Study Material
Semester-03
DSA
Unit-06

Unit 6: Hashing and File Organization

1. Introduction to Hashing

Hashing is a powerful technique used to quickly locate a data record in a database. By applying a hash function, we can convert a key into a hash value, which can be used as an index in a hash table. This approach is essential in scenarios where efficient data retrieval is paramount, such as in databases, caches, and associative arrays.

2. Hash Tables

A hash table is a data structure that implements an associative array abstract data type, a structure that can map keys to values. It uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found.

2.1 Scattered Tables

A scattered table is a type of hash table where elements are stored at various locations based on the hash value generated by the hash function. This approach minimizes clustering, which can improve performance in terms of lookup and insert operations.

3. Hash Functions

A hash function is a function that converts an input (or 'key') into a fixed-size string of bytes. The output, typically a hash code, should ideally distribute keys uniformly across the hash table. The quality of a hash function is crucial for the performance of hash tables.

3.1 Characteristics of a Good Hash Function

  1. Uniform Distribution: A good hash function will evenly distribute keys across the hash table.
  2. Deterministic: The same input should always produce the same output.
  3. Minimize Collisions: It should minimize the chances of two different keys hashing to the same index.
  4. Efficiency: The computation of the hash value should be efficient in terms of time complexity.

4. Key-to-Address Transformation Techniques

Key-to-address transformations are techniques used to convert keys into hash table addresses. Common methods include:

  • Division Method: The hash value is computed as the remainder of the key divided by a prime number.

    [ h(k) = k \mod p ]

  • Multiplication Method: This method involves multiplying the key by a constant (less than 1) and taking the fractional part.

    [ h(k) = \lfloor m \times (kA \mod 1) \rfloor ]

where ( A ) is a constant and ( m ) is the size of the hash table.

5. Collisions in Hashing

A collision occurs when two keys hash to the same index in a hash table. Collisions can significantly degrade performance, so efficient resolution methods are necessary.

5.1 Collision Resolution Techniques

  1. Linear Probing: When a collision occurs, this technique checks the next slot in the hash table and continues until an empty slot is found.

    • Example:
    int linearProbing(int hashTable[], int key, int size) {
        int index = hashFunction(key) % size;
        while (hashTable[index] != -1) {
            index = (index + 1) % size; // Move to the next index
        }
        return index;
    }
  2. Quadratic Probing: This method also finds the next available slot but uses a quadratic function to determine how far to probe.

    • Example:
    int quadraticProbing(int hashTable[], int key, int size) {
        int index = hashFunction(key) % size;
        int i = 0;
        while (hashTable[index] != -1) {
            index = (index + i * i) % size; // Quadratic probing
            i++;
        }
        return index;
    }
  3. Rehashing: In this technique, when a collision occurs, a new hash function is applied to compute a new index.

  4. Chaining: In chaining, each slot in the hash table contains a linked list (or another data structure) to handle collisions.

    • Example:
    class Node {
    public:
        int key;
        Node* next;
    };
     
    class HashTable {
    private:
        Node** table;
        int size;
    public:
        HashTable(int s) : size(s) {
            table = new Node*[size];
            for (int i = 0; i < size; i++) {
                table[i] = nullptr; // Initialize the table
            }
        }
    };

6. Concept of File

A file is a collection of related information stored on a storage device. Files are essential for data organization, management, and retrieval in computer systems.

7. File Types and File Organization

Files can be categorized into various types, and their organization can significantly affect performance and efficiency.

7.1 File Types

  • Text Files: Store data in a human-readable format (e.g., .txt, .csv).
  • Binary Files: Store data in a format that is not human-readable (e.g., .exe, .jpg).
  • Database Files: Store structured data in a database management system.

7.2 File Organization Methods

  1. Sequential File Organization: Data records are stored in sequence, one after the other. This method is efficient for bulk reading but can be slow for random access.

    Example of Sequential Access:

    #include <iostream>
    #include <fstream>
    using namespace std;
     
    void writeSequentialFile() {
        ofstream outFile("data.txt");
        for (int i = 1; i <= 5; i++) {
            outFile << "Record " << i << endl;
        }
        outFile.close();
    }
  2. Index Sequential File Organization: Combines sequential and indexed access. An index is maintained to allow for faster searches while keeping records in a sequential format.

  3. Direct Access File Organization: Each record is accessed directly based on its address rather than being stored sequentially. This method is useful for databases where quick access is necessary.

8. Comparison of Different File Organizations

  • Sequential Files: Easy to implement, good for batch processing, but poor random access performance.
  • Index Sequential Files: Faster access due to indexing, but requires additional storage for the index.
  • Direct Access Files: Offers the fastest access time for random reads and writes but requires more complex data structures to manage records.

9. Conclusion

Hashing and file organization are crucial concepts in computer science that directly impact data retrieval and storage efficiency. By understanding various hashing techniques, collision resolution methods, and file organization strategies, developers can design systems that efficiently manage and access data, ultimately improving performance and user experience. Mastering these concepts is essential for anyone looking to delve deeper into data structures and algorithms, as they lay the foundation for effective software development and database management.