Bitmaps from scratch
You can find the full implementation of
bitmapin the repository
To extend my knowledge of bitmaps and understand their limitations and possibilities, I've been playing around and exploring how to build bitmaps from scratch.
Background
A bitmap is a data structure used to store bit data in a compact manner. Bitmaps are often used for database indexing by encoding data in the form of bits. Bitmaps can be combined or transformed using bitwise operations, which are simple and efficient to execute by processors.
Classic bitmaps suffer from space fragmentation, leading to inefficient storage. Many solutions have been proposed by both industry and research to overcome this problem. One of the latest and most well-known implementations is Roaring Bitmaps.
The goal of this blog post is not to build Roaring Bitmaps but to explore different ideas around implementing bitmaps in general, as well as to delve into the implementation of bitwise operations.
Simple bitmap
A simple bitmap can be constructed by using a list of numbers. Each number in the list has a certain bit size, aiming to store bit data within a specific range.
usize has a bit size of 64 bits. Thus, each element in the vector is a chunk of 64-bit reserved space.
Note: We are omitting boundary checks for simplicity. You can check the complete implementation in the repository
Get and set
To access and mutate a certain bit position, we need to use bit operations to manipulate the usize value by using, e.g., left or right shift operations. Luckily, Rust has great support for such operations.
Accessing a certain bit position should respect the chunk size, so that an absolute position can be calculated as:
Get
To get a certain value, we can use the bit_index function above to know what the actual chunk index and relative index in that chunk are. Then, we can subtract the value by intersecting the chunk value with a usize value that has only a 1 in the desired position. Thus, returning the bit value.
Set
Changing a position value is a similar process. We'd find out the chunk and the bit index, then apply a series of bit operations to mutate the bitmap in the desired way:
As shown in the code, we make use of bit operations to extract and mutate bitmap values. These operations are very efficient and fast to run on the computer's processor (O(1)).
Bitwise operations
Bitwise operations are used to combine bitmap values in a very efficient manner. Examples of bitwise operations are the logical AND, OR, and XOR, negation, and logical shifts. For our custom bitmap, we can implement the basic ones using the existing traits: BitAnd, BitOr, BitXor and Not.
The BitAnd trait allows us to define a custom implementation for the bitwise AND operator &. For our chunk of bits, this can be implemented by iterating over each chunk and combining the inner bit values using the AND operator for usize.
The same approach can be used for the other bitwise operations BitOr, BitXor and Not:
// `BitOr` implementation
*chunk = self.chunks | rhs.chunks;
// `BitXor` implementation
*chunk = self.chunks ^ rhs.chunks;
// `Not` implementation
*chunk = !self.chunks;
Limitations
The presented implementation has a couple of limitations. Ignoring the fact that it's a naive implementation and aspects such as resizing and index boundaries are not being respected, it presents a fundamental problem by design:
Storing sparse bit sets leads to large storage space
Sparse bit sets are those that contain a 1 among many 0s (e.g. 10...010...01). In this scenario, chunks will contain many elements with only 0s. In other words, the memory is filled with "empty data".
Sparse bitmap
The sparse bitmap is optimized to store sparse bit sets. Instead of storing a complete 64-bit chunk, the sparse bitmap stores the position and length of each block of 1 values. These blocks are called runs.
Thus, we are storing only two 64-bit numbers for each complete run, rather than a large number of 64-bit entries. In scenarios where the bit set has a very heterogeneous distribution, the sparse bitmap would be less efficient. A similar implementation is used by Roaring Bitmaps when using run containers to store consecutive values efficiently.
Get and set
Get
For reading a certain bit value, we'd need to find which run contains the bit's position. If the position is found inside a run, the bit has a value of 1. If no run is found, the bit has a value of 0.
Note that if the runs are sorted ascendingly inside the bitmap, we need only to check if the position's lower bound includes a run that contains that position.
Set
Mutating a certain position implies changing the distribution of the runs, depending on whether a 1 or 0 is set in that position.
When setting a 0 value to a certain bit position inside a run, that run would need to be either shrunk, split into two runs, or deleted.
On the other hand, when setting a 1 value to a certain bit position, we would either need to extend an existing run, merge two runs into one, or create a new run.
As presented, getting and setting bits in the sparse bitmap is quite more cumbersome and less efficient than using bitwise operations in the initial bitmap. This increase in complexity comes as a trade-off for more efficient data storage.
Bitwise operations
The implementation for logical bitwise operations follows the same pattern as for the get and set methods. The boundaries of the runs need to be merged together depending on the operation.
For instance, to implement BitAnd, we'd iterate over both bitmaps' runs and intersect the runs that have common boundaries.
The result of intersecting two runs is the run with the higher start index and the minimal length.
The implementation for the rest of the logical bitwise operations follows the same idea. Combining two sparse bitmaps into one, by merging the distribution of the runs according to the operation's logic.
You can check out the details around implementing the rest of the operations in the repository.
Conclusions
- We've introduced the concept of bitmaps and their practical application in real-world scenarios.
- We've implemented a simple bitmap using an array of 64-bit chunks and explored how to implement getting and setting individual bits, as well as logical bitwise operations.
- We've discussed the limitations of the simple bitmap and implemented a compatible bitmap for sparse bit sets.
As a closing note, without getting into deep details, the idea behind Roaring Bitmaps is ultimately to understand which type of bitmap works best depending on its data distribution. Thus, the goal is not to use the same design for all possible data distributions, but to select the one that fits best.