https://aleics.me/posts/feed.xml

Building your own database index

2025-02-01

You can find the latest state of delta-search in the repository.

Lately, I've been exploring the idea of building a simple search engine from scratch. The primary motivation is the opportunity to learn how search engines are internally built and to have fun along the way. We'll be developing this in Rust 🦀.

In the first chapter, we'll explore Roaring Bitmaps and B-Trees and learn how to use them to build one of the core components of a search engine: indexing.

Background

At work, I've been exploring roaring bitmaps and how various databases and search engines leverage this data structure for indexing and faster lookups. Inspired by its powerful applications and widespread use in the industry, I decided to experiment and investigate how to use it for indexing data.

Building an index

Indexes are internal mechanisms in databases that enable fast lookups when querying data. They are typically attached to specific fields referenced by queries to ensure efficient matching.

Let's consider an example. Imagine we have a database containing players and their information:

#[derive(Debug)]
struct Player {
    name: String,
    sport: Sport,
    age: u32
}

#[derive(Debug)]
enum Sport {
    Football,
    Basketball
}

To start, let's build an index to filter players by their sport. A bitmap can help track which players belong to which sport:

let db = vec![
    Player {
        name: "Leo Messi".to_string(),
        sport: Sport::Football,
        age: 36
    },
    Player {
        name: "Lebron James".to_string(),
        sport: Sport::Basketball,
        age: 39
    },
    Player {
        name: "Cristiano Ronaldo".to_string(),
        sport: Sport::Football,
        age: 38
    },
    Player {
        name: "Luka Doncic".to_string(),
        sport: Sport::Basketball,
        age: 25
    },
];

let mut football_players = RoaringBitmap::new();
football_players.insert(0); // Leo Messi
football_players.insert(2); // Cristiano Ronaldo

let mut basketball_players = RoaringBitmap::new();
basketball_players.insert(1); // Lebron James
basketball_players.insert(3); // Luka Doncic

Now, we've categorized which players belong to which sport using two bitmaps, one for each sport.

We can collect all the field's possible values and their respective bitmaps into a single data structure called Index:

#[derive(Default)]
struct Index<T: Ord>(BTreeMap<T, RoaringBitmap>);

impl<T: Ord> Index<T> {
    fn add(&mut self, value: T, position: u32) {
        let bitmap = self.0.entry(value).or_default();
        bitmap.insert(position);
    }
}

let mut sport_index = Index::default();
let mut age_index = Index::default();

for (i, player) in db.iter().enumerate() {
    sport_index.add(player.sport, i as u32);
    age_index.add(player.age, i as u32);
}

Note: we assume that all indexed types implement the Ord trait.

In our Index, we collect all the possible values into a BTreeMap. This data structure, part of the Rust standard library, is an ordered map based on a B-Tree.

B-Trees are self-balancing tree data structures that maintains sorted data, and allow read and write operations in logarithmic time. They are widely used for indexing and storage, by breaking down a database into fixed-size blocks or pages.

Using an index

Once the index is available, we can use it to run queries and combine different filter criteria. Bitmaps can be combined using bitwise operations to compose logical combination of filters and execute more complex queries against the indexes.

impl<T: Ord> Index<T> {
    fn get(&self, value: &T) -> Option<&RoaringBitmap> {
        self.0.get(value)
    }
}

// Find basketball players
let basketball_players = sport_index.get(&Sport::Basketball)?;

// Find players with a certain age
let age_players = age_index.get(&39)?;

// Combine multiple index results using AND:
// sport = "Basketball" AND age = 39
let hits = match (basketball_players, age_players) {
    (Some(left), Some(right)) => left & right,
    _ => RoaringBitmap::new(),
};

for hit in hits.iter() {
    println!("{:?}", db.get(hit as usize)); // Lebron James
}

Instead of iterating through the entire db to filter, we use efficient bit operations on roaring bitmaps and retrieve only the final results. This example is all the data in memory stored under the db variable. However, in the real world this would prevent reading all the database from disk.

Inequality operators

Databases commonly allow filtering using inequality operators, such as > (greater than) and < (less than). Storing values using a B-Tree enables value comparisons due to the inherent order between values.

However, not all field types are "comparable" in the same way. Comparing different value types may also be impossible or yield indeterminate results. Therefore, we need to differentiate between types of filters and the data they index.

enum Index {
    Integer(IntegerIndex),
    String(StringIndex),
}

#[derive(Default)]
struct IntegerIndex {
    inner: SortableIndex<u32>,
}

#[derive(Default)]
struct StringIndex {
    inner: SortableIndex<String>,
}

#[derive(Default)]
struct SortableIndex<T: Ord>(BTreeMap<T, RoaringBitmap>);

Note: we've renamed the previous Index struct to SortableIndex to keep the high level index name.

Then, we can define inequality operators for an integer index as follows:

impl IntegerIndex {
    fn ge(self, value: &u32) -> RoaringBitmap {
        let mut hits = RoaringBitmap::new();

        // Iterate over all the fields that are greater or equal
        // than `value` and collect all the items.
        let left_bound = Bound::Included(value);
        let right_bound = Bound::Unbounded;
        for (_, bitmap) in self.inner.0.range((left_bound, right_bound)) {
            hits |= bitmap;
        }

        hits
    }
}

The solution can be interpolated for other inequality operators using the existing range API from the Rust standard library:

impl IntegerIndex {
    fn ge(&self, value: &u32) -> RoaringBitmap {
        self.between(Bound::Included(value), Bound::Unbounded)
    }

    fn gt(&self, value: &u32) -> RoaringBitmap {
        self.between(Bound::Excluded(value), Bound::Unbounded)
    }

    fn le(&self, value: &u32) -> RoaringBitmap {
        self.between(Bound::Unbounded, Bound::Included(value))
    }

    fn lt(&self, value: &u32) -> RoaringBitmap {
        self.between(Bound::Unbounded, Bound::Excluded(value))
    }

    fn between(&self, left: Bound<&u32>, right: Bount<&u32>) -> RoaringBitmap {
        let mut hits = RoaringBitmap::new();

        for (_, bitmap) in self.inner.0.range((left, right)) {
            hits |= bitmap;
        }

        bitmap
    }
}

Now we can use inequality operators to run queries against the indexes:

// Find basketball players
let basketball_players = sport_index.get(&Sport::Basketball);

// Find players younger than 39 year old
let young_players = age_index.lt(&39);

// Combine multiple index results using AND:
// sport = "Basketball" AND age = 39
let hits = match (basketball_players, young_players) {
    (Some(left), Some(right)) => left & right,
    _ => RoaringBitmap::new(),
};

for hit in hits.iter() {
    println!("{:?}, db.get(hit));
}

This same approach can be used to define indexes for other data types, such as decimals, dates, ordered enums, etc.

Sorting and pagination

Another important feature for databases and search engines is the ability to retrieve results in a sorted order. This feature is naturally unlocked by using a BTreeMap as part of our index. We can iterate over the index values and return a sorted list of elements.

The sorting step naturally follows filtering. We match the filter results against each value's bitmap during a sorted iteration.

impl<T: Ord> SortableIndex<T> {
    fn sort(&self, filter_result: &RoaringBitmap) -> Vec<u32> {
        let mut sorted = Vec::new();

        // Iterate over the values in ascending order
        for bitmap in self.0.values() {
            // Find the elements that are part of the filter
            // and match the current's iteration value.
            let round = filter_result & bitmap;

            sorted.extend(&round);
        }

        sorted
    }
}

The result of the sort function above is a vector of filter hits sorted using the BTreeMap keys. We can then use these hits to retrieve the corresponding data from our database:

// Sort hits by age in ascending order
for hit in age_index.sort(hits).iter() {
    println!("{:?}, db.get(hit));
}

Pagination is used to limit the response to different pages and retrieve a manageable amount of data at a time from a server. This reduces resource consumption for both the client and server when running a query.

A naive approach is to limit the number of hits before fetching the data. For example, using offset-based pagination:

let pagination = Pagination { offset: 5, limit: 10 };

let paginated_hits = sorted_hits
    .iter()
    .skip(pagination.offset)
    .take(pagination.size);

for hit in paginated_hits {
    println!("{:?}, db.get(hit));
}

This way, we avoid fetching all the data and limit operations to those needed for the given page.

In Rust, a more idiomatic and efficient approach would be to define a custom Iterator for the sorted list instead of allocating a vector. This avoids unnecessary data fetching and memory allocation for the complete vector of sorted hits.

Conclusions

  • Roaring bitmaps and B-Trees are fundamental data structures used for indexing across various databases, search engines, and data analytics engines.
  • Building your own index is straightforward by leveraging existing libraries and Rust's robust ecosystem.
  • Indexes can not only categorize data but also enable advanced filtering and sorting using custom operators.