Writing my solution for offline storage part 2— Storage efficiency
A case of poorly chosen data structures
A few weeks ago, right after reading the book Design Data-Intensive Applications, I started reading the book Database Internals — amazing pieces both.
I started understanding better how a database works internally. I’ve been focusing on the Storage Engine so far which means I went through different approaches and data structures for handling storage efficiently. I wish I knew this when working on offline storage because I did terrible things in our implementation.
Basic things on data storage
First things first. The goal of a database is to store data and make it possible to read in the future. There is a plethora of database flavors out there. We have those that work with primary and secondary memory, also we have the ones that contain with and without schema, and so on.
Regardless of the flavor, you’ll most likely find, among many others, the following components: data and indexes. Where data is the real data you store in the database and the indexes are the structures that we have data also but for guiding searches only and to make this operation more efficient.
I’ll take the risk of being repetitive or obvious, but that’s why we have DBMSs and do not use plain files. You know, the databases end up saving data into files but they do it using powerful data structures.
Our not-so-good decisions
Honestly, I’m a bit ashamed while I’m writing this section because it’s by far the worst thing I ever coded in my life.
Recalling, we had a mobile application written in TypeScript, a backend written in NodeJS, and a MongoDB database. So it seemed good to use localStorage for storing data. Using that package we could save JSON-like data associated with a key.
NOTE: Even though the name is the same. This localStorage is not the same as the Local Storage we have in browsers.
// saving
await localStorage.set('someKey', {
1: {id: 1, description: 'item 1'},
2: {id: 2, description: 'Item 2'}
});
// retrieving
const data = await localStorage.get('someKey');
// deleting
await localStorage.remove('someKey');
Usually, SQLite is associated with storage in local devices and that makes sense since it contains a single file and is lighter than other vendors. If you come from Android native development you might think to draw a parallel between SharedPreferences and this localStorage. Also, if you’re a web developer you can think this localStorage is like the one we have in browsers (they’re not really equal but the idea is to be close as possible).
Even though we have this interface where we use key/value pairs, we can define a driver for that and we choose SQLite in the beginning — we changed that after a testing period but it doesn’t matter for now.
As we had the power of SQLite and an interface for saving schemaless data using key/value pairs we just started with that.
We implemented that in a way that when the application starts up, it fetches data from the server, does some transformation converts it from an array into an object indexed by the id and saves it locally.
A hypothetical code would look like the one shown below — I know it breaks the single responsibility principle and blah blah blah but it’s just an example.
async fetchSubjects() {
const data = await this.httpClient.get('https://some-endpoint').toPromise();
await this.storage.set('subjects', toObject(data, 'id'));
}
For now, we won’t consider the synchronization problem, so let’s assume it was well implemented — spoiler: it wasn’t.
It worked like a charm until we started testing what we called heavy users. We had users with datasets containing thousands of data. In a specific collection, we had users with 5K of records, and in our tests, we noticed that from 2K records and more, reading and writing became terrible. It was due to this bad decision and to the sync (we’ll see it in the next part of this series).
Problems with this approach
Trusting in the lib
For some reason (maybe due to the lack of experience), I believed that the localStorage with SQLite driver was doing a good job. In fact, it wasn’t. At least not for our purposes. The table it uses for key/value is just a simple table containing a key and value field (reference: https://github.com/localForage/localForage/blob/master/src/drivers/websql.js#L18).
SQL example of such a table:
CREATE TABLE IF NOT EXISTS table_name (
id INTEGER PRIMARY KEY,
key unique,
value
);
NOTE: The values field actually doesn’t have a type definition so it can assume the type based on the data stored. It’s a kind of generic type.
It’s great that it can assume different types. That’s why localForage (the one used by localStorage under the hood) has a serializer and depending on what we pass as a value it does some serialization.
We were passing an object so it was just stringifying when saving (JSON.stringify
) and parsing ( JSON.parse
) when retrieving — https://github.com/localForage/localForage/blob/master/src/utils/serializer.js#L164.
So, in the end, our table was more like the one shown below.
CREATE TABLE IF NOT EXISTS table_name (
id INTEGER PRIMARY KEY,
key unique,
value text
);
Even though it wouldn’t be a problem for most users, this solution doesn’t scale and it is not what I would call a good job.
Poorly chosen and handled Data Structures
We were using a JavaScript object where the key was the id of the record which seems to be a smart choice since we have such a structure indexed and getting a record by id it’s really easy and fast. In terms of time complexity, it’s what we call an O(1) — in the average.
But the way we handled it was not good. We were keeping the data in the table and every time we wanted to manipulate the data we read (and parse intrinsically), we update the data and save it again.
For instance, there was an entity for colors and the code for manipulating them was like this:
export class ColorsStorage {
private readonly keyName: string = 'colors';
constructor(private storage: Storage) {}
async create(color: Color) {
const colors = await this.storage.get(this.keyName);
await this.storage.set(this.keyName, { ...colors, [color.id]: color })
}
}
NOTE: There were more abstractions but nothing that is worth mentioning here.
This approach is bad because we’re reading all the structure whenever we need to write it.
A small improvement we could have done was to have an in-memory (memtable) structure where we could perform the write operations faster than doing it directly on disk every time. However, it would bring more complexities since we would need to have a background process writing snapshots of the memtable to the disk as well as checking when the user is about to close or move our app to the background state.
There’s also another problem with this structure. It’s a kind of hash index structure that makes it inefficient to perform range queries.
What we could’ve done
First of all, have a better discussion about the solution we would like to implement and have a good definition of the trade-offs. Also, we could have done more experiments than we did.
From a technical perspective, today, I believe we could have used SQLite directly. That would bring the power of well-chosen structures already as well as the implementations of buffers as we mentioned in the last section where we called memtable. Also, we would have index structures to boost our queries when needed.
Of course, it would bring some other things to do. We would need to define our schemas and have code to map data coming from MongoDB (schemaless) the SQLite, a relational database. Also, adding fields would be a harder operation since we would need to update the schemas in the local database but I strongly believe that would fit better for what we wanted to achieve.
Conclusion
This article talks about a lot of mistakes me and my team made. I would like to point out that every team needs to have somebody really experienced participating in the decisions. In that team, we had experienced members that didn’t participate so it was a problem. I myself, today, can look at the choices mentioned here and see that they aren’t good.
Even though this article is a kind of essay, we don’t assume, we prove things. So, don’t take this as a silver bullet and tell your team something like: the ionic storage dependency sucks. You have to check your environment and do your own experiments.
I had prepared a proof of concept but I didn’t have a chance to demonstrate it to my team. Maybe I bring a benchmark comparing these two implementations to this series of articles.
That’s all. See ya!