Thursday, May 19, 2016

Adding a strongly typed eager loading feature to an ORM

I'm currently accepting new clients and projects! If you need an experienced Software Developer, please contact me.

When I selected the data access technology to create Invotes, I was primarily searching for an ORM that reminded me of Entity Framework.

There were a number of different SQL based frameworks with their own pros and cons. Here's a short list of a few of the more prominent libraries that I evaluated:

  • Slick - The most popular Scala FRM, also known for it's steep learning curve.
  • Anorm - Bundled with Play Framework, uses plain SQL to execute queries.
  • Squeryl - A strong typed ORM that focuses on type safety with familiar SQL DSL

I decided to give Squeryl a shot since it was easy to pick up and it looked fairly similar to Entity Framework.

A week or so into the project, I ran into some productivity and peformance concerns. Squeryl worked exactly as advertised, but I wasn't thrilled with the way I had to retrieve relations for a given entity query. Rather than making a single call with a join, related entities were loaded by making additional calls to the database. Consider this schema:

Database Tables

account
- id
- first_name
- last_name
- email
account_event
- id
- account_id
- event_id
event
- id
- name
- start_timestamp
- location

In Squeryl, the simplest way to retrieve an account with their events is by touching all the properties inside of a transaction block. This generates a query for each collection, which runs and executes the appropriate SQL and populates the relation:

(Note that if these relations are accessed outside of a transaction block, an exception will be thrown).

While this is a straightforward and effective approach, it's not always performant depending on the size of the dataset being operated on.

Imagine a scenario where there are hundreds of accounts with hundreds or even thousands of events. In this case, thousands of SQL executing queries would be created to populate the model.

Joins are the obvious solution to this issue. When I first started working with Squeryl, I found that this was possible, but very inconvenient, particularly when working to retain immutable data structures.

While it worked, it was very difficult code to read and reason about. When it was time to build a new query, I found that my drive to continue drastically decreased.

My first thought was to build out an external library with helper functions, which would allow these joins to take place in a generic way. As I attempted to tackle the problem, it became apparent that this wasn't going to be a straightforward task, plus it felt like a waste to put in a large amount of effort on a problem that would be better served inside the Squeryl library itself.

The Inspiration: Entity Framework

My experience with Entity Framework reminded me that there are pre-established, elegant ways of dealing with this problem. EF comes with an eager loading feature built in:

In this example, the Include method is used to load related entities. This instructs the query that it should select Accounts with a matching username, and it should also select and materialize the AccountEvent and Event relations.

Inspired by a similar design and with no former experience in this area, I decided to try my hand at implementing a similar feature in Squeryl.

The Design

On the surface, the idea is fairly simple. Each selected entity needs to be joined to the relation that's specified to the left of it.

Jumping right in, I soon realized that entering an unfamiliar code base meant that I would be best served by breaking down the feature into more manageable chunks.

Step 1: Expose existing relational information in Squeryl

To generate a LEFT JOIN clause between two entities, there needs to be a way to retrieve the foreign key columns, which are used to build the ON clause.

In Squeryl, this relational information is defined up front (by you) in an object derived from the Schema base class:

The information existed inside this class, but it was not exposed. A new public method had to be created to retrieve the relation between two entity classes.

Looking through the Schema base class, there was already a method that looks similar called findAllTablesFor[A](c: Class[A]). Instead of finding a table for a class, the new method needed to find the foreign key relation given two classes. Luckily, this information was already stored in a private field (_oneToManyRelations) that tracked all one to many relations. The method is implemented as follows:

[Final source code here]

Relations can now be found between two tables by calling this method on the Database object:

Step 2: Basic Join Generation

At some point, a Squeryl query has to generate a real SQL query. Before investing a lot of time on the API side of the include feature, I wanted to figure out how a join could be introduced using a newly implemented include method.

Before we continue

One thing that's important to note is the way that Squeryl handles relations on an entity level. To create a field reference to all AccountEvents on an Account model, a reference to the relation is used:

Further, to manually write a join in Squeryl, it would look like this:

This returns a collection of tuples (Account, AccountEvent), meaning the AccountEvent data will not be attached to the collection Account.accountEvents.

The way that Squeryl models this underneath the covers is by attaching what is called a subqueryable instance (generally an entity table) and an associated join expression to the QueryYield class. More on this later.

Getting a naive include expression to generate expected SQL

To create a proof of concept, I leveraged the field relation to write a quick and dirty include signature on the Squeryl QueryYield definition:

[Original Source Code Here]

This allowed me to write a simple test query to validate that the correct SQL was being generated until the real API was designed.

All the pieces were in place to figure out how to generate the expected join SQL:

  • Access to foreign key information for both sides of the join
  • An understanding of how Squeryl builds joined queries through subqueryables
  • Temporary include method signature

First, the information derived from the include parameter (the table entity and foreign key information) is stored with the query (in the QueryYield class), so it can be retrieved at execution time. This is the include implementation as seen in the above example.

[Original Source Code Here]

The important lines are:

  • includeExpressions - a new subqueryable is stored and will be included as a left outer join on the specified table entity.
  • joinExpressions - the relationship between the left and the right table is stored, which is the information that will be used to generate the ON clause.

As I mentioned earlier, Squeryl already had a code model for joins, so the information attached to the QueryYield class is utilized when building the query. This is done in the AbstractQuery class:

[Original Source Code Here]

At the top of the snippet, the subqueryables found in our include clause are being appended to the collection of items to be joined. Now the join clause can generate the correct SQL. The effect can be seen by inspecting the statement on a Squeryl query:

Step 3: The Include DSL

While the naive approach above works fine for simple scenarios, I wanted the feature to be capable of including an arbitrary amount of nesting.

Consider again the original example, to include all events for an account. Since account_event is the junction table, we need to be able support an arbitrary amount of nested relations. This was the initial idea for the syntax:

But I also wanted to support parent relations (many to one):

And adjacent relations:

And any arbitrary combination of the above. I realized quickly that I was looking at implementing a tree structure.

To my surprise, this turned out to be the hardest part of the entire feature.

The first difficulty was the lack of expression trees in Scala. In .NET, an expression can be created to model code in a tree structure. The tree can be traversed at runtime to extract more meaning from the code provided, rather than just the return value of the call. What this means in practical use is that the API could have been much simpler. For example, in .NET I could write this signature for the include method:

Then the method could be called with the following relation parameter value:

Inside the Include method, inspecting the value of relation would expose that we returned an Events relation by traversing from Account, to the AccountEvents relation and finally onto the Event relation. This is a concept that Scala lacks (although something like this may be possible with macros).

Instead, a more realistic approach was chosen to express relations in the include method. The idea was to create an enclosing type for each node of the expression, and to use a fluent style API to chain each relation together, keeping the expression strongly typed. The final DSL looks like this:

I'm not generally a big fan of symbols for method names, but when words were used instead of symbols, it obfuscated which relations were actually being selected. Making that information easily parseable by a human was of primary importance. Specifically, the methods used are

  • ->> (Select adjacent relations)
  • -* (Select the many side of a relation)
  • *- (Select the one side of a relation)

Below is an abridged version of the PathBuilder base class, which highlights how this fluent chaining is accomplished:

[Final Source Code Here]

Think about an instance of PathBuilder[OneToMany[Account]], where P is OneToMany[Account]. Observing the one to many method (-*), it accepts a function with P (OneToMany[Account]) as a parameter and a result of OneToMany[A]. This means you could pass something like include(pathBuilder => pathBuilder.-*(a => a.accountEvents)) and it would return a PathBuilder[A], which would be PathBuilder[OneToMany[AccountEvents]]. This is how static typing is preserved, while mitigating the absence of expression trees baked into the language.

This allows for infinite flexibility, you can include as many relations as needed while leveraging the type safety Scala has to offer.

Developing this tree style syntax took a lot of trial and error. I would fix one piece, and find out that adjacent relations weren't implemented properly, then I would fix that issue and parent relations were broken. It was tedious, but with enough perseverance and a suite of valuable tests, all pieces ended up working properly. The final syntax is arguably ugly, but ultimately flexible.

On top of building this DSL, the AbstractQuery class was updated to recursively build out queries with the correct joins.

[Final Source Code Here]

Step 4: Materializing row data

It may seem that the feature would be complete at this point, but there's still one critical missing piece.

At this stage, the query was a single statement with the appropriate join clauses, but the relational row data still needed to be materialized and populated in the parent instances.

The first attempt

My first attempt to materialize the dataset into class instances was brute forced and honestly somewhat embarrassing. But for a first step, getting the correct result is more important than the most efficient result, so at least I was able to write some valuable tests to validate my work.

In simple terms, the algorithm was:

  1. Materialize all objects in each row of the data set.
  2. Loop over each object from parent to child, and populate each relation with the appropriate children.
  3. Call distinct on each relation to ensure uniqueness

This is less than ideal, but it worked and passed every test case I could think of (including all tests in the Invotes test suite).

But eventually I hit a performance issue. Relatively simple queries were taking seconds to return.

It occurred to me that there's no reason why every row would need to materialize multiple instances, especially when you think about the way SQL returns joined data:

# account.id account.username account_event.id account_event.account_id
1 100 user1 200 100
2 100 user1 201 100
3 100 user1 202 100
4 101 user2 203 101
5 101 user2 204 101

Clearly it's a waste of resources to materialize five Account classes (the number of rows) when only two are needed (the number of unique IDs). Ideally, each account_event row would be materialized, and the results would then be attached to existing account instances (if it they were already materialized) rather than regenerating an account instance for each row.

I decided a better algorithm was neccessary - one that makes better use of the tree structure already in place.

  1. Traverse the tree until an end node is reached (the furthest child)
  2. Read the primary key
  3. Search for a materialized instance in a hashtable with the given key
  4. If not found, materialize the object and store it, otherwise return the existing object
  5. Append the materialized object to the parent relation

This process repeats recursively until all objects have been materialized.

The result is much better - very little time is spent materializing rows...and it turned out that the biggest (but not only) culprit causing performance issues in my application was simply a missing index. Whoops.

A minor optimization

There's one other optimization I made, but I only implemented it when the include path is moving in the one-to-many direction in the tree. I never implemented the opposite direction since I later found out that the change was trivial in the grand scheme of things.

Look at the previous example SQL data set, and notice how the account id value is repeated - once for the account table, and once for the account_event table, as these are the foreign key columns that link the tables. Now imagine working with a dataset that's several more relations deep. If the account_id column from the account_event table has already been materialized (meaning it is found in the hashtable), then only the account_event data needs to be materialized and attached, and there's no more work to do. No more of the row data needs to be read or materialized from data on the left.

Key Takeaways

TDD is still extremely valuable to me

I know that this is often turns into a hotly debated subject, but I can't emphasize how much writing tests cases before I write code helps me efficiently get to the desired result. It keeps me focused on what features are important, and allows me to design code from the consumer perspective before I start writing any logic.

It gives me the ability to freely experiment. If the test passes, I'm pretty confident that the conceptual model I have in my head is correct. If it fails, I get a stack trace or a helpful failure message that clues me in on what I did wrong.

Plus, since I'm going to have to write the test eventually, and the alternative is using a REPL or some other more temporary test harness - why wouldn't I just put it in a test class first? It feels less wasteful to me.

One thing I should mention is that I don't believe TDD is only for unit testing, or that you have to write isolated unit tests (with mocks) if you're practicing TDD. I believe that testing is most important at code boundaries, and more robust if you minimize the amount of mocking needed. In this case, that just meant writing tests for public code that could be accessed outside of the library, and no mocks were needed.

Optimizations matter in libraries

This is obvious, but I can count few times I truly needed to optimize anything in application code, because generally the libraries I depend on have already done the hard work. It was kind of fun to run into an issue where I really did need to speed up an algorithm since the payoff was noticeable and has the potential to affect others.

Reflection in Scala is not ideal

.NET easily beats out Scala from a reflection standpoint. The lack of reified generics and expression trees was less than ideal. I recognize that there are limitations to the JVM, but it's still a point in .NET's favor.

On the positive side, it's certainly workable and I was able to accomplish everything despite missing these features.

Open Source is awesome

Huge shoutout to everyone who's worked on Squeryl - it's a great library and I'm always in awe of the sheer amount of free and open software available. Cases like this really highlight why having the ability to modify the source code benefits everyone. I get to build what I need rather than begging a volunteer for free work, and I can share it with others if they find what I built to be useful.

The Code

You can find most of the detailed commits in the original include branch on Github. To see the final diff, see the include2 branch where the commits were squashed.

The code has not yet been merged into the official Squeryl master branch at this time, but I've been using this in several personal projects and I haven't encountered any issues so far. Since this is the first ORM feature I've ever built, I'd be happy to hear any constructive feedback on where I could improve things. Thanks for reading!

I'm currently accepting new clients and projects! If you need an experienced Software Developer, please contact me.

0 comments:

Post a Comment