Persism and the SELECT N+1 problem

Virtually all ORMs out there have mechanisms to support some kind of join or relationship where you have - let’s say a Customer object which has a List of Invoice objects. Persism has not had anything much to say about this - leaving it up to you to handle this situation on your own.

The reason for this mainly was I wanted to avoid the “impedance mismatch” problems inherent when you are combining OO/Procedural languages with SQL - which is a declarative query language.

So what is the SELECT N+1 problem?

The N+1 query problem can happen when the ORM executes N additional SQL statements to fetch the data for child tables.

So we have a relationship of Customer -> Invoices -> LineItems -> Product

public final class Customer {
    private String customerId;
    private String customerName;
    private String region;
    private String country;

    private List<Invoice> invoices = new ArrayList<>();
}

public final class Invoice {
    private int invoiceId;
    private String customerId;
    private Date invoiceDate;
    private double total;
    
    private List<LineItem> lineItems = new ArrayList<>();
} 

public final class LineItem {
    private int lineItemId;
    private int invoiceId;
    private int productId;
    private int quantity;
    private double price;     

    private Product product;
} 

public final class Product {
    private int productId;
    private String description;
    private double price;
} 

So what happens if we SELECT one Customer? That would mean 1 query to get the customer, 1 query to get the invoices for the customer, FOR EACH invoice query to get the line items for the invoice, FOR EACH line item a query to get the product.

If on average each customer had 50 invoices with 50 line items that would be 2500 additional queries to the database!

And it gets much worse if you were selecting hundreds of customers.

So how did I handle this situation?

In most cases I was able to avoid this by thinking in terms of the specific user requirement. If the user really just wanted a grid to display customer name, total number of invoices, total price and region. I would write and test an SQL query with this specific information and place it in a View mapped to a Record.

But in some cases I really did want objects in a hierarchy in situations where I was responding with JSON to a web service or a web request.

So how did I handle it? By kicking out the N.

Goodbye N

I select from customers then collect their customerIds
Use this list of customerIds to select invoices (SELECT FROM Invoices WHERE CustomerID IN (1,2,3,4,5))
Use this list of invoiceIds to get the LineItems (SELECT FROM LineItems WHERE InvoiceID IN (5,6,7,8,9))
Use this list of LineItem productIds to get the Products (SELECT FROM Products WHERE ProductID IN (10,11,12))

And then stitch all this together.

Figure. Photo by: Alan King

_{Figure. Photo by: Alan King - source https://www.cmaj.ca/content/169/12/1294.full}

Now instead of potentially 2502 queries I only have 4 + plus some work in memory to put everything together.

So how does this work out?

Generally this approach works well as long as you have indexes and your results are not too large. Most Databases have some limit to the IN (?,?,?) expression though - some maximum number of parameters which varies depending on the DB.

So it’s not a perfect solution. How do we make it better?

Maybe instead if IN (?,?,?) I could use sub-selects like SELECT FROM [CHILD TABLE] WHERE ID IN (SELECT ID FROM [PARENT TABLE])

What does this look like:

-- Single select - Customer
SELECT * FROM CUSTOMERS 
    WHERE "CUSTOMER_ID" = ? 

-- One to many - Customer to Invoices
SELECT * FROM INVOICES 
    WHERE "INVOICES"."CUSTOMER_ID" IN (SELECT "CUSTOMER_ID" FROM CUSTOMERS 
        WHERE "CUSTOMER_ID" = ?)

-- Many to many - Invoices to Line-items 
SELECT * FROM LINEITEMS 
    WHERE "LINEITEMS"."INVOICE_ID" IN (SELECT "INVOICE_ID" FROM INVOICES 
        WHERE "INVOICES".CUSTOMER_ID" IN (SELECT "CUSTOMER_ID" FROM CUSTOMERS 
            WHERE "CUSTOMER_ID" = ?)) 

-- Many to one - Line-items to Products
SELECT * FROM PRODUCTS 
    WHERE "PRODUCTS."PRODUCT_ID" IN (SELECT "PRODUCT_ID" FROM LINEITEMS 
        WHERE "LINEITEMS"."INVOICE_ID" IN (SELECT "INVOICE_ID" FROM INVOICES 
            WHERE "INVOICES".CUSTOMER_ID" IN (SELECT "CUSTOMER_ID" FROM CUSTOMERS 
                WHERE "CUSTOMER_ID" = ?)))      

Well it looks weird but it works! Performance is great as long we have indexes along primary and foreign columns. And it works fine with large data sets.

So How can I generalize all this into Persism

Moving these ideas into a general library is a challenge. What if the user joins on multiple columns of varying types? How does this perform if we don’t have indexes? What happens if I reference a parent class again in a child class? (answer: Infinite LOOP so be careful!)

Version 2.0.1 of Persism (available now!) Contains an experimental Join annotation. To try it out:

<dependency>
    <groupId>io.github.sproket</groupId>
    <artifactId>persism</artifactId>
    <version>2.0.1</version>
</dependency>

Now we have this:

public final class Customer {
    private String customerId;
    private String customerName;
    private String region;
    private String country;

    @Join(to = Invoice.class, onProperties = "customerId", toProperties = "customerId")
    private List<Invoice> invoices = new ArrayList<>();
}

public final class Invoice {
    private int invoiceId;
    private String customerId;
    private Date invoiceDate;
    private double total;
    
    @Join(to = LineItem.class, onProperties = "invoiceId", toProperties = "invoiceId")
    private List<LineItem> lineItems = new ArrayList<>();
} 

public final class LineItem {
    private int lineItemId;
    private int invoiceId;
    private int productId;
    private int quantity;
    private double price;     

    @Join(to = Product.class, onProperties = "productId", toProperties = "productId")
    private Product product;
} 

public final class Product {
    private int productId;
    private String description;
    private double price;
} 

The API for this fairly straightforward. The annotation takes the Class of the child type, the property name(s) of the parent key(s) (onProperties) and the child key(s) (toProperties). Yes it is possible to join on multiple keys).

So how do we get an infinite loop?

Let’s say for example you had a reference to Customer on the Product class.

public final class Customer {
    private String customerId;
    private String customerName;
    private String region;
    private String country;
    private int lastBoughtProductId;
    
    @Join(to = Invoice.class, onProperties = "customerId", toProperties = "customerId")
    private List<Invoice> invoices = new ArrayList<>();
    
}

etc...

public final class Product {
    private int productId;
    private String description;
    private double price;
    
    @Join(to = Customer.class, onProperties = "productId", toProperties = "lastBoughtProductId")
    private List<Customer> customersWhoAlsoBought;
} 

So when the customer goes to invoices, to line-items, to products and then products goes back to customer, This cycle will just keep repeating!

The work-around is to have a different class for Customer. Inherit Customer as CustomerExtraInfo with the additional Join annotations.

This new version is testable from the Unit Tests if you want to try it (Just run AllLocalTests which only uses local DBs).

Currently, I’m working on a benchmarking project using the StackOverflow2010 10 gigabyte DB. I’ll put this project up on github soon.

Open questions

How can I detect infinite loops and fail fast.
Is it possible to have mismatched join columns? Like 1 column on a parent and 2 on a child?
Handling case-sensitive string keys (currently there, but I need to set up a DB to test)
How should this work with Records? Currently, One to One, Many to One will fail. One to Many, Many to Many adds to the Collection. But should we even do that?
Anything else someone can break for me? :)

So what do you think of my solution?

Have a look my discussion here

And remember, there is no N.

_{There is no N}