Speeding Up Bulk Imports in Rails
This article is about speeding up bulk imports in Rails. Let’s imagine we have 50,000 products to import… not an absurd amount, but enough that we can start to notice some pain and enough that we wouldn’t want to keep the user sitting there while the system tries to import them all.
That is unless we can figure out a way to have these records imported quickly.
The Data We’re Working With
We’ll be working with a Product
model which has the columns: sku
, name
, origin
, and msrp_cents
. To populate our database, we receive a nightly dump from the manufacturer which arrives to us as a CSV file. The data looks like this:
0abcd,0 Product,China,3000
1abcd,1 Product,Mexico,4000
2abcd,2 Product,Canada,2000
3abcd,3 Product,USA,5000
4abcd,4 Product,Canada,4000
Here is the schema:
create_table "products", force: :cascade do |t| t.string "sku", null: false t.string "name", null: false t.string "origin", null: false t.integer "msrp_cents", null: false t.datetime "created_at", null: false t.datetime "updated_at", null: false end add_index "products", ["sku"], name: "index_products_on_sku", unique: true, using: :btree
The data is obviously fake. But that doesn’t matter, the principle is the same. The only validation our model has is that it requires the sku
field to be unique, and the other fields to not be null.
Attempt 1 – The original (59 seconds)
My first attempt was the most straight-forward attempt… nothing fancy is going on. I’m taking the CSV file, looping over the records and saving them.
task import_products1: :environment do require 'csv' CSV.foreach("#{Rails.root}/tmp/products.csv") do |p| product = Product.new(sku: p[0], name: p[1], origin: p[2], msrp_cents: p[3]) product.save! end end
Running this with the time
program produces the following output:
rake speedy:import_products1 59.45s user 4.44s system 77% cpu 1:22.59 total
Is 59 seconds fast? Slow? At this point we don’t really know. But let’s take a look at what’s going on for each of the records.
(0.1ms) BEGIN Product Exists (0.6ms) SELECT 1 AS one FROM "products" WHERE "products"."sku" = '123' LIMIT 1 SQL (0.3ms) INSERT INTO "products" ("sku", "name", "origin", "msrp_cents", "created_at", "updated_at") VALUES ($1, $2, $3, $4, $5, $6) RETURNING "id" [["sku", "123"], ["name", "n"], ["origin", "Russia"], ["msrp_cents", 1000], ["created_at", "2015-12-05 01:34:41.896705"], ["updated_at", "2015-12-05 01:34:41.896705"]] (6.1ms) COMMIT
Here is what’s happening step by step:
BEGIN
starts a database transaction. If there is a failure of any kind it will run theROLLBACK
command to avoid finalizing these changes in the database.- Next because we have a uniqueness validation in the model, it is querying to check if this sku already exists.
- At this point it does the actual insert statement, inserting the data into the database.
- Because there were no failures, it runs
COMMIT
to finalize all changes made within this database transaction.
What we don’t see is that there is a lot going on behind the scenes as well. Rails is having to talk over TCP to the database four times for each record, plus it is having to build two different SQL statements, each time. Luckily, this last part is quick thanks to some work by Tenderlove on AdequateRecord Pro where the query construction is cached.
Attempt 2 – Rely on DB for unique validation (44 seconds)
The first thing we’ll do to speed up the import is to rely on our database to ensure the skus are unique rather than on Rails. This will allow us to avoid that extra SELECT
statement from being run.
You should always have a unique index on any field you want to actually be unique. Rails will do its best, but it can’t guarantee uniqueness due to race conditions.
The migration to add this unique index looks like this:
add_index :products, [:sku], unique: true
If there is failure, it will raise an ActiveRecord::RecordNotUnique
exception which you can rescue from. To avoid this validation run, we could send the save!
method the option validate: false
, but this would stop all validations from being run, not just the one we want to avoid.
Below is a technique for avoiding a single validation. It creates an attr_accessor
called skip_uniqueness
, which we use in conjunction with the unless
option that we can pass to the validates
method.
class Product < ActiveRecord::Base attr_accessor :skip_uniqueness validates :sku, uniqueness: true, unless: :skip_uniqueness def save_for_bulk_import!(options = {}) self.skip_uniqueness = true save! options end end
Using this technique, we created our own version of the save!
method which sets the skip_uniqeness
attribute to true, thereby skipping the uniqueness validation.
And finally, here’s the code that was run to produce this result:
task import_products2: :environment do require 'csv' CSV.foreach("#{Rails.root}/tmp/products.csv") do |p| product = Product.new(sku: p[0], name: p[1], origin: p[2], msrp_cents: p[3]) product.save_for_bulk_import! end end
Now after running the import, we get a time of 44 seconds… about 15 seconds slower than the original!
rake speedy:import_products2 44.10s user 3.56s system 78% cpu 1:00.57 total
Attempt 3 – Wrap it in a transaction (34 seconds)
The way we have our code now, each of the 50,000 products gets its own database transaction. This is how Rails handles saving a model by default… unless of course we wrap it in our own transaction, a single one that we can use for all 50,000 records.
This means that if there is any failure, none of them will be inserted.
task import_products3: :environment do require 'csv' Product.transaction do CSV.foreach("#{Rails.root}/tmp/products.csv") do |p| product = Product.new(sku: p[0], name: p[1], origin: p[2], msrp_cents: p[3]) product.save_for_bulk_import! end end end
Now when we run the code we see that it is down to 34 seconds… another 10 seconds slower!
rake speedy:import_products3 34.14s user 1.47s system 89% cpu 39.696 total
Attempt 4 – Bulk insert (4.6 seconds)
Our code is still having to connect to the database 50,000 times. I had a feeling this was the real reason for slowness, because if I instantiate 50,000 products without saving them, the script only takes about 5 seconds.
To solve this, let’s take advantage of something both MySQL and PostgreSQL support: Inserting multiple values with a single INSERT
statement.
INSERT INTO films (code, title, did, date_prod, kind) VALUES ('B6717', 'Tampopo', 110, '1985-02-10', 'Comedy'), ('HG120', 'The Dinner Game', 140, DEFAULT, 'Comedy');
As with many things in Ruby, there is a gem that helps us do this called activerecord-import. Now we’ll read all of the products into an array and insert them into the database all at once:
task import_products4: :environment do require 'csv' Product.transaction do products = CSV.read("#{Rails.root}/tmp/products.csv") columns = [:sku, :name, :origin, :msrp_cents] Product.import columns, products, validate: false end end
This lowers our code all the way down to 4.6 seconds!
Keep in mind that in this case I’ve turned off validations entirely. As a result, you should be sure that your incoming data is valid, maybe running it through a cleanse phase before getting to this point. That being said, even by running the validations, it ran in about 25 seconds, which is the second fastest time I was able to get.
rake speedy:import_products4 4.64s user 0.30s system 75% cpu 6.527 total
Conclusion
Take advantage of your database! It’s very efficient at what it does, and by relying on its functionality, you’ll be able to figure out different ways to make your code more efficient.
In this article, we relied upon our not-null database constraints along with a unique index, and then we used the ability to insert multiple records with a single SQL query. All in all, we were able to get 50,000 records into the database from 59 seconds all the way down to 4.6 seconds, around 12 times faster.
Reference: | Speeding Up Bulk Imports in Rails from our WCG partner Florian Motlik at the Codeship Blog blog. |