When reviewing the contents of the database of Flare, we encountered a table with 1 billion records. Technically that isn’t a problem. Flare runs on Vapor and uses an Aurora database, so it can handle that scale. But of course, there’s a cost for storing that many records.
We dove in and concluded that we could safely delete about 900 million records. They all were created before a specific date.
Attempt 1: running a single delete query #
A first naive approach would be to run a simple delete query.
Performing a delete query like this on a small table will probably work. But when your table contains a billion items, it will likely cause problems.
In most cases, a database will try to perform all queries atomically, so no other queries see intermediary results. When a deletion query using a where
clause starts, it will lock that table. All other queries that are run against the table will have to wait.
Because of the massive size of the table, the deletion query would probably run for hours. The lock would be in place for hours, essentially halting the entire app.
So, just running a single query to clean up the table isn’t a good idea.
Attempt 2: using a limit #
To make the query complete faster, you can add a limit
clause. Because the query completes faster, the table lock will be released much faster as well. Other queries wont be halted as long.
Here’s an artisan command that performs that query.
In the code above, only the 1000 first selected are deleted. Our tables contain many more records that should be deleted.
A deletion query in Laravel returns the number of deleted records. We can use that to determine whether if it’s worthwhile to rerun the deletion query.
This strategy works quite well on traditional systems where there’s no limit on how long a process can run. If you use a conventional server instead of AWS Lambda and want to use the cleanup strategy above, look at our model cleanup package.
The problem with this strategy on AWS Lambda is that there is a hard execution time limit of 15 minutes. When cleaning up a huge table, the process would likely take multiple hours, so doing the cleanup in a single artisan command isn’t possible.
Attempt 3: using jobs #
To overcome that execution time limit of 15 minutes, the deletion logic can be moved to a job. The job can dispatch itself to delete more records.
This strategy works great, the query executes fast, and the deletion process can keep running as long as needed.
Scheduling the cleanup process #
In Laravel’s console kernel, you can, in addition to artisan commands, schedule jobs.
This will work, but there’s a caveat here. When Laravel performed the scheduled tasks, it will perform them one after the other. This means that if one task takes a long time, all the others will have to wait. That’s why you should always try to complete scheduled tasks as fast as possible.
It’s certainly possible that our CleanTableJob
will take a couple of seconds when the database is slow. That’s why it’s probably a good idea to schedule an artisan command that’s guaranteed to run fast.
Instead of the job, the command can be scheduled.
Since the CleanTableCommand
only dispatches a job, it will always complete very fast.
There is… one more caveat. In the example above, the CleanTableCommand
is scheduled to run hourly. The command will dispatch the CleanTableJob
that will re-dispatch itself if more records need to be deleted.
Now imagine that the CleanTableJob
will re-dispatch itself for more than an hour. While there is still a CleanTableJob
running, the CleanTableCommand
will dispatch another CleanTableJob
. That will cause multiple CleanTableJob
jobs to run at the same time.
In our specific situation, where we are cleaning up a table with hundreds of millions of records, it likely takes hours or even days to clean up the table. If left unchecked, there would be a great many CleanTableJob
running. Ideally, we want one CleanTableJob
running at any given time.
We can solve this problem by using an atomic lock. Laravel has support for this out of the box. Here’s how using the lock could look like.
This code looks good, but it can be improved. That locking code makes it hard to see the essence of the job, which is deleting rows.
The locking code can be moved to a job middleware.
Here’s the rewritten CleanTableJob
that makes use of the AtomicJobMiddleware
.
And with all this in place, we now have a pretty reliable way of cleaning up big tables.
Cleaning big tables isn’t that difficult, but many caveats need to be taken care of. I hope this post helps you in cleaning up your tables.
If you’re using an Aurora cluster on AWS, you should be aware that deleting a lot of records won’t make your bill smaller as space used by the deleted records will still be allocated. Here’s how to reclaim the allocated space.
If you’re not on AWS and want to clean up big tables in a safe way, take a look at our model clean up package.
We are currenlty optimizing Flare some more, and will soon release a first batch of new features. Stay tuned for that!
This content was originally published here.