September 22, 2014

Preprocessing Data

There are many ways of improving response times for users. There are some people that spend a lot of time, energy and money on trying to have the application respond as fast as possible at the time when the users made the request.

Those people may miss out on an opportunity to do some or all of the processing the application needs to do at a different point in time. In other words, if you preprocess your data ahead of time, you can reduce the time it takes to complete a request.

 

Allow me to give you three examples of what I mean:

1)  There is a sales report that your managers would like to see on their fancy new dashboards. The query for this report takes 45 minutes to run and may disrupt other functions that the database server needs to do. You decide to run this report at 3am when there is very little happening on the database server and save the results to a separate table. When the dashboard queries the database, it only queries the ready made results in the separate table.

2) You work in a social micro-blogging start up and it is important for your users to send and receive messages very quickly. Should a user send a message to his or her group of friends, the user does not want to wait too long to submit a message and the users would not like to wait a long time to view all their messages when they reach the site. To help reduce load on the site’s infrastructure, you decide to use a queuing system to asynchronously update each friend’s new messages table after one of their friends has sent them a message. This means that the user doesn’t need to wait a long time to submit the message. It also means that when a user views their messages, they quickly get results from prepared tables.

3) You work for a movie review website and on the main page of the website are four “Top Ten” style charts. These charts are very popular and very process heavy on the database server. You decide to set up four seperate tables for each graph and schedule four jobs to run once an hour to repopulate the tables with current data. You then change the main page of the website to query these new tables. You have now improved the speed of the main page of the website as well as seriously reduce the load on the database. The business say that once an hour is more then enough for accurate results.

 

Points in Time

Here are the possible “points in time” when you might want to consider preprocessing your data:

On Insert

This is when you insert data into the database. Your application or the database itself (using TRIGGERs) does additional calculations or data insertions. This adds some overhead when data is being inserted, but according to your business requirements, it may be necessary. This is a form of event processing. Examples could be counter tables, audit logs and cache invalidation. This use case works better in environments where there are fewer inserts and perhaps many more reads.

Asynchronous

The work is done in the background and there is no wait time when data is inserted. This can be done by forking processes or using a queuing system. You would need an error monitoring system because if a forked process crashes, you would need to recover it in some way. This is a form of event processing. Examples could be messaging systems, bidding systems and even real-time analytics.

Off Peak

Similar to the example given above, you preprocess your data during a period when there is very little work happening on the database server. This is a form of batch processing. Examples could be reporting, data quality checks and optimizations.

Intervals

Similar to the retail example I gave, you run a scheduled job every certain interval. This is a form of batch processing. Examples could be reporting, updating caches (preventing cache stampedes) and counter tables.

Prefetching

If there is a likely chance that the user may ask for certain information in the future, then you can pre-generate that result. Examples could be pagination.

On Request

If there is no good way to preprocess the data then you just have to send the result as quickly as possible to the user when they request it.

 

Conclusion

I have tried to demonstrate that there are five points-in-time when you can preprocess your data to reduce the response time of user requests.
As a final point, similar with any caching technique, running large processes only once and avoiding doing them a multitude of times not only saves time, but also electricity. So please consider that as an additional and possibly important benefit.

 

Comments

  1. tobi says:

    Nice post. If there was a +1 button, I would click it.

  2. Good article Jonathan – looking forward to more of your insights! Steve

  3. Jasper Knops says:

    To decrease to load for the user even more you could implement an alternative action that would be faster than an insert query. For a large community site we triggered a syslog and parsed this data in the background to populate summary tables.

Speak Your Mind

*