How can we bring query to the data?Peter Zaitsev
Baron recently wrote about sending the query to the data looking at distributed systems like Cassandra. I want to take a look at more simple systems like MySQL and see how we’re doing in this space.
It is obvious getting computations as closer to the data as possible is the most efficient as we will likely have less data to work with on the higher level in this case. Internally MySQL starts add optimizations which help in this regard, such as Index Condition Pushdown which allow storage engine to do most rudimentary data filtering improving efficiency.
The more important case though is the Application – Database interaction. Modern applications often have quite complicated logic which might not map to SQL very well. Framework and the practices developers follow can only add to this problem. As results Application may be issuing a lot of queries to the database doing computation inside the application and paying a lot of inefficiency and latency for transferring data back and forth. Latency is really a key here as accessing data through network is thousands of times slower than accessing data in memory so many simple data processing algorithms you could imagine accessing data they need row by row simply do not work.
For some tasks simply learning SQL (including some voodoo practices with user variables etc) and using it correctly is good enough to do efficient computations with single round trip, for others – it might not map to SQL very well.
There is known solution to this problem which existed in many database systems for decades – stored procedures. These would allow you to store programs inside database servers so they can work with data locally often accessing it with much lower latency so you can implement much broader set of algorithms. Stored procedures also other other advantages such as giving more security and more control over what application does to DBA but they have limitations too.
There are MySQL limitations – Stored Procedures restrict transparency,hard to debug, do not perform very well and need to be implemented in the programming language from 70s. Though all of this can be fixed in time. The design limitation though is stored procedures do not support some of the modern development and operational practices very well.
If you look at a lot of modern applications with database backend they have the “code” which is something which lives in your version control system changed quickly and such changes can be developed in production many times a day – approach proved to be successful for many modern web application. This is all as long as no database change is needed… because database does not like you to be agile. Changing database schema, indexes etc takes significant effort and can’t be taken lightly. They also require different process as you can’t just “deploy changed database structure” as you do with data you have to have the migration process which can be ranging from trivial (such as adding a column) to rather complicated – such as major database redesign. In fact reducing pain from database maintenance by having no schema is one of the major draws for NoSQL systems which offer a lot more flexibility.
The code also can often be deployed in rolling fashion to make it downtime free then more than one version of code is allowed to run in production for short or long period of time. For example deploying performance optimized version of code I can deploy it only on one Web server for a few days to ensure there are no surprises before full scale deployment
The problem with Stored Procedures they take some middle place. They are really the code which is part of the application but they live in the database and updated through change to the database which is global for all application servers. This makes it for Developers which can’t use the same editing tool to just edit code save it and see how it runs in the test system without doing extra work. Second problem is solved by some people by implementing some versioning in the database, so instead of CALL MAKE_PAYMENT(…) they will use CALL MAKE_PAYMENT_(…) which is however also quite pain in the butt.
So what would be the alternative ?
Implementing something like this one would need to be extra careful as it would allow application developers to kill database servers either more effectively than they do now – the proper limit on resource usage (CPU, memory etc) would need to be enforced. We also would need to be extra careful with security as ability to change the “program” allows hackers a lot more ways to express their creativity than SQL injection currently does.
The downside of this approach compared to really “stored” procedures from performance standpoint is their dynamic nature – each would need to be compiled for execution from the scratch. I think this could be substantially optimized with modern technologies and it is a small price to pay for the power to avoid many round trips and get a lot more power on the data processing local to the database.
In the end I think something along those lines could be quite helpful in expending usability of relational databases for modern applications. What are your thoughts?