September 24, 2014

How can we bring query to the data?

Baron recently wrote about sending the query to the data looking at distributed systems like Cassandra. I want to take a look at more simple systems like MySQL and see how we’re doing in this space.

It is obvious getting computations as closer to the data as possible is the most efficient as we will likely have less data to work with on the higher level in this case. Internally MySQL starts add optimizations which help in this regard, such as Index Condition Pushdown which allow storage engine to do most rudimentary data filtering improving efficiency.

The more important case though is the Application – Database interaction. Modern applications often have quite complicated logic which might not map to SQL very well. Framework and the practices developers follow can only add to this problem. As results Application may be issuing a lot of queries to the database doing computation inside the application and paying a lot of inefficiency and latency for transferring data back and forth. Latency is really a key here as accessing data through network is thousands of times slower than accessing data in memory so many simple data processing algorithms you could imagine accessing data they need row by row simply do not work.

For some tasks simply learning SQL (including some voodoo practices with user variables etc) and using it correctly is good enough to do efficient computations with single round trip, for others – it might not map to SQL very well.

There is known solution to this problem which existed in many database systems for decades – stored procedures. These would allow you to store programs inside database servers so they can work with data locally often accessing it with much lower latency so you can implement much broader set of algorithms. Stored procedures also other other advantages such as giving more security and more control over what application does to DBA but they have limitations too.

There are MySQL limitations – Stored Procedures restrict transparency,hard to debug, do not perform very well and need to be implemented in the programming language from 70s. Though all of this can be fixed in time. The design limitation though is stored procedures do not support some of the modern development and operational practices very well.

If you look at a lot of modern applications with database backend they have the “code” which is something which lives in your version control system changed quickly and such changes can be developed in production many times a day – approach proved to be successful for many modern web application. This is all as long as no database change is needed… because database does not like you to be agile. Changing database schema, indexes etc takes significant effort and can’t be taken lightly. They also require different process as you can’t just “deploy changed database structure” as you do with data you have to have the migration process which can be ranging from trivial (such as adding a column) to rather complicated – such as major database redesign. In fact reducing pain from database maintenance by having no schema is one of the major draws for NoSQL systems which offer a lot more flexibility.
The code also can often be deployed in rolling fashion to make it downtime free then more than one version of code is allowed to run in production for short or long period of time. For example deploying performance optimized version of code I can deploy it only on one Web server for a few days to ensure there are no surprises before full scale deployment

The problem with Stored Procedures they take some middle place. They are really the code which is part of the application but they live in the database and updated through change to the database which is global for all application servers. This makes it for Developers which can’t use the same editing tool to just edit code save it and see how it runs in the test system without doing extra work. Second problem is solved by some people by implementing some versioning in the database, so instead of CALL MAKE_PAYMENT(…) they will use CALL MAKE_PAYMENT_(…) which is however also quite pain in the butt.

So what would be the alternative ?

I would love to see MySQL extending the work started with INSERT ON DUPLICATE KEY UPDATE and Multi-Result Set by being able to submit the simple programs to run on the database server side as alternative to queries (or even alternative to SQL API). Using language more friendly to modern developers, such as JavaScript and allowing to return more than flat tables (for example JSON objects) would be quite appreciated. Such API would for example allow to solve “dynamic join” problems efficiently – where the data which belongs to the object (and as such which tables needs to be added to the join in SQL) depend on the object properties as well as handling complex update logic which now often requires many roundtrips to the application.

Implementing something like this one would need to be extra careful as it would allow application developers to kill database servers either more effectively than they do now – the proper limit on resource usage (CPU, memory etc) would need to be enforced. We also would need to be extra careful with security as ability to change the “program” allows hackers a lot more ways to express their creativity than SQL injection currently does.

The downside of this approach compared to really “stored” procedures from performance standpoint is their dynamic nature – each would need to be compiled for execution from the scratch. I think this could be substantially optimized with modern technologies and it is a small price to pay for the power to avoid many round trips and get a lot more power on the data processing local to the database.

In the end I think something along those lines could be quite helpful in expending usability of relational databases for modern applications. What are your thoughts?

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Justin Swanhart says:

    Shard-Query sends the query to the data. I even mentioned it in Baron’s comments. Why mention it with no mention of a tool that already shards (like cassandra) and sends the query to the data (like it doesn’t)?

  2. Justin Swanhart says:

    The computation is aggregation and group by. ICP doesn’t count – it is just selection and filtering, which Baron was complaining about. You have to be able to send a COUNT(*) to 100 nodes and only get 100 results back, not billions. That is sending the query to the data.

  3. Justin Swanhart says:

    What cassandra does is FINE for OLTP. MAKE_PAYMENT isn’t the problem. COUNT_PAYMENTS is.

  4. Justin Swanhart says:

    VoltDB takes this stored procedure approach for OLTP btw, instead of using actual SQL

  5. Leif Walsh says:

    I can imagine this is a possible future for MongoDB’s aggregation framework. What they’re building is a query language where the syntax is “pipe” and the vocabulary is fairly well optimized operators. From here, the work to be done is

    1) create more operators to make the “language” more general purpose,

    2) find interesting “phrases” in the language and optimize them (for example, sort followed by limit),

    3) create some way of defining new words as shorthand for phrases (for example, “function top_k() { sort | head -n$1 }” in bash) and

    3) possibly add ways for a pipeline element to modify existing collection data, rather than just outputting new data.

    I believe this is the direction MongoDB is headed, but if they don’t, it might be something I would want to work on in TokuMX.

    Piping as a way of creating a functional programming language has already been vetted in a fair number of places, most notably the shell. I feel like this is the right way to go for “take the query to the data” type operations (Javascript was a mistake, for example). I certainly don’t have the MySQL chops to design how this could be added to MySQL, but I believe the high-level design is applicable and a good approach for any database.

    In particular, it’s something that can be developed iteratively (start with just sort, group by, and limit, and build up interesting operators from there), it’s something that can be tuned and customized easily (unlike something like Javascript where you take an entire language runtime with all its warts), and it’s easy for users to develop with in the same way that languages with a REPL are easy to develop with (unlike stored procedures). You wouldn’t necessarily have to pick something JSON-ish as the intermediate format in MySQL, I think sticking with something SQL-ish might be the right choice, but I don’t know exactly what the language should look like.

  6. Justin Swanhart says:

    You are thinking in data flow. It happens that SQL is a natural data flow language, but people often overlook this fact. Map/Reduce is also a data flow framework, including tagging, which is an important concept in data flow architecture. Shard-Query uses techniques similar to map/reduce, but uses a technique called map/combine. Multiple map/reduce steps are combined into one process. A SQL query can count and filter at the same time. For cases where map/combine is not possible Shard-Query does map/reduce using tagging. It hashes the group-by attributes for the tags.

  7. Justin Swanhart says:

    Think about “select count(*) from table”

    you could implement that as:
    cat shard1 shard2 | wc -l

    but that is slow – it looks at all rows and the count is serialized

    c1=$(cat shard1 | wc -l) &
    c2=$(cat shard2 | wc -l) &
    wait
    echo “$c1+$c2″|bc

    that is what Shard-Query does

  8. Justin,

    Thare are many layers you can be thinking about “sending query to the data” and many functions which it may or may not do. For example Netezza would take some processing to really low storage level so would Kickfire appliance be able to do some functions on the special purpose hardware close to the data (compared to main CPU). ICP in this case allows to do some operations (filtering) closer to the data (on storage engine level) which can provide benefits ranging from simply less physical data reads from the disk and copying to significantly less network communication as in case of NDB/MySQL Cluster.

  9. Justin Swanhart says:

    I think of ICP as fixing a deficiency in the storage engine. ECP is more sending the query to the data, and yes, NDB uses it to good effect. But they are still single threaded in the ‘wc’ example above. That is the SQL node. Shard-Query sends the wc (count) to the data.

  10. Leith,

    I think SQL is one of the big reasons relational databases succeeded but it is also a big reasons there is a search for alternatives is happening now.

    Relational Data Structures if you think about them are very powerful. They essentially allow you to present any data structure you would be present in something as classical C language with different structures and pointer.

    The SQL language though is only able to work efficiently with small subset of the data structures which we can present as relational databases.

    Yes you can be extending SQL but I think you never can get the flexibility of conventional language in terms of ability to work with data structures.

    The Table as the result set is another serious limitation which requires us either have a lot of round trips or data duplication.
    Think about for this blog – if I were to want to get top 10 posts from the blog with all comments presented in SQL result set I would need to either duplicate post content for every comment or have to fetch blog posts and comments separately and when “join” them some way in the application. Both of these things are ugly and both would be fixed if I could have some form of hierarchical result set in return.

  11. Justin Swanhart says:

    and they send all of the ‘cat’ over the network. ndb sends a few bytes to each node and gets potentially gigabytes back. shard query sends a few bytes to each node and gets a few bytes back.

  12. Justin Swanhart says:

    SQL-92 doesn’t have recursion which is a big limit. I want to add CTE and recursive CTE to Shard-Query to fix that. I just need time. Should I focus on that Peter?

  13. Justin Swanhart says:

    And Peter, if we wrote a storage engine that could talk to hard drive filters, shard-query would still scale that out over more than one node. You can’t scale up forever. That is what fried Kickfire. We needed something like Shard-Query and we didn’t have it.

  14. Justin Swanhart says:

    Peter – I understand why Kickfire was fast. I know exactly how many cpus it had and the /tricks/ it did to get fast results.

  15. Roger Booth says:

    Maybe Percona has the resources to make MySQL Proxy a player in this space?

  16. Leif – Do your plans allow for an optimizer. Most users won’t understand what can be done to make analytic queries faster and what is done in a good parallel RDBMS.

    This was a fun example of moving code to data (non-stored procedures) – http://highscalability.com/blog/2010/11/9/facebook-uses-non-stored-procedures-to-update-social-graphs.html

  17. Baron says:

    Interestingly MySQL does allow PROCEDURE at the end of a query, which has never really been used outside of PROCEDURE ANALYSE as far as I know.

  18. Justin Swanhart says:

    Except that that interface was removed in 5.6. Now PROCEDURE ANALYZE is just a keyword.

  19. Justin Swanhart says:

    Now PS 5.6 allows you to select into a named pipe or unix socket which could allow a lot of interesting tricks.

  20. Justin Swanhart says:

    Roger Booth,

    Shard-Query has a MySQL proxy interface, btw. It is the optimizer that Mark was alluding to. It knows how to rewrite SQL in an analytics way for massively parallel processing. The goal isn’t just creating a pipeline as Leif suggested, but parallelizing that pipeline. Shard-Query “sends the query to the data” by adding WHERE clauses or PARTITION hints to look at tables in parallel for example and it does this on one or more shards, again in parallel.

  21. Leif Walsh says:

    Mark and Justin, I’m thinking about this purely from the language level. Peter suggested “Using language more friendly to modern developers, such as JavaScript and allowing to return more than flat tables (for example JSON objects) would be quite appreciated.” and I’m just responding to that prompt (I’m sort of getting away from the point of the post which was about sharding). I’m not saying SQL needs to be thrown out, but there’s an argument to be made that SQL is a very powerful and complex language, and—as I’m sure you’ll agree, Justin—it’s hard to make all of its features work well in a sharded environment.

    There are two reasons I see here for proposing an alternate query language. One is to start fresh with a smaller core, made up of only those features you know you can support efficiently. This seems to be the strategy taken by many NoSQL databases as they tried to figure out what a sharded database should do. Another is that SQL is kind of an annoying language and maybe we can come up with something better. I don’t think the latter is good enough on its own, but it’s a nice side effect if you decide it might be a good idea to try something new.

    Let me be clear, I’m not proposing that a new language SHOULD be written, I’m just responding to Peter’s suggestion that it might be worth thinking about, and I wanted to propose an idea I saw and appreciated, and get it swirling around in the heads of those with more likelihood of implementing it.

    Justin, you’re right, SQL is a dataflow language, and the thing I’m proposing (not really proposing, more like highlighting) is functionally isomorphic. Like anything, once you get used to it it’s easy and everything else seems hard, but I think SQL is particularly poor in a way that other dataflow languages, like the shell with pipes, are not: it’s very awkward to compose. You can’t really copy and paste subqueries, or conceptualize a segment of the dataflow pipeline as a new atom (maybe you can, I’m woefully inexperienced with SQL, but I can’t imagine it’s pretty). These are superficial details, but I believe they are lent importance by their frequency in one’s development process. I think the pipeline makes development incredibly better, if one starts from zero knowledge of either the pipeline or SQL.

    Mark, I haven’t thought out the details of optimization yet, and it’s really not the point I’m trying to make right now, but I think it should be fairly simple to optimize. As Justin pointed out, this sort of thing can be used to write the same sorts of programs that SQL can do. Since we know SQL can be optimized, the pipeline should be optimizable too. My first attempt at optimizing it would probably be to find sub-pipelines (I called them “phrases” above) and write special optimized implementations for those phrases. For example, the server could search for “sort | head -n1″ anywhere in the pipeline and turn it into a point query. But I don’t think this would be fundamentally easier or harder to optimize than anything else; to be honest it would probably all just come down to implementation details anyway.

    Peter, I agree, I think SQL’s success stems from its power in terms of features, but that’s also the reason why it’s now under attack. Those features are great and people want them, but some of them become a lot harder to pull off in a distributed system. Some people thought it was easier to throw out everything and start over than to try to educate people “well, everything’s the same except that if you have this one special case or you use this feature, everything grinds to a halt suddenly.” Maybe you’re right and Javascript or something is the right direction to go. JSON is pretty easy to work with. Personally I’m not too concerned with denormalization and data duplication, Fractal Trees can usually soak up a lot of duplicate data, and I think eventually that will almost always just be the right thing to do.

  22. Justin Swanhart says:

    SQL was created to express the formal rules of relational algebra in a declarative programming way. The power of SQL comes from the fact that it is built from math. The best part of SQL is that you don’t have to know the gritty details of how COUNT(DISTINCT) is working under the covers. Just my opinion.

Speak Your Mind

*