When Should You Store SQL Serialized Data in the Database?

A while back Friendfeed posted a blog post explaining how they changed from storing data in MySQL columns to serializing data and just storing it inside TEXT/BLOB columns. It seems that since then, the technique has gotten more popular with Ruby gems now around to do this for you automatically.

So when is it a good idea to store SQL serialized Data with this technique?

If the application really is schema-less and has a lot of optional parameters that do not appear in every record, serializing the data in one column can be a better idea than having many extra columns that are NULL. The restriction on this would be that searching on these columns now becomes more difficult[1]. A good example of this optional nature of data is user preferences – you only really need to store the settings that differ from the default values.

The second situation where I can see this technique making sense is that when you update the text/blob, a large percentage of the data is actually modified. If you have to write back the full blob for 1 change, it is expensive.

Another potential pro for this technique is that ALTER TABLE commands are no longer required. Until 5.1-plugin simple operations like adding a secondary index on an InnoDB table require the whole table to be rebuilt (now just the index is built). I don’t really buy this pro, since using master-slave switching I’ve always been able to emulate online index adds. Projects like MMM and Flipper make this even easier.

In Friendfeed’s case, they also used the serialization to be able to compress the objects.  From 5.1-plugin this is now available natively.

[1] Friendfeed solved this indexing problem by creating separate ‘index’ tables.

What are the downsides?

I would say that the first serious downside is write amplification. If you are constantly making small updates to one piece of data in a very large blob, the effort MySQL has to go to is greatly increased.

The second downside I would mention is that this pattern tends to force you to read/write larger amounts of data at once. We’ve all been trained to know that SELECT * is bad. This creates a pattern where SELECT * is not optional. In practical terms, I would expect this increases the size of a working set since more data must be kept in the buffer pool.

The next obvious restriction is that there is a clear loss in functionality. You can no longer easily perform aggregation functions on the data (MIN, MAX, AVG). You are storing the data in a way that is very specific to one application, and you can’t just point a BI tool at it to process.

It can become difficult to apply even the simplest constraints on the data such as character length, if an age must be a number, and if the age must be unsigned.  MySQL doesn’t go as far as having check constraints, but what it has is better than nothing.

Some smaller issues I have is that if you are using a standard serialization technique like JSON it’s difficult to store pieces of data like numbers or IP addresses in their most efficient form and that technically this design breaks 1NF.

What would I recommend?

Josh Berkus (of PgExperts) calls this serialization E-blob. Much like EAV, he criticizes this as one of 10 ways to wreck your database (slides). I tend to side with Josh.  I’m optimistic that this design has its place in a smaller component of your application, but I’m weary every time I hear someone decide to switch to it exclusively.

Before making such decisions for yourself I would recommend reading Peter’s KISS KISS KISS post.

Share this post

Comments (52)

  • James Golcik

    A few comments…

    > The next obvious restriction is that there is a clear loss in functionality.

    You can easily run aggregations on the index tables if you like. Or, you can aggregate in process in the background and cache the values. I haven’t heard of many people running aggregation queries at web scale.

    > I don’t really buy this pro, since using master-slave switching I’ve always been able to emulate online index adds.

    If you actually run operations for any decent sized installation (or even read the FriendFeed article), you know that this process is extremely tedious and error prone. It just isn’t worth it if you have a lot of tables that you’re often modifying.

    Sure, this is purportedly fixed in 5.1, but at least in our case, 5.1 is about 30% slower than 5.0. So, that’s not a great solution.

    Those are a few things I picked out of the article. One convenient omission from your article is the fact that companies like Facebook and Twitter do *exactly* this. Especially in Facebook’s case, this strategy has been key to their scalability.

    Treating MySQL like a key value store makes it dead simple to cache your objects. My Friendly library that you linked to will read and write through to cache without any extra effort. We see a 99.9% cache hit rate in production.

    In practice, that means that even if the queries through to MySQL are somewhat slower, the machine is serving so many fewer queries that it’s hardly an issue.

    There’s this DBA attitude that you should always write your applications to conform to the performance characteristics of the database. It stinks of a lack of understanding of how code actually gets written. It’s like DBAs aren’t aware that anything other than the database even exists.

    Sometimes (actually, almost always) there are good reasons to trade a little performance for a lot of convenience.

    January 21, 2010 at 3:02 pm
  • James Golcik

    Oh, I should probably say that the database that we’re doing this with has about 20GB of data growth per week, serves around 70req/s at peak, and has shown absolutely stable performance characteristics since we deployed it.

    January 21, 2010 at 3:05 pm
  • James Golcik

    My first comment seems to have gone in to moderation…

    January 21, 2010 at 3:06 pm
  • Morgan Tocker

    James, I don’t dispute that you are getting those sorts of numbers.

    A better question would be could you have achieved the same without doing it. The problem with going to too much complexity (as in the link to Peter’s KISS KISS KISS post) is the operational overhead that can come with it.

    January 21, 2010 at 3:15 pm
  • James Golcik

    I submitted a much longer comment about it (before the other two), but for some reason, it’s not here.

    You claim that serializing objects is complex, but Friendly’s code base is an order of magnitude smaller than most ORMs I’ve seen. Without its tests, it’s 1200 lines. That’s not a lot of code.

    According to you, playing the operations hokey pokey to build indexes and add fields offline is simple? That’s a *manual* process and it’s error prone as hell. I should know, I’ve screwed it up.

    January 21, 2010 at 3:27 pm
  • bob

    If you don’t need a database, why are you using a database?

    January 21, 2010 at 3:49 pm
  • James Golcik

    The same reason a lot of companies (facebook, twitter, FriendFeed, etc) use this pattern. At this point, MySQL is stable and predictable. We know how to run it, scale it, perform and recover backups, etc.

    We did have mongodb in production for a while, but ran in to problems with it as our dataset grew. It was fast as hell at the beginning, but got brutally slow near the end. We have no such problems with MySQL.

    January 21, 2010 at 3:54 pm
  • Jonny Armstrong

    @bob that’s a really silly question. They are storing data that still requires some indexing. I think most people still use databases for storing indexed data.

    January 21, 2010 at 4:09 pm
  • Mark Callaghan

    Somewhat related to this, MySQL needs a storage engine that is crash safe and good at storing blobs. InnoDB is not it. Maybe PBXT is.

    January 21, 2010 at 4:25 pm
  • Morgan Tocker

    Mark – I’d be curious what features you think are missing for InnoDB to store blobs better?

    January 21, 2010 at 5:55 pm
  • Mark Callaghan

    Morgan – the overflow pages are not shared. When a column must be spilled to overflow pages it has exclusive use of those pages. Data from other columns in the same row and from other columns in other rows cannot use that space. This will waste a lot of space unless your average BLOB column size is many multiples of the page size. For large BLOB columns, too much IO will be wasted because the BLOB is written to the transaction log and the database file. It is better to avoid the transaction log in that case.

    January 21, 2010 at 6:01 pm
  • Darius Jahandarie

    No matter how you look at it, storing non-relational data in a relational database is a hack and nothing more. The solution is using a system designed for non-relational data. If it fails, then fix it.

    January 21, 2010 at 6:26 pm
  • peter


    Right a lot of people do exactly this… because they know no other solution than MySQL, or trust it enough.

    I think where are plenty of cases when storing serialized data makes sense but I would view it as EXCEPTION when “normal” database usage is not feasible for some reason.

    The important requirement for storing serialized data is it should be dealt with by application and you have to ensure if there are multiple applications their serialization is really compatible. Manually messing with serialized data is very error prone.

    When you typically need to access all (or most) of serialized data at the same time. If you serialize 1MB blob but you mostly need 50 bytes out of it it will be a big loss. Same applies for writes – changing 1 byte in blob is costly as it has to be completely replaced.

    Interesting enough reverse is true as well. If you need to store 10000 IDs retrieving single blob which holds them will be a lot faster than reading 10000 rows even if they only contain 1 column. It is even more true for writes – if you need to modify say 5000 of these 10000 rows it will be orders of magnitude faster to change the blob.

    This is why for example if Sessions are stored in MySQL (which is typically bad idea to start with) they are stored in stored as a blobs – because you need majority of session data at once and large portion of it may change between requests.

    January 21, 2010 at 6:51 pm
  • Morgan Tocker

    @Jonny – The key value store NOSQL database don’t have indexes, but MongoDB does:

    I don’t know enough about MongoDB to suspect why it was slow for James. I don’t believe one of the design goals of MongoDB was that all data needed to fit in memory. The wiki suggests it has a use case for analytics: http://www.mongodb.org/display/DOCS/Use+Cases

    January 21, 2010 at 8:14 pm
  • Baron Schwartz

    I have notes on a handful of clients we’ve helped with situations that fall on both sides of the “continental divide” between the EBLOB solution and “normal” relational usage. One of them, for example, was seeing terrible performance. When we dug into it, the issue was exactly what Peter mentions: big string, changing 1 or 2 characters, writing and reading the whole thing. I indexed it in my notes file under “absurd and ridiculous” because of the weird way they were working with the data. The solution? Not what I thought it would be: compress it before stuffing it into the database. Different strokes for different folks.

    70 requests per second is not a lot in the most general terms, and 20GB/week doesn’t mean much to me unless I know things like the total data size, schema, clustering characteristics, and nature of the queries. Without knowing much about James’s system it’s a total guess, but I’d be surprised if there’s really only one way to accomplish the goal under that kind of workload. I don’t dispute that it works for him, though.

    I wonder if the MongoDB folks are tuned into the cases in which performance can drop off a cliff as James wrote. I don’t know much about MongoDB myself so I wouldn’t know why that is, but the classic problem is something like a b-tree not fitting into memory anymore.

    January 21, 2010 at 8:18 pm
  • James Golcik

    70req/s isn’t a lot. I just wanted to give a (very general) sense of what our environment and workload looks like. My point is that using MySQL like a key/value store makes caching incredibly easy. So, there’s actually very little read load imposed on the machine, and performance stays extremely consistent, even as the dataset grows far beyond available memory.

    As you say, Baron, there’re almost certainly other ways to accomplish this. But, for our purposes, the tradeoffs (flexible schema even with huge tables and dead simple caching in exchange for somewhat reduced write performance) made perfect sense.

    It’s just a bit frustrating to read a “don’t ever do this” post about this topic. I think there’s a real tendency for DBAs to become so focused on the data store’s esoteric performance characteristics that they miss the bigger picture. There are a lot of other, sometimes more important concerns when building real applications.

    In our case, we gained back whatever write performance we lost 10-fold with easy caching.

    January 21, 2010 at 9:43 pm
  • Morgan Tocker

    James, to be perfectly clear I didn’t say “don’t ever do this”. Our blog is about MySQL performance, and my aim was to inform readers of the measurable net wins/losses 🙂

    I even said that I was optimistic that e-BLOB could be used efficiently in some parts of your application (if your application really is schema-less, if you modify a large percentage on each update).

    A lot of applications are read heavy. I’m glad you found an easy way to leverage this.

    January 21, 2010 at 10:06 pm
  • Roland Bouman


    I’ve been reading the article and the discussion with great interest, and I am curious what serialization formats people are using. In particular @James: what are you using for format? how large is a typical serialization string? what do you use in the application to serialize and deserialize? Do you use compression too?

    I realize this blog and post is mostly about databases and database performance, so I hope that you don’t feel I’m hijacking the thread. The reason I’m asking is mostly that I have seen some instances of serialization that were really expensive for the application, and I’m wondering what other people have found.



    January 22, 2010 at 1:43 am
  • alan

    @Morgan: you are right you didn’t say not to use it.

    We do use this technique sparingly, and it really does offer a huge gain in places where it literally is just meta-data. We classify it as data that will never be indexed or searched on.

    It is also extremely useful technique for rapid development at the start of a project, where you don’t really know what precise fields you need, so this is like a rapid development technique that if you use an ORM or any custom object wrapper to your database then it is easy to refactor out and make into a first-class column citizen.

    There is a trade-off without a doubt, like many of the suggestions this blog makes, you have to carefully weigh up what works for your application, and not assume just because someone said it was GREAT or POOR that you have to follow.

    January 22, 2010 at 1:47 am
  • Justin Campbell

    I’m currently using this method with a few projects. I’m using PHP, and have a model called KV that I’ve built to access the data, stored in JSON. The model also reads and writes from memcached. It has 3 simple functions: get(key), set(key, value), and delete(key). In the set() function, it also checks if the value is an object, and if so, JSON-encodes it. And there’s one more function called get_object(key) that simply JSON-decodes the value before returning it as an object.

    As far as “blob size”, I think some common sense should be used. I haven’t done tests to figure out how much data you can store before taking a performance hit, but I imagine it’s related to MySQL’s block size, which I’m sure I’m under. Most JSON strings for these smaller applications aren’t more than a few hundred characters. If they were larger, it might make sense to break them up with prefixes. But like James said above, going to the database less for reads makes the trade-off worth it in most of my applications.

    Anyway, this allows me, as a PHP programmer, to write code insanely fast, and will allow my projects to scale and be ported easily in the future.

    January 22, 2010 at 7:23 am
  • Robin

    I agree with your post – for most use cases. However, I’ve recently been working on a project where we had to deal with lots of “schema less” data, or better said, a mixture of multiple highly dynamic schemas. We’ve looked at the various key-value stores out there but didn’t really feel happy and comfortable with either of them (reasons for that being a bit too much to explain here). Instead we decided to store the data as XML blobs and we do that in MySQL not because it makes sense but because our Site Operations department felt more comfortable with it.. 😐 To deal with indexes (and aggregate functions etc) we index (but not store) the documents in a Solr index. This provides great flexibility for searching and for full-text searches it does a whole lot better than MySQL’s Full-Text Search. We now got faceted search, full-text search, aggregate functions and a search engine that is easy to extend with plugins and let mysql store and replicate the data. For our specific use case this works excellent. We now store more than 100 million documents without any hassle and our customers have already changed their data schema several times without us having to worry about it.
    I’m working on an article about our approach, but it may take some time still before it will appear online.
    Great article though – everyone should be fully aware of all the ups and downs before choosing an alternative -possibly very dangerous- route to solve their problems.

    January 22, 2010 at 7:24 am
  • Angelo Mandato

    I think there is too much thought into whether the technique is a good or bad one and not focusing on when to use the serialized data in tables and when not to use them. If you are of my opinion, this is another technique, that when used wisely, can be very powerful and if used wrongly, disastrous.

    WordPress has been using this technique for many years now. Though I’m not 100% in agreement with the way WordPress uses the technique, it does have advantages, depending on what your application needs the data for.

    For example, if you have a web site where users can set specific account settings, and you know these setting’s values are never going to need to be searched using SQL queries, then you may want to consider the flexibility of storing these settings into a TEXT or BLOG serialized. Combine this with associative arrays in PHP and you got yourself a really powerful combination. One big advantage is if you add or remove a setting, you don’t have to add/remove a column from your database because everything is stored in one field.

    Here is my recommendation. If you do this, store your TEXT/BLOB into a separate database table with one other column indexed to map to another table. For example, have a table of accounts, make another table called account_settings that has the account_id and the account_settings, where account_id is INDEXED and account_settings is a TEXT.

    One other important note, inserts/updates with TEXT/BLOGs are slower, so only do this if the design of your site will have a lot more selects of this data than inserts/updates. Usually for account settings, the user only does this every so often, most of the time the settings are simply being retrieved.

    My take on this? Add it to the tool belt, use the technique if it makes sense, avoid it otherwise.

    January 22, 2010 at 7:56 am
  • Robert Hodges

    Hi Morgan,

    Great topic. In addition to the pros and cons already listed for eBlobs, here are three more to consider:

    Pro: Modeling issues. Blobs are sometimes necessary to store data that can’t be effectively modeled in SQL. Replication and messaging systems often contain data that can’t expand easily into tables with fixed types.

    Con: Debugging. If something goes wrong, it’s hard or even impossible to look directly at the data at the SQL level. This is a big issue for debugging production systems, which is where your nastiest problems end to arise.

    Con: Upgrades. Unless you have a very disciplined approach to serialization, application upgrades can make old data unreadable. You will at that point yearn sorely for SQL schema evolution, which may be inconvenient but tends to get things right. Only choose serialization that addresses this problem.

    As previous posters have indicated there’s no right choice for everyone. You have to weigh the pros and cons very carefully. This can be hard because some of the cons happen far in the future (i.e., after next week), hence are hard to judge. However, they can be business-threatening when they do occur.

    January 22, 2010 at 10:09 am
  • Jeremy Cole

    I somewhat frequently recommend this option (for game objects with potentially hundreds of characteristics and/or parameters), but with two caveats: The BLOBs must be serialized in a portable way (protobuf is a good thing here) and they must be versioned. Then you must write code to upgrade v1 objects to v2, etc., and it’s very easy to migrate an object from v1 to v2 to v3 to v4 (latest) at read time. Then you go for model of read any version, write newest.

    January 22, 2010 at 10:35 am
  • Morgan Tocker

    @Roland – I don’t think you’re hijacking. I’m going to extract from what Mark Callaghan posted and say the edge case you don’t want to fit in is where blobs are located off-page, but can not fill up a significant amount of that page. So say the range of ~370 bytes – ~14K? This is going to be version specific, with Barracuda offering fully off-page storage.

    @Alan – The ORM should be able to create columns as easy as it does serialized blobs. YMMV, but I’m a little worried that with ‘Broken Window Theory’ that prototype never gets converted to columns. The problem with when you choose to convert, is that you can’t easily get statistics such as how often the parameter is specified or what the average length is. All meaningful use of existing tools (like PROCEDURE ANALYSE()) is lost.

    @Robin – Great to see you’re in one of the “when is it a good idea” cases.

    @Angelo – That indexing technique is described pretty well in the Friendfeed link in my article. WordPress is a completely different kettle of fish. They don’t want performance, they want it to run anywhere. There’s lots of optimizations possible with wordpress that will probably never be included. I don’t think we should necessarily look to them when designing our own applications.

    @Robert – thanks for your comments 😉

    Josh Berkus cited upgrades as one of the main reasons as well. Imagine that you had some application error which caused an unindexed column to store the word broccoli as ‘brocoli’ [sic]. For me that’s a very easy UPDATE statement that I can make run transactional at a cost of a tablescan and whatever I update. For the E-Blob users that’s a script that pulls everything out, examines, and sends some data back.

    On Debugging – Particularly w/compressed serialized objects, I loose all my tools I’m used to (mysql cli, etc). I then have to create new tools (which is why I believe that this technique has a higher operational cost all other things considered equal).

    On Modeling – I’d say I probably hinted at that in saying “if the application really is schema-less…”, but thank you for the two additional examples. Replication and messaging systems are both interesting in that they are so specific.

    I think one of the common mistakes people make is they don’t interpret the mixed-requirements of their application. They use the justification that ‘it needs to be this way for the message system’ to convert everything over to E-Blob. That’s a mistake. My message here is that E-Blob is okay, but its the EXCEPTION to the rule and I question every time I see a complete system designed that way.

    January 22, 2010 at 11:00 am
  • Steven Roussey

    I typically use a mixed form of this. Anything that we may want to query goes into columns, and the rest goes into a blob. However that is tempered by the use case of the data, typical query loads, etc. Sometimes production systems have characteristics that require non-intuitive solutions. well, until your intuition changes from your experience…

    As for compression — it typically is better done on the application end. If you have 20 web servers for each DB server, you can spread the compression load to 20 servers rather than having one DB server do it. I had proposed years ago to Monty to have compressed columns and change the client-server API to recognize them such that the client API would do the decompression. This would also mean less network traffic. And the server could do it if the client version was old.

    Unfortunately, MySQL’s compress/uncompress puts a length header in front of the data so you can’t use them out of the box and have them interop with an app that is doing compression, unless the app ads the length in front. You can create a stored procedure to tease the data out, but it is a pain when using a cli. If getting data from the mysql cli is important, the app should add this header info so you can use uncompress in your queries easily.

    The Drizzle folk are welcome to my suggestion to have compressed columns and integrate it at the client/server api level, especially with prepared statements.

    January 22, 2010 at 11:26 am
    • Baron Schwartz

      Steven, I agree with you about where to do compression. Your idea about the client automatically handling it is great. Why not mention it on the Drizzle mailing list?

      January 22, 2010 at 2:52 pm
  • Bill Karwin

    E-blob is appropriate when you need variable attributes per row (which already breaks the relational paradigm), and you aren’t allowed to use any NoSQL technology, and the only other option seems to be an EAV design.

    E-blob is still a square peg in the round hole of RDBMS, but at least it’s less evil than EAV.

    Note that you don’t need to put *all* of a table’s attributes in the E-blob. You can have a set of conventional columns for attributes that are the same on every row of the table, or if you need to use indexes or constraints or data types for these attributes. Only the attributes that must be variable belong in an E-blob.

    January 22, 2010 at 12:26 pm
  • Robert Hodges

    Good point. Tungsten Replicator uses E-blob for data to be replicated with index and metadata stored in SQL types, hence exactly what you recommend. I have used that pattern many times. Another interesting issue turns out to be [de-]serialization performance for the blobs. We use Java which is pretty fast but it’s still a significant tax on processing. This is undoubtedly an issue for many high-performance sites, and is a good reason to go for a No-SQL solution or start figuring out to cache more effectively within process memory. We are looking at both of these for Tungsten.

    January 22, 2010 at 1:13 pm
  • Morgan Tocker

    Robert – MongoDB uses a binary representation of JSON. They call it BSON:

    If you’re talking about fast formats for MySQL, I liked Jeremy’s suggestion (Google protobuf).

    January 22, 2010 at 1:26 pm
  • Sean

    Yup, this is a great technique for storing user preferences, e.g. the layout of a page or enabling/disabling certain feature. We just store a serialized associate array in the database for several different things on our site, and when the user logs in, we grab it, unserialize it, and store it in a session. Works great. It woulud be a huge pain in the ass if everytime we wanted to add a new option to the set, we had to change the table structure.

    As long as your datda doens’t need to be indexed or searchable, this can save a ton of time, short term and long term.

    January 24, 2010 at 4:46 am
  • Josh Berkus

    I just thought I’d mention where PostgreSQL is for this: It’s less costly to have null columns in PostgreSQL than in InnoDB, so having lots of null columns is not a reason to use E-Blob. Postgres is also very good at storing blobs, and compresses them for offline storage automatically. Also, with GIST, GIN, expression indexes and HStore (and the upcoming HStore-to-JSON) there are indexing options for blobs. So thats two reasons TO use them.

    Reasons to use E-Blob are:

    a) the data stored varies wildly per entity, over time, or between customer installations;

    b) the data in the blob is almost always retrieved, read, and updated all-at-once due to the application design;

    c) the data is not going to be used for aggregation, ad-hoc querying, or response-time-sensitive filtering (since cheap indexes are impossible);

    d) the data does not need to be constrained or used to enforce a constraint

    Reasons not to use it are:

    1) the above-mentioned update cost: update an entire 1K blob to change one value;

    2) complete inability to enforce meaningful constraints on the data, thus allowing garbage to creep into the database;

    3) high cost of blanket updates to the data which might be required by application design changes.

    Generally, I only consider e-blob for non-essential data which is going to vary by installation, or for specially structured data which is infrequently updated and thus works well with special index types which work with blobs.

    January 24, 2010 at 7:44 pm
  • Mark Callaghan

    Josh – what is the cost of a null column in Postgres? Do possibly null columns require a bit in the row header?

    January 24, 2010 at 7:52 pm
  • Brian Cavanagh

    Yeah PostgreSQLmight be a good choice if you can get around the lousy replication support, as it will let you store arrays and another data types natively in the fields, so you don’t have to worry about an inaccessible object model.

    January 26, 2010 at 8:53 am
  • Rob Wultsch

    “All table rows are structured in the same way. There is a fixed-size header (occupying 23 bytes on most machines), followed by an optional null bitmap, an optional object ID field, and the user data. The header is detailed in Table 53-4. The actual user data (columns of the row) begins at the offset indicated by t_hoff, which must always be a multiple of the MAXALIGN distance for the platform. The null bitmap is only present if the HEAP_HASNULL bit is set in t_infomask. If it is present it begins just after the fixed header and occupies enough bytes to have one bit per data column (that is, t_natts bits altogether). In this list of bits, a 1 bit indicates not-null, a 0 bit is a null. When the bitmap is not present, all columns are assumed not-null. The object ID is only present if the HEAP_HASOID bit is set in t_infomask. If present, it appears just before the t_hoff boundary. Any padding needed to make t_hoff a MAXALIGN multiple will appear between the null bitmap and the object ID. (This in turn ensures that the object ID is suitably aligned.) ”

    January 26, 2010 at 5:01 pm
  • Baron Schwartz

    This is completely OT and not meant to be trolling, but one of the things I’ve always thought Postgres could improve is make their data files less architecture-dependent.

    January 26, 2010 at 6:50 pm
  • Leo Petr

    Hi, I’m dropping by via the High Scalability blog.

    This is a neat technique.

    Something similar but much simpler can be done in IBM DB2. DB2 has a built-in, indexable XML column type. There’s native support for XPath and XQuery, so you can store a clob with arbitrary fields serialized as XML and then do SQL queries with XPath to extract arbitrary fields, run aggregation functions on them, etc. Effectively, this lets you do exactly the same thing except without the opaqueness and with potentially higher performance depending on what you want.

    This is included in the free edition (ibm.com/db2/express/)

    Disclaimer: I work on the DB2 team, the opinions are my own, etc.

    February 3, 2010 at 5:16 pm
  • Diego

    Great post and discussion. IMO, unless someone is really unconfortable with trying new dbs (which are somewhat not as proven as mysql), they shouldn’t mysql that way. It’s harder to administer than any nosql db, and it’s just not the best tool for storing key/value data.

    February 6, 2010 at 1:20 pm
  • Vinay Binny

    Another downside to this approach is refactoring. If you refactor a class whose instances are persisted in the database in serialized form, you have to take care of that. Usually by writing a separate program/s that needs to be run out of the application’s deployment environment. Imagine the plight of this refactoring when instances like Hibernate proxies are serialized inadvertently.

    The only way to avoid this refactoring nightmare is to store instances(serialized) of native language classes like Java’s String or classes that are part of stable frameworks, value objects are ok, store entity objects at our own risk as they may give you state synchronization problems. Once tool I see that does this is Quartz(or its spring supported bean), which is a scheduling framework written in Java.

    Of course, you can probably avoid this if your language gives you a ‘seamless’ way of refactoring/manipulating classes in the serialized data by just replacing the class names which are usually stored as plain strings rather than binary. Still that may not cover all the refactoring cases.

    January 17, 2011 at 2:02 am
  • Edward Wittlinger

    I have a startup that is in the very beginning stages of engineering a website that will process data between user accounts. The database will store user data which in most cases is static data that includes a mug-shot of each user (I’m not that worried about that)!

    Where the database get’s complicated is the user accounts each have a full figured avitar that needs to be rendered on the fly with an ongoing second by second manipulation of the data for each account!

    And like any good business man, I’m planning for growth. What if the website ends up with 50 million user accounts (I know, I’m dreaming, BUT)… What if?

    Trust me, if Facebook had to start over and knew what they know now, do you think they would have used MySQL again?

    Based on what I have been reading a lot of you don’t like PostgreSQL, but has anyone looked over the latest version 9.0 or greater with regards to the replication?

    streaming replication:
    This streaming replication allows one or more databases to be replicated from a master server. It’s an asynchronous transfer but very quick.

    In addition to the new streaming replication, Postgres 9.0 also features a hot standby ability. This allows a secondary database to be designated as a hot standby should the primary database fail.

    Another big new feature is the addition of an in-place upgrade.

    My question is would PostgreSQL be the way to go with it’s ability to handle geometric shapes?

    Edward Wittlinger

    February 9, 2011 at 7:40 am
  • Anish Sahare

    how to save json data into oracle using procdure
    and which are the packages and type require to run the procedure

    plz send the example to email address

    Anish Sahare

    December 7, 2011 at 5:08 am
  • Matt Young

    Make a version of Json/Bson and call if Sqlson. I did that here: https://github.com/Matt-Young/Semantic-Machine/wiki

    I can query the internal of a Bson stream by name or structure. I can join multiple Bson streas. What makes this all possible is a direct map from vertical store format to Bson/Json. It is built on sqlite3 and designed ultimately to support joibs between Bson streams and Bson web bots. Sqlson is a small footprint row format that hold one Bson item and any sub items it may have, recursive of course, so one gets a complete nested store. The code rapidly translates between Bson and Sqlson, but retains all the power of sql searches in Sqlson mode. As a bonus, the code can serials square standard sql and make them look like Bson. Open source, CutnPast license, guaranteed patent free, full network io, support a Lazy Json consolel, including remote terminals.


    February 13, 2012 at 1:36 am
  • Jerome

    Hi there,
    thanks for the info.
    I am serializing part of my users extra info whichis store in a different table ie. address, contacts, name, surname etc.. as they are optional and thus avoid me to add additional columns BUT give me headaches when it comes to lookups. ie lookup users in a specific country which now i am doing a LIKE country name .

    how it works i have 3 cols (excluding my auto increment) . col1 will be a pointer(FK) to the user, col 2 is the identifier of the next col. for example col2 will be named ‘address’ and col 3 will have the value in it -> object containing the full address line.

    does somebody can suggest a better alternative if any and hope my explanation makes sense.

    Thanks in advance


    June 1, 2012 at 3:22 am
  • Scott F

    Using MongoDB, is always a good option “If the application really is schema-less and has a lot of optional parameters that do not appear in every record.”

    If you’re open to different databases, and it’s a possibility you should look into it first, before storing large serializations into MySQL.

    August 17, 2012 at 1:00 am
  • SleepyBoBos

    I am import large amount of data from text to sql using .net SqlBulkCopy wrapper. Some entities have an obvious parent child (one to many) relationship which if I put DBA hat on results in two tables ie parent and child.

    However doing a bulk import to these two tables isn’t possible ie after bulk importing to parent I don’t know what my foreign keys are to import into child table and hence be able to match up any children with parents.

    The solution – Serialize data (save human readable XML to a Text type field), which would ordinarily go to child table, into it’s own column into parent table instead. This allows me to do my bulk import (ie need for speed). Obviously there is a trade off if doing this if you want to query db direct.

    December 9, 2012 at 7:37 pm
  • Ssekirime Geofrey

    A lot of attention has been directed toward whether to use serialization or not, a few recommedations have been put forwards but lets all agree that serialization is a necessary evil considering the context of MySQL which is intended for relational data storage.

    I wish to give this a new edge of view, I understand the two most common datatypes used for serialization are arrays and objects(classes) but what is not clear to me is which of the two is more efficient in terms of the final output size, and overhead in serializing/unserializing the data. On the current project i’m working on, we opted to you objects since documentation and code maintainance is quite more trivial with classes than arrays, and designcontrols can easily be enforced which is not the case for arrays which are losely handled. Whats your view?

    February 16, 2013 at 2:46 pm
  • Eric

    I didn’t read every last comment, but it appears no one has raised the issue of conflicting writes to a serialized object. If you application allows more than one actor to read, locally update, then write to the serialized object, then you can easily have scenarios where someone’s write gets lost.

    Say I have an array of prices for given time periods (price fluctuates over time, needs to be managed in realtime). Next Tuesday the price is set to $220. Next Wednesday the price is currently set to $200. Actor A pulls the array and updates the price for next Wednesday to $240. Actor B pulls the array in the middle of that sequence, getting the old values. Actor B updates the value for next Tuesday to $300, because that’s Actor B’s job. Now Actor A writes the array to the database with the new Wed value. Now Actor B writes the array to the database, but it has the old Wed value and the new Tues value. Actor A lost his change to Tuesday’s price because of Bs write to Wednesday’s price.

    I have run into this problem when dealing with some legacy code, it was nasty to debug, and it reinforced my policy to never serialized fields that could be written by multiple actors in the same time period. If that case is not explicitly restricted in your app, then serializing can create painful bugs. In general it’s best normalize your data out into new tables and index. It works, it performs, and integrity is bliss. With Redshift it even scales now.

    March 14, 2013 at 12:09 pm
  • alexia

    I am wondering in this case of JSON data stored in mysql, how would you store “user comments” on a story, assuming you have a lot of comments/seconds written for each story!

    Would you have some kind of:
    – 1 table: Id_comment+json(comment,other params)
    – 1 table : Id_story,json([Idcomment1,…..,Idcomment100])

    which means that you constantly push 1 element in the array, along with an insert in the first table? Is this efficient when you have a huge flow of comments?

    Or would you suggest another structure?

    July 26, 2014 at 7:06 pm
  • Vijay Raju

    if i want to insert more than 30 columns information to a table ,how can i insert ???.Normal insertion in table requires 30 fields ,So that the table become complex.Is their any solution to insert the form data into database .using either php or jsp.

    August 20, 2014 at 7:21 am
  • Jason Kistler

    I have a question regarding .NET object serialization. I have inherited a system that uses serialization to capture historical rate, product and inventory data at the time of a reservation. Thus the serialized object is a larger object with multiple child objects. It is stored in the transaction table which is the most active table in the database. The original author of the system and database did not seem concerned about performance but now I have to deal with serious performance issues. If I run a SELECT * against the transaction table it takes a certain amount of time (lengthy) to retrieve all the data. If I remove the one column from the query that includes the XML serialization data then the query literally goes from 4 minutes to around 15 seconds. In order to eliminate the use of the XML column it would require significant re-architecture of the system to convert it to a table structure and rewrite the code that retrieves the object. It would also require adding quite a few more tables in order to capture the bredth of data contained in this single column. So the question is how can I make this work with performance, or am I stuck re-writing into tabular structure? It is written as standard XML into a TEXT field in SQL. There is no need for querying the data in sql….this is simply to store a snapshot of pricing and availability, so the priority is to gain back the performance lost in this methodology of storing objects in the database. Would it help to normalize it out to a different table with just the index and text fields? Do I need to use compression? Should I use binary instead of straight XML? There is unfortunately not a lot of information pertaining to my situation specifically on the web so hopefully someone here can help. Thanks in advance.

    September 26, 2014 at 2:34 pm
  • Fabian K

    Nice article.
    However at some parts I didn’t agree 100%.
    Of course “SELECT *” patterns are not efficient, but at which part of it do you want to point here exactly? (Using * might not always be bad)
    Also interestingly you were mentioning the omission of aggregation functions but nothing regarding their efficency.
    In the end i guess it depends a whole lot on the environment and the structure that it’s beeing applied on.
    If the db is well modelled (indexes,ettcc) I wouldn’t see problems either using mysql for w/e sized databases.
    It is a bit hypothetic here however, in the end it depends on a set of advantages/disadvantages coming with it either more or less fitting the projects requirements.
    If for example an application has to receive and write data by millisecond intervals (networking devices),
    reads and writes to mysql might become problematic. Workarrounds doing this in a filesystem only way may cause lots of IO sockets in a minimal timeframe up to bsod depending on the repeatition of course.
    What could be much more compfortable here might be using memory storage engines as first layer, and a sql storage engine to take snapshots of the memory stored objects at the right time and release it at the same time.

    February 16, 2015 at 9:11 am
  • bluetreble

    https://github.com/torodb/torodb is a good alternative to storing raw blobs. It gives you the best of both worlds.

    June 2, 2016 at 12:13 pm

Comments are closed.

Use Percona's Technical Forum to ask any follow-up questions on this blog topic.