This was an interesting week for data discussions in the Boston area. There were two back-to-back events this week — Big Data on Wednesday night hosted by TiE Boston and Channeling the Big Data Tsunami by MassTLC on Thursday morning. With desperate sounding headlines, I was beginning to fear that big data was going to storm the Boston ‘burbs to give us a British-style trouncing.
What a relief instead to hear luminaries from startups, established companies, research houses and VCs share their wisdom at the TiE and MassTLC events on what we need to worry about and how to think of the problem.
David Reinsel of IDC stepped back and gave a great macro level overview of the problem and the sources. He talked about the incredible data volume – 0.8 ZB (a zettabyte or “ZB” is a million terabytes) in 2009 growing to 35 ZB in 2020. And that’s just what is created. While David estimated that popular social networking sites are generating on the order of 14 PB (petabytes which is a thousand terabytes) per year, the consumption (how much people view) is on the order of 10,000 PB a year. Big numbers – so that’s our problem, right? Well, according to Dave, that’s not big data. “Big data is not the created content nor is it even its consumption – it’s the analysis of all the data surrounding or swirling around it”.
Examples for this analysis came from multiple areas outside the traditional data center that companies use for BI including smart systems, web 2.0 and science. David mentioned America’s smartest bridge, built as a replacement over the Mississippi river after the pervious bridge collapsed. This one bridge alone has 323 sensors and generates over 500 MB/day of data. And that’s one bridge. Professor Mike Stonebraker of MIT raised the cases of typical web 2.0 companies like Yahoo, who has 1 TB/day of clickstream data to analyze. Adam Towvim of Jumptap noted that his firm’s network had 8 billion mobile impressions per month to manage. On the science front, Mike spoke of Johns Hopkins alone which has 20 different departments each dealing with ½ PB of data to manage and analyze.
OK, great. So we are drowning in data, never mind having to try to manage it, analyze it, and get value from it. What can database solution vendors do?
Any solution needs flexibility, performance, scalability and a working ecosystem to be deployed in most organizations. The standard off-the-shelf MySQL has shortcomings. Dan Weinreb of ITA Software pointed out how tough it can get in an organization if all of a sudden you want to track a new field with MySQL – perhaps a second salary column for someone depending on their project. There was also an acknowledgement from the TiE event that hardware remains a bottleneck as performance is only as good at the end of the day as how quickly one can get data off of the drives. And a number of folks pointed out that new solutions such as NoSQL lack portability of data as well as a mature ecosystem. Finally, there is inertia – with training, investment and expertise in MySQL, it’s only under very limited circumstances that organizations want to consider hitting the “reset button” on existing architecture.
So, is there a middle ground? Can we get more performance out of MySQL? Do we need to throw out the “baby with the bathwater”? Professor Dan Abadi of Yale noted that people were looking for a number of solutions that get the best of all worlds – the “classic” high performance of structured data from SQL vs. the scalability of new areas such as Hadoop. Luke Lonergan of Greenplum/EMC spoke of leveraging the familiarity of SQL with the flexibility to add additional hardware and “scale out”. It’s solutions like these – that extend the value of the SQL language or the popular MySQL RDBMS, that work for a lot of organizations. Of course, this is where TokuDB plays well – improving the performance and scalability of MySQL, allowing customers to get more performance without having to “rip and replace” with some type of alternative to MySQL or a radically different new hardware stack.
At the end of the day, I felt better. It is an ever-growing rise of data. However, we’ve got a clever and determined band of folks in the Boston area addressing big data. Some companies will get injured for sure (the start of the revolutionary war left many wounded as well), but the creativity and mental prowess in the room gave me confidence that we can surely stem the tide.