EmergencyEMERGENCY? Get 24/7 Help Now!

The case for getting rid of duplicate “sets”

Posted on:



Share Button

The most useful feature of the relational database is that it allows us to easily process data in sets, which can be much faster than processing it serially.

When the relational database was first implemented, write-ahead-logging and other technologies did not exist. This made it difficult to implement the database in a way that matched with natural set theory. You see in, in set theory there is no possibility of duplicates in sets. This seems hard to fathom with the way we are used to dealing with data in a computer, but intuitive when you “think about how you think”.

When you say, “I have ten fingers and ten toes”, and I say, “how many is that combined?” you say 20, of course. Did you stop to add up the number one 10 times, and then then ten times more? Of course not. You did simple arithmetic to make the math faster, adding 10 + 10. This is a simple fact, that when you distribute computation you work faster.

How can we effectively reduce the data size of any set, possibly by orders of magnitude, with almost no work? Simple, we start thinking in sets like our brains do.

I am going to use a bigger example than “digits” because this is too small for you to notice an effective change.

Lets say I have a table of numbers, and I want to sum them all up:

Now, what if I structure my data differently? We can “compress” the table in the database by removing the duplicates:

This is very useful compression. First, the data is stored inside of a histogram in the set. The cardinality and the distribution is always known. Second, computation on the compressed data is done without decompression, because a mathematical transformation does not have to be decompressed. Third, the COUNT acts as RLE compression for the other attributes. Don’t use FLOAT with this method (don’t use it at all).

We can do even more interesting things with this concept. Lets consider that I want to determine if I have certain numbers in my set?

First, lets create a table to hold the set of numbers (and their desired minimum cardinality) to a “search set” table:

We are looking for at least one minus-two, at least two fifteens, etc

Great, things are looking good. What if we were to do this on our original relation before compression? This is a way to do it without using the “search set” trick above.

Also, notice that it takes a long time to delete. There is no index on this table.

We can also check to see if any two numbers sum to zero:

Share Button

Justin Swanhart

Justin is a former Principal Support Engineer on the support team. In the past, he was a trainer at Percona and a consultant. Justin also created and maintains Shard-Query, a middleware tool for sharding and parallel query execution and Flexviews, a tool for materialized views for MySQL. Prior to working at Percona Justin consulted for Proven Scaling, was a backend engineer at Yahoo! and a database administrator at Smule and Gazillion games.



Leave a Reply

Percona’s widely read Percona Data Performance blog highlights our expertise in enterprise-class software, support, consulting and managed services solutions for both MySQL® and MongoDB® across traditional and cloud-based platforms. The decades of experience represented by our consultants is found daily in numerous and relevant blog posts.

Besides specific database help, the blog also provides notices on upcoming events and webinars.

Want to get weekly updates listing the latest blog posts? Subscribe to our blog now! Submit your email address below.

No, thank you. Please do not ask me again.