The case for getting rid of duplicate “sets”

The most useful feature of the relational database is that it allows us to easily process data in sets, which can be much faster than processing it serially.

When the relational database was first implemented, write-ahead-logging and other technologies did not exist. This made it difficult to implement the database in a way that matched with natural set theory. You see in, in set theory there is no possibility of duplicates in sets. This seems hard to fathom with the way we are used to dealing with data in a computer, but intuitive when you “think about how you think”.

When you say, “I have ten fingers and ten toes”, and I say, “how many is that combined?” you say 20, of course. Did you stop to add up the number one 10 times, and then then ten times more? Of course not. You did simple arithmetic to make the math faster, adding 10 + 10. This is a simple fact, that when you distribute computation you work faster.

How can we effectively reduce the data size of any set, possibly by orders of magnitude, with almost no work? Simple, we start thinking in sets like our brains do.

I am going to use a bigger example than “digits” because this is too small for you to notice an effective change.

Lets say I have a table of numbers, and I want to sum them all up:

Now, what if I structure my data differently? We can “compress” the table in the database by removing the duplicates:

This is very useful compression. First, the data is stored inside of a histogram in the set. The cardinality and the distribution is always known. Second, computation on the compressed data is done without decompression, because a mathematical transformation does not have to be decompressed. Third, the COUNT acts as RLE compression for the other attributes. Don’t use FLOAT with this method (don’t use it at all).

We can do even more interesting things with this concept. Lets consider that I want to determine if I have certain numbers in my set?

First, lets create a table to hold the set of numbers (and their desired minimum cardinality) to a “search set” table:

We are looking for at least one minus-two, at least two fifteens, etc

Great, things are looking good. What if we were to do this on our original relation before compression? This is a way to do it without using the “search set” trick above.

Also, notice that it takes a long time to delete. There is no index on this table.

We can also check to see if any two numbers sum to zero: