The case for getting rid of duplicate “sets”

The most useful feature of the relational database is that it allows us to easily process data in sets, which can be much faster than processing it serially.

When the relational database was first implemented, write-ahead-logging and other technologies did not exist. This made it difficult to implement the database in a way that matched with natural set theory. You see in, in set theory there is no possibility of duplicates in sets. This seems hard to fathom with the way we are used to dealing with data in a computer, but intuitive when you “think about how you think”.

When you say, “I have ten fingers and ten toes”, and I say, “how many is that combined?” you say 20, of course. Did you stop to add up the number one 10 times, and then then ten times more? Of course not. You did simple arithmetic to make the math faster, adding 10 + 10. This is a simple fact, that when you distribute computation you work faster.

How can we effectively reduce the data size of any set, possibly by orders of magnitude, with almost no work? Simple, we start thinking in sets like our brains do.

I am going to use a bigger example than “digits” because this is too small for you to notice an effective change.

Lets say I have a table of numbers, and I want to sum them all up:

mysql> show create table ex1;
+-------+-----------------------------------------------------------------------------------------------------+
| Table | Create Table                                                                                        |
+-------+-----------------------------------------------------------------------------------------------------+
| ex1   | CREATE TABLE `ex1` (
  `val` bigint(20) NOT NULL DEFAULT '0'
) ENGINE=MyISAM DEFAULT CHARSET=latin1 |
+-------+-----------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> select count(*) from ex1;
+----------+
| count(*) |
+----------+
| 73027220 |
+----------+
1 row in set (0.00 sec)

mysql> select sum(val) from ex1;
+-----------+
| sum(val)  |
+-----------+
| 871537665 |
+-----------+
1 row in set (5.49 sec)

mysql> show create table ex1;

+-------+-----------------------------------------------------------------------------------------------------+

| Table | Create Table |

+-------+-----------------------------------------------------------------------------------------------------+

| ex1 | CREATE TABLE `ex1` (

`val` bigint(20) NOT NULL DEFAULT '0'

) ENGINE=MyISAM DEFAULT CHARSET=latin1 |

+-------+-----------------------------------------------------------------------------------------------------+

1 row in set (0.00 sec)

mysql> select count(*) from ex1;

+----------+

| count(*) |

+----------+

| 73027220 |

+----------+

1 row in set (0.00 sec)

mysql> select sum(val) from ex1;

+-----------+

| sum(val) |

+-----------+

| 871537665 |

+-----------+

1 row in set (5.49 sec)

Now, what if I structure my data differently? We can “compress” the table in the database by removing the duplicates:

mysql> create table ex2 as select val, count(*) from ex1 group by val;
Query OK, 9 rows affected (19.51 sec)
Records: 9  Duplicates: 0  Warnings: 0

mysql> select * from ex2;
+-----+----------+
| val | count(*) |
+-----+----------+
| -10 |        1 |
|  -3 |        1 |
|  -2 |        1 |
|   0 |   250000 |
|   4 |  8000000 |
|   7 | 14680064 |
|  14 | 14680064 |
|  15 | 35417088 |
|  16 |        1 |
+-----+----------+
9 rows in set (0.00 sec)

mysql> select sum(`count(*)`) from ex2;
+-----------------+
| sum(`count(*)`) |
+-----------------+
|        73027220 |
+-----------------+
1 row in set (0.00 sec)

mysql> create table ex2 as select val, count(*) from ex1 group by val;

Query OK, 9 rows affected (19.51 sec)

Records: 9 Duplicates: 0 Warnings: 0

mysql> select * from ex2;

+-----+----------+

| val | count(*) |

+-----+----------+

| -10 | 1 |

| -3 | 1 |

| -2 | 1 |

| 0 | 250000 |

| 4 | 8000000 |

| 7 | 14680064 |

| 14 | 14680064 |

| 15 | 35417088 |

| 16 | 1 |

+-----+----------+

9 rows in set (0.00 sec)

mysql> select sum(`count(*)`) from ex2;

+-----------------+

| sum(`count(*)`) |

+-----------------+

| 73027220 |

+-----------------+

1 row in set (0.00 sec)

This is very useful compression. First, the data is stored inside of a histogram in the set. The cardinality and the distribution is always known. Second, computation on the compressed data is done without decompression, because a mathematical transformation does not have to be decompressed. Third, the COUNT acts as RLE compression for the other attributes. Don’t use FLOAT with this method (don’t use it at all).

We can do even more interesting things with this concept. Lets consider that I want to determine if I have certain numbers in my set?

First, lets create a table to hold the set of numbers (and their desired minimum cardinality) to a “search set” table:

CREATE TABLE `search_set` (
  `val` bigint(20) DEFAULT NULL,
  `cnt` bigint(20) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1 ;

mysql> insert into search_set values (-2,1),(-3,1),(-10,1),(15,2),(16,1);
Query OK, 5 rows affected (0.00 sec)
Records: 5  Duplicates: 0  Warnings: 0

CREATE TABLE `search_set` (

`val` bigint(20) DEFAULT NULL,

`cnt` bigint(20) DEFAULT NULL

) ENGINE=MyISAM DEFAULT CHARSET=latin1 ;

mysql> insert into search_set values (-2,1),(-3,1),(-10,1),(15,2),(16,1);

Query OK, 5 rows affected (0.00 sec)

Records: 5 Duplicates: 0 Warnings: 0

We are looking for at least one minus-two, at least two fifteens, etc

select * from compressed join search_set using (val) where compressed.cnt >= search_set.cnt;

mysql> select * from compressed join search_set using (val) where compressed.cnt >= search_set.cnt;
+-----+----------+------+
| val | cnt      | cnt  |
+-----+----------+------+
| -10 |        1 |    1 |
|  -3 |        1 |    1 |
|  -2 |        1 |    1 |
|  15 | 35417088 |    2 |
|  16 |        1 |    1 |
+-----+----------+------+
5 rows in set (0.00 sec)

mysql> delete from compressed where val=16;
Query OK, 1 row affected (0.00 sec)

mysql> select * from compressed join search_set using (val) where compressed.cnt >= search_set.cnt;
+-----+----------+------+
| val | cnt      | cnt  |
+-----+----------+------+
| -10 |        1 |    1 |
|  -3 |        1 |    1 |
|  -2 |        1 |    1 |
|  15 | 35417088 |    2 |
+-----+----------+------+
4 rows in set (0.00 sec)

select * from compressed join search_set using (val) where compressed.cnt >= search_set.cnt;

mysql> select * from compressed join search_set using (val) where compressed.cnt >= search_set.cnt;

+-----+----------+------+

| val | cnt | cnt |

+-----+----------+------+

| -10 | 1 | 1 |

| -3 | 1 | 1 |

| -2 | 1 | 1 |

| 15 | 35417088 | 2 |

| 16 | 1 | 1 |

+-----+----------+------+

5 rows in set (0.00 sec)

mysql> delete from compressed where val=16;

Query OK, 1 row affected (0.00 sec)

mysql> select * from compressed join search_set using (val) where compressed.cnt >= search_set.cnt;

+-----+----------+------+

| val | cnt | cnt |

+-----+----------+------+

| -10 | 1 | 1 |

| -3 | 1 | 1 |

| -2 | 1 | 1 |

| 15 | 35417088 | 2 |

+-----+----------+------+

4 rows in set (0.00 sec)

Great, things are looking good. What if we were to do this on our original relation before compression? This is a way to do it without using the “search set” trick above.

Also, notice that it takes a long time to delete. There is no index on this table.

mysql> delete from data where val=16;
Query OK, 1 row affected (3.14 sec)

mysql> select val, count(distinct id) from data             where val in (-2,-3,-10,15,15,16)  group by val ;
+-----+--------------------+
| val | count(distinct id) |
+-----+--------------------+
| -10 |                  1 |
|  -3 |                  1 |
|  -2 |                  1 |
|  15 |           35417088 |
+-----+--------------------+
4 rows in set (39.82 sec)

mysql> delete from data where val=16;

Query OK, 1 row affected (3.14 sec)

mysql> select val, count(distinct id) from data where val in (-2,-3,-10,15,15,16) group by val ;

+-----+--------------------+

| val | count(distinct id) |

+-----+--------------------+

| -10 | 1 |

| -3 | 1 |

| -2 | 1 |

| 15 | 35417088 |

+-----+--------------------+

4 rows in set (39.82 sec)

We can also check to see if any two numbers sum to zero:

mysql> select a.val, b.val from ex2 a,ex2 b where a.val + b.val = 0;
+-----+-----+
| val | val |
+-----+-----+
|   0 |   0 |
+-----+-----+
1 row in set (0.00 sec)

mysql> select a.val, b.val, count(*) from ex2 a,ex2 b where a.val + b.val = 0 group by 1,2 ;
+-----+-----+----------+
| val | val | count(*) |
+-----+-----+----------+
|   0 |   0 |        1 |
+-----+-----+----------+
1 row in set (0.00 sec)

mysql> insert into ex2 values (2,1);
Query OK, 1 row affected (0.00 sec)

mysql> select a.val, b.val, count(*) from ex2 a,ex2 b where a.val + b.val = 0 group by 1,2 ;
+-----+-----+----------+
| val | val | count(*) |
+-----+-----+----------+
|  -2 |   2 |        1 |
|   0 |   0 |        1 |
|   2 |  -2 |        1 |
+-----+-----+----------+
3 rows in set (0.00 sec)

mysql> select a.val, b.val from ex2 a,ex2 b where a.val + b.val = 0;

+-----+-----+

| val | val |

+-----+-----+

| 0 | 0 |

+-----+-----+

1 row in set (0.00 sec)

mysql> select a.val, b.val, count(*) from ex2 a,ex2 b where a.val + b.val = 0 group by 1,2 ;

+-----+-----+----------+

| val | val | count(*) |

+-----+-----+----------+

| 0 | 0 | 1 |

+-----+-----+----------+

1 row in set (0.00 sec)

mysql> insert into ex2 values (2,1);

Query OK, 1 row affected (0.00 sec)

mysql> select a.val, b.val, count(*) from ex2 a,ex2 b where a.val + b.val = 0 group by 1,2 ;

+-----+-----+----------+

| val | val | count(*) |

+-----+-----+----------+

| -2 | 2 | 1 |

| 0 | 0 | 1 |

| 2 | -2 | 1 |

+-----+-----+----------+

3 rows in set (0.00 sec)

2 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Justin Swanhart

Author

14 years ago

http://flexvie.ws

I wrote a complete materialized view solution for MySQL that supports all aggregate functions and deferred refresh for all aggregate functions. Only support for inner joins. You do not need outer joins in a knowledge management system.

Mark Leith

14 years ago

Welcome to data warehouse summary/aggregate/fact tables (just to name them properly)..

http://en.wikipedia.org/wiki/Aggregate_(Data_Warehouse)
http://en.wikipedia.org/wiki/Fact_table

MySQL 5.7
Support

Compare Percona to Leading Database Solutions

Software
Downloads

Valkey Contribution

Product Documentation

Resource Hub

Why Percona for MongoDB?

Why Percona for PostgreSQL?

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

The case for getting rid of duplicate “sets”

Related Blog Articles

RECOMMENDED ARTICLES

JavaScript Stored Routines in Percona Server for MySQL: A New Era for Database Programmability

A Christmas Carol of Two Databases

How to Turn a MySQL Unique Key Into a Primary Key

MOST POPULAR ARTICLES

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL Performance Tuning: Maximizing Database Efficiency and Speed

The Ultimate Guide to Open Source Databases

MySQL 5.7 Support

Compare Percona to Leading Database Solutions

Software Downloads

Valkey Contribution

Product Documentation

Resource Hub

Why Percona for MongoDB?

Why Percona for PostgreSQL?

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

The case for getting rid of duplicate “sets”

About the Author

Share This Post!

Stay up to date with the Percona Blog

Related Blog Articles

RECOMMENDED ARTICLES

JavaScript Stored Routines in Percona Server for MySQL: A New Era for Database Programmability

A Christmas Carol of Two Databases

How to Turn a MySQL Unique Key Into a Primary Key

MOST POPULAR ARTICLES

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL Performance Tuning: Maximizing Database Efficiency and Speed

The Ultimate Guide to Open Source Databases

MySQL 5.7
Support

Software
Downloads