The expected growth of ARM processors in data centers has been a hot topic for discussion for quite some time, and we were curious to see how it performs with PostgreSQL. The general availability of ARM-based servers for testing and evaluation was a major obstacle. The icebreaker was when AWS announced their ARM-based processors offering in their cloud in 2018. But we couldn’t see much excitement immediately, as many considered it is more “experimental” stuff. We were also cautious about recommending it for critical use and never gave enough effort in evaluating it. But when the second generation of Graviton2 based instances was announced in May 2020, we wanted to seriously consider. We decided to take an independent look at the price/performance of the new instances from the standpoint of running PostgreSQL.
Important: Note that while it’s tempting to call this comparison of PostgreSQL on x86 vs arm, that would not be correct. These tests compare PostgreSQL on two virtual cloud instances, and that includes way more moving parts than just a CPU. We’re primarily focusing on the price-performance of two particular AWS EC2 instances based on two different architectures.
For this test, we picked two similar instances. One is the older m5d.8xlarge, and the other is a new Graviton2-based m6gd.8xlarge. Both instances come with local “ephemeral” storage that we’ll be using here. Using very fast local drives should help expose differences in other parts of the system and avoid testing cloud storage. The instances are not perfectly identical, as you’ll see below, but are close enough to be considered same grade. We used Ubuntu 20.04 AMI and PostgreSQL 13.1 from pgdg repo. We performed tests with small (in-memory) and large (io-bound) database sizes.
Specifications and On-Demand pricing of the instances as per the AWS Pricing Information for Linux in the Northern Virginia region. With the currently listed prices, m6gd.8xlarge is 25% cheaper.
Graviton2 (arm) Instance
Instance : m6gd.8xlarge
Virtual CPUs : 32
RAM : 128 GiB
Storage : 1 x 1900 NVMe SSD (1.9 TiB)
Price : $1.4464 per Hour
Regular (x86) Instance
Instance : m5d.8xlarge
Virtual CPUs : 32
RAM : 128 GiB
Storage : 2 x 600 NVMe SSD (1.2 TiB)
Price : $1.808 per Hour
OS and PostgreSQL setup
We selected Ubuntu 20.04.1 LTS AMIs for the instances and didn’t change anything on the OS side. On the m5d.8xlarge instance, two local NVMe drives were unified in a single raid0 device. PostgreSQL was installed using .deb packages available from the PGDG repository.
The PostgreSQL version string shows confirm the OS architecture
postgres=# select version();
PostgreSQL 13.1 (Ubuntu 13.1-1.pgdg20.04+1) on aarch64-unknown-linux-gnu, compiled by gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit
** aarch64 stands for 64-bit ARM architecture
The following PostgreSQL configuration was used for testing.
max_connections = '200'
shared_buffers = '32GB'
checkpoint_timeout = '1h'
max_wal_size = '96GB'
checkpoint_completion_target = '0.9'
archive_mode = 'on'
archive_command = '/bin/true'
random_page_cost = '1.0'
effective_cache_size = '80GB'
maintenance_work_mem = '2GB'
autovacuum_vacuum_scale_factor = '0.4'
bgwriter_lru_maxpages = '1000'
bgwriter_lru_multiplier = '10.0'
wal_compression = 'ON'
log_checkpoints = 'ON'
log_autovacuum_min_duration = '0'
First, a preliminary round of tests is done using pgbench, the micro-benchmarking tool available with PostgreSQL. This allows us to test with a different combination of a number of clients and jobs like:
pgbench -c 16 -j 16 -T 600 -r
Where 16 client connections and 16 pgbench jobs feeding the client connections are used.
Read-Write Without Checksum
The default load that
pgbench creates is a tpcb-like Read-write load. We used the same on a PostgreSQL instance which doesn’t have checksum enabled.
We could see a 19% performance gain on ARM.
Read-Write With Checksum
We were curious whether the checksum calculation has any impact on Performance due to the architecture difference. if the PostgreSQL level checksum is enabled. PostgreSQL 12 onwards, the checksum can be enabled using pg_checksum utility as follows:
pg_checksums -e -D $PGDATA
To our surprise, the results were marginally better! Since the difference is around just 1.7%, we consider it as a noise. At least we feel that it is ok to conclude that enabling checksum doesn’t have any noticeable performance degradation on these modern processors.
Read-Only Without Checksum
Read-only loads are expected to be CPU-centric. Since we selected a database size that fully fits into memory, we could eliminate IO related overheads.
The results showed a 30% gain in tps for the ARM than the x86 instance.
Read-Only With Checksum
We wanted to check whether we could observe any tps change if we have checksum enabled when the load becomes purely CPU centric.
The results were very close to the previous one, with 26.5% gains.
In pgbench tests, we observed that as the load becomes CPU centric, the difference in performance increases. We couldn’t observe any performance degradation with checksum.
Note on checksums
PostgreSQL calculates and writes checksum for pages when they are written out and read in the buffer pool. In addition, hint bits are always logged when checksums are enabled, increasing the WAL IO pressure. To correctly validate the overall checksum overhead, we would need longer and larger testing, similar to once we did with sysbench-tpcc.
Testing With sysbench-tpcc
We decided to perform more detailed tests using sysbench-tpcc. We were mainly interested in the case where the database fits into memory. On a side note, while PostgreSQL on the arm server showed no issues, sysbench was much more finicky compared to the x86 one.
Each round of testing consisted of a few steps:
- Restore the data directory of the necessary scale (10/200).
- Run a 10-minute warmup test with the same parameters as the large test.
- Checkpoint on the PG side.
- Run the actual test.
In-memory, 16 threads:
With this moderate load, the ARM instance shows around 15.5% better performance than the x86 instance. Here and after, the percentage difference is based on the mean tps value.
You might be wondering why there is a sudden drop in performance towards the end of the test. It is related to checkpointing with full_page_writes. Even though for in-memory testing we used pareto distribution, a considerable amount of pages is going to be written out after each checkpoint. In this case, the instance showing more performance triggered checkpoint by WAL earlier than its counterpart. These dips are going to be present across all tests performed.
In-memory, 32 threads:
When concurrency increased to 32, the difference in performance reduced to nearly 8%.
In-memory, 64 threads:
Pushing instances close to their saturation point (remember, both are 32-cpu instances), we see the difference reducing further to 4.5%.
In-memory, 128 threads:
When both instances are past their saturation point, the difference in performance becomes negligible, although it’s still there at 1.4% Additionally, we could observe a 6-7% drop in throughput(tps) for ARM and a 4% drop for x86 when concurrency increased from 64 to 128 on these 32 vCPU machines.
Not everything we measured is favorable to the Graviton2-based instance. In the IO-bound tests (~200G dataset, 200 warehouses, uniform distribution), we saw less difference between the two instances, and at 64 and 128 threads, regular m5d instance performed better. You can see this on the combined plots below.
A possible reason for this, especially the significant meltdown at 128 threads for m6gd.8xlarge, is that it lacks the second drive that m5d.8xlarge has. There’s no perfectly comparable couple of instances available currently, so we consider this a fair comparison; each instance type has an advantage. More testing and profiling is necessary to correctly identify the cause, as we expected local drives to negligibly affect the tests. IO-bound testing with EBS can potentially be performed to try and remove the local drives from the equation.
More details of the test setup, results of the tests, scripts used, and data generated during the testing are available from this GitHub repo.
There were not many cases where the ARM instance becomes slower than the x86 instance in the tests we performed. The test results were consistent throughout the testing of the last couple of days. While ARM-based instance is 25 percent cheaper, it is able to show a 15-20% performance gain in most of the tests over the corresponding x86 based instances. So ARM-based instances are giving conclusively better price-performance in all aspects. We should expect more and more cloud providers to provide ARM-based instances in the future. Please let us know if you wish to see any different type of benchmark tests.