PostgreSQL on ARM-based AWS EC2 Instances: Is It Any Good?

The expected growth of ARM processors in data centers has been a hot topic for discussion for quite some time, and we were curious to see how it performs with PostgreSQL. The general availability of ARM-based servers for testing and evaluation was a major obstacle. The icebreaker was when AWS announced their ARM-based processors offering in their cloud in 2018. But we couldn’t see much excitement immediately, as many considered it is more “experimental” stuff. We were also cautious about recommending it for critical use and never gave enough effort in evaluating it.  But when the second generation of Graviton2 based instances was announced in May 2020, we wanted to seriously consider. We decided to take an independent look at the price/performance of the new instances from the standpoint of running PostgreSQL.

Important: Note that while it’s tempting to call this comparison of PostgreSQL on x86 vs arm, that would not be correct. These tests compare PostgreSQL on two virtual cloud instances, and that includes way more moving parts than just a CPU. We’re primarily focusing on the price-performance of two particular AWS EC2 instances based on two different architectures.

Test Setup

For this test, we picked two similar instances. One is the older m5d.8xlarge, and the other is a new Graviton2-based m6gd.8xlarge. Both instances come with local “ephemeral” storage that we’ll be using here. Using very fast local drives should help expose differences in other parts of the system and avoid testing cloud storage. The instances are not perfectly identical, as you’ll see below, but are close enough to be considered same grade. We used Ubuntu 20.04 AMI and PostgreSQL 13.1 from pgdg repo. We performed tests with small (in-memory) and large (io-bound) database sizes.

Instances

Specifications and On-Demand pricing of the instances as per the AWS Pricing Information for Linux in the Northern Virginia region. With the currently listed prices, m6gd.8xlarge is 25% cheaper.

Graviton2 (arm) Instance

Regular (x86) Instance

OS and PostgreSQL setup

We selected Ubuntu 20.04.1 LTS AMIs for the instances and didn’t change anything on the OS side. On the m5d.8xlarge instance, two local NVMe drives were unified in a single raid0 device. PostgreSQL was installed using .deb packages available from the PGDG repository.

The PostgreSQL version string shows confirm the OS architecture

** aarch64 stands for 64-bit ARM architecture

The following PostgreSQL configuration was used for testing.

pgbench Tests

First, a preliminary round of tests is done using pgbench, the micro-benchmarking tool available with PostgreSQL. This allows us to test with a different combination of a number of clients and jobs like:

Where 16 client connections and 16 pgbench jobs feeding the client connections are used.

Read-Write Without Checksum

The default load that pgbench creates is a tpcb-like Read-write load. We used the same on a PostgreSQL instance which doesn’t have checksum enabled.

We could see a 19% performance gain on ARM.

x86 (tps) 28878
ARM (tps) 34409

Read-Write With Checksum

We were curious whether the checksum calculation has any impact on Performance due to the architecture difference. if the PostgreSQL level checksum is enabled. PostgreSQL 12 onwards, the checksum can be enabled using pg_checksum utility as follows:

x86 (tps) 29402
ARM (tps) 34701

To our surprise, the results were marginally better! Since the difference is around just 1.7%, we consider it as a noise. At least we feel that it is ok to conclude that enabling checksum doesn’t have any noticeable performance degradation on these modern processors.

Read-Only Without Checksum

Read-only loads are expected to be CPU-centric. Since we selected a database size that fully fits into memory, we could eliminate IO related overheads.

x86 (tps) 221436.05
ARM (tps) 288867.44

The results showed a 30% gain in tps for the ARM than the x86 instance.

Read-Only With Checksum

We wanted to check whether we could observe any tps change if we have checksum enabled when the load becomes purely CPU centric.

x86 (tps) 221144.3858
ARM (tps) 279753.1082

The results were very close to the previous one, with 26.5% gains.

In pgbench tests, we observed that as the load becomes CPU centric, the difference in performance increases. We couldn’t observe any performance degradation with checksum.

Note on checksums

PostgreSQL calculates and writes checksum for pages when they are written out and read in the buffer pool. In addition, hint bits are always logged when checksums are enabled, increasing the WAL IO pressure. To correctly validate the overall checksum overhead, we would need longer and larger testing, similar to once we did with sysbench-tpcc.

Testing With sysbench-tpcc

We decided to perform more detailed tests using sysbench-tpcc. We were mainly interested in the case where the database fits into memory. On a side note, while PostgreSQL on the arm server showed no issues, sysbench was much more finicky compared to the x86 one.

Each round of testing consisted of a few steps:

  1. Restore the data directory of the necessary scale (10/200).
  2. Run a 10-minute warmup test with the same parameters as the large test.
  3. Checkpoint on the PG side.
  4. Run the actual test.

In-memory, 16 threads:

In-memory, 16 threads

With this moderate load, the ARM instance shows around 15.5% better performance than the x86 instance. Here and after, the percentage difference is based on the mean tps value.

You might be wondering why there is a sudden drop in performance towards the end of the test. It is related to checkpointing with full_page_writes. Even though for in-memory testing we used pareto distribution, a considerable amount of pages is going to be written out after each checkpoint. In this case, the instance showing more performance triggered checkpoint by WAL earlier than its counterpart. These dips are going to be present across all tests performed.

In-memory, 32 threads:

In-memory, 32 threads

When concurrency increased to 32, the difference in performance reduced to nearly 8%.

In-memory, 64 threads:

In-memory, 64 threads

Pushing instances close to their saturation point (remember, both are 32-cpu instances), we see the difference reducing further to 4.5%.

In-memory, 128 threads:

In-memory, 128 threads

When both instances are past their saturation point, the difference in performance becomes negligible, although it’s still there at 1.4% Additionally, we could observe a 6-7% drop in throughput(tps) for ARM and a 4% drop for x86 when concurrency increased from 64 to 128 on these 32 vCPU machines.

Not everything we measured is favorable to the Graviton2-based instance. In the IO-bound tests (~200G dataset, 200 warehouses, uniform distribution), we saw less difference between the two instances, and at 64 and 128 threads, regular m5d instance performed better. You can see this on the combined plots below.

A possible reason for this, especially the significant meltdown at 128 threads for m6gd.8xlarge, is that it lacks the second drive that m5d.8xlarge has. There’s no perfectly comparable couple of instances available currently, so we consider this a fair comparison; each instance type has an advantage. More testing and profiling is necessary to correctly identify the cause, as we expected local drives to negligibly affect the tests. IO-bound testing with EBS can potentially be performed to try and remove the local drives from the equation.

More details of the test setup, results of the tests, scripts used, and data generated during the testing are available from this GitHub repo.

Summary

There were not many cases where the ARM instance becomes slower than the x86 instance in the tests we performed. The test results were consistent throughout the testing of the last couple of days. While ARM-based instance is 25 percent cheaper, it is able to show a 15-20% performance gain in most of the tests over the corresponding x86 based instances. So ARM-based instances are giving conclusively better price-performance in all aspects. We should expect more and more cloud providers to provide ARM-based instances in the future. Please let us know if you wish to see any different type of benchmark tests.

Join the discussion on Hacker News

Share this post

Comments (19)

  • Yuriy Safris Reply

    One note for comparison:
    m6gd.8xlarge Virtual CPUs : 32 – these are 32 physical cores
    m5d.8xlarge Virtual CPUs : 32 – these are 32 virtual threads or 16 physical cores
    Thus, you are comparing 32 physical cores against 16.
    Considering that the competitors were selected on the basis of comparable value, the comparison can be considered quite correct.
    But it should be borne in mind that with an equal number of cores, the solution with Graviton2 will be much slower.

    January 22, 2021 at 2:51 pm
    • chessmango Reply

      Not sure this will necessarily be the case, though valid point anyway. One method could be to drop one instance size down for the Graviton2 variant, and disable SMT on the other: https://aws.amazon.com/blogs/compute/disabling-intel-hyper-threading-technology-on-amazon-linux/ (guide is for Amazon’s own distro, but same principle applies)

      January 23, 2021 at 8:14 am
    • Jobin Augustine Reply

      Thank you @Yuriy. Yes, This is more of a value comparison between 2 instance types : What we pay for and what we get. The point you raised is very significant for cloud instances : in m6gd.8xlarge instance, the vCPUs are in a 1 to 1 relation with physical cores. When there are more vCPUs than physical cores, the chance of “noisy neighbor” increases. As we know, some of the AWS instance types are notorious for noisy neighbor problems and unexpected CPU “steal”s in the wrong times.

      January 23, 2021 at 11:30 pm
    • krunal bauskar Reply

      vCPU vs Physical core is valid but then given 2 different architectures it is difficult to get machines
      with exact same configuration.

      x86 has hyper threading, turbo mode, viz optimization and disabling them in favor on ARM (since arm doesn’t have it)
      may not be right.

      —————-

      I would suggest we look at it from different perspective.

      Let’s keep the cost constant (and other resources like storage and memory) and let’s allow the compute power to vary.
      (ARM being cheaper will get more compute resources).

      This would give a fair idea that for given cost how much more TPS/USD can ARM get back.

      —————–

      This model is tagged as #cpm (cost performance model) and is widely being used for mysql on arm evaluation.
      You can read more about it here https://mysqlonarm.github.io/CPM/

      January 26, 2021 at 11:07 pm
  • Hareesh H Reply

    Difference in cpu cache sizes of graviton and intel processors could be the reason of better performance on intel on io bound loads. I have observed similar results between amd and intel processors

    January 22, 2021 at 5:26 pm
  • Aituar Reply

    I agree with Yuriy Safris.
    Also, the measurements themselves without specifying the error are meaningless.
    Example.
    221436 + -30000 will already intersect with the value 288867 + -30000. That is, these 2 measurements will be almost equal to each other.

    January 23, 2021 at 5:37 am
  • anaconda Reply

    Compare against zen2 real cores, and ARM will loose.

    January 23, 2021 at 10:49 am