Percona Resources

Software
Downloads

All of Percona’s open source software products, in one place, to download as much or as little as you need.

Valkey Contribution

Product Documentation

Why Percona for MongoDB?

Why Percona for PostgreSQL?

Percona Blog

Percona Blog

Our popular knowledge center for all Percona products and all related topics.

Community

Percona Community Hub

A place to stay in touch with the open source community

Events

Percona Events Hub

See all of Percona’s upcoming events and view materials like webinars and forums from past events

About

About Percona

Percona is an open source database software, support, and services company that helps make databases and applications run better.

Percona in the News

See Percona’s recent news coverage, press releases and industry recognition for our open source software and support.

Our Customers

Our Partners

Careers

Contact Us

05/12/2021

Percona Live Online 2021

With the rapid onset of the global Covid-19 Pandemic in 2020 the USA Centers for Disease Control and Prevention (CDC) quickly implemented a new Covid-19 pipeline to collect testing data from all of the USA’s states and territories, and produce multiple consumable results for federal and public agencies. They did this in under 30 days, using Apache Kafka.

We built a similar (but simpler) demonstration pipeline for ingesting, indexing, and visualizing some publicly available tidal data using multiple open source technologies including Apache Kafka, Apache Kafka Connect, Apache Camel Kafka Connectors, Open Distro for Elasticsearch and Kibana, Prometheus and Grafana.

In this talk, we introduce each technology, the pipeline architecture, and walk through the steps, challenges and solutions to build an initial integration pipeline to consume USA National Oceanic and Atmospheric Administration (NOAA) Tidal data, map and index the data types in Elasticsearch, and add missing data with an ingest pipeline. The goal being to visualize the results with Kibana, where we’ll see the period of the “Lunar” day, and the size and location of some small and large tidal ranges.

But what can go wrong? The initial pipeline only worked briefly, failing when it encountered exceptions. To make the pipeline more robust, we investigated Apache Kafka Connect exception handling, and evaluated the benefits of using Apache Camel Kafka Connectors, and Elasticsearch schema validation.

With a sufficiently robust pipeline in place, it’s time to scale it up. The first step is to select and monitor the most relevant metrics, across multiple technologies. We configured Prometheus to collect the metrics, and Kibana to produce a dashboard. With the monitoring in place we were able to systematically increase the pipeline throughput by increasing Kafka connector tasks, while watching out for potential bottlenecks. We discovered, and fixed, two bottlenecks in the pipeline, proving the value of this approach to pipeline scaling.

We conclude the presentation with lessons learned so far, and some potential future challenges.

Speaker: Paul Brebner – Instaclustr.com