EmergencyEMERGENCY? Get 24/7 Help Now!

Managing big data? Say ‘hello’ to HP Vertica

 | September 18, 2014 |  Posted In: Big Data, Insight for DBAs, MySQL

PREVIOUS POST
NEXT POST

Managing big data? Say ‘hello’ to HP VerticaOver the past few months, I’ve seen an increase in the following use case while working on performance and schema review engagements:

I need to store exponentially increasing amounts of data and analyze all of it in real-time.

This is also known simply as: “We have big data.” Typically, this data is used for user interaction analysis, ad tracking, or other common click stream applications. However, it can also be seen in threat assessment (ddos mitigation, etc), financial forecasting, and other applications as well. While MySQL (and other OLTP systems) can handle this to a degree, it is by no means a forte. Some of the pain points include:

  • Cost of rapidly increasing, expensive disk storage (OLTP disks need to be fast == $$)
  • Performance decrease as the data size increases
  • Wasted hardware resources (excess I/O, etc)
  • Impact against other time-sensitive transactions (i.e. OLTP workload)

While there are many approaches to this problem – and often times, the solution is actually a hybrid of many individually tailored components – a solution that I have seen more frequently in recent work is HP Vertica.

At the 30,000 foot overview, Vertica is built around the following principles:

  • Columnar data store
  • Highly compressed data
  • Clustered solution for both availability and scalability

Over the next few weeks, I’ll discuss several aspects of Vertica including:

  • Underlying architecture and concepts
  • Basic installation and use
  • Different data loading techniques
  • Some various maintenance/operational procedures
  • Some comparisons vs. traditional OLTP (MySQL) performance
  • Some potential use-cases
  • Integration with other tools (such as Hadoop)

While Vertica is by no means the silver bullet that will solve all of your needs, it may prove to be a very valuable tool in your overall approach to managing big data.

PREVIOUS POST
NEXT POST
Mike Benshoof

Michael joined Percona in 2012 as a US based consultant. Prior to joining Percona, Michael spent several years in a DevOps role maintaining a SaaS application specializing in social networking. His experiences include application development and scaling, systems administration, along with database administration and design. He enjoys designing extensible and flexible solutions to problems. When not working, he enjoys golfing, grilling, watching sports, and spending time with the family.

20 Comments

  • For those intrested into how Vertica works and it can be used(scripts,how to’s and other stuff) feel free to visit www.verticablog.com , or http://www.aodba.com/main_tutorials.php?page=vertica&id_tit=22 or http://www.aodba.com/main_articles_single.php?art=9999&page=vertica&id_tit=6
    – here write my day to day stuff about how i work with Vertica.

  • The biggest advantage of Vertica is the raw speed. It is extremely fast when compared to other analytical databases, and it boasts features that make joins extremely fast. The trade-offs are the high license costs, slow updates and Vertica’s insatiable need to eat thru tons of disk space.

  • David Chen, you are very true about the fact that Vertica is “simply fast” but i will have to disagree with you on the fact that has an expensive license cost and also that is disk space hog!
    License – is a as per TB of raw data loaded into the database; once inside you can replicated 1 million times it will still be under the 1 TB row license, you can delete the unwanted(old/archive) data and free space (unlike other big data solution like Splunk).
    Space disk – totally not true – one of the key assets in Verticas architecture is the fact that will take a mass of data and apply all types of encoding such as RLE encoding and compression so that the footprint will be as small as possible. I can tell you that depending on the data type you have, your disk size occupation will use 90% less space.(this is one of the key that make Vertica so fast – less block to retrieve) also the Vertica will query the encoded data with no extra CPU cost.
    Bad Updates – that is true if you comapre it with a OLTP database system. Vertica will go nuts when you do updates(more then 22+ per sec) Why ? quick answer – Vertica never “overwrite” the data file so every time you update and new S.O. write will happen.
    One thing i love about it is the fact that you can use any hardware you have unlike all the Teradata,Exadata,etc that come with their own hardware and gadgets.
    I work with Oracle Exadata as well and i can tell you that a 4 GB on 4 CPU’s will load and query data faster then a Quarter Rack Exadata machine and i dont even use SSD drives on Vertica hosts.(this is only my opinion from my experience)
    My self i love Vertica as is so fast, it can be tricky and still has a lot to develop in many ways.
    One think to all that might think to use Vertica – Vertica is not to be used to replace your MySQL,Oracle,SQL Server etc.. OLTP database , Vertica is there to do the heavy lifting and help you do the analytics stuff with less expense(time, money).

  • David Chen — Could you please back up what you mean by “Vertica’s insatiable need to eat thru tons of disk space”?

    Foo — Infobright and InfiniDB are niche players compared to leaders such as Vertica.

    • Hi Fadi,
      Well that is the key to a fast result set !
      Like i said we have compressed data and encoded data, the compressed data will require some extra cpu cycles while retrieved, but most of the times Vertica uses encoding as i describe in this article http://www.aodba.com/tut_output_mysql.php?tut=6&page=vertica , encoding creates a smaller footprints and by doing this data retrieval will be faster.
      And maybe your concept of columnar databases is not up to date, as columnar mixed with Vertica’s technology(projections) will enable only the data that is in the predicates reach to be retrieved, not like ordinary RDBMS that unless there is an index present will need to read all the block in order to get the results.
      I hope i was clear 🙂

    • Hi Fadi,
      Just wanted to complete the previous post with a simple example:
      We have the table EMPS with 46.000.000 rows (1.2G of raw data)
      Example of the data content :
      1,Torres,F,FR,Brainbox
      2,Ortiz,F,VG,Mudo
      …..
      ….
      Once i load it into Vertica and apply all the encoding(i choose the best for my data type) the entire 46 millions rows of data will occupy only 1.1MB of space on my disk !
      See table definition
      CREATE PROJECTION emps_tt
      (
      id ENCODING RLE,
      last_name ENCODING RLE,
      gender ENCODING RLE,
      country ENCODING RLE,
      company_name ENCODING RLE
      )
      AS
      SELECT id,
      last_name,
      gender,
      country,
      company_name
      FROM public.emps
      ORDER BY gender,
      country,
      last_name,
      company_name,
      id
      UNSEGMENTED ALL NODES;

      And used space:
      dbadmin=> select row_count,cast((used_bytes/1024/1024) as float(2)) as MB from v_monitor.projection_storage;
      row_count | MB
      ———–+——————
      46000000 | 1.172

      ##Now let’s do this in Oracle(just as an example)- this is a an Exadata machine:##
      – we loaded the data quite fast :).
      select
      segment_name table_name,
      sum(bytes)/(1024*1024) table_size_meg
      from dba_extents
      where segment_type=’TABLE’
      and segment_name = ‘EMPS’
      group by segment_name;

      TABLE_NAME TABLE_SIZE_MEG
      —————– ————–
      EMPS 1516

      Table definition:

      create table emps (
      id int,
      last_name varchar(30),
      gender char,
      country varchar(3),
      company_name varchar(30)
      );

      So at the end of the data loading we get a 1516 MB os used space in Oracle and 1.1MB in Vertica – that is quite a space consumption difference ! don’t you think ? (and just to remember you that no indexes were created on the Oracle table!)
      Again the space consumption ratio depends on your data type(of course i have used a data type that helped me make a point).
      If in any ways i am wrong please fell free to correct me.

  • @Adrian Oprea: Since you used compression in Vertica, you should compare with Oracle with compression.
    Since you have exadata accessible, can you test with hybrid columnar compression?
    use “compress for archive high” (highest compression) or “compress for query low” (lowest compression) in your create table statement.

    • Hi Henrik,
      I am so happy you asked me this, well you see like i said is not compression is encoding and it comes by default if your run a Database Designer(is a Vertica built-in tuning tool- or at least i call it like that) not like Oracle where you need to specify it.
      Just to make it clear i love Oracle database and i think it holds an important role on the market and Vertica is not here to replace any rdbms, here is the key of great system arch – make the best of each tech( as they serve a purpose )
      I will try to apply the compress types and see the results. Just to remind you that any compression will require decompression so more cpus etc… Encoded data is read as it is !

  • A pricing is an interesting question, and it is always a reason for speculations as you can’t find prices on Vertica web-site.
    $100.000 per TB is not take from no where, if you google “vertica price per terabyte” , every link mentions $100K/TB,
    I guess it comes from Ex-CEO interview in https://gigaom.com/2013/11/02/ex-vertica-ceo-hadoop-is-pulling-the-rug-from-under-the-database-industry/.

    More recent slides mention significant price decrease to $20K/TB
    http://www.slideshare.net/sasindia/future-of-analytics-is-here-15616679

    Of course both of these can’t be a reliable source, so only way to know if to call Vertica sales and generate them one more lead.

    • Just to put some light into the Vertica licenses
      -All development,homologation or desaster sites (they are under the initial production License – no extra money)
      -Raw data size based (valid to store up to some amount of raw data) -License is only applied to the RAW data loaded – once loaded it can be replicated as many times as you like.
      -Term-based (valid until a specific date).
      -Both term-based and data-size-based.
      -Unlimited duration and data storage(perpetual).
      -Community version will only allow you to create a 3 nodes cluster and a max of 1TB(no bkp available 🙁 – and others things are as well not possible)
      And yes price may vary as per location and demand – i think is all bout the way you negotiate.

      I am not a HP Vertica rep, so i have no intend of convincing anybody.

  • This is the nice blog!
    I caught so much information from this blog and this blog is on Vertica Monitoring. Vertica Monitoring is about that who is able to access your database, why and when. Some strategies are here by which the whole data should be secured.

Leave a Reply