Querying data from Druid database (Part 3 of Druid tutorial)

in #utopian-io7 years ago (edited)

This is part three in series of tutorials about Druid: high performance, scalable, distributed time-series datastore. In this part we will learn how to query data from druid database with TopN queries, timeseries queries and time boundary queries. Some more advanced querying techniques will be covered in next parts of this tutorial.

This tutorial, as well as previous parts, expects reader to have basic knowledge of system administration and some experience working in command line. If you don't yet have local Druid instance running, or don't have sample dataset loaded into database, please refer to previous parts of this tutorial:

Your question and suggestions are welcomed in comment section, as always. Examples are tested on MacOS, but should work without changes in Linux as well. Windows users are advised to use virtual machine to emulate linux environment, since Druid does not have Windows support.

TopN query

Out of the box Druid uses powerful JSON-based query language. There are many third-party libraries and tools that provide other kinds of query interfaces, for example SQL-like query syntax. In this tutorial we will concentrate on the default way of interaction with Druid: JSON queries through REST-ful API.

Same as with input data example, simple Druid query example is packaged together with Druid itself. It's located in ./quickstart/wikiticker-index.json:

{
  "queryType" : "topN",
  "dataSource" : "wikiticker",
  "intervals" : ["2015-09-12/2015-09-13"],
  "granularity" : "all",
  "dimension" : "page",
  "metric" : "edits",
  "threshold" : 25,
  "aggregations" : [
    {
      "type" : "longSum",
      "name" : "edits",
      "fieldName" : "count"
    }
  ]
}

Let's discuss each field of this query:

  • first thing Druid considers when receiving a query is queryType parameter. Query type determines what kinds of insights we want to receive as a result of our query. topN query allows us to aggregate data based on one dimension and returns number of records with highest aggregated value.
  • dataSource is similar to the table name in relational databases;
  • intervals limits time range for our query;
  • granularity specifies whether we need to pre-aggregate rows into bigger "chunks", e.g. Query granularity can not be smaller injestion granularity;
  • dimension describes which dimension should be used for grouping records;
  • metric describe which target metrics should be used for ordering results. In our case this metric is actually an aggregation specified in the same query.
  • threshold simply limits the number of output records;
  • aggregations specifies the way to aggregate data. In case of our example we ask Druid to sum values of count field and expose it with name edits.

This is what happens when we execute this query:

  1. Broker node will receive TopN request.
  2. Broker node will issue multiple requests to historical nodes, each asking for their top K results, where K equals 1000 if threshold is less than 1000, otherwise K equals threshold. This approach sacrifices precision for efficiency, since for threshold of 1000 only first 900 results will be guaranteed to be correctly ordered.
  3. Broker node will wait for responses from all historical nodes, combine them into final result and return it to user.

To send the example topN request to our dataset execute following command from druid directory:

$ curl -L -H'Content-Type: application/json' -XPOST --data-binary @quickstart/wikiticker-top-pages.json http://localhost:8082/druid/v2/?pretty
[ {
  "timestamp" : "2015-09-12T00:46:58.771Z",
  "result" : [ {
    "edits" : 33,
    "page" : "Wikipedia:Vandalismusmeldung"
  }, {
    "edits" : 28,
    "page" : "User:Cyde/List of candidates for speedy deletion/Subpage"
  }, {
    "edits" : 27,
    "page" : "Jeremy Corbyn"
  }, {
    "edits" : 21,
    "page" : "Wikipedia:Administrators' noticeboard/Incidents"
  }, {
    "edits" : 20,
    "page" : "Flavia Pennetta"
  }, {
    ... 20 more items ...
  } ],
} ]

Timeseries query

Unlike topN queries, timeseries queries do not allow to apply grouping rules, yet they usually include aggregations. For example, it might be interesting to see how many edits wikipedia processes each hour (across all pages). The query to fetch this information looks like this:

{
  "queryType" : "timeseries",
  "dataSource" : "wikiticker",
  "intervals" : ["2015-09-12/2015-09-13"],
  "granularity" : "hour",
  "threshold" : 25,
  "aggregations" : [
    {
      "type" : "longSum",
      "name" : "edits",
      "fieldName" : "count"
    }
  ]
}

Functions of all fields in this in this query are matching those in topN query, except for couple of differences:

  • there is no dimension and metric fields
  • no grouping results by dimension is performed
  • ordering results is done by timestamp instead of metric value.

Save this query in a new file in "quickstart/wikiticker-timeseries.json" and query the database as follows:

$ curl -L -H'Content-Type: application/json' -XPOST --data-binary @quickstart/wikiticker-ts-pages.json http://localhost:8082/druid/v2/\?pretty
[ {
  "timestamp" : "2015-09-12T00:00:00.000Z",
  "result" : {
    "edits" : 268
  }
}, {
  "timestamp" : "2015-09-12T01:00:00.000Z",
  "result" : {
    "edits" : 1144
  }
}, {
  "timestamp" : "2015-09-12T02:00:00.000Z",
  "result" : {
    "edits" : 1102
  }
}, {
  ...
} ]

Time Boundary Query

Time Boundary query is a very simple, yet often useful type of druid query. In the basic case, when query does not have parameters expect for required queryType and dataSource it gives you timestamps of earliest and latest datapoint. This query also supports bound object with possible value of minTime or maxTime. If bound is set only time boundary from specified side of datasetwill be returned. Let's try it out:

$ curl -L -H'Content-Type: application/json' -XPOST --data '{"queryType": "timeBoundary","dataSource":"wikiticker"}' http://localhost:8082/druid/v2/\?pretty
[ {
  "timestamp" : "2015-09-12T00:46:58.771Z",
  "result" : {
    "maxTime" : "2015-09-12T23:59:59.200Z",
    "minTime" : "2015-09-12T00:46:58.771Z"
  }
} ]
$ curl -L -H'Content-Type: application/json' -XPOST --data '{"queryType": "timeBoundary","dataSource":"wikiticker","bound":"minTime"}' http://localhost:8082/druid/v2/\?pretty
[ {
  "timestamp" : "2015-09-12T00:46:58.771Z",
  "result" : {
    "minTime" : "2015-09-12T00:46:58.771Z"
  }
} ]

As you can see, sample data in our database contains events starting from 00:46 AM at September 9-th, 2015 until 11:59 PM on the same day (almost full 24 hours of data).

Summary

In this part of the tutorial we learned how to query druid database with topN, timeseries and time boundaries queries. We will cover number of topics in upcoming tutorials:

  • configuring real-time injection of avro-encoded events from kafka into our Druid cluster using Tranquility utility
  • advanced query technicques
  • visualising data in Druid with Swiv(formerly Pivot) and Superset



Posted on Utopian.io - Rewarding Open Source Contributors

Sort:  

Thank you for the contribution. It has been approved.

You can contact us on Discord.
[utopian-moderator]

That was quick. Thank you.

Hey @laxam I am @utopian-io. I have just upvoted you!

Achievements

  • You have less than 500 followers. Just gave you a gift to help you succeed!
  • Seems like you contribute quite often. AMAZING!

Community-Driven Witness!

I am the first and only Steem Community-Driven Witness. Participate on Discord. Lets GROW TOGETHER!

mooncryption-utopian-witness-gif

Up-vote this comment to grow my power and help Open Source contributions like this one. Want to chat? Join me on Discord https://discord.gg/Pc8HG9x

Coin Marketplace

STEEM 0.19
TRX 0.18
JST 0.032
BTC 88143.93
ETH 3070.82
USDT 1.00
SBD 2.78