I wanted write about using only SQL for analytical queries (possibly for big-data) because when I talk about SQL, people usually tend to say that it’s not a good fit for analytics and they have to write complex code in Java or use a NO-SQL solution that has custom query language because in big-data world, all the paradigms should be changed and become complex.
I actually like Mixpanel because of its simplicity but since they argue that SQL is so old and we need JS or JSON as query language, it will try to answer their arguments about SQL because when I look into their landing page, I have a lot to talk about.
Here is an comparison with JQL and SQL from their landing page
1. SQL is meant for rigid schemas for traditional relational databases
This is a common misconception. SQL is not RDBMS, it’s a query language mostly used by RDBMSs but recently even NO-SQL solutions such as Cassandra switched to SQL for their query language. You can have MAP and ARRAY types in SQL, define schema at query time or even introduce a VARIANT type for unstructured data similar to what Snowflake did. The author in the post also mentioned another problem which I will talk about why we always (almost) need a schema.
2. Difficult to manipulate and transform the data
3. Complex queries because unwieldly to read & compose
Ah, I will demonstrate an example for you and use the example query in Mixpanel’s JQL landing page:
One can write the same query in SQL as follows:
SELECT page_type, count(*) from “Song played”
GROUP BY page_type
WHERE time between timestamp ‘2015–7–1’ and timestamp ‘2015–7–31’
I don’t know about your preference but you might guess about my opinion. However; I’m aware that behavioral analytic queries such as funnel and retention is not easy write in SQL since every single action may cause a state change, people usually chain JOINs in order to perform funnel queries but they’re expensive. However we have workarounds for this problem, just use UDFs and pass the raw data to them. We use this approach in Rakam and it works just fine. In order to simplify, we do have funnel endpoint and even for that, the developers can see the underlying SQL queries that are executed for funnel analysis. If they need to perform even more complex queries, they can just copy the SQL and alter it as they wish. Clickhouse data-warehouse also use similar approach in Yandex Metrika.
4. Limited flexibility due to query functions available in SQL
For the last a two years, the big-data ecosystem tries hard to implement SQL as their query language. Spark developers tries hard to improve their SQL engine for the last 1.5 years, (commit history), Cassandra replaced its query engine with SQL-ish language, new distributed databases such as Cockroach and Influxdb uses SQL and finally new generation big-data in-memory database engines such as Impala, Presto and Drill allow us to run SQL queries on our data-set in a distributed environment without hassle.
Some database engines such as Elasticsearch prefer to use JSON as query language but considering their use-case, I find them reasonable. However just look at this repo that implements third-party SQL support for Elasticsearch, people tend to use what they’re familiar with.
I believe that SQL is enough for both batch and stream processing and on top of them, you can easily build your custom analytics services, that’s what we do at Rakam. We use Presto for ad-hoc analysis and implemented incremental materialized views and continuous query support on top of Presto’s query execution engine, just try it out.