TL;DR
The semantic layer is fragmented between tools, suffers from vendor lock-in and requires duplicate encoding of business logic that is costly to maintain and causes frustration for Analytics Engineers and metric consumers alike.
Recent products launched by dbt Labs and Looker aim to create a true universal semantic layer but fail due to a lack of collaboration and integration, making their adoption economically unviable for many firms.
Solving these compatibility issues will ultimately benefit all players, as it will lower the barriers for adoption and make the universal semantic layer affordable beyond a privileged class of well-funded world class data teams.
The current state of the semantic Layer
Back in 2013, the word “Reconciliation” used to send shivers down my spine: It would mean that two MDs had ended up in a heated argument whether an extremely important metric had gone up or down, supported by respective PowerPoint slides. Being the junior analyst on call, I was subsequently tasked to “reconcile” or figure out the workings of opaque data lineage, CSV dumps across spreadsheets created by ex-employees and slow on-premise data warehouses. More often than not, the verdict was that both metric definitions were hopelessly flawed.
Fast forward twelve years and we’ve come a long way: dbt brought us transparent data lineage and transformation, columnar cloud DBMS process billions of rows in a flash and Looker brought the semantic layer to BI. Despite the rise of the Modern Data Stack, I would argue that most firms’ semantic layer is in a mess today. Let’s dig in into it, using four exemplary companies1:
Earlybird Inc. jumped on the semantic layer bandwagon early and invested heavily in Looker. Its repo is full of hastily written and reviewed LookML code. Users get lost in a myriad of irrelevant dashboards and PDTs that send data warehouse costs through the roof. As its range of data products expands to ML apps, reverse ETL and collaborative notebooks, it realises that a semantic layer living in a BI tool is tricky to leverage2.
Transform Inc. leverages the transformation layer (usually dbt) and data warehouse as semantic layer and opted for Entity-Centric modelling. As consequence, avoiding circular references in its dbt project has become increasingly difficult and their DAG is starting to look like the worlds largest (and most expensive) bowl of spaghetti. The Analytics Engineering team is drowning in tickets as “self serve” data users are not able to pull the exact flavour and grain of metric they need.
Process Inc. is aware of the need of a universal semantic layer, but management has decided that there is no money for the required tools and Analytics Engineers to maintain them. Instead, they instruct their teams to follow long checklists and processes to ensure metrics remain consistent across tools and business domains. The resulting overhead is consuming the analytics team’s productive time and business stakeholders complain why nothing gets done.
Ignorance Inc. has not moved on much since 2013: They have myriad of misaligned metric definitions, buried in 200+ Airflow DAGs, which their MDs keep on arguing about. Their Analytics teams work nightshifts compiling “reconciliations” and visualising them as waterfall charts. HR wonders why the average tenure in the Analytics team is 6 months.
The false promise of salvation
Enter 2023 and with it, many promises to resolve all that mess:
dbt announces the acquisition of Transform and soon thereafter ambitious plans to roll out a “universal semantic layer” based on MetricFlow, Transform’s core software package. This semantic layer can then be accessed by your cloud data warehouse (Snowflake, BigQuery, Databricks, Redshift), BI Tool (Mode, Hex, Google Sheets) - or anywhere else via a generic JDBC driver.
Looker announces Looker Modeler, that is supposed to allow you to use your existing LookML metric definitions as single source of truth across pretty much all the most popular BI tools - Microsoft Power BI, Tableau, Looker Studio (formerly Google Data Studio). Similar to MetricFlow, the idea is to make this semantic layer available to any SQL speaking tool via JDBC driver.
The problem? Let’s revisit our four exemplary companies:
Earlybird Inc. hesitates to use dbt Metricflow since there is no integration that translates all its LookML to MetricFlow or vice versa, which means: (1) a massive amount of manual migration work to redefine all LookML in MetricFlow and (2) constant overhead to keep LookML and MetricFlow definitions in sync (a Process Inc. in the making). It also sees limited value in Looker Modeller since it does not mean to change BI tool and is not able to use the semantic layer in the tools that matter (Hex, Deepnote) as they still lack native integrations. They also don’t like the idea of having to go through the BI layer to expose the metrics in the data warehouse.
Transform Inc. is very excited about the roll-out of dbt Metricflow as it would finally allow them to give data users more flexibility to self-serve. However, Transform Inc. has chosen to run dbt core on their own infrastructure for good reasons and does not use dbt Cloud. Unfortunately that means that you also don’t get to use MetricFlow and the Semantic Layer API. They also wonder whether the whole idea of avoiding vendor lock-in with dbt core holds true, now that the key features are reserved for paying customers.
Process Inc.’s data team, tired on manually duplicating metric definitions between repos, is convinced of the benefits of paying for dbt cloud to roll out dbt Metricflow. However, its conservative management is not willing to pay for dbt cloud without case studies from peers proving clear economical benefits of a universal semantic layer.
Ignorance Inc. is too busy reconciling metrics and visualising the discrepancies in PowerPoint slides to remain informed about breaking news in the data world. Nothing happens.
The semantic Nirvana on the horizon
As young me was pouring over those spreadsheets back in the days, I had no idea what a semantic layer was, but I sure knew this was no great way to do Analytics. Today, I am optimistic that we have come very close to a state of Analytics where the metrics mess is finally resolved. For the sake of a catchy name, let’s call it the “Semantic Nirvana”.
In the Semantic Nirvana:
There is a true open source technology that allows Analytics Engineers to build a universal semantic layer that can be migrated freely from one cloud platform to the next, without vendor lock-in
Analytics Engineers only have to encoding business logic once because there is an easy way to translate from and to the universal semantic layer.
Analytics Engineers can build the semantic layer without having to learn any new language, as long as they are familiar with the stuff they should master anyway (SQL, Python, template engines)
There is seamless integration of all downstream tools with the universal semantic layer, without the need of obscure third parties
The universal semantic layer is affordable, allowing small companies with great ambitions to streamline their metrics and reduce overhead
Having a universal semantic layer is considered a standard best practice for all data-heavy companies and executives are taught about it in their 120,000$ MBAs
Hence my appeal to the cloud giants and Modern Data Stack vendors: Tackle the last remaining hurdles get us to the Semantic Nirvana so I don’t have to write about the semantic layer still not working ten years from here.
Any similarities with real companies are purely accidental.
There is an excellent podcast on Spotify by Sami Rahman and Harry Gollop which, among many other interesting things, touches on why leveraging the BI layer as semantic layer is a bad idea: