Skip to content

Commit fad32ab

Browse files
author
Carlin Eng
authored
CLI blog post draft (#93)
CLI blog post draft
1 parent f4a8abd commit fad32ab

File tree

3 files changed

+113
-0
lines changed

3 files changed

+113
-0
lines changed

src/blog/malloy-cli/ga4_schema.png

134 KB
Loading

src/blog/malloy-cli/index.malloynb

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
>>>markdown
2+
# Announcing the Malloy Command Line Interface
3+
4+
Today we’re excited to announce the launch of the Malloy Command Line Interface (CLI). One of the primary jobs of SQL is transforming datasets. The most basic way to do this is by issuing SQL queries on a command line. The Malloy CLI serves this function, but offers simplicity and reusability in calculations that SQL lacks. With Malloy, metric calculations can be saved as part of a data model and reused in queries that calculate roll-ups at varying levels of granularity or slice across different dimensions. This reduces duplicate code, making data pipelines much easier to read, understand, and maintain.
5+
6+
Let’s take a look at a simple example. The [Google Analytics 4 (GA4) schema](https://support.google.com/analytics/answer/7029846) is notoriously difficult to query. The schema contains 23 columns, 11 of which are “record” types with nested data. Three of those record columns contain further nested types. Some of these nested types contain columns named “key” and “value”, so the schema can’t be inferred without actually querying the data:
7+
8+
<img src="ga4_schema.png" class="small-img">
9+
10+
This data needs to be transformed before it is usable by normal people. The Malloy model we define below does all this toilsome work; for example, transforming the `timestamp` data from microseconds to an actual timestamp data type, or wrangling session IDs from the unhelpfully named `event_params` column:
11+
12+
```malloy
13+
source: events is table('bigquery:bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*') {
14+
15+
rename:
16+
event_timestamp_raw is event_timestamp
17+
18+
dimension:
19+
event_timestamp is timestamp_micros!timestamp(event_timestamp_raw)
20+
new_user is pick 1 when event_name ? 'first_visit' | 'first_open' else 0
21+
event_value is (event_params.value.int_value ?? event_params.value.float_value ?? event_params.value.double_value)
22+
23+
measure:
24+
is_new_user is max(new_user)
25+
user_count is count(distinct user_pseudo_id)
26+
session_count is count(distinct event_params.value.int_value) { where: event_params.key = 'ga_session_id' }
27+
# percent
28+
conversion_rate is count(distinct user_pseudo_id) { where: event_name = 'purchase' } / count(distinct user_pseudo_id)
29+
30+
query: nested_index is {
31+
index: event_params.*, user_properties.*, items.*
32+
}
33+
}
34+
```
35+
>>>markdown
36+
Isn’t the same thing achievable with SQL views? We could certainly write the same transformation logic to unpack `event_params` with SQL, but things get more interesting when we start aggregating the data. Suppose we want to compute the conversion rate as the count of users who made a purchase divided by all users. In the model above, the logic for this metric is encapsulated into a named measure called `conversion_rate`. The business wants to see conversion rates sliced by different dimensions: date, month, and platform. In SQL, to generate each of these slices, the aggregation logic for `conversion_rate` would need to be duplicated for each view:
37+
38+
```sql
39+
CREATE OR REPLACE VIEW conversion_by_day_overall AS
40+
SELECT
41+
date(event_timestamp) as event_date
42+
, count(distinct case when event_name = 'purchase' then user_pseudo_id) / count(distinct pseudo_user_id) as conversion_rate
43+
FROM events
44+
GROUP BY 1;
45+
46+
CREATE OR REPLACE VIEW conversion_by_day_and_platform AS
47+
SELECT
48+
date(event_timestamp) as event_date
49+
, platform
50+
, count(distinct case when event_name = 'purchase' then user_pseudo_id) / count(distinct pseudo_user_id) as conversion_rate
51+
FROM events
52+
GROUP BY 1,2;
53+
54+
CREATE OR REPLACE VIEW conversion_by_month_overall AS
55+
SELECT
56+
date(event_timestamp) as event_month
57+
, count(distinct case when event_name = 'purchase' then user_pseudo_id) / count(distinct pseudo_user_id) as conversion_rate
58+
FROM events
59+
GROUP BY 1;
60+
61+
CREATE OR REPLACE VIEW conversion_by_month_and_platform AS
62+
SELECT
63+
date(event_timestamp) as event_month
64+
, platform
65+
, count(distinct case when event_name = 'purchase' then user_pseudo_id) / count(distinct pseudo_user_id) as conversion_rate
66+
FROM events
67+
GROUP BY 1;
68+
```
69+
70+
With Malloy, we can simply reference the named measure with each query. There’s a single place to define the measure, and that measure can be used to calculate roll-ups across any arbitrary set of dimensions. We can embed these queries inside of SQL DDL statements in a script:
71+
72+
```malloy
73+
-- create_views.malloysql
74+
create or replace view ga4.conversion_by_day_overall as %{
75+
events -> {
76+
group_by: event_timestamp.day
77+
aggregate: conversion_rate
78+
}
79+
}%
80+
;
81+
82+
create or replace view ga4.conversion_by_day_platform as %{
83+
events -> {
84+
group_by: event_timestamp.day, platform
85+
aggregate: conversion_rate
86+
}
87+
}%
88+
;
89+
90+
create or replace view ga4.conversion_by_month_overall as %{
91+
events -> {
92+
group_by: event_timestamp.day, platform
93+
aggregate: conversion_rate
94+
}
95+
}%
96+
;
97+
98+
create or replace view ga4.conversion_by_month_platform as %{
99+
events -> {
100+
group_by: event_timestamp.month, platform
101+
aggregate: conversion_rate
102+
}
103+
}%
104+
;
105+
```
106+
107+
Now we use the Malloy CLI to execute this script and create the views inside of our database:
108+
109+
<img src="malloy_cli_run.png">
110+
111+
The Malloy code is much more concise, readable, and maintainable. If the definition of `conversion_rate` changes, it only needs to be updated in a single place: the events model. Contrast this with the SQL equivalent. The code is verbose, and any change to the underlying business logic requires an update to every single `CREATE VIEW`` statement.
112+
113+
The Malloy CLI is relatively simple in its functionality today, but it still unlocks the power of Malloy for many users. We’ll be looking to build more advanced functionality into this tool to make it even more useful. Getting started is easy – head over to [Github Releases](https://github.com/malloydata/malloy-cli/releases) to grab the latest binary for your platform, and check out our [documentation](https://malloydata.github.io/documentation/malloy_cli/index) for detailed usage information. If you have any feedback or feature requests, don’t hesitate to join our community [Slack channel](https://join.slack.com/t/malloy-community/shared_invite/zt-1kgfwgi5g-CrsdaRqs81QY67QW0~t_uw) and drop us a note.
98.4 KB
Loading

0 commit comments

Comments
 (0)