Skip to Content

Zero to Cohort Analysis in 60 Minutes

Cohort analysis is one of the best ways to understand your company’s growth — it often reveals things that are important but not obvious from top-line measures like MRR, MAUs, etc. Historically, it’s been kind of a slog to get cohort analysis up and running, usually because you either don’t have the data you want (e.g., a complete transaction ledger) or you can’t easily put it in the right format (e.g. it’s stuck in your point of sale).

But today, it’s fantastically easier — especially if you process user transactions using a modern payments platform with a good API (like Shopify, Square, or Stripe). In fact, it’s totally feasible to go from nothing to automatically-updated cohort analysis in about 60 minutes. The rest of this post shows how. To keep things concrete, we’ll focus the specific example of pulling revenue LTV curves for a fictional Shopify company, but the ideas apply more broadly.

imgur

Ingesting your Data

The first thing you need is data. For this post, we’re looking at revenue LTVs, so we want our transaction stream from Shopify. Rather than reinventing the wheel for this ingestion, we like to use data loaders like Stitch and Fivetran. For this demo, we’ll use Stitch — it’ll ingest our Shopify data for us, and send it to our data warehouse every 30 minutes, automatically. (If you’d like to use something self-hosted / open source, Stitch supports Singer, a set of composable CLI packages for pulling data from one place and pushing it to another.)

The whole thing takes a few minutes. We’ve done it using BigQuery below, but Stitch and Fivetran support Redshift and Snowflake too. (If you don’t have a data warehouse set up yet, we recommend trying out BigQuery for this demo — it’s pretty fast to sign up for GCP and set up BigQuery.) Again, there’s nothing special about Shopify here — we could just as easily be pulling data from Stripe.


That’s pretty much all we need from a data ingestion point of view(!). Once Stitch syncs, our Shopify data will be sitting in BigQuery, ready to use.

Be aware that some of this data is sensitive. E.g., the customers table will hit the /admin/customers endpoint from the Shopify API, which will return customer emails, names, and addresses. You may want to hide the table behind an “authorized view” or to put sensitive stuff in an intermediate “staging” db/warehouse.

Cohortized Revenue LTV

Now we can start analyzing our data. As a simple proof of concept, we’ll draw the cohortized revenue LTV curves — we think this is an important early step in understanding your business’s growth.

You can do this in SQL, but it’s a lot easier to work in Python if you can. So for now we’ll simply use the BigQuery frontend to roll up our data into months, and then download a big CSV.

select
  customer_id,
  date_trunc(date_of_txn, month) as txn_month,
  sum(txn_amount) as txn_total,
  count(distinct txn_uuid) as txn_count
from fake_shopify.orders
group by txn_month, customer_id
order by txn_month, customer_id

To keep things interesting, we’ve generated some fake data for this demo — you can get the csv here (csv link), and see the code that generated it here. Of course, if you have your own Shopify/Square/Stripe account with real data in it, there’s no need to do this.

Now that we have a local csv, we can run ≈40 lines of Python to organize our data and plot the LTV curves:


imgur

For this fake company, cohort-average LTV is getting worse over time — a common pattern for consumer businesses that are starting to saturate their acquisition channels. Of course, there’s more analysis to do here to better understand this company’s growth — this is by no means a complete view.

We think that tools like Stitch and BigQuery are exciting because of how quickly and easily they let you get out of infra and into analysis, as shown above. If you’re interested in diving deeper into the analysis itself, check out this great series from Jonathan Hsu.