As GPT-3 and its counterparts have come into the mainstream, I've begun seeing products and conversations about the interaction between Natural Language Processing (NLP) and data analytics appearing more and more frequently. I'm somewhere between excited and skeptical.

Inputs

What does NLP entail in the context of analytics? For the sake of this post, I'll say that it's the act of taking in words and producing some outcome. Input is pretty non-controversial among data analytics use cases. It's almost always some notion of "Type in the thing that you want to know." Some examples are:

  • What does retention look like for our top 10 customers?
  • What percentage of new users skip this onboarding step?
  • What is the average number of active users per team last month?

Outputs

The most common output that I see products and thought-leaders gravitating towards is getting answers. This is undoubtedly the goal that we're going for. However the more I think about this space, the more I think that we won't achieve this anytime soon. Data questions are nuanced, complex, and often require institutional knowledge.

With that being said, I'm still excited about NLP applications in data analytics. Success isn't binary here. There's a spectrum of help that "text input → output" can provide. From most high-touch to least:

  • Questions → Answers: Fully automated and self-serve for users, I can see this working for templatized questions and data, but as I said before, nuance and complexity are inherent in data questions. I'm skeptical we'll get to this point where technology "just answers" our questions.
  • Questions → SQL: This hits a little closer to home since it incorporates a human-in-the-loop element to validate that the code is actually answering the thing they want. This is less self-serve though since non-technical users are left out to dry a bit. Institutional knowledge is still a barrier from this being the way, but something like this could definitely be a helpful assistant to make analysts more productive. Example.
  • Questions → Datasets: You type in your question and get a list of proposed datasets and columns back. If you have been watching the plethora of data catalog startups popping up then this might be of interest to you. It's still not truly self-serve, and similar to the SQL approach above, it would be more of an assistant to the person performing the analysis. Which is still helpful.
  • Questions → Previous answers: You type in your question and with semantic search, the tool accurately suggests a menu of previous analyses that are likely to hit on what you want. I love this vision, and it's what I'm betting on of these options, but there are also lots of dependencies to it working. Namely, you need to first maintain a collection of analyses and have a process to verify which are still correct. It's not a silver bullet, but with those dependencies in place, it gets us closer to self-serve.
  • Questions → Analyst → Answers: Just for kicks, here's the baseline at most companies. This is the opposite of self-serve and how most data teams function today. If you have an even slightly nuanced question, you send a Slack message to Analyst Alex and ask if they can help you out. Not ideal and prone to bottlenecks.

Looking forward

If I was to place my bet on one of these, it would be Questions → Previous answers. While this may seem like an iterative improvement, I think the hurdle going from a question to "what you actually meant" is one of the larger ones in self-serve analytics.

This is all aspirational, but hey, answering analytics questions is still a really hard thing, despite new tools popping up in the modern data stack each day. I don't think NLP is a silver bullet, but I do have a feeling that it gets us closer to the data experience that we crave.


Thanks for reading! If you enjoyed this post, you can subscribe in the form below to get future ones like it delivered straight to your inbox. If Twitter is more your speed, you can follow me there as well. 🔥