Prefect

Intro

I was the lead designer for Prefect Cloud, a product used to help data engineers and data platform operators manage their orchestrated workflows. Workflows are collections of Python functions that are run on cloud infrastructure, typically in an automated way, whether scheduled or trigged from events (e.g. webhooks).

Team and collaboration

I reported to the VP of Product, and collaborated most closely with our CEO, lead front-end developer, and technical product manager.

Team map

This is how we collaborated on projects:

Projects generally took 1-2 months to complete, depending on resource availability.

Runs page redesign problem

One of the problems I tackled was a redesign of the main Runs page. The runs page is used to monitor the behavior of all workflows over time. We wanted to provide a more powerful timeline view of runs (inspired by Pydantic Logfire) and to add some additional top-level metrics (SLA Violations).

However, we had gotten all the way to a soft launch of the new experience when we starting getting feedback from customers that made us aware of a significant problem with the bar chart on the Runs tab. The bar chart in question shows the count of runs for each day in the selected time span (one week by default).

Runs page

People said they missed the original data visualization on this page, which we had dubbed "The Ball Pit". My job was to figure out why they missed it and what to do about it.

The Ball Pit

I've overlaid the Ball Pit on top of the bar chart to give you an idea of what it looked like.

The Ball Pit is a scatterplot, where each dot represents an individual run. The color of the dot represents the status of the run (for example, pending, running, completed, failed). The y-axis represents the duration of the run. The x-axis represents the time when the run started.

Runs page with Ball Pit overlay

The Ball Pit has a significant problem. When there are more than 200 runs, the API truncates the response and only returns the 200 most recent runs. This is why the plot looks oddly broken much of the time. Most of Prefect's enterprise customers have tens of thousands of runs per week, so they're only seeing a small fraction of their runs.

This was the original motivation behind replacing the Ball Pit with a simpler bar chart. We couldn't show all of the individual runs, but we _could_ show a statistical aggregation of all runs (the count of runs). However, after getting feedback that people missed the ball chart, we tried to see if we could bring back more of the granularity from the Ball Pit.

Statistical aggregation exploration

My first idea was to do statistical aggregation by showing p-value thresholds instead of a total count. I did a few iterations to explore this idea:

  1. A clustered bar chart, where each color represents a specific threshold: Clustered bar chart
  2. A modified box and whiskers plot, which shows both the thresholds and the delta between thresholds: Modified box and whiskers plot
  3. A stacked bar chart, which shows the same information but in a less visually noisy way: Stacked bar chart
  4. A traditional line chart overlaid on top of daily run counts, which is closer to the industry standard for this type of visualization: Traditional line chart
  5. A nested stacked bar chart, where the vertical bars represents the daily run counts and thresholds. Nested stacked bar chart And when you hover, we highlight the horizontal bars which represent the proportion of run states within each theshold band. Nested stacked bar chart - hover state

In the end, we decided to use the traditional line chart for each of the high level tabs, with the exception of the Runs tab. The Runs tab is what needed to solve the Ball Pit problem, and unfortunately these approaches didn't quite get us there.

Multi-dimension exploration

While those initial explorations avoided the 200 run limit of the original ball chart, users told us they still missed the glanceability of the Ball Pit. Users don't want to have to choose a metric (Run Count, Failure Rate, SLA Violations, Duration, Lateness) in order to see if that metric has anomalies. What users want is to have a single pane of glass they can look at to passively see if there are anomalies in any dimensions, after which they can drill down futher.

While the Ball Pit had significant shortcomings, it solved the single pane of glass problem very well. I tried a couple of explorations to see if I could represent all of these dimensions in a single plot.

  1. Show P99 thresholds where that makes sense (Duration, Lateness), and show other metrics sized according to their absolute counts (Failure, SLA Violations). Overlay multiple metrics
  2. Show all metrics as a clustered bar chart, which is simpler and less clever. Mutiple metrics with clustered bar chart

However, these visualizations didn't resonate with internal stakeholders or our users, so I went back to the drawing board and approached the problem from a first principles approach.

Observability first principles

Those explorations came further to being replacements for the Ball Pit, but we didn't have enough confidence in them. Rather than shipping a mediocre experience, I revisited the core observability problems that users have, including problems that even the Ball Pit _doesn't_ solve.

  1. Most run failures are caused by changes to code, infrastructure, or run duration. Infrastructure annotations
  2. Users want to see anomalies in run duration, but they want to see the deviation fom the median run for each run's parent deployment or work pool, not the mean for all runs as a whole. Deviation from the mean
  3. If a run fails, users want to be able to see if this is a recurring pattern for other runs in the same workflow. Highlight related runs

Interestingly, these problems are better solved by a scatterplot versus the statistical aggregations from the previous explorations. Users need to be able to see failure at the granularity of individual runs, which statistical aggregations can't do well.

Ball Pit redux

Given all of these explorations, we realized we probably wanted to enhance the Ball Pit rather than replacing it with an entirely new visualization. However, we couldn't do this with a hard limit of 200 runs. Given that, I went back to engineering to ask if this was a law of physics or something we could find a creative solution for (progressive loading, optimizing the payload from the API, etc).

It turned out that we could create a new API endpoint specific to the Ball Pit which would let us load up to 10,000 runs at a time. While not perfect, we decided this was good enough for most users, since:

  1. Users can restrict the timespan to limit the number of runs that are shown.
  2. We can introduce additional filters to hide runs that are less interesting (for example, successful runs, which are the majority of all runs).

With that limitation resolved, I went ahead and designed the solution that we ended up building.

New and improved Ball Pit

When you hover over an individual run in the chart, we highlight the other runs from the same workflow and show a trend line.

Hover over a run

When you click a run, we show a floating dialog with some additional metadata about the run.

Click a run

When you hover over an annotation on the x-axis, we show the associated infrastructure or code change.

Hover over an annotation

The Runs tab has a settings dialog that lets you dial in the chart to your preferences. One important function is the ability to hide specific run states (for example, Completed) if there is too much visual noise or if there are too many runs to display at once.

Tab settings

We decided to make the new relative duration y-axis opt-in. Even though users care more about deviation from the median parent workflow, we didn't want to confuse people accustomed to the original y-axis (absolute duration).

Y-axis setting

If you do change the y-axis to show the change from median, we show an additional setting that lets you control how the median is calculated. For example, you can show the median deviation from the parent deployment or the parent work pool (the infrastructure on which flows are run).

Median calculation Options for median calculation