Data Flare

Data Flare

  • Docs
  • API

›Metrics based checks

Getting Started

  • Introduction
  • Writing your first suite of checks
  • Supported Scala and Spark versions

Available Checks

    Metrics based checks

    • Introduction to metrics and metric based checks
    • Metrics based checks on a single Dataset
    • Metrics based checks on a pair of Datasets
    • Track metrics not involved in any checks
    • Available metrics
  • Arbitrary checks

Persisting your results

  • Persisting results from your checks
  • Persisting metrics over time

Developer docs

  • Developer documentation

Available metrics

The available metrics are:

  • SizeMetric - count the number of rows in your dataset
  • SumValuesMetric - sum up a given column in your dataset
  • CountDistinctValuesMetric - count the distinct values across a given set of columns
  • ComplianceMetric - calculate the fraction of rows that comply with the given condition
  • DistinctnessMetric - calculate the fraction of rows that are unique
  • MinValueMetric - calculate the minimum value in a given column. Returns None if no rows in the dataset
  • MaxValueMetric - calculate the maximum value in a given column. Returns None if no rows in the dataset

With most metrics a filter can be applied before the metric gets calculated - you can see an example of this below.

Size Metric

Size metrics are very straightforward and just count the number of rows in your dataset. A filter can be applied to the dataset before the rows are counted. e.g.

import com.github.timgent.dataflare.metrics.MetricDescriptor.SizeMetric
import com.github.timgent.dataflare.metrics.MetricFilter
import org.apache.spark.sql.functions.col

SizeMetric(MetricFilter(col("fullName").isNotNull, "fullName is not null"))

A MetricFilter takes a filter condition and a descriptive string which is used for persistence.

SumValuesMetric

SumValuesMetric sums the values in a single column. A filter can be provided (the default is no filter) e.g.

import com.github.timgent.dataflare.metrics.MetricDescriptor.SumValuesMetric
import com.github.timgent.dataflare.metrics.MetricFilter
import com.github.timgent.dataflare.metrics.MetricValue.LongMetric

SumValuesMetric[LongMetric]("numberOfItems", MetricFilter.noFilter)

Note the type parameter. This is required to specify the type that should be used for the metric. For example if you have fractional values you should use a DoubleMetric, but for whole numbers a LongMetric is more appropriate.

ComplianceMetric

ComplianceMetric tells you what proportion of rows in a dataset comply with the given constraint. Again a filter can be applied before the metric is calculated. For example:

import com.github.timgent.dataflare.metrics.{ComplianceFn, MetricFilter}
import com.github.timgent.dataflare.metrics.MetricDescriptor.ComplianceMetric
import org.apache.spark.sql.functions.col

ComplianceMetric(ComplianceFn(col("fullName").isNotNull, "fullName is not null"), MetricFilter.noFilter)

CountDistinctValuesMetric

CountDistinctValuesMetric counts the number of distinct values in the given columns. A filter can be applied before the metric is calculated. For example:

import com.github.timgent.dataflare.metrics.MetricDescriptor.CountDistinctValuesMetric
import com.github.timgent.dataflare.metrics.MetricFilter

CountDistinctValuesMetric(List("firstName", "surname"), MetricFilter.noFilter)

DistinctnessMetric

DistinctnessMetric calculates, for a given set of columns, how unique those are across the whole dataset. A filter can be applied before the metric is calculated. For example:

import com.github.timgent.dataflare.metrics.MetricDescriptor.DistinctnessMetric
import com.github.timgent.dataflare.metrics.MetricFilter

DistinctnessMetric(List("firstName", "surname"), MetricFilter.noFilter)

MinValueMetric and MaxValueMetric

These calculate the minimum/maximum value of a given column in a Dataset. You must specify if they operate on columns with Ints or Longs (by using OptLongMetric) or Doubles (by using OptDoubleMetric). A filter can be provided For example:

import com.github.timgent.dataflare.metrics.MetricDescriptor.MinValueMetric
import com.github.timgent.dataflare.metrics.MetricFilter
import com.github.timgent.dataflare.metrics.MetricValue.OptLongMetric

MinValueMetric[OptLongMetric]("age", MetricFilter.noFilter)

The reason these checks use OptLongMetric or OptDouble metric is because in the case of an empty Dataset the only sensible metric value to calculate is None. These types of metric value calculate metrics that are options to acknowledge this possibility.

← Track metrics not involved in any checksArbitrary checks →
  • Size Metric
  • SumValuesMetric
  • ComplianceMetric
  • CountDistinctValuesMetric
  • DistinctnessMetric
  • MinValueMetric and MaxValueMetric
Data Flare
Docs
Getting StartedAPI Reference
Community
Stack Overflow
More
GitHubStar