Data Flare

Data Flare

  • Docs
  • API

›Metrics based checks

Getting Started

  • Introduction
  • Writing your first suite of checks
  • Supported Scala and Spark versions

Available Checks

    Metrics based checks

    • Introduction to metrics and metric based checks
    • Metrics based checks on a single Dataset
    • Metrics based checks on a pair of Datasets
    • Track metrics not involved in any checks
    • Available metrics
  • Arbitrary checks

Persisting your results

  • Persisting results from your checks
  • Persisting metrics over time

Developer docs

  • Developer documentation

Metrics based checks on a pair of Datasets

Performing metric based checks on a pair of datasets

To perform metric based checks on a pair of Datasets you will need to pass a dualDsChecks argument to your ChecksSuite.

import com.github.timgent.dataflare.checkssuite._
val checksSuite = ChecksSuite(
  "someChecksSuite", 
  dualDsChecks = ???
)

You will need to pass in a Map[DescribedDsPair, DualDsCheck]. The keys represent the pair of datasets that they checks will be performed on. The values represent the checks to be done on the pair of datasets.

How to create a DualMetricCheck

You can create a DualMetricCheck by specifying the metric you want to compare for each dataset, and providing a MetricComparator that describes what comparison to do between the 2 metrics. For example:

import com.github.timgent.dataflare.checks.metrics.DualMetricCheck
import com.github.timgent.dataflare.metrics.MetricComparator
import com.github.timgent.dataflare.metrics.MetricDescriptor._
DualMetricCheck(SizeMetric(), CountDistinctValuesMetric(onColumns = List("someColumn")), "dsA size is equal to dsB size",
  MetricComparator.metricsAreEqual
)

You can create your own MetricComparator if you choose, for example:

import com.github.timgent.dataflare.metrics.MetricComparator
import com.github.timgent.dataflare.metrics.MetricValue.LongMetric
val metricBIsDoubleMetricA = MetricComparator[LongMetric]("metricBIsDoubleMetricA", (metricA, metricB) => metricB == metricA * 2)

Putting it all together

import com.github.timgent.dataflare.checks.metrics.DualMetricCheck
import com.github.timgent.dataflare.checkssuite.{ChecksSuite, DescribedDs, DescribedDsPair}
import com.github.timgent.dataflare.metrics.MetricComparator
import com.github.timgent.dataflare.metrics.MetricDescriptor.{CountDistinctValuesMetric, SizeMetric}
import org.apache.spark.sql.DataFrame
val myDsA: DataFrame = ???
val myDsB: DataFrame = ???
val myDescribedDsA: DescribedDs = DescribedDs(myDsA, "myDs")
val myDescribedDsB: DescribedDs = DescribedDs(myDsB, "myDs")
val myDualMetricCheck = DualMetricCheck(
  SizeMetric(),
  CountDistinctValuesMetric(onColumns = List("someColumn")),
  "dsA size is equal to dsB size",
  MetricComparator.metricsAreEqual
)
val checksSuite = ChecksSuite(
  "someChecksSuite",
  dualDsChecks = Map(
    DescribedDsPair(myDescribedDsA, myDescribedDsB) -> Seq(myDualMetricCheck)
  )
)
← Metrics based checks on a single DatasetTrack metrics not involved in any checks →
  • Performing metric based checks on a pair of datasets
  • How to create a DualMetricCheck
  • Putting it all together
Data Flare
Docs
Getting StartedAPI Reference
Community
Stack Overflow
More
GitHubStar