Spark Data Quality

Spark Data Quality

  • Docs
  • API

›Metrics based checks

Getting Started

  • Introduction
  • Writing your first suite of checks

Available Checks

    Metrics based checks

    • Introduction to metrics and metric based checks
    • Metrics based checks on a single Dataset
    • Metrics based checks on a pair of Datasets
    • Track metrics not involved in any checks
    • Available metrics
  • Arbitrary checks

Persisting your results

  • Persisting results from your checks
  • Persisting metrics over time

Developer docs

  • Developer documentation

Track metrics not involved in any checks

It's possible to store metrics to enable tracking even if they aren't being used in any checks. You will need to provide a metricsPersister to your ChecksSuite as well as your metricsToTrack. For example:

import com.github.timgent.sparkdataquality.checkssuite._
import com.github.timgent.sparkdataquality.metrics.MetricDescriptor.{SizeMetric, SumValuesMetric}
import com.github.timgent.sparkdataquality.metrics.MetricValue.LongMetric
import org.apache.spark.sql.DataFrame

val myDsA: DataFrame = ???
val myDescribedDsA = DescribedDs(myDsA, "myDsA")
val checksSuite = ChecksSuite(
  "someChecksSuite",
  metricsToTrack = Map(
    myDescribedDsA -> Seq(SizeMetric(), SumValuesMetric[LongMetric](onColumn = "number_of_items"))
  )
)
← Metrics based checks on a pair of DatasetsAvailable metrics →
Spark Data Quality
Docs
Getting StartedAPI Reference
Community
Stack Overflow
More
GitHubStar