case class EtlDefinition(sourceDF: DataFrame, transform: (DataFrame) ⇒ DataFrame, write: (DataFrame) ⇒ Unit, metadata: Map[String, Any] = ...) extends Product with Serializable

spark-daria can be used as a lightweight framework for running ETL analyses in Spark.

You can define EtlDefinitions, group them in a collection, and run the etls via jobs.

Components of an ETL

An ETL starts with a DataFrame, runs a series of transformations (filter, custom transformations, repartition), and writes out data.

The EtlDefinition class is generic and can be molded to suit all ETL situations. For example, it can read a CSV file from S3, run transformations, and write out Parquet files on your local filesystem.

Linear Supertypes
Serializable, Serializable, Product, Equals, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. EtlDefinition
  2. Serializable
  3. Serializable
  4. Product
  5. Equals
  6. AnyRef
  7. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new EtlDefinition(sourceDF: DataFrame, transform: (DataFrame) ⇒ DataFrame, write: (DataFrame) ⇒ Unit, metadata: Map[String, Any] = ...)

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  5. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native() @HotSpotIntrinsicCandidate()
  6. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  7. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native() @HotSpotIntrinsicCandidate()
  8. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  9. val metadata: Map[String, Any]
  10. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  11. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native() @HotSpotIntrinsicCandidate()
  12. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native() @HotSpotIntrinsicCandidate()
  13. def process(): Unit

    Runs an ETL process

    Runs an ETL process

    val sourceDF = spark.createDF(
    List(
      ("bob", 14),
      ("liz", 20)
     ), List(
      ("name", StringType, true),
      ("age", IntegerType, true)
     )
    )
    
    def someTransform()(df: DataFrame): DataFrame = {
      df.withColumn("cool", lit("dude"))
    }
    
    def someWriter()(df: DataFrame): Unit = {
      val path = new java.io.File("./tmp/example").getCanonicalPath
      df.repartition(1).write.csv(path)
    }
    
    val etlDefinition = new EtlDefinition(
      name =  "example",
      sourceDF = sourceDF,
      transform = someTransform(),
      write = someWriter(),
      hidden = false
    )
    
    etlDefinition.process()
  14. val sourceDF: DataFrame
  15. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  16. val transform: (DataFrame) ⇒ DataFrame
  17. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  18. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  19. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  20. val write: (DataFrame) ⇒ Unit

Deprecated Value Members

  1. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] ) @Deprecated
    Deprecated

Inherited from Serializable

Inherited from Serializable

Inherited from Product

Inherited from Equals

Inherited from AnyRef

Inherited from Any

Ungrouped