case class EtlDefinition(sourceDF: DataFrame, transform: (DataFrame) ⇒ DataFrame, write: (DataFrame) ⇒ Unit, metadata: Map[String, Any] = ...) extends Product with Serializable
spark-daria can be used as a lightweight framework for running ETL analyses in Spark.
You can define EtlDefinitions, group them in a collection, and run the etls via jobs.
Components of an ETL
An ETL starts with a DataFrame, runs a series of transformations (filter, custom transformations, repartition), and writes out data.
The EtlDefinition class is generic and can be molded to suit all ETL situations. For example, it can read a CSV file from S3, run transformations, and write out Parquet files on your local filesystem.
Linear Supertypes
Ordering
- Alphabetic
- By Inheritance
Inherited
- EtlDefinition
- Serializable
- Serializable
- Product
- Equals
- AnyRef
- Any
- Hide All
- Show All
Visibility
- Public
- All
Instance Constructors
- new EtlDefinition(sourceDF: DataFrame, transform: (DataFrame) ⇒ DataFrame, write: (DataFrame) ⇒ Unit, metadata: Map[String, Any] = ...)
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native() @HotSpotIntrinsicCandidate()
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native() @HotSpotIntrinsicCandidate()
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
- val metadata: Map[String, Any]
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native() @HotSpotIntrinsicCandidate()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native() @HotSpotIntrinsicCandidate()
-
def
process(): Unit
Runs an ETL process
Runs an ETL process
val sourceDF = spark.createDF( List( ("bob", 14), ("liz", 20) ), List( ("name", StringType, true), ("age", IntegerType, true) ) ) def someTransform()(df: DataFrame): DataFrame = { df.withColumn("cool", lit("dude")) } def someWriter()(df: DataFrame): Unit = { val path = new java.io.File("./tmp/example").getCanonicalPath df.repartition(1).write.csv(path) } val etlDefinition = new EtlDefinition( name = "example", sourceDF = sourceDF, transform = someTransform(), write = someWriter(), hidden = false ) etlDefinition.process()
- val sourceDF: DataFrame
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
- val transform: (DataFrame) ⇒ DataFrame
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
- val write: (DataFrame) ⇒ Unit
Deprecated Value Members
-
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] ) @Deprecated
- Deprecated