Configurable data lineage solution for Apache Spark
The project aims at providing pluggable solution for Apache spark projects to capture the lineage of the data assets being sourced and created in the application. The project supports both Dataframe APIs as well as raw queries submitted using spark sql APIs.
The captured lineage events can be forwarded to variety of supported backends.
The supported backends are:
- Console: It is primarily used for debugging purpose and it will log the captured events
- Neo4J: It will push the lineage events towards Neo4j and store the lineage graph as Neo4j graphs
- HTTP: It will push the events to an API server.
- Kafka: It will push the events to a Kafka topic.
Orion can be plugged into any Spark application using the below configuration which can be set at cluster level or individual Spark conf level:
spark.sql.queryExecutionListeners=com.mkv.ds.orion.core.listeners.OrionQueryInterceptor
- Console backend can be configured with spark configuration: spark.orion.lineage.backend.handler=Console
- Neo4J backend can be configured with spark configuration: spark.orion.lineage.backend.handler=Neo4J
- HTTP backend can be configured with spark configuration: spark.orion.lineage.backend.handler=HTTP
- Kafka backend can be configured with spark configuration: spark.orion.lineage.backend.handler=KAFKA
[Kafka & HTTP backend implementations to be added in a future release]
To push event towards a Neo4J database below configurations have to be set:
spark.orion.lineage.backend.handler="NEO4J"
spark.orion.lineage.neo4j.backend.uri="bolt://localhost:7687"
spark.orion.lineage.neo4j.backend.username="neo4j"
spark.orion.lineage.neo4j.backend.password="password"
Running OrionTestDriver class with Neo4j backend will capture the below lineage in Neo4j