Sunday, June 7, 2026
HomeBig DataCapture data lineage of Amazon EMR spark jobs into Amazon SageMaker Unified...

Capture data lineage of Amazon EMR spark jobs into Amazon SageMaker Unified Studio


Data engineers running Apache Spark jobs on Amazon EMR face a persistent challenge: understanding how data moves through Spark pipelines as it’s transformed, joined, and written to downstream tables . Tracking these transformations manually requires examining job logs, reviewing code, and piecing together transformation logic across multiple sources. As pipelines scale, this process becomes complex. The visibility gap affects key business activities: troubleshooting data quality issues takes longer – impact analysis for schema changes requires more effort – and compliance audits need extensive documentation of data provenance.

Amazon SageMaker is the center for all your data and analytics where you can find and access all the data in your organization and act on it using tools across various use case. This unified platform addresses the data visibility challenge by bringing together data governance, collaboration, and discovery into a single interface. At the heart of this platform is Amazon SageMaker Catalog, a centralized hub that enables organizations to catalog, govern, and discover all their data assets with complete visibility into lineage. By capturing data lineage across your entire data ecosystem from raw sources through transformations to final outputs, SageMaker Catalog enables you to track data provenance across your entire platform, enable collaboration with clear visibility into data ownership and quality metrics, build trust through comprehensive data lineage that supports compliance and confident decision-making, and accelerate discovery of trustworthy, governance-ready data assets. You can access and visualize this lineage directly in Amazon SageMaker Unified Studio, which serves as the unified interface to explore data relationships and collaborate across your analytics workflows.

Amazon EMR, starting from version 7.11, now includes native OpenLineage support that automates lineage capture. OpenLineage is an open-source framework for data lineage that automatically emits lineage metadata from your data transformation jobs directly into Amazon SageMaker Catalog, or other data governance solutions, without requiring customizations.

This EMR native support of OpenLineage is part of a growing set of integrations across AWS analytics services including AWS Glue, Amazon EMR Serverless, and Amazon Redshift. The complete list of services with native OpenLineage integration can be found in the data lineage support matrix.

In this post, you’ll walk through a practical, step-by-step example that shows how to capture and track data lineage from Spark jobs running on Amazon EMR directly into Amazon SageMaker Catalog using OpenLineage. You’ll see how lineage metadata flows automatically and explore data relationships and dependencies across your workflows in Amazon SageMaker Unified Studio.

Solution overview

Imagine you’re part of a large enterprise that relies on HR analytics to optimize workforce planning, compensation strategies, and talent retention practices. Your data engineering team owns the delivery of these analytical products by processing raw HR datasets (including employee records, attendance logs, and compensation details), with Spark jobs running on your Amazon EMR infrastructure.

With time, Spark jobs have grown in complexity. Your team now struggles to maintain visibility into how data moves through pipelines, who modified it, and how to map dependencies between datasets and final analytical products.

The following solution demonstrates how you can address these challenges by automatically capturing data lineage end-to-end from Spark jobs running on your EMR infrastructure and visualizing it in Amazon SageMaker Unified Studio so that you and the business understand data provenance of the final analytical products.

AWS cloud data pipeline architecture diagram showing data flowing from Amazon S3 CSV files (employees.csv, attendance.csv) through Amazon EMR with Apache Spark processing, AWS Glue Data Catalog metadata management, and Amazon SageMaker Catalog integration, producing salary_adjustments.csv and bonus_payments.csv output files stored in Amazon S3.

The architecture includes a Data Layer with CSV files containing employee, attendance, salary, and bonus data stored in Amazon S3 (Simple Storage Service), representing typical HR and payroll source systems.

The Processing Layer uses Amazon EMR cluster running Apache Spark jobs that transform raw data into analytical tables. The first Spark job joins employee and attendance data while the second Spark job combines attendance with compensation data. Both jobs use Apache Iceberg table format to provide ACID (Atomic, Consistent, Isolated, and Durable) transactions and time travel capabilities.

The Metadata Layer uses AWS Glue Data Catalog to store Iceberg table metadata, making tables discoverable and accessible across AWS analytics services. A Lineage Layer uses the OpenLineage integration in EMR to automatically track input/output datasets (CSV files and Iceberg tables), transformation logic at column level (joins, filters, aggregations), and job execution metadata.

Finally, the Data Governance Layer uses Amazon SageMaker Catalog to capture and process OpenLineage events posted by the EMR Spark jobs and automatically build a comprehensive lineage graph that shows complete data provenance from CSV source files through Spark transformations to Iceberg analytical tables.

Before you deploy this solution, make sure you have the following resources in place.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account.
  • Your assumed role should have full access to Amazon EMR serverless, Amazon S3, Amazon Identity and Access Management (IAM) and AWS Lambda. Note that for production workloads, minimum permissions are recommended.
  • A Amazon VPC (Virtual Private Cloud) with at least one subnet with internet access. You can provision this VPC as you create the Amazon SageMaker domain next.
  • An existing Amazon SageMaker Unified Studio domain and project. To get started, use the quick setup option as explained here. To create a project, follow the instructions here.
  • An S3 bucket with the sample data files and Spark scripts uploaded (see Prepare Your Source Data below)
  • Default EMR service roles — if this is your first time using EMR in this account, run `aws emr create-default-roles` from the AWS CLI or CloudShell to create them.

With these prerequisites in place, let’s examine what the AWS CloudFormation template will deploy to your AWS environment.

Architecture components

The deployment creates several interconnected components that work together to capture and visualize lineage:

  • An S3 bucket to store all data and artifacts for the solution.
  • An EMR cluster (v 7.12.0) with Apache Iceberg support enabled and OpenLineage integration pre-installed, ready to run Spark jobs with lineage tracking.
  • A set of IAM policies that grant the necessary permissions to the EMR cluster to post lineage events to your SageMaker Unified Studio domain.
  • A set of AWS Lake Formation permissions that grant the EMR cluster to create, alter, and drop Iceberg tables in your specified Glue database.

With an understanding of what will be deployed, you’re ready to launch the CloudFormation stack.

Deploy the solution

Note: While this walkthrough uses the AWS EMR console and AWS CLI to verify the cluster and run Spark jobs, you can also perform these steps directly from Amazon SageMaker Unified Studio. SMUS provides a unified interface to create and manage EMR clusters, submit Spark jobs, and monitor execution — all within the same environment where you’ll later explore the lineage captured in Amazon SageMaker Catalog.

Prepare your source data

Before deploying the CloudFormation stack, clone or download the Git repository.

git clone https://github.com/aws-samples/sample-capture-data-lineage-of-amazon-emr-ec2

Upload the CSV files downloaded from git to the input/ prefix and the spark scripts in scripts/ prefix. You can run the following command to upload the files:

aws s3 cp employees.csv s3://YOUR-BUCKET/input/
aws s3 cp attendance.csv s3://YOUR-BUCKET/input/
aws s3 cp salary_adjustments.csv s3://YOUR-BUCKET/input/
aws s3 cp bonus_payments.csv s3://YOUR-BUCKET/input/
aws s3 cp emr-lineage-spark-job.py s3://YOUR-BUCKET/scripts/
aws s3 cp emr-lineage-compensation-job.py s3://YOUR-BUCKET/scripts/

To deploy the solution, complete the following steps in CloudFormation console:

  1. Create new stack by specifying the CloudFormation yaml file previously download from git repository PutHereThe YMLFileName
  2. Enter a stack name (such as, emr-lineage-demo) and provide the following parameters:
    • SourceS3BucketName: S3 bucket containing your CSV files and Spark scripts
    • SourceCSVPrefix: S3 prefix where CSV files are located
    • SourceScriptsPrefix: S3 prefix where Spark scripts are located
    • GlueDatabaseName: The name of the Glue database associated to your Amazon SageMaker Unified Studio project.
    • DataZoneDomainId: Your SageMaker Unified Studio domain ID.
    • VpcId: The id of the VPC that was deployed as part of the prerequisites.
    • For EMRReleaseLabel, MasterInstanceType, CoreInstanceType and CoreInstanceCount, keep the default values.
  3. Acknowledge IAM resource creation, choose Next and then Submit. The CloudFormation stack takes approximately 10 to 15 minutes to complete.
  4. In the EMR console, wait for the cluster status to show as WAITING before moving to the next step.

Screenshot of the Amazon EMR on EC2 Clusters management console showing a list of 14 clusters, with the cluster "EMR-Lineage-Demo-emr-ec2-lineage-demo-stack" (ID: j-3APWOTUDNYO2T) highlighted in a "Waiting – Ready to run steps" status with a green badge.

Now that the EMR cluster is running with OpenLineage enabled, let’s examine how the Spark jobs are configured to capture lineage metadata.

Explore data lineage configuration in EMR

When submitting Spark jobs to EMR, specific configurations enable OpenLineage to create and post lineage events to SageMaker Unified Studio as the job runs:

  • spark.hadoop.hive.metastore.client.factory.class – Configures Spark to use AWS Glue as the Hive metastore.
  • spark.jars – Path to the pre-installed OpenLineage library (available on EMR 7.11+).
  • spark.extraListeners – Registers an OpenLineage listener to capture metadata of input / output datasets and transformations.
  • spark.openlineage.transport.type – Uses the OpenLineage DataZone transport option to send lineage events directly into SageMaker Catalog.
  • spark.openlineage.transport.domainId – The ID of your SageMaker Unified Studio domain, that serves as the target for lineage events.
  • spark.glue.accountId – Your AWS account ID for Glue data catalog operations.

Now that you understand the configuration that enables automatic lineage capture, you’re ready to run the data pipeline.

When running this two-step pipeline, you will calculate the total employee compensation by combining salary adjustments, bonuses, and attendance data. The final analytical asset will serve payroll processing and budgeting.

Run employee attendance analysis job

The first job reads employee details (in employees.csv dataset) and attendance records (in attendance.csv dataset), joins the datasets on EmployeeID and creates a unified dataset (employee_attendance Iceberg table) in your Glue database.

Follow the steps below to run this first job:

  1. In the CloudFormation console, navigate to the stack’s Outputs tab
  2. Copy the value of the Job1SubmitCommand output key. Note that this is the command you’ll use to submit the first job in EMR with the right configuration.

AWS CloudFormation console screenshot showing the Outputs tab for the "emr-ec2-lineage-demo-stack" stack, displaying 9 outputs including the Job1SubmitCommand — an AWS EMR add-steps command with Apache Spark configuration for the EMR Lineage Demo Job targeting cluster j-3APWOTUDNYO2T.

  1. Run the command in your terminal or AWS CloudShell.
  2. Monitor the job in the Amazon EMR console under Steps.

Screenshot of the Amazon EMR console Steps tab for the cluster "EMR-Lineage-Demo-emr-ec2-lineage-demo-stack," showing one completed step named "EMR-Lineage-Demo-Job" with Step ID s-0270631D8DHBCJZKBAZ and a green "Completed" status checkmark.

Run employee compensation analysis job

Now, you will calculate the total employee compensation (Iceberg table) by combining salary adjustments (salary_adjustments.csv dataset), bonuses (bonus_payments.csv dataset), and attendance (calculated in the last step):

  1. Repeat the steps 1 to 4 to run Job 2.
  2. After completion, open the AWS Glue console.
  3. Navigate to Data Catalog, then Tables and select your SageMaker project’s database.
  4. Confirm that employee_attendance and employee_compensation tables are listed.

With both Spark jobs complete, you can now visualize the complete data lineage graph in Amazon SageMaker Unified Studio.

Visualizing lineage in SageMaker Unified Studio

SageMaker Unified Studio provides a graph-based data lineage visualization that helps data engineers, analysts, and data scientists clearly understand which source datasets (files or tables) feed into each dataset, what transformations and logic are applied at every step, which downstream analytics assets consume the data, and how changes to upstream data or transformations may impact the rest of the data pipeline.

Now that the data pipeline run successfully, let’s review the captured lineage for the HR data in SageMaker Unified Studio:

  1. Navigate to the SageMaker Unified Studio console, sign in to your domain.
  2. Open your project and go to Data Sources
  3. Find your AWS Glue Data Catalog source

Screenshot of the Amazon SageMaker project catalog Data Sources page listing three configured data sources: a Redshift Serverless source, an AWS Glue Lakehouse source named "AwsDataCatalog-emr_ec2_lineage_blogpost_glue_db-default-datasource" (highlighted), and a Tooling SageMaker model package group source — all scheduled MTWTFSS and in Ready or Running status.

  1. Click RUN. Two new assets will be created.

Screenshot of the AWS Glue Data Catalog interface showing run activities for the data source "AwsDataCatalog-emr_ec2_lineage_blogpost_glue_db-default-datasource," with two completed on-demand runs and a highlighted asset table showing employee_attendance and employee_compensation successfully created in the emr_ec2_lineage_blogpost_glue_db database.

  1. Navigate to Assets and Click on employee_compensation. Under the LINEAGE tab you’ll find the lineage graph view that SageMaker builds based on the OpenLineage metadata captured from the EMR Spark jobs as they run.

AWS Glue data lineage visualization showing the flow of the employee_compensation dataset from an Apache Spark job (default.emr_lineage_compensa, COMPLETE, Dec 22 2025 11:42:47 AM) through an AWS Glue Iceberg table (20 columns) to an AWS Glue Inventory destination table, with a right sidebar displaying lineage metadata including the dataset ARN, OpenLineage producer URL, Iceberg snapshot ID, and projected field names EmployeeID, Name, and Department.

    • You’ll first see three lineage nodes from left to right: one representing the EMR Spark job that created the final Iceberg table, a second one representing the actual Iceberg table in the Glue catalog, and a third one representing the data asset in the SageMaker Catalog inventory that maps to the Glue table.
    • Click on any lineage node to view its underlying metadata in the details pane, including dataset names, S3 locations, schema, data types, job execution details and more.
  1. Expand the lineage to the left by clicking on the double arrow next to the first lineage node. Keep expanding until you hit the originating datasets.

Data pipeline lineage diagram showing the complete ETL flow from Amazon S3 source files (input/attendance.csv with 6 columns, input/employees.csv with 5 columns) through two Apache Spark jobs to intermediate tables (input/salary_adjustments.csv, iceberg/employee.csv, AWS Glue employee_attendance with 14 columns) and final destination tables (AWS Glue iceberg/employee_compensation with 29 columns, AWS Glue Inventory employee_compensation_hive with 30 columns), all timestamped Dec 22, 2025.

    • Expanding the graph to the left reveals the complete data pipeline back to original CSV source files. You can see how compensation data depends on upstream attendance analytics.
    • Note how each lineage node represents an element in the data pipeline you run, including both Spark jobs and even the intermediate employee_attendance Iceberg table that connects them.
  1. You can expand column-level lineage by clicking on the column section of a lineage node of a dataset or data asset. This allows you to understand how data changes at a column level as it goes downstream your data pipeline.

Data lineage diagram showing the employee compensation ETL pipeline with four Amazon S3 source tables (employee.csv with 5 columns, input/attendance.csv with 6 columns, input/salary_adjustments.csv with 4 columns, output/employee_attendance.csv with 14 columns) processed by two Apache Spark jobs to produce a final s3://employee_compensation table with 20 columns, all dated Dec 22, 2025.

Cleanup

To avoid ongoing charges, clean up the resources:

  1. First, empty the destination bucket by running the following command in your terminal or with AWS CloudShell.

aws s3 rm s3://${DEST_BUCKET}/ --recursive

  1. Delete the CloudFormation stack.
    • On the AWS CloudFormation console, choose Stacks in the navigation pane.
    • Choose the stack you created, then choose Delete and then Delete stack when prompted.

Conclusion

In this post, you explore how to capture data lineage from Spark jobs in Amazon EMR (v7.11+) directly into Amazon SageMaker Unified Studio. You learned how to set up an Amazon EMR cluster with native OpenLineage support to automatically track lineage metadata from Spark jobs processing your data. You also configured the integration between EMR and Amazon SageMaker Catalog to ensure lineage information flows seamlessly into your governance platform. Finally, you explored the resulting lineage graph in SageMaker Unified Studio and saw how it provides comprehensive visibility into data transformations, from source CSV files through Spark processing jobs to final analytical tables using Apache Iceberg format.

We encourage you to now test these capabilities with your own data pipelines running on EMR. By implementing automated lineage tracking, many customers have strengthened their governance frameworks while gaining valuable insights into data dependencies, impact analysis, and compliance requirements. This approach enables data teams to build trust in their analytics outputs while maintaining the agility needed to derive business value from their data assets.


About the authors

Yanick Houngbedji is a Solutions Architect for Independent Software Vendors (ISV) at Amazon Web Services (AWS), based in Montréal, Canada. He specializes in helping customers architect and implement highly scalable, performant, and secure cloud solutions on AWS. Before joining AWS, he spent over 8 years providing technical leadership in data engineering, big data analytics, business intelligence, and data science solutions.

Jose Romero is a Senior Solutions Architect for Startups at Amazon Web Services (AWS) based in Austin, TX, US. He is passionate about helping customers architect modern platforms at scale for data, AI, and ML. As a former senior architect in AWS Professional Services, he enjoys building and sharing solutions for common complex problems so that customers can accelerate their cloud journey and adopt best practices. Connect with him on LinkedIn.

RELATED ARTICLES

Most Popular

Recent Comments