What is the difference between AWS glue vs lambda?

481    Asked by CsabaToth in AWS , Asked on Mar 29, 2024

 I am currently engaged with a Task which is related to designing a data processing pipeline for a large e-commerce platform. The pipeline includes the extraction, transforming, and loading of data from various sources into a centralized data warehouse. Discuss the appropriate approach when I should choose AWS glue over AWS lambda or vice versa.

Answered by Deepa bhawana

 In the context of AWS, here is the difference given between both:-



























AWS glue

You can use the AWS glue when you are dealing with complex ETL workflow which requires automated schema discovery, data cataloging, and orchestrating.

The AWS glue is very suitable for large-scale data processing tasks where data transformation includes multiple steps and data sources.

Here is the example given in the form of coding for the AWS glue ETL job:-

From awsglue.context import GlueContext
From pyspark.sql import SparkSession
# Initialize Spark session and Glue context
Spark = SparkSession.builder.appName(“GlueETLJob”).getOrCreate()
glueContext = GlueContext(spark)
# Define source and target data paths
Source_path = “s3://my-source-bucket/input-data/”
Target_path = “s3://my-target-bucket/output-data/”
# Read data from source
Source_df = glueContext.create_dynamic_frame.from_options(
    Connection_type=”s3”,
    Connection_options={“paths”: [source_path]},
    Format=”json”
).toDF()
# Perform data transformations
Transformed_df = source_df.withColumn(“new_column”, source_df[“existing_column”] + 1)
# Write transformed data to target
Transformed_df.write.mode(“overwrite”).format(“parquet”).save(target_path)

AWS Lambda

You can opt-out of the AWS lambda for lightweight and event-driven processing tasks which would require quick implementation and scalability based on the demand. You can use lambda for processing individual records, triggering action based on the events, and integrating with the other AWS services seamlessly.

Here is the example given in the form of coding for the AWS lambda function in Python:-

Import boto3
Def lambda_handler(event, context):
    # Extract data from the event
    Input_data = event[“Records”][0][“s3”][“object”][“key”]
    # Process data (e.g., perform calculations, format data)
    Processed_data = process_data(input_data)
    # Save processed data to the target location
    Save_to_s3(processed_data, “my-target-bucket/output-data/processed-data.txt”)
Def process_data(input_data):
    # Placeholder for data processing logic
    Processed_data = input_data.upper() # Example: Convert data to uppercase
    Return processed_data
Def save_to_s3(data, target_path):
    S3 = boto3.client(“s3”)
    S3.put_object(Body=data, Bucket=”my-target-bucket”, Key=target_path)

Here is the combination of the example given which would integrate both AWS glue and AWS lambda for a data processing pipeline:-

From pyspark.context import SparkContext
From awsglue.context import GlueContext
From pyspark.sql import SparkSession
# Initialize Spark context and Glue context
Sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
# Define source and target data paths
Source_path = “s3://my-source-bucket/input-data/”
Target_path = “s3://my-target-bucket/output-data/”
# Read data from source
Source_df = glueContext.create_dynamic_frame.from_options(
    Connection_type=”s3”,
    Connection_options={“paths”: [source_path]},
    Format=”json”
).toDF()
# Perform data transformations
Transformed_df = source_df.withColumn(“new_column”, source_df[“existing_column”] + 1)
# Write transformed data to target
Transformed_df.write.mode(“overwrite”).format(“parquet”).save(target_path)


Your Answer

Interviews

Parent Categories