Career Roadmap

AWS Data Engineer: Zero to Hero

This roadmap is structured around the four official DEA-C01 exam domains and their weightings. Data Ingestion and Transformation (34%) is the largest domain and receives proportionally more coverage. The roadmap reflects the current DEA-C01 exam — 65 questions, 130 minutes, 720/1000 passing score, $150 USD, recommended 2-3 years data engineering experience with 1-2 years AWS hands-on. Use ExamOS practice quizzes at every step to make progress measurable before booking.

10 steps3 certifications~5-7 months01-Jun-202615 views

Embark on your career roadmap by setting a target and staying accountable

Set target

Step 0 - Data engineering and programming foundations

Build the technical foundation that every AWS data engineering concept depends on. DEA-C01 is described as the most technical associate-level AWS exam — the foundations matter more here than on most other AWS certifications.

3-4 weeks

Python proficiency — pandas for data manipulation, PySpark basics, boto3 for AWS API interaction, writing ETL logic
SQL fluency — SELECT, JOIN, GROUP BY, window functions, CTEs, query optimization concepts, EXPLAIN plans
Data formats — CSV, JSON, Parquet, ORC, Avro — what each is, why Parquet dominates ML and analytics workloads, compression trade-offs
Data engineering concepts — batch versus streaming processing, ETL versus ELT, idempotency in pipelines, exactly-once versus at-least-once delivery
Database fundamentals — relational model, OLTP versus OLAP, columnar versus row-based storage, normalization versus denormalization for analytics
Distributed computing basics — what Apache Spark is, partitioning for parallel processing, shuffle operations
Git and workflow basics — version-controlled ETL code, reproducible pipeline development

💡 PySpark basics are directly relevant to AWS Glue, which uses Spark under the hood. Candidates who understand how Spark partitioning and transformations work find Glue scenarios significantly more approachable.

💡 SQL fluency matters more on DEA-C01 than on most AWS exams. Athena queries, Redshift query optimization, and Glue Data Quality rules all require SQL reasoning. Window functions and query plan interpretation appear in scenario questions.

💡 The difference between ETL (transform before loading) and ELT (load raw then transform in the destination) is a design decision the exam tests. Know when each pattern is appropriate for described workloads.

Step 1 - AWS fundamentals and data-relevant architecture (SAA-C03 recommended)

Build the AWS architecture knowledge that DEA-C01 scenarios assume. Data engineering on AWS is built on top of core infrastructure — IAM, S3, networking, and compute appear in every data pipeline.

4-6 weeks

Amazon S3 — bucket architecture, object storage patterns, storage classes, lifecycle policies, S3 Select, S3 event notifications, versioning
AWS IAM — roles for data services, policy design for least privilege, service-linked roles, cross-account data access patterns
AWS networking — VPCs for data infrastructure, VPC endpoints for private access to S3, Glue, Redshift, and Athena
Amazon RDS — database engines, Multi-AZ for source systems, read replicas, when relational databases feed data pipelines
AWS Cost management — understanding data transfer costs, choosing between services based on cost profiles
CloudWatch — logs, metrics, dashboards, alarms for data pipeline monitoring
AWS Lambda — event-driven compute for pipeline triggers, small transformation tasks, connector logic

Certifications

AWS Certified Cloud Practitioner (CLF-C02)

💡 SAA-C03 is the strongest foundation for DEA-C01 if you have time for it. The Resilient Architectures and High-Performing Architectures domains of SAA-C03 directly overlap with DEA-C01 storage and pipeline design scenarios.

💡 CLF-C02 is sufficient for candidates who want to establish cloud familiarity before DEA-C01. Candidates with existing AWS experience can skip both and proceed directly.

💡 S3 knowledge at genuine depth matters for DEA-C01. Partitioning strategies (Hive-style partitioning for Athena and Glue performance), storage class selection by access pattern, and S3 event notification patterns for pipeline triggering all appear in exam scenarios.

💡 VPC endpoints for data services are tested in the security domain. Know that S3 Gateway endpoints are free, while interface endpoints for Glue, Redshift, and Kinesis have hourly charges, and why each is used.

Step 2 - DEA-C01 Domain 1 Part A — Data Ingestion Patterns and Services (34% total)

Build deep knowledge of how data moves from source systems into AWS. Data Ingestion and Transformation at 34% is the largest single domain — no other domain comes close. Split across two steps to give it appropriate depth.

4-5 weeks

AWS Database Migration Service (DMS) — homogeneous versus heterogeneous migration, full load versus CDC (change data capture), replication instances, task configuration
Amazon Kinesis Data Streams — shards, partition keys, sequence numbers, retention period, enhanced fan-out consumers, IteratorAge metric
Amazon Kinesis Data Firehose — delivery stream destinations (S3, Redshift, OpenSearch, Splunk, HTTP), buffering hints, data transformation with Lambda, dynamic partitioning
Amazon Kinesis Data Analytics (Managed Service for Apache Flink) — SQL applications for streaming, Flink applications, stateful processing
Amazon MSK (Managed Streaming for Apache Kafka) — cluster configuration, topic design, consumer groups, MSK Serverless, MSK Connect for connectors
Amazon SQS and SNS for pipeline triggers — decoupling ingestion components, fan-out patterns, FIFO versus Standard queues for ordered ingestion
AWS DataSync — migrating large datasets to S3, EFS, FSx, comparing DataSync versus DMS for different source types
AWS Snow family — Snowcone, Snowball Edge, Snowmobile for offline bulk ingestion, when network transfer is impractical
S3 Transfer Acceleration and multipart upload for large dataset ingestion

Certifications

AWS Certified Data Engineer - Associate (DEA-C01)

💡 DMS CDC (change data capture) appears in scenarios about real-time replication from operational databases into data lakes. Know that DMS CDC requires the source database to have binary logging enabled (MySQL) or supplemental logging enabled (Oracle/PostgreSQL).

💡 Kinesis IteratorAge is the key metric for detecting consumer lag in Streams. If IteratorAge is growing, consumers are falling behind producers. This appears in monitoring and troubleshooting scenarios.

💡 MSK (Managed Kafka) is tested at a configuration and use case level. Know when Kafka is preferred over Kinesis — primarily when you need Kafka compatibility, multiple consumer groups, or longer retention than Kinesis supports.

💡 Use ExamOS for ingestion scenario practice that tests service selection for described data source characteristics, velocity, and volume.

Step 3 - DEA-C01 Domain 1 Part B — Data Transformation with AWS Glue (34% total)

Master AWS Glue as the primary ETL and data transformation service on AWS. Glue is the most heavily tested service on DEA-C01 and is significantly more complex than most candidates expect.

4-5 weeks

AWS Glue architecture — Glue Data Catalog, Glue crawlers, Glue ETL jobs, Glue Studio, Glue DataBrew, Glue Workflows
Glue Data Catalog — databases, tables, partitions, schemas, cross-account catalog sharing, Lake Formation integration
Glue crawlers — crawler configuration, classifiers, partition detection, incremental crawls, scheduled versus on-demand
Glue ETL jobs — worker types (Standard, G.1X, G.2X, G.025X), job bookmarks for incremental processing, DynamicFrames versus DataFrames
Glue Studio — visual ETL authoring, custom transforms, job monitoring, source and target connectors
AWS Glue DataBrew — no-code data transformation, 250+ built-in transformations, profiling data quality, recipe publishing
Glue Data Quality — data quality rulesets (completeness, uniqueness, referential integrity), DQ results in Catalog, anomaly detection
Glue Streaming ETL — continuous processing of Kinesis and Kafka streams, micro-batch processing
PySpark transformations in Glue — common DynamicFrame operations, relationalize for nested JSON, resolveChoice for ambiguous types
Glue Workflows — orchestrating multiple crawlers and ETL jobs, triggers (scheduled, on-demand, event-based)

Certifications

AWS Certified Data Engineer - Associate (DEA-C01)

💡 Glue job bookmarks are specifically tested in incremental data processing scenarios. Job bookmarks track which data has already been processed so that rerunning the job only processes new data. Know how to enable, reset, and troubleshoot bookmarks.

💡 Glue worker type selection appears in cost and performance scenarios. G.025X (Flex) is for non-urgent batch jobs where cost matters more than speed. G.1X and G.2X are for memory-intensive jobs. Standard is for typical workloads.

💡 The difference between DynamicFrame and Spark DataFrame appears in scenarios about handling semi-structured and nested data. DynamicFrames handle schema inconsistencies more gracefully. DataFrames offer the full Spark API for complex transformations.

💡 Glue DataBrew is tested in scenarios where non-technical users need to prepare data without writing code. Know that DataBrew produces recipes that can be published and rerun, and that it integrates with Glue Data Catalog for output.

💡 Glue Data Quality rules are increasingly tested. Know the rule types — ColumnValues (check specific column values), Completeness (check for nulls), Uniqueness (check for duplicates), ReferentialIntegrity (check foreign key relationships).

Step 4 - DEA-C01 Domain 2 — Data Store Management (26%)

Design and manage the right data store for each type of data workload across the full spectrum of AWS data storage services. The second largest domain at 26%.

4-5 weeks

Amazon S3 as a data lake — Hive-style partitioning (year/month/day) for query performance, compaction strategies, columnar format benefits
Amazon Redshift — cluster architecture, RA3 node types with managed storage, Serverless versus provisioned
Redshift distribution styles — EVEN (default), KEY (join column), ALL (small dimension tables) and performance implications
Redshift sort keys — compound versus interleaved, when sort keys accelerate queries, VACUUM and ANALYZE
Redshift Spectrum — querying S3 data from Redshift, external schemas, Lake Formation integration
Amazon Athena — serverless SQL on S3, partition projection for performance, workgroup configuration, query result caching
Athena query optimization — columnar formats (Parquet/ORC), partitioning, file size (too-small files problem), compression codecs
Amazon OpenSearch Service — full-text search, log analytics, vector search for AI workloads, integration with Kinesis
Amazon DynamoDB for operational data — partition key design, global secondary indexes, streams for change capture
Amazon ElastiCache — Redis versus Memcached for caching analytical query results, reducing Redshift load
Amazon Timestream — purpose-built time-series database, when to use over DynamoDB or Redshift
Data lake house patterns — combining S3, Glue Catalog, Athena, Redshift Spectrum for unified analytics

Certifications

AWS Certified Data Engineer - Associate (DEA-C01)

💡 Redshift distribution style selection is the most consistently tested storage configuration topic. Know why KEY distribution on join columns dramatically improves join performance by collocating matching rows on the same node. Know why ALL distribution is only appropriate for small, rarely-changing dimension tables.

💡 Athena the-small-files problem appears in scenarios about poor Athena performance. Many small files in S3 (from streaming ingestion) cause excessive metadata overhead and slow queries. Compaction using Glue or EMR to merge small files into larger Parquet files is the correct solution.

💡 Redshift Serverless versus provisioned appears in cost optimization scenarios. Serverless is better for intermittent analytics workloads with unpredictable patterns. Provisioned with Reserved Instances is more cost-effective for consistent, predictable workloads.

💡 Use ExamOS for data store selection scenario practice that tests choosing the right storage service for described query patterns, data volumes, and access frequency requirements.

Step 5 - DEA-C01 Domain 3 Part A — Pipeline Orchestration and EMR (22% total)

Build and orchestrate complex multi-step data pipelines across AWS services. Data Operations and Support at 22% covers orchestration, performance tuning, and troubleshooting.

3-4 weeks

AWS Step Functions for data pipelines — state machine design, Map state for parallel processing, error handling with Catch and Retry, Express versus Standard workflows
Amazon MWAA (Managed Workflows for Apache Airflow) — DAG structure, operators for AWS services, environment configuration, when to choose over Step Functions
AWS Glue Workflows — orchestrating Glue-native steps, trigger types, dependency configuration
Amazon EMR — cluster architecture, instance types (primary, core, task nodes), Spot instances for task nodes
EMR Serverless — on-demand Spark and Hive without cluster management, application configuration
EMR on EKS — running Spark on Kubernetes, virtual clusters, when EMR on EKS versus EMR on EC2
Amazon EventBridge — event-driven pipeline triggers, scheduled rules, custom event patterns, cross-account events
AWS Lambda for pipeline logic — triggering pipelines from S3 events, lightweight data transformations, connector logic

Certifications

AWS Certified Data Engineer - Associate (DEA-C01)

💡 Step Functions versus MWAA (Airflow) is a design decision the exam tests. Step Functions is native AWS, simpler for AWS-centric workflows, better for event-driven patterns. MWAA is better when teams have existing Airflow expertise, complex DAG dependencies, or need Airflow's operator ecosystem.

💡 EMR Spot instance usage for task nodes appears in cost optimization scenarios. Core and primary nodes should use On-Demand (losing them means losing HDFS data). Task nodes are stateless and can safely use Spot for significant savings.

💡 EventBridge scheduled rules for pipeline triggering appear in scenarios about replacing cron-based scheduling with managed event-driven scheduling. Know how to configure cron and rate expressions and how EventBridge connects to Step Functions, Glue, and Lambda targets.

💡 Use ExamOS for orchestration scenario practice that tests choosing between Step Functions, MWAA, and Glue Workflows for described pipeline complexity and team requirements.

Step 6 - DEA-C01 Domain 3 Part B — Performance Tuning and Troubleshooting (22% total)

Optimize data pipeline performance, troubleshoot failures, and manage operational aspects of running data systems at scale.

2-3 weeks

Redshift performance tuning — query analysis with EXPLAIN, distribution key selection, sort key optimization, workload management (WLM) queues
Amazon Redshift Advisor — automated recommendations for distribution keys, sort keys, compression encoding
Athena performance — partition pruning, columnar formats, file size optimization, query result caching
Glue job performance — dynamic frame optimization, partition pushdown, job metrics in CloudWatch
Kinesis scaling — adding shards for increased throughput, merging shards to reduce cost, capacity mode selection
Pipeline failure handling — dead-letter queues for failed messages, Glue job retry configuration, Step Functions error handling
CloudWatch for data pipelines — custom metrics from ETL jobs, alarms on IteratorAge, dashboard design for pipeline health
AWS X-Ray for distributed tracing — tracing data pipeline requests across Lambda, API Gateway, and downstream services
Cost optimization for data pipelines — right-sizing EMR clusters, Glue DPU selection, S3 Intelligent-Tiering for data lake cost management

Certifications

AWS Certified Data Engineer - Associate (DEA-C01)

💡 Redshift WLM (Workload Management) configuration appears in scenarios about query concurrency and priority. Know the difference between automatic WLM (AWS manages queue allocation) and manual WLM (you define queue slots and memory allocation).

💡 The Athena partition pruning concept is critical for performance scenarios. Queries that filter on partition columns skip reading irrelevant data entirely. Queries that don't filter on partition columns scan the entire dataset. This is why Hive-style partitioning design matters at ingestion time.

💡 Kinesis IteratorAge as a CloudWatch alarm threshold appears in operational monitoring scenarios. When IteratorAge exceeds an acceptable threshold, consumers are falling behind. The solution is adding consumers, processing more records per batch, or adding shards.

💡 Use ExamOS for performance tuning scenario practice that presents symptoms of poor pipeline performance and asks for the most appropriate optimization approach.

Step 7 - DEA-C01 Domain 4 — Data Security and Governance (18%)

Implement security controls, data governance frameworks, and compliance requirements for data systems on AWS. Security and Governance at 18% is consistently underweighted in preparation and consistently overrepresented in exam day surprises.

3-4 weeks

AWS Lake Formation — data lake governance, fine-grained access control (column-level, row-level, cell-level), data permissions model
Lake Formation integration with Glue Catalog and Athena — how permissions layer on top of IAM
AWS KMS for data encryption — CMKs for Redshift, S3, Glue, and Kinesis encryption, key rotation, cross-account key sharing
S3 encryption options — SSE-S3, SSE-KMS, SSE-C, DSSE-KMS and when each is appropriate for data lake scenarios
Redshift encryption — cluster encryption at rest with KMS, TLS for data in transit, column-level encryption
AWS CloudTrail for data audit — data events for S3 (expensive, selective), management events, Athena querying CloudTrail logs
VPC endpoints for data services — Gateway endpoints (S3, DynamoDB free), interface endpoints (Glue, Kinesis, Redshift private link)
IAM data service permissions — Glue service roles, Kinesis data stream access, cross-account S3 bucket policies for data sharing
AWS Macie — sensitive data discovery in S3, PII detection, compliance scanning for data lakes
AWS Glue Data Catalog resource policies — cross-account catalog access, resource-based policies
Data masking and tokenization — Glue transforms for PII masking, Redshift dynamic data masking

Certifications

AWS Certified Data Engineer - Associate (DEA-C01)

💡 Lake Formation is the most important security service to understand for DEA-C01 and the one most candidates skip in preparation. Lake Formation's permissions model sits on top of IAM — Lake Formation permissions AND IAM permissions both need to allow an action for it to succeed. This is the same two-door model as cross-account IAM in other contexts.

💡 Lake Formation column-level and row-level security appears in scenarios about different analyst groups needing access to different subsets of the same data lake. Lake Formation enables this without creating separate copies of the data.

💡 CloudTrail data events for S3 are expensive (additional charge per event) and not enabled by default. Scenarios asking how to audit all S3 object access for compliance are testing whether you know both that data events exist and that they must be explicitly enabled.

💡 AWS Macie sensitive data discovery appears in scenarios about discovering where PII lives in a large S3 data lake before implementing access controls. Know what Macie can detect automatically and how to configure custom data identifiers for domain-specific sensitive data.

💡 Use ExamOS for security and governance scenario practice that tests Lake Formation permission design and encryption configuration decisions.

Step 8 - Advanced data patterns and exam readiness

Consolidate preparation through integrated data pipeline scenarios, advanced pattern recognition, and targeted gap closure before booking the exam.

2-3 weeks

Data mesh architecture on AWS — domain-oriented decentralized data ownership, data as a product, federated governance with Lake Formation
Modern data lake house patterns — Apache Iceberg and Apache Hudi on S3 for ACID transactions, time travel, schema evolution
Amazon Redshift data sharing — sharing data between Redshift clusters without copying, producer and consumer cluster configuration
Cross-account data architecture — Lake Formation cross-account sharing, S3 bucket policies for data consumers, centralized versus federated catalog
Amazon QuickSight — connecting to Redshift, Athena, and S3 for BI, SPICE for in-memory acceleration, embedding analytics
End-to-end pipeline design — S3 landing zone → Glue crawler → Glue ETL → S3 curated layer → Redshift Spectrum → Athena → QuickSight
Exam readiness — domain-weighted practice, timed full-length simulations, targeted gap closure

Certifications

AWS Certified Data Engineer - Associate (DEA-C01)

💡 Apache Iceberg support in Glue and Athena is increasingly tested in 2026 DEA-C01 scenarios. Iceberg provides ACID transactions on S3, schema evolution without rewriting data, and time travel queries. Know what problem Iceberg solves compared to plain Parquet files and when you would choose it.

💡 Redshift data sharing is tested in multi-team analytics scenarios where different teams need access to the same data without maintaining separate copies. The producer cluster grants access. The consumer cluster creates a database from the datashare.

💡 Consistent performance above 80% on Legend mode across five or more consecutive ExamOS sessions is the clearest DEA-C01 readiness signal. Given DEA-C01's reputation as the most technical associate-level AWS exam, this threshold represents genuine operational knowledge rather than conceptual familiarity.

Final step - Certification readiness and follow-on paths

DEA-C01 is the most technically demanding associate-level AWS certification and should not be underestimated. The 34% weight of Data Ingestion and Transformation — primarily AWS Glue and Kinesis in depth — means that a candidate who knows these services only at a surface level will struggle regardless of how well they know the other domains. Before booking, ensure stable performance above 80% on timed ExamOS scenario practice across multiple sessions, with particular strength in the Glue and Lake Formation topics that appear across Domains 1, 2, and 4 simultaneously. After DEA-C01, the most natural follow-on credentials are MLA-C01 (AWS Machine Learning Engineer Associate) for engineers moving into ML data pipelines, AIP-C01 (AWS Certified AI Practitioner Professional) for engineers building generative AI data infrastructure, and SAP-C02 (AWS Solutions Architect Professional) for engineers moving toward senior architecture roles.

Certifications

AWS Certified Cloud Practitioner (CLF-C02)

AWS Certified Solutions Architect - Associate (SAA-C03)

AWS Certified Data Engineer - Associate (DEA-C01)

Realistic timeline

2 hours per day: approximately 5-7 months for the complete path
3-4 hours per day: approximately 3.5-5 months
Candidates who already hold SAA-C03: approximately 10-14 weeks for DEA-C01 specific preparation
Domain 1 (Data Ingestion and Transformation, 34%) should receive approximately one-third of total preparation time — weight your effort proportionally
Hands-on lab time building real data pipelines in AWS is essential for DEA-C01 — build at least one complete pipeline using Glue, Kinesis, Redshift, and Athena before your exam
Apache Iceberg, Lake Formation, and Glue Data Quality are newer topics that many study materials undercover — verify your materials address these before relying on them
Consistency across daily sessions produces significantly better DEA-C01 outcomes than periodic marathon sessions
The most effective preparation pattern reported by recent DEA-C01 passers: daily scenario practice targeting the 34% ingestion domain first, then adding storage (26%), then operations (22%), then security (18%)

Embark on your career roadmap by setting a target and staying accountable

Set target

Share your feedback

AWS Data Engineer: Zero to Hero

Step 0 - Data engineering and programming foundations

Step 1 - AWS fundamentals and data-relevant architecture (SAA-C03 recommended)

Step 2 - DEA-C01 Domain 1 Part A — Data Ingestion Patterns and Services (34% total)

Step 3 - DEA-C01 Domain 1 Part B — Data Transformation with AWS Glue (34% total)

Step 4 - DEA-C01 Domain 2 — Data Store Management (26%)

Step 5 - DEA-C01 Domain 3 Part A — Pipeline Orchestration and EMR (22% total)

Step 6 - DEA-C01 Domain 3 Part B — Performance Tuning and Troubleshooting (22% total)

Step 7 - DEA-C01 Domain 4 — Data Security and Governance (18%)

Step 8 - Advanced data patterns and exam readiness

Final step - Certification readiness and follow-on paths

Realistic timeline