Dev.to
Read-Write ETL on NAS Data with EMR Serverless Spark — No Cluster, No Copy
TL;DR
In Part 1, Athena provided serverless read-only SQL. In Part 2, Databricks hit session policy boundaries. In Part 3, Snowflake works with config. In Part 4, DuckDB Lambda delivered the cheapest path. This Part 5 shows the full-power Spark ETL path with write-back.
EMR Serverless Spark can read, transform, and write-back Parquet files on FSx for ONTAP via S3 Access Points. Total Spark execution: 16 seconds for a full ETL pipeline (read → aggregate → window → write). Job total including cold start: 37 seconds. Cost: ~$0.05 per job.
No cluster to manage. No data to copy. No idle cost.
Quick Decision Guide:
Need Spark's full power (UDFs, ML, window functions) + write-back → EMR Serverless
Read-only SQL, no Spark needed → Use Athena (Part 1) or DuckDB Lambda (Part 4)
Need enterprise governance on results → Combine EMR write-back + Athena/Lake Formation for reads
GitHub: fsxn-lakehouse-integrations/integrations/emr-spark/
How to Read This Article
This article is:
A reproduction-focused validation report
Evidence from one environment (EMR Serverless emr-7.1.0, ap-northeast-1)
A deployment guide for EMR Serverless + FSx for ONTAP S3 AP
Read by role:
Data engineer: Architecture → Critical Findings → PySpark Job
Platform engineer: Deploy and Run → Gotchas → Cost Analysis
Partner / SA: Partner Decision Card → Discovery Questions
Security reviewer: Governance Impact → When to Use
Prerequisite Concepts
Before reading this article, it helps to understand:
EMR Serverless — a deployment option for EMR that runs Spark/Hive jobs without managing clusters
EMRFS — EMR's S3 filesystem implementation (s3:// prefix) that natively supports S3 AP aliases
S3A vs EMRFS — s3a:// (Hadoop's S3AFileSystem) does NOT support S3 AP aliases; always use s3://
PySpark — Python API for Apache Spark
Parquet timestamp resolution — Spark requires microsecond timestamps; nanosecond (pandas default) causes errors
Why EMR Serverless + FSx for ONTAP?
Traditional ETL
This approach
Provision EMR cluster (minutes)
Submit job to EMR Serverless (seconds)
Copy data from NAS to S3
Read NAS data in place via S3 AP
Pay for idle cluster
Pay only during job execution
Manage cluster scaling
Auto-scales per job
Write results to separate S3 bucket
Write results back to FSx for ONTAP
EMR Serverless eliminates cluster management entirely. Combined with FSx S3 AP, you get a fully serverless ETL pipeline that reads and writes directly to your NAS storage.
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ EMR Serverless Application (Spark 3.5, emr-7.1.0) │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ PySpark Job │ │
│ │ ├── Read: spark.read.parquet("s3://<AP>/sensor-data/") │ │
│ │ ├── Transform: GROUP BY, Window functions │ │
│ │ └── Write: df.write.parquet("s3://<AP>/gold/output/") │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ EMRFS (s3://) │
└──────────────────────────┼──────────────────────────────────────┘
│
▼
S3 Access Point (internet-origin)
│
▼
FSx for ONTAP Volume (Parquet files)
Key: EMR Serverless uses EMRFS (s3:// prefix) which natively supports S3 AP aliases. No special configuration needed.
Benchmark Results
Operation
Duration
Notes
Read 10K rows
6.78s
First read includes Spark initialization
GROUP BY aggregation
2.52s
Status + AVG(temperature)
Window function
1.19s
Moving average per device
Write-back to FSxN
3.61s
Parquet output to S3 AP
Total Spark execution
16.35s
All operations combined
Job total (with cold start)
37s
Includes EMR Serverless startup
Environment: EMR Serverless, emr-7.1.0, Spark 3.5, ap-northeast-1. FSx for ONTAP Single-AZ, 128 MB/s.
Evidence Matrix
Layer
Evidence
Result
Interpretation
EMR Serverless app
create-application
✅ Pass
Spark 3.5 app created
IAM role
Execution role with S3 AP permissions
✅ Pass
GetObject + PutObject on AP ARN
EMRFS read
spark.read.parquet("s3://AP/...")
✅ Pass
EMRFS natively handles AP alias
Spark transforms
GROUP BY, Window, aggregation
✅ Pass
Full Spark SQL works
Write-back
df.write.parquet("s3://AP/gold/...")
✅ Pass
PutObject to FSxN via S3 AP
S3A (negative test)
spark.read.parquet("s3a://AP/...")
❌ Expected fail
S3A cannot parse AP alias
Job lifecycle
start → running → success
✅ Pass
37s total including cold start
Critical Finding: EMRFS vs S3A
This is the most important thing to know:
# ✅ WORKS — EMRFS natively supports S3 AP aliases
df = spark.read.parquet("s3://my-ap-alias-ext-s3alias/sensor-data/")
# ❌ FAILS — S3A cannot parse AP alias URLs
df = spark.read.parquet("s3a://my-ap-alias-ext-s3alias/sensor-data/")
# Error: IllegalArgumentException: Invalid S3 URI
Always use s3:// (EMRFS) with EMR. The s3a:// filesystem (Hadoop's S3AFileSystem) does not understand S3 AP alias format.
Critical Finding: Parquet Timestamp Compatibility
If you generate Parquet files with pandas or DuckDB, they default to nanosecond timestamps. Spark cannot read these:
AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS, true))
Fix: Generate Parquet with microsecond timestamps:
import pyarrow as pa, pyarrow.parquet as pq
# Convert nanosecond → microsecond before writing
table = pa.Table.from_pandas(df)
schema = table.schema
new_fields = []
for field in schema:
if pa.types.is_timestamp(field.type):
new_fields.append(field.with_type(pa.timestamp('us')))
else:
new_fields.append(field)
new_schema = pa.schema(new_fields)
table = table.cast(new_schema)
pq.write_table(table, 'output.parquet')
This affects cross-engine compatibility: if you write Parquet with DuckDB or pandas and want to read it with Spark (EMR, Glue, Databricks), always use microsecond resolution.
Comparison with Other Engines in This Series
Aspect
EMR Serverless
Athena (Part 1)
DuckDB Lambda (Part 4)
Snowflake (Part 3)
Databricks (Part 2)
Read from FSx for ONTAP S3 AP
✅ Direct
✅ Direct
✅ Direct
✅ With ARN
⚠️ Partial (explicit path only)
Write-back to FSx for ONTAP
✅ Best
✅ CTAS
✅ COPY TO
⚠️ TBD
❌ Blocked
Complex transforms (UDF, ML)
✅ Best
❌ SQL only
❌ SQL only
⚠️ Snowpark
✅ Best (if data in UC)
Cold start
20s
~2s
1.9s
N/A
N/A (cluster always on)
Cost per job
$0.05
$0.005/TB
$0.00001
Credits
DBU
Governance
IAM only
✅ Glue + LF
❌ None
✅ Tags + RBAC
❌ UC blocked on S3 AP
Distributed processing
✅ Best
✅
❌
✅
✅ Best (if data in UC)
Session policy issues
❌ None
❌ None
❌ None
Resolved with ARN
❌ Blocks table creation
Why EMR Serverless instead of Databricks for FSx for ONTAP S3 AP?
EMR Serverless uses direct IAM role credentials without intermediary session policies. The S3 AP ARN format works natively — no special configuration needed. In contrast, Databricks UC generates a restrictive session policy that blocks subdirectory listing, table creation, and write operations on FSx for ONTAP S3 AP paths (confirmed by Databricks Support, May 2026).
For teams that need Spark processing on FSx for ONTAP data today:
EMR Serverless: Direct read + write-back, no session policy issues, IAM governance
Databricks: Requires DataSync → S3 → UC (data copy), but provides full UC governance + Mosaic AI
Partner Decision Card
Customer requirement
EMR Serverless today
Recommended path
Full Spark ETL with write-back
✅ Best fit
Deploy EMR Serverless
Complex transforms (UDFs, ML pipelines)
✅ Best fit
Deploy EMR Serverless
Large-scale distributed processing
✅ Best fit
Deploy EMR Serverless
Read-only SQL analytics
⚠️ Overkill
Use Athena or DuckDB Lambda
Sub-second query latency
❌ 20s cold start
Use DuckDB Lambda
Enterprise governance on results
⚠️ IAM only
Write to FSxN → read via Athena + Lake Formation
Delta/Iceberg table format
❌ Write not supported on S3 AP
Write flat Parquet only. Iceberg read (pre-existing table) is theoretically possible via GetObject but not validated.
Scheduled batch ETL
✅ Good fit
EMR Serverless + Step Functions
Discovery Questions for Partners
When a customer asks about EMR Serverless + FSx for ONTAP S3 Access Points:
Does the workload require Spark-specific features (UDFs, ML, window functions, graph)?
Is write-back to FSx for ONTAP required? (EMR is the best write-back path)
What is the typical dataset size? (EMR shines at > 1 GB; for < 1 GB, DuckDB Lambda is cheaper)
Is the workload batch or interactive? (EMR has 20s cold start — not suitable for interactive)
Does the team have Spark expertise? (If not, Athena SQL may be simpler)
Is Delta/Iceberg table format required? (Not supported for write on FSx S3 AP)
What is the job frequency? (10 jobs/day = $15/month; 100 jobs/day = $150/month)
Is there an existing EMR or Glue investment? (Leverage existing IAM roles and scripts)
Governance Impact
Capability
EMR Serverless
Notes
Authentication
IAM (execution role)
Standard AWS IAM
Authorization
S3 AP policy + IAM
No table/column-level control natively
Audit trail
CloudWatch Logs + CloudTrail
Job logs + S3 API calls logged
Data classification
❌ None built-in
Can integrate with Lake Formation for reads
Row/column security
❌ None built-in
Apply at read layer (Athena + LF)
Catalog integration
⚠️ Optional (Glue Catalog)
Can register output in Glue for downstream governance
Governance model: EMR Serverless uses IAM + S3 AP policy for access control. For enterprise governance, write results back to FSxN and read them via Athena + Lake Formation (Part 6). This gives you Spark's processing power with Lake Formation's governance on the output.
Recommended pattern for governed ETL:
FSxN (raw) → EMR Spark (transform) → FSxN (gold) → Athena + Lake Formation (governed read)
AI Readiness Score
Pattern
Governance
Performance
AI Capability
Cost
Operational Simplicity
Overall
EMR Serverless Spark
★★☆☆☆
★★★★☆
★★★☆☆
★★★☆☆
★★★☆☆
3.0
Athena + Lake Formation
★★★★★
★★★☆☆
★★☆☆☆
★★★★☆
★★★★☆
3.6
DuckDB Lambda
★☆☆☆☆
★★★★☆
★☆☆☆☆
★★★★★
★★★★★
3.2
Snowflake External Table
★★★★☆
★★☆☆☆
★★★★☆
★★★☆☆
★★★★☆
3.4
Governance: Access control, audit, classification capabilities
Performance: Processing throughput for ETL workloads
AI Capability: Built-in ML/AI integration (Spark MLlib, etc.)
Cost: Total cost for batch ETL workloads
Operational Simplicity: Setup and maintenance effort
Scoring methodology: Each dimension rated by the author based on validated evidence. EMR scores highest on Performance and AI Capability (Spark MLlib, distributed ML) but lower on Governance (IAM-only) and Simplicity (requires Spark expertise).
Cost Analysis
Component
Cost
EMR Serverless (37s job)
~$0.05
FSx for ONTAP (existing)
$0 incremental
S3 AP requests
$0 (included in FSx)
Script storage (S3)
< $0.01
Monthly estimate (10 jobs/day):
300 jobs × $0.05 = $15/month
Zero idle cost (application stopped between jobs)
Compare with:
EMR on EC2 (m5.xlarge cluster): ~$200/month (always-on)
Glue ETL (same workload): ~$0.44/job × 300 = $132/month
DuckDB Lambda: ~$1.10/month (but no distributed processing)
The PySpark Job
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
import time
spark = SparkSession.builder.appName("FSxN-S3AP-Verification").getOrCreate()
S3_AP = "s3://<your-ap-alias-ext-s3alias>"
# --- Read ---
start = time.time()
df = spark.read.parquet(f"{S3_AP}/sensor-data/sensor_data_microsecond.parquet")
row_count = df.count()
print(f"Read: {row_count} rows in {time.time()-start:.2f}s")
# --- Transform: GROUP BY ---
start = time.time()
agg_df = df.groupBy("status").agg(
F.count("*").alias("count"),
F.avg("temperature").alias("avg_temp"),
F.avg("humidity").alias("avg_humidity")
)
agg_df.show()
print(f"GROUP BY: {time.time()-start:.2f}s")
# --- Transform: Window function ---
start = time.time()
window_spec = Window.partitionBy("device_id").orderBy("timestamp").rowsBetween(-5, 0)
window_df = df.withColumn("moving_avg_temp", F.avg("temperature").over(window_spec))
window_df.select("device_id", "timestamp", "temperature", "moving_avg_temp").show(5)
print(f"Window: {time.time()-start:.2f}s")
# --- Write-back ---
start = time.time()
agg_df.write.mode("overwrite").parquet(f"{S3_AP}/gold/emr_spark_output/")
print(f"Write-back: {time.time()-start:.2f}s")
spark.stop()
Deploy and Run
# 1. Create EMR Serverless application
aws emr-serverless create-application \
--name "fsxn-spark" \
--release-label "emr-7.1.0" \
--type "SPARK" \
--region ap-northeast-1
# 2. Upload script to S3 (regular bucket, not S3 AP)
aws s3 cp scripts/spark_verification.py \
s3://my-scripts-bucket/emr-scripts/
# 3. Submit job
aws emr-serverless start-job-run \
--application-id <app-id> \
--execution-role-arn arn:aws:iam::<ACCOUNT_ID>:role/emr-serverless-role \
--job-driver '{
"sparkSubmit": {
"entryPoint": "s3://my-scripts-bucket/emr-scripts/spark_verification.py"
}
}'
# 4. Check status
aws emr-serverless get-job-run \
--application-id <app-id> \
--job-run-id <job-run-id>
# 5. Stop application (zero cost when stopped)
aws emr-serverless stop-application --application-id <app-id>
Known Failure Signatures
Symptom
Likely cause
Next step
IllegalArgumentException: Invalid S3 URI
Using s3a:// instead of s3://
Switch to EMRFS (s3://) prefix
Illegal Parquet type: INT64 (TIMESTAMP(NANOS))
Nanosecond timestamps in Parquet
Regenerate with microsecond resolution
Job stuck in PENDING > 60s
EMR Serverless capacity
Check service quotas; retry
AccessDeniedException on S3 AP
IAM role missing AP permissions
Add S3 AP ARN to execution role policy
Script not found
Script on S3 AP instead of regular S3
Move script to regular S3 bucket
Write fails with 501
Attempting Delta/Iceberg write
Use flat Parquet write only
Gotchas and Lessons
1. Script must be on regular S3 (not S3 AP)
EMR Serverless loads the PySpark script from S3. The script location must be a regular S3 bucket, not an FSx S3 AP. The script then reads/writes data from/to the S3 AP.
2. IAM role needs both S3 bucket and S3 AP permissions
{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::my-scripts-bucket/*",
"arn:aws:s3:ap-northeast-1:<ACCOUNT_ID>:accesspoint/<ap-name>",
"arn:aws:s3:ap-northeast-1:<ACCOUNT_ID>:accesspoint/<ap-name>/object/*"
]
}
3. Cold start is ~20 seconds
EMR Serverless has a cold start of ~20 seconds before Spark begins executing. For latency-sensitive workloads, keep the application in "started" state (costs ~$0.01/hour for pre-initialized capacity).
4. No session policy issues
Unlike Databricks and Snowflake, EMR Serverless uses direct IAM role credentials without intermediary session policies. The S3 AP ARN format works natively.
When to Use EMR Serverless vs Other Engines
Requirement
EMR Serverless
Athena
DuckDB Lambda
Glue ETL
Read-only SQL
✅
✅ Best
✅
✅
Write-back to FSxN
✅ Best
✅ (CTAS)
✅
✅
Complex Spark transformations
✅ Best
❌
❌
✅
Sub-second latency
❌ (cold start)
❌
✅ Best
❌
Zero idle cost
✅
✅
✅
✅
Large-scale distributed
✅ Best
✅
❌
✅
What's Next
Part 6: Redshift Spectrum + Lake Formation — for teams that need DWH-integrated analytics with enterprise governance (4-layer authorization) on NAS data
Part 7: Table Format Boundaries — why Delta, Iceberg, and Hudi can't write to FSx S3 AP, and what flat Parquet patterns work instead
Previously in this series:
Part 1: Athena — Query NAS Data In Place
Part 2: Databricks — A Layer-by-Layer Validation of Observed Boundaries
Part 3: Snowflake — From 'Access Denied' to Working External Tables
Part 4: DuckDB Lambda — Serverless Analytics for $0.00001/Query
References
FSx for ONTAP S3 Access Points
AWS Tutorial: Run Spark jobs using Amazon EMR Serverless
EMR Serverless documentation
GitHub: fsxn-lakehouse-integrations
Key achievement: This validation established that EMR Serverless Spark provides the most capable read-write ETL path for FSx for ONTAP S3 AP data — full Spark SQL, UDFs, window functions, and write-back in 16 seconds of Spark execution at $0.05/job. No cluster management, no data copy, no session policy issues. The trade-off is cold start latency (20s) and lack of built-in governance — pair with Athena + Lake Formation for governed reads on the output.
All benchmarks are from a specific test environment (EMR Serverless emr-7.1.0, FSx for ONTAP Single-AZ 128 MB/s, ap-northeast-1). Scale throughput provisioning for production workloads.
Disclaimer: This article is an independent validation report and does not represent AWS or NetApp official guidance. Product behavior and platform capabilities may change. Always validate in your own environment.
2 hours ago