EMR Serverless and Redshift Spectrum Validate NAS Access via FSx for ONTAP S3 Access Points

AI-Generated Summary

1 sources

1 hour ago

1 views

EMR Serverless and Redshift Spectrum Validate NAS Access via FSx for ONTAP S3 Access Points — Photo: Dev.to

Key Points

EMR Serverless Spark can read and write Parquet on FSx for ONTAP through S3 Access Points using EMRFS and an s3:// URI; s3a:// does not support S3 AP aliases.
Spark ETL validation reports full Spark transformations (including window functions) and write-back to S3 AP paths; the example reports ~16 seconds Spark execution and ~37 seconds including cold start.
A compatibility requirement is that Parquet timestamps must use microsecond resolution when Spark reads the files; nanosecond timestamps fail in the validation.
Redshift Serverless with Redshift Spectrum queries FSx for ONTAP data via external schemas that reuse the same Glue Catalog as Athena, without additional table registration.
Adding Lake Formation enables fine-grained governance on the FSx for ONTAP data accessed through Spectrum and Athena, including table/column permissions, row filters, and LF-tag classification.

Two validation reports describe ways to access and govern Parquet data stored on FSx for ONTAP from AWS analytics services using S3 Access Points (S3 APs). In the EMR Serverless Spark report, the author shows a fully serverless Spark ETL pipeline that reads and writes Parquet files directly through S3 APs using the EMRFS filesystem with an s3:// URI format. The validation includes Spark transformations such as GROUP BY and window functions and reports ~16 seconds of Spark execution time for a test workload, with ~37 seconds total including EMR Serverless cold start. It also identifies compatibility constraints: using s3a:// with S3 AP aliases fails, and Parquet timestamps must be microsecond resolution for Spark to read them.

In the Redshift Spectrum plus Lake Formation report, the author validates that Redshift Serverless can query FSx for ONTAP data through Spectrum external schemas mapped to the same Glue Catalog tables used by Athena, and that Lake Formation adds enterprise governance. The reports verify table-level access, fine-grained column-level permissions, row filtering, and LF-tag-based classification. Performance measurements show Redshift Serverless queries are slower than Athena for simple scans, and both Spectrum-based querying and governance changes depend on the configured Glue and Lake Formation permissions.

How Outlets Covered This Story

DEV

Dev.to

Redshift Spectrum + Lake Formation — Enterprise Governance on NAS Data

TL;DR In Part 1, Athena provided serverless SQL. In Part 2, Databricks hit boundaries. In Part 3, Snowflake works with config. In Part 4, DuckDB Lambda was cheapest. In Part 5, EMR Spark delivered full ETL. This Part 6 adds enterprise governance: Redshift Spectrum + Lake Formation provides 4-layer authorization on NAS data. Redshift Serverless (8 RPU) successfully queries FSx for ONTAP data via S3 Access Points using the same Glue Catalog tables as Athena — no additional data registration needed. Add Lake Formation on top for table-level, column-level, and tag-based access control. Query Duration Comparison with Athena COUNT(*) 10K rows 3,231 ms Athena: ~1,500 ms GROUP BY aggregation 2,580 ms Athena: ~1,800 ms COUNT(*) 5M rows 4,277 ms Athena: 2,196 ms ~2x slower than Athena for simple scans (Redshift Serverless cold start overhead), but Redshift adds DWH capabilities: federated JOINs with local tables, materialized views, and stored procedures. Quick Decision Guide: Need DWH JOINs with NAS data → Redshift Spectrum (this article) Need enterprise governance (table/column/tag) → Add Lake Formation Need serverless SQL only (no DWH) → Use Athena (Part 1) — faster and cheaper GitHub: fsxn-lakehouse-integrations How to Read This Article This article is: A reproduction-focused validation report Evidence from one environment (Redshift Serverless 8 RPU, ap-northeast-1) A governance architecture guide for Lake Formation + FSx S3 AP Read by role: DWH engineer: Architecture → Setup → Benchmark Results Security / governance reviewer: 4-Layer Authorization → Governance Impact Data engineer: When to Use → Comparison with Athena Partner / SA: Partner Decision Card → Discovery Questions Prerequisite Concepts Before reading this article, it helps to understand: Redshift Spectrum — Redshift's ability to query data in S3 via external schemas (Glue Catalog) Redshift Serverless — pay-per-query Redshift without cluster management (measured in RPU) Lake Formation — AWS's centralized governance layer for data lakes (table/column/tag permissions) Glue Catalog — AWS's metadata catalog (shared by Athena, Redshift Spectrum, EMR, Glue) External Schema — a Redshift schema that maps to a Glue Catalog database Architecture ┌─────────────────────────────────────────────────────────────────┐ │ Redshift Serverless (8 RPU) │ │ │ │ ┌──────────────────────────────────────────────────────────┐ │ │ │ SQL Query │ │ │ │ SELECT * FROM fsxn_spectrum.sensor_readings │ │ │ │ JOIN local_table ON ... │ │ │ └──────────────────────────────────────────────────────────┘ │ │ │ │ │ External Schema (Glue Catalog) │ └──────────────────────────┼───────────────────────────────────────┘ │ ┌────────────┼────────────┐ │ │ │ Lake Formation IAM Role S3 Access Point (table/column (API (resource permissions) access) policy) │ │ │ └────────────┼────────────┘ │ ▼ FSx for ONTAP Volume (Parquet files) 4-Layer Authorization: Lake Formation — Who can access which tables/columns (fine-grained) IAM — Who can call which AWS APIs S3 Access Point Policy — Which principals can access this access point File System — UNIX permissions on the underlying files Benchmark Results Query Duration (ms) Rows Notes CREATE EXTERNAL SCHEMA 240 — One-time setup COUNT(*) 10K rows 3,231 10,000 Cold start overhead GROUP BY + AVG aggregation 2,580 3 groups Status grouping COUNT(*) 5M rows 4,277 5,000,000 Large scan Environment: Redshift Serverless 8 RPU, ap-northeast-1. FSx for ONTAP Single-AZ, 128 MB/s. Performance note: Redshift Serverless has cold start overhead (~2-3s for first query). Warm queries on provisioned Redshift clusters would be faster. For simple scans, Athena is ~2x faster because it has no DWH initialization overhead. Evidence Matrix Layer Evidence Result Interpretation Redshift Serverless Workgroup creation (8 RPU) ✅ Pass Serverless endpoint available IAM role Spectrum role with S3 AP permissions ✅ Pass GetObject + ListBucket on AP ARN External Schema CREATE EXTERNAL SCHEMA from Glue ✅ Pass Same catalog as Athena Spectrum read (small) COUNT(*) 10K rows ✅ Pass 3,231ms Spectrum read (aggregation) GROUP BY + AVG ✅ Pass 2,580ms Spectrum read (large) COUNT(*) 5M rows ✅ Pass 4,277ms Lake Formation admin put-data-lake-settings ✅ Pass Admin configured Lake Formation grant Table-level SELECT grant ✅ Pass Fine-grained permission works LF column-level SELECT on 3 permitted columns ✅ Pass Non-permitted column returns "cannot be resolved" LF column deny SELECT on denied column (humidity) ✅ Pass (denied) "Column cannot be resolved or requester is not authorized" LF row filter Data cells filter creation ✅ Pass Row filter (status='normal') + column filter combined LF-Tag creation sensitivity tag (public/internal/confidential) ✅ Pass Tag created and assigned to table LF-Tag permission Tag-based DESCRIBE+ASSOCIATE grant ✅ Pass Scalable governance via classification Athena under LF Query with LF permissions active ✅ Pass Same governance applies to Athena Setup Step 1: Create External Schema (reuses Glue Catalog) CREATE EXTERNAL SCHEMA fsxn_spectrum FROM DATA CATALOG DATABASE 'fsxn_athena_verification' IAM_ROLE 'arn:aws:iam::<ACCOUNT_ID>:role/fsxn-redshift-spectrum-role' REGION 'ap-northeast-1'; Key insight: This uses the same Glue Catalog database that Athena uses. No additional table registration needed — if Athena can query it, Redshift Spectrum can too. Step 2: Query FSx for ONTAP Data -- Simple count SELECT COUNT(*) FROM fsxn_spectrum.sensor_readings; -- Result: 10000 (3,231ms) -- Aggregation SELECT status, COUNT(*), AVG(temperature) FROM fsxn_spectrum.sensor_readings GROUP BY status; -- Result: 3 groups (2,580ms) -- JOIN with local Redshift table (DWH capability) SELECT s.device_id, s.temperature, d.location FROM fsxn_spectrum.sensor_readings s JOIN device_master d ON s.device_id = d.device_id WHERE s.temperature > 35; Step 3: Add Lake Formation Governance # Set Lake Formation admin aws lakeformation put-data-lake-settings \ --data-lake-settings '{"DataLakeAdmins": [{"DataLakePrincipalIdentifier": "arn:aws:iam::<ACCOUNT_ID>:user/<admin>"}]}' # Grant table-level SELECT to a role aws lakeformation grant-permissions \ --principal '{"DataLakePrincipalIdentifier": "arn:aws:iam::<ACCOUNT_ID>:role/fsxn-analyst-role"}' \ --resource '{"Table": {"DatabaseName": "fsxn_athena_verification", "Name": "sensor_readings"}}' \ --permissions '["SELECT", "DESCRIBE"]' Lake Formation Data Permissions: fine-grained table-level SELECT grant for fsxn-athena-glue-role on sensor_readings table. Lake Formation Governance Value Capability Without Lake Formation With Lake Formation Table-level access S3 AP policy (all-or-nothing per prefix) Per-table SELECT/DESCRIBE grants Column-level security ❌ Not possible ✅ Column-level grants + masking Row-level filtering ❌ Not possible ✅ Data Cells Filter (row filter expressions) Tag-based access control ❌ Not possible ✅ Classify data → auto-grant by tag (LF-Tags) Centralized audit CloudTrail (API-level) Lake Formation audit (table/column-level) Cross-account sharing Share S3 AP (complex) Share tables via Lake Formation (simple) Fine-Grained Governance — Verified (May 2026) All three fine-grained Lake Formation capabilities have been validated on FSx for ONTAP S3 AP data: Feature Test Result Column-level permission Grant SELECT on 3 of 4 columns; query the denied column ✅ Permitted columns return data; denied column (humidity) returns "cannot be resolved" Row filter (Data Cells Filter) Create filter status = 'normal'; query returns only matching rows ✅ Only rows matching the filter expression are returned LF-Tag Create tag sensitivity: public/internal/confidential; assign to table ✅ Tag created, assigned, and queryable via Lake Formation console Governance implication for regulated workloads: Lake Formation on FSx for ONTAP S3 AP data provides the same fine-grained access control as on native S3 data. Column masking, row filtering, and tag-based classification all work without data movement. This is the strongest AWS-native governance path for FSx for ONTAP data. Iceberg + Lake Formation path: Glue Data Catalog supports Iceberg table registration natively. For transactional workloads requiring ACID guarantees: sync FSx for ONTAP data to S3 via DataSync → write as Iceberg table (EMR Spark) → register in Glue Catalog → query via Redshift Spectrum with full Lake Formation governance (column/row/tag). This provides the best of both worlds: FSx for ONTAP as source of truth + Iceberg ACID + Lake Formation governance. Enterprise governance use cases: Healthcare: Column-level masking of PHI fields (e.g., hide patient_name from analysts) Finance: Row-level filtering by business unit (each team sees only their data) Public sector: LF-Tag classification enforcement (sensitivity: public/internal/confidential) Comparison with Other Engines in This Series Aspect Redshift Spectrum Athena (Part 1) DuckDB Lambda (Part 4) EMR Spark (Part 5) Query latency (5M rows) 4,277ms 2,196ms N/A (memory limit) 6,780ms DWH JOINs with local tables ✅ Best ❌ ❌ ❌ Lake Formation governance ✅ ✅ ❌ ⚠️ Optional Materialized views ✅ ❌ ❌ ❌ Stored procedures ✅ ❌ ❌ ❌ Zero idle cost ✅ (Serverless) ✅ ✅ ✅ Write-back to FSxN ❌ (results stay in Redshift) ✅ CTAS ✅ COPY TO ✅ Best Cold start ~3s (Serverless) ~2s 1.9s 20s Cost model RPU-seconds $/TB scanned $/invocation $/job Partner Decision Card Customer requirement Redshift Spectrum + LF today Recommended path JOIN NAS data with DWH tables ✅ Best fit Redshift Spectrum external schema Enterprise governance (table/column/tag) ✅ Best fit Add Lake Formation Existing Redshift investment ✅ Natural extension Add external schema to existing cluster Serverless SQL only (no DWH) ⚠️ Overkill Use Athena (faster, cheaper for simple queries) Write-back to FSxN ❌ Not supported Use EMR Serverless (Part 5) Sub-second latency ❌ Cold start overhead Use DuckDB Lambda (Part 4) Cross-account data sharing ✅ Lake Formation sharing Configure LF cross-account grants Column-level masking for compliance ✅ Lake Formation Configure column-level permissions Discovery Questions for Partners When a customer asks about Redshift Spectrum + Lake Formation + FSx for ONTAP S3 AP: Does the customer already have a Redshift cluster or Serverless workgroup? (If yes, adding Spectrum is trivial) Do they need to JOIN NAS data with existing DWH tables? (This is Redshift Spectrum's unique value) Is table/column-level governance required? (Lake Formation adds this layer) Is the workload read-only analytics, or does it need write-back? (Spectrum is read-only from external data) What is the query frequency? (For < 10 queries/day, Athena is cheaper) Is cross-account data sharing needed? (Lake Formation simplifies this) Are there compliance requirements for column-level masking? (Lake Formation provides this) What is the acceptable query latency? (Redshift Serverless has ~3s cold start) Governance Impact Summary Access path Authorization layers Auditability Production suitability Redshift Spectrum (no LF) IAM + S3 AP + File System (3 layers) Medium (CloudTrail) Good for non-regulated workloads Redshift Spectrum + Lake Formation LF + IAM + S3 AP + File System (4 layers) High (LF audit + CloudTrail) Recommended for regulated workloads Athena + Lake Formation LF + IAM + S3 AP + File System (4 layers) High (LF audit + CloudTrail) Recommended for serverless regulated workloads Key insight: Redshift Spectrum and Athena share the same Glue Catalog and Lake Formation permissions. Governance configured for one automatically applies to the other. This means you can use EMR Spark for write-back, register output in Glue, apply Lake Formation permissions, and query from both Athena and Redshift Spectrum with the same governance. AI Readiness Score Pattern Governance Performance AI Capability Cost Operational Simplicity Overall Redshift Spectrum + LF ★★★★★ ★★★☆☆ ★★☆☆☆ ★★★☆☆ ★★★☆☆ 3.2 Athena + Lake Formation ★★★★★ ★★★☆☆ ★★☆☆☆ ★★★★☆ ★★★★☆ 3.6 Snowflake External Table ★★★★☆ ★★☆☆☆ ★★★★☆ ★★★☆☆ ★★★★☆ 3.4 DuckDB Lambda ★☆☆☆☆ ★★★★☆ ★☆☆☆☆ ★★★★★ ★★★★★ 3.2 EMR Serverless Spark ★★☆☆☆ ★★★★☆ ★★★☆☆ ★★★☆☆ ★★★☆☆ 3.0 Scoring methodology: Redshift Spectrum + LF scores highest on Governance (same as Athena + LF) but lower on Cost and Simplicity due to RPU pricing and DWH management overhead. Choose Redshift Spectrum when DWH JOINs are required; choose Athena when serverless SQL is sufficient. Cost Analysis Component Cost Redshift Serverless (8 RPU, per query) ~$0.36/RPU-hour (billed per second) Redshift Serverless (idle) $0 (scales to zero) Lake Formation $0 (no additional charge) Glue Catalog $1/100K objects/month FSx for ONTAP (existing) $0 incremental Monthly estimate (100 queries/day, avg 5s each): 100 queries × 5s × 8 RPU × $0.36/RPU-hour ÷ 3600 = ~$0.40/day = ~$12/month Compare with: Athena (same queries): ~$5/TB × data scanned DuckDB Lambda: ~$1.10/month (but no DWH JOINs) When Redshift Spectrum is cost-justified: When you already have Redshift and need to JOIN NAS data with local tables. The marginal cost of adding Spectrum queries is low. When to Use (and When Not To) Use Redshift Spectrum + Lake Formation when: Customer already has Redshift (adding Spectrum is trivial) Need to JOIN NAS data with DWH tables Enterprise governance (table/column/tag) is required Cross-account data sharing is needed Compliance requires column-level masking Don't use when: Simple serverless SQL is sufficient (use Athena — faster, cheaper) Need write-back to FSxN (use EMR Serverless) Need sub-second latency (use DuckDB Lambda) No existing Redshift investment (Athena is simpler to start) Dataset is small and ad-hoc (DuckDB Lambda is cheapest) Known Failure Signatures Symptom Likely cause Next step permission denied for schema IAM role not associated with Redshift Associate IAM role with Redshift namespace S3 access denied on external table IAM role missing S3 AP permissions Add S3 AP ARN to role policy External schema creation fails Glue database doesn't exist Create database in Glue Catalog first (or use Athena) Query returns 0 rows Table location doesn't match S3 AP path Verify Glue table LOCATION uses AP alias Spectrum is not supported Using provisioned cluster without Spectrum Enable Spectrum or use Serverless Lake Formation permission denied LF permissions not granted Grant SELECT via aws lakeformation grant-permissions What's Next Part 7: Table Format Boundaries — why Delta, Iceberg, and Hudi can't write to FSx S3 AP, and what flat Parquet patterns work instead (critical knowledge for architecture decisions) Previously in this series: Part 1: Athena — Query NAS Data In Place Part 2: Databricks — A Layer-by-Layer Validation of Observed Boundaries Part 3: Snowflake — From 'Access Denied' to Working External Tables Part 4: DuckDB Lambda — Serverless Analytics for $0.00001/Query Part 5: EMR Spark — Read-Write ETL on NAS Data References Redshift Spectrum documentation AWS Lake Formation documentation FSx for ONTAP S3 Access Points Redshift Serverless documentation GitHub: fsxn-lakehouse-integrations Key achievement: This validation established that Redshift Spectrum + Lake Formation provides the strongest enterprise governance path for FSx for ONTAP S3 AP data — 4-layer authorization (Lake Formation → IAM → S3 AP → File System), table/column-level access control, and seamless sharing of Glue Catalog with Athena. The same governance configuration applies to both Athena and Redshift Spectrum queries, enabling a unified governance model across query engines. All benchmarks are from a specific test environment (Redshift Serverless 8 RPU, FSx for ONTAP Single-AZ 128 MB/s, ap-northeast-1). Performance improves with warm queries and provisioned clusters. Disclaimer: This article is an independent validation report and does not represent AWS or NetApp official guidance. Product behavior and platform capabilities may change. Always validate in your own environment.

2 hours ago

DEV

Dev.to

Read-Write ETL on NAS Data with EMR Serverless Spark — No Cluster, No Copy

TL;DR In Part 1, Athena provided serverless read-only SQL. In Part 2, Databricks hit session policy boundaries. In Part 3, Snowflake works with config. In Part 4, DuckDB Lambda delivered the cheapest path. This Part 5 shows the full-power Spark ETL path with write-back. EMR Serverless Spark can read, transform, and write-back Parquet files on FSx for ONTAP via S3 Access Points. Total Spark execution: 16 seconds for a full ETL pipeline (read → aggregate → window → write). Job total including cold start: 37 seconds. Cost: ~$0.05 per job. No cluster to manage. No data to copy. No idle cost. Quick Decision Guide: Need Spark's full power (UDFs, ML, window functions) + write-back → EMR Serverless Read-only SQL, no Spark needed → Use Athena (Part 1) or DuckDB Lambda (Part 4) Need enterprise governance on results → Combine EMR write-back + Athena/Lake Formation for reads GitHub: fsxn-lakehouse-integrations/integrations/emr-spark/ How to Read This Article This article is: A reproduction-focused validation report Evidence from one environment (EMR Serverless emr-7.1.0, ap-northeast-1) A deployment guide for EMR Serverless + FSx for ONTAP S3 AP Read by role: Data engineer: Architecture → Critical Findings → PySpark Job Platform engineer: Deploy and Run → Gotchas → Cost Analysis Partner / SA: Partner Decision Card → Discovery Questions Security reviewer: Governance Impact → When to Use Prerequisite Concepts Before reading this article, it helps to understand: EMR Serverless — a deployment option for EMR that runs Spark/Hive jobs without managing clusters EMRFS — EMR's S3 filesystem implementation (s3:// prefix) that natively supports S3 AP aliases S3A vs EMRFS — s3a:// (Hadoop's S3AFileSystem) does NOT support S3 AP aliases; always use s3:// PySpark — Python API for Apache Spark Parquet timestamp resolution — Spark requires microsecond timestamps; nanosecond (pandas default) causes errors Why EMR Serverless + FSx for ONTAP? Traditional ETL This approach Provision EMR cluster (minutes) Submit job to EMR Serverless (seconds) Copy data from NAS to S3 Read NAS data in place via S3 AP Pay for idle cluster Pay only during job execution Manage cluster scaling Auto-scales per job Write results to separate S3 bucket Write results back to FSx for ONTAP EMR Serverless eliminates cluster management entirely. Combined with FSx S3 AP, you get a fully serverless ETL pipeline that reads and writes directly to your NAS storage. Architecture ┌─────────────────────────────────────────────────────────────────┐ │ EMR Serverless Application (Spark 3.5, emr-7.1.0) │ │ │ │ ┌──────────────────────────────────────────────────────────┐ │ │ │ PySpark Job │ │ │ │ ├── Read: spark.read.parquet("s3://<AP>/sensor-data/") │ │ │ │ ├── Transform: GROUP BY, Window functions │ │ │ │ └── Write: df.write.parquet("s3://<AP>/gold/output/") │ │ │ └──────────────────────────────────────────────────────────┘ │ │ │ │ │ EMRFS (s3://) │ └──────────────────────────┼──────────────────────────────────────┘ │ ▼ S3 Access Point (internet-origin) │ ▼ FSx for ONTAP Volume (Parquet files) Key: EMR Serverless uses EMRFS (s3:// prefix) which natively supports S3 AP aliases. No special configuration needed. Benchmark Results Operation Duration Notes Read 10K rows 6.78s First read includes Spark initialization GROUP BY aggregation 2.52s Status + AVG(temperature) Window function 1.19s Moving average per device Write-back to FSxN 3.61s Parquet output to S3 AP Total Spark execution 16.35s All operations combined Job total (with cold start) 37s Includes EMR Serverless startup Environment: EMR Serverless, emr-7.1.0, Spark 3.5, ap-northeast-1. FSx for ONTAP Single-AZ, 128 MB/s. Evidence Matrix Layer Evidence Result Interpretation EMR Serverless app create-application ✅ Pass Spark 3.5 app created IAM role Execution role with S3 AP permissions ✅ Pass GetObject + PutObject on AP ARN EMRFS read spark.read.parquet("s3://AP/...") ✅ Pass EMRFS natively handles AP alias Spark transforms GROUP BY, Window, aggregation ✅ Pass Full Spark SQL works Write-back df.write.parquet("s3://AP/gold/...") ✅ Pass PutObject to FSxN via S3 AP S3A (negative test) spark.read.parquet("s3a://AP/...") ❌ Expected fail S3A cannot parse AP alias Job lifecycle start → running → success ✅ Pass 37s total including cold start Critical Finding: EMRFS vs S3A This is the most important thing to know: # ✅ WORKS — EMRFS natively supports S3 AP aliases df = spark.read.parquet("s3://my-ap-alias-ext-s3alias/sensor-data/") # ❌ FAILS — S3A cannot parse AP alias URLs df = spark.read.parquet("s3a://my-ap-alias-ext-s3alias/sensor-data/") # Error: IllegalArgumentException: Invalid S3 URI Always use s3:// (EMRFS) with EMR. The s3a:// filesystem (Hadoop's S3AFileSystem) does not understand S3 AP alias format. Critical Finding: Parquet Timestamp Compatibility If you generate Parquet files with pandas or DuckDB, they default to nanosecond timestamps. Spark cannot read these: AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS, true)) Fix: Generate Parquet with microsecond timestamps: import pyarrow as pa, pyarrow.parquet as pq # Convert nanosecond → microsecond before writing table = pa.Table.from_pandas(df) schema = table.schema new_fields = [] for field in schema: if pa.types.is_timestamp(field.type): new_fields.append(field.with_type(pa.timestamp('us'))) else: new_fields.append(field) new_schema = pa.schema(new_fields) table = table.cast(new_schema) pq.write_table(table, 'output.parquet') This affects cross-engine compatibility: if you write Parquet with DuckDB or pandas and want to read it with Spark (EMR, Glue, Databricks), always use microsecond resolution. Comparison with Other Engines in This Series Aspect EMR Serverless Athena (Part 1) DuckDB Lambda (Part 4) Snowflake (Part 3) Databricks (Part 2) Read from FSx for ONTAP S3 AP ✅ Direct ✅ Direct ✅ Direct ✅ With ARN ⚠️ Partial (explicit path only) Write-back to FSx for ONTAP ✅ Best ✅ CTAS ✅ COPY TO ⚠️ TBD ❌ Blocked Complex transforms (UDF, ML) ✅ Best ❌ SQL only ❌ SQL only ⚠️ Snowpark ✅ Best (if data in UC) Cold start 20s ~2s 1.9s N/A N/A (cluster always on) Cost per job $0.05 $0.005/TB $0.00001 Credits DBU Governance IAM only ✅ Glue + LF ❌ None ✅ Tags + RBAC ❌ UC blocked on S3 AP Distributed processing ✅ Best ✅ ❌ ✅ ✅ Best (if data in UC) Session policy issues ❌ None ❌ None ❌ None Resolved with ARN ❌ Blocks table creation Why EMR Serverless instead of Databricks for FSx for ONTAP S3 AP? EMR Serverless uses direct IAM role credentials without intermediary session policies. The S3 AP ARN format works natively — no special configuration needed. In contrast, Databricks UC generates a restrictive session policy that blocks subdirectory listing, table creation, and write operations on FSx for ONTAP S3 AP paths (confirmed by Databricks Support, May 2026). For teams that need Spark processing on FSx for ONTAP data today: EMR Serverless: Direct read + write-back, no session policy issues, IAM governance Databricks: Requires DataSync → S3 → UC (data copy), but provides full UC governance + Mosaic AI Partner Decision Card Customer requirement EMR Serverless today Recommended path Full Spark ETL with write-back ✅ Best fit Deploy EMR Serverless Complex transforms (UDFs, ML pipelines) ✅ Best fit Deploy EMR Serverless Large-scale distributed processing ✅ Best fit Deploy EMR Serverless Read-only SQL analytics ⚠️ Overkill Use Athena or DuckDB Lambda Sub-second query latency ❌ 20s cold start Use DuckDB Lambda Enterprise governance on results ⚠️ IAM only Write to FSxN → read via Athena + Lake Formation Delta/Iceberg table format ❌ Write not supported on S3 AP Write flat Parquet only. Iceberg read (pre-existing table) is theoretically possible via GetObject but not validated. Scheduled batch ETL ✅ Good fit EMR Serverless + Step Functions Discovery Questions for Partners When a customer asks about EMR Serverless + FSx for ONTAP S3 Access Points: Does the workload require Spark-specific features (UDFs, ML, window functions, graph)? Is write-back to FSx for ONTAP required? (EMR is the best write-back path) What is the typical dataset size? (EMR shines at > 1 GB; for < 1 GB, DuckDB Lambda is cheaper) Is the workload batch or interactive? (EMR has 20s cold start — not suitable for interactive) Does the team have Spark expertise? (If not, Athena SQL may be simpler) Is Delta/Iceberg table format required? (Not supported for write on FSx S3 AP) What is the job frequency? (10 jobs/day = $15/month; 100 jobs/day = $150/month) Is there an existing EMR or Glue investment? (Leverage existing IAM roles and scripts) Governance Impact Capability EMR Serverless Notes Authentication IAM (execution role) Standard AWS IAM Authorization S3 AP policy + IAM No table/column-level control natively Audit trail CloudWatch Logs + CloudTrail Job logs + S3 API calls logged Data classification ❌ None built-in Can integrate with Lake Formation for reads Row/column security ❌ None built-in Apply at read layer (Athena + LF) Catalog integration ⚠️ Optional (Glue Catalog) Can register output in Glue for downstream governance Governance model: EMR Serverless uses IAM + S3 AP policy for access control. For enterprise governance, write results back to FSxN and read them via Athena + Lake Formation (Part 6). This gives you Spark's processing power with Lake Formation's governance on the output. Recommended pattern for governed ETL: FSxN (raw) → EMR Spark (transform) → FSxN (gold) → Athena + Lake Formation (governed read) AI Readiness Score Pattern Governance Performance AI Capability Cost Operational Simplicity Overall EMR Serverless Spark ★★☆☆☆ ★★★★☆ ★★★☆☆ ★★★☆☆ ★★★☆☆ 3.0 Athena + Lake Formation ★★★★★ ★★★☆☆ ★★☆☆☆ ★★★★☆ ★★★★☆ 3.6 DuckDB Lambda ★☆☆☆☆ ★★★★☆ ★☆☆☆☆ ★★★★★ ★★★★★ 3.2 Snowflake External Table ★★★★☆ ★★☆☆☆ ★★★★☆ ★★★☆☆ ★★★★☆ 3.4 Governance: Access control, audit, classification capabilities Performance: Processing throughput for ETL workloads AI Capability: Built-in ML/AI integration (Spark MLlib, etc.) Cost: Total cost for batch ETL workloads Operational Simplicity: Setup and maintenance effort Scoring methodology: Each dimension rated by the author based on validated evidence. EMR scores highest on Performance and AI Capability (Spark MLlib, distributed ML) but lower on Governance (IAM-only) and Simplicity (requires Spark expertise). Cost Analysis Component Cost EMR Serverless (37s job) ~$0.05 FSx for ONTAP (existing) $0 incremental S3 AP requests $0 (included in FSx) Script storage (S3) < $0.01 Monthly estimate (10 jobs/day): 300 jobs × $0.05 = $15/month Zero idle cost (application stopped between jobs) Compare with: EMR on EC2 (m5.xlarge cluster): ~$200/month (always-on) Glue ETL (same workload): ~$0.44/job × 300 = $132/month DuckDB Lambda: ~$1.10/month (but no distributed processing) The PySpark Job from pyspark.sql import SparkSession from pyspark.sql import functions as F from pyspark.sql.window import Window import time spark = SparkSession.builder.appName("FSxN-S3AP-Verification").getOrCreate() S3_AP = "s3://<your-ap-alias-ext-s3alias>" # --- Read --- start = time.time() df = spark.read.parquet(f"{S3_AP}/sensor-data/sensor_data_microsecond.parquet") row_count = df.count() print(f"Read: {row_count} rows in {time.time()-start:.2f}s") # --- Transform: GROUP BY --- start = time.time() agg_df = df.groupBy("status").agg( F.count("*").alias("count"), F.avg("temperature").alias("avg_temp"), F.avg("humidity").alias("avg_humidity") ) agg_df.show() print(f"GROUP BY: {time.time()-start:.2f}s") # --- Transform: Window function --- start = time.time() window_spec = Window.partitionBy("device_id").orderBy("timestamp").rowsBetween(-5, 0) window_df = df.withColumn("moving_avg_temp", F.avg("temperature").over(window_spec)) window_df.select("device_id", "timestamp", "temperature", "moving_avg_temp").show(5) print(f"Window: {time.time()-start:.2f}s") # --- Write-back --- start = time.time() agg_df.write.mode("overwrite").parquet(f"{S3_AP}/gold/emr_spark_output/") print(f"Write-back: {time.time()-start:.2f}s") spark.stop() Deploy and Run # 1. Create EMR Serverless application aws emr-serverless create-application \ --name "fsxn-spark" \ --release-label "emr-7.1.0" \ --type "SPARK" \ --region ap-northeast-1 # 2. Upload script to S3 (regular bucket, not S3 AP) aws s3 cp scripts/spark_verification.py \ s3://my-scripts-bucket/emr-scripts/ # 3. Submit job aws emr-serverless start-job-run \ --application-id <app-id> \ --execution-role-arn arn:aws:iam::<ACCOUNT_ID>:role/emr-serverless-role \ --job-driver '{ "sparkSubmit": { "entryPoint": "s3://my-scripts-bucket/emr-scripts/spark_verification.py" } }' # 4. Check status aws emr-serverless get-job-run \ --application-id <app-id> \ --job-run-id <job-run-id> # 5. Stop application (zero cost when stopped) aws emr-serverless stop-application --application-id <app-id> Known Failure Signatures Symptom Likely cause Next step IllegalArgumentException: Invalid S3 URI Using s3a:// instead of s3:// Switch to EMRFS (s3://) prefix Illegal Parquet type: INT64 (TIMESTAMP(NANOS)) Nanosecond timestamps in Parquet Regenerate with microsecond resolution Job stuck in PENDING > 60s EMR Serverless capacity Check service quotas; retry AccessDeniedException on S3 AP IAM role missing AP permissions Add S3 AP ARN to execution role policy Script not found Script on S3 AP instead of regular S3 Move script to regular S3 bucket Write fails with 501 Attempting Delta/Iceberg write Use flat Parquet write only Gotchas and Lessons 1. Script must be on regular S3 (not S3 AP) EMR Serverless loads the PySpark script from S3. The script location must be a regular S3 bucket, not an FSx S3 AP. The script then reads/writes data from/to the S3 AP. 2. IAM role needs both S3 bucket and S3 AP permissions { "Effect": "Allow", "Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"], "Resource": [ "arn:aws:s3:::my-scripts-bucket/*", "arn:aws:s3:ap-northeast-1:<ACCOUNT_ID>:accesspoint/<ap-name>", "arn:aws:s3:ap-northeast-1:<ACCOUNT_ID>:accesspoint/<ap-name>/object/*" ] } 3. Cold start is ~20 seconds EMR Serverless has a cold start of ~20 seconds before Spark begins executing. For latency-sensitive workloads, keep the application in "started" state (costs ~$0.01/hour for pre-initialized capacity). 4. No session policy issues Unlike Databricks and Snowflake, EMR Serverless uses direct IAM role credentials without intermediary session policies. The S3 AP ARN format works natively. When to Use EMR Serverless vs Other Engines Requirement EMR Serverless Athena DuckDB Lambda Glue ETL Read-only SQL ✅ ✅ Best ✅ ✅ Write-back to FSxN ✅ Best ✅ (CTAS) ✅ ✅ Complex Spark transformations ✅ Best ❌ ❌ ✅ Sub-second latency ❌ (cold start) ❌ ✅ Best ❌ Zero idle cost ✅ ✅ ✅ ✅ Large-scale distributed ✅ Best ✅ ❌ ✅ What's Next Part 6: Redshift Spectrum + Lake Formation — for teams that need DWH-integrated analytics with enterprise governance (4-layer authorization) on NAS data Part 7: Table Format Boundaries — why Delta, Iceberg, and Hudi can't write to FSx S3 AP, and what flat Parquet patterns work instead Previously in this series: Part 1: Athena — Query NAS Data In Place Part 2: Databricks — A Layer-by-Layer Validation of Observed Boundaries Part 3: Snowflake — From 'Access Denied' to Working External Tables Part 4: DuckDB Lambda — Serverless Analytics for $0.00001/Query References FSx for ONTAP S3 Access Points AWS Tutorial: Run Spark jobs using Amazon EMR Serverless EMR Serverless documentation GitHub: fsxn-lakehouse-integrations Key achievement: This validation established that EMR Serverless Spark provides the most capable read-write ETL path for FSx for ONTAP S3 AP data — full Spark SQL, UDFs, window functions, and write-back in 16 seconds of Spark execution at $0.05/job. No cluster management, no data copy, no session policy issues. The trade-off is cold start latency (20s) and lack of built-in governance — pair with Athena + Lake Formation for governed reads on the output. All benchmarks are from a specific test environment (EMR Serverless emr-7.1.0, FSx for ONTAP Single-AZ 128 MB/s, ap-northeast-1). Scale throughput provisioning for production workloads. Disclaimer: This article is an independent validation report and does not represent AWS or NetApp official guidance. Product behavior and platform capabilities may change. Always validate in your own environment.

2 hours ago

Barnes & Noble CEO Clarifies Policy on Stocking AI-Generated, Labeled Books

Barnes & Noble CEO James Daunt says the company is willing to stock books written with AI as long as they are clearly la...

4 sources 6 days ago

Tech

Developers test AI coding tools, workflows, and decision-layer approaches in real projects

Across multiple Dev.to posts, authors describe hands-on experimentation and process lessons for AI-assisted software dev...

1 sources 2 weeks ago

Tech

AI agents move into production, driving new focus on governance, sandboxing, and developer control

Recent discussions across developer and industry outlets portray AI agents as shifting from chat-based tools to systems...

3 sources 3 weeks ago