Advanced ETL Developer (AWS & PySpark) (India)

InfoCepts

Full Timemid

INPosted March 7, 2026

Resume Keywords to Include

Make sure these keywords appear in your resume to improve ATS scoring

PythonScalaSQLAWSSnowflakeGitGitHubSparkAirflowPandasAgileCI/CDAPI

Job Description

Data Engineer - AWS & Pyspark

Position: Data Engineer - AWS & Pyspark

Location: Nagpur/Pune

Type of Employment: Full-Time:

Purpose of the Position: You will be a critical member of the InfoCepts Cloud Data Architect Team. We are seeking an experienced Data Engineer with robust expertise in Databricks, PySpark, AWS, and Python to design and deliver scalable data pipelines, high-performance ETL frameworks, and reliable data solutions. The ideal candidate has a solid understanding of distributed data processing, cloud architecture, and modern data engineering best practices.

Key Result Areas and Activities:

Data Engineering & ETL Development

Design, build, and optimize ETL/ELT pipelines using PySpark/Scala and Databricks on large-scale distributed data environments.
Develop reusable data ingestion frameworks, transformation modules, and feature engineering pipelines.
Ensure high-quality data processing with robust data validation, error handling, and observability.

Databricks Platform Engineering

Work extensively with the Databricks Lakehouse platform—clusters, notebooks, Delta Lake, MLflow, jobs, and workflows.
Implement best practices for Delta Lake, including schema evolution, time-travel, vacuuming, ZOrdering, partitioning, and optimization.
Collaborate on job orchestration using Databricks Workflows, Jobs API, or Airflow

AWS Cloud Engineering

Build and maintain data pipelines leveraging AWS services such as:
S3, Glue, Lambda, IAM, Step Functions, Athena, Redshift or Snowflake, CloudWatch
Implement secure data architectures following IAM, networking, encryption, and costoptimized design principles.
Integrate Databricks with AWS data sources and event-driven systems.
Working knowledge of OTF like Delta and Iceberg

Programming & Data Processing

Write high-quality, production-grade Python code (modular, optimized, reusable).
Develop PySpark jobs for batch and near real-time data transformations.
Optimize Spark performance (partitions, broadcast variables, caching, cluster tuning).

Data Architecture, Governance & Quality

Contribute to the design of data models, storage layers, and data lifecycle management.
Implement best practices for data governance, metadata management, and lineage tracking.
Ensure data reliability, performance, and accuracy across multiple environments.

Cross-Functional Collaboration

Partner with analysts, data scientists, product teams, and business stakeholders to understand requirements.
Document workflows, maintain Git-based version control, and participate in architecture reviews.
Support production pipelines, troubleshoot issues, and continuously enhance system performance.

Roles & Responsibilities

Essential Skills:

5+ years of hands-on experience in Data Engineering.
Strong expertise in PySpark and distributed data processing.
Deep understanding of Databricks Lakehouse (Delta Lake, clusters, jobs, Workflows, MLflow).
Proficiency in AWS data ecosystem (S3, Glue, Redshift, Lambda, Step Functions,EMR).
Strong programming proficiency in Python (pandas, PySpark, APIs, modular code).
Solid SQL skills (analytical functions, performance tuning).
Experience with Git, CI/CD basics, and production deployments.
Experience of working with AI based productivity improvement tools like Github Copilot

Desirable Skills :

Familiarity with Unity Catalog, governance, and fine-grained access controls.
Experience with Airflow, or other orchestration tools.
Knowledge of Databricks SQL dashboards and visualization.
Exposure to ML/AI workflows on Databricks (not mandatory).

Qualifications

Bachelor’s degree in computer science engineering, or related field
Demonstrated continued learning through one or more technical certifications or related methods
5+ years of relevant experience in Data Analytics

Qualities: