Home

Sai kumar Gunda - Senior Data Engineer
[email protected]
Location: Chicago, Illinois, USA
Relocation:
Visa: Green card
PROFESSIONAL SUMMARY
With over a 11+ years of hands-on expertise, I bring extensive experience as an Azure data engineer proficient in
both big data technologies and data warehousing solutions, delivering impactful insights and scalable data
architectures.
6+ years of Hands-on experience in Azure Cloud Services- Azure Active Directory, Azure Key Vault, Azure Data
Factory (ADF), Azure Event Hubs, Azure Databricks (ADB), Azure Blob Storage, Azure Data Lake storage
(ADLS Gen1 & Gen2), Azure SQL DB, Azure Synapse Analytics, Azure Stream Analytics, Azure DevOps,
Azure Cosmos NOSQL DB, Azure Purview, Microsoft Fabric, Azure Data Explorer (ADX) or Kusto etc.,
successfully deployed Azure environments utilizing Azure IaaS Virtual Machines (VMs) and cloud services (PaaS).
Throughout my career, I have embraced the challenges and opportunities across Finance, Network, Healthcare, and
Manufacturing where I created data solutions to drive strategic business decisions.
Expertise in design and implementation of pipelines using Azure Data Factory, encompassing the orchestration of
automated triggers, intricate mapping of data flows, and secure credential management through Azure Key Vault.
Crafted and fine-tuned Apache Spark Databricks notebooks with Python (PySpark) and Spark SQL for intricate data
transformations and managed the transition of datasets in Azure Data Lake Storage Gen2 from raw ingestion to
curated zones.
Experience in debugging and performance tuning of the Informatica mappings, Sessions and workflows.
Proficient in Apache Spark operations within Databricks, utilizing PySpark for resilient distributed datasets (RDDs),
Data Frames, and Datasets for streamlined transformations, actions, and advanced analytics data persistence
techniques.
Leveraged Databricks features including shared notebooks and real-time collaboration along with PySpark for
increased productivity, seamlessly integrating with Azure services such as Azure Machine Learning, Azure SQL
Database, and Azure Synapse Analytics to create comprehensive analytics solutions.
Experience in developing ETL program for supporting Data Extraction, transformations and loading using
Informatica power center.
Pioneered comprehensive analytics solutions by seamlessly integrating Azure Synapse Analytics and Azure Stream
Analytics, addressing both historical data analysis and real-time insights leveraging Azure Event Hub and IOT hub.
Spearheaded secure authentication and access management system utilizing Azure Active Directory (Azure AD),
while also ensuring the robust encryption and secure management of sensitive data and credentials through Azure
Key Vault.
Specialized in data management with Azure Purview to optimize data discovery, classification, and lineage tracking
workflows, enabling streamlined data management and governance practices.
Engineered continuous integration and continuous deployment (CI/CD) pipelines within Azure DevOps for Azure-
based data solutions, enhancing automation and streamlining deployment cycles.
Tailored UDFs in Pig and Hive, integrating Python or Java methods for advanced data processing, extending
capabilities beyond built-in functions, and enabling complex analytics in Pig Latin and HiveQL scripts.
Expertise in tackling challenges of Pig and Hive scripts, optimized MapReduce jobs, and actively administered
Hadoop logs for adept troubleshooting. Demonstrates strong proficiency in YARN configuration, enhancing core
functionality for efficient resource allocation.
Expertise in employing diverse Hadoop infrastructures, including HDFS, MapReduce, Yarn, Pig, Hive, Zookeeper,
Sqoop, Oozie, Flume, Kafka, and Spark, for the storage and analysis of data. Skilled in programming languages
such as Scala, Python, SQL, and working with both SQL (MySQL, Oracle, Microsoft) and NoSQL databases
(HBase, MongoDB, and Cassandra).
Competent in Installing, configuring, administrating, and managing Hadoop clusters and services using Cloudera
Manager, actively supporting the Deployment team in cluster setup and service configuration.
Comprehensive understanding of Hadoop architecture and core framework, coupled with demonstrated expertise in
smooth data transfer between HDFS and relational databases using Sqoop.
Specialized in structured data ingestion using Sqoop and real-time streaming through Flume, orchestrating clusters
with Zookeeper, overseeing and querying data using Hive, Pig, and HBase, establishing event streaming pipelines
via Kafka, and orchestrating workflows with Oozie.
Proficient in Spark Core, Spark SQL, Scala, and Spark Streaming, with practical expertise in implementing Star and
Snowflake Schemas to centralize and analyses all data records efficiently.

Proficient in working with Hive optimization techniques like Partitioning, Bucketing, Map side-joins, Bucket-Map Join,
skew joins, and creating Indexes.
Diligent in writing Infrastructure as a code (IAC) in Terraform and well versed in Version-Control and CI/CD
technologies like Jenkins, Docker, Git, GitHub.
Strong experience with SQL (DDL & DML, TCL, DCL) in Implementing & Developing Stored Procedures, Transactions,
Nested Queries, Joins, Cursors, Views, User Defined Functions, Indexes, User Profiles, Relational Database Models,
Creating & Updating tables.
Significant experience in creating applications specialized in Data Processing tasks, employing Teradata, Oracle,
PostgreSQL server, and MySQL databases.
Harnessed Power BI and Tableau to design and develop visually compelling and insightful dashboards and reports
for effective data visualization by time series analysis using DAX expression.
Skilled in leveraging Snowflake utilities like Snow SQL and Snow Pipe, adeptly implementing role-based access
control, data encryption, and network policies to safeguard sensitive information.
Managed Agile project processes, translating user stories, overseeing stakeholder management, facilitating process
re-engineering, and aligning development efforts with business goals, leveraging robust SDLC understanding and
analytical skills across Agile and Waterfall methodologies throughout the project lifecycle.
TECHNICAL SKILLS:
Big Data Technologies Hadoop, HDFS, YARN, MapReduce, Pig, HBase, Hive, Sqoop, Flume, Oozie,

Zookeeper, Kafka, Apache Spark, Spark Streaming.

Hadoop Distribution Cloudera, Horton Works
Shell Scripting Bash, PowerShell, Azure CLI
Languages Java, SQL, PL/SQL, Python, HiveQL, Scala, PySpark.
Automation Tools Ant, Maven, Terraform
Version Control & CI/CD tools GIT, GitHub, Jenkins, Bitbucket, GitLab, Azure DevOps
IDE & Build Tools Eclipse, Visual Studio, Notepad ++.
Visualization Tools Power BI, Tableau, SSRS.
Cloud Services Azure Data Factory, Databricks, Logic Apps, Functional App, Synapse Analytics,
HDInsight, Stream Analytics, EventHub, Purview, Snowflake, Azure DevOps, Azure
Blob Storage, Azure Data Lake Storage Gen 2 & Gen 1, Active Directory, Key Vault.

Operating Systems Windows (XP/7/8/10), UNIX, LINUX, UBUNTU, CENTOS
Databases MS SQL Server 2016/2014/2012, Azure SQL DB, Azure Synapse, MS Access, Oracle

11g/12c, Cosmos DB, PostgreSQL

EDUCATION:
Masters in information Technology and management From University of Texas at Dallas. 2013
Bachelors In Mechanical Engineering from Aditya College of Engineering and Technology. 2011
PROFESSIONAL EXPERIENCE:
Client: Samsung Electronics America, Ridgefield Park, NJ Jan 2022 to till now
Role: Senior Azure Data Engineer
Responsibilities:
Designed scalable data solutions on Azure, leveraging services like Azure Data Lake Storage, Azure Synapse
Analytics, and Azure Data Factory for efficient data processing.

Implemented automated Emails using Azure Logic Apps, Azure Functions, and APIs for real-time data retrieval and
storage in Azure Cosmos DB, enhancing analytical capabilities.
Create and deploy ETL mappings and develop PL/SQL procedures to execute complex data transformations.
Deployed real-time data streaming solutions with Azure Event Hubs, Azure Kafka for handling continuous streams
of network events and customer feedback.
Extensively involved in performance tuning of the Informatica ETL mappings by using the Caches and overriding the
SQL queries and by using parameter files.
Engineered and managed scalable ETL pipelines using PySpark within Azure Databricks, orchestrating complex data
workflows through Directed Acyclic Graphs (DAGs) to optimize performance and resource utilization.
Developed and executed Spark Notebooks for data processing and analysis, leveraging Azure Databricks Clusters
to handle large datasets and ensure efficient data transformation and aggregation.
Optimized data processing workflows in Azure Databricks by implementing advanced PySpark functions, utilizing
Colacse and Repartitioning techniques to enhance the efficiency, performance, and scalability of ETL pipelines.
Designed and implemented scalable, high-performance databases using Azure Cosmos DB and PostgreSQL,
optimizing data storage and retrieval for various applications and use cases.
Strong command of SQL, including T-SQL and PostgreSQL with extensive experience in writing, optimizing, and
managing complex queries across various environments.
Utilized SQL Server, Teradata, Snowflake, and Synapse for data storage and retrieval, enhancing data accessibility
and facilitating robust data analysis.
Architected and managed data solutions using Azure Data Lake and Delta Lake, implementing Delta Live Tables
within a Medallion architecture to ensure real-time data processing, incremental updates, and enhanced data quality
and accessibility.
Providing support to Oracle R12. 1. 3E-business suite application.
Developed ETL programs using Informatica to implement the business requirements.
Implemented the Medallion Architecture in Azure Data Lake Storage, efficiently moving data through the bronze,
silver, and gold layers to ensure a structured, clean, and high-quality dataset for advanced analytics and reporting.
Designed and developed interactive dashboards and reports in Power BI, enabling data-driven decision-making and
providing actionable insights to stakeholders.
Engaged in Agile scrum meetings, including daily stand-ups and globally coordinated PI Planning, to ensure effective
project management and execution.
Used Spark API Cloudera Hadoop Yarn to perform analytics on data in Hive.
Spearheaded batch data processing initiatives using Azure Data Factory and Azure Databricks, orchestrating
seamless transfer and transformation of historical logs and configuration backups.
Combined multiple Azure data services, including Azure Synapse, Power BI, and Data Factory, into a cohesive
environment within Microsoft Fabric, enabling streamlined management of data ingestion, transformation, and
visualization for end-to-end analytics solutions.
Created shell scripts to fine tune the ETL flow of the Informatica workflows.
Implemented Role-Based Access Control (RBAC) using Azure Active Directory to secure and manage access to
Azure resources, ensuring compliance and enhancing data security across the organization.
Employed query performance optimization techniques, including index optimization and query rewriting, to enhance
data retrieval speeds.
Used spark Data frame API over the Cloudera platform to perform analytics on Hive Data.
Established data encryption at rest and in transit using Azure Key Vault and Azure Security Canter to protect sensitive
data.
Developed and maintained data pipelines using Azure Data Factory (ADF) with a focus on leveraging both Azure
Integration Runtime and Self-hosted Integration Runtime for efficient data movement and transformation across
hybrid environments.
Designed and implemented Databricks Unity Catalog for centralized data governance and security, ensuring fine-
grained access control and compliance with organizational data protection policies.
Implemented and scheduled ADF pipelines using various triggers (e.g., schedule, tumbling window, and event-based
triggers) ensuring seamless integration and timely execution of ETL processes.
Leveraged Azure Databricks and Azure Data Factory, along with Synapse Analytics, for comprehensive ETL
processes, including data cleansing, deduplication, normalization, and joins, utilizing PySpark for robust root cause
analysis and anomaly detection.

Gathered configuration data using NetFlow, REST APIs, and PowerShell scripts, facilitating thorough analysis and
storage in Azure Blob Storage or Azure SQL Database.
Migrated and optimized ETL workflows from SQL Server Integration Services (SSIS) to Azure Data Factory,
leveraging ADF's pipelines and data flows for improved scalability and efficiency.
Designed and implemented scalable data storage solutions using Azure Blob Storage, optimizing data retrieval and
ensuring secure, cost-effective storage for large datasets.
Developed and optimized complex SQL queries and transformations in Azure Databricks, leveraging Spark SQL for
efficient data processing and analytics across large datasets.
Pioneered dimensional modelling with star and snowflake schemas for multidimensional analysis of network
operations and customer satisfaction metrics.
Instituted automated workflows and CI/CD pipelines to streamline model development, testing, and deployment
processes.
Implemented data masking and anonymization strategies to safeguard sensitive information in compliance with GDPR
and HIPAA regulations.
Proficient in working with Parquet and Delta formats, ensuring optimized storage and retrieval of large-scale
datasets in big data environments, particularly within Azure ecosystems.
Orchestrated performance optimization strategies using Azure Data Factory and Databricks, and PySpark to fine-tune
data processing pipelines and reduce latency.
Leveraged DBT to build and orchestrate modular data transformation workflows in Azure, implementing CI/CD
practices, model versioning, and data quality tests to ensure accuracy and scalability across data pipelines.
Used Azure Application Insights, Azure Diagnostics, and Azure Log Analytics for root cause analysis and performance
tuning, and aggregated data with Azure Monitor and Syslog to optimize network throughput.
Utilized Delta Live Tables to process and analyze real-time data streams, enabling instant insights for time-critical
decision-making.
As the senior Oracle PL/SQL resource on the project I developed an abstraction layer of complex views to support
backward compatibility for legacy data warehouse data consumers.
Deep understanding of the SDLC process, implementing DevOps practices, including standard deployment
processes (dev, test, prod) with peer-reviewed code. Experienced in managing CI/CD pipelines using Azure DevOps
to streamline and automate software releases.
Collaborated with cross-functional teams including data scientists, analysts, and business stakeholders to
understand data requirements and deliver high-impact, data-driven solutions.
Developed and optimized ETL pipelines to extract, transform, and load data from CRM systems into Azure SQL
Database and Azure Data Lake Storage, ensuring seamless data integration and high data quality for advanced
reporting and analytics.
Monitored workload, job performance and capacity planning using Cloudera Manager.
Automated governance processes by configuring Unity Catalog for seamless integration with Azure Active
Directory and Key Vault.
Led migration of on-premises data systems to Azure cloud, including architecture design, data transfer, and
integration, ensuring seamless transition and optimized cloud performance.
Client: Subaru Of America, Camden, NJ May 2018 to Dec 2021
Role: Azure Data Engineer
Responsibilities:
Extracted, transformed, and loaded data from diverse source systems into Azure Data Lake Storage (ADLS) using
Azure Data Factory (ADF) and Azure Databricks, ensuring reliable data availability and data integrity.
Designed and implemented Azure Data Factory (ADF) pipelines for batch and streaming data ingestion into Azure
Data Lake Storage (ADLS), enabling real-time insights generation and processing capabilities.
Used Informatica file watch events to pole the FTP sites for the external mainframe files.
Creation of database objects like tables, views, procedures and packages using Oracle tools like Toad, PL/SQL
Developer and SQL.
Developed and orchestrated ETL workflows in Azure Data Factory, utilizing features related to SQL Server
Integration Services (SSIS) for efficient data extraction, transformation, and loading.
Effectively worked in Informatica version-based environment and used deployment groups to migrate the objects.
Played a key role in integrating Apache Airflow with Databricks and cloud services, enabling automated

orchestration of data pipelines and increasing the overall operational efficiency.
Implemented data migration processes to transition CRM data from on-premises SQL databases to Azure,
leveraging tools like Azure Data Factory and Azure Databricks for efficient data flow and transformation.
In-depth, understanding of Apache spark job execution components like DAG, lineage graph, DAG Scheduler, Task
scheduler, and Stages and worked on relational, NoSQL databases including HBase, Postgre SQL, Cassandra and
Mongo DB.
Leveraged analytical capabilities of Synapse serverless pools to handle massive datasets and perform complex data
transformations, ensuring efficient and cost-effective data processing for the organization.
Developed automation scripts for Hadoop ETL jobs using Python, and implemented CI/CD pipelines with Azure
DevOps and Jenkins to enhance productivity, security, and continuous delivery.
Developed event-driven architectures using Azure Event Grid, integrating with Azure Cosmos DB and PostgreSQL
to enable real-time data processing and seamless communication between distributed systems.
Built near real-time data processing pipelines using Kafka, Spark Structured Streaming, and HBase, facilitating timely
insights generation.
Maintain and enhance Oracle PL/SQL batch process for patient level data collected in a clinical trail and reporting
system.
Effectively used Informatica parameter files for defining mapping variables, Workflow variables, FTP connections
and relational connections.
Expertise in integrating and processing data from diverse sources, including relational databases (SQL Server,
Oracle), cloud storage (Azure Blob Storage, AWS S3), streaming platforms (Kafka, Azure Event Hubs), and APIs
(REST, GraphQL), ensuring seamless data ingestion, transformation, and analysis for data engineering workflows.
Designed and implemented data delivery processes to seamlessly integrate operational systems and files into the
Data Lake.
Deployed Hadoop cluster using Cloudera Hadoop 4 (CDH4) with Pig, Hive, HBase and Spark.
Integrated PySpark with Apache Kafka, Apache Hadoop, and Apache Hive for streamlined data processing workflows.
Utilized Python and Spark SQL to translate Hive/SQL native queries into Spark Data Frame transformations within
Apache Spark, enabling efficient analysis of outdoor recreation data.
Developed and implemented Spark batch jobs using Python and Spark SQL to optimize workflows and extract
actionable insights from data.
Managed and automated the lifecycle of data stored in Azure Blob Storage, including data retention policies,
archiving strategies, and access controls to enhance data governance and compliance.
Leveraged Azure Data Factory pipelines for transformations with Databricks Spark, importing data from various
sources like HDFS/Hive into Spark Data Frames using Spark 2.0 for insights generation in outdoor recreation
management.
Employed Spark API over Cloudera Hadoop YARN to analyse data in Hive, extracting valuable insights for demand
forecasting and trail optimization in outdoor recreation management.
Integrated data from various sources into Power BI, creating visualizations and reports to monitor key performance
indicators and track business metrics effectively.
Utilized GIT and coordinated with Continuous Integration (CI) tools to facilitate efficient collaboration and version
control management during the development of demand forecasting and trail optimization solutions.
Implemented Slowly Changing Dimensions (SCD) and Change Data Capture (CDC) in Azure Data Factory to
manage historical data and track real-time changes, enhancing data accuracy and integration.
Attended daily sync-up calls between onsite and offshore teams to discuss the ongoing features/work items, issues,
blockers, and ideas to improve the performance, readability, and experience of the data presented to end users.
Client: BNY Mellon, New York, NY Feb 2016 - Apr 2018
Role: Bigdata Developer
Responsibilities:
Implemented real-time data streaming and processing pipelines using Apache Flink, enabling low-latency data
analytics and event-driven applications for large-scale data environments.
Automated data pipelines and workflows using Oozie and implemented Flume for collecting and storing web log
data for manufacturing process analysis.
In-depth knowledge of Hadoop architecture and various components such as HDFS, application manager, node
master, resource manager name node, data node and map reduce concepts.

Utilized Kerberos authentication principles to ensure secure network communication on the cluster and conducted
testing of HDFS, Hive, Pig, and MapReduce to grant access to new users.
Orchestrated mappings for transferring data from Oracle and SQL Server to the new Data Warehouse, facilitating
efficient integration and analysis.
Managed and optimized PySpark applications, implementing scheduling and automation for efficient data processing
and job execution.
Orchestrated data migration from Oracle and SQL Server to Hadoop using Sqoop, handling flat files in various formats.
Developed a Flume and Sqoop data pipeline to ingest customer behavioral data histories into HDFS.
Engineered Hive tables and MapReduce programs, using Hive queries for data loading, transformation, and
manufacturing process analysis.
Developed and optimized Big Data products and platforms using Python, Scala, Spark, Hadoop tools like Hive and
Impala, ensuring efficient data processing workflows and scalable architectures.
Implemented Kafka and Spark streaming for real-time data processing, configured Spark to store data in HDFS, and
developed Spark and Hive jobs for data summarization and transformation.
Expert in designing and implementing Kafka-based stream processing solutions and data pipelines, as well as
managing and optimizing Kafka clusters for high availability and performance.
Packaged and deployed PySpark applications using Docker for scalability and portability across environments.
Utilized Spark and Python for data processing, leveraging Spark Data Frames to ingest, transform, validate,
cleanse, and aggregate unstructured data into structured formats, ensuring efficient and accurate data handling.
Expert in using Apache Ant for optimizing complex build processes and integrating with CI/CD pipelines.
Proficient in using Maven for dependency management and project modularization, and developing custom plugins
to enhance build workflows.
Implemented partitioning, dynamic partitions, and buckets within Hive for efficient data organization and querying.
Demonstrated proficiency in using Jenkins for continuous integration, automating build and deployment processes for
Hadoop-based solutions and manufacturing process enhancements.
Leveraged Git for version control, ensuring collaborative development and tracking changes in the codebase.
Utilized JIRA for efficient project management, tracking tasks, and facilitating communication among team members.
Managed Hadoop infrastructure installation, configuration, maintenance, and monitoring using Cloudera Manager and
shell scripts, ensuring seamless operations and security.
Client: Upward Health, New York Jan 2014 - Jan 2016
Role: SQL Developer/Data Warehouse Developer
Responsibilities:
Designed and implemented ETL processes using SSIS to efficiently extract, transform, and load data into the data
warehouse, ensuring optimal performance and data integration.
Developed and maintained reports with SSRS and created OLAP cubes using SSAS, enabling comprehensive data
analysis and delivering key business insights through automated reporting.
Optimized T-SQL queries through advanced performance tuning techniques, improving execution time and
database performance, while managing large datasets and ensuring data integrity.
Leveraged SQL for optimizing query performance, tuning Data Transformation Manager (DTM) buffer and block
sizes, and identifying bottlenecks in sources, targets, mappings, and sessions.
Deployed ETL modules and monitored performance in the production environment, identifying and addressing
read/write errors using Workflow and Session logs.
Utilized Erwin Data Modeler to design DataMarts and generate DDL scripts for review by Database Administrators
(DBAs) and designed and built ETL modules using technical transformation documents.
Leveraged Informatica for ETL processes, designing and developing data integration workflows to transform and load
data from various sources into target databases, ensuring data accuracy.
Strong expertise in Data Integration, including ETL/ELT processes for large-scale data transformation and
management, using tools like Informatica PowerCenter and Data Quality for high-performance data processing.
Expert in designing and implementing ETL processes using Informatica PowerCenter, optimizing workflows and
mappings to ensure data quality, consistency, and processing efficiency.
Hands-on experience with ELT and Change Data Capture (CDC) solutions, specifically using Informatica Power
Exchange ensuring real-time data flow between multiple systems and sources.
Designed and developed Informatica Mappings for incremental loads from source to target tables and implemented

Slowly Changing Dimensions (SCD) Type I and II as per requirements.
Developed SQL shell scripts for Pre-session and post-session tasks, including index management and email
notifications.
Used DTS/SSIS and T-SQL stored procedures to transfer data from OLTP databases to staging area and finally
transfer into data marts and performed action in XML.
Implemented DDL (Data Definition Language) and DML (Data Manipulation Language) operations to create,
modify, and manage database structures, ensuring efficient data storage and retrieval.
Developed and maintained complex stored procedures, triggers, and functions to encapsulate business logic and
automate repetitive tasks.
Utilized window functions for advanced data analysis, such as running totals, moving averages, and ranking, to
solve complex business requirements.
Worked with various SQL environments including MySQL, PostgreSQL, and Microsoft SQL Server and Oracle to
manage and analyze large datasets.
Collaborated in Agile Scrum methodology, actively participating in daily stand-up meetings, utilizing Visual SourceSafe
for Visual Studio 2010, and tracking project progress through Trello.
Keywords: continuous integration continuous deployment business intelligence sthree database active directory microsoft procedural language New Jersey New York

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];5248
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: