Sai kumar Gunda - Senior Data Engineer |
[email protected] |
Location: Chicago, Illinois, USA |
Relocation: |
Visa: Green card |
PROFESSIONAL SUMMARY
With over a 11+ years of hands-on expertise, I bring extensive experience as an Azure data engineer proficient in both big data technologies and data warehousing solutions, delivering impactful insights and scalable data architectures. 6+ years of Hands-on experience in Azure Cloud Services- Azure Active Directory, Azure Key Vault, Azure Data Factory (ADF), Azure Event Hubs, Azure Databricks (ADB), Azure Blob Storage, Azure Data Lake storage (ADLS Gen1 & Gen2), Azure SQL DB, Azure Synapse Analytics, Azure Stream Analytics, Azure DevOps, Azure Cosmos NOSQL DB, Azure Purview, Microsoft Fabric, Azure Data Explorer (ADX) or Kusto etc., successfully deployed Azure environments utilizing Azure IaaS Virtual Machines (VMs) and cloud services (PaaS). Throughout my career, I have embraced the challenges and opportunities across Finance, Network, Healthcare, and Manufacturing where I created data solutions to drive strategic business decisions. Expertise in design and implementation of pipelines using Azure Data Factory, encompassing the orchestration of automated triggers, intricate mapping of data flows, and secure credential management through Azure Key Vault. Crafted and fine-tuned Apache Spark Databricks notebooks with Python (PySpark) and Spark SQL for intricate data transformations and managed the transition of datasets in Azure Data Lake Storage Gen2 from raw ingestion to curated zones. Experience in debugging and performance tuning of the Informatica mappings, Sessions and workflows. Proficient in Apache Spark operations within Databricks, utilizing PySpark for resilient distributed datasets (RDDs), Data Frames, and Datasets for streamlined transformations, actions, and advanced analytics data persistence techniques. Leveraged Databricks features including shared notebooks and real-time collaboration along with PySpark for increased productivity, seamlessly integrating with Azure services such as Azure Machine Learning, Azure SQL Database, and Azure Synapse Analytics to create comprehensive analytics solutions. Experience in developing ETL program for supporting Data Extraction, transformations and loading using Informatica power center. Pioneered comprehensive analytics solutions by seamlessly integrating Azure Synapse Analytics and Azure Stream Analytics, addressing both historical data analysis and real-time insights leveraging Azure Event Hub and IOT hub. Spearheaded secure authentication and access management system utilizing Azure Active Directory (Azure AD), while also ensuring the robust encryption and secure management of sensitive data and credentials through Azure Key Vault. Specialized in data management with Azure Purview to optimize data discovery, classification, and lineage tracking workflows, enabling streamlined data management and governance practices. Engineered continuous integration and continuous deployment (CI/CD) pipelines within Azure DevOps for Azure- based data solutions, enhancing automation and streamlining deployment cycles. Tailored UDFs in Pig and Hive, integrating Python or Java methods for advanced data processing, extending capabilities beyond built-in functions, and enabling complex analytics in Pig Latin and HiveQL scripts. Expertise in tackling challenges of Pig and Hive scripts, optimized MapReduce jobs, and actively administered Hadoop logs for adept troubleshooting. Demonstrates strong proficiency in YARN configuration, enhancing core functionality for efficient resource allocation. Expertise in employing diverse Hadoop infrastructures, including HDFS, MapReduce, Yarn, Pig, Hive, Zookeeper, Sqoop, Oozie, Flume, Kafka, and Spark, for the storage and analysis of data. Skilled in programming languages such as Scala, Python, SQL, and working with both SQL (MySQL, Oracle, Microsoft) and NoSQL databases (HBase, MongoDB, and Cassandra). Competent in Installing, configuring, administrating, and managing Hadoop clusters and services using Cloudera Manager, actively supporting the Deployment team in cluster setup and service configuration. Comprehensive understanding of Hadoop architecture and core framework, coupled with demonstrated expertise in smooth data transfer between HDFS and relational databases using Sqoop. Specialized in structured data ingestion using Sqoop and real-time streaming through Flume, orchestrating clusters with Zookeeper, overseeing and querying data using Hive, Pig, and HBase, establishing event streaming pipelines via Kafka, and orchestrating workflows with Oozie. Proficient in Spark Core, Spark SQL, Scala, and Spark Streaming, with practical expertise in implementing Star and Snowflake Schemas to centralize and analyses all data records efficiently. Proficient in working with Hive optimization techniques like Partitioning, Bucketing, Map side-joins, Bucket-Map Join, skew joins, and creating Indexes. Diligent in writing Infrastructure as a code (IAC) in Terraform and well versed in Version-Control and CI/CD technologies like Jenkins, Docker, Git, GitHub. Strong experience with SQL (DDL & DML, TCL, DCL) in Implementing & Developing Stored Procedures, Transactions, Nested Queries, Joins, Cursors, Views, User Defined Functions, Indexes, User Profiles, Relational Database Models, Creating & Updating tables. Significant experience in creating applications specialized in Data Processing tasks, employing Teradata, Oracle, PostgreSQL server, and MySQL databases. Harnessed Power BI and Tableau to design and develop visually compelling and insightful dashboards and reports for effective data visualization by time series analysis using DAX expression. Skilled in leveraging Snowflake utilities like Snow SQL and Snow Pipe, adeptly implementing role-based access control, data encryption, and network policies to safeguard sensitive information. Managed Agile project processes, translating user stories, overseeing stakeholder management, facilitating process re-engineering, and aligning development efforts with business goals, leveraging robust SDLC understanding and analytical skills across Agile and Waterfall methodologies throughout the project lifecycle. TECHNICAL SKILLS: Big Data Technologies Hadoop, HDFS, YARN, MapReduce, Pig, HBase, Hive, Sqoop, Flume, Oozie, Zookeeper, Kafka, Apache Spark, Spark Streaming. Hadoop Distribution Cloudera, Horton Works Shell Scripting Bash, PowerShell, Azure CLI Languages Java, SQL, PL/SQL, Python, HiveQL, Scala, PySpark. Automation Tools Ant, Maven, Terraform Version Control & CI/CD tools GIT, GitHub, Jenkins, Bitbucket, GitLab, Azure DevOps IDE & Build Tools Eclipse, Visual Studio, Notepad ++. Visualization Tools Power BI, Tableau, SSRS. Cloud Services Azure Data Factory, Databricks, Logic Apps, Functional App, Synapse Analytics, HDInsight, Stream Analytics, EventHub, Purview, Snowflake, Azure DevOps, Azure Blob Storage, Azure Data Lake Storage Gen 2 & Gen 1, Active Directory, Key Vault. Operating Systems Windows (XP/7/8/10), UNIX, LINUX, UBUNTU, CENTOS Databases MS SQL Server 2016/2014/2012, Azure SQL DB, Azure Synapse, MS Access, Oracle 11g/12c, Cosmos DB, PostgreSQL EDUCATION: Masters in information Technology and management From University of Texas at Dallas. 2013 Bachelors In Mechanical Engineering from Aditya College of Engineering and Technology. 2011 PROFESSIONAL EXPERIENCE: Client: Samsung Electronics America, Ridgefield Park, NJ Jan 2022 to till now Role: Senior Azure Data Engineer Responsibilities: Designed scalable data solutions on Azure, leveraging services like Azure Data Lake Storage, Azure Synapse Analytics, and Azure Data Factory for efficient data processing. Implemented automated Emails using Azure Logic Apps, Azure Functions, and APIs for real-time data retrieval and storage in Azure Cosmos DB, enhancing analytical capabilities. Create and deploy ETL mappings and develop PL/SQL procedures to execute complex data transformations. Deployed real-time data streaming solutions with Azure Event Hubs, Azure Kafka for handling continuous streams of network events and customer feedback. Extensively involved in performance tuning of the Informatica ETL mappings by using the Caches and overriding the SQL queries and by using parameter files. Engineered and managed scalable ETL pipelines using PySpark within Azure Databricks, orchestrating complex data workflows through Directed Acyclic Graphs (DAGs) to optimize performance and resource utilization. Developed and executed Spark Notebooks for data processing and analysis, leveraging Azure Databricks Clusters to handle large datasets and ensure efficient data transformation and aggregation. Optimized data processing workflows in Azure Databricks by implementing advanced PySpark functions, utilizing Colacse and Repartitioning techniques to enhance the efficiency, performance, and scalability of ETL pipelines. Designed and implemented scalable, high-performance databases using Azure Cosmos DB and PostgreSQL, optimizing data storage and retrieval for various applications and use cases. Strong command of SQL, including T-SQL and PostgreSQL with extensive experience in writing, optimizing, and managing complex queries across various environments. Utilized SQL Server, Teradata, Snowflake, and Synapse for data storage and retrieval, enhancing data accessibility and facilitating robust data analysis. Architected and managed data solutions using Azure Data Lake and Delta Lake, implementing Delta Live Tables within a Medallion architecture to ensure real-time data processing, incremental updates, and enhanced data quality and accessibility. Providing support to Oracle R12. 1. 3E-business suite application. Developed ETL programs using Informatica to implement the business requirements. Implemented the Medallion Architecture in Azure Data Lake Storage, efficiently moving data through the bronze, silver, and gold layers to ensure a structured, clean, and high-quality dataset for advanced analytics and reporting. Designed and developed interactive dashboards and reports in Power BI, enabling data-driven decision-making and providing actionable insights to stakeholders. Engaged in Agile scrum meetings, including daily stand-ups and globally coordinated PI Planning, to ensure effective project management and execution. Used Spark API Cloudera Hadoop Yarn to perform analytics on data in Hive. Spearheaded batch data processing initiatives using Azure Data Factory and Azure Databricks, orchestrating seamless transfer and transformation of historical logs and configuration backups. Combined multiple Azure data services, including Azure Synapse, Power BI, and Data Factory, into a cohesive environment within Microsoft Fabric, enabling streamlined management of data ingestion, transformation, and visualization for end-to-end analytics solutions. Created shell scripts to fine tune the ETL flow of the Informatica workflows. Implemented Role-Based Access Control (RBAC) using Azure Active Directory to secure and manage access to Azure resources, ensuring compliance and enhancing data security across the organization. Employed query performance optimization techniques, including index optimization and query rewriting, to enhance data retrieval speeds. Used spark Data frame API over the Cloudera platform to perform analytics on Hive Data. Established data encryption at rest and in transit using Azure Key Vault and Azure Security Canter to protect sensitive data. Developed and maintained data pipelines using Azure Data Factory (ADF) with a focus on leveraging both Azure Integration Runtime and Self-hosted Integration Runtime for efficient data movement and transformation across hybrid environments. Designed and implemented Databricks Unity Catalog for centralized data governance and security, ensuring fine- grained access control and compliance with organizational data protection policies. Implemented and scheduled ADF pipelines using various triggers (e.g., schedule, tumbling window, and event-based triggers) ensuring seamless integration and timely execution of ETL processes. Leveraged Azure Databricks and Azure Data Factory, along with Synapse Analytics, for comprehensive ETL processes, including data cleansing, deduplication, normalization, and joins, utilizing PySpark for robust root cause analysis and anomaly detection. Gathered configuration data using NetFlow, REST APIs, and PowerShell scripts, facilitating thorough analysis and storage in Azure Blob Storage or Azure SQL Database. Migrated and optimized ETL workflows from SQL Server Integration Services (SSIS) to Azure Data Factory, leveraging ADF's pipelines and data flows for improved scalability and efficiency. Designed and implemented scalable data storage solutions using Azure Blob Storage, optimizing data retrieval and ensuring secure, cost-effective storage for large datasets. Developed and optimized complex SQL queries and transformations in Azure Databricks, leveraging Spark SQL for efficient data processing and analytics across large datasets. Pioneered dimensional modelling with star and snowflake schemas for multidimensional analysis of network operations and customer satisfaction metrics. Instituted automated workflows and CI/CD pipelines to streamline model development, testing, and deployment processes. Implemented data masking and anonymization strategies to safeguard sensitive information in compliance with GDPR and HIPAA regulations. Proficient in working with Parquet and Delta formats, ensuring optimized storage and retrieval of large-scale datasets in big data environments, particularly within Azure ecosystems. Orchestrated performance optimization strategies using Azure Data Factory and Databricks, and PySpark to fine-tune data processing pipelines and reduce latency. Leveraged DBT to build and orchestrate modular data transformation workflows in Azure, implementing CI/CD practices, model versioning, and data quality tests to ensure accuracy and scalability across data pipelines. Used Azure Application Insights, Azure Diagnostics, and Azure Log Analytics for root cause analysis and performance tuning, and aggregated data with Azure Monitor and Syslog to optimize network throughput. Utilized Delta Live Tables to process and analyze real-time data streams, enabling instant insights for time-critical decision-making. As the senior Oracle PL/SQL resource on the project I developed an abstraction layer of complex views to support backward compatibility for legacy data warehouse data consumers. Deep understanding of the SDLC process, implementing DevOps practices, including standard deployment processes (dev, test, prod) with peer-reviewed code. Experienced in managing CI/CD pipelines using Azure DevOps to streamline and automate software releases. Collaborated with cross-functional teams including data scientists, analysts, and business stakeholders to understand data requirements and deliver high-impact, data-driven solutions. Developed and optimized ETL pipelines to extract, transform, and load data from CRM systems into Azure SQL Database and Azure Data Lake Storage, ensuring seamless data integration and high data quality for advanced reporting and analytics. Monitored workload, job performance and capacity planning using Cloudera Manager. Automated governance processes by configuring Unity Catalog for seamless integration with Azure Active Directory and Key Vault. Led migration of on-premises data systems to Azure cloud, including architecture design, data transfer, and integration, ensuring seamless transition and optimized cloud performance. Client: Subaru Of America, Camden, NJ May 2018 to Dec 2021 Role: Azure Data Engineer Responsibilities: Extracted, transformed, and loaded data from diverse source systems into Azure Data Lake Storage (ADLS) using Azure Data Factory (ADF) and Azure Databricks, ensuring reliable data availability and data integrity. Designed and implemented Azure Data Factory (ADF) pipelines for batch and streaming data ingestion into Azure Data Lake Storage (ADLS), enabling real-time insights generation and processing capabilities. Used Informatica file watch events to pole the FTP sites for the external mainframe files. Creation of database objects like tables, views, procedures and packages using Oracle tools like Toad, PL/SQL Developer and SQL. Developed and orchestrated ETL workflows in Azure Data Factory, utilizing features related to SQL Server Integration Services (SSIS) for efficient data extraction, transformation, and loading. Effectively worked in Informatica version-based environment and used deployment groups to migrate the objects. Played a key role in integrating Apache Airflow with Databricks and cloud services, enabling automated orchestration of data pipelines and increasing the overall operational efficiency. Implemented data migration processes to transition CRM data from on-premises SQL databases to Azure, leveraging tools like Azure Data Factory and Azure Databricks for efficient data flow and transformation. In-depth, understanding of Apache spark job execution components like DAG, lineage graph, DAG Scheduler, Task scheduler, and Stages and worked on relational, NoSQL databases including HBase, Postgre SQL, Cassandra and Mongo DB. Leveraged analytical capabilities of Synapse serverless pools to handle massive datasets and perform complex data transformations, ensuring efficient and cost-effective data processing for the organization. Developed automation scripts for Hadoop ETL jobs using Python, and implemented CI/CD pipelines with Azure DevOps and Jenkins to enhance productivity, security, and continuous delivery. Developed event-driven architectures using Azure Event Grid, integrating with Azure Cosmos DB and PostgreSQL to enable real-time data processing and seamless communication between distributed systems. Built near real-time data processing pipelines using Kafka, Spark Structured Streaming, and HBase, facilitating timely insights generation. Maintain and enhance Oracle PL/SQL batch process for patient level data collected in a clinical trail and reporting system. Effectively used Informatica parameter files for defining mapping variables, Workflow variables, FTP connections and relational connections. Expertise in integrating and processing data from diverse sources, including relational databases (SQL Server, Oracle), cloud storage (Azure Blob Storage, AWS S3), streaming platforms (Kafka, Azure Event Hubs), and APIs (REST, GraphQL), ensuring seamless data ingestion, transformation, and analysis for data engineering workflows. Designed and implemented data delivery processes to seamlessly integrate operational systems and files into the Data Lake. Deployed Hadoop cluster using Cloudera Hadoop 4 (CDH4) with Pig, Hive, HBase and Spark. Integrated PySpark with Apache Kafka, Apache Hadoop, and Apache Hive for streamlined data processing workflows. Utilized Python and Spark SQL to translate Hive/SQL native queries into Spark Data Frame transformations within Apache Spark, enabling efficient analysis of outdoor recreation data. Developed and implemented Spark batch jobs using Python and Spark SQL to optimize workflows and extract actionable insights from data. Managed and automated the lifecycle of data stored in Azure Blob Storage, including data retention policies, archiving strategies, and access controls to enhance data governance and compliance. Leveraged Azure Data Factory pipelines for transformations with Databricks Spark, importing data from various sources like HDFS/Hive into Spark Data Frames using Spark 2.0 for insights generation in outdoor recreation management. Employed Spark API over Cloudera Hadoop YARN to analyse data in Hive, extracting valuable insights for demand forecasting and trail optimization in outdoor recreation management. Integrated data from various sources into Power BI, creating visualizations and reports to monitor key performance indicators and track business metrics effectively. Utilized GIT and coordinated with Continuous Integration (CI) tools to facilitate efficient collaboration and version control management during the development of demand forecasting and trail optimization solutions. Implemented Slowly Changing Dimensions (SCD) and Change Data Capture (CDC) in Azure Data Factory to manage historical data and track real-time changes, enhancing data accuracy and integration. Attended daily sync-up calls between onsite and offshore teams to discuss the ongoing features/work items, issues, blockers, and ideas to improve the performance, readability, and experience of the data presented to end users. Client: BNY Mellon, New York, NY Feb 2016 - Apr 2018 Role: Bigdata Developer Responsibilities: Implemented real-time data streaming and processing pipelines using Apache Flink, enabling low-latency data analytics and event-driven applications for large-scale data environments. Automated data pipelines and workflows using Oozie and implemented Flume for collecting and storing web log data for manufacturing process analysis. In-depth knowledge of Hadoop architecture and various components such as HDFS, application manager, node master, resource manager name node, data node and map reduce concepts. Utilized Kerberos authentication principles to ensure secure network communication on the cluster and conducted testing of HDFS, Hive, Pig, and MapReduce to grant access to new users. Orchestrated mappings for transferring data from Oracle and SQL Server to the new Data Warehouse, facilitating efficient integration and analysis. Managed and optimized PySpark applications, implementing scheduling and automation for efficient data processing and job execution. Orchestrated data migration from Oracle and SQL Server to Hadoop using Sqoop, handling flat files in various formats. Developed a Flume and Sqoop data pipeline to ingest customer behavioral data histories into HDFS. Engineered Hive tables and MapReduce programs, using Hive queries for data loading, transformation, and manufacturing process analysis. Developed and optimized Big Data products and platforms using Python, Scala, Spark, Hadoop tools like Hive and Impala, ensuring efficient data processing workflows and scalable architectures. Implemented Kafka and Spark streaming for real-time data processing, configured Spark to store data in HDFS, and developed Spark and Hive jobs for data summarization and transformation. Expert in designing and implementing Kafka-based stream processing solutions and data pipelines, as well as managing and optimizing Kafka clusters for high availability and performance. Packaged and deployed PySpark applications using Docker for scalability and portability across environments. Utilized Spark and Python for data processing, leveraging Spark Data Frames to ingest, transform, validate, cleanse, and aggregate unstructured data into structured formats, ensuring efficient and accurate data handling. Expert in using Apache Ant for optimizing complex build processes and integrating with CI/CD pipelines. Proficient in using Maven for dependency management and project modularization, and developing custom plugins to enhance build workflows. Implemented partitioning, dynamic partitions, and buckets within Hive for efficient data organization and querying. Demonstrated proficiency in using Jenkins for continuous integration, automating build and deployment processes for Hadoop-based solutions and manufacturing process enhancements. Leveraged Git for version control, ensuring collaborative development and tracking changes in the codebase. Utilized JIRA for efficient project management, tracking tasks, and facilitating communication among team members. Managed Hadoop infrastructure installation, configuration, maintenance, and monitoring using Cloudera Manager and shell scripts, ensuring seamless operations and security. Client: Upward Health, New York Jan 2014 - Jan 2016 Role: SQL Developer/Data Warehouse Developer Responsibilities: Designed and implemented ETL processes using SSIS to efficiently extract, transform, and load data into the data warehouse, ensuring optimal performance and data integration. Developed and maintained reports with SSRS and created OLAP cubes using SSAS, enabling comprehensive data analysis and delivering key business insights through automated reporting. Optimized T-SQL queries through advanced performance tuning techniques, improving execution time and database performance, while managing large datasets and ensuring data integrity. Leveraged SQL for optimizing query performance, tuning Data Transformation Manager (DTM) buffer and block sizes, and identifying bottlenecks in sources, targets, mappings, and sessions. Deployed ETL modules and monitored performance in the production environment, identifying and addressing read/write errors using Workflow and Session logs. Utilized Erwin Data Modeler to design DataMarts and generate DDL scripts for review by Database Administrators (DBAs) and designed and built ETL modules using technical transformation documents. Leveraged Informatica for ETL processes, designing and developing data integration workflows to transform and load data from various sources into target databases, ensuring data accuracy. Strong expertise in Data Integration, including ETL/ELT processes for large-scale data transformation and management, using tools like Informatica PowerCenter and Data Quality for high-performance data processing. Expert in designing and implementing ETL processes using Informatica PowerCenter, optimizing workflows and mappings to ensure data quality, consistency, and processing efficiency. Hands-on experience with ELT and Change Data Capture (CDC) solutions, specifically using Informatica Power Exchange ensuring real-time data flow between multiple systems and sources. Designed and developed Informatica Mappings for incremental loads from source to target tables and implemented Slowly Changing Dimensions (SCD) Type I and II as per requirements. Developed SQL shell scripts for Pre-session and post-session tasks, including index management and email notifications. Used DTS/SSIS and T-SQL stored procedures to transfer data from OLTP databases to staging area and finally transfer into data marts and performed action in XML. Implemented DDL (Data Definition Language) and DML (Data Manipulation Language) operations to create, modify, and manage database structures, ensuring efficient data storage and retrieval. Developed and maintained complex stored procedures, triggers, and functions to encapsulate business logic and automate repetitive tasks. Utilized window functions for advanced data analysis, such as running totals, moving averages, and ranking, to solve complex business requirements. Worked with various SQL environments including MySQL, PostgreSQL, and Microsoft SQL Server and Oracle to manage and analyze large datasets. Collaborated in Agile Scrum methodology, actively participating in daily stand-up meetings, utilizing Visual SourceSafe for Visual Studio 2010, and tracking project progress through Trello. Keywords: continuous integration continuous deployment business intelligence sthree database active directory microsoft procedural language New Jersey New York |