Ravi Teja - Sr. Data Engineer / Data Analyst |
[email protected] |
Location: Durham, North Carolina, USA |
Relocation: Yes |
Visa: H1b |
Name: - RAVITEJA THOTA
[email protected] Morrisville, North Carolina USA ________________________________________ PROFESSIONAL SUMMARY: Demonstrated 8+ years of professional work experience in design, development, and implementation of cloud, Big Data, Spark, Scala, Hadoop, and maintenance of data pipelines in the position of Data engineer. Strong experience in developing Data Modeling, Data Migration, Design, Data Warehousing, Data Ingestion, Data integration, Data consumption, Data delivery, and integration Reporting. Extensive experience with processing and analyzing Big Data with hands-on experience in Big Data Ecosystems and related technologies like Hive, Spark, Cloudera, Hortonworks, Navigator, Mahout, HBase, Pig, Zookeeper, Sqoop, Flume, Oozie, and HDFS. Experience with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data frame API, Spark Streaming, MLlib, Pair RDDs, and worked explicitly on PySpark and Scala. Experience with designing and developing the end-to-end Hadoop architecture and Hadoop components like HDFS, MapReduce, Hive, HBase, Kafka, Sqoop, Spark, Scala, Oozie, Yarn, No SQL, Postman, and Python. Extensive experience with Azure Cloud using Azure Data Factory, Azure Data Lake Storage, Azure Synapse Analytics, Azure Analytical Services, Azure SQL Datawarehouse, NO SQL DB, Azure HDInsight, and Databricks. Good knowledge in transforming Azure projects and implementing ETL and data movement solutions using Azure Data Factory (ADF), and SSIS. Good working experience with Amazon EC2, S3, RDS, CloudWatch, Glue, Lambda, EMR, Redshift, Dynamo DB, and other services of the AWS Experience in scripting to automate the ingestion process using PySpark and Scala as needed through various sources such as AWS S3, Teradata, and Redshift. Strong understanding of the ETL framework metadata to recognize the current state of ETL implementation and experience in ELT with transformation and optimization. Good Exposure to Data Quality, Data Mapping, and Data Filtration using Data warehouse ETL tools like Talend, and Informatica. Experience with implementing the Server less Architecture using Lambda, and S3 buckets. Expertise in Creating, Debugging, Scheduling, and Monitoring jobs using Airflow and Oozie and used operators like Python Operator, Bash Operator, Google Cloud Storage Download Operator, and Google Cloud Storage. Good knowledge in Implementing Continuous Integration and Continuous Deployment (CI/CD) through Jenkins for Hadoop jobs and managed the container clusters using GIT and SVN. Experience in working with Structured, and Unstructured Data with various file formats such as text files, Parquet files, and JSON files. Expertise in Data Processing jobs to analyze data using MapReduce, Spark, and Hive. Proficient in identifying Facts and Dimensions, Star Schema, and Snowflake Schema for modeling a Data Warehouse in Relational, Dimensional, and multidimensional modeling. Strong experience in working with UNIX environments, and writing Shell Scripts. Experience in writing SQL queries and optimizing the Teradata, Oracle, and SQL Server queries. Extensive experience in storing and documentation using NoSQL Databases like MongoDB, Snowflake, and HBase. Experience with Software Development Life Cycle analysis, design, development, and testing, familiarity with configuration management and project execution, and experience with methodologies like Agile and Traditional Waterfall. Good knowledge of working with Teradata utilities like BTEQ, Fast Load, and Multi Load. Extensively worked on performance tuning of Teradata SQL scripts. Experience with visualizing the data using BI and services amp tools such as Power BI, Tableau, Plotly, and Matplotlib. TECHNICAL SKILLS: Programming Languages Python, SQL, PL/SQL, Shell scripts, Java, Scala, Unix Big Data Tools Hadoop, Apache Spark, MapReduce, PySpark, Hive, YARN, Kafka, Flume, Oozie, Airflow, Zookeeper, Sqoop, H-Base, Aerospike Cloud Services AWS Glue, S3, RedShift, EC2, S3, EMR, Dynamo DB, Data Lake, AWS Lambda, cloud watch, Azure Data Factory, Azure Data Lake Storage, Azure Synapse Analytics, Azure Analytical Services, HDInsight, Azure SQL Datawarehouse, Angular MVC ETL/Data warehouse Tools Informatica, Talend, DataStage, Power BI, and Tableau Version Control & Containerization tools SVN, GIT, Bit-Bucket, Docker, and Jenkins Databases Oracle, MySQL, Mongo DB, and DB2 Operating Systems Ubuntu, Windows, and Mac OS Methodologies Agile and Traditional Waterfall PROFESSIONAL EXPERIENCE Client: BCBS Oct 23 Dec 24 Location: Chicago, IL Role: Sr Data Engineer/ Data Analyst Roles & Responsibilities: Implemented medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HD Insight/Databricks, and NoSQL DB). Developed a detailed project plan and helped manage the data conversion migration from the legacy system to the target snowflake database. Worked on migration of on-premises data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake and Stored (ADLS) using Azure Data Factory (ADF V1/V2). Implemented Copy activity, Custom Azure Data Factory Pipeline activities. Implemented ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HD Insights. Moved data to Azure Data Lake to Azure Data Warehouse using PolyBase. Created external tables in ADW with 4 compute nodes and scheduled. Worked with data ingestions from multiple sources into the Azure SQL data warehouse Transformed and loading data into Azure SQL Database. Maintained data storage in Azure Data Lake. Configured Spark streaming to receive real time data from the Kafka and store the stream data to HDFS. Responsible for development and maintenance of data pipeline on Azure Analytics platform using Azure Databricks. Developed purging scripts and routines to purge data on Azure SQL Server and Azure Blob storage. Developed Azure Databricks notebooks to apply the business transformations and perform data cleansing operations. Developed complex data pipelines using Azure Databricks and Azure Data Factory (ADF) to create a consolidated and connected data lake environment. Wrote scripts in Hive SQL for creating complex tables with high performance metrics like partitioning, clustering, and skewing. Loaded the data through HBase into Spark RDD and implement in memory data computation to generate the output response. Continuously tuned Hive UDF for faster queries by employing partitioning and bucketing. Developed PySpark script to encrypting the raw data by using hashing algorithms concepts on client specified columns. Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines. Worked on Data Migration using SQL, SQL Azure, Azure Storage, and Azure Data Factory, SSIS, PowerShell. Converted Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala. Ingested data into Hadoop from various data sources like Oracle, MySQL using Sqoop tool. Created Sqoop job with incremental load to populate Hive External tables. Designed several DAGs (Directed Acyclic Graph) for automating ETL pipelines. Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries. Developed front-end applications using Angular MVC architecture, integrating with Azure Data Lake and Azure SQL Database for real-time data retrieval and dynamic reporting. Built and maintained Angular components and services to interact with backend data pipelines, ensuring seamless data flow and optimized performance for end-users. Collaborated with backend developers to design RESTful APIs, enabling smooth integration between Angular front-end and Azure-based data storage solutions, such as Azure SQL Data Warehouse and Blob Storage. Configured and monitored data pipelines on Cloudera (Hadoop) environments for optimized performance and scalability. Ingested, stored, and processed high-velocity transactional data using Aerospike for real-time analytics. Utilized Hive on Cloudera for querying and transforming large datasets with efficient partitioning and bucketing strategies. Integrated Cloudera tools to process structured and semi-structured data for data science workflows. Environment: Python, Hadoop, Spark, Spark SQL, Spark Streaming, PySpark, Hive, Scala, MapReduce, HDFS, Kafka, Sqoop, HBase, MS Azure, Blob Storage, Data Factory, cloudera, Angular MVC, Data Bricks, SQL Data Warehouse, Apache Airflow, Snowflake, Oracle, MySQL, UNIX Shell Script, Perl, PowerShell, SSIS, Power BI, Aerospike Client: Citizen s Bank May 2022-Aug 2023 Sr. Data Engineer Roles & Responsibilities: Worked on Extraction, Transformation, and Loading of data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, and Spark SQL. Worked on designing the ETL data-driven workflows to transform the data from multiple sources to Master tables. Developed Spark applications using Spark-SQL in Participated in developing the project planning and implementation schedule through consultations with the project team, and external consultants. Gathered requirements, analysis, design, and development of any enhancements in the application. Migrated on premise data like Oracle/ SQL Server/ DB2/ Mongo DB to Azure Data Lake Storage and Stored using Azure Data Factory. Worked on the design, development, and implementation of the Azure Data factory framework with Error logging to populate data in the Azure SQL Data warehouse from Azure Blob storage and Azure data lake store. Data bricks for data extraction, transformation, and aggregation from JSON/Parquet files for analyzing & transforming the data to uncover insights into the customer usage patterns. Worked on ingesting the data in Azure Data Lake, Azure Storage, and Azure DW and processed the data in Azure Data bricks. Worked on developing data-driven workflows using Spark, Hive, Pig, Python, Impala, and HBase for further ingesting the data. Worked on the process of streaming the data using Kafka, Spark, and Hive and developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using SQL. Created Python notebooks on Azure Data bricks for processing the datasets and loading them into Azure SQL databases. Used Azure Data bricks for a fast, easy, and collaborative spark-based platform on Azure and integrated the data using the Data bricks. Worked on developing the data pipelines in Azure Data Factory using activities like Move &Transform, Copy, filter, for each, Get Metadata, and Lookup Data Bricks. Worked on implementing the large enterprise data and Lambda architectures using Azure Data platform capabilities like Azure Data Lake, Azure Data Factory, HDInsight, and Azure SQL Server. Developed a detailed project plan and helped manage the data conversion migration from the legacy system to the target Snowflake database. Designed and developed reusable Angular components and services to create a modular front-end architecture, improving code maintainability and scalability across multiple web applications. Implemented Angular Material for building a consistent and user-friendly interface, enhancing user experience with modern UI elements like navigation, forms, and data tables. Utilized Angular CLI for efficient project setup, testing, and deployment, ensuring a streamlined development process and reducing build time for production environments. Worked on designing, developing, and testing the dimensional data models using Star and Snowflakes schema methodologies. Worked on implementation of ad-hoc analysis solutions using Azure Data Lake Analytics and HDInsight. Converted Hive/SQL queries into Spark transformations using Spark RDDs, Python, and PySpark. Worked in an Agile/Scrum Development environment with frequently changing requirements and actively participated in daily scrum meetings and reviews with biweekly sprint deliveries. Worked on a direct query using Power BI to compare legacy data with the current data and generated reports and stored dashboards. Environment: Python, SQL, Oracle, SQL Server, DB2, MongoDB, Azure Data Lake Storage, Azure Data Factory, Azure SQL Data warehouse, Azure Blob storage, Azure data lake store, Spark, Angular MVC, Spark-SQL, JSON, Parquet files, Kafka, Hive, Azure Databricks, Hive, Pig, Impala, HBase, Snowflake and Star schema, Agile, Power BI. Client: AT&T Nov 2020-Feb 2022 Data Engineer Roles & Responsibilities: Requirement discussions, designing the solutions and building scalable distributed data solutions using Hadoop. Migrated on premise data to Azure Data Lake Storage and Stored using Azure Data Factory. Worked on performing statistical data analysis and data visualization using Python and R and analysis on implementing Spark using Scala and writing programs using Spark. Implemented data integration using Data Factory and Databricks from input sources to azure services. Developed Python programs for manipulating the data reading from various Teradata and converting them as one CSV Files. Automated data flows and pipelines which are interacting with multiple Azure services using Azure Databricks, Power automate (Flow). Worked on designing ETL pipelines in Azure Data Factory and used the Databricks for transforming the workflows. Worked on creating filters, parameters, and calculated sets for preparing dashboards and worksheets in Tableau. Performed visualization of the data using tools like Tableau, and Packages in R interacting with other data scientists, and architected custom solutions. Worked on the data implementation using Tableau Server for biweekly and monthly increments based on business change to ensure that the views and dashboards were displaying the changed data accurately. Worked using Azure SQL, and Azure SQL Data warehouse for creating and managing the data-driven workflows in Azure Data Factory. Worked on maintenance of high-volume data sets, and combined data from various sources by using tools like SQL queries, Excel, Visio, and Access. Created Hive queries for helping the analysts with current scenarios and spot the trend using incremental data and comparing with Teradata reference tables and past metrics. that helped market analysts spot emerging trends Worked on creating Hive tables, loading the structured data resulting from MapReduce jobs into the tables, and writing Hive queries to further analyze the logs to identify issues and behavioral patterns. Designed and developed the Analytics POC to work on Spark, implement and store them using Azure. Involved in running MapReduce jobs for processing millions of records. Written complex SQL queries using joins and OLAP functions like CSUM, Count, and Rank. Involved in extensive routine operational reporting, Ad hoc reporting, and data manipulation to produce routine metrics and dashboards for management. Building, publishing customized interactive reports and dashboards, report scheduling using Tableau server. Wrote several Teradata SQL Queries using Teradata SQL Assistant for Ad hoc Data Pull requests. Responsible for Data Modeling as per our requirement in HBase and for managing and scheduling jobs on a Hadoop cluster using Oozie jobs. Visualized the data using tools like Tableau, and Packages in R interacting with other data scientists, and architected custom solutions. Worked on Spark SQL and Data frames for faster execution of Hive queries using Spark SQL Context. Design and development of ETL processes using Informatica ETL tool for dimension and fact file creation. Environment: Python, R, Azure Data Factory, Azure Data Lake Storage, Databricks, Azure SQL, Azure SQL Data warehouse, Scala, Spark, Tableau, JIRA, Tableau Server, Excel, Access, SQL queries, Hive, Teradata, MapReduce, OLAP, HBase, Hadoop, Oozie, Informatica ETL. Client: TJX Apr 2018-Oct 2020 Sr. Data Analyst/Data Engineer Roles & Responsibilities: Participated in gathering requirements, analyzing the entire system, and providing estimation on development, and testing efforts. Performed analysis of enterprise data report integration and provided functional specifications to develop and build Enterprise Reporting Systems further. Migrated an existing on-premises application to AWS and used AWS services like EC2 and S3 for processing and storage of small data sets and maintained the Hadoop cluster using AWS EMR. Performed transformations using Spark and saved the result back to HDFS and later to the target database Snowflake. Worked on designing and developing the ETL processes in AWS Glue to migrate data from external sources like S3, or file formats like JSON, Parquet and Text Files into AWS Redshift. Worked on processing the real-time data analytics using Spark Streaming, Kafka, and Flume. Configured Spark streaming to get ongoing information from Kafka and store the stream information to HDFS. Transformed and cleansed the input data extracted from the external data using Spark. Created DataStage jobs using different stages like Transformer, Aggregator, Sort, Join, Merge, Lookup, Data Set, Funnel, Remove Duplicates, Copy, Modify, Filter, Change Data Capture, Change Apply, Sample, Surrogate Key, Column Generator, and Row Generator. Worked on Creating, Debugging, Scheduling, and Monitoring jobs using Airflow for ETL batch processing to load into Snowflake for analytical processes. Built ETL pipeline for data ingestion, data transformation, and data validation on cloud service AWS, working along with data steward under data compliance. Worked on scheduling and validating all jobs with Airflow scripts using Python and added different tasks to directed acyclic graph and dependencies between the tasks using Lambda. Worked on extracting and filtering the data in data pipelines using PySpark and transformed the data pipelines. Monitored the servers using Cloud Watch and stored and retrieved the data extracted from the workflows. Used Spark applications using Spark-SQL in Data bricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns. Worked on writing Hive and Spark queries for optimization using window functions and customizing the Hadoop shuffle. Worked on estimating the cluster size, monitoring, and troubleshooting Spark Databricks cluster and scripting using UNIX shell scripting for automating the data load processes. Worked on implementing the monitoring solutions using Docker and Jenkins. Implemented dashboard designs and created worksheets and data visualization dashboards using Tableau. Worked for the development of Agile Development Methodology/SCRUM and tested the application in each iteration. Environment: Python, Hive, Spark, AWS EC2, S3, AWS EMR, AWS Glue, HDFS, Spark Streaming, Kafka, Flume, JSON, Parquet, Text Files, AWS Redshift, Spark, DataStage, Airflow, Snowflake, ETL pipeline, Lambda, PySpark, Cloud Watch, Spark-SQL, Unix shell, Agile, Tableau, Docker, and Jenkins. Client: Teradata Jul 2016-Nov 2017 Data Engineer/Data Analyst Roles & Responsibilities: Analyzing large amounts of raw data to determine an optimal way to aggregate and transform the data into Meaningful and useful formats using Python and SQL. Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, maintaining the Hadoop cluster on AWS EMR. Load data into Amazon Redshift and use AWS Cloud Watch to collect and monitor AWS RDS instances within Confidential. Developed code to import data SQL Server into HDFS and created Hive views on data in HDFS using Spark in Scala. Created scripts to append data from temporary HBase tables to target HBase tables in Spark and written Spark programs in Scala and ran Spark jobs on YARN. Developed web service using HBase and Hive to compare schema between HBase and Hive tables. Worked on NoSQL Databases such as HBase, and SPARK for real-time streaming of data into the cluster. Using Tableau Desktop to analyze and obtain insights into large data sets using groups, bins, hierarchies, sorts, sets, and filters. Monitored the SQL scripts and modified them for improved performance using PySpark, SQL. Worked on monitoring the threads and memory using HBase and Hive schema check a web application. Used Python for Time Series, and Machine Learning to build the supply chain management system. Created and managed policies for S3 buckets and Utilized S3 buckets for storage and backup on AWS Worked on Integrating Big Data and Analytics based on Hadoop, Spark, Kafka, and web methods technologies. Created Hive queries that helped market analysts spot emerging trends by comparing fresh data with EDW reference tables and historical metrics. Produced comprehensive analysis report on legacy data, data structure, and statistical summary with Python. Developed Apache spark application with Python to perform different kinds of validations and standardization on fields based on certain validation rules on incoming Data. Developed Python and Scala-Spark programs for data reformation after extraction from HDFS for analysis. Developed Spark Applications by using Scala and Implemented Apache Spark data processing project to handle data from various RDBMS sources. Worked in Agile development environment and actively involved in daily Scrum and other design-related meetings. Environment: Python, SQL, HBase, Hive, Amazon Redshift, AWS Cloud Watch, EC2, S3, EMR, Hadoop, HDFS, Spark, Scala, HBase, YARN, NoSQL, Tableau, PySpark, Hadoop, Kafka, Apache Spark, RDBMS. EDUCATION Bachelor's degree, Electronics and Communications Engineering --2016 Vellore Institute of Technology Keywords: continuous integration continuous deployment user interface business intelligence sthree database active directory rlang microsoft procedural language Illinois |