Yamini - AWS Data Engineer |
[email protected] |
Location: Abilene, Kansas, USA |
Relocation: Open to relocate All over USA |
Visa: |
Summary:
10 years of IT experience as a Data engineer with capacity for creating, implementing, and documenting data models for enterprise-level applications. Background in ETL Data pipeline, Data Visualization, Data Warehousing, and Data Lake. Learn about accelerating data processing by running Pyspark tasks on a Kubernetes cluster. Involved in the creation of RDDs and data frames from the necessary HDFS files to transform Hive queries into different Spark actions and transformations. Practical knowledge of dimensional modelling with the use of star and snowflake models. Extensive knowledge of the Big Data Ecosystem, specifically using the Hadoop framework and related technologies including HDFS, MapReduce, HIVE, PIG, HBASE, STORM, YARN, OOZIE, SQOOP, Airflow, and Zookeeper. Also includes work with Spark Core, Spark SQL, Spark Streaming, Scala, and Kafka. Solid background in converting other databases to Snowflake. Comprehensive understanding of the Snowflake Database's schema and table architecture. CI/CD (continuous integration & deployment) pipelines were set up, developed, and maintained. Automation was used in environments and applications. GIT, Terraform, and Ansible were some of the automation technologies. Skilled in transactional modelling, fact dimensional modelling (Star schema, Snowflake schema), and SCD (Slowly changing dimension). A solid background in data modelling, data pipelines, and SQL and NoSQL databases. Participating in the creation and automation of ETL pipelines from start to finish using Python and SQL The DAGs were constructed using Airflow to perform jobs sequentially and in parallel after installing and configuring Apache Airflow for workflow management. Knowledge of version control tools like GitLab, SVN Code Commit, and Bit Bucket to arrange the versions and settings of the code. Using Spark Context, Spark-SQL, Data frame API, Spark Streaming, MLlib, Pair RDDs, and working specifically with PySpark and Scala, an expert with Spark improved the performance and optimized the current Hadoop techniques. A solid working understanding of command-line utilities in the UNIX and Linux shell environments. Skilled in using Agile Methodologies, Scrum stories, sprints, daily standup meetings, and pair programming to achieve high-quality deliverables on schedule Extensive knowledge in Dimensional Data Modelling, Relational Data Modelling, Star Schema/Snowflake Modelling, FACT & Dimensions Tables, Physical & Logical Data Modelling, and Data Analysis. Having expertise in the whole software development life cycle (SDLC), designing scalable platforms, objectoriented programming, database design, and agile approaches. A solid working knowledge of Docker's image format, registry, and Kubernetes for container-based deployments. Amazon S3 buckets were used to install AWS Lambda code. To receive events from your S3 bucket, you must first create a Lambda Deployment function. Experience with databases like DynamoDB, S3 Buckets, MySQL, and Elastic Cache on the AWS Cloud with good knowledge of IaaS, PaaS, and SaaS. Good knowledge of NoSQL databases like MongoDB, Redis, and Apache Cassandra. Effective teamwork requires strong leadership, strong work ethics, and solid communication abilities. Agile and Waterfall methodologies are used for implementation and support. Knowledgeable of EC2, Cloud Watch, Cloud Formation, and managing security groups on AWS, as well as automating, configuring, and deploying instances on these platforms. Strong Python PEP Guidelines compliance and experience with Linux Bash scripting Being skilled in creating SQL queries, stored procedures, functions, packages, tables, views, and triggers for relational databases like Oracle, DB2, MySQL, and MSSQL Server. Excellent understanding of SQL basics and working knowledge of Teradata and Oracle databases. Technical skills: ETL Tools AWS Glue, Airflow, Spark, Sqoop, Flume, Apache Kafka, Spark Streaming NoSQL Databases MongoDB, Cassandra, Amazon DynamoDB, HBase Data Warehouse AWS RedShift, Google Cloud Storage, Snowflake, Teradata, Tools PyCharm, Visual Studio, R Studio, Power BI, Tableau, SAS Studio, Gephi, Eclipse, Putty, Mainframes, Excel, Jupyter Notebook, Azure Databricks Web Development HTML, XML, JSON, CSS, JQUERY, JavaScript Monitoring Tools Splunk, Chef, Nagios, ELK Source Code Management JFrog Artifactory, Nexus, GitHub, Code Commit Containerization Docker & Docker Hub, Kubernetes, OpenShift Hadoop Distribution Cloudera, Hortonworks, MapR, AWS EMR, GCP DataProc Programming and Scripting Spark Scala, Python, Java, MySQL, PostgreSQL, Shell Scripting, Pig, HiveQL AWS EC2, S3, Glacier, Redshift, RDS, EMR, Lambda, Glue, CloudWatch, Recognition, Kinesis, CloudFront, Route53, DynamoDB, Code Pipeline, EKS, Athena, Quick Sight Hadoop Tools H \DFS, HBase, Hive, YARN, MapReduce, Pig, HIVE, Apache Storm, Sqoop, Oozie, Zookeeper, Spark, SOLR, Atlas Build & Development Tools Jenkins, Maven, Gradle, Bamboo Methodologies Agile/Scrum, Waterfall Education: Master of Science (Computer Science), 2015, University Central Missouri, Lee Sumit, USA. Bachelors in Computer Science, 2013, Audisankara Institute of Technology, Gudur, India. Professional Experience: Mclaneco, Temple, TX Apr 2023 to till date Sr. AWS Data Engineer Responsibilities: Performed actions and transformations on RDDs, data frames, and datasets using SparkSQL and Spark Streaming Contexts while working on a Scala code base connected to Apache Spark. Making sure the development process adheres to the SDLC (Software Development Life Cycle) principles Worked on the design of Restful Software services and stored the data in MySQL and PostgreSQL databases. Developed a plan and executed a successful migration of databases from on-premises to the AWS cloud. Creating data pipelines from many sources to Snowflake. Create a snowflake warehouse strategy and set it up to use PUT scripts to migrate a terabyte of data from S3 into Snowflake. By creating a customized read/write Snowflake utility function in Scala, data was transferred from an AWS S3 bucket to Snowflake. In Snowflake, create an external S3 stage to be used for data migration. Performed actions and transformations on RDDs, data frames, and datasets using SparkSQL and Spark Streaming Contexts while working on a Scala code base connected to Apache Spark. To keep up with business standards, we used and configured a variety of AWS services, including RedShift, EMR, EC2, and S3. Conducted data blending, prepared data for Tableau consumption using Alteryx and SQL, and published data sources to the Tableau Server. For data extraction, processing, and aggregation from various file formats, developed Spark applications utilizing Pyspark and Spark-SQL Made Spark Streaming tasks in Python to retrieve JSON files from AWS S3 buckets and read messages from Kafka. Responsible for converting manual report systems into fully automated CI/CD data pipelines that feed data from several marketing platforms into an AWS S3 data lake. The loading of data into Parquet hive tables from Avro hive tables after creating partitioned and bucketed Hive tables in Parquet File Formats using Snappy compression. In AWS, Lambda, and other requirements were deployed to automate EMR Spin jobs. Created, scheduled, and monitored Data Pipelines with Apache Airflow I worked with Jira and Confluence and had data visualization skills using Matplotlib and the Seaborn package. Involved in creating a test environment utilizing Docker containers and setting up the containers using Kubernetes. Scheduled Spark processes/applications in the AWS EMR cluster. Involved in processing huge datasets of diverse types, including structured, semi-structured, and unstructured data. Worked on creating tables in Hive, MYSQL utilizing SQOOP, and processing data, such as importing and exporting databases to the HDFS. Knowledge of big data analytics, including Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, and Kafka, as well as MapReduce programming To create dashboards and workbooks using PowerBI, action filters, parameters, and calculated sets were created. Creating a data pipeline using Spark, Scala, and Apache Kafka to ingest data from a CSL source and store it in an HDFS secured folder. Built user-friendly website interfaces using Python and Django view controllers and templating language. The creation and development of UNIX shell scripts for task scheduling. Additionally, PL/SQL scripts were written for removing and rebuilding indexes, as well as pre-and post-session shell scripts. Environment: Python, SQL, Amazon Web Services (EC2, S3, Amazon Simple DB, RDS, Elastic Load Balancing, Elastic Search, Amazon MQ, Lambdas, SQS, IAM, Google Cloud Platform (GCP), CloudWatch, EBS, CloudFormation), Pandas, NumPy, seaborn, SciPy, Matplotlib, Scikit-learn, NLTK, Docker, Machine Learning, Snowflake, Hadoop, Hive, Sqoop, Yarn, HDFS, Flume, PySpark, Spark SQL, Tableau, SAS Visual Analytics Mesirow, Chicago, IL Feb 2021 to Mar 2023 AWS Data Engineer Responsibilities: Developed methods for ingesting data from different sources and processing Data-at-Rest utilizing Big Data tools like Hadoop, Map Reduce Frameworks, HBase, and Hive. Using best practices, Snowflake was deployed, and subject matter experts in data warehousing, especially with Snowflakes were provided. A workflow engine called Oozie was created to handle numerous Hive, Pig, Tealeaf, Mongo DB, Git, Sqoop, and Spark processes. Utilizing Informatica Designed, a variety of Mappings were created using the collection of all Sources, Targets, and Transformations. POC using Tableau and AWS Quick sight are conducted to suggest product direction. Created a dashboard and a metadata layer. Use HDFS and Sqoop command to handle extensive data input. Gather "partitioned" data in a variety of storage formats, such as text, JSON, Parquet, etc. engaged in the loading of data from the LINUX file system to the HDFS. Configuring and integrating the required AWS services in accordance with the business requirement to start Infrastructure as a code (Iaas) in the AWS cloud platform from scratch. Created and crafted the study enrollment view for the metadata layer to use. Data analysis and visualization were done using AWS Athena and quick sight. Participating in all phases of the project and its scope, reference data for MDM was used to develop a Data Dictionary and a Mapping from Sources to the Target in the MDM Data Model. Creating a data pipeline using Spark, Scala, and Apache Kafka to ingest data from a CSL source and store it in an HDFS secured folder. Designing and deploying multi-tier applications with an emphasis on high availability, fault tolerance, and auto- scaling on AWS Cloud Formation utilizing all of the AW S services (EC2, AWS GLUE, Athena, Lambda, S3, RDS, Dynamo DB, SNS, SQS, IAM, etc.). Designing and deploying several applications that make use of practically all AWS services, with an emphasis on high availability, fault tolerance, and auto-scaling in AWS Cloud Formation, including EC2, RedShift, S3, RDS, Dynamo DB, SNS, and SQS. Custom Kafka producers and consumers have been developed for a variety of publishing and subscribing to Kafka topics. Using sophisticated SQL scripting, ETL tools, Python, Shell scripting, and scheduling tools, create and deploy several ETL systems with different data sources. XML, Web feeds, and file processing utilizing Python, Unix, and SQL for data profiling and manipulation. Data migration between many Teradata servers was done to assist the development team. Architecting and creating serverless web applications with AWS Lambda, API Gateway, Dynamo DB, and Security Token Service (STS). Linked to several stages of the Software Development Life Cycle (SDLC) of the program, including requirement collecting, design, analysis, and code development. Create data models for apps, views, tables, and other database objects that contain metadata. Build automated ETL pipelines that can process a wide range of data sources. Knowledgeable about loading and extracting data from Python sources and working with Python tools for data analysis, including matplotlib, NumPy, SciPy, and Pandas Building Python-based database models, APIs, and views to create an interactive web-based solution. Coordinating the development of the team using the GIT version control tool. Jenkins and SonarQube were integrated to enable SonarQube's Maven scanner to do continuous code quality inspection and analysis. Using Spark RDDs and Scala to transform Hive/SQL queries into Spark Transformations, as well as SQOOP to import and export data between RDBMS and HDFS Created dimensional data models utilizing star and snowflake schemas as well as 3NF data models for OLTP systems. Utilizing Terraform scripts from Jenkins, involved in provisioning AWS infrastructure. Docker and Kubernetes were heavily utilized to securely ship, run, and deploy the application in containers to speed up the build and release engineering process. Sqoop tasks were made to import the data from DB2 to HDFS. Created visuals, processed XML, exchanged data and implemented business logic using Python and Django. Environment: Python, Machine Learning, AWS (S3, EMR, Lambda, CloudFormation, Redshift, Elastic Search), Flask, Snowflake, JSON, Hadoop, Hive, MapReduce, Scala, HBase, HDFS, Yarn, PySpark, Spark, Kafka, Apache Nifi, SQL Fifth Third Bank, Cincinnati, OH Apr 2019 Dec 2020 AWS Data Engineer Responsibilities: Knowledge of Amazon AWS services like RedShift, EC2, S3, and EMR, which offer quick and effective Big Data processing. Sqoop is used to import and export data into HDFS, Pig, Hive, and HBase. Control and examination of Hadoop log files. Setup and configuration of Hadoop MapReduce, HDFS, created several Java MapReduce tasks for cleaning and processing data. S3, lambda, glue, DynamoDB, Elasticsearch, CloudWatch, and Athena were used to create and maintain a data lake across all of AWS. Enormous sets of organized, semi-structured, and unstructured data into the Hadoop system and working on their loading and transformation. Created Java MapReduce programs for data analysis, both simple and complicated. Use Flume to add data from numerous sources to HDFS. Created MapReduce scripts to process the raw data, fill staging tables and store the cleaned data in partitioned tables in the EDW. Utilizing Spark SQL, prep data was created and stored in AWS S3 along with data frames that were formed by importing data from Hive databases. Developed FTP programs to save DB2 Sqoop data in AWS as Avro formatting. Hadoop cluster and several big data analysis tools, such as Map Reduce, Hive, and Spark, were examined. Created extract, transform, and load (ETL) software for DB2 fact and dimension tables. Sqoop was used to connect to the MySQL database, create the Oozie Workflow, and transform the MySQL data into AVRO before writing it to HDFS. By comparing recent data with EDW reference tables and historical measures, I developed Hive queries that assisted market analysts in spotting new trends. Participate in significant development projects using codebases that use Python, Django, R, MySQL, MongoDB, and jQuery For report analysis, data were exported from Sqoop into HDFS and Hive. Environment: Python, NumPy, Pandas, SQLalchemy, Sci-kit Learn, AWS, Lambda, SQS, Snowflake, Hadoop, Hive, MapReduce, Pig, HDFS, Flume, Scala, Sqoop, Spark, Spark SQL, Kafka, JSON, GitHub, Oracle SQL Server, MS Excel, Linux. AT&T, Dallas, TX Jan 2016 Mar 2019 Data Engineer Responsibilities: Worked with Dimensional modeling, Data migration, Data cleansing, ETL Processes for data warehouses. Developed ETL processes (Data Stage Open Studio) to load data from multiple data sources to HDFS using Sqoop. Monitored Resources and Applications using AWS Cloud Watch, including creating alarms to monitor metrics such as EBS, EC2, ELB, RDS, S3, EMR, IAM, Athena, Glue, SNS and configured notifications for the alarms generated based on events defined. Designed, developed, and implemented pipelines using python API (PySpark) of Apache Spark on AWS EMR. Created, modified and executed DDL in table AWS Redshift and Snowflake tables to load data. Created Hive External tables, also used custom Serde's based on the structure of input file so that Hive knows how to load the files to Hive tables. Designed and developed the core data pipeline code, involving work in Python and built on Kafka and Storm. Orchestrated data workflows using Airflow to manage and schedule by creating DAGS using Python. Imported the data from different sources like HDFS/HBase into Spark RDD and performed computations using PySpark to generate the output response. Implemented Spark Streaming and Spark SQL using Data Frames. Developed PySpark and Spark SQL code to process the data in Apache spark on Amazon EMR to perform the necessary transformations. Imported required tables from RDBMS to HDFS using Sqoop and used Storm/ Spark streaming and Kafka to get real time streaming of data into HBase. Used Nifi to automate the data flow between disparate systems. Developed Spark Applications by using Scala and Implemented Apache Spark data processing project to handle data from various RDBMS sources. Developed Pre-processing job using Spark Data frames to flatten JSON documents to flat file. Used HiveQL to analyze the partitioned and bucketed data, executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business specification logic. Wrote MapReduce jobs for tracking the customer preferences and providing recommendations implicitly and explicitly. Used Avro, Parquet, RC File and JSON file formats, developed UDFs in Hive and Pig. Converted Hive SQL queries into spark transformations using spark RDD and PySpark concepts. Worked with Log4j framework for logging debug, info & error data. Responsible for data cleaning, pre-processing and modelling using Spark and Python. Environment: Python, AWS Services, Cloudera Distributions, Hadoop, Hive, Serdes, HBase, HDFS, Pig, Apache Nifi, MapReduce, Sqoop, Spark, PySpark, Spark SQL, Scala, Kafka, Airflow, Snowflake, ETL, SQL, Avro, Parquet, RC File, Unix Shell Scripting, Tableau Mastercard, St louis, MO Mar 2015 to Dec 2015 Data Engineer Responsibilities: Installed and Snowflake configured Hadoop and responsible for maintaining cluster and managing and reviewing Hadoop log files. As a Big Data Engineer responsible for developing, troubleshooting and implementing programs. Installed Hadoop, Cassandra, MapReduce, HDFS, and developed multiple MapReduce jobs in pig and Hive for data cleaning and pre-processing. Built Hadoop star schema solutions for big data problems using MR1 and MR2 in Yarn. Implemented the Big Data solution using Hadoop, hive and Informatica to pull/load the data into the HDFS. Installed and configured Hadoop ecosystem like HBase, Flume, Pig and Sqoop. Developed Scala scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop. Built Hadoop solutions Snowflake for big data problems using MR1 and MR2 in YARN. Performed star schema Data Analysis and Data Manipulation of source data from SQL Server and other data structures to support the business organization. Responsible for building scalable distributed data solutions using Big Data technologies like Apache Hadoop, MapReduce, Shell Scripting and Hive. Write complex Hive queries to extract data from heterogeneous sources (Data Lake) and persist the data into HDFS. Involved in all phases of data mining, data collection, data cleaning, developing models, validation and visualization. Performed Spark, Hive data extraction, data analysis, data manipulation and prepared various production and ad-hoc reports to support cost optimization initiatives and strategies. Responsible for data mapping and data mediation between the source data table and target data tables using MS Access and MS Excel. Developed PL/SQL programming that included writing Views, Stored Procedures, Packages, Functions and Database Triggers. Performed data analysis and data profiling using various sources systems including Oracle, SQL Server and DB2. Wrote complex SQL scripts and PL/SQL packages, to extract data from various source tables of data warehouse Environment: Bigdata, Cloudera Manager (CDH5), Snowflake, Hadoop, Hive, HDFS, Sqoop, MapReduce, Cassandra, Spark, Pig, Scala, Yarn, Oozie, Kafka, Flume, Python, Shell Scripting, MS Excel, Git, SQL, Oracle Keywords: continuous integration continuous deployment message queue business intelligence sthree database active directory rlang information technology microsoft procedural language Illinois Missouri Ohio Texas |