Site Reliability Engineer (SRE) at Remote, Remote, USA |
Email: [email protected] |
http://bit.ly/4ey8w48 https://jobs.nvoids.com/job_details.jsp?id=1244671&uid= From: Vikrama Rao, ValiantIQ INC [email protected] Reply to: [email protected] Position: Site Reliability Engineer (SRE) Duration: 9 Months Location: 100% Remote Duration: 6 to 12 Months Site Reliability Engineer (SRE) Site Reliability Engineer will be playing a key role in building Observability and Resilience capabilities on cloud platform (Azure). Responsibilities of the SRE will be: Build and configure alerts, tracing, telemetry, and instrumentation required for Infrastructure Monitoring and Application Performance Management. Role entails implementing dashboards to monitor and share Observability at various levels (engineering teams, portfolio, Senior management). Support resilience engineering (application and infrastructure resilience) to meet availability requirements. Work with development engineers, cloud engineers, product teams, and support engineers to gather requirements, implement, and evolve observability and resilience solutions. Key Skillsets: Extensive knowledge on Observability and Application Performance Monitoring best practices, KPIs/metrics on Cloud platforms Experience in monitoring tools - Dynatrace and Splunk Experience with incident resolution (on-call support), application errors and performance troubleshooting using Dynatrace and Splunk to assist application team on root cause analysis Experience working with SLO and Error budget, understanding of SLA/SLI/SLO Expertise with Splunk Query Language Experience building monitoring solutions for container-based workloads (Java / Spring boot desirable), databases, Kafka and Kubernetes Experience in resilience engineering, and implementing high availability solutions Experience creating Monitoring dashboards using Dynatrace and Splunk Ability to work in a fast paced and agile environment SRE Maturity Level 3 (Expectation) DevOps Observability DORA Metrics are visible. Deployment frequency, Mean Time to Restore (MTTR), Cycle time, Change failure rate IaC (Infrastructure as Code) Platforms leverage IaC. Test / Release automation Unit tests Test in a vacuum Integration tests Load test results validated against SLOs. Test run as part of CI/CD pipeline. Automated rollback Business Continuity Plan for Recovering Service(s) Capacity planning review Show saturation of service as compared to load test and production peak load. Product Management (Security) Security scanning Documented procedures for Vulnerability Management Integrated into CI/CD pipeline (partner with security) SRE Maturity Level 4 (Advanced) Modernized application. Deployment to Kubernetes, Azure, or SaaS via CI/CD pipeline Synthetic Monitoring Canary / Blue Green Deployment Self-Healing Auto scaling Identify KPIs for business performance. Chaos Engineering Thanks & Regards, Vikrama Rao Recruitment Executive- ValiantIQ Inc . "Searching Best Minds Searching Best Minds" Email: [email protected] P. 704-249-2259 F. (302) 482-3672 Disclaimer: If you are not interested in receiving our e-mails then please reply with a "REMOVE" in the subject line for automatic removal. And mention all the e-mail addresses to be removed with any e-mail addresses, which might be diverting the e-mails to you. We are sorry for the inconvenience Keywords: continuous integration continuous deployment http://bit.ly/4ey8w48 https://jobs.nvoids.com/job_details.jsp?id=1244671&uid= |
[email protected] View All |
09:42 PM 22-Mar-24 |