Site Reliability Engineer

Company: Tata Consultancy Services

Location: Atlanta

Closing Date: 23/10/2024

Hours: Full Time

Type: Permanent

Apply Now

Job Requirements / Description

Job Description

Automating work including infrastructure needs, testing, failover solutions, failure mitigation, and much more
Debugging complex problems across an entire stack and creating solid solutions
Developing and building CI/CD processes to improve cadence
Using Chaos Engineering to test what you build under real-world conditions
Triage product or system issues and debug/track/resolve by analyzing the sources of issues and the impact on hardware, network, or service operations and quality.
Participate in, or lead design reviews with peers and stakeholders to decide amongst available technologies.
Experience with an APM tool such as Dynatrace, New Relic, AppDynamics, or Datadog.
Performance Measurement and Tuning: Knowledge of system performance, testing and programming; ability to monitor, measure, and optimize system performance and network communication.
Site Reliability Engineering: Knowledge of the theories and methodologies of reliability engineering; ability to design, develop and support various tools, services and applications to maintain a reliable site environment.
Support capacity planning, availability, scalability, security and latency considerations for new infrastructure and service provisioning as appropriate
Responsible for improvements to end-to-end availability and performance of mission critical services and build automation to prevent problem recurrence.
Strong experience setting SLOs / SLIs / error budgets and managing of reliability for infrastructure and applications
Partner with other SREs to bring best practices or learnings from across the organization to them
Scale and optimize existing infrastructure and services sustainably through mechanisms, including automation, and evolve them by improving reliability and efficiency
Manage end-to-end availability and performance of mission-critical services and build automation to prevent problem recurrence
Maintain infrastructure and services by measuring, and monitoring system metrics to proactively identify operational efficiencies, potential outages and security threats in Development, UAT, Staging and Production environments
Practice sustainable incident response and blameless postmortems
Develop and maintain solution and operational documentation and designs for all infrastructure and services within the scope of SRE

Other Skills

AWS SysOps Administrator OR AWS DevOps Engineer certification
Experience with Akamai or related WAF application preferred.
Experience with OpenShift, Kubernetes.
Experience with setting up synthetic monitors and tracking SLAs.
Experience with airline applications and infrastructure technology is a plus.
Experience developing applications and/or automation runn ing in Red Hat OpenShift is a plus.

Apply Now

Share this job

Tata Consultancy Services

Useful Links

More Jobs in Atlanta
Full Time Jobs in Atlanta
Part Time Jobs in Atlanta
Engineering Jobs

Similar Jobs
Site Reliability Engineer
Atlanta
View Job
Site Reliability Engineer
Atlanta
View Job
Site Reliability Engineer (SRE)
Atlanta
View Job
Software Development Engineer III (Site Reliability)
Atlanta
View Job
Senior or Staff Site Reliability Engineer - Cloud Infrastructure
Atlanta
View Job

Site Reliability Engineer

Similar Jobs