Site Reliability Engineer

Company:  Tata Consultancy Services
Location: Atlanta
Closing Date: 08/11/2024
Hours: Full Time
Type: Permanent
Job Requirements / Description

Job Description


Job Type: Fulltime

Location: Atlanta GA (Onsite)

Experience: 6+years


  • Automating work including infrastructure needs, testing, failover solutions, failure mitigation, and much more
  • Debugging complex problems across an entire stack and creating solid solutions
  • Developing and building CI/CD processes to improve cadence
  • Using Chaos Engineering to test what you build under real-world conditions
  • Triage product or system issues and debug/track/resolve by analyzing the sources of issues and the impact on hardware, network, or service operations and quality.
  • Participate in, or lead design reviews with peers and stakeholders to decide amongst available technologies.
  • Experience with an APM tool such as Dynatrace, New Relic, AppDynamics, or Datadog.
  • Performance Measurement and Tuning: Knowledge of system performance, testing and programming; ability to monitor, measure, and optimize system performance and network communication.
  • Site Reliability Engineering: Knowledge of the theories and methodologies of reliability engineering; ability to design, develop and support various tools, services and applications to maintain a reliable site environment.
  • Support capacity planning, availability, scalability, security and latency considerations for new infrastructure and service provisioning as appropriate
  • Responsible for improvements to end-to-end availability and performance of mission critical services and build automation to prevent problem recurrence.
  • Strong experience setting SLOs / SLIs / error budgets and managing of reliability for infrastructure and applications
  • Partner with other SREs to bring best practices or learnings from across the organization to them
  • Scale and optimize existing infrastructure and services sustainably through mechanisms, including automation, and evolve them by improving reliability and efficiency
  • Manage end-to-end availability and performance of mission-critical services and build automation to prevent problem recurrence
  • Maintain infrastructure and services by measuring, and monitoring system metrics to proactively identify operational efficiencies, potential outages and security threats in Development, UAT, Staging and Production environments
  • Practice sustainable incident response and blameless postmortems
  • Develop and maintain solution and operational documentation and designs for all infrastructure and services within the scope of SRE

Other Skills

  • AWS SysOps Administrator OR AWS DevOps Engineer certification
  • Experience with Akamai or related WAF application preferred.
  • Experience with OpenShift, Kubernetes.
  • Experience with setting up synthetic monitors and tracking SLAs.
  • Experience with airline applications and infrastructure technology is a plus.
  • Experience developing applications and/or automation runn ing in Red Hat OpenShift is a plus.

Apply Now
Share this job
Tata Consultancy Services
  • Similar Jobs

  • Site Reliability Engineer

    Atlanta
    View Job
  • Site Reliability Engineer

    Atlanta
    View Job
  • Site Reliability Engineer

    Atlanta
    View Job
  • Site Reliability Engineer

    Atlanta
    View Job
  • Site Reliability Engineer

    Atlanta
    View Job
An error has occurred. This application may no longer respond until reloaded. Reload 🗙