Company:
Tata Consultancy Services
Location: Atlanta
Closing Date: 23/10/2024
Hours: Full Time
Type: Permanent
Job Requirements / Description
Job Description
- Automating work including infrastructure needs, testing, failover solutions, failure mitigation, and much more
- Debugging complex problems across an entire stack and creating solid solutions
- Developing and building CI/CD processes to improve cadence
- Using Chaos Engineering to test what you build under real-world conditions
- Triage product or system issues and debug/track/resolve by analyzing the sources of issues and the impact on hardware, network, or service operations and quality.
- Participate in, or lead design reviews with peers and stakeholders to decide amongst available technologies.
- Experience with an APM tool such as Dynatrace, New Relic, AppDynamics, or Datadog.
- Performance Measurement and Tuning: Knowledge of system performance, testing and programming; ability to monitor, measure, and optimize system performance and network communication.
- Site Reliability Engineering: Knowledge of the theories and methodologies of reliability engineering; ability to design, develop and support various tools, services and applications to maintain a reliable site environment.
- Support capacity planning, availability, scalability, security and latency considerations for new infrastructure and service provisioning as appropriate
- Responsible for improvements to end-to-end availability and performance of mission critical services and build automation to prevent problem recurrence.
- Strong experience setting SLOs / SLIs / error budgets and managing of reliability for infrastructure and applications
- Partner with other SREs to bring best practices or learnings from across the organization to them
- Scale and optimize existing infrastructure and services sustainably through mechanisms, including automation, and evolve them by improving reliability and efficiency
- Manage end-to-end availability and performance of mission-critical services and build automation to prevent problem recurrence
- Maintain infrastructure and services by measuring, and monitoring system metrics to proactively identify operational efficiencies, potential outages and security threats in Development, UAT, Staging and Production environments
- Practice sustainable incident response and blameless postmortems
- Develop and maintain solution and operational documentation and designs for all infrastructure and services within the scope of SRE
Other Skills
- AWS SysOps Administrator OR AWS DevOps Engineer certification
- Experience with Akamai or related WAF application preferred.
- Experience with OpenShift, Kubernetes.
- Experience with setting up synthetic monitors and tracking SLAs.
- Experience with airline applications and infrastructure technology is a plus.
- Experience developing applications and/or automation runn ing in Red Hat OpenShift is a plus.
Share this job
Tata Consultancy Services
Useful Links