Role Description -

We are looking for a proactive and technically strong Site Reliability Engineer (SRE) with 4–6 years of hands-on experience to join our Engineering team. The SRE will be responsible for ensuring the reliability, performance, and availability of our production systems and applications. The ideal candidate should be comfortable bridging the gap between development and operations, thrives in high-availability environments, and brings a strong bias towards automation, observability, and continuous improvement.

Key Responsibilities -

Monitoring & Observability -

- Monitor application and infrastructure health using Dynatrace and Splunk to proactively detect anomalies, performance degradation, and service disruptions.
- Define and maintain dashboards, alerts, and SLIs/SLOs to ensure real-time visibility into system health and availability.

Incident Management & Production Support -

- Serve as part of the on-call rotation for production support, ensuring rapid triage, escalation, and resolution of critical incidents within defined SLA timelines.
- Lead or contribute to war-room calls during high-severity incidents, coordinating across engineering and business teams to drive timely resolution.

Root Cause Analysis & Preventive Actions -

- Conduct thorough Root Cause Analysis (RCA) for incidents and production issues, documenting findings clearly and driving the implementation of preventive actions to eliminate recurrence.

Deployment Support & Post-Deployment Validation -

- Actively support deployment activities via Jenkins CI/CD pipelines, ensuring smooth and controlled releases with minimal production downtime.
- Execute and own post-deployment validation checks to confirm application stability, correctness, and performance after each release.

Performance Testing -

- Design, develop, and maintain performance and load test scripts in Java to simulate real-world traffic patterns and proactively identify system bottlenecks before they impact production.

Cross-functional Collaboration -

- Partner closely with Development, QA, and Product teams to identify reliability risks early in the SDLC and drive engineering solutions that improve system resilience.
- Participate in design and architecture reviews from an SRE perspective — contributing inputs on scalability, observability, and graceful failure handling.

Documentation & Knowledge Management -

- Maintain comprehensive and up-to-date documentation for runbooks, incident reports, RCAs, deployment procedures, and operational processes.
- Contribute to the team's knowledge base to improve operational efficiency and reduce mean time to resolution (MTTR).

Required Skills & Qualifications

Technical Skills -

- 3–5 years of hands-on experience in an SRE, DevOps, or Production Engineering role in a large-scale environment.
- Strong proficiency in application and infrastructure monitoring using Dynatrace and Splunk.
- Hands-on experience with Jenkins for CI/CD pipeline management and deployment automation.
- Solid working knowledge of Git for version control and collaborative development workflows.
- Proficiency in Java (basic to intermediate) for writing and maintaining performance test scripts.
- Working knowledge of SQL for querying databases during incident investigation and root cause analysis.
- Familiarity with IntelliJ IDEA or equivalent IDE for development, debugging, and script execution.

Core Competencies -

- Strong analytical and troubleshooting skills with the ability to diagnose complex production issues calmly under pressure.
- Solid understanding of incident management processes, RCA frameworks, and ITIL best practices.
- Proven experience with on-call production support in high-availability, 24x7 environments.
- Good understanding of distributed systems, microservices architecture, and cloud-native deployment patterns.
- Excellent communication skills with the ability to articulate technical issues clearly to both technical and non-technical stakeholders.
- Strong documentation habits — clear, structured, and consistent in maintaining operational records.

Tools & Technologies

Category	Tools / Technologies	Proficiency Required
Monitoring & Observability	Dynatrace, Splunk	Required
CI/CD & Deployment	Jenkins	Required
Version Control	Git	Required
IDE	IntelliJ IDEA	Required
Programming Language	Java	Intermediate
Database Querying	SQL	Working Knowledge

Good to Have -

- Familiarity with containerization and orchestration tools such as Docker and Kubernetes.
- Knowledge of Agile/Scrum methodologies and experience working in sprint-based delivery models.
- Exposure to chaos engineering practices and tools for proactive resilience and fault-tolerance testing.
- Experience in the payments, fintech, or financial services domain.
- Good to have experience with cloud platforms such as AWS, GCP.

View all job openings

Site Reliability Engineer - SRE