Site Reliability Engineer
Intellyk Inc.Resume Keywords to Include
Make sure these keywords appear in your resume to improve ATS scoring
Sign up free to auto-tailor your resume with all these keywords and get a higher ATS score
Job Description
In This Role, You Will
Platform & Reliability Engineering
- Embed SRE and production engineering principles into Payments Modernization from design through early life support
- Define and validate non-functional requirements (NFRs) covering resilience, scalability, observability, recovery, and operability
- Drive replay, retry, and exception-handling validation for event-driven payment flows
- Lead capacity and performance testing, including volume growth and peak event scenarios (e.g. FedNow, CHIPS, SWIFT)
Service Transition & Operational Readiness
- Own Permit-to-Operate readiness across environments (NFR Testing)
- Define cutover, shadow support, and early life support models
- Ensure runbooks, support procedures, on-call readiness, and escalation paths are production-grade before go-live
- Partner with Change Assurance to apply risk-based release controls, canary/blue-green strategies, and rollback automation
Observability & Stability
- Implement end-to-end observability across Kafka, MongoDB, API layers, and downstream payment components
- Define and monitor SLOs, error budgets, and golden signals
- Reduce alert noise through signal design, correlation, and automation
- Analyze early defects and exception patterns (ACK/NACKs, business errors) to drive stabilization
Chaos Engineering & Continuous Improvement
- Design and execute controlled failure testing (chaos engineering) to validate recovery patterns and blast radius
- Lead blameless RCAs, ensuring corrective actions are owned and recurrence is prevented
- Drive continuous service improvement (CSI) initiatives, including automation, resilience uplift, and technical debt reduction
Required Qualifications:
- 4+ years of Systems Engineering, Technology Architecture experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
- 2+ years of application support experience
- 2+ years of Application Frameworks experience in Spring Boot, Spring WebFlux, etc.
- 2+ years of Data Stores & Caching experience with MongoDB, Redis
- 2+ years of Platform experience in Kubernetes / container orchestration
- 2+ years of CI/CD & Automation experience in Progressive delivery, automated rollback, reliability-as-code concepts
Desired Qualifications:
- 2+ years of Resilience experience with Resilience4J, retry/replay patterns
- 2+ years Observability: Distributed tracing, metrics, logging, SLO tooling
- 2+ years Testing & Resilience Validation: BlazeMeter, Chaos Monkey
- Strong experience in SRE, Production Engineering, Platform Engineering, or Service Transition within a complex technology or financial services environment
- Demonstrated ability to productionize new platforms, not just support them
- Solid understanding of high-value payment systems (Wires, RTP, SWIFT, CHIPS, FedNow) and their operational risk profile
- Experience working with event-driven, distributed architectures
- Proven ability to partner with engineering teams while representing the production and operational lens
- Comfortable operating in early-stage, ambiguous transformation environments
- Strong communication skills, with the ability to explain technical risk to senior stakeholders
Want AI-powered job matching?
Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.
Get Started Free