Accepting candidates in Brazil ONLY.

Professional Role Overview

We are seeking a Site Reliability Engineer (L1) to ensure the continuous availability and performance of our mission-critical production services. This role is designed for a professional who possesses the technical rigor required to manage complex distributed systems under a 100% on-call mandate within South American time zones. You will be responsible for the stewardship of high-stakes data environments—specifically those involving message queuing, relational and non-relational databases, and enterprise data warehouses—with a primary objective of maintaining strict service-level objectives (SLOs) through proactive monitoring, rapid incident response, and automated intervention.

Key Responsibilities

Production Stewardship: Serve as the first responder for production anomalies, managing the end-to-end incident lifecycle from initial detection to post-incident resolution.
Data Infrastructure Management: Ensure the reliability and scalability of high-throughput data platforms, including message brokers, relational (PostgreSQL or similar) and non-relational databases (MongoDB or similar), and data warehouse environments.
Operational Excellence: Execute 100% on-call rotations, providing consistent coverage and rapid response to critical system alerts.
Automation & Toil Reduction: Develop and maintain scripts (Python, Go, or Bash) to automate routine operational tasks, enhancing system resilience and reducing manual overhead.
Observability & Telemetry: Configure and optimize monitoring suites (e.g., Prometheus, Grafana, Datadog) to ensure comprehensive visibility into application and system health.

Must Have:

Prior SRE/On-call Experience: A mandatory background in SRE or production support roles, with a demonstrated ability to manage high-pressure on-call rotations and running production services.
Data Systems Proficiency: Message Queuing: Experience managing brokers (e.g., Kafka), topics, and troubleshooting throughput issues.
Relational & Non-Relational Databases: Proficiency in managing database health, query optimization, and high-availability configurations.
Data Warehouse: Experience in managing large-scale data warehouse performance and resource allocation.
Systems Engineering: Strong competency in Linux internals and networking protocols.
Regional Alignment: Must be based in and able to operate effectively within South American time zones to facilitate synchronized operations.

Preferred Skills:

Analytical Rigor: The ability to diagnose root causes in complex, interconnected systems rather than applying superficial fixes.
Communication: Exceptional technical documentation skills and the ability to provide concise, professional updates during active incidents.
Dedication: A steadfast commitment to system uptime and a proactive approach to identifying potential points of failure before they impact the user experience.

Education

Bachelor’s degree in Technology, Computing, or a related field

Job Types: Full-time, Contract

Pay: $35,000.00 - $48,000.00 per year

Benefits

Dental insurance
Flexible schedule
Health insurance
Paid time off
Vision insurance

Application Question(s):

Do you have previous on-call experience?
Are you located in South America?

Work Location: Remote

Site Reliability Engineer - SRE (L1)

Sarvin

Full Timeentry Remote

RemoteRemote$35k – $48kPosted February 4, 2026

Resume Keywords to Include

Make sure these keywords appear in your resume to improve ATS scoring

PythonGoBashLinuxPostgreSQLMongoDBKafka

Job Description

Accepting candidates in Brazil ONLY.

Professional Role Overview

Key Responsibilities

Production Stewardship: Serve as the first responder for production anomalies, managing the end-to-end incident lifecycle from initial detection to post-incident resolution.
Data Infrastructure Management: Ensure the reliability and scalability of high-throughput data platforms, including message brokers, relational (PostgreSQL or similar) and non-relational databases (MongoDB or similar), and data warehouse environments.
Operational Excellence: Execute 100% on-call rotations, providing consistent coverage and rapid response to critical system alerts.
Automation & Toil Reduction: Develop and maintain scripts (Python, Go, or Bash) to automate routine operational tasks, enhancing system resilience and reducing manual overhead.
Observability & Telemetry: Configure and optimize monitoring suites (e.g., Prometheus, Grafana, Datadog) to ensure comprehensive visibility into application and system health.

Must Have:

Prior SRE/On-call Experience: A mandatory background in SRE or production support roles, with a demonstrated ability to manage high-pressure on-call rotations and running production services.
Data Systems Proficiency: Message Queuing: Experience managing brokers (e.g., Kafka), topics, and troubleshooting throughput issues.
Relational & Non-Relational Databases: Proficiency in managing database health, query optimization, and high-availability configurations.
Data Warehouse: Experience in managing large-scale data warehouse performance and resource allocation.
Systems Engineering: Strong competency in Linux internals and networking protocols.
Regional Alignment: Must be based in and able to operate effectively within South American time zones to facilitate synchronized operations.

Preferred Skills:

Analytical Rigor: The ability to diagnose root causes in complex, interconnected systems rather than applying superficial fixes.
Communication: Exceptional technical documentation skills and the ability to provide concise, professional updates during active incidents.
Dedication: A steadfast commitment to system uptime and a proactive approach to identifying potential points of failure before they impact the user experience.