Senior Manager of Reliability Engineering
Upwork ($UPWK) is the leading tech solution for companies looking to hire the best talent, maintain flexibility, and get more done. We’re passionate about our mission to create economic opportunities so people have better lives. Every year, more than $2 billion of work is done through Upwork by skilled professionals who want the freedom of working anytime, anywhere. Top companies connecting with extraordinary talent around the globe? Upwork is how.
The Sr. SRE Manager will lead a team that combines software and engineering to build and enable large-scale, fault-tolerant systems and services. You will be responsible for providing technical leadership to key projects as well as empowering teams who maintain the infrastructure, platform & services to operate and manage highly reliable, performant applications and services in AWS (Amazon Web Services). Do you want to help transform the way we get work done? Apply now!
- Lead the team in engineering resilient, scalable infrastructure & platform services and solutions.
- Apply engineering leadership and deep knowledge of infrastructure and software development at scale to lead the operation, adoption, and evolution of these services
- Forecast growth and future needs both for engineering resources and system scalability. Own the strategy and development of service capacity management
- Lead by example, mentor the team and establish credibility through quality technical execution and pitch in with hands on help and code as needed to keep things running smoothly
- Develop, mentor and train other SREs on modern methodology & troubleshooting techniques and processes for infrastructure, platform & application services in Production.
- Develop tight relationships with Development, Engineering and Product partners, ensuring that Product needs are met from an operational perspective
What it takes to catch our eye:
- Deep experience engineering and building large and geographically disperse infrastructure & platform technologies supporting critical business cloud services
- Deep experience with 24/7/365 distributed-site monitoring and first-response support for availability & performance SLA’s
- Solid understanding and experience with modern Chaos/Failure Engineering techniques and tools (simian army, chaos monkey, gremlin, etc).
- Strong experience in a cloud environment (AWS, Azure, GCP), cloud data infrastructure and can make recommendations when a cloud or vendor-managed service can be utilized.
- Ability to write code in at least one language(Python, Ruby preferred). You are comfortable reviewing both functional implementation and tests.
Come change how the world works.
At Upwork, you’ll shape talent solutions for how the world works today. We’re a remote-first organization supported by offices in Santa Clara and Chicago, working together to create exciting remote work opportunities for a global community of professionals.
Our vibrant culture is built on shared values and our mission to create economic opportunities so that people have better lives. We build amazing teams, put our community first, and have a bias toward action. We encourage everyone to bring their whole selves to work and grow together through development opportunities, mentorship, and employee resource groups. Oh yeah, we’ve also got amazing benefits.
Check out our Life at Upwork page to learn more about the employee experience.
Upwork is proudly committed to recruiting and retaining a diverse and inclusive workforce. As an Equal Opportunity Employer, we never discriminate based on race, religion, color, national origin, gender (including pregnancy, childbirth, or related medical condition), sexual orientation, gender identity, gender expression, age, status as a protected veteran, status as an individual with a disability, or other applicable legally protected characteristics.