Principal Engineer, Reliability & Automation

Remote

Upwork ($UPWK) is the world’s work marketplace. We help connect companies large and small with top independent talent from around the world. Simply put, our mission is to create economic opportunities so people have better lives.

Every year, more than $2 billion of work is done through Upwork by skilled independent professionals who want the freedom of working anytime, anywhere.


Cloud Engineering & Operations (CLEO) is at the core of the technology engine that enables the Upwork platform. We engineer and operate all infrastructure (IAAS), platform services (PAAS), automation, tooling and drive modern Service Engineering (DevOps, Ops Eng, Ops Analytics) to successfully deliver & manage our end-to-end offering.


This is an opportunity to bring your modern expertise and experience in helping lead & drive our on-going efforts to deliver great Service via modern Service Engineering. From ensuring resilience, reliability and intelligent (orchestrated) automation to driving the analytics that target and align our team’s focus, this role helps drive innovation and improvements on behalf of our teams, our customers and the services we provide.


This is a highly technical role that needs a leader who is inquisitive by nature, uses data to answer questions, solves problems as an engineer and is detailed in execution.

Your Responsibilities:

  • Drive Service Engineering:
    • Operations (health management) analytics and intelligence (from traceability/observability to modern event correlation)
    • Experience Health Management (STM/synthetics, RUM/behavior analytics)
    • Automated Remediation engineering, working with other CLEO teams
    • Deliver data-driven Service Management tooling & dashboards
  • Drive Ops Analytics:
    • Lead monthly “forest from the trees” analysis across root cause / post-mortems and other operational intelligence reporting activities.
    • Identify and drive initiatives to address data gaps, on-going problem trends target teams and ultimately improve TTD/TTR and prevent recurrent incidents.
  • Enable continual service improvement through the effective use of metrics in support of key performance indicators and critical success factors used to manage service delivery. Work with Ops Intelligence engineers to deliver those metrics in the form of visualized data, tooling & dashboards.
    • Perform data analysis and provide a level of expertise in layering the data to show the journey. Analyze patterns and trends to drive a continuous improvement cadence within the organization.
    • Partner with the Sr. Director on tooling selection, implementation and support to capture key data points used in operational analytics.
  • Drive the maturation of Change, Incident, Problem and Knowledge management, taking a technology/engineering-first, process-second approach.
  • Be the technical owner for the Service Management roadmap. Work with Technical PMs to ensure projects are accurately tracked & prioritized and work to ensure appropriate resources are assigned. Provide technical mentorship for other engineers as needed.

What it takes to catch our eye:

  • Above all else, you want to know why and what the data says about something. You’re then able to go find the data, interpret it and translate it to a defined set of next steps and actions.
  • You have experience in modern day operational & engineering techniques which may include: synthetics monitoring, chaos engineering, event correlation, AI ops, etc.
  • You have previous operations analytics (SLOs, SLAs, MTTD, MTTR, etc.), data analytics or business intelligence experience.
  • You’re incredibly detailed oriented, care about quality and know there are multiple ways of solutioning a problem. This manifests in your ability to pivot as priorities shift and not lose sight of the end goal & customer requirements.
  • You have a high-degree of autonomy and drive. You demonstrate a customer first mentality and take full ownership of the work. You can navigate professionally and collaboratively through ambiguity to deliver outcomes.
  • You’ve had previous experience in the operational world at scale – either as a systems engineer, site reliability engineer or in a service management/service delivery role in the past.

 


Come change how the world works.

At Upwork, you’ll shape talent solutions for how the world works today. We are a remote-first organization working together to create exciting remote work opportunities for a global community of professionals.  While we have physical offices in San Francisco and Chicago, currently we also support hiring of corporate full-time employees in 15 states in the United States. Please speak with a member of our recruitment team to determine whether you are located in a state in which we are hiring corporate full-time employees. 

Our vibrant culture is built on shared values and our mission to create economic opportunities so that people have better lives. We foster amazing teams, put our community first, and have a bias toward action. We encourage everyone to bring their whole selves to work and grow together through development opportunities, mentorship, and employee resource groups. Oh yeah, we’ve also got amazing benefits.

Check out our Life at Upwork page to learn more about the employee experience.   

Upwork is proudly committed to recruiting and retaining a diverse and inclusive workforce. As an Equal Opportunity Employer, we never discriminate based on race, religion, color, national origin, gender (including pregnancy, childbirth, or related medical condition), sexual orientation, gender identity, gender expression, age, status as a protected veteran, status as an individual with a disability, or other applicable legally protected characteristics.

Apply Now

Don't see the right opportunity now?
Join our Talent Pipeline instead.

Life at Upwork

We believe in a workplace in which all employees are empowered to see themselves in our One Upwork community. 

Learn More