As a founder of oDesk—which has since evolved into Upwork—Stratis Karamanlakis has had a front seat view of Upwork’s evolution.
In this article, which is part of a series from Upwork Engineering, he talks about Upwork’s journey from fledgling startup to the largest website for freelancers, with a community of more than 17 million worldwide.
The Upwork technology stack has evolved dramatically over the last 11 years, growing into a website where millions of freelancers deliver services worth in excess of $1 billion. Like so many companies on a similar journey facing rapid growth, after years of incremental technology enhancements we decided we had to modernize our technology stack.
We knew the effort would be complex, require a number of years to complete, and involve cultural as well as technological changes. To ensure all stakeholders were aligned, we spent some time of introspection in order to define the problems we were trying to solve. This involved the following orthogonal issues:
- Aging domain model: Our codebase and data models reflected 10+ years of incremental and often ad-hoc solutions that in varying degrees no longer corresponded to our current understanding of our business.
- Monolithic system: The legacy middle-tier system was a monolith characterized by problems of scalability and performance, tangled dependencies, and poor resilience to failure. This also hurt us organizationally and culturally, since ownership boundaries were vague or missing.
- Outdated technology: Our legacy middle-tier was based on Perl, which was a solid choice for web-scale systems a decade ago. Perl has not disappointed us in that regard, but ten years later it is no longer a popular application development language: Perl open-source libraries are the last to appear and are often poorly maintained, and few developers continue to be interested in it, which impacts our ability to recruit and retain top talent.
- Architectural inconsistency: Ten years of evolution and the unavoidable technical debt had resulted in inconsistencies in nomenclature, design principles, protocol definitions, data types, you name it… This increased tremendously the cognitive load engineers needed to manage.
Microservices: A fine-grained approach
The nature of the problems above mandated an overhaul of how we developed software and operated our systems. We knew that our solution should give us the confidence that it can extend and grow as our business evolves, so that future modernization efforts could be more incremental.
We agreed to the following outline of our next generation middle-tier stack which we dubbed Agora (from Greek: “Αγορά”), meaning marketplace (pun intended).
What is Agora? Agora is
- A fine-grained services architecture that is
- Combined with a JVM-based core libraries stack.
It enables our engineering organization to standardize on
- Architecture style,
- Language and libraries, and
- Tools and processes used for development, testing, deployment and monitoring.
We chose the term fine-grained services, instead of microservices, to avoid focusing on size and emphasize separation of concerns and the single-responsibility principle. Since then, the ubiquity of the term microservices has made us adopt it as well—even though we often find it to be misleading.
The reasons for going with a fine-grained services architecture are not unique to Upwork, but they are worth repeating here.
Assuming that service boundaries have been properly defined to correspond with business subdomains that can operate independently, and making sure that a number of best practices in microservice design are followed (example), then this architectural style can help with:
- Complexity: Services can better enforce a separation of concerns.
- Scalability: Services can scale up or down independently.
- Resilience: Services can fail independently.
- Agility: Services can be developed, tested and deployed independently.
Design patterns and constraints
The core of our Agora stack is built around
- Dropwizard, an http-based services container that integrates a collection of best-of-class components to make it easy for engineers to do the right thing.
- A selection of libraries from the Netflix Open-source stack, including Eureka for service discovery, Hystrix for fault isolation, and Archaius for dynamic configuration.
As part of Agora, we adopted the following fundamental patterns and constraints to help us become more efficient and on track as we try to avoid past mistakes.
1. Components as services (vs monolithic system)
Establishing and respecting clear component boundaries is an architectural imperative as system complexity grows. HTTP-based services and service contracts (see below) constitute clear, bright lines that support the separation of logical subsystems and help engineers maintain and evolve these boundaries consciously. Furthermore building services as executable JARs with an embedded jetty server has allowed us to deploy them independently as a simple Unix process and leverage all of the existing Unix process management tools to control it, inspect resource utilization etc.
2. REST (vs tangled interfaces)
We found the REST style to be very helpful as a set of guidelines for our design of service interfaces. Though we have not been able to fully support REST (e.g., the HATEOAS principle still eludes us), we believe we have managed to go a long way toward well-designed service interfaces
3. Thrift (vs absence of service contracts)
Thrift is an interface description language that makes it easy to build typed service contracts, with native code support in multiple languages. Furthermore it helps with documenting service contracts and allows them to evolve with backwards compatibility as a first-class concern. Thrift was our only option in that regard, since alternatives such as Protobuf or Avro would not support our PHP and Perl environments.
4. Fault isolation (vs absence of resilience patterns)
Coding defensively against failures is necessary for the stability of a distributed system. What is equally important is to establish a common vocabulary and a common set of design patterns so fault isolation practices are consistent across the organization. Michael T. Nygard’s Release It! documented a collection of fault isolation patterns (e.g., circuit-breakers, bulkheads) and Hystrix by Netflix delivers a brilliant implementation. We are proud to have given back to the community phystrix, our own PHP implementation of some of Hystrix’s capabilities.
5. Rich metric instrumentation (vs absent/ad-hoc instrumentation)
Visibility into the workings of a complex distributed system needs to be baked in. Agora provides standardized and detailed metric instrumentation through the complete request lifecycle and also supports request tracing across the dependency tree using zipkin.
6. Dynamic service discovery (vs static configuration)
Dynamic service discovery built around Netflix Eureka allows us to adapt quickly to changes in the runtime environment as new services come online, if others fail or are terminated, and as engineers load-balance traffic between service instances during deployments.
Moving toward continuous delivery
As part of the modernization process, we decided to adopt best practices for our delivery pipeline and move toward continuous delivery. To support this, we decided to shift to a DevOps culture where development teams would be fully empowered to build, test, deploy, and operate their applications using automated tooling.
The Agora architecture of individually deployable services fits nicely with this shift. We have built a toolchain that enables engineering teams to release multiple times a day without any dependency on TechOps. This deployment workflow system allows engineers to
- Select which service artifact to deploy
- Select and provision a target cluster configuration
- Load-balance traffic to the new cluster using custom or standard deployment patterns (e.g., blue/green, canary). This also means that when things go wrong, a deployment can be rolled back quickly minimizing risk and downtime.
Each service can evolve independently as long as its owners ensure backwards compatibility—something Thrift facilitates programmatically. As multiple service versions can coexist, our engineering needed a clear and succinct vocabulary to deal with software versioning. We’ve found Semantic Versioning to be a simple and pragmatic solution.
We still have a significant way to go toward achieving our continuous delivery goal. Some of our teams have already fully automated their integration and delivery pipelines, but most still require manual approval during the deployment process.
There are gaps in two areas that are important to continuous delivery.
First, we need testing strategies that will provide confidence that a service artifact can proceed to the production environment. Testing a large distributed system is hard, and continuous delivery makes this harder since the shared runtime environment is changing constantly. We are seeing promising results with contract testing, which is one of the various microservices testing strategies we’ve considered.
Second, we need automated analysis of environment and service metrics so we can quickly recognize whether a new deployment is successful or not. Automated analysis is still at an early stage. For the time being, our deployment automation checks a few individual metrics, but we plan to move toward combining multiple metrics in order to calculate a comprehensive confidence score.
Embracing the cloud
We have been running upwork.com (then odesk.com) from a data center for more than 10 years. We’ve realized that we should not be wasting our precious resources doing the undifferentiated heavy lifting of planning, provisioning, managing, and maintaining data center hardware, software and networking.
We had been monitoring the emergence of Amazon Web Services as it quickly became the biggest infrastructure-as-a-service provider; we had started experimenting with S3 (Amazon Simple Storage Service) quite a few years ago.
With microservices, DevOps and continuous delivery at the core of our engineering practices, we decided that AWS was the right environment to support our focus on infrastructure automation while giving us redundancy and scalability.
With a limited investment in cloud configuration and management tooling, we have been able to move faster than ever before. Our computing capacity can grow or shrink instantly. A template-based configuration ensures we avoid snowflakes, and enables us to easily rollout configuration upgrades. Whole environments can be made available on-demand to our engineers via self-service tooling.
Our experience with the cloud has not been without glitches, a couple of which have been major. Cloud infrastructure can be volatile and can fail in ways dissimilar to the ones we had learned to expect using a data center. This has resulted in periods of significant instability and a lot of stress for the teams involved.
The important takeaway is that failure is a given and we need to plan and design around it. Detailed monitoring can make all the difference when it comes to separating signal from noise. Resiliency must be baked into our designs and plans.
The principles of Chaos Engineering and tools such as the Netflix Chaos Monkey emphasize the practice of proactively stressing a system in order to understand its limits and build confidence in its expected availability and resiliency. We are not there yet, but we will soon start to adopt some of these approaches.
Cost management is another area where the cloud requires new approaches and processes. We are still learning best practices here, but the basics include the need to:
- Understand your baseline capacity requirements and make use of EC2 instance reservations.
- Tag resources systematically so engineering owners can be kept informed of their expenditure and are able to manage their budget properly. With ownership comes accountability!
- Automatically monitor your environment for unused resources.
The modernization journey and next steps
Modernization is not a project but a process—and we have already identified opportunities to improve our tools and processes as we move forward.
The most immediate among them is the move to Docker, which will happen in early 2017. Docker containers will give us two concrete benefits:
- Unification of the deployment pipeline across environments as well as across JVM and non-JVM applications: We are currently managing the deployment of our Symfony/Angular presentation-tier differently from our JVM/Agora middle-tier which results in unnecessary variance, replication of effort, and waste.
- The ability to deploy each single application/service instance on its own host: Container scheduling will allow us to consolidate our computing resources with improved utilization while maintaining isolation between service instances.
Modernization has been a unique experience for the engineering teams, who get to rethink and rebuild our technology stack as well as our engineering processes while supporting the millions of clients and freelancers who rely on us.