Tokopedia: Scaling to accommodate major shopping events with Google Kubernetes Engine
Google Cloud results
- Enables Tokopedia Play to handle a 20x increase in traffic thanks to Google Cloud
- Supports growth by optimally scaling up to 30x with automated provisioning
- Reduces operating costs by 90% by migrating to Google Cloud
How do you prepare for an epic online shopping event that rivals Black Friday in the U.S. and Singles’ Day in China? According to Tahir Hashmi, Vice President of Engineering at Tokopedia, a successful delivery depends on tech support and a reliable network.
In May 2018, Tokopedia launched Ramadan Ekstra, the first-ever online shopping festival in Indonesia. The event attracted over 332 million visits to the Tokopedia platform during the Muslim holy month of Ramadan. Ramadan Ekstra was so successful that the transactions from May 25 alone equaled the total transactions from Tokopedia’s first five years of operations. On top of that, Tokopedia welcomed 73 million visitors to its platform during that month.
Although high-profile online events like the Ramadan Ekstra campaign help Tokopedia acquire many new users at once, they require careful planning to minimize disruptions. Any glitch in Tokopedia’s network can affect millions of users and result in complaints from sellers and online shoppers as well as negative publicity. According to a survey by Unbounce Research in 2018, nearly 70% of consumers admit that page load speed influences their willingness to buy from an online retailer.
“We benefited from the knowledge of Google Cloud engineers who have the experience of running large-scale events. If we had to roll out the project with our limited resources, we would have to read a lot of documentation, run many experiments, and perhaps still end up in blind alleys.”—Tahir Hashmi, Vice President of Engineering, Tokopedia
To cope with the increase in website visitors and transactions, Tokopedia turned to Google Cloud to deliver uninterrupted service to shoppers and merchants alike.
“Internally, we did a lot of prep work, with help from Google Cloud, to make sure that our system could handle peak demand without slowing down performance,” says Tahir Hashmi, Vice President of Engineering at Tokopedia. “It’s critical that we offer a frictionless shopping experience to turn new customers into return shoppers.”
“We benefited from the knowledge of Google Cloud engineers who have the experience of running large-scale events,” says Tahir. “We were able to roll out the new technology faster, and with more confidence than we would have if we were doing it without their support.”
For big events and promotions, Tokopedia runs the overall design by the Google Cloud team to see if it fits the Google Cloud infrastructure. In the preparation stage, Tokopedia sets up load and performance testing to simulate large-scale traffic on the application. This exercise gives the team plenty of time to uncover and resolve bottlenecks. Before executing the event, Tokopedia coordinates with Google Cloud to freeze changes during the promotion timeframe so network performance isn’t affected by software updates or bug fixes.
Minimizing downtime with autoscaling on Google Kubernetes Engine
According to a recent study by McKinsey, electronic retailing, or “e-tailing” revenue in Indonesia is expected to grow from $5 billion in 2017 to $40 billion in 2022, driven by tech-savvy customers who are willing to pay for convenience.
“Our mission at Tokopedia is to democratize commerce through technology. We want to transform lives by reducing distances between merchants and consumers in this vast country we call home,” says Tahir. “Running our ecommerce platform on Google Kubernetes Engine (GKE) helps us to improve user experience and keeps shoppers coming back.”
“We used to experience partial downtime after adding a new VM that wasn’t configured correctly. Such headaches have been pretty much eliminated since moving to application clusters on Google Kubernetes Engine.”—Tahir Hashmi, Vice President of Engineering, Tokopedia
Before moving to Google Cloud, Tokopedia experienced issues with scalability and reliability with its previous service provider. One major challenge was that Tokopedia’s largest scale interactive product Tokopedia Play could only support 55,000 concurrent users. The application was rebuilt as a microservice on GKE in five weeks and is now able to support 1.5 million concurrent users. Tokopedia manages and secures the microservices with Istio service mesh and configures global load balancing on GKE for resiliency.
“Unlike our previous VM-based environment, adding and removing compute capacity is extremely reliable on Kubernetes,” says Tahir. “We had to put in a lot of effort to avoid partial downtime after adding or removing VMs due to complicated configuration changes. Such headaches have been pretty much eliminated since moving to application clusters on Google Kubernetes Engine.”
Autoscaling comes in handy when Tokopedia runs limited time campaigns such as the Semarak Maret Mantap, or “Great March,” that encourages users to open the Tokopedia Play app on their phone and shake it to win prizes. The application, supported by Google Cloud, scaled servers down by 30x after the dual-screen event on TV and the Tokopedia app ended. According to Tahir, Tokopedia saved money by not having to provision hardware just for that purpose.
“Scalability, at a very basic level, means your application can handle a bigger load if you add more hardware to it,” Tahir explains. “By moving to GKE, we have more than just scalability, we have reliable scalability, better known as elasticity. We can scale up and down as many times as needed, without having to laboriously configure VMs.”
Achieving redundancy with global load balancing
Tokopedia uses Cloud Load Balancing to provision service instances in Google Cloud regions around the world. This feature is useful for Tokopedia’s multi-region business continuity planning. The load balancer doesn’t need to be pre-warmed to handle spikes in traffic. If a server in one region fails because of a man-made or natural disaster, the load balancer seamlessly shifts the traffic to servers with the most available capacity.
“One of the features that I particularly like on Google Cloud is global load balancing because our application is available via a single global IP address,” says Tahir. “Previously, we used DNS policy for application load balancing by configuring multiple IP addresses on the same domain. That method worked, but global load balancing offers a more scalable and robust DNS setup.”
Reducing complexity with Istio on Google Kubernetes Engine
The concept of a service mesh came about from the need to manage and deploy huge numbers of microservices. “At Tokopedia, we run a few hundred microservices, and all of them interact with each other in a way that isn’t immediately obvious. It’s hard to tell which microservice depends on which other microservices without auditing the code and the traffic,” Tahir notes. “Istio on GKE makes it easy to manage the overall microservices ecosystem and observe the telemetry from the containers.” Tokopedia is the first company in Indonesia to deploy Istio for such a high volume of traffic.
Tokopedia plans to leverage Istio’s identity and access control policies to help secure microservices running on GKE beyond the current security provided by IAM. Istio helps to authenticate services and provide the right level of access to data.
“Running our ecommerce platform on Google Kubernetes Engine (GKE) helps us to improve user experience and keeps shoppers coming back.”—Tahir Hashmi, Vice President of Engineering, Tokopedia
Democratizing commerce through big data with Google BigQuery
At the moment, Tokopedia uses BigQuery to analyze traffic and transactional data such as logistics and billing and create reports on customer insights. For example, product managers use sales forecasting to anticipate how much budget is needed for promotion on a daily, weekly, or monthly basis.
Moving forward, Tokopedia is looking to improve customer satisfaction with data science. Analyzing the transactional data through BigQuery would allow Tokopedia to do demand prediction for more effective logistics delivery time and costing. “This would allow merchants and customers from different islands to enjoy same-day delivery,” says Herman Widjaja, Senior Vice President of Engineering of Tokopedia.
Live streaming sales made possible with Cloud CDN
Tokopedia and its sellers are increasingly using videos as a marketing tool to reach online shoppers. Whether it’s a K-pop event or a gadget demo, videos grab the buyer’s attention and close sales. However, website performance is critical to the customer’s viewing experience. A low-quality product video is more likely to attract negative reviews than sales.
Tokopedia uses Cloud CDN to deliver hundreds of gigabytes, terabytes, and even petabytes of data in 700 milliseconds for a smooth user experience. Although Tokopedia hosts its live video shopping moduleTokopedia Play on another cloud, Cloud CDN uses QUIC protocol to accelerate video content. It was found to be 15% faster than other CDN providers under test conditions for Tokopedia Play. “Cloud CDN provides great flexibility for our multicloud environment,” says Tahir Hashmi, vice-president of Engineering at Tokopedia. “In terms of performance, we have relatively low latency compared to other CDN providers, and from a price-point perspective, Cloud CDN also offers competitive pricing.”
To optimize performance, Tokopedia keeps track of Cloud CDN logs on Cloud Logging. By combining log metrics with other data such as billing data and sales metrics on Grafana open source visualization tool, engineers can quickly view and analyze usage patterns and forecast capacity for a positive user experience.
Additionally, Cloud CDN helps Tokopedia protect against distributed denial of service (DDos) attacks by dispersing the traffic across many points of presence (PoPs), rather than bombarding one IP address. “Cloud CDN improves our security posture as it provides a built-in anti-DDoS and extensive protection by enabling Google Cloud Armor,” Tahir says. “The added layer of security defends our web apps from malicious attacks and minimizes potential downtime for customers.”
Delivering high availability for ecommerce campaigns at scale
When it comes to mega sales events, shoppers have many tips and tricks to get the best deals. One winning strategy is to add items to the shopping cart in advance, in case they run out when the sale starts. Imagine the shopper’s disappointment if they find that their cart is empty on sales day, even after preloading it.
To achieve high availability and minimize data loss, Tokopedia runs some of its mission-critical applications with Redis Enterprise, an in-memory database for session management on Google Cloud Marketplace. For example, when a shopper logs in, the frequently accessed information is made available with Redis to elevate the customer experience in using the Tokopedia platform.
“We once had an incident where we had to significantly upscale our Redis instance within 15 minutes in the middle of a big sales event,” says Tahir. “The Redis Enterprise team supported us to upscale the instance seamlessly without any disruption to our event traffic.”
Redis Enterprise improves developer productivity by automating maintenance tasks, such as database scaling and software patching. DevOps can focus on delivering features instead of worrying about managing or maintaining Redis workloads at scale. For real-time observability, the Redis database integrates with Grafana so developers can quickly identify slow commands causing latency and troubleshoot performance issues before they become a problem.
From a business perspective, Tokopedia can bypass the procurement cycle and deploy the database service quickly as part of its Google Cloud subscription. By selecting Redis Enterprise on Google Cloud Marketplace, Tokopedia benefits from a simple and integrated bill that shows Redis Enterprise charges alongside Google Cloud usage fees.
Tell us your challenge, we’re here to help.