How to build a platform which will survive Black Friday

I still remember George Colony from Forrester said ” Throughput and processing integrity will not be the key considerations; the magic will center on overall customer experience”. Is he right ?? Lets see

As you know, online retailers were struggling to meet the demand of the Black Friday in UK (28th Nov 2014). Websites of John Lewis, Tesco, Boots, Currys, Argos, GAME along with many others crashed or struggled within few hours of peak. Some of them came back while for others problem kept on persisting for almost till noon. Currys even went to the extent of creating queue’s and waiting time of hours before users were allowed in. In the “Age of customer”, where retailers conversion drops even if the response time for pages increases by couple of secs, it is odd that users have to wait hours.

All of these websites have great customer experience (yes there is a always a room for improvement) but may be there lack of enterprise architecture and stability was exposed.

This is where I believe, we need to challenge George. I believe the statement should be “Throughput and processing integrity will remain the key considerations; the magic will center on overall customer experience”. This magic is something which is enabled by combination of enterprise technology and creative technology

Here are some best practises and key considerations for making sure you can scale for peak

  • Always use marketing lever to control the traffic to the site. Like there is no need to increase paid search in the first half hour of the sale when you are any ways going to get traffic by word of mouth and hype created for this Black Fridays
  • Isolate your frond end i.e. websites from the legacy backend. Most of the places will have modern digital platforms connecting to legacy (sometimes mainframe) systems. These legacy systems do not scale and thus you create a bottleneck for your front end systems on how many orders they can take. You should be in a state where you can take orders and batch them with no legacy systems working including payments
  • Use your caching strategy cleverly. Please refer details on caching strategy. Make sure your cache TTL (time to live) are set as high as practically you can in the business scenario. You will need a different TTL strategy for peak days as compared to BAU. This will make sure your servers are processing only things it needs to process
  • Make sure you have a provision for dynamic bursting for your servers. Your architecture should scale horizontally. In case user load exceeds your projection, you should be able to burst into cloud or your test environments to make full use of the environments
  • Kill Switch for functionalities. You should be able to disable functionalities one at a time so you can create as much room as possible for things which matters
  • Most important thing. Monitor , monitor , monitor. Make sure you monitor real user timings and figure out if there is any impact to end user. You need some RUM tools for this. Worst case, atleast rely on synthetic user testing as it will provide some good insights. Also making sure you monitor your log files for 4XX erros is not a bad idea

In summary, you have lot of levers and with good enterprise thought through architecture, you can have good peak days and most importantly happy customers

Please let me know if you have found anything else which has worked for you