PAYMENTSfn

Surviving Black Friday: Tales from an e-Commerce Engineer

Join us for an entertaining look at what it takes to prepare for Black Friday, as explained by Aaron Suggs from Glossier. Learn how to plan, what went well, and what you should do to prepare for planned site traffic spikes.

Written by
Jordan Chavis
Publication Date
July 2, 2019
Social Share
Newsletter
Subscribe
Don’t miss our latest news and updates
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Preparing for the "13th Month"

Did you know the day after Thanksgiving in the US (Friday) until that Monday (Cyber Monday) can often bring in the same amount of sales as a typical month? Physical retail stores are often packed full of shoppers, with crowds that can sometimes rival large sporting events.

e-Commerce sales during this same four day period has been growing at a steady rate over the past few years as well - with many brands planning for peak traffic and sales well in advance.

Glossier gets ready

This is the case for Glossier, a leading skin care brand in the US. Aaron Suggs, Glossier's Director of Engineering, was tasked with planning for the biggest shopping event of the year.

Learn about capacity testing, see which pages are most important to customers, and find out what happens at exactly 12:01 am when Glossier's ultra rare 20% coupon code goes live. Spoiler alert: not everything goes to plan.

Watch above as Aaron weaves an insightful, entertaining, and super useful look at what the team at Glossier did to get ready for this monumental annual event.

Rough transcript of "Surviving Black Firday: Tales from an e-Commerce Engineer"

Aaron Suggs:        How are people feeling right now? All right, all right.

Audience:           Woo.

Aaron Suggs:        A little more. A little more.

Audience:           Woo-hoo.

Aaron Suggs:        All right. Fantastic. Yeah, so, I wanted to start out by saying thank you to Peter and Helen at Spreedly, and all the folks who helped organize and put on this conference. It's my second year being here. And I think it's a really great local community and a great payment's resource that you all are building here. I wanted to say thanks for that.

Aaron Suggs:        So, this is an original talk that I'm giving and so I'm excited to drop some fresh content here. Let's make sure my clicker's going. Okay. So, yeah, I'm, I'm Aaron Suggs. I go by Kay Theory on the Internet, that's like Twitter and GitHub. Feel free to at me and slide into my DMs. I'm Director of Engineering at Glossier and one of the teams I lead is our tech platform team that's responsible for site reliability, infrastructure, performance security and developer productivity. And before Glossier, I was doing payments and platform ops engineering at Kickstarter like Peter mentioned.

Aaron Suggs:        And so, for those of you who might not be familiar with Glossier, let me tell you a little bit about what our brand is, what our platform is. So, our direct to consumer e-commerce brand in skincare and makeup and beauty, that category. And so, if you identify as a woman and you use Facebook or Instagram, there's a pretty good chance that you've seen some of our ads. What makes us sort of special and unique as a beauty company is that we are really good at listening to our customers. And so, the company started out as Into the Gloss, a blog by Emily Weiss, who's sort of interviewing women about their beauty routines, sort of sitting in their bathrooms, talking about their skincare regimen and that sort of thing.

Aaron Suggs:        And then, we have what we call our G team or like our customer experience team, that's really interactive on Reddit and Instagram and all these places where there is a really solid community of people talking about beauty and skin care. The values of the company is skin first, makeup second as in like taking care of yourself and being healthy and having good hygiene is like a necessary thing that sort of everybody wants to do and then make up can be an optional second.

Aaron Suggs:        Just in terms of business metrics, we don't share a lot of public information about the financials, but in 2018, we did over a hundred million dollars of revenue and that is growing very quickly. It's a mix of both the e-commerce website and some permanent retail locations and some pop-up retail locations we do. We're about 200 employees, 30 of them are software engineers on the tech team and just because people are interested in our tech stack, it's Ruby on Rails using Solidus, which is a fork of Spree E-commerce. We use stripe as our payment gateway and it's all on to AWS.

Aaron Suggs:        So, I wanted to talk about our Black Friday experience. This is for people who may be not super familiar with the US consumer industry. There's a consumer bonanza in the United States. It's the Friday after Thanksgiving. We sort of think of this as like our peak holiday period extending from Friday, Saturday, Sunday through Cyber Monday. And it's like those four days can often represent a typical month's worth of revenue compressed in that short time period.

Aaron Suggs:        All right. And so, for Glossier specifically, we run a 20% off promotion. It's pretty uncommon for us to do these sort of discounts or promotions at all. And so, the consumers are really eager to get our products during this time where they're on sale. And so, having this huge surge of traffic when a lot of new customers are looking at your site, it presents a unique scaling challenge to deliver a reliable e-commerce experience. Visually, this is what our order volume looked like on Black Friday 2018, so you can see that, that once we get to Friday, we're doing about 10 times our normal traffic. Friday is the biggest day and on Monday that is our second biggest day of the year. And then, the Saturday and Sunday in between are still way above typical daily volume order volume, and are our third and fourth largest days of the year.

Aaron Suggs:        And so, this talk is a narrative about how we prepared, how it went and what we learned from that experience. They say people don't want to see how sausage gets made. And this is a talk where I'm going to show you how sausage gets made. I invite you to celebrate our wins or enjoy the Showdan Freud or commiserate in our failures.

Aaron Suggs:        So, it's not all smooth sailing, it wasn't all rosy. This is a pager duty alert that went out at 12:07 AM, seven minutes into Black Friday. The site is under high load, would appreciate assistance. It's a really interesting story of what happened in this case. We did a lot of things well, but some things didn't go as planned and we learned a lot.

Aaron Suggs:        So, let's rewind. We're going to go back to September of 2018, when we decide that we are going to get serious about preparing for Black Friday. We know it's coming up. The first thing we decided to do is make a team. This is on the technology side of things on our sort of logistics and customer experience and product development side of things, they'd been planning for this for a long time. The tech team, we were starting to face this challenge head on and saying like, "Okay, we're at a scale where we can't just all be working on user facing features. We have a lot of shared infrastructure and cyber liability concerns."

Aaron Suggs:        This is when we sort of set aside for engineers name a directly responsible individual. That was myself to to just sort of say like, "You are the point person for ensuring that our site is going to work on Black Friday." The longterm mission of this team was sort of more broad than that. Empowering our other tech teams to quickly deliver reliable product features is our sort of like general tech platform mission. But the narrow specific short term thing was just make sure we're up for that peak traffic.

Aaron Suggs:        Okay. So, we made a team. Team has the skills that we need to to do address the challenge. Next step was to make a plan. And so, we were going to be more rigorous and systematic about fully answering the question, what should we expect to happen during this peak period, and have we covered all the bases that we should in order to gain confidence that things will go as planned?

Aaron Suggs:        And then, you execute the plan. And now, sometimes I look at this and it's like, "Oh, is this too trite?" It seems so simple and obvious like, "Duh, okay, make a team and make a plan." And it sounds so easy to have these three step programs. But I also want to say not all three step plans are created equally. It's pretty easy to have like a simple sounding plan that's actually turns out to be completely ineffective because it's relying on poor assumptions or wishful thinking.

Aaron Suggs:        And, to this end, I wanted to shout out this book recommendation here, Good Strategy, Bad Strategy by Richard Rumelt. This book has really helped clarify my own thinking and several of my colleagues and friends' thinking about how to clearly state a challenge and develop coherent steps to address whatever sort of business challenge you're trying to face. So, we have this team with the skills to address the question of how do we provide this reliable e-commerce experience on Black Friday?

Aaron Suggs:        We're making a plan, but what is the plan? The plan is capacity testing. So, doing capacity testing of your system forces you to understand how it behaves in different scenarios. And I've seen some teams do this in incomplete or ineffective ways and so, I wanted to explain what I see are the three necessary important ingredients of doing good capacity testing.

Aaron Suggs:        Number one is define your target. You have to know what you're aiming for and what success looks like when your system is operating well. This forces you to make a prescription about what your system ought to do, rather than just doing the description of how your system behaves.

Aaron Suggs:        So, next up, you measure what your actual capacity is and this is really helpful to just know the limits of your system. This is the descriptive part. And then, the final step is you remove bottlenecks until you meet your capacity target, right? And so, steps two and three here are kind of in a programming loop, right? You just keep looping back in between measuring, removing bottlenecks, measure again until you meet your capacity target.

Aaron Suggs:        Okay, so let's dig into how you define a good capacity target. First off, this is a very collaborative experience, right? So, even though the tech team is doing a lot of the work on this, figuring out what should we expect on Black Friday that is very cross functional and collaborative effort. So, in particular the data team and the marketing team in our case brought a lot of context and expertise to helping to come up with a good target. And because it's this sort of prescription of what your age and what you're aiming for, it sort of focuses organizational alignment on the same goal.

Aaron Suggs:        In our case, like what what we decided to focus on for our targets were the peak orders per minute and the peak page views per minute across the three most important types of pages on our site. That's the homepage, PLPs and PDPs. If you're not familiar with the e-commerce jargon, PLP is a product listing page, like a search result where you just see a whole bunch of products in a list. PDP is a product detail page where you've clicked on the product and you're seeing everything about that product, the reviews, ingredients, or measurements or whatever, that kind of thing.

Aaron Suggs:        And so, for most e-commerce sites, those three types of pages, PLPs, PDPs, homepage are the most important experience. And this sort of captured the customer journey that we were expecting during peak traffic. Right? Somebody lands on the homepage, they click around some PLPs, maybe they click a few PDPs, they add some stuff to their cart, they click around these three pages some more, eventually they start to check out and they do an order.

Aaron Suggs:        And then, knowing from just my understanding of how our website works, we knew that the peak order volume, whatever minute was the sort of high water mark of what we should expect in terms of volume, that's what we needed to make sure we could sustain. And then, the rest of the time when the order volume is less than that, we're sitting pretty.

Aaron Suggs:        Okay. But, what other information did we bring to defining this target? Fortunately, because the company has been around for more than one Black Friday, we had prior information that we could look at. So, we said, "Okay, let's look at this window of time, the few months proceeding that peak holiday traffic from 2017, and basically assume a same proportional change." We can say what was the proportional from say a random weekday in November to the peak Black Friday traffic in 2017? And now we can assume that it's going to be about the same here.

Aaron Suggs:        And looking at the sort of shape of our traffic, we were expecting this nice gentle hill. This was the sort of expected customer behavior where we'd see some steady growth throughout the morning. This was like organically growing traffic and then it would peak around say 11:00 AM or so and then gently taper off throughout the day. This was distinctly different than a big countdown to a flash sale where it's just very low sales, very low sales, and then like boom, right at a certain moment there's like a huge spike.

Aaron Suggs:        I know that is a harder thing to plan for and ensure that you have the sufficient capacity for, whereas when you have that nice gentle peak, you have time to react to all the volume that is building up. Let's just put a pin in this important point because it might come up in the future. All right, so we're planning for that gentle peak.

Aaron Suggs:        Okay, so step two. We want to measure our actual capacity, right? So, this is something that the tech platform team could own. We use a service called Flood.io. They worked out great. They're very helpful. You sort of write a typescript file that sort of models the flow throughout your test. We decided to use our production environment because there was this question of like, "Oh, should we use staging or build some other production like environment?"

Aaron Suggs:        We really didn't want to get in the place where our test weren't realistic because we'd made some of bad assumption about whether our test environment was sufficiently production-like, so we decided we're just going to use the same infrastructure. But in order to do this, we needed to enable the sandbox account in Stripe because there's no credit card that would actually let you place, thousands of orders in a couple of minutes and have those go through.

Aaron Suggs:        So, we made a little customization to allow us to do that. We ensured that these real orders were not actually fulfilled by our warehouse. These were orders paid for with fake money. And so, we didn't want to actually deliver them. And then, we made a little exception to our business reporting that we would exclude these orders as well. So, the marketing team didn't look at our capacity testing and say, "Wow, our conversion rate is through the roof." And I wanted to say Kudos to our data team for like having this built into our data pipeline to like automatically exclude certain email regexes from our business reporting dashboards.

Aaron Suggs:        Oh, pro tip. So, before getting into this, I asked the team, I was like, "What do you think our current capacity is actually going to be?" And then, the winner got the baked good of their choice and the most ridiculous hat I could find on Amazon. And this was a really nice moment because we were actually kind of pessimistic about what our volume would be. The person who'd named like this super high, it seemed almost outlandishly high. they were actually the most accurate and it was because they knew that we had really optimized the promotion logic that had been a bottle of that previously in the air.

Aaron Suggs:        Oh, another bonus of doing this capacity testing, just to get our capacity testing to work. We sort of sussed out a bunch of bugs and race conditions that had been affecting a few orders throughout the year. And so, this really just sort of like was a bit of hardening on our system in order to even support the capacity testing.

Aaron Suggs:        Okay. So, we were measuring our capacity. Now step three is you remove bottlenecks until you meet your capacity target. There are really like two big ways to do this, right? Either you can scale up, add more servers, bigger servers, you can just add more capacity that way. Or you can optimize, which is, instead of spending money to get more server resources, you spend engineering effort to make your system work more efficiently. And there's sort of a trade off of we did some of both honestly.

Aaron Suggs:        And then, I wanted to call out this trap of, you cannot just improve any performance aspect of the website. You're only adding capacity if you improve the performance of something that's the bottleneck. So, let's say for example, you make your JavaScript payload smaller by removing some dependencies. Unless that network bandwidth of downloading the JavaScript was the bottleneck, which it probably isn't, you haven't actually increased your capacity, but you have increased performance. So, it's like, you've done something nice, but you have to be careful to make sure that you're addressing the bottleneck if you are trying to increase your capacity.

Aaron Suggs:        Okay, so a little framework for how to identify bottlenecks. This is coming from some experience with systems thinking and just trying to chase down a lot of bottlenecks from time to time. You sort of pick one of each of these two columns that I'm going to show. So, first one is computing resources. Every server you know has a CPU, it has disc IO, has network IO, and then you have all the tiers of your application stack, your load balancers or application servers or databases or whatever.

Aaron Suggs:        Whenever you're making a web request and you're waiting for that to come back, what you're waiting for is one of these compute resources on one of these system tiers. And so, finding that bottleneck is just chasing and like, "Okay now we're waiting for CPU on the app servers. Now we're waiting for disc IO on the database servers." Whichever one is taking the most time and is the easiest to get rid of in that request response life cycle. That's what you want to be addressing in order to improve performance and improve the capacity of your system.

Aaron Suggs:        Second book recommendation, Thinking In Systems by the Donella Meadows. This has been a really helpful book for me to sort of clarify how to model and understand the behavior of complex systems. I think a theme of this conference has been payment systems are really complex systems and this has been a great way to sort of break it down.

Aaron Suggs:        All right, so quick recap. Capacity testing. Three easy steps to find your target, that's collaborative and cross functional. Measure your capacity, Flood.io helps a lot there. And then remove bottlenecks until you meet the capacity target.

Aaron Suggs:        A couple couple more things that came out of this process that were pretty helpful that a little bit flushing out the plan a little bit. We happen to know a familiar contractor who had optimized a bunch of checkout systems previously and so we hired that person and it was a big help. I wouldn't say that super generalizable, but if you can hire good talent, do it. We put all the copy and Promo code changes for the Black Friday launch behind a feature flag. That means we're testing it in production amongst staff for weeks ahead of time. The actual go live at midnight was the most trivial code change. It was just flipping on a feature and it was all code paths that we'd been testing for the capacity testing beforehand. It's so nice to be able to de-risked big changes like that.

Aaron Suggs:        We also had a really solid internal communication plan. For example, we made a dedicated Slack channel that everybody would be on. So, this was a tech team, our logistics, our retail team, customer support, data team is all in their sort of talking together, having quick decisions, quick context sharing about what was going on, on Black Friday.

Aaron Suggs:        We made a special pager duty alert that anybody could just send an email and page out several engineers all at once to make sure that there was quick attention on any issue that came up. We weren't planning on using that. And then, we also made an hourly on call rotation for throughout the weekend. So, fortunately, we have a bunch of engineers in Canada and on Canada, Black Friday is just Friday. And so, they were on call throughout the day. Saturday and Sunday was like we'd all pick a couple an hour or two that we would be on call for. And we were really sitting at our desk knowing that this was a critical time for the website.

Aaron Suggs:        Okay. So, the results of our capacity testing, where did we end up after doing all of this work? The lowest number we had was the expected traffic volume. Then we had set our target, a little padded above that. And then, we even exceeded our target by 2-4x on certain metrics. So, we thought we were sitting pretty there. And then we knew what our bottlenecks were at our capacity that we've measured. For the checkout rate, our bottleneck was database CPU, and a lot of that came from inventory accounting. This is like atomically decrementing inventory as as you add it to cart or checkout. And for page views, homepage, PLPs, et cetera, it's application CPU. It's just limited by how many apps servers we were running.

Aaron Suggs:        So, with knowing those capacity, we could even as an extra backup plan prepared some mitigation techniques, right? So, if a that database CPU was a bottleneck, we could disable inventory accounting per SKU. This really means we're switching from a strongly consistent inventory tracking to a eventually consistent one, where we would look at the recent line items we've sold and ask ourselves, "Do we have enough in stock?"

Aaron Suggs:        Because of how our businesses, this was kind of easy to do or this was possible to do because we knew we had plenty in stock and we weren't going to sell out. So, this was kind of okay. It was just a little got you for a this program called BOPS where buy online, pickup in store. It's for like New York retail locations and this was a really popular thing to do, but we just have much lower inventory in our retail locations and in our store, so we decide that we would leave the strongly consistent inventory tracking enabled for that BOPS experience because we're so over capacity and we don't expect that we're even going to need this capacity.

Aaron Suggs:        And so, if we needed to add more app servers to scale up the app CPU, we knew that took about 20 minutes to do. We also could vertically scale our database. That takes about 50 ... There's a Postgres RDS database on AWS, that takes about 15 minutes to bring up the new server. But your sites available during that time, two to five minutes to reboot.

Aaron Suggs:        All right. And then we had you various feature flags to disable anything that we didn't absolutely need. Right? Phew. Okay. We had this rock solid plan, tested the shit out of it. We got this, right? Black Friday comes, 12:07, gave the pager duty alert. What's going on? Oh my goodness. All right, let's go back to 10:00 PM on Thursday, Thanksgiving and we send out this email. Subject line is t-minus two hours. I wouldn't be surprised if some folks in the audience got this email and we said, "Oh." The call to action here is to add something to your calendar, which isn't like the best call to action, but we're just like, "Oh, get psyched about this deal that's coming in two hours."

Aaron Suggs:        And so, when marketing and I had reviewed this together, where we convinced ourselves, we're like, "Oh, this isn't like a flash sale count down" because we knew that this sale is going to go on for four days and we have plenty of inventory. So, you don't need to do like do this big rush right at a midnight. But, we hadn't sufficiently put ourselves in the customer mindset where they're used to some products going out of stock from time to time and they're like, "Oh there's going to be this big sale. I want to get them before they possibly go out of stock."

Aaron Suggs:        So, here's what this looks like from a data dog monitoring perspective. This is our add to cart metric, right? So, so how often are people adding stuff to their cart? We see it at 10:00 PM when this email goes out, suddenly a lot of people are adding stuff to cart and if I showed the actual checkout rate, it's funny because the checkout rate goes down so people are just adding stuff to their cart, leaving it there knowing that this 20% discount is going to go on at midnight.

Aaron Suggs:        All right. So, I have a 6:00 AM a Friday morning on-call shift. I think I'm being immense. Going to take like the early morning shifts. So I'm going to sleep. And meanwhile the a bunch of people on the team who are staying up for like the go live at midnight, see this and they're like, "What is going on?"

Aaron Suggs:        All right. Now we're going to zoom ahead to include midnight, 20% promo goes live and, oh my God, there's that sheer cliff that we knew we wanted to avoid. The site becomes barely usable. Several orders are getting through. We're having more order volume than we'd ever seen before. But there were also a lot of people who are getting errors, particularly timeouts. Pager duty goes off. Myself and many other engineers all hop on this conference call. We're in the slack room sending lots of metrics, trying to talk about what is going on cause we were not expecting this.

Aaron Suggs:        So, really quickly we sort of align on three different levels to look at the problem. So, I'm going to say what the symptoms are from a business level. The site was sluggish and customers were getting frequent errors. Fortunately, we were already using the comms channels that we'd set up as part of our planning. Kudos to us. Good planning. At an app level though, the problems were very high page views, more page views than we'd planned for and our checkout rate was well above what we'd planned for and there were many timeout errors.

Aaron Suggs:        Right now in the silverist of linings, the site was broken but in a way that was just how we predicted it would break from our capacity plan testing. So, we're actually able to use experience we'd seen in our capacity testing to to know what we can do in this case. So, there were something familiar about these systems, but we still didn't understand why there was so much demand. We, hadn't totally connected the dots to how that countdown email had changed customer behavior at midnight.

Aaron Suggs:        And then, from a system level, we were looking at this business level, app level, system level is just like our app and database CPU are pegged at 90% plus, it's basically like no capacity left to process anymore checkouts, et cetera. So, what are the levers that we can pull to make this better? We decided to pull all three levers at once that we had. We disabled inventory tracking, we added a bunch of more app servers and we vertically scaled our database to the biggest one.

Aaron Suggs:        Now in retrospect, we could have probably just disabled inventory tracking and a lot of the extra page views weren't necessarily extra customers, but really just people who are frustrated getting timeouts and so you start refreshing or your page is taking awhile to load, so you refresh and it's just sort of that vicious cycle.

Aaron Suggs:        But, so, our key learning here was that we should have prepared some of these mitigation scripts ahead of time. We assumed again that we'd have that gentle curve and we could say like, "Oh, inventory tracking is taking a lot of database CPU, let's disable it for one or two SKUs and we'd be clicking around in a web UI to do that." In fact, what we needed to do was on mass disable all the inventory tracking and so it took us a couple of minutes to just like write this script because we didn't want to click through our scores of SKUs. And so, this added several minutes to the remediation.

Aaron Suggs:        Another key learning was that when you try to scale a Postgres RDS database that's under high load, it takes longer than when it's under low load. So, in our case, it took about 20 minutes to go from when it started being unavailable to restart, until it actually came back online. This was really surprising. There was a really dark moment where we were like, "Boy, do we promote one of our replicas to the leader. Is it going to work? It's only ever taken five minutes in the past." I think there was some stuff around transactions and timeouts that we needed to set and configure in Postgres that would allow this to be faster in the future.

Aaron Suggs:        So, here's a graph of our orders per day from midnight to 1:00 AM, and you can see that there was this really high spike early on and that spike probably would have been faster if we'd had the sufficient capacity. It's kind of hard to say how high that really would have been. But in order to understand the impact of what we lost here, it's really where this crying face is. Everything in that area is where there was a bad customer experience that we'd aim to do better in the future. And now in retrospect, had we just disabled inventory tracking before this went live, we're pretty confident, but we don't know for sure, but I would say it's likely that it would have been smooth sailing all the way through.

Aaron Suggs:        One other thing that we had to do was fix broken orders. Right? So, we lacked atomicity on some of our checkout processes we realized and so there were a bunch of orders between place in that hour that we're in this inconsistent state. Maybe missing collateral or we hadn't collected the payment even though we sent a confirmation email. And so, here we got on a conference call with our customer experience team and they started a workshop like what the customer comms are. We were writing a script to fix things. This was a really effective collaboration that let us go from this like unfortunate customer experience to something that we ended up communicating really well in handling really well. We fixed all the callbacks.

Aaron Suggs:        And so, then here's the rest of Black Friday, right? This is what we were expecting with that nice gentle hill. Oh my gosh. That is so nice. So, this was interesting because I go to bed late at night and then I wake up in the morning and I don't know if I'm like in this new world where like, "Oh my gosh, are all our expectations different? Midnight was so different. It's going to be a wild ride the rest of the time." No, this was crazy accurate what our data and marketing team had forecast. Our peak traffic was within 10% of what they'd predicted. I'm like, "If engineers can even estimate that's like within 10% what is this magical super power that you have?" So, that was really impressive.

Aaron Suggs:        And overall, we exceeded our revenue targets for the day, despite the problems at midnight. So, overall it's success with an asterisk. Right? Mrs. Lincoln, besides that, how'd you like the rest of the play? There's obviously room for improvement, but we exceeded our revenue expectations. The rest of the weekend was delightfully boring and predictable from a site reliability perspective. And boy, that midnight thing was unexpected.

Aaron Suggs:        So, how do we turn that into learnings for next year? Right? So, we do our like blameless learning review, sort of notes to ourselves for next year. Boy, midnight was a surprise, let's better understand customer experience when we're making one of these flash sale countdowns. I wanted to call out this sort of like anti-pattern of wanting to say like, "Boy, well, we padded our expectations a little bit. Do we just need to pad them more in the future?" And I wanted to sort of call that out as like I anti-strategy.

Aaron Suggs:        You can't just say our estimates were wrong, let's just pad them more. That's not bringing any new information to the estimate. And really you want to understand what you missed and what you weren't capturing in your estimate and then look to make architectural improvements that would dramatically improve our capacity for subsequent years. We thought through like what their user behavior was that was different. Some of the operator behavior of like having that script ready to go for disabling inventory or just disabling inventory tracking preemptively, and prioritize a more resilient architecture for next year.

Aaron Suggs:        So, our 2019 tech roadmap includes these things like dramatically improving the reliability and performance. So, one of the big here is a pre-generated pages for the homepage, PDP, PLP. That just makes that app CPU backend work sort of obviated, it would just all be cacheable on a CDN, which is very scalable.

Aaron Suggs:        We're also moving to like an asynchronous checkout flow where we're sort of optimistically taking orders with just minimal validations, knowing that we can fix whatever's wrong sort of in arrears or retroactively to say like, "Oh, that that payment method that you'd use previously didn't work this time. Please login and update your payment method" or something like that. But we don't need to do all the validations and all the inventory accounting before taking your order.

Aaron Suggs:        Now, I wouldn't want to say like we're just doing this so that Black Friday is easier. We're doing this to also drive important business goals around conversion and retention. Having a fast reliable experience improves conversion and your customer lifetime value. And so, that's why it sort of an easy organization sell to say, "These are projects that we should invest in because it's going to drive these important business metrics."

Aaron Suggs:        Throughout the team, we've deepened our debugging and systems thinking expertise. Our capacity planning and testing has been super useful and we now do that before any major launch and it's given us a lot more confidence. And so, as a moment of good news, I will say in March we launched a new brand called Glossier Play, and we basically took all these learnings and we sort of applied them. We did the capacity testing, we preemptively disabled inventory tracking. We were able to look at optimizations that we've made to our checkup flow and had measured our capacity three times beyond what our peak was that midnight Black Friday, and our Glossier Play launch from a tech reliability perspective was delightfully boring.

Aaron Suggs:        All right. That's all I have. Thank you everyone for your kind attention. I have samples that I can give out after the talk or come find me for a break. Thanks for your kind attention.

Download the Payments Orchestration eBook Below

Related Articles

No items found.