Joel Taylor of TaskRabbit, joined the PAYMENTSfn 2020 event providing us with a keynote presentation about working effectively with legacy payment systems. The session offers a first-hand account of how TaskRabbit, an online and mobile marketplace that matches freelance labor with local demand, had to learn, identify, and correct legacy code issues in their payment stack in order to improve the customer experience.
In his session, Joel describes how payments in the marketplace domain faces the unique challenge of being responsible for both payment processing and disbursements. This balance of customer and merchant needs is what creates the trustworthy system all parties rely on.
The session takes us on a journey through TaskRabbit's "haunted forest" to discover how wrong assumptions, not matter how minuscule, can have a global impact.
WANT MORE? Joel is back with us for our new PAYMENTSfn Fireside Chats series. Join us August 12 at 1:00 pm EDT to discuss legacy code and as all your questions. Register here.
Hello, and welcome. My name is Joel, and this is Where's My Money? I'm so excited to take part in this remote conference, and honored to be a part of this community. For the longest time I've been looking for a payment-centric conference and I was so excited when I was on YouTube and I stumbled across a PAYMENTSfn talk, because I've been looking for places to share knowledge and learn what other challenges people are facing. This is exactly what I was looking for and exactly the kind of content that I wanted, so I'm very excited to be able to share my use case with all of you and hopefully it can benefit you at some point.
So there's this running joke at work where I start every talk with payments, payments, payments. It's because I like payments, you should like payments and, well, this is going to be a talk about payments, so I had to do it here as well. Over the past couple of years, I've really leaned in to this whole ecosystem. I've really enjoyed the unique challenges that are out there, and payments is moving at such an interesting rate. There's so many new problems out there. That's probably the reason why I want to do a use case talk is so I could share the experiences I've had, going from primarily working in just US-based e-commerce areas to going to an international marketplace, which has completely different challenges than I've experienced. I'm hoping that maybe some of the things that I've come across will benefit you.
I'd like to establish a bit of vocabulary. For starters, I work at TaskRabbit and we have customers, which is explanatory. They're the ones that we're charging, doing the processing for; they're paying for goods and service. We're a gig economy, so that means we also have a merchant, and they use our platform in order to do their services. Then we mediate all the funds and make sure everyone gets paid.
What's interesting here is that we have a second player of a merchant that's not just interested in e-commerce. So instead of just payment processing, we also have to be responsible of doing disbursements reliably. I think that's a really unique and difficult balance and challenge to handle because if you neglect the customer payment processing service side of things and have only the merchant, then you have less customers and the merchants have less work. But also the flip side, if you don't do good payment processing and it's buggy and you double-charge, then customers run away and then your merchants start to suffer. There's a fine balance of harmony there.
I think also too, for the platform, if you write buggy code, you might get fired, who knows? I think finding that harmony is really tricky because there's trade-offs between both sides in the way that we balance those things and the choices we make in the payment processing side and the choices we make in disbursements.
One more term I want to talk about, and that is haunted forest. This is relatively new to me, but I picked it up in Stripe's quarterly magazine Increment, and there's this article called Exit the Haunted Forest by John Millikin. It's a brilliant piece and a commentary on tech debt. It starts off with asking you to imagine for a moment that you're visiting a small village and you've just checked into the resorts and a local is giving you some advice about the area. He tells you that to the east there's some orchards, to the west is a beach, to the north is a haunted forest, and then there's a lake somewhere else.
The character in the story says, "Wait, what? Did you say haunted forest?", to which the clerk responds, "Oh yes, the haunted forest. A bit of a nuisance really. You can't stay out past sundown or the ghosts will get you. Nothing we can do. It's been around for too long to fix now." Which, that just blew my mind when I read that, because what he's asserting here is that we use tech debt as this euphemism to refer to areas of the code that are less than ideal, areas that we've chosen maybe to make compromises or to ignore in order to move faster. But the author asserts that tech debt doesn't fully capture all of the damage that can come along with that and the toxicity that tech debt can get into, that over time, I think, grow to a certain point where it's harmful. They point out a number of characteristics, so there's three that I thought are really interesting.
The first is uncertain behavior, meaning that the engineers don't know what the code should be doing: it's magical, mystical, things kind of happen, and that's a scary place to be. You don't know what you don't know, what do you do? Especially if it's hard to test the code, if it's hard to get into it.
The second is that it's obviously unacceptable. Engineers know that this isn't good, code shouldn't be written like this. This shouldn't be continued. It's kind of just what it is.
The third, which is all too true, is that engineers have fallen. I've seen this where an engineer goes in, gung-ho into this portion of code, and just comes out with the fear of code in their eyes. You can tell, you need to leave them alone for a few weeks to just get their sanity back.
One of the points the author makes that I strongly agree with is that no one sets out to write a haunted forest. Instead, these things happen over time. Sometimes, code has to change rapidly. Sometimes you do have to pivot and there are trade-offs, but the choice there is how we continue to live, or if we're going to tackle that forest. I think that's an important distinction. We have that option to eradicate it, if we want to.
Now we have that over, I want to jump right into the story. That is, I joined TaskRabbit at an interesting time. It was transitional, a lot of new engineers, a lot of new hands shifting. This is a monolithic application, it's been around for 10 years, it's a pretty decent size. I had come to learn about this portion of code that's conveniently called Payments, and everyone avoids it. It's semi-tested, but in a really vague way that's not consistent, it's a little confusing, sparse documentation. No one just really knows too much about it.
At the heart of this is this one model that has 95 columns, 27 unique types and one magical method that does everything. It does direct charges, destination charges, refunds, balance adjustments, credits, everything, in one single method. Then to add to that, there's bus events, there's call backs, there's state changes, all that fun stuff, all the moving parts, all just packed in there into this one portion of code that nobody wants to touch because the risk is super high. No one wants to be an engineer that types in the code that ends up causing fires in the Payments portion, because that's the lifeblood of an app.
It's scary. It's been about two years now I've been working with this code, and I've grown to learn more about it. I know its favorite kind of food, I know the music it likes to listen to. Even as I've grown accustomed to it, it's still very confusing and it's costly to wrap anyone up into it. Sure, I understand the 27 types, but anyone who wants to refactor, who wants to add code to this, is going to face an uphill battle. They're going to enter a very spooky forest. It's very risky and it's scary.
So what I want to do is share a few bugs that we've come across in this haunted forest of TaskRabbit. Hopefully some of them can save you some trouble, maybe. If not, hopefully it'll just be entertaining. So we're going to have our first event here, that's what I'm going to call these things. We're going to have our first spooky monster.
This all began a few weeks after I started and I received this innocuous ticket. It essentially boiled down to, "Our merchants were saying that they needed to contact support because their money wasn't where it should be." Our app makes it very easy to contact us when those issues arise. The way the state flows is it begins in collecting. This is when the customer has invoiced: the job is done, and the Tasker, or merchant in our case, the Tasker is in TaskRabbit. They have provided the service to the customer, so the customer is happy and they're done, and the merchant submits the invoice.
There's an arbitrary amount of time before the submission and when we actually do the payment processing. For simplicity, for the talk, let's just say it's a day. Then from there, it goes into disbursement. Now the funds have been charged; we've charged the card, the customer, now the funds are being sent off into the merchant. The vernacular for this, in the marketplace, is called a destination charge. We're sending money, the platform takes a small cut of it, that's our fees. The ideal case next is that, now we're in disbursed. Everything's happy, it's great.
But there's one more state, and that's the help state, and that's the one that was in this ticket. We had logs and cases of merchants reaching out, asking for help. And so I started doing my very scientific method of picking random cases and looking at them: looking at them in the database, looking at the dates, checking them out; the payment processor, looking at their logs.
And what I find is, okay, all of these have settled, and I see that at the bank account. Part of me thinks, our customers are wrong? But I'm also not going to just, in the first two weeks at a job tell people, "No, it's the customer, not us." There's no way. So I decided I'm going to, instead, try to find out where this code is happening, where is the code that's giving this text to the users? Maybe there's something interesting there.
So brace yourselves for this crazy bit of machine learning, it's three whole conditionals. So this is a simplified version of the code, but it represents what it is. If you read through it top to bottom, it says that if a payment is settled, it's going to be disbursed. Cool. Then it says that if a payment is older than seven days, created as of seven days ago, then "help". Okay. And that the customer's paid, it's disbursing, and it defaults to collecting. It's kind of trickles down there.
What's fun here is that that seven seems like quite a magical number. And I tend to get a little curious when I see numbers like that, especially when there's no comments, there's no documentation there, it's just "seven days". That's what prompted me to do an average in the database to see the average created, average settled date, and I'm going to want to just group it by currency codes to see if there's any anomalies.
And lo and behold, I discover that, in the US we see an average of three days, and then in European countries, we're seeing seven to eight days. So I hop onto the payment processor that we're using and I look at their documentation, and sure enough, there it says that it's a seven-day period for European countries and two to three, in this case, for the United States, which is like, oh, okay, well, this makes perfect sense. This was news to me because I'd always been in just the US payments, not into international. ACH was what I was familiar with.
So I take one of the cases of the European transactions and decide to put it through the flow, just to verify what's going on here. We're onto collecting, and this is day zero. We wait our arbitrary, we're calling it one day for now, and then we add the seven days that the processor says it's going to take, and boom, there we are. There is the book. Pretty simple, it seems like.
So I do what ever self-respecting engineer does when you find code that you don't understand why it's done, and I get blamed. I find out the engineer's name, I reach out and I ask, "Hey, why did we choose seven days? Like, can you give me a little explanation here?" What I was hoping for was maybe some crazy game of telephone. Maybe an engineer talked to a representative of our payment processor, who talked to a director, to a project manager, and maybe something just got lost in translation and that's why it's seven days. But the answer is great, I love it.
He said, "Oh, seven days just seemed like people should be paid by then", which I love the sentiment behind, because yes, people should be paid quickly. We want that. But it's also not really going to work in every case, especially in international payments. What's really funny about this situation is that this assumption built on three days was everywhere in the code base. I mean, it was coded in the logic, it was in our sparse documentation, and it was even in the material that we use to train our support, so we're communicating this constantly. It's setting the expectation for our merchants that they should see their payments in three to five days, which is just setting them up for expectations of failure because they get to see that, and that's just a terrible experience. If you're providing a good, you want to make sure you're paid at a rate, at the time that you expect it. You're planning it, this might be how you make your living. It's important that, as a platform, we set the expectations and we deliver accurately.
So we fixed that up with a few lines of code; it was pretty simple, and it feels pretty good after a day of digging and you just do a one line change. Our contact rate dropped by 40% for these cases where we were directing our merchants straight into our Support queue. "Contact" here is when a merchant contacts a human, and that's expensive. It takes our Customer Support time, it takes Project Manager time, it takes Engineering time to debug. It has a big impact on the entire flow of just development. That solved a portion of this problem. As I said, the tickets had a lot of cases, and in this one, it turned out that it was our European friend, the Canadian, and they were the ones that were mostly suffering from this wrong assumption.
There was another one that was lurking there, and that was still the US payments. And the difference in our US payments that I noticed is that a lot of them said pending, they actually weren't in a disbursed state. But when I went into the payment processor site, it was disbursed. So we'd go back and forth with the merchants and they would check their bank accounts and sure enough, it was good. Now, there were over 10,000 transactions like this. That's a lot of manual intervention to go back and forth. We were able to eventually, programmatically, check the states, update them, all is well. Everyone was happy, and that we resolved, reconciled, the whole slew of transactions.
And so, trying to figure out the root cause of that, we were looking at our logs and noticed that we didn't see any incoming webhooks. With destination payments, what we do is we have webhook listeners for when the transfer has settled, and it's pretty common practice to use those. Some folks will say that you should pull and that webhooks are optimizations, but we were using webhooks entirely because they were working for us. They seemed to be reliable.
In this case, for this specific processor, I noticed they were missing webhooks in the logs. So either they weren't sending it, or our logs weren't logging. All I know is that, after many conversations, we discovered that sometimes webhooks don't make it and they get a false 200 OK response, which is mind-blowing that that can happen. But sure enough, we can deal with that, we're engineers, let's figure out a solution. It's janitor time.
So I figure, okay, we'll make a worker that will just pull. We'll check the status of a transaction, just to see if it's settled or not: simple. So, write that up, write some tests for it, get two code reviews, ship it, good to go. What's fun is that, a few days go by, and then I start getting a bunch of frantic Slack messages. Merchants are angry; they're being told that their money has been disbursed and it's not disbursed. Uh oh. So again, I pop open the logs and I look at a few use cases there. I pick one at random, just a proven scientific method for finding the solution to bugs.
And here's the flip for this one. So we're on the day of collecting, which is good, day zero. We go to day one; we've charged the customer, so now we're going to do the destination charge. In that same day, we also run our janitor codes to get a check. We check with the processor saying, "Hey, what's the status of this transaction?" On that same day, I don't know the time span, it could have been end of day, but it was a very soon amount of time. We got a response back and it was disbursed, which okay, that's not right. We're not doing instant payouts, we're not that progressive yet. It may be great, but we're not getting that for free, especially via ACH. ACH just can't do that.
So again, hop on Support calls, talk to their team, to this processor. And it turns out that the transaction state isn't the state of the transfer, it's just the state that has been settled on their end and that's been posted to the clearing house. So there's the bug and the spookiness in this one. Now, what's really fun about this one is that I asked the service, "Well, then how on earth do we know when it's settled?" Like, "Oh, use our webhooks." It's like, well, we can't do that because it's not reliable.
The moral of this story is that processors are different and they have very different behavior. Over time, what we did is actually phased that one out. There were so many bugs like this, so many different nuances between multiple processors like this. We had two, and there's always this assumption because their vernacular is just saying, they both do destination charges, they both have transfers, they both have payments, they both have credit cards. But there are so many minor differences between the two, and each one had these kinds of implications where it could cause a transaction to be stuck, it could cause unreliable behavior in general. And as we were beginning to tackle those and we were just patching things up, we eventually saw a 90% decrease in contacts regarding stuck payments. Now, things are fluid again and we're moving forward. It begs the question, "Now what?"
So the two issues; the first problem was just the wrong assumption. TaskRabbit had started off in the US and then went international, and then the assumption that payments would be as fast as the US everywhere just kind of stuck around. That's an innocent mistake and I don't think there's really anyone to blame, that's just what was understood because that's what the processor they used had told them and that's what was in the documentation. In the second case, that was just a nuance that had to be discovered, and maybe you just had to be bit to learn. It's unfortunate.
Now I know if we ever add another processor, I'm going to have a bunch of questions. Like, can you tell me the disbursements' status confidently? Are your webhooks guaranteed; do you retry webhooks? Is escrow a part of your destination charge? And if we do a refund on a destination charge, does it refund the whole thing? Is there a balance or a transfer reversal, or is it just the customer reversal? These are all things that we had to come across the hard way, as a team that was learning the system, learning how it behaved without really knowing what was going on, which is scary because it creates a lot of room for mistakes.
And so the solution for this, not to be facetious, but I believe it's RTFM, which in this case stands for "Read The FinTech Manual". It's our responsibility to perform the due diligence and understand the core components. What I really had to fight with communication around payments and TaskRabbit, is the concept of what a destination charge is. It's such a loaded term. It's easier to just say something, "Oh, it just transfers money", but it's important to understand that it's a charge and a transfer. That's what we can know, like there's two things that can go wrong there. It's not just a magical payment operation, there's a bank account involved and there is a credit card or debit card involved. Knowing that's important, and knowing how your processor handles that is critical to providing a service that's reliable so you can know those nuances, especially as you go to different countries. Even with the best abstraction, there's still going to be these cases that you have to account for, and we have to be prepared to ask the right questions.
Now, there are tools out there, I think, and the two that I've come across that have been really helpful in our "haunted forest" is Working Effectively with Legacy Code by Michael Feathers; this has been pivotal in providing strategies of how to wrangle these pieces of code that you just don't understand, or that are too complex to put under test. And I think this is just a classic that is really a great toolbox to have and to reach for. The second is the bible of accounting, Analysis Patterns by Martin Fowler. These patterns in there are time-tested and proven, and I think provide a great ground point for where to base off design.
I've taken also a lot of inspiration from External Library or XR projects like Kill Bill from Groupon, I think does a great job of modeling their payment system. Most notable is what you can tell what their project is that you can see the lifecycle of a payment. You can see the params that were sent to the processor, you can see the response and see the entire paper trail. I think it's a really well-designed system that pulls on a lot of the analysis patterns there.
And then of course, you can only refactor to certain points. Sometimes the code in question is just resistant. It fights back. It's too brittle, it's not meeting the business needs. That's where we've arrived at TaskRabbit. We have done a lot of refactoring and we've improved areas; we've added numerous areas of logging and metrics, so we can know when things are going wrong and we can provide a reliable service. But still, at the end of all of this, it's time for us to conjure rewrites and to create something that we know how it should behave, we know the outcomes that we want from it. That way, we can provide our merchants and our customers with the best possible, most reliable and trustworthy platform to do business on.
I think it'd only be fitting, since this is all derived from John Millikin's Haunted Forest, to end with a quote that he uses that I think really sums up part of what I want to get at. The true moral of the story is that a rewrite is a good idea, if the new version will be better. It's simple. Yes, there's going to be caveats to it. But I think in most cases, when a payment system isn't acting as it should, when there is the question posed to your team of "Where's my money?", something has gone drastically wrong and that needs to be addressed. It either needs to be refactored and fixed up, if possible, or it needs to just be thrown out and rebuilt. It's our job to do this well and to provide the best experience for our customers. Thank you.