Sveriges mest populära poddar

The Business of Open Source

Exploring Ant Financial’s Cloud-Native Journey with Haojie Hang

34 min • 10 juni 2020

Some highlights of the show include

  • The challenges of operating digital commerce at scale, including the need for resource pooling and resiliency — and how this caused Ant Financial to re-think their infrastructure. 
  • Ant Financial’s former approach to scaling, which was mostly manual, and highly resource-intensive. 
  • How Kubernetes is expediting cloud development for Ant Financial.
  • Haojie’s thoughts on the global engineering skills gap, and China’s growing cloud computing market including driving factors and barriers. 
  • Why Ant Financial’s migration has largely been a success — and why achieving operational security is now a top priority for the company.  
  • How Ant Financial is managing disconnect between its engineers and business leaders. 
  • The company’s ongoing mission to migrate its systems and applications away from legacy architectures.


Links


Transcript


Announcer: Welcome to The Business of Cloud Native podcast where we explore how end users talk and think about the transition to Kubernetes and cloud-native architectures.



Emily: So, I always start the same way. Can you introduce yourself?



Haojie: Hey, my name is Haojie Hang. I'm a product manager in the CTO office at Ant Financial. I work on the product and strategy side for, basically, the CTO and the other executive leaders, as well as leading a small product teams within the org to look at the frontier technology in the cloud and other infrastructure businesses.



Emily: And can you tell me a little bit more about what Ant Financial does? And then, also, what do you do on a day to day basis? What do you do when you get into the office?



Haojie: Yeah, I'll do a quick introduction about the Ant Financial business. It's not just one business or two business, it's a group of businesses that we innovate and we do, mostly in China, but we're also expanding very rapidly all over the world. So, Ant Financial is basically a group of businesses including credit for both consumers and the enterprise, as well as loan businesses, both consumer and enterprise businesses. We say that the parent organization is basically, we call it Alipay, it’s the earliest business we do since 2004 when the business was basically born from Taobao, which is our parent company. So, in short, the Ant Financial Business has a lot of presence in the business of payments business, remittance, credit card, loans, securities, and many other businesses like intelligent technology, blockchain, pretty much everything you can imagine in the FinTech and financial services, we’re in there.



Emily: Tell me a little bit more about the cloud-native journey for Ant Financial. When did it start? Why did it start? What was some of the motivations behind moving to cloud-native?



Haojie: Yeah, it's actually quite interesting. I joined Ant Financial in 2008, but actually, the entire company started to look at cloud-native technology quite early, in 2012. So, back then, people were just looking at these technologies around the world, mostly from the US, they look at this open-source community, look at what other companies are doing, how to use the cloud-native technology to help with their business in the peak time, so during event. There’s online promotion event we're doing every year, called Double 11—Shuāng shíyī in Chinese. Every year, so we have a large amount of promotional events happening online, trying to help merchants and the customer is trying to sell and buy stuff in our Tmall and Taobao platform in very, very discounted price. So, for that promotion event online, we have to think about the resilience, the resource pooling, oftentimes the visits has to increase multiple times, sometimes over 100 times the increase compared to the normal time. So in that case, we have to think about how we can be very resilient and efficient infrastructure to support that business needs. So, this is a very large topic. And then, back then, there was a lot of focus and study in our cloud computing department. So, we started looking at this technology called Mesos in 2012. And then, we do a lot of experiments around this technology, but from the business perspective, it's still hard to justify the benefits of moving to Mesos completely. So, we have multiple teams doing a lot of research in Mesos, in Kubernetes, sometimes in our own technology stack, but there's not enough proof or enough confidence for us to move completely over to that technology, until the emergence of Docker container, this Docker technology. Then we started to look at our container infrastructure, really do the investigation around this technology, and understand why this is taking over so quickly over the world, from the business perspective, and from the technology perspective. If you look at the community of Docker, the thing does not really happen until 2015. But we are already in the game for about a year or two. So, we're actually quite happy about our original strategy, but it's just in terms of the research. We're actually a little bit behind in terms of moving to this cloud-native architecture. But as you can see, that I had an interview with CNCF. So, we are very happy about the results that we have right now. Pretty much the entire architecture we run within Ant Financial is, basically, on Kubernetes ecosystem. It's not just using the open-source version of it. We're doing a lot of customization around this open-source framework. Yeah, I can talk more about the details.



Emily: Yeah. Well, let's back up just a little bit. I’m curious what you were doing to manage this scaling before? And how did that change? And what about the whole process changed? Like, how stressful is it now, compared to before?



Haojie: The process was very manual, I would say. We have extremely large team of engineers, and DevOps, security teams. And oftentimes their responsibility are overlap. So, some engineers are doing security work, some engineers are doing basically operational work. I would say, some people really hated it because they have to be on the computer, look at monitor 24/7, making sure transactions succeeded. When the peak time happens, there's nothing wrong with it. Sometimes they have to keep their phone open 24/7, basically to make sure this thing will not fail, right? And then, just many parts of work has to—so in the previous way, the way we do this operation is quite manual. We don't have a mature system or methodology telling us what we should do first, we should do second, and what's what would you do after this. So, basically the collaboration chain was not there. Therefore, when issue happens, our operation team has to respond very quickly. But then, how can we quickly identify the problem, and make it a problem? That's a problem, right? So, we have to make sure every time we respond, we respond in a very effective manner. That's the problem. In the previous process when something unexpected happen, who had to engage with the entire team from product, engineering, operation, security, everybody has to get up and look at the problem together, which was quite inefficient. So, after we moved to this cloud-native architecture—it's not the standard cloud architect, it's, kind of—we have a lot of innovation on...

00:00 -00:00