Sveriges 100 mest populära podcasts

Serverless Chats

Serverless Chats

Serverless Chats is a podcast that geeks out on everything serverless. Join Jeremy Daly and Rebecca Marshburn as they chat with a special guest each week.

Prenumerera

iTunes / Overcast / RSS

Webbplats

serverlesschats.com

Avsnitt

Episode #110: Mapping the Inevitability of Serverless with Simon Wardley

Simon Wardley is a researcher for the Leading Edge Forum focused on the intersection of IT strategy and new technologies. Simon is a seasoned executive who has spent the last 15 years defining future IT strategies for companies in the FMCG, retail, and IT industries?from Canon?s early leadership in the cloud-computing space in 2005 to Ubuntu?s recent dominance as the top cloud operating system. As a geneticist with a love of mathematics and a fascination for economics, Simon has always found himself dealing with complex systems, whether in behavioral patterns, the environmental risks of chemical pollution, developing novel computer systems, or managing companies. He is a passionate advocate and researcher in the fields of open source, commoditization, innovation, organizational structure, and cybernetics.

Simon?s most recent published research, ?Clash of the Titans: Can China Dethrone Silicon Valley?,? assesses the high-tech challenge from China and what this means to the future of global technology industry competition. His previous research covers topics including the nature of technological and business change over the next 20 years, value chain mapping, strategies for an increasingly open economy, Web 2.0, and a lifecycle approach to cloud computing. Simon is a regular presenter at conferences worldwide and has been voted one of the UK?s top 50 most influential people in IT in Computer Weekly?s 2011 and 2012 polls.

Twitter: https://twitter.com/swardley
Medium: https://swardley.medium.com
Blog: https://blog.gardeviance.org
Wardley Maps (free online book): https://medium.com/wardleymaps

Simon's slides discussed during the podcast

2021-09-13
Länk till avsnitt

Episode #109: Serverless for Newbies with Emily Shea

Emily Shea is a Sr. Serverless GTM Specialist at AWS. Emily has been at Amazon for 5 years and currently works with customers adopting serverless in the UK & Ireland. In her free time, Emily has learned to code and build her own serverless applications. Emily?s current personal project is a daily Chinese vocabulary app with over 100 subscribers.
 
Twitter: https://twitter.com/em__shea
Personal blog: https://emshea.com/
Chinese vocabulary app: https://haohaotiantian.com/
re:Invent talk: Getting started building your first serverless web application

2021-09-06
Länk till avsnitt

Episode #108: Mulling over Multi-cloud with Corey Quinn

Corey Quinn is the Cloud Economist at The Duckbill Group. Corey?s unique brand of snark combines with a deep understanding of AWS?s offerings, unlocking a level of insight that?s both penetrating and hilarious. He lives in San Francisco with his spouse and daughter.

Twitter: https://twitter.com/QuinnyPig  LinkedIn: https://www.linkedin.com/in/coquinn/  Last Week in AWS: https://www.lastweekinaws.com/ The Morning Brief and Screaming in the Cloud: https://www.lastweekinaws.com/podcast/ Duckbill Group: https://www.duckbillgroup.com/ 
2021-08-30
Länk till avsnitt

Episode #107: Serverless Infrastructure as Code with Ben Kehoe

About Ben Kehoe
Ben Kehoe is a Cloud Robotics Research Scientist at iRobot and an AWS Serverless Hero. As a serverless practitioner, Ben focuses on enabling rapid, secure-by-design development of business value by using managed services and ephemeral compute (like FaaS). Ben also seeks to amplify voices from dev, ops, and security to help the community shape the evolution of serverless and event-driven designs.

Twitter: @ben11kehoe
Medium: ben11kehoe
GitHub: benkehoe
LinkedIn: ben11kehoe
iRobot: www.irobot.com

Watch this episode on YouTube: https://youtu.be/B0QChfAGvB0

This episode is sponsored by CBT Nuggets and Lumigo.

Transcript
Jeremy: Hi, everyone. I'm Jeremy Daly.

Rebecca: And I'm Rebecca Marshburn.

Jeremy: And this is Serverless Chats. And this is a momentous occasion on Serverless Chats because we are welcoming in Rebecca Marshburn as an official co-host of Serverless Chats.

Rebecca: I'm pretty excited to be here. Thanks so much, Jeremy.

Jeremy: So for those of you that have been listening for hopefully a long time, and we've done over 100 episodes. And I don't know, Rebecca, do I look tired? I feel tired.

Rebecca: I've never seen you look tired.

Jeremy: Okay. Well, I feel tired because we've done a lot of these episodes and we've published a new episode every single week for the last 107 weeks, I think at this point. And so what we're going to do is with you coming on as a new co-host, we're going to take a break over the summer. We're going to revamp. We're going to do some work. We're going to put together some great content. And then we're going to come back on, I think it's August 30th with a new episode and a whole new show. Again, it's going to be about serverless, but what we're thinking is ... And, Rebecca, I would love to hear your thoughts on this as I come at things from a very technical angle, because I'm an overly technical person, but there's so much more to serverless. There's so many other sides to it that I think that bringing in more perspectives and really being able to interview these guests and have a different perspective I think is going to be really helpful. I don't know what your thoughts are on that.

Rebecca: Yeah. I love the tech side of things. I am not as deep in the technicalities of tech and I come at it I think from a way of loving the stories behind how people got there and perhaps who they worked with to get there, the ideas of collaboration and community because nothing happens in a vacuum and there's so much stuff happening and sharing knowledge and education and uplifting each other. And so I'm super excited to be here and super excited that one of the first episodes I get to work on with you is with Ben Kehoe because he's all about both the technicalities of tech, and also it's actually on his Twitter, a new compassionate tech values around humility, and inclusion, and cooperation, and learning, and being a mentor. So couldn't have a better guest to join you in the Serverless Chats community and being here for this.

Jeremy: I totally agree. And I am looking forward to this. I'm excited. I do want the listeners to know we are testing in production, right? So we haven't run any unit tests, no integration tests. I mean, this is straight test in production.

Rebecca: That's the best practice, right? Total best practice to test in production.

Jeremy: Best practice. Right. Exactly.

Rebecca: Straight to production, always test in production.

Jeremy: Push code to the cloud. Here we go.

Rebecca: Right away.

Jeremy: Right. So if it's a little bit choppy, we'd love your feedback though. The listeners can be our observability tool and give us some feedback and we can ... And hopefully continue to make the show better. So speaking of Ben Kehoe, for those of you who don't know Ben Kehoe, I'm going to let him introduce himself, but I have always been a big fan of his. He was very, very early in the serverless space. I read all his blogs very early on. He was an early AWS Serverless Hero. So joining us today is Ben Kehoe. He is a cloud robotics research scientist at iRobot, as I said, an AWS Serverless Hero. Ben, welcome to the show.

Ben: Thanks for having me. And I'm excited to be a guinea pig for this new exciting format.

Rebecca: So many observability tools watching you be a guinea pig too. There's lots of layers to this.

Jeremy: Amazing. All right. So Ben, why don't you tell the listeners for those that don't know you a little bit about yourself and what you do with serverless?

Ben: Yeah. So I mean, as with all software, software is people, right? It's like Soylent Green. And so I'm really excited for this format being about the greater things that technology really involves in how we create it and set it up. And serverless is about removing the things that don't matter so that you can focus on the things that do matter.

Jeremy: Right.

Ben: So I've been interested in that since I learned about it. And at the time saw that I could build things without running servers, without needing to deal with the scaling of stuff. I've been working on that at iRobot for over five years now. As you said early on in serverless at the first serverless con organized by A Cloud Guru, now plural sites.

Jeremy: Right.

Ben: And yeah. And it's been really exciting to see it grow into the large-scale community that it is today and all of the ways in which community are built like this podcast.

Jeremy: Right. Yeah. I love everything that you've done. I love the analogies you've used. I mean, you've always gone down this road of how do you explain serverless in a way to show really the adoption of it and how people can take that on. Serverless is a ladder. Some of these other things that you would ... I guess the analogies you use were always great and always helped me. And of course, I don't think we've ever really come to a good definition of serverless, but we're not talking about that today. But ...

Ben: There isn't one.

Jeremy: There isn't one, which is also a really good point. So yeah. So welcome to the show. And again, like I said, testing in production here. So, Rebecca, jump in when you have questions and we'll beat up Ben from both sides on this, but, really ...

Rebecca: We're going to have Ben from both sides.

Jeremy: There you go. We'll embrace him from both sides. There you go.

Rebecca: Yeah. Yeah.

Jeremy: So one of the things though that, Ben, you have also been very outspoken on which I absolutely love, because I'm in very much closely aligned on this topic here. But is about infrastructure as code. And so let's start just quickly. I mean, I think a lot of people know or I think people working in the cloud know what infrastructure as code is, but I also think there's a lot of people who don't. So let's just take a quick second, explain what infrastructure as code is and what we mean by that.

Ben: Sure. To my mind, infrastructure as code is about having a definition of the state of your infrastructure that you want to see in the cloud. So rather than using operations directly to modify that state, you have a unified definition of some kind. I actually think infrastructure is now the wrong word with serverless. It used to be with servers, you could manage your fleet of servers separate from the software that you were deploying onto the servers. And so infrastructure being the structure below made sense. But now as your code is intimately entwined in the rest of your resources, I tend to think of resource graph definitions rather than infrastructure as code. It's a less convenient term, but I think it's worth understanding the distinction or the difference in perspective.

Jeremy: Yeah. No, and I totally get that. I mean, I remember even early days of cloud when we were using the Chefs and the Puppets and things like that, that we were just deploying the actual infrastructure itself. And sometimes you deploy software as part of that, but it was supporting software. It was the stuff that ran in the runtime and some of those and some configurations, but yeah, but the application code that was a whole separate process, and now with serverless, it seems like you're deploying all those things at the same time.

Ben: Yeah. There's no way to pick it apart.

Jeremy: Right. Right.

Rebecca: Ben, there's something that I've always really admired about you and that is how strongly you hold your opinions. You're fervent about them, but it's also because they're based on this thorough nature of investigation and debate and challenging different people and yourself to think about things in different ways. And I know that the rest of this episode is going to be full with a lot of opinions. And so before we even get there, I'm curious if you can share a little bit about how you end up arriving at these, right? And holding them so steady.

Ben: It's a good question. Well, I hope that I'm not inflexible in these strong opinions that I hold. I mean, it's one of those strong opinions loosely held kind of things that new information can change how you think about things. But I do try and do as much thinking as possible so that there's less new information that I have to encounter to change an opinion.

Rebecca: Yeah. Yeah.

Ben: Yeah. I think I tend to try and think about how people ... But again, because it's always people. How people interact with the technology, how people behave, how organizations behave, and then how technology fits into that. Because sometimes we talk about technology in a vacuum and it's really not. Technology that works for one context doesn't work for another. I mean, a lot of my strong opinions are that there is no one right answer kind of a thing, or here's a framework for understanding how to think about this stuff. And then how that fits into a given person is just finding where they are in that more general space. Does that make sense? So it's less about finding out here's the one way to do things and more about finding what are the different options, how do you think about the different options that are out there.

Rebecca: Yeah, totally makes sense. And I do want to compliment you. I do feel like you are very good at inviting new information in if people have it and then you're like, "Aha, I've already thought of that."

Ben: I hope so. Yeah. I was going to say, there's always a balance between trying to think ahead so that when you discover something you're like, "Oh, that fits into what I thought." And the danger of that being that you're twisting the information to fit into your preexisting structures. I hope that I find a good balance there, but I don't have a principle way of determining that balance or knowing where you are in that it's good versus it's dangerous kind of spectrum.

Jeremy: Right. So one of the opinions that you hold that I tend to agree with, I have some thoughts about some of the benefits, but I also really agree with the other piece of it. And this really has to do with the CDK and this idea of using CloudFormation or any sort of DSL, maybe Terraform, things like that, something that is more domain-specific, right? Or I guess declarative, right? As opposed to something that is imperative like the CDK. So just to get everybody on the same page here, what is the top reasons why you believe, or you think that DSL approach is better than that iterative approach or interpretive approach, I guess?

Ben: Yeah. So I think we get caught up in the imperative versus declarative part of it. I do think that declarative has benefits that can be there, but the way that I think about it is with the CDK and infrastructure as code in general, I'm like mildly against imperative definitions of resources. And we can get into that part, but that's not my smallest objection to the CDK. I'm moderately against not being able to enforce deterministic builds. And the CDK program can do anything. Can use a random number generator and go out to the internet to go ask a question, right? It can do anything in that program and that means that you have no guarantees that what's coming out of it you're going to be able to repeat.

So even if you check the source code in, you may not be able to go back to the same infrastructure that you had before. And you can if you're disciplined about it, but I like tools that help give you guardrails so that you don't have to be as disciplined. So that's my moderately against. My strongly against piece is I'm strongly against developer intent remaining client side. And this is not an inherent flaw in the CDK, is a choice that the CDK team has made to turn organizational dysfunction in AWS into ownership for their customers. And I don't think that's a good approach to take, but that's also fixable.

So I think if we want to start with the imperative versus declarative thing, right? When I think about the developers expressing an intent, and I want that intent to flow entirely into the cloud so that developers can understand what's deployed in the cloud in terms of the things that they've written. The CDK takes this approach of flattening it down, flattening the richness of the program the developer has written into ... They think of it as assembly language. I think that is a misinterpretation of what's happening. The assembly language in the process is the imperative plan generated inside the CloudFormation engine that says, "Here's how I'm going to take this definition and turn it into an actual change in the cloud.

Jeremy: Right.

Ben: They're just translating between two definition formats in CDK scene. But it's a flattening process, it's a lossy process. So then when the developer goes to the Console or the API has to go say, "What's deployed here? What's going wrong? What do I need to fix?" None of it is framed in terms of the things that they wrote in their original language.

Jeremy: Right.

Ben: And I think that's the biggest problem, right? So drift detection is an important thing, right? What happened when someone went in through the Console? Went and tweaked some stuff to fix something, and now it's different from the definition that's in your source repository. And in CloudFormation, it can tell you that. But what I would want if I was running CDK is that it should produce another CDK program that represents the current state of the cloud with a meaningful file-level diff with my original program.

Jeremy: Right. I'm just thinking this through, if I deploy something to CDK and I've got all these loops and they're generating functions and they're using some naming and all this kind of stuff, whatever, now it produces this output. And again, my naming of my functions might be some function that gets called to generate the names of the function. And so now I've got all of these functions named and I have to go in. There's no one-to-one map like you said, and I can imagine somebody who's not familiar with CloudFormation which is ultimately what CDK synthesizes and produces, if you're not familiar with what that output is and how that maps back to the constructs that you created, I can see that as being really difficult, especially for younger developers or developers who are just getting started in that.

Ben: And the CDK really takes the attitude that it's going to hide those things from those developers rather than help them learn it. And so when they do have to dive into that, the CDK refers to it as an escape hatch.

Jeremy: Yeah.

Ben: And I think of escape hatches on submarines, where you go from being warm and dry and having air to breathe to being hundreds of feet below the sea, right? It's not the sort of thing you want to go through. Whereas some tools like Amplify talk about graduation. In Amplify they aim to help you understand the things that Amplify is doing for you, such that when you grow beyond what Amplify can provide you, you have the tools to do that, to take the thing that you built and then say, "Okay, I know enough now that I understand this and can add onto it in ways that Amplify can't help with."

Jeremy: Right.

Ben: Now, how successful they are in doing that is a separate question I think, but the attitude is there to say, "We're looking to help developers understand these things." Now the CDK could also if the CDK was a managed service, right? Would not need developers to understand those things. If you could take your program directly to the cloud and say, "Here's my program, go make this real." And when it made it real, you could interact with the cloud in an understanding where you could list your deployed constructs, right? That you can understand the program that you wrote when you're looking at the resources that are deployed all together in the cloud everywhere. That would be a thing where you don't need to learn CloudFormation.

Jeremy: Right.

Ben: Right? That's where you then end up in the imperative versus declarative part where, okay, there's some reasons that I think declarative is better. But the major thing is that disconnect that's currently built into the way that CDK works. And the reason that they're doing that is because CloudFormation is not moving fast enough, which is not always on the CloudFormation team. It's often on the service teams that aren't building the resources fast enough. And that's AWS's problem, AWS as an entire company, as an organization. And this one team is saying, "Well, we can fix that by doing all this client side."

What that means is that the customers are then responsible for all the things that are happening on the client side. The reason that they can go fast is because the CDK team doesn't have ownership of it, which just means the ownership is being pushed on customers, right? The CDK deploys Lambda functions into your account that they don't tell you about that you're now responsible for. Right? Both the security and operations of. If there are security updates that the CDK team has to push out, you have to take action to update those things, right? That's ownership that's being pushed onto the customer to fix a lack of ACM certificate management, right?

Jeremy: Right. Right.

Ben: That is ACM not building the thing that's needed. And so AWS says, "Okay, great. We'll just make that the customer's problem."

Jeremy: Right.

Ben: And I don't agree with that approach.

Rebecca: So I'm sure as an AWS Hero you certainly have pretty good, strong, open communication channels with a lot of different team members across teams. And I certainly know that they're listening to you and are at least hearing you, I should say, and watching you and they know how you feel about this. And so I'm curious how some of those conversations have gone. And some teams as compared to others at AWS are really, really good about opening their roadmap or at least saying, "Hey, we hear this, and here's our path to a solution or a success." And I'm curious if there's any light you can shed on whether or not those conversations have been fruitful in terms of actually being able to get somewhere in terms of customer and AWS terms, right? Customer obsession first.

Ben: Yeah. Well, customer obsession can mean two things, right? Customer obsession can mean giving the customer what they want or it can mean giving the customer what they need and different AWS teams' approach fall differently on that scale. The reason that many of those things are not available in CloudFormation is that those teams are ... It could be under-resourced. They could have a larger majority of customer that want new features rather than infrastructure as code support. Because as much as we all like infrastructure as code, there are many, many organizations out there that are not there yet. And with the CDK in particular, I'm a relatively lone voice out there saying, "I don't think this ownership that's being pushed onto the customer is a good thing." And there are lots of developers who are eating up CDK saying, "I don't care."

That's not something that's in their worry. And because the CDK has been enormously successful, right? It's fixing these problems that exists. And I don't begrudge them trying to fix those problems. I think it's a question of do those developers who are grabbing onto those things and taking them understand the full total cost of ownership that the CDK is bringing with it. And if they don't understand it, I think AWS has a responsibility to understand it and work with it to help those customers either understand it and deal with it, right? Which is where the CDK takes this approach, "Well, if you do get Ops, it's all fine." And that's somewhat true, but also many developers who can use the CDK do not control their CI/CD process. So there's all sorts of ways in which ... Yeah, so I think every team is trying to do the best that they can, right?

They're all working hard and they all have ... Are pulled in many different directions by customers. And most of them are making, I think, the right choices given their incentives, right? Given what their customers are asking for. I think not all of them balance where customers ... meeting customers where they are versus leading them where they should, like where they need to go as well as I would like. But I think ... I had a conclusion to that. Oh, but I think that's always a debate as to where that balance is. And then the other thing when I talk about the CDK, that my ideal audience there is less AWS itself and more AWS customers ...

Rebecca: Sure.

Ben: ... to understand what they're getting into and therefore to demand better of AWS. Which is in general, I think, the approach that I take with AWS, is complaining about AWS in public, because I do have the ability to go to teams and say, "Hey, I want this thing," right? There are plenty of teams where I could just email them and say, "Hey, this feature could be nice", but I put it on Twitter because other people can see that and say, "Oh, that's something that I want or I don't think that's helpful," right? "I don't care about that," or, "I think it's the wrong thing to ask for," right? All of those things are better when it's not just me saying I think this is a good thing for AWS, but it being a conversation among the community differently.

Rebecca: Yeah. I think in the spirit too of trying to publicize types of what might be best next for customers, you said total cost of ownership. Even though it might seem silly to ask this, I think oftentimes we say the words total cost of ownership, but there's actually many dimensions to total cost of ownership or TCO, right? And so I think it would be great if you could enumerate what you think of as total cost of ownership, because there might be dimensions along that matrices, matrix, that people haven't considered when they're actually thinking about total cost of ownership. They're like, "Yeah, yeah, I got it. Some Ops and some security stuff I have to do and some patches," but they might only be thinking of five dimensions when you're like, "Actually the framework is probably 10 to 12 to 14." And so if you could outline that a bit, what you mean when you think of a holistic total cost of ownership, I think that could be super helpful.

Ben: I'm bad at enumeration. So I would miss out on dimensions that are obvious if I was attempting to do that. But I think a way that I can, I think effectively answer that question is to talk about some of the ways in which we misunderstand TCO. So I think it's important when working in an organization to think about the organization as a whole, not just your perspective and that your team's perspective in it. And so when you're working for the lowest TCO it's not what's the lowest cost of ownership for my team if that's pushing a larger burden onto another team. Now if it's reducing the burden on your team and only increasing the burden on another team a little bit, that can be a lower total cost of ownership overall. But it's also something that then feeds into things like political capital, right?

Is that increased ownership that you're handing to that team something that they're going to be happy with, something that's not going to cause other problems down the line, right? Those are the sorts of things that fit into that calculus because it's not just about what ... Moving away from that topic for a second. I think about when we talk about how does this increase our velocity, right? There's the piece of, "Okay, well, if I can deploy to production faster, right? My feedback loop is faster and I can move faster." Right? But the other part of that equation is how many different threads can you be operating on and how long are those threads in time? So when you're trying to ship a feature, if you can ship it and then never look at it again, that means you have increased bandwidth in the future to take on other features to develop other new features.

And so even if you think about, "It's going to take me longer to finish this particular feature," but then there's no maintenance for that feature, that can be a lower cost of ownership in time than, "I can ship it 50% faster, but then I'm going to periodically have to revisit it and that's going to disrupt my ability to ship other things," right? So this is where I had conversations recently about increasing use of Step Functions, right? And being able to replace Lambda functions with Step Functions express workflows because you never have to go back to those Lambdas and update dependencies in them because dependent bot has told you that you need to or a version of Python is getting deprecated, right? All of those things, just if you have your Amazon States Language however it's been defined, right?

Once it's in there, you never have to touch it again if nothing else changes and that means, okay, great, that piece is now out of your work stream forever unless it needs to change. And that means that you have more bandwidth for future things, which serverless is about in general, right? Of say, "Okay, I don't have to deal with this scaling problems here. So those scaling things. Once I have an auto-scaling group, I don't have to go back and tweak it later." And so the same thing happens at the feature level if you build it in ways that allow you to do that. And so I think that's one of the places where when we focus on, okay, how fast is this getting me into production, it's okay, but how often do you have to revisit it ...

Jeremy: Right. And so ... So you mentioned a couple of things in there, and not only in that question, but in the previous questions as you were talking about the CDK in general, and I am 100% behind you on this idea of deterministic builds because I want to know exactly what's being deployed. I want to be able to audit that and map that back. And you can audit, I mean, you could run CDK synth and then audit the CloudFormation and test against certain things. But if you are changing stuff, right? Then you have to understand not only the CDK but also the CloudFormation that it actually generates. But in terms of solving problems, some of the things that the CDK does really, really well, and this is something where I've always had this issue with just trying to use raw CloudFormation or Serverless Framework or SAM or any of these things is the fact that there's a lot of boilerplate that you often have to do.

There's ways that companies want to do something specifically. I basically probably always need 1,400 lines of CloudFormation. And for every project I do, it's probably close to the same, and then add a little bit more to actually make it adaptive for my product. And so one thing that I love about the CDK is constructs. And I love this idea of being able to package these best practices for your company or these compliance requirements, excuse me, compliance requirements for your company, whatever it is, be able to package these and just hand them to developers. And so I'm just curious on your thoughts on that because that seems like a really good move in the right direction, but without the deterministic builds, without some of these other problems that you talked about, is there another solution to that that would be more declarative?

Ben: Yeah. In theory, if the CDK was able to produce an artifact that represented all of the non-deterministic dependencies that it had, right? That allowed you to then store that artifacts as you'd come back and put that into the program and say, "I'm going to get out the same thing," but because the CDK doesn't control upstream of it, the code that the developers are writing, there isn't a way to do that. Right? So on the abstraction front, the constructs are super useful, right? CloudFormation now has modules which allow you to say, "Here's a template and I'm going to represent this as a CloudFormation type itself," right? So instead of saying that I need X different things, I'm going to say, "I packaged that all up here. It is as a type."

Now, currently, modules can only be playing CloudFormation templates and there's a lot of constraints in what you can express inside a CloudFormation template. And I think the answer for me is ... What I want to see is more richness in the CloudFormation language, right? One of the things that people do in the CDK that's really helpful is say, "I need a copy of this in every AZ."

Jeremy: Right.

Ben: Right? There's so much boilerplate in server-based things. And CloudFormation can't do that, right? But if you imagine that it had a map function that allowed you to say, "For every AZ, stamp me out a copy of this little bit." And then that the CDK constructs allowed to translate. Instead of it doing all this generation only down to the L one piece, instead being able to say, "I'm going to translate this into more rich CloudFormation templates so that the CloudFormation template was as advanced as possible."

Right? Then it could do things like say, "Oh, I know we need to do this in every AZ, I'm going to use this map function in the CloudFormation template rather than just stamping it out." Right? And so I think that's possible. Now, modules should also be able to be defined as CDK programs. Right? You should be able to register a construct as a CloudFormation tag.

Jeremy: It would be pretty cool.

Ben: There's no reason you shouldn't be able to. Yeah. Because I think the declarative versus imperative thing is, again, not the most important piece, it's how do we move ... It's shifting right in this case, right? That how do you shift what's happening with the developer further into the process of deployment so that more of their context is present? And so one of the things that the CDK does that's hard to replicate is have non-local effects. And this is both convenient and I think of code smell often.

So you can pass a bucket resource from another stack into a piece of code in your CDK program that's creating a different stack and you say, "Oh great, I've got this Lambda function, it needs permissions to that bucket. So add permissions." And it's possible for the CDK programs to either be adding the permissions onto the IAM role of that function, or non-locally adding to that bucket's resource policy, which is weird, right? That you can be creating a stack and the thing that you do to that stack or resource or whatever is not happening there, it's happening elsewhere. I don't think that's a great approach, but it's certainly convenient to be able to do it in a lot of situations.

Now, that's not representable within a module. A module is a contained piece of functionality that can't touch anything else. So things like SAM where you can add events onto a function that can go and create ... You create the API events on different functions and then SAM aggregates them and creates an API gateway for you. Right? If AWS serverless function was a module, it couldn't do that because you'd have these in different places and you couldn't aggregate something between all of them and put them in the top-level thing, right?

This is what CloudFormation macros enable, but they don't have a... There's no proper interface to them, right? They don't define, "This is what I'm doing. This is the kind of resources I can create." There's none of that that would help you understand them. So they're infinitely flexible, but then also maybe less principled for that reason. So I think there are ways to evolve, but it's investment in the CloudFormation language that allows us to shift that burden from being a flattening inside client-side code from the developer and shifting it to be able to be represented in the cloud.

Jeremy: Right. Yeah. And I think from that standpoint too if we go back to the solving people's problems standpoint, that everything you explained there, they're loaded with nuances, it's loaded with gotchas, right? Like, "Oh, you can't do this, you can't do that." So that's just why I think the CDK is so popular because it's like you can do so much with it so quickly and it's very, very fast. And I think that trade-off, people are just willing to make it.

Ben: Yes. And that's where they're willing to make it, do they fully understand the consequences of it? Then does AWS communicate those consequences well? Before I get into that question of, okay, you're a developer that's brand new to AWS and you've been tasked with standing up some Kubernetes cluster and you're like, "Great. I can use a CDK to do this." Something is malfunctioning. You're also tasked with the operations and something is malfunctioning. You go in through the Console and maybe figure out all the things that are out there are new to you because they're hidden inside L3 constructs, right?

You're two levels down from where you were defining what you want, and then you find out what's wrong and you have no idea how to turn that into a change in your CDK program. So instead of going back and doing the thing that infrastructure as code is for, which is tweaking your program to go fix the problem, you go and you tweak it in the Console ...

Jeremy: Right. Which you should never do.

Ben: ... and you fix it that way. Right. Well, and that's the thing that I struggle with, with the CDK is how does the CDK help the developer who's in that situation? And I don't think they have a good story around that. Now, I don't know. I haven't talked with enough junior developers who are using the CDK about how often they get into that situation. Right? But I always say client-side code is not a replacement for a managed service because when it's client-side code, you still own the result.

Jeremy: Right.

Ben: If a particular CDK construct was a managed service in AWS, then all of the resources that would be created underneath AWS's problem to make work. And the interface that the developer has is the only level of ownership that they have. Fargate is this. Because you could do all the things that Fargate does with a CDK construct, right? Set up EC2, do all the things, and represent it as something that looks like Fargate in your CDK program. But every time your EC2 fleet is unhealthy that's your problem. With Fargate, that's AWS's problem. If we didn't have Fargate, that's essentially what CDK would be trying to do for ECS.

And I think we all recognize that Fargate is very necessary and helpful in that case, right? And I just want that for all the things, right? Whenever I have an abstraction, if it's an abstraction that I understand, then I should have a way of zooming into it while not having to switch languages, right? So that's where you shouldn't dump me out the CloudFormation to understand what you're doing. You should help me understand the low-level things in the same language. And if it's not something that I need to understand, it should be a managed service. It shouldn't be a bunch of stuff that I still own that I haven't looked at.

Jeremy: Makes sense. Got a question, Rebecca? Because I was waiting for you to jump in.

Rebecca: No, but I was going to make a joke, but then the joke passed, and then I was like, "But should I still make it?" I was going to be like, "Yeah, but does the CDK let you test in production?" But that was a 32nd ago joke and then I was really wrestling with whether or not I should tell it, but I told it anyway, hopefully, someone gets a laugh.

Ben: Yeah. I mean, there's the thing that Charity Majors says, right? Which is that everybody tests in production. Some people are lucky enough to have a development environment in production. No, sorry. I said that the wrong way. It's everybody has a test environment. Some people are lucky enough that it's not in production.

Rebecca: Yeah. Swap that. Reverse it. Yeah.

Ben: Yeah.

Jeremy: All right. So speaking of talking to developers and getting feedback from them, so I actually put a question out on Twitter a couple of weeks ago and got a lot of really interesting reactions. And essentially I asked, "What do you love or hate about infrastructure as code?" And there were a lot of really interesting things here. I don't know, maybe it might be fun to go through a couple of these and get your thoughts on them. So this is probably not a great one to start with, but I thought it was interesting because this I think represents the frustration that a lot of us feel. And it was basically that they love that automation minimizes future work, right? But they hate that it makes life harder over time. And that pretty much every approach to infrastructure in, sorry, yeah, infrastructure in code at the present is flawed, right? So really there are no good solutions right now.

Ben: Yeah. CloudFormation is still a pain to learn and deal with. If you're operating in certain IDEs, you can get tab completion.

Jeremy: Right.

Ben: If you go to CDK you get tab completion, which is, I think probably most of the value that developers want out of it and then the abstraction, and then all the other fancy things it does like pipelines, which again, should be a managed service. I do think that person is absolutely right to complain about how difficult it is. That there are many ways that it could be better. One of the things that I think about when I'm using tools is it's not inherently bad for a tool to have some friction to use it.

Jeremy: Right.

Ben: And this goes to another infrastructure as code tool that goes even further than the CDK and says, "You can define your Lambda code in line with your infrastructure definition." So this is fine with me. And there's some other ... I think Punchcard also lets you do some of this. Basically extracts out the bits of your code that you say, "This is a custom thing that glues together two things I'm defining in here and I'll make that a Lambda function for you." And for me, that is too little friction to defining a Lambda function.

Because when I define a Lambda function, just going back to that bringing in ownership, every time I add a Lambda function, that's something that I own, that's something that I have to maintain, that I'm responsible for, that can go wrong. So if I'm thinking about, "Well, I could have API Gateway direct into DynamoDB, but it'd be nice if I could change some of these fields. And so I'm just going to drop in a little sprinkle of code, three lines of code in between here to do some transformation that I want." That is all of sudden an entire Lambda function you've brought into your infrastructure.

Jeremy: Right. That's a good point.

Ben: And so I want a little bit of friction to do that, to make me think about it, to make me say, "Oh, yeah, downstream of this decision that I am making, there are consequences that I would not otherwise think about if I'm just trying to accomplish the problem," right? Because I think developers, humans, in general, tend to be a bit shortsighted when you have a goal especially, and you're being pressured to complete that goal and you're like, "Okay, well I can complete it." The consequences for later are always a secondary concern.

And so you can change your incentives in that moment to say, "Okay, well, this is going to guide me to say, "Ah, I don't really need this Lambda function in here. Then I'm better off in the long term while accomplishing that goal in the short term." So I do think that there is a place for tools making things difficult. That's not to say that the amount of difficult that infrastructure as code is today is at all reasonable, but I do think it's worth thinking about, right?

I'd rather take on the pain of creating an ASL definition by hand for express workflow than the easier thing of writing Lambda code. Because I know the long-term consequences of that. Now, if that could be flipped where it was harder to write something that took more ownership, it'd be just easy to do, right? You'd always do the right thing. But I think it's always worth saying, "Can I do the harder thing now to pay off to pay off later?"

Jeremy: And I always call those shortcuts "tomorrow-Jeremy's" problem. That's how I like to look at those.

Ben: Yeah. Yes.

Jeremy: And the funny thing about that too is I remember right when EventBridge came out and there was no CloudFormation support for a long time, which was super frustrating. But Serverless Framework, for example, implemented a custom resource in order to do that. And I remember looking at a clean stack and being like, "Why are there two Lambda functions there that I have no idea?" I'm like, "I didn't publish ..." I honestly thought my account was compromised that somebody had published a Lambda function in there because I'm like, "I didn't do that." And then it took me a while to realize, I'm like, "Oh, this is what this is." But if it is that easy to just create little transform functions here and there, I can imagine there being thousands of those in your account without anybody knowing that they even exist.

Ben: Now, don't get me wrong. I would love to have the ability to drop in little transforms that did not involve Lambda functions. So in other words, I mean, the thing that VTL does for API Gateway, REST APIs but without it being VTL and being ... Because that's hard and then also restricted in what you can do, right? It's not, "Oh, I can drop in arbitrary code in here." But enough to say, "Oh, I want to flip ... These fields should go from a key-value mapping to a list of key-value, right? In the way that it addresses inconsistent with how tags are defined across services, those kinds of things. Right? And you could drop that in any service, but once you've defined it, there's no maintenance for you, right?

You're writing JavaScript. It's not actually a JavaScript engine underneath or something. It's just getting translated into some big multi-tenant fancy thing. And I have a hypothesis that that should be possible. You should be able to do it where you could even do it in the parsing of JSON, being able to do transforms without ever having to have the whole object in memory. And if we could get that then, "Oh, sure. Now I have sprinkled all over the place all of these little transforms." Now there's a little bit of overhead if the transform is defined correctly or not, right? But once it is, then it just works. And having all those little transforms everywhere is then fine, right? And that incentive to make it harder it doesn't need to be there because it's not bringing ownership with it.

Rebecca: Yeah. It's almost like taking the idea of tomorrow-Jeremy's problem and actually switching it to say tomorrow-Jeremy's celebration where tomorrow-Jeremy gets to look back at past-Jeremy and be like, "Nice. Thank you for making that decision past-Jeremy." Because I think we often do look at it in terms of tomorrow-Jeremy will think of this, we'll solve this problem rather than how do we approach it by saying, how do I make tomorrow-Jeremy thankful for it today-Jeremy? And that's a simple language, linguistic switch, but a hard switch to actually make decisions based on.

Ben: Yeah. I don't think tomorrow-Ben is ever thankful for today-Ben. I think it's tomorrow-Ben is thankful for yesterday-Ben setting up the incentives correctly so that today-Ben will do the right thing for tomorrow-Ben. Right? When I think about people, I think it's easier to convince people to accept a change in their incentives than to convince them to fight against their incentives sustainably.

Jeremy: Right. And I think developers and I'm guilty of this too, I mean, we make decisions based off of expediency. We want to get things done fast. And when you get stuck on that problem you're like, "You know what? I'm not going to figure it out. I'm just going to write a loop or I'm going to do whatever I can do just to make it work." Another if statement here, "Isn't going to hurt anybody." All right. So let's move to ... Sorry, go ahead.

Ben: We shouldn't feel bad about that.

Jeremy: You're right.

Ben: I was going to say, we shouldn't feel bad about that. That's where I don't want tomorrow-Ben to have to be thankful for today-Ben, because that's the implication there is that today-Ben is fighting against his incentives to do good things for tomorrow-Ben. And if I don't need to have to get to that point where just the right path is the easiest path, right? Which means putting friction in the right places than today-Ben ... It's never a question of whether today-Ben is doing something that's worth being thankful for. It's just doing the job, right?

Jeremy: Right. No, that makes sense. All right. I got another question here, I think falls under the category of service discovery, which I know is another topic that you love. So this person said, "I love IaC, but hate the fuzzy boundaries where certain software awkwardly fall. So like Istio and Prometheus and cert-manager. That they can be considered part of the infrastructure, but then it's awkward to deploy them when something like Terraform due to circular dependencies relating to K8s and things like that."

So, I mean, I know that we don't have to get into the actual details of that, but I think that is an important aspect of infrastructure as code where best practices sometimes are deploy a stack that has your permanent resources and then deploy a stack that maybe has your more femoral or the ones that are going to be changing, the more mutable ones, maybe your Lambda functions and some of those sort of things. If you're using Terraform or you're using some of these other services as well, you do have that really awkward mix where you're trying to use outputs from one stack into another stack and trying to do all that. And really, I mean, there are some good tools that help with it, but I mean just overall thoughts on that.

Ben: Well, we certainly need to demand better of AWS services when they design new things that they need to be designed so that infrastructure as code will work. So this is the S3 bucket notification problem. A very long time ago, S3 decided that they were going to put bucket notifications as part of the S3 bucket. Well, CloudFormation at that point decided that they were going to put bucket notifications as part of the bucket resource. And S3 decided that they were going to check permissions when the notification configuration is defined so that you have to have the permissions before you create the configuration.

This creates a circular dependency when you're hooking it up to anything in CloudFormation because the dependency depends on the resource policy on an SNS topic, and SQS queue or a Lambda function depends on the bucket name if you're letting CloudFormation name the bucket, which is the best practice. Then bucket name has to exist, which means the resource has to have been created. But the notification depends on the thing that's notifying, which doesn't have the names and the resource policy doesn't exist so it all fails. And this is solved in a couple of different ways. One of which is name your bucket explicitly, again, not a good practice. Another is what SAM does, which says, "The Lambda function will say I will allow all S3 buckets to invoke me."

So it has a star permission in it's resource policy. So then the notification will work. None of which is good or there's custom resources that get created, right? Now, if those resources have been designed with infrastructure as code as part of the process, then it would have been obvious, "Oh, you end up with a circular pendency. We need to split out bucket notifications as a separate resource." And not enough teams are doing this. Often they're constrained by the API that they develop first ...

Jeremy: That's a good point.

Ben: ... they come up with the API, which often makes sense for a Console experience that they desire. So this is where API Gateway has this whole thing where you create all the routes and the resources and the methods and everything, right? And then you say, "Great, deploy." And in the Console you only need one mutable working copy of that at a time, but it means that you can't create two deployments or update two stages in parallel through infrastructure as code and API Gateway because they both talk to this mutable working copy state and would overwrite each other.

And if infrastructure as code had been on their list would have been, "Oh, if you have a definition of your API, you should be able to go straight to the deployment," right? And so trying to push that upstream, which to me is more important than infrastructure as code support at launch, but people are often like, "Oh, I want CloudFormation support at launch." But that often means that they get no feedback from customers on the design and therefore make it bad. KMS asymmetric keys should have been a different resource type so that you can easily tell which key types are in your template.

Jeremy: Good point. Yeah.

Ben: Right? So that you can use things like CloudFormation Guard more easily on those. Sure, you can control the properties or whatever, but you should be able to think in terms of, "I have a symmetric key or an asymmetric key in here." And they're treated completely separately because you use them completely differently, right? They don't get used to the same place.

Jeremy: Yeah. And it's funny that you mentioned the lacking support at launch because that was another complaint. That was quite prevalent in this thread here, was people complaining that they don't get that CloudFormation support right away. But I think you made a very good point where they do build the APIs first. And that's another thing. I don't know which question asked me or which one of these mentioned it, but there was a lot of anger over the fact that you go to the API docs or you go to the docs for AWS and it focuses on the Console and it focuses on the CLI and then it gives you the API stuff and very little mention of CloudFormation at all. And usually, you have to go to a whole separate set of docs to find the CloudFormation. And it really doesn't tie all the concepts together, right? So you get just a block of JSON or of YAML and you're like, "Am I supposed to know what everything does here?"

Ben: Yeah. I assume that's data-driven. Right? And we exist in this bubble where everybody loves infrastructure as code.

Jeremy: True.

Ben: And that AWS has many more customers who set things up using Console, people who learn by doing it first through the Console. I assume that's true, if it's not, then the AWS has somehow gotten on the extremely wrong track. But I imagine that's how they find that they get the right engagement. Now maybe the CDK will change some of this, right? Maybe the amount of interest that is generating, we'll get it to the point where blogs get written with CDK programs being written there. I think that presents different problems about what that CDK program might hide from when you're learning about a service. But yeah, it's definitely not ... I wrote a blog for AWS and my first draft had it as CloudFormation and then we changed it to the Console. Right? And ...

Jeremy: That must have hurt. Did you die a little inside when that happened?

Ben: I mean, no, because they're definitely our users, right? That's the way in which they interact with data, with us and they should be able to learn from that, their company, right? Because again, developers are often not fully in control of this process.

Jeremy: Right. That's a good point.

Ben: And so they may not be able to say, "I want to update this through CloudFormation," right? Either because their organization says it or just because their team doesn't work that way. And I think AWS gets requests to prevent people from using the Console, but also to force people to use the Console. I know that at least one of them is possible in IAM. I don't remember which, because I've never encountered it, but I think it's possible to make people use the Console. I'm not sure, but I know that there are companies who want both, right? There are companies who say, "We don't want to let people use the API. We want to force them to use the Console." There are companies who say, "We don't want people using the Console at all. We want to force them to use the APIs."

Jeremy: Interesting.

Ben: Yeah. There's a lot of AWS customers, right? And there's every possible variety of organization and AWS should be serving all of them, right? They're all customers. And certainly, I want AWS to be leading the ones that are earlier in their cloud journey and on the serverless ladder to getting further but you can't leave them behind, I think it's important.

Jeremy: So that people argument and those different levels and coming in at a different, I guess, level or comfortability with APIs versus infrastructure as code and so forth. There was another question or another comment on this that said, "I love the idea of committing everything that makes my solution to text and resurrect an entire solution out of nothing other than an account key. Loved the ability to compare versions and unit tests, every bit of my solution, and not having to remember that one weird setting if you're using the Console. But hate that it makes some people believe that any coder is now an infrastructure wizard."

And I think this is a good point, right? And I don't 100% agree with it, but I think it's a good point that it basically ... Back to your point about creating these little transformations in Pulumi, you could do a lot of damage, I mean, good or bad, right? When you are using these tools. What are your thoughts on that? I mean, is this something where ... And again, the CDK makes it so easy for people to write these constructs pretty quickly and spin up tons of infrastructure without a lot of guard rails to protect them.

Ben: So I think if we tweak the statement slightly, I think there's truth there, which isn't about the self-perception but about what they need to be. Right? That I think this is more about serverless than about infrastructure as code. Infrastructure as code is just saying that you can define it. Right? I think it's more about the resources that are in a particular definition that require that. My former colleague, Aaron Camera says, "Serverless means every developer is an architect" because you're not in that situation where the code you write goes onto something, you write the whole thing. Right?

And so you do need to have those ... You do need to be an infrastructure wizard whether you're given the tools to do that and the education to do that, right? Not always, like if you're lucky. And the self-perception is again an even different thing, right? Especially if coders think that there's nothing to be learned ... If programmers, software developers, think that there's nothing to be learned from the folks who traditionally define the infrastructure, which is Ops, right? They think, "Those people have nothing to teach me because now I can do all the things that they did." Well, you can create the things that they created and it does not mean that you're as good at it ...

Jeremy: Or responsible for monitoring it too. Right.

Ben: ... and have the ... Right. The monitoring, the experience of saying these are the things that will come back to bite you that are obvious, right? This is how much ownership you're getting into. There's very much a long-standing problem there of devaluing Ops as a function and as a career. And for my money when I look at serverless, I think serverless is also making the software development easier because there's so much less software you need to write. You need to write less software that deals with the hard parts of these architectures, the scaling, the distributed computing problems.

You still have this, your big computing problems, but you're considering them functionally rather than coding things that address them, right? And so I see a lot of operations folks who come into serverless learn or learn a new programming language or just upscale, right? They're writing Python scripts to control stuff and then they learn more about Python to be able to do software development in it. And then they bring all of that Ops experience and expertise into it and look at something and say, "Oh, I'd much rather have step functions here than something where I'm running code for it because I know how much my script break and those kinds of things when an API changes or ... I have to update it or whatever it is."

And I think that's something that Tom McLaughlin talks about having come from an outside ground into serverless. And so I think there's definitely a challenge there in both directions, right? That Ops needs to learn more about software development to be more engaged in that process. Software development does need to learn much more about infrastructure and is also at this risk of approaching it from, "I know the syntax, but not the semantics, sort of thing." Right? We can create ...

Jeremy: Just because I can doesn't mean I should.

Ben: ... an infrastructure. Yeah.

Rebecca: So Ben, as we're looping around this conversation and coming back to this idea that software is people and that really software should enable you to focus on the things that do matter. I'm wondering if you can perhaps think of, as pristine as possible, an example of when you saw this working, maybe it was while you've been at iRobot or a project that you worked on your own outside of that, but this moment where you saw software really working as it should, and that how it enabled you or your team to focus on the things that matter. If there's a concrete example that you can give when you see it working really well and what that looks like.

Ben: Yeah. I mean, iRobot is a great example of this having been the company without need for software that scaled to consumer electronics volumes, right? Roomba volumes. And needing to build a IOT cloud application to run connected Roombas and being able to do that without having to gain that expertise. So without having to build a team that could deal with auto-scaling fleets of servers, all of those things was able to build up completely serverlessly. And so skip an entire level of organizational expertise, because that's just not necessary to accomplish those tasks anymore.

Rebecca: It sounds quite nice.

Ben: It's really great.

Jeremy: Well, I have one more question here that I think could probably end up ... We could talk about for another hour. So I will only throw it out there and maybe you can give me a quick answer on this, but I actually had another Twitter thread on this not too long ago that addressed this very, very problem. And this is the idea of the feedback cycle on these infrastructure as code tools where oftentimes to deploy infrastructure changes, I mean, it just takes time. In many cases things can run in parallel, but as you said, there's race conditions and things like that, that sometimes things have to be ... They just have to be synchronous. So is this something where there are ways where you see in the future these mutations to your infrastructure or things like that potentially happening faster to get a better feedback cycle, or do you think that's just something that we're going to have to deal with for a while?

Ben: Yeah, I think it's definitely a very extensive topic. I think there's a few things. One is that the deployment cycle needs to get shortened. And part of that I think is splitting dev deployments from prod deployments. In prod it's okay for it to take 30 seconds, right? Or a minute or however long because that's at the end of a CI/CD pipeline, right? There's other things that are happening as part of that. Now, you don't want that to be hours or whatever it is. Right? But it's okay for that to be proper and to fully manage exactly what's going on in a principled manner.

When you're doing for development, it would be okay to, for example, change the Lambda code without going through CloudFormation to change the Lambda code, right? And this is what an architect does, is there's a notion of a dirty deploy which just packages up. Now, if your resource graph has changed, you do need to deploy again. Right? But if the only thing that's changing is your code, sure, you can go and say, "Update function code," on that Lambda directly and that's faster.

But calling it a dirty deploy is I think important because that is not something that you want to do in prod, right? You don't want there to be drift between what the infrastructure as code service understands, but then you go further than that and imagine there's no reason that you actually have to do this whole zip file process. You could be R sinking the code directly, or you could be operating over SSH on the code remotely, right? There's many different ways in which the loop from I have a change in my Lambda code to that Lambda having that change could be even shorter than that, right?

And for me, that's what it's really about. I don't think that local mocking is the answer. You and Brian Rue were talking about this recently. I mean, I agree with both of you. So I think about it as I want unit tests of my business logic, but my business logic doesn't deal with AWS services. So I want to unit test something that says, "Okay, I'm performing this change in something and that's entirely within my custom code." Right? It's not touching other services. It doesn't mean that I actually need adapters, right? I could be dealing with the native formats that I'm getting back from a given service, but I'm not actually making calls out of the code. I'm mocking out, "Well, here's what the response would look like."

And so I think that's definitely necessary in the unit testing sense of saying, "Is my business logic correct? I can do that locally. But then is the wiring all correct?" Is something that should only happen in the cloud. There's no reason to mock API gateway into Lambda locally in my mind. You should just be dealing with the Lambda side of it in your local unit tests rather than trying to set up this multiple thing. Another part of the story is, okay, so these deploys have to happen faster, right? And then how do we help set up those end-to-end test and give you observability into it? Right? X-Ray helps, but until X-Ray can sort through all the services that you might use in the serverless architecture, can deal with how does it work in my Lambda function when it's batching from Kinesis or SQS into my function?

So multiple traces are now being handled by one invocation, right? These are problems that aren't solved yet. Until we get that kind of inspection, it's going to be hard for us to feel as good about cloud development. And again, this is where I feel sometimes there's more friction there, but there's bigger payoff. Is one of those things where again, fighting against your incentives which is not the place that you want to be.

Jeremy: I'm going to stop you before you disagree with me anymore. No, just kidding! So, Rebecca, you have any final thoughts or questions for Ben?

Rebecca: No. I just want to say to both of you and to everyone listening that I hope your today self is celebrating your yesterday-self right now.

Jeremy: Perfect. Well, Ben, thank you so much for joining us and being a guinea pig as we said on this new format that we are trying. Excellent guinea pig. Excellent.

Rebecca: An excellent human too but also great guinea pig.

Jeremy: Right. Right. Pretty much so. So if people want to find out more about you, read some of the stuff you're doing and working on, how do they do that?

Ben: I'm on Twitter. That's the primary place. I'm on LinkedIn, I don't post much there. And then I write articles that show up on Medium.

Rebecca: And just so everyone knows your Twitter handle I'll say it out loud too. It's @ben11kehoe, K-E-H-O-E, ben11kehoe.

Jeremy: Right. Perfect. All right. Well, we will put all that in the show notes and hopefully people will like this new format. And again, we'd love your feedback on this, things that you'd like us to do in the future, any ideas you have. And of course, make sure you reach out to Ben. He's an amazing resource for serverless. So again, thank you for everything you do, and thank you for being on the show.

Ben: Yeah. Thanks so much for having me. This was great.

Rebecca: Good to see you. Thank you.

2021-06-28
Länk till avsnitt

Episode #106: Building Apps on the Decentralized Web with Nader Dabit

About Nader Dabit
Nader Dabit is a web and mobile developer, author, and Developer Relations Engineer building the decentralized future at Edge and Node. Previously, he worked as a Developer Advocate at AWS Mobile working with projects like AWS AppSync and AWS Amplify. He is also the author and editor of React Native in Action and OpenGraphQL.

Nader Dabit Twitter: @dabit3
Edge and Node Twitter: @edgeandnode
Graph protocol Twitter: @graphprotocol
Edge and Node: edgeandnode.com
Everest: everest.link
YouTube: YouTube.com/naderdabit
What is Web3? The Decentralized Internet of the Future Explained


Watch this episode on YouTube: https://youtu.be/pSv_cCQyCPQ

This episode is sponsored by CBT Nuggets and Fauna.

Transcript
Jeremy: Hi everyone. I'm Jeremy Daly and this is Serverless Chats. Today I am joined again by Nader Dabit. Hey Nader, thanks for joining me.

Nader: Hey Jeremy. Thanks for having me.

Jeremy: You are now a developer relations engineer at Edge & Node. I would love it if you could tell the listeners a little bit about yourself. I think a lot of people probably know you already, but a little bit about your background and then what Edge & Node is.

Nader: Yeah, totally. My name is Nader Dabit like you mentioned, and I've been a developer for about, I guess, nine or ten years now. A lot of people might know me from my work with AWS, where I worked with the Amplify team with the front end web and mobile team, doing a lot of full stack stuff there as well as serverless. I've been working as a developer relations person, developer advocate, actually, leading the front end web and mobile team at AWS for a little over three years I was there. I was a manager for the last year and I became really, really interested in serverless while I was there. It led to me writing a book, which is Full Stack Serverless. It also just led me down the rabbit hole of managed services and philosophy and all this stuff.

It's been really, really cool to learn about everything in the space. Edge & Node is my next step, I would say, in doing work and what I consider maybe a serverless area, but it's an area that a lot of people might not associate with the traditional, I would say definition of serverless or the types of companies they often associate with serverless. But Edge & Node is a company that was spun off from a team that created a decentralized API protocol, which is called the Graph protocol. And the Graph protocol started being built in 2017. It was officially launched in a decentralized way at the end of 2020. Now we are currently finalizing that migration from a hosted service to a decentralized service actually this month.

A lot of really exciting things going on. We'll talk a lot about that and what all that means. But Edge & Node itself, we do support the Graph protocol, that's part of what we do, but we also build out decentralized applications ourselves. We have a couple of applications that we're building as engineers. We're also doing a lot of work within the Web3 ecosystem, which is known as the decentralized web ecosystem by investing in different people and companies and supporting different things and spreading awareness around some of the things that are going on here because it does have a lot to do with maybe the work that people are doing in the Web2 space, which would be the traditional webspace, the space that I was in before.

Jeremy: Right, right. Here I am. I follow you on Twitter. Love the videos that you do on your YouTube channel. You're like a shining example of what a really good developer relations dev advocate is. You just produce so much content, things like that, and you're doing all this stuff on serverless and I'm loving it. And then all of a sudden, I see you post this thing saying, hey, I'm leaving AWS Amplify. And you mentioned something about blockchain and I'm like, okay, wait a minute. What is this that Nader is now doing? Explain to me this, or maybe explain to me and hopefully the audience as well. What is the blockchain have to do with this decentralized applications or decentralized, I guess Web3?

Nader: Web3 as defined by definition, what you might see if you do some research, would be what a lot of people are talking about as the next evolution of the web as we know it. In a lot of these articles and stuff that people are trying to formalize ideas and stuff, the original web was the read-only web where we were not creators, the only creators were maybe the developers themselves. Early on, I might've gone and read a website and been able to only interact with the website by reading information. The current version that we're currently experiencing might be considered as Web2 where everyone's a creator. All of the interfaces, all of the applications that we interact with are built specifically for input. I can actually create a comment, I can upload a video, I can share stuff, and I can write to the web. And I can read.

And then the next evolution, a lot of people are categorizing, yes, is Web3. It's like taking a lot of the great things that we have today and maybe improving upon those. A lot of people and everyone kind of, this is just a really, a very old discussion around some of the trade-offs that we currently make in today's web around our data, around advertising, around the way a lot of business models are created for monetization. Essentially, they all come down to the manipulation of user data and different tricks and ways to steal people's data and use that essentially to create targeted advertising. Not only does this lead to a lot of times a negative experience. I just saw a tweet yesterday that resonated a lot with me that said, "YouTube is no longer a video platform, it's now an ad platform with videos in between." And that's the way I feel about YouTube. My kids ...

Jeremy: Totally.

Nader: ... I have kids that use YouTube and it's interesting to watch them because they know exactly what to do when the ads come up and exactly how to time it because they're used to, ads are just part of their experience. That's just what they're used to. And it's not just YouTube, it's every site that's out there, that's a social site, Instagram, LinkedIn. I think that that's not the original vision that people had, right, for the web. I don't think this was part of it. There have been a lot of people proposing solutions, but the core fundamental problem is how these applications are engineered, but also how the applications are paid for. How do these companies pay for developers to build. It's a really complex problem that, the simplest solution is just sell ads or maybe create something like a developer platform where you're charging a weekly or monthly or yearly or something like that.

I would say a lot of the ideas around Web3 are aiming to solve this exact problem. In order to do that you have to rethink how we build applications. You have to rethink how we store data. You have to rethink about how we think about identity as well, because again, how do you build an application that deals with user data without making it public in some way? Right? How do we deal with that? A lot of those problems are the things that people are thinking about and building ways to address those in this decentralized Web3 world. It became really fascinating to me when I started looking into it because I'm very passionate about what I'm doing. I really enjoy being a developer and going out and helping other people, but I always felt there was something missing because I'm sitting here and I love AWS still.

In fact, I would 100% go back and work there or any of these big companies, right? Because you can't really look at a company as, in my opinion, a black or white, good or bad thing, there's companies are doing good things and bad things at the same time. For instance, at AWS, I would meet a developer, teach them something at a workshop, a year later they would contact me and be like, hey, I got my first job or I created a business, or I landed my first client. So you're actually helping improve people's lives, at the same time you're reading these articles about Amazon in the news with some of the negative stuff going on. The way that I look at it is, I can't sit there and say any company is good or bad, but I felt a lot of the applications that people were building were also, at the end goal when you hear some of these VC discussions or people raising money, a lot of the end goal for some of the people I was working with were just selling advertising.

And I'm like, is this really what we're here to do? It doesn't feel fulfilling anymore when you start seeing that over and over and over. I think the really thing that fascinated me was that people are actually building applications that are monetized in a different way. And then I started diving into the infrastructure that enabled this and realized that there was a lot of similarities between serverless and how developers would deploy and build applications in this way. And it was the entry point to my rabbit hole.

Jeremy: I talked to you about this and I've been reading some of the stuff that you've been putting out and trying to educate myself on some of this. It seems very much so that show Silicon Valley on HBO, right? This decentralized web and things like that, but there's kind of, and totally correct me if I'm wrong here, but I feel there's two sides of this. You've got one side that is the blockchain, that I think some people are familiar with in the, I guess in the context of cryptocurrency, right? This is a very popular use of the blockchain because you have that redundancy and you have the agreement amongst multiple places, it's decentralized. And so you have that security there around that. But there's other uses for the blockchain as well.

Especially things like banking and real estate and some of those other use cases that I'd like to talk about. And then there's another side of it that is this decentralized piece. Is the decentralized piece of it like building apps? How is that related to the blockchain or are those two separate things?

Nader: Yeah, absolutely. I'm a big fan of Silicon Valley. Working in tech, it's almost like every single episode resonates with you if you've been in here long enough because you've been in one of those situations. The blockchain is part of the discussion. Crypto is part of the discussion, and those things never really interested me, to be honest. I was a speculator in crypto from 2015 until now. It's been fun, but I never really looked at crypto in any other way other than that. Blockchain had a really negative, I would say, association in my mind for a long time, I just never really saw any good things that people were doing with it. I just didn't do any research, maybe didn't understand what was going on.

When I started diving into it originally what really got me interested is the Graph protocol, which is one of the things that we work on at Edge & Node. I started actually understanding, why does this thing exist? Why is it there? That led me to understanding why it was there and the fact that 90% of dApps, decentralized apps in the Ethereum ecosystem are using it. And billions of queries, companies with billions of dollars in transactions are all using this stuff. I'm like, okay, this whole world exists, but why does it exist? I guess to give you an example, I guess we can talk about the Graph protocol. And there are a lot of other web, I would say Web3 or decentralized infrastructure protocols that are out there that are similar, but they all are doing similar things in the sense of how they're actually built and how they allow participation and stuff like that.

When you think of something like AWS, you think of, AWS has all of these different services. I want to build an app, I need storage. I need some type of authentication layer, maybe with Cognito, and then maybe I need someplace to execute some business logic. So maybe I'll spin up some serverless functions or create an EC2 instance, whatever. You have all these building blocks. Essentially what a lot of these decentralized protocols like the Graph are doing, are building out the same types of web infrastructure, but doing so in a decentralized way. Why does that even matter? Why is that important? Well, for instance, when you live, let's say for example in another country, I don't know, in South America and outside the United States, or even in the United States in the future, you never know. Let's say that you have some application and you've said something rude about maybe the president or something like that.

Let's say that for whatever reason, somebody hacks the server that you're dealing with or whatever, at the end of the day, there is a single point of failure, right? You have your data that's controlled by the cloud provider or the government can come in and they can have control over that. The idea around some of, pretty much all of the decentralized protocols is that they are built and distributed in a way that there is no single point of failure, but there's also no single point of control. That's important when you're living in areas that have to even worry about stuff like that. So maybe we don't have to worry about that as much here, but in other countries, they might.

Building something like a server is not a big deal, right? With AWS, but how would you build a server and make it available for anyone in the world to basically deploy and do so in a decentralized way? I think that's the problem that a lot of these protocols are trying to solve. For the Graph in particular, if you want to build an application using data that's stored on a blockchain. There's a lot of applications out there that are basically using the blockchain for mainly, right now it's for financial, transactional reasons because a lot of the transactions actually cost a lot of money. For instance, Uniswap is one of these applications. If you want to basically query data from a blockchain, it's not as easy as querying data from a traditional server or database.

For us we are used to using something like DynamoDB, or some type of SQL database, that's very optimized for queries. But on the blockchain, you're basically having these blocks that add up every time. You create a transaction, you save it. And then someone comes behind them and they save another transaction. Over time you build up this data that's aggregated over time. But let's say you want to hit that database with the, quote-unquote, database with a query and you want to retrieve data over time, or you want to have some type of filtering mechanism. You can't do that. You can't just query blockchains the way you can from a regular database. Similar to how a database basically indexes data and stores it and makes it efficient for retrieval, the Graph protocol basically does that, but for blockchain data.

Anyone that wants to build an application, one of these decentralized apps on top of blockchain data has a couple of options. They can either build their own indexing server and deploy it to somewhere like AWS. That takes away the whole idea of decentralization because then you have a single point of failure again. You can query data directly from the blockchain, from your client application, which takes a very long time. Both of those are not, I would say the most optimal way to build. But also if you're building your own indexing server, every time you want to come up with a new idea also, you have to think about the resources and time that go into it. Basically, I want to come up with a new idea and test it out, I have to basically build a server index, all this data, create APIs around it. It's time-intensive.

What the Graph protocol allows you to do is, as a developer you can basically define a subgraph using YAML, similar to something like cloud formation or a very condensed version of that maybe more Serverless Framework where you're defining, I want to query data from this data source, and I want to save these entities and you deploy that to the network. And that subgraph will basically then go and look into that blockchain. And will look for all the transactions that have happened, and it will go ahead and save those and make those available for public retrieval. And also, again, one of the things that you might think of is, all of this data is public. All of the data that's on the blockchain is public.

Jeremy: Right. Right. All right. Let me see if I could repeat what you said and you tell me if I'm right about this. Because this was one of those things where blockchain ... you're right. To me, it had a negative connotation. Why would you use the blockchain, unless you were building your own cryptocurrency? Right. That just seemed like that's what it was for. Then when AWS comes out with QLDB or they announced that or whatever it was. I'm like, okay, so this is interesting, but why would you use it, again, unless you're building your own cryptocurrency or something because that's the only thing I could think of you would use the blockchain for.

But as you said, with these blockchains now, you have highly sensitive transactions that can be public, but a real estate transaction, for example, is something really interesting, where like, we still live in a world where if Bank of America or one of these other giant banks, JPMorgan Chase or something like that gets hacked, they could wipe out financial data. Right? And I know that's backed up in multiple regions and so forth, but this is the thing where if you're doing some transaction, that you want to make sure that transaction lives forever and isn't manipulated, then the blockchain is a good place to do that. But like you said, it's expensive to write there. But it's even harder to read off the blockchain because it's that ledger, right? It's just information coming in and coming in.

So event storming or if you were doing event sourcing or something that, it's that idea. The idea with these indexers are these basically separate apps that run, and again, I'm assuming that these protocols, their software, and things that you don't have to build this yourself, essentially you can just deploy these things. Right? But this will read off of the blockchain and do that aggregation for you and then make that. Basically, it caches the blockchain. Right? And makes that available to you. And that you could deploy that to multiple indexers if you wanted to. Right? And then you would have access to that data across multiple providers.

Nader: Right. No single point of failure. That's exactly right. You basically deploy a very concise configuration file that defines how you want your data stored and made available. And then it goes, and it just starts at the very beginning and it queries all those blocks or reads all those blocks, saves the data in a database, and then it keeps up with additional new updates. If someone writes a new transaction after that, it also saves that and makes it available for efficient retrieval. This is just for blockchain data. This is the data layer for, but it's not just a blockchain data in the future. You can also query from IPFS, which is a file storage layer, somewhat S3. You can query from other chains other than Ethereum, which is kind of like the main chamber.

In the future really what we're hoping to have is a complete API on top of all public data. Anybody that wants to have some data set available can basically deploy a subgraph and index it and then anyone can then essentially query for it. It's like when you think of public data, we're not really used to thinking of data in this way. And also I think a good thing to talk about in a moment is the types of apps that you can build because you wouldn't want to store private messages on a blockchain or something like that. Right? The types of apps that people are building right now at least are not 100% in line with everything. You can't do everything I would say right now in Web3 that you can do in Web2.

There are only certain types of applications, but those applications that are successful seem to be wildly successful and have a lot of people interested in them and using them. That's the general idea, is like you have this way to basically deploy APIs and the technology that we use to query is GraphQL. That was one of the reasons that I became interested as well. Right now the main data sources are blockchains like Ethereum, but in the future, we would like to make that available to other data sources as well.

Jeremy: Right. You mentioned earlier too because there are apps obviously being built on this that you said are successful. And the problem though, I think right now, because I remember I speculated a little bit with Bitcoin and I bought a whole bunch of Ripple, so I'm still hanging on to it. Ripple XPR whatever, let's go. Anyways, but it was expensive to make a transaction. Right? Reading off of the blockchain itself, I think just connecting generally doesn't cost money, but if you're, and I know there's some costs with indexers and that's how that works. But in terms of the real cost, it's writing to the blockchain. I remember moving some Bitcoin at one point, I think cost me $30 to make one transaction, to move something like that.

I can see if you're writing a $300,000 real estate transaction, or maybe some really large wire transfer or something that you want to record, something that makes sense where you could charge a fee of $30 or $40 in order to do that. I can't see you doing that for ... certainly not for web streaming or click tracking or something like that. That wouldn't make sense. But even for smaller things there might be writing more to it, $30 or whatever that would be ... seems quite expensive. What's the hope around that?

Nader: That was one of the biggest challenges and that was one of the reasons that when I first, I would say maybe even considered this as a technology back in the day, that I would be considering as something that would possibly be usable for the types of applications I'm used to seeing. It just was like a no-brainer, like, no. I think right now, and that's one of the things that attracted me right now to some of the things that are happening, is a lot of those solutions are finally coming to fruition for fixing those sorts of things. There's two things that are happening right now that solve that problem. One of them is, they are merging in a couple of updates to the base layer, layer one, which would be considered something like Ethereum or Bitcoin. But Ethereum is the main one that a lot of the financial stuff that I see is happening.

Basically, there are two different updates that are happening, I think the main one that will make this fee transactional price go down a little bit is sharding. Sharding is basically going to increase the number of, I believe nodes that are basically able to process the transactions by some number. Basically, that will reduce the cost somewhat, but I don't think it's ever going to get it down to a usable level. Instead what the solutions seem to be right now and one of the solutions that seems to actually be working, people are using it in production really recently, this really just started happening in the last couple of months, is these layer 2 solutions. There are a couple of different layer 2 solutions that are basically layers that run on top of the layer one, which would be something like Ethereum.

And they treat Ethereum as the settlement layer. It's almost like when you interact with the bank and you're running your debit card. You're probably not talking to the bank directly and they are doing that. Instead, you have something like Visa who has this layer 2 on top of the banks that are managing thousands of transactions per second. And then they take all of those transactions and they settle those in an underlying layer. There's a couple different layer 2s that seem to be really working well right now in the Ethereum ecosystem. One of those is Arbitrum and then the other is I think Matic, but I think they have a different name now. Both of those seem to be working and they bring the cost of a transaction down to a fraction of a penny.

You have, instead of paying $20 or $30 for a transaction, you're now paying almost nothing. But now that's still not cheap enough to probably treat a blockchain as a traditional database, a high throughput database, but it does open the door for a lot of other types of applications. The applications that you see building on layer one where the transactions really are $5 to $20 or $30 or typically higher value transactions. Things like governance, things like financial transactions, you've heard of NFTs. And that might make sense because if someone's going to spend a thousand bucks or 500 bucks, whatever ...

Jeremy: NFTs don't make sense to me.

Nader: They're not my thing either, the way they're being, I would say, talked about today especially, but I think in the future, the idea behind NFTs is interesting, but yeah, I'm in the same boat as you. But still to those people, if you're paying a thousand dollars for something then that 5 or 10 or 20 bucks might make sense, but it's not going to make sense if I just want to go to an e-commerce store and pay $5 for something. Right? I think that these layer 2s are starting to unlock those potential opportunities where people can start building these true financial applications that allow these transactions to happen at the same cost or actually a lot cheaper maybe than what you're paying for a credit card transaction, or even what those vendors, right? If you're running a store, you're paying percentages to those companies.

The idea around decentralization comes back to this discussion of getting rid of the middleman, and a lot of times that means getting rid of the inefficiencies. If you can offload this business logic to some type of computer, then you've basically abstracted away a lot of inefficiencies. How many billions of dollars are spent every year by banks flying their people around the world and private jets and these skyscrapers and stuff. Now, where does that money come from? It comes from the consumer and them basically taking fees. They're taking money here and there. Right? That's the idea behind technology in general. They're like whenever something new and groundbreaking comes in, it's often unforeseen, but then you look back five years later and you're like, this is a no-brainer. Right?

For instance Blockbuster and Netflix, there's a million of them. I don't have to go into that. I feel this is what that is for maybe the financial institutions and how we think about finance, especially in a global world. I think this was maybe even accelerated by COVID and stuff. If you want to build an application today, imagine limiting yourself to developers in your city. Unless you're maybe in San Francisco or New York, where that might still work. If I'm here in Mississippi and I want to build an application, I'm not going to just look for developers in a 30-mile radius. That is just insane. And I don't use that word mildly, it's just wild to think about that. You wouldn't do that.

Instead, you want to look in your nation, but really you might want to look around the world because you now have things like Slack and Discord and all these asynchronous ways of doing work. And you might be able to find the best developer in the world for 25% or 50% of what you would typically find locally and an easy way to pay them might just be to just send them some crypto. Right? You don't have to go find out all their banking information and do all the wiring and all this other stuff. You just open your wallet, you send them the money and that's it. It's a done deal. But that's just one thing to think about. To me when I think about building apps in Web2 versus Web3, I don't think you're going to see the Facebook or Instagram use case anytime in the next year or two. I think the killer app for right now, it's going to be financial and e-commerce stuff.

But I do think in maybe five years you will see someone crack that application for, something like a social media app where we're basically building something that we use today, but maybe in a better way. And that will be done using some off-chain storage solution. You're not going to be writing all these transactions again to a blockchain. You're going to have maybe a protocol like Graph that allows you to have a distributed database that is managed by one of these networks that you can write to. I think the ideas that we're talking about now are the things that really excite me anyway.

Jeremy: Let's go back to GraphQL for a second, though. If you were going to build an app on top of this, and again, that's super exciting getting those transaction fees down, because I do feel every time you try to move money between banks or it's the $3 fee, if you go to a foreign ATM and you take money out of an ATM, they charge you. Everybody wants to take a cut somewhere along, and there's probably reasons for it, but also corporate jets cost money. So that makes sense as well. But in terms of the GraphQL protocol here, so if I wanted to build an application on top of it, and maybe my application doesn't write to the blockchain, it just reads from it, with one of these indexers, because maybe I'm summing up some financial transactions or something, or I've got an app we can look things up or whatever, I'm building something.

I'm querying using the GraphQL, this makes sense. I have to use one of these indexers that's aggregating that data for me. But what if I did want to write to the blockchain, can I use GraphQL to do a mutation and actually write something to the blockchain? Or do I have to write to it directly?

Nader: Yeah, that's actually a really, really good question. And that's one of the things that we are currently working on with the Graph. Right now if you want to write a transaction, you typically are going to be using one of these JSON RPC wallets and using some type of client library that interacts with the wallet and signs the transaction with the private key. And then that sends the transaction to the blockchain directly. And you're talking to the blockchain and you're just using something like the Graph to query. But I think what would be ideal and what we think would be ideal, is if someone could use a single technology, a single language, and a single abstraction to do everything, not only with reading and writing but also with subscriptions for real-time updates.

That's where we think the whole idea for this will ultimately be, and that's what we're working on now. Right now you can only query. And if you want to write a transaction, you basically are still going to be using something like ethers.js or Web3 or one of these other libraries that allows you to sign a transaction using your wallet. But in the future and in fact, we're already building this right now as having an end-to-end GraphQL library that allows you to write transactions as well as read. That way someone just learns a single API and it's a lot easier. It would also make it easier for developers that are coming from a traditional web background to come in because there's a little bit of learning curve for understanding how to create one of these signed providers and write the transaction. It's not that much code, but it is a new way of thinking about things.

Jeremy: Well I think both of us coming from the serverless space, we know that new way of thinking about things certainly can throw a wrench in the system when a new developer is trying to pick that stuff up.

Nader: Yeah.

Jeremy: All right. So that's the blockchain side of things with the data piece of it. I think people could wrap their head around that. I think it makes a lot of sense. But I'm still, the decentralized, the other things that you talked about. You mentioned an S3, something that's sort of an S3 type protocol that you can use. And what are some of the other ones? I think I've written some of them down here. Acash was one, Filecoin, Livepeer. These are all different protocols or services that are hosted by the indexers, or is this a different thing than the indexers? How does that work? And then how would you use that to save data, maybe save some blob, a blob storage or something like that?

Nader: Let's talk about the tokenomics idea around how crypto fits into this and how it actually powers a protocol like this. And then we'll talk about some of those other protocols. How do people actually build all this stuff and do it for, are they getting paid for it? Is it free? How does that work and how does this network actually stay up? Because everything costs money, developers' time costs money, and so on and so forth. For something like the Graph, basically during the building phase of this protocol, basically, there was white papers and there was blog posts, and there was people in Discords talking about the ideas that were here. They basically had this idea to build this protocol. And this is a very typical life cycle, I would say.

You have someone that comes up with an idea, they document some of it, they start building it. And the people that start building it are going to be basically part of essentially the founding team you could think of, in the sense of they're going to be having equity. Because at the end of the day, to actually launch one of these decentralized protocols, the way that crypto comes into it, there's typically some type of a token offering. The tokens need to be for a network like this, some type of utility token to keep the network running in the future. You're not just going to create some crypto and that's it like, right? I think that's the whole idea that I thought was going on when in reality, these tokens are typically used for powering the protocol.

But let's say early on you have let's say 20 developers and they all build 5% of the system, whatever percentage that you want to talk about, whatever. Let's say you have these people helping out and then you actually build the thing and you want to go ahead and launch it and you have something that's working. A lot of times what people will do is they'll basically have a token offering, where they'll basically say, okay, let's go ahead and we're going to mint X number of tokens, and we're going to put these on the market and we're going to also pay these people that helped build this system, X number of tokens, and that's going to be their payment. And then they can go and sell those or keep those or trade those or whatever they would like to do.

And then you have the tokens that are then put on the public market essentially. Once you've launched the protocol, you have to have tokens to basically continue to power the protocol and fund it. There are different people that interact with the protocol in different ways. You have the indexers themselves, which are basically software engineers that are deploying whatever infrastructure to something like AWS or GCP. These people are still using these cloud providers or they're maybe doing it at their house, whatever. All you basically need is a server and you want to basically run this indexer node, which is software that is open source, and you run this node. Basically, you can go ahead and say, okay, I want to start being an indexer and I want to be one of the different nodes on the network.

To do that you basically buy some GRT, Graph Token, and in our case you stake it, meaning you are putting this money up to basically affirm that you are an indexer on the protocol and you are going to be accepting subgraph developers to deploy their subgraphs to your indexer. You stake that money and then when people use the API, they're basically paying money just like they might pay money to somewhere like API gateway or AppSync. Instead, they're paying money for their subgraph and that money is paid in GRT and it's distributed to the people in the ecosystem. Like me as a developer, I'm deploying the subgraph, and then if I have a million people using it, then I make some money. That's one way to use tokens in the system.

Another way is basically to, as an outside person looking in, I can say, this indexer is really, really good. They know what they're doing. They're a very strong engineer. I'm going to basically put some money into their indexer and I'm basically backing them as an indexer. And then I will also share the money that comes in from the query fees. And then there are also people that are subgraph developers, which is the stuff that I've been working with mainly, where I can basically come up with a new API. I can be like, it'd be cool if I took data from this blockchain and this file system and merged it together, and I made this really cool API that people can use to build their apps with. I can deploy that. And basically, people can signal to this subgraph using tokens. And when people do that, they can say that they believe that this is a good subgraph to use.

And then when people use that, I can also make money in that way. Basically, people are using tokens to be part of the system itself, but also to use that. If I'm a front end application like Uniswap and I want to basically use the Graph, I can basically say, okay, I'm going to put a thousand dollars in GRT tokens and I'm going to be using this API endpoint, which is a subgraph. And then all of the money that I have put up as someone that's using this, is going to be taken as the people start using it. Let's say I have a million queries and each query is one, 1000th of a cent, then after those million queries are up, I've spent $100 or something like that. Kind of similar to how you might pay AWS, you're now paying, you know, subgraph developers and indexers.

Jeremy: Right. Okay. That makes sense. So then that's the payment method of that. So then these other protocols that get built on top of it, the Acash and Filecoin and Livepeer. So those ...

Nader: They're all operating in a very similar fashion.

Jeremy: Okay. All right. And so it's ...

Nader: They have some type of node software that's run and people can basically run this node on some server somewhere and make it available as part of the network. And then they can use the tokens to participate. There's Filecoin for file storage. There's also IPFS, which is actually more of, it's a completely free service, but it's also not something that's as reliable as something like S3 or Filecoin. And then you have, like you mentioned, I believe Acash, which is a way to execute arbitrary code, business logic, and stuff like that. You have Ceramic Network, which is something that you can use for authentication. You have Livepeer which is something you use for live streaming. So you have all these ideas, these decentralized services fitting in these different niches.

Jeremy: Right, right. Okay. So then now you've got a bunch of people. Now you mentioned this idea of, you could say, this is a good indexer. What about bad indexers? Right?

Nader: That's a really good question.

Jeremy: Yeah. You're relying on people to take data off of a public blockchain, and then you're relying on them to process it correctly and give you back good data. I'm assuming they could manipulate that data if they wanted to. I don't know why, but let's say they did. Is there a way to guarantee that you're getting the correct data?

Nader: Yeah. That's a whole part of how the system works. There's this whole idea and this whole, really, really deep rabbit hole of crypto-economics and how these protocols are structured to incentivize and also disincentivize. In our protocol, basically, you have this idea of slashing and this is also a fairly known and used thing in the ecosystem and in the space. It's this idea of slashing. Basically, you incentivize people to go out and find people that are serving incorrect data. And if that person finds someone that's serving incorrect data, then the person that's serving the incorrect data is, quote-unquote, slashed. And that basically means that they're not only not going to receive the money from the queries that they were serving, but they also might lose the money that they put up to be a part of the network.

I mentioned you have to actually put up money to deploy an indexer to the network, that money could also be at risk. You're very, very, very much so financially disincentivized to do that. And there's actually, again, incentives in the network for people to go and find those people. It's all-around incentives, game theory, and things like that.

Jeremy: Which makes a ton of sense. That's good to know. You mentioned, you threw out the number, five years from now, somebody might build the killer app or whatever, they'll figure out some of these things. Where are we with this though? Because this sounds really early, right? There's still things that need to be figured out. Again, it's public data on the blockchain. How do you see this evolving? When do you think Web3 will be more accessible to the masses?

Nader: Today people are actually building really, really interesting applications that are fitting the current technology stack, what are the things that you can build? People are already building those. But when you think about the current state of the web, where you have something like Twitter, or Facebook or Instagram, where I would say, especially maybe something like Facebook, that's extremely, extremely complex with a lot of UI interaction, a lot of private data, messages and stuff. I think to build something like that, yeah, it's going to be a couple of years. And then you might not even see certain types of applications being built. I don't think there is going to be this thing where there is no longer these types of applications. There are only these new types. I think it's more of a new type of application that people are going to be building, and it's not going to be a winner takes all just like in all tech in my opinion.

I wouldn't say all but in many areas of tech where you're thinking of something as a zero-sum game where I don't think this is. But I do think that the most interesting stuff is around how Web3 essentially enables native payments and how people are going to use these native payments in interesting ways that maybe we haven't thought of yet. One of the ways that you're starting to see people doing, and a lot of venture capitalists are now investing in a lot of these companies, if you look at a lot of the companies coming out of YC and a lot of the new companies that these traditional venture capitalists are investing in, are a lot of TOMS crypto companies.

When you think about the financial incentives, the things that we talked about early on, let's say you want to have the next version of YouTube and you don't want to have ads. How would that even work? Right? You still need to enable payments. But there's a couple of things that could happen there. Well, first of all, if you're building an application in the way that I've talked about, where you basically have these native payments or these native tokens that can be part of the whole process now, instead of waiting 10 years to do an IPO for an application that has been around for those 10 years and then paying back all his investors and all of those people that had been basically pulling money out their pockets to take part in.

What if someone that has a really interesting idea and maybe they have a really good track record, they come out with a new application and they're basically saying, okay, if you want to own a piece of this, we're going to basically create a token and you can have ownership in it. You might see people doing these ICO's, initial coin offerings, or whatever, where basically they're offering portions of the company to anyone that wants to own it and then incentivizing people to basically use those, to govern how the application is built in the future. Let's say I own 1% of this company and a proposal is put up to do something new. I can basically say, I can use that portion of my ownership to vote on things. And then people that are speculating can say, this company is doing interesting things. I'm going to buy into it, therefore driving the price up or down.

Kind of like the same way that you see the traditional stock market there, but without all of the regulation and friction that comes with that. I think that's interesting and you're already seeing companies doing that. You're not seeing the majority of companies doing that or anything like that, but you are starting to see those types of things happening. And that brings around the discussion of regulations. Is ... can you even do something like that in the United States? Well, maybe, maybe not. Does that mean people are going to start building these companies elsewhere? That's an interesting discussion as well. Right now if you want to build an application this way, you need to have some type of utility that these tokens are there for. You can't just do them purely on speculation, at least right now. But I think it's going to be interesting for sure, to watch.

Jeremy: Right. And I think too that, I'm just thinking if you're a bank, right? And you maybe have a bunch of private transactions that you want to keep private. Because again, I don't even know how, I don't know how we get to private transactions on the blockchain. I could see you wanting to have some transactions that were public blockchain and some that were private and maybe a hybrid approach would make sense for some companies.

Nader: I think the idea that we haven't really talked about at all is identity and how identity works compared to how we're used to identity. The way that we're used to identity working is, we basically go to a new website and we're like, this looks awesome. Let me try it out. And they're like, oh wait, we need your name, your email address, your phone number, and possibly your credit card and all this other stuff. We do that over and over and over, and over time we've now given our personal information to 500 people. And then you start getting these emails, your data has been breached, every week you get one of these emails, if you're someone like me, I don't know. Maybe I'm just signing up for too much stuff. Maybe not every week, but maybe every month or two. But you're giving out your personal data.

But we're used to identity as being tied to our own physical name and address and things like that. But what if identity was something that was more abstract? And I think that that's the way that you typically see identity managed in Web3. When you're dealing with authentication mechanisms, one of the most interesting things that I think that is part of this whole discussion is this idea of a single sign-on mechanism, that you own your identity and you can transfer it across all the applications and no one else is in control of it. When you use something like an Ethereum wallet, like MetaMask, for example, it's an extension you can just download and put crypto in and basically make payments on the web with. When you create a wallet, you're given a wallet address. And the wallet address is basically created using public key cryptography, where basically you start with this private key, your public key is derived from the private key, and then your address is dropped from the public key.

And when you send a transaction, you basically sign the transaction with your private key and you send your public key along with the transaction, and the person that receives that can decode the transaction with the public key to verify that that's who signed the transaction. Using this public key cryptography that only you can basically sign with your own address and your own password, it's all stored on the blockchain or in some decentralized manner. Actually in this case stored on the blockchain or it depends on how you use it really, I guess. But anyway, the whole idea here is that you completely own your identity. If you never decide to associate that identity with your name and your phone number, then who knows who's sending these transactions and who knows what's going on, because why would you need to associate your own name and phone number with all of these types of things, in these situations where you're making payments and stuff like that. Right?

What is the idea of a user profile anyway, and why do you actually need it? Well, you might need it on certain applications. You might need it or want it on social network, or maybe not, or you might come up with a pseudonym, because maybe you don't want to associate yourself with whatever. You might want to in other cases, but that's completely up to you and you can have multiple wallet addresses. You might have a public wallet address that you associate your name with that you are using on social media. You might have a private wallet address that you're never associating with your name, that you're using for financial transactions. It's completely up to you, but no one can change that information. One of the applications that I recently built was called Decentralized Identity. I built it and release it a few days ago.

And it's an implementation of this and it's using some of these Web3 technologies. One of them is IDX. One of them is Ceramic, which is a decentralized protocol similar to the Graph but for identity. And then it's using something called DIDs, which are decentralized identifiers, which are a way to have a completely unique ID based off of your address. And then you own the control over that. You can basically go in and make updates to that profile. And then any application across the web that you choose to use can then access that information. You're only dealing with it stored in one place. You have full control over it, at any time you can go in and delete that. You can go in and change it. No one has control over it except for you.

The idea of identity is a mind-bending thing in this space because I think we're so used to just handing everybody our real names and our real phone numbers and all of our personal information and just having our fingers crossed, that we're just not used to anything else.

Jeremy: It's all super interesting. You mentioned earlier about, would it be legal in the United States? I'm thinking of all these recent ransomware attacks and I think they were able to trace back some Bitcoin transaction, they were actually able to trace it back to the individual group that accepted the payment. It opens up a whole can of worms. I love this idea of being anonymous and not being tracked, but then it's also like, what could bad actors do with anonymous financial transactions and things like that? So ...

Nader: There kind of has been anonymous transactional layer for a long time. Cash brought in, you can't really do a lot of illegal stuff these days without cash. So should we get rid of cash? I think with any technology ...

Jeremy: No, but I mean, there's a limit though, right? You can't withdraw more than $10,000 worth of cash without the FBI being flagged and you can't deposit more, you know what I mean?

Nader: You can't take a million dollars worth of Bitcoin that you've gotten from ransomware and turn it into cash either.

Jeremy: That's also true. Right.

Nader: Because it's all tracked on the blockchain, that's probably how they caught those people. Right? They somehow had their personal information tied to a transaction, because if you follow these transactions long enough, you're going to find some origination point. I agree though. There's definitely trade-offs with everything. I don't think I'm ever the type to argue that. There's good things and there's bad things. I think you have to look at the whole picture and decide for yourself, what you think. I'm the type that's like, let's lay out all of the ideas and let the market decide.

Jeremy: Right. Yeah. I totally agree with that. All this stuff is fascinating, there is way too much more for me to learn at this point. I think my brain is filled at this point. Anything else about Edge & Node? Any cool things you're working on there or anything you want people to know?

Nader: We're working on a couple of different projects. I can't really talk about some of them because they're not released yet, but we are working on a new version of something called Everest, and Everest is already out. If you want to check it out, it's at everest.link. It's basically a repository of a bunch of different applications that have already been built in the Web3 ecosystem. It also ties in a lot of the stuff that we talked about, like identity and stuff like that. You can basically sign in with your Ethereum wallet. You can basically interact with different applications and stuff, but you can also just see the types of stuff people are building. It's categorized into games, financial apps. If you've listened to this and you're like, this sounds cool, but are people actually building stuff? This is a place to see hundreds of apps that people have are already built and that are out there and successful.

Jeremy: Awesome. All right. Well, listen, Nader, this was awesome. Thank you so much for sharing this with me. I know I learned a ton. I hope the listeners learned a ton. If people want to learn more about this or just follow you and keep up with what you're doing, what's the best way to do that?

Nader: I would say check out Twitter, we're on Twitter @dabit3 for me, @edgeandnode for Edge & Node, and of course @graphprotocol for Graph protocol.

Jeremy: Okay. And then edgeandnode.com. Your YouTube channel is just youtube.com/naderdabit, N-A-D-E-R D-A-B-I-T. And then you had an article on Web3 and I'll put it in the show notes.

Nader: Yeah. Put it in the show notes. For freeCodeCamp, it's called what is Web3. And it's really a condensed version of a lot of the stuff we talked about. Maybe go into a little bit more depth around native payments and how people might build companies in the way that we've talked about here.

Jeremy: Awesome. All right. Well, I will get all that stuff into the show notes. Thanks again, Nader.

Nader: Thanks for having me. It was good to talk.

2021-06-21
Länk till avsnitt

Episode #105: Building a Serverless Banking Platform with Patrick Strzelec

About Patrick Strzelec

Patrick Strzelec is a fullstack developer with a focus on building GraphQL gateways and serverless microservices. He is currently working as a technical lead at NorthOne making banking effortless for small businesses.

LinkedIn: Patrick Strzelec
NorthOne Careers: www.northone.com/about/careers


Watch this episode on YouTube: https://youtu.be/8W6lRc03QNU  

This episode sponsored by CBT Nuggets and Lumigo.

Transcript
Jeremy: Hi everyone. I'm Jeremy Daly, and this is Serverless Chats. Today, I'm joined by Patrick Strzelec. Hey, Patrick, thanks for joining me.

Patrick: Hey, thanks for having me.

Jeremy: You are a lead developer at NorthOne. I'd love it if you could tell the listeners a little bit about yourself, your background, and what NorthOne does.

Patrick: Yeah, totally. I'm a lead developer here at NorthOne, I've been focusing on building out our GraphQL gateway here, as well as some of our serverless microservices. What NorthOne does, we are a banking experience for small businesses. Effectively, we are a deposit account, with many integrations that act almost like an operating system for small businesses. Basically, we choose the best partners we can to do things like check deposits, just your regular transactions you would do, as well as any insights, and the use cases will grow. I'd like to call us a very tailored banking experience for small businesses.

Jeremy: Very nice. The thing that is fascinating, I think about this, is that you have just completely embraced serverless, right?

Patrick: Yeah, totally. We started off early on with this vision of being fully event driven, and we started off with a monolith, like a Python Django big monolith, and we've been experimenting with serverless all the way through, and somewhere along the journey, we decided this is the tool for us, and it just totally made sense on the business side, on the tech side. It's been absolutely great.

Jeremy: Let's talk about that because this is one of those things where I think you get a business and a business that's a banking platform. You're handling some serious transactions here. You've got a lot of transactions that are going through, and you've totally embraced this. I'd love to have you take the listeners through why you thought it was a good idea, what were the business cases for it? Then we can talk a little bit about the adoption process, and then I know there's a whole bunch of stuff that you did with event driven stuff, which is absolutely fascinating.

Then we could probably follow up with maybe a couple of challenges, and some of the issues you face. Why don't we start there. Let's start, like who in your organization, because I am always fascinated to know if somebody in your organization says, ?Hey we absolutely need to do serverless," and just starts beating that drum. What was that business and technical case that made your organization swallow that pill?

Patrick: Yeah, totally. I think just at a high level we're a user experience company, we want to make sure we offer small businesses the best banking experience possible. We don't want to spend a lot of time on operations, and trying to, and also reliability is incredibly important. If we can offload that burden and move faster, that's what we need to do. When we're talking about who's beating that drum, I would say our VP, Blake, really early on, seemed to see serverless as this amazing fit. I joined about three years ago today, so I guess this is my anniversary at the company. We were just deciding what to build. At the time there was a lot of architecture diagrams, and Blake hypothesized that serverless was a great fit.

We had a lot of versions of the world, some with Apache Kafka, and a bunch of microservices going through there. There's other versions with serverless in the mix, and some of the tooling around that, and this other hypothesis that maybe we want GraphQL gateway in the middle of there. It was one of those things that we wanted to test our hypothesis as we go. That ties into this innovation velocity that serverless allows for. It?s very cheap to put a new piece of infrastructure up in serverless. Just the other day we wanted to test Kinesis for an event streaming use case, and that was just a half an hour to set up that config, and you could put it live in production and test it out, which is completely awesome.

I think that innovation velocity was the hypothesis. We could just try things out really quickly. They don't cost much at all. You only pay for what you use for the most part. We were able to try that out, and as well as reliability. AWS really does a good job of making sure everything's available all the time. Something that maybe a young startup isn't ready to take on. When I joined the company, Blake proposed, ?Okay, let's try out GraphQL as a gateway, as a concept. Build me a prototype." In that prototype, there was a really good opportunity to try serverless. They just ... Apollo server launched the serverless package, that was just super easy to deploy.

It was a complete no-brainer. We tried it out, we built the case. We just started with this GraphQL gateway running on serverless. AWS Lambda. It's funny because at first, it's like, we're just trying to sell them development. Nobody's going to be hitting our services. It was still a year out from when we were going into production. Once we went into prod, this Lambda's hot all the time, which is interesting. I think the cost case breaks down there because if you're running this thing, think forever, but it was this GraphQL server in front of our Python Django monolift, with this vision of event driven microservices, which has fit well for banking. If you just think about the banking world, everything is pretty much eventually consistent.

Just, that's the way the systems are designed. You send out a transaction, it doesn't settle for a while. We were always going to do event driven, but when you're starting out with a team of three developers, you're not going to build this whole microservices environment and everything. We started with that monolith with the GraphQL gateway in front, which scaled pretty nicely, because we were able to sort of, even today we have the same GraphQL gateway. We just changed the services backing it, which was really sweet. The adoption process was like, let's try it out. We tried it out with GraphQL first, and then as we were heading into launch, we had this monolith that we needed to manage. I mean, manually managing AWS resources, it's easier than back in the day when you're managing your own virtual machines and stuff, but it's still not great.

We didn't have a lot of time, and there was a lot of last-minute changes we needed to make. A big refactor to our scheduling transactions functions happened right before launch. That was an amazing serverless use case. And there's our second one, where we're like, ?Okay, we need to get this live really quickly." We created this work performance pattern really quickly as a test with serverless, and it worked beautifully. We also had another use case come up, which was just a simple phone scheduling service. We just wrapped an API, and just exposed some endpoints, but it was just a lot easier to do with serverless. Just threw it off to two developers, figure out how you do it, and it was ready to be live. And then ...

Jeremy: I'm sorry to interrupt you, but I want to get to this point, because you're talking about standing up infrastructure, using infrastructure as code, or the tools you're using. How many developers were working on this thing?

Patrick: How many, I think at the time, maybe four developers on backend functionality before launch, when we were just starting out.

Jeremy: But you're building a banking platform here, so this is pretty sophisticated. I can imagine another business case for serverless is just the sense that we don't have to hire an operations team.

Patrick: Yeah, exactly. We were well through launching it. I think it would have been a couple of months where we were live, or where we hired our first dev ops engineer. Which is incredible. Our VP took a lot of that too, I'm sure he had his hands a little more dirty than he did like early on. But it was just amazing. We were able to manage all that infrastructure, and scale was never a concern. In the early stages, maybe it shouldn't be just yet, but it was just really, really easy.

Jeremy: Now you started with four, and I think, what are you now? Somewhere around 25 developers? Somewhere in that space now?

Patrick: About 25 developers now, we're growing really fast. We doubled this year during COVID, which is just crazy to think about, and somehow have been scaling somewhat smoothly at least, in terms of just being able to output as a dev team promote. We'll probably double again this year. This is maybe where I shamelessly plug that we're hiring, and we always are, and you could visit northone.com and just check out the careers page, or just hit me up for a warm intro. It's been crazy, and that?s one of the things that serverless has helped with us too. We haven't had this scaling bottleneck, which is an operations team. We don't need to hire X operations people for a certain number of developers.

Onboarding has been easier. There was one example of during a major project, we hired a developer. He was new to serverless, but just very experienced developer, and he had a production-ready serverless service ready in a month, which was just an insane ramp-up time. I haven't seen that very often. He didn't have to talk to any of our operation staff, and we'd already used serverless long enough that we had all of our presets and boilerplates ready, and permissions locked down, so it was just super easy. It's super empowering just for him to be able to just play around with the different services. Because we hit that point where we've invested enough that every developer when they opened a branch, that branch deploys its own stage, which has all of the services, AWS infrastructure deployed.

You might have a PR open that launches an instance of Kinesis, and five SQS queues, and 10 Lambdas, and a bunch of other things, and then tear down almost immediately, and the cost isn't something we really worry about. The innovation velocity there has been really, really good. Just being able to try things out. If you're thinking about something like Kinesis, where it's like a Kafka, that's my understanding, and if you think about the organizational buy-in you need for something like Kafka, because you need to support it, come up with opinions, and all this other stuff, you'll spend weeks trying it out, but for one of our developers, it's like this seems great.

We're streaming events, we want this to be real-time. Let's just try it out. This was for our analytics use case, and it's live in production now. It seems to be doing the thing, and we?re testing out that use case, and there isn't that roadblock. We could always switch off to a different design if you want. The experimentation piece there has been awesome. We?ve changed, during major projects we've changed the way we've thought about our resources a few times, and in the end it works out, and often it is about resiliency. It's just jamming queues into places we didn't think about in the first place, but that's been awesome.

Jeremy: I'm curious with that, though, with 25 developers ... Kinesis for the most part works pretty well, but you do have to watch those iterator ages, and make sure that they're not backing up, or that you're losing events. If they get flooded or whatever, and also sticking queues everywhere, sounds like a really good idea, and I'm a big fan of that, but it also, that means there's a lot of queues you have to manage, and watch, and set alarms and all that kind of stuff. Then you also talked about a pretty, what sounds like a pretty great CI/CD process to spin up new branches and things like that. There's a lot of dev ops-y ops work that is still there. How are you handling that now? Do you have dedicated ops people, or do you just have your developers looking after that piece of it?

Patrick: I would say we have a very spirited group of developers who are inspired. We do a lot of our code-sharing via internal packages. A few of our developers just figured out some of our patterns that we need, whether it's like CI, or how we structure our events stores, or how we do our Q subscriptions. We manage these internal packages. This won't scale well, by the way. This is just us being inspired and trying to reduce some of this burden. It is interesting, I?ve listened to this podcast and a few others, and this idea of infrastructure as code being part of every developer's toolbox, it?s starting to really resonate with our team.

In our migration, or our swift shift to full, I'd say doing serverless properly, we?ve learned to really think in it. Think in terms of infrastructure in our creating solutions. Not saying we're doing serverless the right way now, but we certainly did it the wrong way in the past, where we would spin up a bunch of API gateways that would talk to each other. A lot of REST calls going around the spider web of communication. Also, I'll call these monster Lambdas, that have a whole procedure list that they need to get through, and a lot of points of failure. When we were thinking about the way we're going to do Lambda now, we try to keep one Lambda doing one thing, and then there's pieces of infrastructure stitching that together. EventBridge between domain boundaries, SQS for commands where we can, instead of using API gateway. I think that transitions pretty well into our big break. I'm talking about this as our migration to serverless. I want to talk more about that.

Jeremy: Before we jump into that, I just want to ask this question about, because again, I call those fat, some people call them fat Lambdas, I call them Lambda lifts. I think there's Lambda lifts, then fat Lambdas, then your single-purpose functions. It's interesting, again, moving towards that direction, and I think it's super important that just admitting that you're like, we were definitely doing this wrong. Because I think so many companies find that adopting serverless is very much so an evolution, and it's a learning thing where the teams have to figure out what works for them, and in some cases discovering best practices on your own. I think that you've gone through that process, I think is great, so definitely kudos to you for that.

Before we get into that adoption and the migration or the evolution process that you went through to get to where you are now, one other business or technical case for serverless, especially with something as complex as banking, I think I still don't understand why I can't transfer personal money or money from my personal TD Bank account to my wife's local checking account, why that's so hard to do. But, it seems like there's a lot of steps. Steps that have to work. You can't get halfway through five steps in some transaction, and then be like, oops we can't go any further. You get to roll that back and things like that. I would imagine orchestration is a huge piece of this as well.

Patrick: Yeah, 100%. The banking lends itself really well to these workflows, I'll call them. If you're thinking about even just the start of any banking process, there's this whole application process where you put in all your personal information, you send off a request to your bank, and then now there's this whole waterfall of things that needs to happen. All kinds of checks and making sure people aren't on any fraud lists, or money laundering lists, or even just getting a second dive from our compliance department. There's a lot of steps there, and even just keeping our own systems in sync, with our off-provider and other places. We definitely lean on using step functions a lot. I think they work really, really well for our use case. Just the visual, being able to see this is where a customer is in their onboarding journey, is very, very powerful.

Being able to restart at any point of their, or even just giving our compliance team a view into that process, or even adding a pause portion. I think that's one of the biggest wins there, is that we could process somebody through any one of our pipelines, and we may need a human eye there at least for this point in time. That's one of the interesting things about the banking industry is. There are still manual processes behind the scenes, and there are, I find this term funny, but there are wire rooms in banks where there are people reviewing things and all that. There are a lot of workflows that just lend themselves well to step functions. That pausing capability and being able to return later with a response, so that allows you to build other internal applications for your compliance teams and other teams, or just behind the scenes calls back, and says, "Okay, resume this waterfall."

I think that was the visualization, especially in an events world when you're talking about like sagas, I guess, we're talking about distributed transactions here in a way, where there's a lot of things happening, and a common pattern now is the saga pattern. You probably don't want to be doing two-phase commits and all this other stuff, but when we're looking at sagas, it's the orchestration you could do or the choreography. Choreography gets very messy because there's a lot of simplistic behavior. I'm a service and I know what I need to do when these events come through, and I know which compensating events I need to dump, and all this other stuff. But now there's a very limited view.

If a developer is trying to gain context in a certain domain, and understand the chain of events, although you are decoupled, there's still this extra coupling now, having to understand what's going on in your system, and being able to share it with external stakeholders. Using step functions, that's the I guess the serverless way of doing orchestration. Just being able to share that view. We had this process where we needed to move a lot of accounts to, or a lot of user data to a different system. We were able to just use an orchestrator there as well, just to keep an eye on everything that's going on.

We might be paused in migrating, but let's say we?re moving over contacts, a transaction list, and one other thing, you could visualize which one of those are in the red, and which one we need to come in and fix, and also share that progress with external stakeholders. Also, it makes for fun launch parties I'd say. It's kind of funny because when developers do their job, you press a button, and everything launches, and there's not really anything to share or show.

Jeremy: There's no balloons or anything like that.

Patrick: Yeah. But it was kind of cool to look at these like, the customer is going through this branch of the logic. I know it's all green. Then I think one of the coolest things was just the retry ability as well. When somebody does fail, or when one of these workflows fails, you could see exactly which step, you can see the logs, and all that. I think one of the challenges we ran into there though, was because we are working in the banking space, we're dealing with sensitive data. Something I almost wish AWS solved out of the box, would be being able to obfuscate some of that data. Maybe you can't, I'm not sure, but we had to think of patterns for tokenization for instance.

Stripe does this a lot where certain parts of their platform, you just get it, you put in personal information, you get back a token, and you use that reference everywhere. We do tokenization, as well as we limit the amount of details flowing through steps in our orchestrators. We'll use an event store with identifiers flowing through, and we'll be doing reads back to that event store in between steps, to do what we need to do. You lose some of that debug-ability, you can't see exactly what information is flowing through, but we need to keep user data safe.

Jeremy: Because it's the use case for it. I think that you mentioned a good point about orchestration versus choreography, and I'm a big fan of choreography when it makes sense. But I think one of the hardest lessons you learn when you start building distributed systems is knowing when to use choreography, and knowing when to use orchestration. Certainly in banking, orchestration is super important. Again, with those saga patterns built-in, that's the kind of thing where you can get to a point in the process and you don't even need to do automated rollbacks. You can get to a failure state, and then from there, that can be a pause, and then you can essentially kick off the unwinding of those things and do some of that.

I love that idea that the token pattern and using just rehydrating certain steps where you need to. I think that makes a ton of sense. All right. Let's move on to the adoption and the migration process, because I know this is something that really excites you and it should because it is cool. I always know, as you're building out applications and you start to add more capabilities and more functionality and start really embracing serverless as a methodology, then it can get really exciting. Let's take a step back. You had a champion in your organization that was beating the drum like, "Let's try this. This is going to make a lot of sense." You build an Apollo Lambda or a Lambda running Apollo server on it, and you are using that as a strangler pattern, routing all your stuff through now to your backend. What happens next?

Patrick: I would say when we needed to build new features, developers just gravitated towards using serverless, it was just easier. We were using TypeScript instead of Python, which we just tend to like as an organization, so it's just easier to hop into TypeScript land, but I think it was just easier to get something live. Now we had all these Lambdas popping up, and doing their job, but I think the problem that happened was we weren't using them properly. Also, there was a lot of difference between each of our serverless setups. We would learn each time and we'd be like, okay, we'll use this parser function here to simplify some of it, because it is very bare-bones if you're just pulling the Serverless Framework, and it took a little ...

Every service looked very different, I would say. Also, we never really took the time to sit back and say, ?Okay, how do we think about this? How do we use what serverless gives us to enable us, instead of it just being an easy thing to spin up?" I think that's where it started. It was just easy to start. But we didn't embrace it fully. I remember having a conversation at some point with our VP being like, ?Hey, how about we just put Express into one of our Lambdas, and we create this," now I know it's a Lambda lift. I was like, it was just easier. Everybody knows how to use Express, why don't we just do this? Why are we writing our own parsers for all these things? We have 10 versions of a make response helper function that was copy-pasted between repos, and we didn't really have a good pattern for sharing that code yet in private packages.

We realized that we liked serverless, but we realized we needed to do it better. We started with having a serverless chapter reading between some of our team members, and we made some moves there. We created a shared boilerplate at some point, so it reduced some of the differences you'd see between some of the repositories, but we needed a step-change difference in our thinking, when I look back, and we got lucky that opportunity came up. At this point, we probably had another six Lambda services, maybe more actually. I want to say around, we'd probably have around 15 services at this point, without a governing body around patterns.

At this time, we had this interesting opportunity where we found out we're going to be re-platforming. A big announcement we just made last month was that we moved on to a new bank partner called Bancorp. The bank partner that supports Chime, and they're like, I'll call them an engine boost. We put in a much larger, more efficient engine for our small businesses. If you just look at the capabilities they provide, they're just absolutely amazing. It's what we need to build forward. Their events API is amazing as well as just their base banking capabilities, the unit economics they can offer, the times on there, things were just better. We found out we're doing an engine swap. The people on the business side on our company trusted our technical team to do what we needed to do.

Obviously, we need to put together a case, but they trusted us to choose our technology, which was awesome. I think we just had a really good track record of delivering, so we had free reign to decide what do we do. But the timeline was tight, so what we decided to do, and this was COVID times too, was a few of our developers got COVID tested, and we rented a house and we did a bubble situation. How in the NHL or MBA you have a bubble. We had a dev bubble.

Jeremy: The all-star team.

Patrick: The all-star team, yeah. We decided let's sit down, let's figure out what patterns are going to take us forward. How do we make the step-change at the same time as step-change in our technology stack, at the same time as we're swapping out this bank, this engine essentially for the business. In this house, we watched almost every YouTube video you can imagine on event driven and serverless, and I think leading up. I think just knowing that we were going to be doing this, I think all of us independently started prototyping, and watching videos, and reading a lot of your content, and Alex DeBrie and Yan Cui. We all had a lot of ideas already going in.

When we all got to this house, we started off with this exercise, an event storming exercise, just popular in the domain-driven design community, where we just threw down our entire business on a wall with sticky notes, and it would have been better to have every business stakeholder there, but luckily we had two people from our product team there as representatives. That's how invested we were in building this outright, that we have products sitting in the room with us to figure it out.

We slapped down our entire business on a wall, this took days, and then drew circles around it and iterated on that for a while. Then started looking at what the technology looks like. What are our domain boundaries, and what prototypes do we need to make? For a few weeks there, we were just prototyping. We built out what I'd called baby's first balance. That was the running joke where, how do we get an account opened with a balance, with the transactions minimally, with some new patterns. We really embraced some of this domain-driven-design thinking, as well as just event driven thinking. When we were rethinking architecture, three concepts became very important for us, not entirely new, but important. Item potency was a big one, dealing with distributed transactions was another one of those, as well as the eventual consistency. The eventual consistency portion is kind of funny because we were already doing it a lot.

Our transactions wouldn't always settle very quickly. We didn't know about it, but now our whole system becomes eventually consistent typically if you now divide all of your architecture across domains, and decouple everything. We created some early prototypes, we created our own version of an event store, which is, I would just say an opinionated scheme around DynamoDB, where we keep track of revisions, payload, timestamp, all the things you'd want to be able to do event sourcing. That's another thing we decided on. Event sourcing seemed like the right approach for state, for a lot of our use cases. Banking, if you just think about a banking ledger, it is events or an accounting ledger. You're just adding up rows, add, subtract, add, subtract.

We created a lot of prototypes for these things. Our events store pattern became basically just a DynamoDB with opinions around the schema, as well as a package of a shared code package with a simple dispatch function. One dispatch function that really looks at enforcing optimistic concurrency, and one that's a little bit more relaxed. Then we also had some reducer functions built into there. That was one of the packages that we created, as well as another prototype around that was how do we create the actual subscriptions to this event store? We landed on SNS to SQS fan-out, and it seems like fan-out first is the serverless way of doing a lot of things. We learned that along the way, and it makes sense. It was one of those things we read from a lot of these blogs and YouTube videos, and it really made sense in production, when all the data is streaming from one place, and then now you just add subscribers all over the place. Just new queues. Fan-out first, highly recommend. We just landed on there by following best practices.

Jeremy: Great. You mentioned a bunch of different things in there, which is awesome, but so you get together in this house, you come up with all the events, you do this event storming session, which is always a great exercise. You get a pretty good visualization of how the business is going to run from an event standpoint. Then you start building out this event driven architecture, and you mentioned some packages that you built, we talked about step functions and the orchestration piece of this. Just give me a quick overview of the actual system itself. You said it's backed by DynamoDB, but then you have a bunch of packages that run in between there, and then there's a whole bunch of queues, and then you're using some custom packages. I think I already said that but you're using ... are you using EventBridge in there? What's some of the architecture behind all that?

Patrick: Really, really good question. Once we created these domain boundaries, we needed to figure out how do we communicate between domains and within domains. We landed on really differentiating milestone events and domain events. I guess milestone events in other terms might be called integration events, but this idea that these are key business milestones. An account was open, an application was approved or rejected, things that every domain may need to know about. Then within our domains, or domain boundaries, we had these domain events, which might reduce to a milestone event, and we can maintain those contracts in the future and change those up. We needed to think about how do we message all these things across? How do we communicate? We landed on EventBridge for our milestone events. We have one event bus that we talked to all of our, between domain boundaries basically.

EventBridge there, and then each of our services now subscribed to that EventBridge, and maintain their own events store. That's backed by DynamoDB. Each of our services have their own data store. It's usually an event stream or a projection database, but it's almost all Dynamo, which is interesting because our old platform used Postgres, and we did have relational data. It was interesting. I was really scared at first, how are we going to maintain relations and things? It became a non-issue. I don't even know why now that I think about it. Just like every service maintains its nice projection through events, and builds its own view of the world, which brings its own problems. We have DynamoDB in there, and then SNS to SQS fan-out. Then when we're talking about packages ...

Jeremy: That's Office Streams?

Patrick: Exactly, yeah. We're Dynamo streams to SNS, to SQS. Then we use shared code packages to make those subscriptions very easy. If you're looking at doing that SNS to SQS fan-out, or just creating SQS queues, there is a lot of cloud formation boilerplate that we were creating, and we needed to move really quick on this project. We got pretty opinionated quick, and we created our own subscription function that just generates all this cloud formation with naming conventions, which was nice. I think the opinions were good because early on we weren't opinionated enough, I would say. When you look in your AWS dashboard, the read for these aren't prefixed correctly, and there's all this garbage. You're able to have consistent naming throughout, make it really easy to subscribe to an event.

We would publish packages to help with certain things. Our events store package was one of those. We also created a Lambda handlers package, which leverages, there's like a Lambda middlewares compose package out there, which is quite nice, and we basically, all the common functionality we're doing a lot of, like parsing a body from S3, or SQS or API gateway. That's just the middleware that we now publish. Validation in and out. We highly recommend the library Zod, we really embrace the TypeScript first object validation. Really, really cool package. We created all these middlewares now. Then subscription packages. We have a lot of shared code in this internal NPM repository that we install across.

I think one challenge we had there was, eventually you extracted away too much from the cloud formation, and it's hard for new developers to ... It's easy for them to create events subscriptions, it's hard for them to evolve our serverless thinking because they're so far removed from it. I still think it was the right call in the end. I think this is the next step of the journey, is figuring out how do we share code effectively while not hiding away too much of serverless, especially because it's changing so fast.

Jeremy: It's also interesting though that you take that approach to hide some of that complexity, and bake in some of that boilerplate that, someone's mostly didn't have to write themselves anyways. Like you said, they're copying and pasting between services, is not the best way to do it. I tried the whole shared packages thing one time, and it kind of worked. It's just like when you make a small change to that package and you have 14 services, that then you have to update to get the newest version. Sometimes that's a little frustrating. Lambda layers haven't been a huge help with some of that stuff either. But anyways, it's interesting, because again you've mentioned this a number of times about using queues.

You did mention resiliency in there, but I want to touch on that point a little bit because that's one of those things too, where I would assume in a banking platform, you do not want to lose events. You don't want to lose things. and so if something breaks, or something gets throttled or whatever, having to go and retry those events, having the alerts in place to know that a queue is backed up or whatever. Then just, I'm thinking ordering issues and things like that. What kinds of issues did you face, and tell me a little bit more about what you've done for reliability?

Patrick: Totally. Queues are definitely ... like SQS is a workhorse for our company right now. We use a lot of it. Dropping messages is one of the scariest things, so you're dead-on there. When we were moving to event driven, that was what scared me the most. What if we drop an event? A good example of that is if you're using EventBridge and you're subscribing Lambdas to it, I was under the impression early on that EventBridge retries forever. But I'm pretty sure it'll retry until it invokes twice. I think that's what we landed on.

Jeremy: Interesting.

Patrick: I think so, and don't quote me on this. That was an example of where drop message could be a problem. We put a queue in front of there, an SQS queue as the subscription there. That way, if there's any failure to deliver there, it's just going to retry all the time for a number of days. At that point we got to think about DLQs, and that's something we're still thinking about. But yeah, I think the reason we've been using queues everywhere is that now queues are in charge of all your retry abilities. Now that we've decomposed these Lambdas into one Lambda lift, into five Lambdas with queues in between, if anything fails in there, it just pops back into the queue, and it'll retry indefinitely. You can drop messages after a few days, and that's something we learned luckily in the prototyping stage, where there are a few places where we use dead letter queues. But one of the issues there as well was ordering. Ordering didn't play too well with ...

Jeremy: Not with DLQs. No, it does not, no.

Patrick: I think that's one lesson I'd want to share, is that only use ordering when you absolutely need it. We found ways to design some of our architecture where we didn't need ordering. There's places we were using FIFO SQS, which was something that just launched when we were building this thing. When we were thinking about messaging, we're like, "Oh, well we can't use SQS because they don't respect ordering, or it doesn't respect ordering." Then bam, the next day we see this blog article. We got really hyped on that and used FIFO everywhere, and then realized it's unnecessary in most use cases. So when we were going live, we actually changed those FIFO queues into just regular SQS queues in as many places as we can. Then so, in that use case, you could really easily attach a dead letter queue and you don't have to worry about anything, but with FIFO things get really, really gnarly.

Ordering is an interesting one. Another place we got burned I think on dead-letter queues, or a tough thing to do with dead letter queues is when you're using our state machines, we needed to limit the concurrency of our state machines is another wishlist item in AWS. I wish there was just at the top of the file, a limit concurrent executions of your state machine. Maybe it exists. Maybe we just didn't learn to use it properly, but we needed to. There's a few patterns out there. I've seen the [INAUDIBLE] pattern where you can use the actual state machine flow to look back at how many concurrent executions you have, and pause. We landed on setting reserved concurrency in a number of Lambdas, and throwing errors. If we've hit the max concurrency and it'll pause that Lambda, but the problem with DLQs there was, these are all errors. They're coming back as errors.

We're like, we're fine with them. This is a throttle error. That's fine. But it's hard to distinguish that from a poison message in your queue, so when do you dump those into DLQ? If it's just a throttling thing, I don't think it matters to us. That was another challenge we had. We're still figuring out dead letter queues and alerting. I think for now we just relied on CloudWatch alarms a lot for our alerting, and there's a lot you could do. Even just in the state machines, you can get pretty granular there. I know once certain things fail, and announced to your Slack channel. We use that Slack integration, it's pretty easy. You just go on a Slack channel, there's an email in there, you plop it into the console in AWS, and you have your very early alerting mechanism there.

Jeremy: The thing with Elasticsearch ... not Elasticsearch, I'm sorry. I'm totally off-topic here. The thing with EventBridge and Lambda, these are one of those things that, again, they?re nuances, but event bridge, as long as it can deliver to the Lambda service, then the Lambda service kicks off and queues it automatically. Then that will retry at a certain number of times. I think you can control that now. But then eventually if that retries multiple times and eventually fails, then that kicks it over to the DLQ or whatever. There's all different ways that it works like that, but that's why I always liked the idea of putting a queue in between there as well, because I felt you just had a little bit more control over exactly what happens.

As long as it gets to the queue, then you know you haven't lost the message, or you hope you haven't lost a message. That's super interesting. Let's move on a little bit about the adoption issues. You mentioned a few of these things, obviously issues with concurrency and ordering, and some of that other stuff. What about some of the other challenges you had? You mentioned this idea of writing all these packages, and it pulls devs away from the CloudFormation a little bit. I do like that in that it, I think, accelerates a lot of things, but what are some of the other maybe challenges that you've been having just getting this thing up and running?

Patrick: I would say IAM is an interesting one. Because we are in the banking space, we want to be very careful about what access do you give to what machines or developers, I think machines are important too. There've been cases where ... so we do have a separate developer set up with their own permissions, in development's really easy to spin up all your services within reason. But now when we're going into production, there's times where our CI doesn't have the permissions to delete a queue or create a queue, or certain things, and there's a lot of tweaking you have to do there, and you got to do a lot of thinking about your IAM policies as an organization, especially because now every developer's touching infrastructure.

That becomes this shared operational overhead that serverless did introduce. We're still figuring that out. Right now we?re functioning on least privilege, so it's better to just not be able to deploy than deploy something you shouldn't or read the logs that you shouldn't, and that's where we're starting. But that's something that, it will be a challenge for a little while I think. There's all kinds of interesting things out there. I think temporary IAM permissions is a really cool one. There are times we're in production and we need to view certain logs, or be able to access a certain queue, and there's tooling out there where you can, or at least so I've heard, you can give temporary permissions. You have this queue permission for 30 minutes, and it expires and it's audited, and I think there's some CloudTrail tie-in you could do there. I'm speaking about my wishlist for our next evolution here. I hope my team is listening ...

Jeremy: Your team's listening to you.

Patrick: ... will be inspired as well.

Jeremy: What about ... because this is something too that I always found to be a challenge, especially when you start having multiple services, and you've talked about these domain events, but then milestone events. You've got different services that need to communicate across services, or across domains, and realize certain things like that. Service discovery in and of itself, and which queue are we mapping to, or which service am I talking to, and which version of the service am I talking to? Things like that. How have you been dealing with that stuff?

Patrick: Not well, I would say. Very, very ad hoc. I think like right now, at least we have tight communication between the teams, so we roughly know which service we need to talk to, and we output our URLs in the cloud formation output, so at least you could reference the URLs across services, a little easier. Really, a GraphQL is one of the only service that really talks to a lot of our API gateways. At least there's less of that, knowing which endpoint to hit. Most of our services will read into EventBridge, and then within services, a lot of that's abstracted away, like the queue subscription's a little easier. Service discovery is a bit of a nightmare.

Once our services grow, it'll be, I don't know. It'll be a huge challenge to understand. Even which services are using older versions of Node, for instance. I saw that AWS is now deprecating version 10 and we'll have to take a look internally, are we using version 10 anywhere, and how do we make sure that's fine, or even things like just knowing which services now have vulnerabilities in their NPM packages because we're using Node. That's another thing. I don't even know if that falls in service discovery, but it's an overhead of ...

Jeremy: It's a service management too. It's a lot there. That actually made me, it brings me to this idea of observability too. You mentioned doing some CloudWatch alerts and some of that stuff, but what about using some observability tool or tracing like x-ray, and things like that? Have you been implementing any of that, and if you have, have you had any success and or problems with it?

Patrick: I wish we had a better view of some of the observability tools. I think we were just building so quickly that we never really invested the time into trying them out. We did use X-Ray, so we rolled our own tooling internally to at least do what we know. X-Ray was one of those, but the problem with X-Ray is, we do subscribe all of our services, but X-Ray isn't implemented everywhere internally in AWS, so we lose our trail somewhere in that Dynamo stream to SNS, or SQS. It's not a full trace. Also, just digesting that huge graph of information is just very difficult. I don't use it often, I think it's a really cool graphic to show, ?Hey, look, how many services are running, and it's going so fast."

It's a really cool thing to look at, but it hasn't been very useful. I think our most useful tool for debugging and observability has been just our logging. We created a JSON logger package, so we get up JSON logs and we can actually filter off of different properties, and we ship those to Elasticsearch. Now you can have a view of all of the functions within a given domain at any point in time. You could really see the story. Because I think early on when we were opening up CloudWatch and you'd have like 10 tabs, and you're trying to understand this flow of information, it was very difficult.

We also implemented our own trace ID pattern, and I think we just followed a Lumigo article where we introduced some properties, and in each of our Lambdas at a higher level, and one of our middlewares, and we were able to trace through. It's not ideal. Observability is something that we'll probably have to work on next. It?s been tolerable for now, but I can't see the scaling that long.

Jeremy: That's the other thing too, is even the shared package issue. It's like when you have an observability tool, they'll just install a layer or something, where you don't necessarily have to worry about updating your own tool. I always find if you are embracing serverless and you want to get rid of all that undifferentiated heavy lifting, observability tools, there's a lot of really good ones out there that are doing some great stuff, and they're specializing in it. It might be worth letting someone else handle that for you than trying to do it yourself internally.

Patrick: Yeah, 100%. Do you have any that you've used that are particularly good? I know you work with serverless so-

Jeremy: I played around with all of them, because I love this stuff, so it's always fun, but I mean, obviously Lumigo and Epsagon, and Thundra, and New Relic. They?re all great. They all do things slightly differently, but they all follow a similar implementation pattern so that it?s very easy to install them. We can talk more about some recommendations. I think it's just one of those things where in a modern application not having that insight is really hard. It can be really hard to debug stuff. If you look at some of the tools that AWS offers, I think they?re there, it's just, they are maybe a little harder to implement, and not quite as refined and targeted as some of the observability tools. But still, you got to get there. Again, that's why I keep saying it's an evolution, it's a process. Maybe one time you get burned, and you're like, we really needed to have observability, then that's when it becomes more of a priority when you're moving fast like you are.

Patrick: Yeah, 100%. I think there's got to be a priority earlier than later. I think I'll do some reading now that you've dropped some of these options. I have seen them floating around, but it's one of those things that when it's too late, it's too late.

Jeremy: It's never too late to add observability though, so it should. Actually, a lot of them now, again, it makes it really, really easy. So I'm not trying to pitch any particular company, but take a look at some of them, because they are really great. Just one other challenge that I also find a lot of people run into, especially with serverless because there's all these artificial account limits in place. Even the number of queues you can create, and the number of concurrent Lambda functions in a particular region, and stuff like that. Have you run into any of those account limit issues?

Patrick: Yeah. I could give you the easiest way to run into an account on that issue, and that is replay your entire EventBridge archive to every subscriber, and you will find a bottleneck somewhere. That's something ...

Jeremy: Somewhere it'll fall over? Nice.

Patrick: 100%. It's a good way to do some quick check and development to see where you might need to buffer something, but we have run into that. I think the solution there, and a lot of places was just really playing with concurrency where we needed to, and being thoughtful about where is their main concurrency in places that we absolutely needed to stay functioning. I think the challenge there is that eats into your total account concurrency, which was an interesting learning there. Definitely playing around there, and just being thoughtful about where you are replaying. A couple of things. We use replays a lot. Because we are using these milestone events between service boundaries, now when you launch a new service, you want to replay that whole history all the way through.

We've done a lot of replaying, and that was one of the really cool things about EventBridge. It just was so easy. You just set up an archive, and it'll record everything coming through, and then you just press a button in the console, and it'll replay all of them. That was really awesome. But just being very mindful of where you're replaying to. If you replay to all of your subscriptions, you'll hit Lambda concurrency limits real quick. Even just like another case, early on we needed to replace ... we have our own domain events store. We want to replace some of those events, and those are coming off the Dynamo stream, so we were using dynamo to kick those to a stream, to SNS, and fan-out to all of our SQS queues. But there would only be one or two queues you actually needed to subtract to those events, so we created an internal utility just to dump those events directly into the SQS queue we needed. I think it's just about not being wasteful with your resources, because they are cheap. Sure.

Jeremy: But if you use them, they start to cost money.

Patrick: Yeah. They start to cost some money as well as they could lock down, they can lock you out of other functionality. If you hit your Lambda limits, now our API gateway is tapped.

Jeremy: That's a good point.

Patrick: You could take down your whole system if you just aren't mindful about those limits, and now you could call up AWS in a panic and be like, ?Hey, can you update our limits?" Luckily we haven't had to do that yet, but it's definitely something in your back pocket if you need it, if you can make the case to AWS, that maybe you do need bigger limits than the default. I think just not being wasteful, being mindful of where you're replaying. I think another interesting thing there is dealing with partners too. It's really easy to scale in the Lambda world, but not every partner could handle that volume really quickly. If you're not buffering any event coming through EventBridge to your new service that hits a partner every time, you're going to hit their API rate limit really quickly, because they're just going to just go right through it.

You might be doing thousands of API calls when you're instantiating a new service. That's one of those interesting things that we have to deal with, and particularly in our orchestrators, because they are talking to different partners, that's why we need to really make sure we could limit the concurrent executions of the state machines themselves. In a way, some of our architecture is too fast to scale.

Jeremy: It's too good.

Patrick: You still have to consider downstream. That, and even just, if you are using relational databases or anything else in your system, now you have to worry about connection limits and ...

Jeremy: I have a whole talk I gave on that.

Patrick: ... spikes in traffic.

Jeremy: Yes, absolutely.

Patrick: Really cool.

Jeremy: I know all about it. Any final advice for companies like you that are trying to bite off a piece of the serverless apple, I guess, That's really bad. Anyways, any advice for people looking to get into this?

Patrick: Yeah, totally. I would say start small. I think we were wise to just try it out. It might not land with your development team. If you don't really buy in, it's one of those things that could just end up unnecessarily messy, so start small, see if you like it in-shop, and then reevaluate, once you hit a certain point. That, and I would say shared boilerplate packages sooner than later. I know shared code is a problem, but it is nice to have an un-opinionated starter pack, that you're at least not doing anything really crazy. Even just things like having opinions around logging. In our industry, it's really important that you're not logging sensitive details.

For us doing things like wrapping our HTTP clients to make sure we're not logging sensitive details, or having short Lambda packages that make sure out-of-the-box you're opinionated about not doing something terribly awful. I would say those two things. Start small and a boiler package, and maybe the third thing is just pay attention to the code smell of a growing Lambda. If you are doing three API calls in one Lambda, chances are you could probably break that up, and think about it in a more resilient way. If any one of those pieces fail, now you could have retry ability in each one of those. Those are the three things I would say. I could probably talk forever about the rest of our journey.

Jeremy: I think that was great advice, and I love hearing about how companies are going through this process, what that process looks like, and I hope, I hope, I hope that companies listen to this and can skip a lot of these mistakes. I don't want to call them all mistakes, and I think it's just evolution. The stuff that you've done, we've all made them, we've all gone through that process, and the more we can solidify these practices and stuff like that, I think that more companies will benefit from hearing stories like these. Thank you very much for sharing that. Again, thank you so much for spending the time to do this and sharing all of this knowledge, and this journey that you've been on, and are continuing to be on. It would great to continue to get updates from you. If people want to contact you, I know you're not on Twitter, but what's the best way to reach out to you?

Patrick: I almost wish I had a Twitter. It's the developer thing to have, so maybe in the future. Just on LinkedIn would be great. LinkedIn would be great, as well as if anybody's interested in working with our team, and just figuring out how to take serverless to the next level, just hit me up on LinkedIn or look at our careers page at northone.com, and I could give you a warm intro.

Jeremy: That's great. Just your last name is spelled S-T-R-Z-E-L-E-C. How do you say that again? Say it in Polish, because I know I said it wrong in the beginning.

Patrick: I guess for most people it would just be Strzelec, but if there are any Slavs in the audience, it's "Strzelec." Very intense four consonants last name.

Jeremy: That is a lot of consonants. Anyways again, Patrick, thanks again. This was great.

Patrick: Yeah, thank you so much, Jeremy. This has been awesome.

2021-06-14
Länk till avsnitt

Episode #104: The Rise of Data Services with Patrick McFadin

About Patrick McFadin

Patrick McFadin is the VP of Developer Relations at DataStax, where he leads a team devoted to making users of Apache Cassandra successful. He has also worked as Chief Evangelist for Apache Cassandra and consultant for DataStax, where he helped build some of the largest and exciting deployments in production. Previous to DataStax, he was Chief Architect at Hobsons and an Oracle DBA/Developer for over 15 years.

Twitter: @PatrickMcFadin
LinkedIn: Patrick McFadin
DataStax website: datastax.com
K8ssandra: k8ssandra.io
Stargate: stargate.io
DataStax Astra: Cassandra-as-a-Service

Watch this episode on YouTube: https://youtu.be/-BcIL3VlrjE

This episode sponsored by CBT Nuggets and Fauna.

Transcript
Jeremy: Hi everyone, I'm Jeremy Daly and this is Serverless Chats. Today I'm chatting with Patrick McFadin. Hey Patrick, thanks for joining me.

Patrick: Hi Jeremy. How are you doing today?

Jeremy: I am doing really well. So you are the VP of Developer Relations at DataStax, so I'd love it if you could tell the listeners a little bit about yourself and what DataStax is all about.

Patrick: Sure. Well, I mean mostly I'm just a nerd with a cool job. I get to talk about technology a lot and work with technology. So DataStax, we're a company that was founded around Apache Cassandra, just supporting and making it awesome. And that's really where I came to the company. I've been working with Apache Cassandra for about 10 years now. I've been a part of the project as a contributor.

But yeah, I mean mostly data infrastructure has been my life for most of my career. I did this in the dotcom era, back when it was really crazy when we had dozens of users. And when that washed out, I'm like, oh, then real scale started and during that period of time I worked a lot in just trying to scale infrastructure. It seems like that's been what I've been doing for like 30 years it seems like, 20 years, 20 years, I'm not that old. Yeah. But yeah, right now, I spend a lot of my time just working with developers on what's next in Kubernetes and I'm part of CNCF now, so yeah. I just can't to seem to stay in one place.

Jeremy: Well, so I'm super interested in the work that DataStax is doing because I have had the pleasure/misfortune of managing a Cassandra ring for a start-up that I was at. And it was a very painful process, but once it was set up and it was running, it wasn't too, too bad. I mean, we always had some issues here and there, but this idea of taking a really good database, because Cassandra's great, it's an excellent data store, but managing it is a nightmare and finding people who can manage it is sort of a nightmare, and all that kind of stuff. And so this idea of taking these services and DataStax isn't the only one to do this, but to take these open-source services and turn them into these hosted solutions is pretty fantastic. So can you tell me a little bit more, though? What this shift is about? This moving away from hosting your own databases to using databases as a service?

Patrick: Yeah. Well, you touched on something important. You want to take that power, I mean Cassandra was a database that was built in the scale world. It was built to solve a problem, but it was also built by engineers who really loved distributed computing, like myself, and it's funny you say like, "Oh, once I got it running, it was great," well, that's kind of the experience with most distributed databases, is it's hard to reason around having, "Oh, I have 100 mouths to feed now. And if one of them goes nuts, then I have to figure it out."

But it's the power, that power, it's like stealing fire from the gods, right? It's like, "Oh, we could take the technology that Netflix and Apple and Facebook use and use it in our own stuff." But you got to pay the price, the gods demand their payment. And that's something that we've been really trying to tackle at DataStax for a couple of years now, actually three, which is how ... Because the era of running your own database is coming to an end. You should not run your own database. And my philosophy as a technologist is that proper, really important technology like your data layer should just fade into the background and it's just something you use, it's not something you have to reason through very much.

There's lots of technology that's like that today. How many times have you ... When was the last time you managed your own memory in your code?

Jeremy: Right. Right. Good point. I know.

Patrick: Thank god, huh?

Jeremy: Exactly.

Patrick: Whew.

Jeremy: But I think that you make a really good point, because you do have these larger companies like Facebook or whatever that are using these technologies and you mentioned data layers, which I don't think I've worked for a single company, I don't think I actually ... I founded a start-up one time and we built a data layer as well, because it's like, the complexity of understanding the transaction models and the routing, especially if you're doing things like sharding and all kinds of crazy stuff like that, hiding that complexity from your developers so that they can just say, "I need to get this piece of information," or, "I need to set this piece of information," is really powerful.

But then you get stuck with these data layers that are bespoke and they're generally fragile and things like that, so how is that you can take data as a service and maybe get rid of some of that, I don't know, some of that liability I guess?

Patrick: Yeah. It's funny because you were talking about sharding and things like that. These are things that we force on developers to reason through, and it's just cognitive load. I have an app to get out, and I have some business desire to get this application online, the last thing I need to worry about is my sharding algorithm. Jeremy, friends don't let friends shard.

Jeremy: Right. That's right. That's a good point.

Patrick: But yeah, I mean I think we actually have all the parts that we need and it's just about, this is closer than you think. Look at where we've already started going, and that is with APIs, using REST. Now GraphQL, which I think is deserving its hotness, is starting to bring together some things that are really important for this kind of world we want to live in. GraphQL is uni-fettering data and collecting and actual queries, it's a QL, and why they call it Graph, I have no idea. But it gives you this ability to have this more abstract layer.

I think GraphQL will, here's a prediction is that it's going to be like the SQL of working with data services on the internet and for cloud-native applications. And so what does that mean? Well, that means I just have to know, well, I need some data and I don't really care what's underneath it. I don't care if I have this field indexed or anything like that. And that's pretty exciting to me because then we're writing apps at that point.

Jeremy: Right. Yeah. And actually, that's one of the things I really like about GraphQL too is just this idea that it's almost like a universal data access layer in a sense because it does, you still have to know it, you have to know what you're requesting if you're an end developer, but it makes it easier to request the things that you need and have those mutations set and have some of those other things standardized across the company, but in a common format because isn't that another problem? Where it's like, I'm working with company A and I move to company B maybe and now company B is using a different technology and a different bespoke data layer and some of these other things.

So, I think data as a service for one, maybe with GraphQL in front of it is a great way to have this alignment across companies, or I guess, just makes it easier for developers to switch and start developing right away when they move into a new company.

Patrick: Yeah, and this is a concept I've been trying to push pretty hard and it's driven by some conversations I've had with some friends that they're engineering leaders and they have this common desire. We want to have a zero day dev, which is the first day that someone starts, they should be producing production code. And I don't think that's crazy talk, we can do this, but there's a lot of things that are in front of it. And the database is one of them. I think that's one of the first things you do when you show up at company X is like, "Okay, what database are you using? What flavor of SQL or GRPC or CQL, Cassandra query language? What's the data model? Quick, where's that big diagram on the wall with my ERD? I got to go look at that for a while."

Jeremy: How poorly did you structure your Git repositories? Yeah.

Patrick: Yeah, exactly. It's like all these things. And no, I would love to see a world where the most troublesome part of your first day is figuring out where the coffee and the bathroom are, and then the rest of it is just total, "Hey, I can do this. This is what I get paid to do."

Jeremy: Right. Yeah. So that idea of zero day developer, I love that idea and I know other companies are trying to do that, but what enables that? Is it getting the idea of having to understand something bespoke? Is it getting that off of the table? Or not having to deal with the low-level database aspect of things? I mean because APIs, I had this conversation with Rob Sutter, actually, a couple weeks ago. And we were talking about the API economy and how everything is moving towards APIs. And even data, it was around data as well.

So, is that the interface, you think, of the future that just says, "Look, trying to interface directly with a database or trying to work with some other layer of abstraction just doesn't make sense, let's just go straight from code right to the data, with a very simple API interface?"

Patrick: Yeah, I think so. And it's this idea of data services because if you think of if you're doing React, or something like a front-end code, I don't want to have a driver. Drivers are a total impediment. It's like, driver hell can be difficult at large organizations, getting the matching right. Oh, we're using this database so you have to use this driver. And if you don't, you are now rejected at the gate. So it's using HTTP protocols, but it's also things like when you're using React or Angular, View, whatever you're using on the front-end, you have direct access.

But most times what you're needing is just a collection or an object. And so just do a get, "I need this thing right now. I'm doing a pick list. I need your collection." I don't need a complicated setup and spend the first three days figuring out which driver I'm using and make sure my Gradle file is just perfect. Yeah. So, I think that's it.

Jeremy: Yeah. No, I'd be curious how you feel about ORMs, or O-R-Ms, certainly for relational databases, I know a lot of people love them. I can't stand them. I think it adds a layer of abstraction and just more complexity where I just want access to the database. I want to write the query myself, and as soon as you start adding in all this extra stuff on top of it to try to make it easier, I don't know, it just seems to mess it up for me.

Patrick: All right. So yeah, I think we have an accord. I am really not a fan of ORMs at all. And I mean this goes back to Hibernate. Everyone's like, "Oh, Hibernate's going to be the end of databases." No, it's not. Oh yeah, it was the end of the database at the other side because it would create these ridiculous queries. It's like, why is every query a full table scan?

Jeremy: Exactly.

Patrick: Because that's the way Hibernate wanted it. Yeah. I actually banned Hibernate at one company I was working at. I was Chief Architect there and I just said, "Don't ever put Hibernate in our production." Because I had more meetings about what it was doing wrong than what it was doing right.

Jeremy: Right. Right. Yeah. No, that's sounds, yeah.

Patrick: Is that a long answer? Like, no.

Jeremy: No, I've had the same experience where certain ORMs you're just like, no. Certain things, you can't do this because it's going to one, I think it locks you in in a sense, I mean there's all kind of lock-in in the cloud, and if you're using a data service or an API or you're using something native in AWS, or IBM Cloud, you're still going to be locked in in some way, but I do feel like whenever you start going down that path of building custom things, or forcing developers to get really low level, that just builds up all kinds of tech debt, right? That you eventually are going to have to work down.

Patrick: Well, it's organizational inertia. When you start getting into this, when you start using annotations in Hibernate where you're just cutting through all the layers and now you're way down in the weeds, try to move that. There's a couple of companies that I've worked with now that are looking at the true reality of portability in their data stores. Like, "Oh, we want to move from one to a different, from a key value to a document without developers knowing." Well, how do you get to that point?

Jeremy: Right. Yeah.

Patrick: And it's just, that's not giving access to those things, first of all, but this is that tech debt that's going to get in your way. We're really good, technologists, we're really good at just wracking up the charges on our tech debt credit card, especially whenever we're trying to get things out the door quickly. And I think that's actually one of the problems that we all face. I mean, I don't think I've ever talked to a developer who was ahead of schedule and didn't have somebody breathing down their neck.

Jeremy: Very true.

Patrick: You take shortcuts. You're like, "We've got to shift this code this week. Skip the annotations and go straight into the database and get the data you need." Or something. You start making trade-offs real fast.

Jeremy: What can we hard code that will just get us past.

Patrick: Yeah. Is it green? Shift it. Yeah.

Jeremy: Yeah, no, I totally, totally agree. All right. So let's talk a little bit more about, I guess, skillsets and things like that. Because there are so many different databases out there. Cassandra is just one and if you're a developer working just at the driver level, I guess, with something like Cassandra, it's not horrible to work with. It's relatively easy once a lot of these things are set up for you.

Same is true of MongoBD, or I mean, DynamoDB, or any of these other ones where the interface to it isn't overly difficult, but there's always some sort of something you want to build on top of it to make it a little bit easier. But I'm just curious, in terms of learning these different things and switching between organizations and so forth, there is a cognitive load going from saying, "I'm working on Cassandra," to going to saying, "I'm working on DynamoDB," or something like that. There's going to be a shift in understanding of how the data can be brought back, what the limitations are, just a whole bunch of things that you kind of have to think about. And that's not even including managing the actual thing. That's a whole other thing.

So, hiring people, I guess, or hiring developers, how much do we want developers to know? Are you on board with me where it's like, I mean I like understanding how Cassandra works and I like understanding how DynamoDB works, and I like knowing the limits, but I also don't want to think about them when I'm writing code.

Patrick: Yeah. Well, it's interesting because Cassandra, one of the things I really loved about Cassandra initially was just how it works. As a computer scientist, I was like, "This is really neat." I mean, my degree field is in distributed computing, so of course, I'm going to nerd out.

Jeremy: There you go.

Patrick: But that doesn't mean that it doesn't have mass appeal because it's doing the thing that people want. And I think that's going to be the challenge of any properly built service layer. I think I've mentioned to you before we started this, I work on a project called Stargate. And Stargate is a project that is meant to build a data layer on top of databases. And right now it's with Cassandra. And it's abstracting away some of the harder to understand or reason things.

For instance, with distributed computing, we're trying to reduce the reliance on coordination. There is a great article about this by Pat Helland about how coordination is the last really expensive thing that we have in development. Memory, CPU, super cheap. I can rent that all day long. Coordination is really, really hard, and I don't expect a new programmer to understand, to reason through coordination problems. "Oh, yeah, the just in time race conditions," and things like that.

And I think that's where distributed computing, it's super powerful, but then whenever people see what eventual consistency are, they freak out and they're like, "I just want my SQL Lite on my laptop. It's very safe." But that's not going to get you there. That's not a global database, it's not going to be able to take you to a billion users. Come on, don't cut ...

Jeremy: Maybe you don't need to be.

Patrick: ... your apps short Jeremy. You're going to have a billion users.

Jeremy: You should strive for it, at least, is how I feel about it. So that's, I guess, the point I was trying to get to is that if the developers are the ones that you don't want learning some of this stuff, and there's ways to abstract it away again, going like we talked about data as a service and APIs and so forth. And I think that's where I would love to see things shifting. And as you said earlier, that's probably where things are going.

But if you did want to run your own database cluster, and you wanted to do this on your own, I mean you have to hire people that know how to do this stuff. And the more I see the market heating up for this type of person, there is very, very few specialists out there that are probably available. So how would you even hire somebody to run your Cassandra ring? They probably all work at DataStax.

Patrick: No, not all of them. There's a few that work at Target and FedEx, Apple, the biggest Cassandra users in the world. Huawei. We just found out lately that Huawei now has the biggest cluster on the planet. Yeah. They just showed up at ApacheCon and said, "Oh yeah, hold my beer." But I mean, you're right, it's a specialized skillset and one of the things we're doing at DataStax, we feel, yeah, you should just rent that. And so we have Astra, which is our database as a service.

It's fully compatible with open-source Cassandra. If you don't like it, you can just take it over and use open-source. But we agree and we actually can run Cassandra cheaper than you can, and it's just because we can do it at scale. And right now Astra, the way we run it is truly serverless, you only pay for what you need, and that's something that we're bringing to the open-source side of Cassandra as well, but we're getting Cassandra closer to Kubernetes internally.

So if you don't want to think about Kubernetes, if you don't want to think about all that stuff, you can just rent it from us, or you could just go use it in open-source, either way. But you're right. I mean, it should not be a 2020s skillset is, "Get better at running Cassandra." I think those days should be, leave it to, if you want to go work at DataStax and run Cassandra, great, we're hiring right now, you will love it. You don't have to. Yeah.

Jeremy: So the idea of it being open-source, so again, I'm not a huge fan of this idea of vendor lock-in. I think if you want to run on AWS Lambda, yeah, most of what you can do can only run on AWS Lambda, but changing the compute, switching that over to Azure or switching that over to GCP or something like that, the compute itself is probably not that hard to move, right? I think especially depending on what you're doing, setting up an entire Kubernetes cluster just to run a few functions is probably not worth it. I mean, obviously, if you've got a much bigger implementation, that's a little different.

But with data, data is just locked in. No matter where you go, it is very hard to move a lot of data. So even with the open-source flair that you have there, do you still see a worry about lock in from a data side?

Patrick: Yeah. And it's becoming more of a concern with larger companies too, because options, #options. There was a pretty famous story a few years ago where the CEO of Target said, "I am not paying Amazon any more money," and they just picked up shop and moved from AWS to Google Cloud. And the CEO made a technical decision. It was like everybody downstream had to deal with that. And I think that luckily Target's a huge Cassandra shop and they were just like, "Okay, we'll just move it over there."

But the thing is that you're right, I mean, and I love talking about this because back when cloud was first starting and I was talking about it and thinking about it, just what do the clouds promise you? Oh, you get commodity scale of CPU and network and storage. And that's what they want to sell you because that what they're building. Those big buildings in north Virginia, they are full of compute network and storage, but the thing they know they need to hook you in and the way that they're hooking you in, there's some services that are really handy, they're great, but really the hook is the data.

Once you get into the database, the bespoke database for the cloud, one of the features of that database is it will not connect to any other database outside of that cloud, and they know that. I mean, and this is why I really strongly am starting to advocate this idea of this move towards data on Kubernetes is a way where open-source gets to take back the cloud. Because now we're deploying these virtual data centers and using open-source technology to create this portability. So we can use the compute network and storage, a Google, Amazon, Azure, OnPrem wherever, doesn't matter.

But you need to think of like, "All right. How is that going to work?" And that's why we're like, "If you rent your Cassandra from DataStax with Astra, you can also use the open-source Cassandra as well." And if we aren't keeping you happy, you should feel totally fine with moving it to an open-source workload. And we're good with that. One way or the other, we would love for you to use a database that works for you.

Jeremy: Right. And so this Stargate project that you're working on, is that the one that allows you to basically route to multiple databases?

Patrick: That's the dream. Right now it just does Cassandra, but there's been some really interesting ... There's some folks coming out of the woodwork that really want to bring their database technology to Stargate. And that's what I'm encouraged by. It's an open-source project, Stargate.io, and you can contribute any of the connectors for underlying data store, but if we're using GraphQL, if you're using GRPC, if you're using REST, the underlying data store is really somewhat irrelevant in that case. You're just doing gets and puts, or gets and sets. Gets and puts, yeah, that's right. Gets, sets, puts, it's a lot of words.

Jeremy: Whatever words. Yeah. Exactly.

Patrick: That's what I love about standard, Jeremy, there's so many to pick from.

Jeremy: Right, because there are ... Exactly, which standard do you choose? Yeah. So, because that's an interesting thing for me too, is just this idea of, I mean, it would be great to live in a perfect little cloud where you could say like, "Oh, well AWS has all the services I need. And I can just keep all my stuff there, whatever." But best of breed services, or again, the cost of hosting something in AWS maybe if you're hosting a Cassandra cluster there, versus maybe hosting it in GCP or maybe hosting it with you, you said you could host it cheaper than those could, or that we could host it ourselves.

And so I do think that there is ... and again, we've had this conversation about multi-cloud and things like that where it's not about agnostic, it's not about being cloud agnostic, it's about using the best of breed for any service that you want to use. And APIs seem to be the way to get you there. So I love this idea of the Stargate project because it just seems like that's the way where it could be that standard across all these different clouds and onto all these different databases, well I mean, right now Cassandra, but eventually these other ones. I don't know, that seems like a pretty powerful project to me.

Patrick: Well, the time has come. It's cloud native ... I work a lot with CNCF and cloud-native data is a kind of emerging topic. It's so emerging that I'm actually in the middle of writing a book, an O'Reilly book on it. So, yeah. Surprise. I just dropped it. This just in.

Yeah, because I can see that this is going to be the future, but when we build cloud-native, cloud applications, cloud-native applications, we want scale, we want elasticity, and we want self-healing. Those are the three cloud-native things that we want. And that doesn't give us a whole lot ... So if I want to crank out a quick REACT app, that's what I'm going to use. And Netlify's a great example, or Vercel, they're creating this abstraction layer. But Netlify and Vercel are both working, they've been partnering with us on the Stargate project, because they're seeing like, "Okay, we want to have that very light touch, developers just come in and use it," in building cloud-native applications.

And whenever you're building your application, you're just paying for what you use. And I think that's really key, not spinning up a bunch of infrastructure that you get a monthly bill for. And that bill can be expensive.

Jeremy: It seems crazy. Doesn't it seem crazy nowadays? Actually provisioning an EC2 instance and paying for it to run even if it does nothing. That seems crazy to me.

Patrick: There are start-ups around the idea of finding the instance that's running that's causing you money that you're not using.

Jeremy: Which is crazy, isn't it? It's crazy. All right. So let's go a little bit more into standards, because you mentioned standards. So there are standards now for a lot of things, and again, GraphQL being a great example, I think. But also from a database perspective, looking at things like TSQL and developers come into an organization and they're familiar with MySQL, or they're familiar with PostgreSQL, whatever it is. Or maybe they're familiar with Cassandra or something like that, but I think most people, at least from what I've seen, have been very, very comfortable with the TSQL approach to getting data. So, how do you bring developers in and start teaching them or getting them to understand more of that NoSQL feel?

Patrick: I think it's already happened, it's just the translation hasn't happened in a lot of minds. When you go to build an application, you're designing your application around the workflows your application's going to have. You're always thinking about like, "I click on this. I go there." I mean, this is where we wireframe out the application. At that point, your database is now involved and I don't think a lot of folks know that.

It's like, at every point you need to put data or get data. And I think this is where we've taught could be anybody building applications, which makes it really difficult to be like, "No, no, no, start with your data domain first and build out all those models. And then you write your application to go against those models." And I'll tell you, I've been involved in a few of these application boot camps, like JavaScript boot camps and things, they don't go into data modeling. It's just not a part of it.

Jeremy: Really?

Patrick: And I think this is that thing where we have to acknowledge like, "Yeah, we don't really need that anymore as much, because we're just building applications." If I build a React app, and I have a form and I'm managing the authentication and I click a button and then I get a profile information, I just described every database interaction that I need and the objects that I need. And I'm going to put my user profile at some point, I'm going to click my ID and get that profile back as an object. Those are the interactions that I need. At no point did I say, "And then I'm going to write select from where." No, I just need to get that data.

Jeremy: And I love thinking about data as objects anyways. It makes more sense, rather than rows of spreadsheets essentially that you join together, describing an object even if it's got nested data, like a document form or things like that, I think makes a ton of sense. But is SQL, is it still relevant do you think? I mean, in the world we're moving into? Should I be teaching my daughters how to write TSQL? Or would I be wasting my time?

Patrick: Yeah. Well, yes and no. Depends on what your kid's doing. I think that SQL will go to where it originally started and where it will eventually end, which is in data engineering and data science. And I mean, I still use SQL every once in a while, Bigtable, that sort of thing, for exploring my data. I mean for an analytics career or reporting data and things like that, SQL is very expressive. I don't see any reason to change that. But this is a guy who's been writing SQL for a million years.

But I mean, that world is still really moving. I mean, like a Presto and Snowflake and all these, Redshift, they all use Bigtable, they all use SQL to express the reporting capabilities. But ... And I think this is how you and I got sucked into this is like, well that was the database that we had, so we started using reporting languages to build applications. And how'd that work out?

Jeremy: Yeah. Well, it certainly didn't scale very well, I can tell you that, going back to sharding, because that is always something that was very hard to do. So I guess, I get the point that essentially if you're going to be in the data sciences and you actually need to analyze that data and maybe you do need to do joins, or maybe you need to work with big data in a way, that's a specialized aspect of it and I think people could dabble in that if they were just regular developers and they didn't want to go too deep.

But it sounds like the bigger, or the end goal here, maybe altruistic, is to just give people access to data. So even if they don't know SQL or they don't know something complex, just make it so that whatever data is there that anybody, with whatever level is, they can consume it.

Patrick: Yeah. And move fast with the thing that you're building. Actually, I use a Facebook term, but Facebook does do this. Internally there's a system called Occhio that provides gets and puts for your data, but it abstracts things like geographics and things like that. But the companies that are trying to move quickly, they understood this a long time ago. If you have to reason through, "Am I doing a full table scan? Is that an efficient interjoin?" If you have to reason through that, you're not moving fast anymore.

Jeremy: Right. Right. All right. Cool. All right, so let's talk about Astra a little bit more and this whole idea of, because Astra is the serverless version, the hosted version, the serverless version of Cassandra, right? Through DataStax?

Patrick: Right. And ...

Jeremy: Did I get that right?

Patrick: You got it right. And so it gives you full access. You could do Port 9042 if you still want to use a driver, but it gives you access via GraphQL, REST, and there's also a document API. So if you just want to persist your JavaScript API or JavaScript and then pull it back out your JSON, it does full documents. So it emulates what a MongoDB or DocumenDB does. But the important thing, and this is the somewhat revolutionary side of this, and again, this is something that we're looking to put into open-source, is the serverless nature of it.

You only pay for what you use. And when you want to create a Cassandra database, we don't even call it a Cassandra database on the Astra panel anymore. We just create a database. You give it a name. You click. And it's ready. And it will scale infinitely. As long as we can find some compute and network for you to use somewhere, it'll just keep scaling and that's kind of that true portion of serverless that we're really trying to make happen. And for me, that's exciting because finally, all that power that I feel like I've been hoarding for a long time is now available for so many more people.

And then if you do a million writes per second for 10 minutes and then you turn it off, you only pay for that little short amount of time. And it scales back. You're not paying a persistent charge forever.

Jeremy: I'm just curious from a technical implementation, because I'm thinking about PTSD or nightmares back of my days running Cassandra, and so I'm just trying to think how this works. Is it a shared tenancy model? Or is there a way to do single tenancy if you wanted that as a service?

Patrick: Under the covers, yes, it is multi-tenant, but the way that we are created ... so we had to do some really interesting engineering inside. So my RCO's going to kill me if I talk about this, but hey, you know what, Jeremy? We're friends, we can do this. He's like, "Don't talk about the underlying architecture." I'm talking about the underlying architecture. The thing that we did was we took Cassandra and we decomposed it into microservices mostly. That's probably, it's still Cassandra, it's just how we run it makes it way more amenable to doing multi-tenant and scale in that fashion where the queries are separated from the storage and things that are running in the background, like if you're familiar with Cassandra because it's a log structure storage, you ask to do compactions and things like that, all that's just kind of on the side. It doesn't impact your query.

But it gives us the ability to, if you create a database and all of a sudden you just hammer it with a million writes per second, there's enough infrastructure in total to cover it. And then we'll spin up more in the back to cover everything else. And then whenever you're done, we retract it back. That's how we keep our costs down. But then the storage side is separated and away from the compute side, and the storage side can scale its own way as well.

And so whenever you need to store a petabyte of Cassandra data, you're just storing, you're just charged for the petabyte of storage on disk, not the thousandth of a cluster that you just created. Yeah.

Jeremy: No. I love that. Thank you for explaining that though, because that is, every time I talk to somebody who's building a database or running some complex thing for a database, there's always magic. Somebody has to build some magic to make it actually work the way everyone hopes it would work. And so if anybody is listening to this and is like, "Ah, I'm just getting ready to spin up our own Cassandra ring," just think about these things because these are the really hard problems that are great to have a team of people working on that can solve this specific problem for you and abstract all of that crap away.

Patrick: Yeah. Well, I mean it goes back to the Dynamo paper, and how distributed databases work, but it requires that they have a certain baseline. And they're all working together in some way. And Cassandra is a share-nothing architecture. I mean you don't have a leader note or anything like that. But like I said, because that data is spread out, you could have these little intermittent problems that you don't want to have to think about. Just leave that to somebody else. Somebody else has got a Grafana dashboard that's freaking out. Let them deal with it. But you can route around those problems really easily.

Jeremy: Yeah. No, that's amazing. All right. So a couple more technical questions, because I'm always curious how some of these things work. So if somebody signs up and they set up this database and they want to connect to it, you mentioned you could use the driver, you mentioned you can use GraphQL or the REST API, or the Document API. What's the authentication method look like for that?

Patrick: Yeah. So, it's a pretty standard thing with tokens. You create your access tokens, so when you create the database, you define the way that you access it with the token, and then whenever you connect to it, if you're using JavaScript, there's a couple of collection libraries that just have that as one of the environment variables.

And so it's pretty standard for connecting the cloud databases now where you have your authentication token. And you can revoke that token at any time. So for instance, if you mistakenly commit that into your Git ...

Jeremy: Say GitHub. We've never done that before.

Patrick: No judging. You can revoke it immediately. But it also gives you our back, the controls over it's a read or write or admin, if you need to create new tables and that sort of thing. You can give that level of access to whatever that token is. So, very simple model, but then at that point, you're just interacting through a REST call or using any of the HTTP protocols or SQL protocol.

Jeremy: And now, can you create multiple tokens with different levels of permission or is it all just token gives you full access?

Patrick: No, it's multiple levels of protection and actually that's probably the best way to do it, for instance, if your CI/CD system, has the ability to, it should be able to create databases and tear them down, right? That would be a good use for that, but if you have, for instance, a very basic application, you just want it to be able to read and write. You don't want to change any of the underlying data structures.

Jeremy: Right. Right.

Patrick: That's a good layer of control, and so you can have all these layers going on one single database. But you can even have read-only access too, for ... I think that's something that's becoming more and more common now that there's reporting systems that are on the side.

Jeremy: Right. Right. Good.

Patrick: No, you can only read from the database.

Jeremy: And what about data backups or exporting data or anything like that?

Patrick: Yeah, we have a pretty rudimentary backup now, and we will probably, we're working on some more sophisticated versions of it. Data backup in Cassandra is pretty simple because it's all based on snapshots because if you know Cassandra the database, the data you write is immutable and that's a great way to start when you come to backup data. But yeah, we have a rudimentary backup system now where you have to, if you need to restore your data, you need to put in a ticket to have it restored at a certain point.

I don't personally like that as much. I like the self-service model, and that's what we're working towards. And with more granularity, because with snapshots you can do things like snapshot, this is one of the things that we're working on, is doing like a snapshot of your production database and restoring it into a QA cluster. So, works for my house, oh, try it again. Yeah.

Jeremy: That's awesome. No, so this is amazing. And I love this idea of just taking that pain of managing a database away from you. I love the idea of just make it simple to access the data. Don't create these complex things where people have to build more, and if people want to build a data access layer, the data access layer should maybe just be enforcing a model or something like that, and not having to figure out if you're on this shard, we route you to this particular port, or whatever. All that stuff is just insane, so yeah, I mean maybe go back to kind of the idea of this whole episode here, which is just, stop using databases. Start using these data services because they're so much easier to use. I mean, I'm sure there's concerns for some people, especially when you get to larger companies and you have all the compliance and things like that. I'm sure Astra and DataStax has all the compliance things and things like that. But yeah, just any final words, advice to people who might still be thinking databases are a good idea?

Patrick: Well, I have an old 6502 on a breadboard, which I love to play with. It doesn't make it relevant. I'm sorry. That was a little catty, wasn't it?

Jeremy: A little bit, but point well taken. I totally get what you're saying.

Patrick: I mean, I think that it's, what do we do with the next generation? And this is one of the things, this will be the thought that I leave us with is, it's incumbent on a generation of engineers and programmers to make the next generation's job easier, right? We should always make it easier. So this is our chance. If you're currently working with database technology, this is your chance to not put that pain on the next generation, the people that will go past where you are. And so, this is how we move forward as a group.

Jeremy: Yeah. Love it. Okay. Well Patrick, thank you so much for sharing all this and telling us about DataStax and Astra. So if people want to find out more about you or they want to find out more about Astra and DataStax, how do they do that?

Patrick: All right. Well, plenty of ways at www.datastax.com and astra.datastax.com if you just want the good stuff. Cut the marketing, go to the good stuff, astra.datastax.com. You can find me on LinkedIn, Patrick McFadin. And I'm everywhere. If you want to connect with me on LinkedIn or on Twitter, I love connecting with folks and finding out what you're working on, so please feel free. I get more messages now on LinkedIn than anything, and it's great.

Jeremy: Yeah. It's been picking up a lot. I know. It's kind of crazy. Linked in has really picked up. It's ...

Patrick: I'm good with it. Yeah.

Jeremy: Yeah. It's ...

Patrick: I'm really good with it.

Jeremy: It's a little bit better format maybe. So you also have, we mentioned the Stargate project, so that's just Stargate.io. We didn't talk about the K8ssandra project. Is that how you say that?

Patrick: Yeah, the K8ssandra project.

Jeremy: K8ssandra? Is that how you say it?

Patrick: K8ssandra. Isn't that a cute name?

Jeremy: It's K-8-S-S-A-N-D-R-A.io.

Patrick: Right.

Jeremy: What's that again? That's the idea of moving Cassandra onto Kubernetes, right?

Patrick: Yeah. It's not Cassandra on Kubernetes, it's Cassandra in Kubernetes.

Jeremy: In Kubernetes. Oh.

Patrick: So it's like in concert and working with how Kubernetes works. Yes. So it's using Cassandra as your default data store for Kubernetes. It's a very, actually it's another one of the projects that's just taking off. KubeCon was last week from where we're recording now, or two weeks ago, and it was just a huge hit because again, it's like, "Kubernetes makes my infrastructure to run easier, and Cassandra is hard, put those together. Hey, I like this idea."

Jeremy: Awesome.

Patrick: So, yeah.

Jeremy: Cool. All right. Well, if anybody wants to find out about that stuff, I will put all of these links in the show notes. Thanks again, Patrick. Really appreciate it.

Patrick: Great. Thanks, Jeremy.

2021-06-07
Länk till avsnitt

Episode #103: Differing Serverless Perspectives Between Cloud Providers with Mahdi Azarboon

About Mahdi Azarboon

Mahdi Aazarboon started working as a serverless specialist and evangelizing it through blog posts, conference talks and open source projects. He climbed up the corporate ladder, and currently works as Senior Manager - Cloud Presales at Cognizant. He helps big and traditional corporations to move into the cloud and improve their existing cloud environment. Having a hands-on background and currently working at the corporate level of cloud journeys, he has matured his overall understanding of serverless.

Linkedin: linkedin.com/in/azarboon/
Twitter: @m_azarboon

Watch this episode on YouTube: https://youtu.be/QG-N3hf1zqI

This episode sponsored by CBT Nuggets and Lumigo.

Transcript:
Jeremy: Hi, everyone. I'm Jeremy Daly, and this is Serverless Chats. Today, I'm joined by Mahdi Azarboon. Hey, Mahdi. Thanks for joining me.

Mahdi: Hi. Thanks for having me.

Jeremy: So, you are a senior manager for cloud pre-sales in the Nordic region for Cognizant. So, I'd love it if you could tell the listeners a little bit about yourself, your background, and what it is that you do at Cognizant?

Mahdi: Yeah. Just a little bit of background, I started as a full stack developer, then I joined Accenture as a serverless specialist, and over there I started to play with AWS Lambda specifically. Started to do some geeky stuff, writing blog posts, and speaking at conferences and so on. Then, I was developing several solutions for multiple corporations in Finland, then I joined another consultancy company, Eficode, which are known for DevOps. It is very good, they have a good reputation for that in Nordic region. I was as a practice lead, AWS practice lead driving their business. Then, I joined my current company, Cognizant, and here I work as a pre-sales capacity. I'm not hands-on anymore, but basically I do whatever is needed to make our customers happy and make them to go to the cloud. So that means high-level solutioning, talking with the customer and as a senior architect, I comment about stuff, I make diagrams, And I translate business and technical stuff requirements, basically as an interface between the delivery and the customer side. Yeah, that's all.

Jeremy: Right. Awesome. All right. Well, so you mentioned in some of the blog posts that you were writing and some of that was a little while ago. And it's actually, I think there is some interesting perspective there. So I want to get into that in a little while, but I want to start by this idea or this post that you wrote about sort of what you need to know about Azure functions versus AWS Lambda and vice versa and it was sort of this lead-in to this concept of multi-cloud and not cloud-agnostic like being able to run the same workloads, but being able to understand the differences or maybe some of the nuances in Azure versus AWS and of course, that got extended to GCP and IBM cloud and some of these other things. But I'm curious why understanding different serverless services or different cloud services across clouds in this multi-cloud world we are living in now, why is that so important?

Mahdi: Yeah. That's a good question. First of all, I would like to clarify that whatever I'm telling in this podcast is just my personal opinion and doesn't reflect my employer. This is just to save myself.

Jeremy: Absolutely. Like a standard Twitter handle route.

Mahdi: Yeah.

Jeremy: Views are my own, right? Yeah.

Mahdi: I don't want to answer to my boss after this podcast. Answering to your question, the thing is that multi-cloud is inevitable and even AWS which was ... In the best practices, I remember like a few years ago, they were saying that, no, try to avoid that. They started to even admitting through their offerings that they are trying to embracing that multi-cloud with their Kubernetes offerings. The thing is that, well, whether AWS fans like it or not, Azure is gaining a lot of market share and it depends on the country. For example, in Finland at least AWS is really popular. But now I'm dealing, for example, in other countries like Norway or UK, Azure is very popular. I mean, you can just exclude yourself to be only with one cloud, but in my opinion, you are missing a lot of opportunities, both to learn and just as a company to embrace the capacities, because whether ...

Well, Azure provides some stuff which are better than AWS. I mean, I heard from a corporation that they really like AI capabilities of Azure much better than AWS and they do a lot of analytics. So it's inevitable whether many people like to admit it or not.

Jeremy: Right. Right. But so even the fact that it's inevitable and we talk about, multi-cloud is one of those terms ... I just talked to Rob Sutter about multi-cloud a couple of episodes ago and it's so expansive. I mean, everything from SaaS providers to, obviously the public cloud providers, to maybe even on-prem cloud, I know that sounds weird, but like your hybrid cloud and things like that. So the problem is that there are a lot of providers, there are a lot of SaaS products, things like that. I mean, are you advocating that people will try to become experts in multiple clouds or how do you sort of ... What level of knowledge do you think you need to have in order to work across multiple clouds?

Mahdi: I haven't met a single person who can claim to be expert in more than one cloud provider and I have talked with many experts because I have been running serverless in Finland and so I have been talking with many experts. None of them dared to claim that they knew it. I mean, even keeping up with one single cloud provider is a lot of work and I don't consider myself expert in any of them either, because I'm not hands-on anymore. The thing is that ... No, you don't have to be experts to work with different stuff. Of course, at some level you need some ... For example, you might need an Azure expert to work with Azure, AWS expert to work with AWS. But in my opinion, if you really want to keep up with the technology and so you need to be good in one provider, really good with that and then, know the fundamentals of the cloud, the best practices which are, I would say, it's irrespective of which cloud provider you are using there and be willing to learn.

For example, it happened to me. At that time, I mean, when I wrote that blog post, I was only working with AWS. Then they said to me that, okay, you have this project on Azure, go for it and I never touched Azure before. It was a lot of pain, but I learned a lot. So I mean, as I said, the fundamentals are same and now be expert in one and be willing to learn. In my opinion, that should be good enough.

Jeremy: Right. I'm curious, I think that's good advice to sort of be well-rounded. I mean, that's always good advice I think for technologists, going a mile wide and an inch deep is usually good enough. But like you said, being able to be an expert in a specific field or a specific technology or something like that can really help. So you think that's certainly a good career choice to sort of start to broaden your perspective a little bit?

Mahdi: Definitely. Actually, I was one of those AWS fans that really was following this Hero, Serverless Heroes, and so on, basically was parroting whatever AWS was telling and I was saying that I just want to come to work with AWS. Actually, it happened to be like that, but when I joined my current company, my manager said that most of the opportunities that you are filling, I mean, in my department, so is mostly Azure. So basically they said that it is as it is, and cope with it. And I felt very happy actually. When I, for example, see ... Well, I'm sure that anyone who is in the cloud gets many job offers from recruiters. I was thinking about it, at some point when I was AWS guy, at least in my experience, half of those job ads were irrelevant and ...

Jeremy: Right. Right.

Mahdi: ... depending on the country. For example in Finland, if you are Azure ... AWS is very popular at least and if you are Azure expert, you are going to miss a lot of opportunities. But at least in my experience, if you say that you are with that, you have worked with the other one, you know something, a lot of career opportunities opens up. This is my observation.

Jeremy: Right. Right. Yeah. And I think actually, you made a really good point and that's certainly, in terms of AWS heroes and so forth. I'm an AWS Serverless Hero and we get inside information but we spend a lot of time thinking about things the AWS way. AWS is very good at what they are doing with serverless and they have an interesting perspective in terms of what they believe serverless is supposed to be and what that roadmap looks like. But even just hosting this show and talking to so many different people in different clouds and different ways that they do it, getting that different perspective of how other people or other clouds think about serverless and how they are building it out. I think that's actually really good context to have.

Mahdi: Yeah, I agree. Actually, you are one of my heroes also, I was following you. But I should say that it has its own advantages and disadvantage was that I was in a kind of AWS bubble. But when I started to see that, okay, even AWS itself opens up having this multi-cloud offering and some serverless heroes start to write about that, I was like, okay, that's time for opening of your thing. But I mean, by that time actually, I already started to use Azure. So again, I mean, I would say that what you have been doing, actually heroes are doing a great job, really doing a great job.

Jeremy: Absolutely, totally agree.

Mahdi: Azure also have similar. If I remember correctly, they tried MVP, something like that.

Jeremy: I guess, that's MVP, yeah.

Mahdi: The thing is that, at least based on my observation, they have more or less same level of dynamics or a narrative between themselves. They also consider Azure more and AWS more and so. But I was lucky, maybe by the choice and so that somehow I had to join or use or attach to both communities. Yeah, it has been a very valuable experience.

Jeremy: Yeah. Yeah. So you went through that process, you were sort of an AWS convert or I guess, an Azure convert from AWS, and you stayed connected. But I know, that idea of transferring your skills and transferring the concepts and you mentioned sort of the pillars are the same as they are in AWS and you sort of have some of the general concepts, but as someone who went through that, what were the challenges that, what were some of the, I guess, challenges and the barriers that you faced going from AWS and that way of thinking into the Microsoft world?

Mahdi: That's a very good question. The thing is that in the department, at that time I was working at Accenture and actually all of us were big AWS fans because at least Accenture owned Avanade, so Azure was very separate, we were in an AWS bubble. Yeah. I'm sure that definitely AWS is much more mature in many aspects than Azure, no doubt. At least it was like that and I'm sure it's still like that. Their gap has been narrower, but that still might be the case. I remember at that time, many of my colleagues were really bashing down Azure, really bashing down and they were right. I mean, some of their services were really immature. But then again, I had the chance to ... Actually, it wasn't quite choice, they said to me, okay, this is an Azure project. Basically, it was a team, I would say quite junior, developed something on Azure, something that you never probably want to hear.

They developed everything in browser, infrastructure as a code nothing at all, they were junior, so they made quite many mistakes also, but they just made the app up and running. It didn't matter how or what, it was just running and that's all. So they told me that, okay, we need some little improvement, this was little improvement and that little improvement basically forced me to reverse engineer whatever they had done, and that required me to upgrade the whole application, because as I said, there was no infrastructure as a code, if I want to use it I had to use ... If I wanted to do local development, I had to use Windows, I had only Mac, so I had to change the complete platform. It was a very tedious process by itself. On top of that, I had to start to see how Azure functions work and that was another pain for that.

The thing is that I had AWS mindset and I was thinking that, okay, AWS is the best, they came out first with the cloud and Lambda, so Azure should be something like that. As I elaborated in the blog post, no, actually they are different and there are some small patches or nuances that makes some even days to find it out, but you need to find it out, otherwise, your app doesn't work. After a while when I reflected the things, I realized that, okay, of course, I was angry and pissed ... I was really bashing down Azure, it was fight of the dynamics over there, but after a while when I reflected through my whole process and actually I wrote in the blog post, I realized that part of the blame was on me because I was expecting Azure to work in the AWS way. No, that's not how it works.

I mean, when you look at, for example, authentication or the mindset, it's different. That requires a learning curve, I mean, you need to find out Stack Overflow and actually, the Azure community is really supportive. I really like it. They have their own community which is really supportive. So the pain basically was that ... Yeah, I had to find out how things work in Azure and what's different. But now that I'm working basically pre-sales in both of the cloud I can say that, again, fundamentals are same.

Jeremy: Right.

Mahdi: And these AWS architecture framework, there are five pillars. You can see that Azure has copied from AWS, it's obvious. Even they haven't changed the name. The naming is similar and you can find that it's just a bad copy. At least like few months ago that they had to implement for that. But at the end, I mean, Azure is catching up fast.

Jeremy: Right.

Mahdi: It's undeniable. And fundamentals are more or less same. I mean, if you want to make your app ... For example, you want to innovate, you should have shorter time to market. Basically, you need to use infrastructure as a code If you want to make your app really high-level appeal, you need to follow best practices, do maybe SRA. At the high level it's same, but when it comes to the detail level, it can be very different. Even the documentation was really confusing and it wasn't just me telling that.

Jeremy: Out of curiosity, was the documentation for AWS more confusing or was the documentation for Azure more confusing?

Mahdi: This is a million-dollar question. Actually, I thought that maybe it's me. I found the Azure doc very confusing, but I thought it's me, so I asked I think nine of my friends who are AWS experts that, "What's your opinion? Have you worked with Azure? Do you find documentation readable?" I think all of them said that it's confusing.

Jeremy: Yeah.

Mahdi: So I was like that, okay, then it's confusing. Then I talked with a few Azure experts who, they breathe in Azure, they are Windows guys and they never touched AWS and they said that, "No, documentation is good. Everything is fine." Actually, if I remember correctly one of them said that, "Actually, I find the AWS documentation confusing." It seemed like two different worlds, you know?

Jeremy: Right. I find them both confusing, actually.

Mahdi: Maybe now it has changed.

Jeremy: Right. Yeah. So, that's interesting. I mean, I think the documentation is a good ... Well, first of all having good documentation is important and I think they both have good documentation, but I do think it's organized differently, right?

Mahdi: Yes.

Jeremy: And again, it's organized more towards I think maybe that different mindset. But let's just talk a little bit about the maturity of those, because to be fair to Azure, I mean, Azure or Azure Functions, it has come a very, very long way. I remember way back in 2018, way back, I mean it seems like a long time ago at this point, seeing very early demos of Durable Functions and I remember thinking like, oh, that's just a mess, like that is not the way that you want to do that. Now fast forward three years, Durable Functions are pretty cool and they do a lot of really interesting things. It does take time to catch up. So certainly I would think your criticism of Azure Functions back then in terms of what it is now, that's probably there is a huge gap there.

Mahdi: Yeah. I'm sure that most of the criticism, the detailed one that I mentioned the blog post, I'm sure that many of them have either been fully addressed or they have been improved a lot. So that's why I don't want to focus that much on detail and I would focus more on the high-level things. Yeah.

Jeremy: Right. So speaking of the high-level things, let's go there for a second. So you mentioned like a well-architected framework, sort of this idea of their being something very, very similar, maybe even a carbon copy in Microsoft. But what about getting down, you said that your individual skills are kind of when you get into the weeds there, that is certainly different, so I mean for the most part though, event-driven, stateless computes, things like that, do those skills transfer over?

Mahdi: Yeah, they do. It's just a matter of implementation. For example, I can tell you, yes, those ones ... Well, there is some caveat. For example, I remember in Azure community, I was at that time, this probably has been changed, but I think it shows some kind of mindset. I was struggling to find out the observability tools of Azure, if I remember correctly it's what's called Application Insight, one of the tools, and they had some event driven insight, something like that which was, they call it near real-time. I remember that basically when I want to get the logs from the functions, it took three minutes to come up, three minutes. At the same time CloudWatch, for example, it was coming in 20 seconds, something like that, 10, 20 seconds and I mentioned it in their community.

If I remember correctly, it was a notable dude, either one of the product team, or he was a very notable dude and he said that three minutes time is, in my opinion, is near real-time. He said that and I remember we made a lot of joke out of that sentence with my colleagues about that.

Jeremy: I can imagine.

Mahdi: But that shows some kind of mindset. I mean, three minutes, I don't think is near real-time. Most probably this time has been reduced, but I just wanted to tell you their mindset about that. But, yeah, event-to-event stateless stuff, they are transferable. But when it comes to implementation, it's different. For example, as I mentioned that blog post, there was some stuff that you can do with an authentication with some, certain some, environmental variables in AWS, but that same thing in Azure, if I remember correctly, is done through something like service principles, it's different. So if you try to play with environmental variables, it turns out no, it doesn't work that way. It gets to very detailed stuff, that gets different. Yeah.

Jeremy: Yeah. Right. Right. Yeah. I'm curious to hear about like another sort of interconnectivity of what you would connect. I'm now trying to remember what they call bindings or triggers and bindings in Azure functions as opposed to events or actually event sources, I think we call them in the Lambda world. So would you look at the way that you connect to other services? Is that another thing that is similar between the two?

Mahdi: Okay. I should say that I don't remember that much of these details anymore, but as far as I remember, again, the high levels were more or less the same. Okay, they call it three gears, but I don't remember now what does AWS Lambda calls it. But it was more or less the same.

Jeremy: I can't even remember what it's called, it's like event sources or something like that.

Mahdi: Yeah. It was more or less same. Yeah, yeah, yeah.

Jeremy: Yeah.

Mahdi: And they had something like a bus, events bus in order to have a centralized event driven thing. It's same I would say.

Jeremy: Yeah.

Mahdi: Again, when it comes to poor person who has to implement it if he hasn't done it before. But the person who is doing the high-level architecture and so, I can easily see that, I mean, I don't see that much difference. But I know that if someone has to implement it and hasn't done it before, he will go through the most pain, because he has to find this small configuration things that, unfortunately, you need to make them. Otherwise, it doesn't work out. But high-level, it's same. It's event ...

Jeremy: Yeah.

Mahdi: Yeah.

Jeremy: I think the nuances are always those tough things. So thinking of the overall mindset here and sort of maybe the approach to serverless. So I know you went from AWS to Azure, but I'm curious, do you think it would be easier to go from Azure to AWS or easier to go from AWS to Azure?

Mahdi: Well, I came from this part of the river to the other one, so I can just speculate about the other part. But I would say it's more or less same, because again when I talk with a few Azure people who really have been breathing always in Azure and never touched or barely touched AWS, I felt that they are feeling same thing about AWS. So I would say it's more or less same. They need to go through the same pain, they will find AWS stuff very confusing, especially that they will not have that great community support of AWS, but they need to either do the Stack Overflow thing or have a enterprise support of AWS. I would say it's more or less same for them.

Jeremy: Yeah. I mean, I think that's interesting too just, that it is different enough that there is pain there, right? I mean, it would be nice if there was some standards and I know there's like the opening, the Cloud Computing Foundation is like open events and some of those things whatever, not that that's all working out for ... I think Kubernetes and Knative and those and some of those teams are implementing it or those projects, but I'm not sure the same things fall into AWS. But anyways, go ahead. You have any thoughts on that?

Mahdi: Actually, that Cloud Foundation, I was working at Eficode and they are really working that stuff. They are so good in Kubernetes. I find that also another world completely.

Jeremy: Yeah.

Mahdi: This Cloud Foundation stuff. I never had to implement any of that for any of our customers in any of the companies that I worked, that they were AWS or Azure. Yeah, some of them they used Kubernetes also, but that CNC or whatever it was ...

Jeremy: Yeah, CNCF.

Mahdi: Yeah, yeah. I found it, that's a different world for me also, I should say. Sometimes out of curiosity, I played with it, but I never ... Nobody ever asked me that, do you want to use that?

Jeremy: Right. Right. Yeah. No, that makes sense. All right. So we talked a lot about, we've been talking about the difficulties in switching between different cloud providers, but also the value of knowing those different cloud providers. And more so, so that you can build serverless applications. So let's talk about serverless in general. I know you are a little bit outside of the ... You're not in the developer role anymore. But this actually, could be really interesting to get your perspective on the management approach to this and how other companies are thinking about the value of serverless at a management level as opposed to ... I guess, even as a sort of planning level. So let me ask you this question then. Are you seeing companies looking at serverless and adopting serverless and that serverless mindset and then maybe a follow-up question would be, if they are not, why do you think they should be embracing serverless?

Mahdi: Okay. Firstly I'll answer the second part. Basically, the thing is that nowadays the world is fast changing. Many companies, many corporations basically, are benefiting from their existing market share or regulatories or the monopoly that they have. For now, it works. If they don't want to change basically if they have the mindset that things are working, what's the point for change. Most probably within a decade or so they are going to die, their business is going to die. Because the world is fast-changing and they need to have them to adapt to the market.

So ideally, they need to go through the pain and disrupt themselves. Disruption always brings pain. You cannot disrupt yourself and feel that everything runs smoothly. Ideally, they need to disrupt themselves, go through the pain and so become really agile in order to understand the customer feedback and deliver the value to the customer, what really the customer wants. They can either have this phase or they can ignore it and say that, okay, things are working, we are making money through our monopoly, regulatory, existing market share, whatever and then, their business is going to go away. These two choices, that's all. Yeah. Painful process to become more competitive and be ahead of customers or assume that everything is okay, and then at that time that's going to be very late.

Jeremy: Right. So let me go back to that first question then. So you are seeing people not doing that?

Mahdi: Okay. The thing is that what I'm telling is going to be biased because I'm working in a cloud team and whatever opportunity that they are going to bring to me, of course, you have the departments and the companies that they are interested in the cloud. So my mindset is a bit biased, but what I'm seeing is that it varies a lot and I mostly focus on corporations, because ... Yeah, of course, for startups it's much easier to go for that.

Jeremy: Right. Of course.

Mahdi: At least in Finland, my observation was that there are two ways. Either they are very ... it depends on the executive leadership. For example, a major bank in Finland, they say that, we want to go to the cloud and be, we want to go for that. And once, one of these big ones goes through that, there is going to be a domino effect on others. But there are some other ones say that, no, it's cloud, who is going to take care of the data? We are not going to do that and they don't touch it.

There are some other companies and their departments, I would say there are departments who are interested in trying things out and then, they have to fight internally with the more conservative departments. So I'm sure that there are three levels of that. But mostly, I work with the ones who are inclined toward using the cloud.

Jeremy: Right. Right. So then, the ones that are starting to dabble in the cloud, is that something where you see ... I mean, clearly there's lift and shift, right? Which I think we probably all understand at this point, it is not the best implementation or the best use of the cloud, right? That it is better to maybe use more native cloud services or cloud native services, I guess, to do that. So in terms of people just rehosting or maybe re-platforming, are you seeing this sort of rearchitecture, or I guess, this refactoring or is that something where companies are staying away from that?

Mahdi: First of all, I respectfully maybe have to disagree with you.

Jeremy: Okay.

Mahdi: Actually, I think rehosting is actually a good approach and that's what even AWS promotes for conservative companies who want to start working with the cloud and they want to get the fastest result in the shortest period of time, with the least amount of pain, it's better to do migration through the easiest one which is lift and shift. Easiest, everything is relatively.

Jeremy: Right.

Mahdi: And then, have a data-driven approach to see what really needs to be improved and then refactor or rearchitect or re-platform based on data. So in AWS terms, I'm sure you're right there with me, have that evolutionary architecture in a data-driven approach. So lift and shift, I don't consider bad at all. Actually, I consider it a very good cornerstone, stepping stone at the beginning, for the beginning.

Jeremy: Interesting. Okay.

Mahdi: Yeah. What was the other question?

Jeremy: No. I was just going to say, so you've got companies that are lift and shift, and, yeah ...

Mahdi: Oh, okay. Sorry. Sorry. Yeah. Sorry, I just remembered.

Jeremy: Yeah.

Mahdi: Sorry to interrupt you. Actually, I'm a bit careful about using the word cloud native. I remember, in a previous company that I was working, we had some philosophical fight about that and I'm sure that then everyone was dissatisfied and I had to have an authoritarian appearance that this is the definition of cloud native. I'm sure many of them hated me after that. But the word cloud native, I really struggle to find a consensus of what does it mean and if you spend some time, you realize that you will find a variety of definitions of that. So I'm picky for the word cloud native. There is a lot of fight can happen, what is exactly cloud native. Some consider Kubernetes cloud native. Some consider using AWS or Azure cloud native. So this is the picky ... this is a very controversial term, I would say. Yeah.

Jeremy: Well, let me interrupt you for a second. So when I think of cloud native, what I'm thinking of are services and components that are built specifically to run in the cloud, things like your API gateway at AWS or Azure functions or things that are like very much so built to run in the cloud environment where they do things. It's that serverless aspect. I think of it more serverlessly. I mean, I know containers and so forth fit in there as well. But that's how I think of cloud native. I think of cloud native as going beyond just your traditional VM and running everything on the VM and moving to the higher-level services that are more managed for you.

Mahdi: May I challenge you?

Jeremy: Absolutely.

Mahdi: So you just said that basically things that use cloud, like API gateway and so. And now I should ask more of a technical question. What is cloud?

Jeremy: Right. Well, that's another good question. Right.

Mahdi: Okay. I can tell you, based on these several definitions that I read and I reflected on them, I have this definition of cloud native, most probably many people I'm sure will disagree. So that's fine because it's very controversial. In my opinion, cloud native is very simple. If your application is architectured in a way that it can leverage the advantages of the cloud environment, then it's cloud native. Doesn't matter if it is on Kubernetes, if it's on AWS, if it's on Azure or so. If it can scale to zero and theoretically to infinity and you pay for only what you use, then it's cloud native. That's my definition of that and I read so many definitions, so I came up with this. But feel free to disagree with that, because many people disagreed with me. I'm fine with that.

Jeremy: That's all right. You are not the only one I'm sure, has differing opinions of what cloud native are. So let me ask this though because I think that's interesting, the way you explained the strategy of lift and shift of basically being able to say it's the, probably the lowest risk way to take an application that's on-prem and move that into the cloud and then to use data and so forth to kind of figure out what parts of the application might you want to migrate to, maybe again I don't want to overload the term, but more cloud native things. I think that's actually really interesting. I have found and I have seen many companies that seem to do this where it's more that they move things, they just rehost without really thinking through what that strategy is going to be and then they basically just end up having their on-prem in the cloud and not benefiting from some of those managed services and some of the benefits of the cloud that you get, they don't transfer on to them. That's what I have seen.

Mahdi: Well, you know it better than me. Your cloud environment is never perfect and it's always an ongoing operation. So I mean, going to the cloud ... Again, if you put your own frame in, put them I don't know, use EC2 or which VM or the AWS or Azure, that's a very good first step ...

Jeremy: Right. That's probably true.

Mahdi: ... but you need to be able to start leveraging that. At least get the data, which one is being used and hopefully, hopefully when you are going to the cloud, you have done some analysis and you have realized that some of the services even are not working with the cloud. Some of them need to retire, some of them cannot be rehosted. They must be rearchitected, because they are so legacy for that. But even again, assuming that you have done your homework and you have done rehosting, okay, you need to leverage that and go and see that all things that AWS or Azure provide, how much over-used or over-utilized or under-utilized are your CPUs and this kind of thing and according to that, do right sizing for that.

Jeremy: Right.

Mahdi: That's a good step for that. Then if they want, requires refactoring, try to I don't know, do refactoring and use more managed services for that. So again, rehosting is a good first step, but cloud is a long journey. I don't know who came up that cloud is cheap, I really don't know.

Jeremy: Right. No, I totally agree. You are right about the first step and I actually loved your point about which services might you be able to retire and not move at all because I think in a lot of these big companies, there are a lot of services that you probably don't need anymore or they are redundant or whatever and you could get rid of those moving to cloud. Good point. All right. I got a couple of more minutes and I want to go back to an article that you wrote. Now, this I think is like three years old and in terms of reading the article now, it's not relevant, because so many things have changed. But what's relevant is, what has changed and this was an article that was about the worrying and promising signals from the serverless community. I think this was an event you went to in Germany, they did this, and you have a couple of different points that you called out.

One of the points was that users have ignored security and that was a worrying sign for you. Where do you think sort of cloud security or more specifically serverless security is now? Do you think people are still thinking about it or have brought it front and center like it probably should be or do you think it's still a worrying factor?

Mahdi: Since I have implemented cloud solutions for I'll say mostly enterprises and a few startups, I haven't seen a single one of them using, having a cloud security specialist. Most of the corporations when they, at least in my experience, when they want to go to the cloud, they must address the security of it and typically because of the customer requirements, so they bring a security guy who has worked with this, let's put it this way, all their security stuff and he has to come in on the cloud part and it's funny that actually, sometimes I have to teach them basically. I remember they had a head of security for a customer. I really had to teach him and actually, I had the Lambda functioning in front of him and he was like, wow, is it really like that? I had to teach him what are the attack methods and it was funny. He had to sign off my solution that it is secure, but basically, I had to tell him what are the priorities.

Jeremy: You had to tell him what it was.

Mahdi: They address it from a traditional way. Yeah, they do some kind of a test, automated test and this kind of thing which is, yeah, definitely ... Again, I'm not a security expert, but as far as I understand, again they have some fundamentals which are safe, that's true, but when it comes to the cloud especially serverless and functional service, you will see that there is a lot of more attack vectors and unfortunately, these security experts, I have not seen any of them who have any expertise in that. I learned about it because I was curious about it and I started to work with basically professionals, some startups which provide professional security solutions for serverless. So that's how I got that, but again I had to go through the pain. It took few months to read so many stuff. But I haven't seen any security specialists who have been working on cloud projects who have done this.

Jeremy: Yeah.

Mahdi: So I would say customers, they consider it, but no, there is still a lot more way to mature.

Jeremy: They are not addressing it. Yeah. It's funny because I remember that in 2018, 2019, there were a couple of companies that were in the serverless security space and they were all acquired. So now they are part of larger platforms which is ...

Mahdi: Exactly.

Jeremy: ... great for them, don't get me wrong. All right. So then another thing you said and I think this is important, because the biggest complaint that I always hear about serverless is, just the workflows are not easy. So you had mentioned that DevOps was finding its way and that was sort of a promising signal, you think that we've ... I mean, we have got a lot of tools for serverless now. Speaking of Azure, the way to deploy an Azure function right through VS Code now with the plugins is really, really slick and Serverless Frameworks, SAM, CDK, all these are there, Terraform and so forth. I mean, have we gotten to some stability around serverless and sort of mixing in DevOps there?

Mahdi: Based on my experience, at least the ones that I have worked with, I can say that, yes, DevOps is now a part of a solution that's provided to the customer and maybe it's correct because personally, I went through the pain whenever I proposed any solution for the customer, so they are always using infrastructure as a code and always try to have a DevOps-centric viewpoint about your solution. So I try to push for that and, yes, I find customers receptive about that. It seems to me that, now DevOps is not one of those buzzwords for cool kids who just want to do this stuff, even the corporate guys are more receptive with that. Again, there is more way to really do the DevOps stuff, because you know that many companies claim that they are doing DevOps, but in reality, they are not. You know this better than me.

Jeremy: Right. Of course.

Mahdi: But, yeah, it's good. I'm happy for that. I mean, a few years ago DevOps was one of those buzzwords, but now I don't think it's buzzword anymore.

Jeremy: Yeah. Yeah. And I think that serverless has actually opened up a lot of making it easier for teams to do automation and things like that, there's a lot that you can do because you have that little bit of compute power that you can do something with. So I think that's definitely promising. So speaking of sort of compute power and other things that you can do with it, one of the things you mentioned was that you saw as a promising sort of signal was, that serverless-based prototypes were on the rise, meaning different services, so whether it was cues or whatever or I guess Lex and things like that, all kinds of services that allow you to do different things that are specializing in different capabilities. So how do you feel about that now? Because there are a lot of those APIs out there.

Mahdi: Yeah. Actually, I also find that even from these legacy corporations that I have been working with, I like the idea that now, they definitely when they want to do migration especially or this kind of thing or do anything cloud, first they do POC. Yes, I find it good. Honestly, I was sometimes impressed that, oh, from some people that I would never expect them to use this one, first let's do POC, then see what's come out. Oh, really? Yeah, it's good in my opinion. It's finding its way.

Jeremy: Yeah. Yeah. No, I like that too and I think you are right about proof of concept, because it's just one of those things where even if it's expensive to use the Google Vision API or something like that, it's a really good way to prove out how that fits into whatever the business use case you have for that and then like you said, you can certainly take a step further and create more sophisticated or I won't say sophisticated but maybe more integrated tools or something like that, that would work around that. So I think that's interesting, allow people to fail fast, learn quickly, and just build out their applications.

Mahdi: Yeah. When we say POC, I should say that I wouldn't exclude it only to this cool new serverless or what the AI stuff that AWS and Azure provide. Even for migration actually, POC is highly recommended. Again, I was working for some period of time for, I would say, one of the most conservative banks in Finland, small and conservative, for consultancy, but even then as we are trying to push the cloud and even then they said that, "Yeah, first let's do a POC of migration and see what's going to happen." Again, there really I was surprised. I would never expect it from them. But the idea of fail fast and learn fast, I think at least that it requires some level of maturity to reach that.

Jeremy: True.

Mahdi: That really needs more room for improvement, fail fast, learn fast. Yeah. Just something, I don't know, I would like to address about this cloud stuff if I can.

Jeremy: Yeah, absolutely.

Mahdi: Yeah. Basically, when companies or customers decide to go to the cloud, I'll recommend that don't look at only the technical aspect of it, because I see that there is, at least there is lot of debate for example ... At least it was like that. AWS, Azure, or this kind of thing, at the end I'll say that most of the things it doesn't matter that much. I mean, it depends on their, sometimes company policy, how much discount you can get, how much funding you can get from the cloud provider. So it's not really the technical people who decide, sometime it's the executive who decides.

Jeremy: Right.

Mahdi: But even then, when you go to the cloud, in my opinion as much as the technology and maturity of the cloud provider matters, the amount that your company is ready to change its operations is also important. This is my favorite example, that I developed and I would say at that time at least a state-of-the-art serverless solution, DevOps, or CI/DC stuff for a major bank in Finland and I was the first one who managed to do that among so many consultants that they have. It was really good. I'm proud of what I did and actually, I open-sourced that. It was really basically we could deploy multiple times per day and we went to their release manager and I said that, "Okay. It's like that. Everything is perfect. DevOps, CI/CD, we can release multiple times per day." And she said that, "No. It doesn't work like that. We need to release once per month," and we have to go through a very painstaking process, fill out so many useless documents.

It didn't matter how much I tried to convince her that, "Well, the idea is different. I mean, you need to do small deployment. This way actually you have less risk. You deploy once per month. Still every time something goes wrong, but when you do a more frequent deployment, your risk is lowered." She said, "No. We are a bank. It is as it is. Sorry." Most of that effort that I made at least at that time went to waste basically, because the process was legacy, even though the technology was good. I'm sure that by now, they have changed because I was among those innovators basically or the early adopters who made through that. But in my opinion, technology matters but operation and processes and release stuff also matters and everything needs to change. So basically it needs to be holistic approach of going to the cloud, not just implementing from technical viewpoint.

Jeremy: Mahdi, thank you so much for having this conversation with me. This was a lot of fun and then I love people who have sort of experienced, from moving from one cloud to another. It's a huge shift, but I think your advice here is great, just to sort of know those basics on those other platforms and do that. So if people want to reach out to you or find out more about, follow you on Twitter, things like that, how do they do that?

Mahdi: Well, I have a Twitter account, but nowadays I mostly put non-service stuff, but LinkedIn is a good option for me.

Jeremy: Okay. Great. And it's m_azarboon on Twitter and then, I will put LinkedIn and Twitter and that in the show notes as well. Thanks again, Mahdi.

Mahdi: Thank you very much for having me. Bye-bye. Thank you.

2021-05-31
Länk till avsnitt

Episode #102: Creating and Evolving Technical Content with Amy Arambulo Negrette

About Amy Arambulo Negrette

With over ten years industry experience, Amy Arambulo Negrette has built web applications for a variety of industries including Yahoo!, Fantasy Sports, and NASA Ames Research Center. One of her projects modernized two legacy systems impacting the entire research center and won her a Certificate of Excellence from the Ames Contractor Council. She has built APIs for enterprise clients for cloud consulting firms and led a team of Cloud Software Engineers. Currently, she works as a Cloud Economist at the Duckbill Group doing bill analyses and leading cost optimization projects. Amy has survived acquisitions, layoffs, and balancing life with two small children.

Website: www.amy-codes.com
Twitter: @nerdypaws
Linkedin: linkedin.com/in/amycodes

Watch this episode on YouTube: https://youtu.be/xc2rkR5VCxo

This episode sponsored by CBT Nuggets and Lumigo.

Transcript
Jeremy: Hi everyone, I'm Jeremy Daly, and this is Serverless Chats. Today, I'm joined by Amy Arambulo Negrette. Hey, Amy thanks for joining me.

Amy: Thank you, glad to be here.

Jeremy: You are a Cloud Economist at the Duckbill Group, so I'd love it if you could tell the listeners a little bit about yourself and your background and what you do at the Duckbill Group.

Amy: Sure thing. I used to be an application developer, I did a bunch of AWS stuff for a while, and now at the Duckbill Group, a cloud economist is someone who goes through cost explorer and your usage report and tries to figure out where you're spending too much money and how the best to help you. It is the best-known use of a small skill I have, which is about being able to dig through someone's receipts and find out what their story is.

Jeremy: Sounds like a forensic accountant, maybe forensic cloud economist or something to that effect.

Amy: Yep. That's basically what we do.

Jeremy: Well, I'm super excited to have you here. First of all, I have to ask this question, I've known Corey for quite some time, and I can imagine that working with him is either amazing or an absolute nightmare. I'm just curious, which one is it?

Amy: It is not my job to control Corey, so it's great. He's great to talk to. He really is fully engaged in any conversation you have with him. You've talked to him before, I'm sure you know that. He loves knowing what other people think on things, which I think is a really healthy attitude to have.

Jeremy: I totally agree, and hopefully he will subtweet this episode. Anyways, getting into this episode, one of the things that I've noticed that you've done quite a bit, is you create technical content. I've seen a lot of the talks that you've given, and I think that's something that you've done such a great job of not only coming up with content and making content interesting.

Sometimes when you put together technical content, it's not super exciting. But you have a very good way of taking that technical content and making it interesting. But then also, following up with it. You have this series of talks where you started talking about managing FaaS, and then you went to the whole frenemies thing with Fargate versus Lambda. Now we're talking about, I think the latest one you did was about Lambda and the container support within Lambda. Maybe we can just go back, or start at a point where, for people who are interested in maybe doing talks, what is the reason for even creating some of these talks in the first place?

Amy: I feel a lot of engineers have the same problem, just day-to-day where they will run into a bug, and then they'll go hit the all-knowing software engineer, which is the Google search engine, and have absolutely either nothing come up or have six posts that say, I'm having this problem, but you won't ever get an answer. This is just a fast way of answering those questions before someone has to ask.

Jeremy: Right. When you come up with these ... You run into this bug, and you're thinking to yourself, you can't find the answer. So, you do the research, you spend the time digging through, and finding the right way to solve it. When you put these talks together, do you get a sense that it's helping people and then that it's just another way to connect with the community?

Amy: Yeah. When I do it, it's really great, because after our talk, I'll see people either in the hallway, or I'll meet someone at a booth, and they'll even say, it's like, I ran into this exact same problem, and I gave up because it was such a strange edge case that it was too hard to fix, and we just moved on to another solution, which is entirely possible.

I also get to express to just the general public that I do, in fact, know what I'm talking about, because someone has given me a stage to talk for 30 minutes, and just put up all of my proofs. That's an actually fun and weirdly empowering place to be.

Jeremy: Yeah. I actually think that's really interesting. Again, for me, I loved your talks, and some of those things are ... I put those things at the back of my mind, but I know for people who give talks, who maybe get judged for other reasons or whatever, that it certainly is empowering. Is that something where you certainly shouldn't have to do it. There certainly should be that same level of respect. But is that something that you found that doing these talks really just sets the tone, right off the bat?

Amy: Yeah, I feel it does. It helps that when someone Googles you, a bunch of YouTube videos on how to solve their problem comes up, that is extremely helpful, especially ... I do a lot of consulting, so if I ever have to go onsite, and someone wants to know what I do, I can pull up an actual YouTube playlist of things that I've done. It's like being in developer relations without having to write all of that content, I get to write a fraction of that content.

Jeremy: Right. Unfortunately, that is a fact that we live with right now, which is, it is completely unfair, but I think that, again, the fact that you do that, you put that out there, and that gives you that credibility, which again, you should have from your resume, but at the same time, I think it's an interesting way to circumvent that, given the current world we live in.

Amy: It also helps when there are either younger engineers or even other younger professionals who are looking at the tech industry, and the tech industry, especially right now, it does not have the best reputation to be able to see that there are people who are from different backgrounds, either educationally or financially, or what have you, and are able to go out and see someone who has something similar being a subject matter expert in whatever it is that they're talking about.

Jeremy: Right. I definitely agree with that. That's that thing, where the more that we can amplify those types of voices and make sure that people can see that diversity, it's incredibly important. Good for you, obviously, for pushing through that, because I know that I've heard a lot of horror stories around that stuff that makes my blood boil.

Let's talk to some of these people out here who potentially want to do some of these talks, and want to use this as a way to, again, sell themselves. Because I can tell you one thing, once I started writing blog posts and doing talks and doing those sorts of things, clearly, I have a very different background, but it just gave me a bunch of exposure; job offers and consulting clients and things like that, those just become much easier to get when you can actually go out there and do some of this stuff.

If you're interested in doing that, I think one of the hard things for most people is, what even makes a good talk? You've come up with some really great talks. What's that secret sauce? How do you do that?

Amy: I think it can also be very intimidating since a lot of the talks that get a lot of promotion are always huge vendor events that they're trying to push their product, they're trying to push a solution. That usually takes up a lot of advertising real estate, essentially, where that's what you see, that's what you see all the threads and everything. When you actually get to these community conferences, or even when I would speak at AWS Summit, it was ... I had a very specific problem that I needed to solve. I ran into a bug, the bug was not in the documentation, because why would it be?

Jeremy: Why would you put that in there, right?

Amy: Of course. Then Google, three pages down, maybe put me on the path to finding the right answer, and it's the journey of trying to put all of the bug fixes in place to make it work for your specific environment and then being able to share that.

Jeremy: Right, yeah. That idea of taking these experiences that you've had, or trying to solve a problem, and then finding the nuances maybe in solving the problem as opposed to the happy path, which it's always great when you're following a blog post and it says, run this command, then run this command, then run this command. Well, what happens on that third command when the thing blows up, and you have no idea what to do? Then you end up Googling for five hours trying to find your way out of that.

You take this path of, find those bugs or find that non-happy path and solve it. Then what do you do around there? How do you then take that ... You got to make that interesting somehow.

Amy: Yes. A lot of people use gifs and memes. I use pictures of food and screencaps from Dungeons and Dragons. That's usually just different enough that it'll snap someone just out of their phone going, "Why is there a huge elf on my screen trying to attack people screaming elf errors." Well, that's because that's what they thought it would be great to call it. It's not a great error code. It doesn't explain what it is, and it makes you very confused.

Jeremy: Right. Part of that is, and again, there's that relatability when you create talks, and you want to connect with the audience in some way. But you also ... This is the other thing that I've always found the hardest when I'm creating talks, is trying to find the right level. Because AWS always does this thing where they're like, it's a 200 level, or it's a 400 level, and so forth. I think that's helpful, but you're going to get people of all different skill levels, and so forth. How do you take a problem like that, and then make it relatable, or understandable, probably? Find that right level?

Amy: The way I see it, there's going to be at least one person of these two types in the room that are not going to be your target audience, someone who doesn't know what you're talking about, but sees that a tool that they're considering is going to pose a problem, and they want to know how difficult it is to fix it. Or there's going to be a business person who has no technical background, and they just want to know if what they're evaluating is worth evaluating, if this error is going to be so difficult to narrow down and try to resolve that, yes, why would we go through something that my engineers are going to spend hours to try to fix something that's essentially a configuration issue?

When I write any section of a talk, I make sure that it addresses a person who may not have come into that with that exact problem in mind. For the people who have, they'll understand the ... In animation, it's called key images, where there are very specific slots where you understand the topic of what is happening and the context around it. I always produce more verbose notes that go with my presentation. I usually release it either at the end of the day, or later on that week, once everyone has had time to settle, and it provides a tutorial-esque experience where this is what you saw, this is how you would actually do it if you were in front of a screen.

Jeremy: Yeah.

Amy: There are people who go to technical talks with a laptop on their lap because they're also working while they're trying to do it. But most of the time, they're not going to have the console open while you're walking through the demo. So, how are you going to address that issue? It's just easier that way.

Jeremy: I like that idea too, of ... I try to do high-level bullet points, and then talk about the bullet point. Because one thing that I try to do, and I'd love to hear your thoughts on this as well. Here I am picking your brain trying to make my own talks better. But basically, I do a bullet point, and then I talk through it. I actually animate the bullet points coming in.

I'm not a huge fan of showing an entire slide with all the bullet points and then letting people read ahead, I bring a bullet point in, talk about the bullet point, bring another bullet point in. Is that something you recommend doing too? Or do you just present all the concepts and then walk people through it?

Amy: I think it depends. I tend to have very dense slides, which is not great for reading, especially if you're several rows back. I truly understand that. But the way I see it, because I also talk very fast when I'm on stage that I want there to be enough context around what's happening, so that if I gloss over a concept, then you visually can understand what's happening.

That said, if that's because the entire bullet block on my slide is going to be about a very specific thing that's happening. It's not something that you have to view step-by-step. Now, I do have a few where, especially in a more workshop scenario, where you're going, I want you to think about this first and then go on to this next concept. I totally hide stuff. I just discovered for a talk that I was constructing the other day, that there's an animation that drops them down like index cards, and that's now my favorite animation right now.

Jeremy: When you're doing that, like because this is the other thing, just for people who have ever ... If you're out there and you've ever written a talker or you've given a talk, the first iteration of it is never going to be the right one. You have to go through and you have to revise. It is sort of weird, and I don't know, maybe you felt this way too, in the pre-pandemic world, when you would give talks in person, most of the time, you'd give it to a relatively small audience, a couple of hundred people or whatever, as opposed to now, when we do talks, post-pandemic, and they're online, it's like, they're immediately available online.

It's hard to give the same talk over and over and over and over again, without somebody potentially having seen it. A lot of work goes into a single talk. Not being able to use the same time over and over again, is not great. But, how do you refine it? Is it that you tested it with a live audience, or do you use a family member or a friend, or a colleague? How do you test and refine your talks?

Amy: I'm actually an organizer at a meetup group, and specifically built around giving people of marginalized gender identities, and a place to stage and write technical content. It is a very specific audience.

Jeremy: I can imagine.

Amy: But it addresses that issue I had earlier about visibility, it also does help you ... If you don't have a lot of contacts in this industry, just as an aside, technical speaking is a way to do it, because everyone loves talking to each other after the stress has worn off, and you become the friendliest person after you've done that.

But also, there are meetup groups out there, specifically about doing technical feedback, or just general speaking feedback. If you want to do something general, Toastmasters is a great organization to do. If you want to do strictly technical, if you do any cloud-related stuff, the DevOps communities are super friendly, even if it's not specifically about DevOps. I'm not a DevOps person, but I have a lot of DevOps friends. Some of my best friends are DevOps people.

And you can get on a meetup or a Zoom call and just burn through your slides for about 10 or 15 minutes and see ... Your friends will be very honest with you, in a small group.

Jeremy: Right. One of the things I did notice, too, giving a speech in person or giving your talk in person versus giving a talk via Zoom call, is sometimes when you don't hear any laughs or chuckles from a little joke that you make in there, it can feel very lonely in that space after you're waiting for something in there, but. It's a little bit ...

Amy: It's worse when there are people in the room. I assure you, it is so much worse.

Jeremy: That is very true. If something falls flat, that's a good point. Just going back to more this idea of creating good talks, and what makes a good talk. Where do you find ... You mentioned, maybe it's a vendor conference or something and you maybe install the vendor stuff, and you find the bugs and so forth. But is there any other places that you get inspiration from? Are there any resources you use to sort build some of these talks?

Amy: Again, the communities help. The communities will tell you, really, it's like, I don't understand this thing, can someone hop on a call with me for real quick minute and explain why this concept is so hard? That's a very good place to base your talk off. As far as making them engaging, and interesting, I tend to clone video gaming videos, just because that's what I watch. I know, if it's going to be interesting to me, then it will probably be at least different than the content that's out there.

Jeremy: Right. That's a good way to think of things too, is if it's something that you find interesting, chances are, there are lots of other people that will find that interesting. All right, let's go back to just this idea of creating new talks. You had mentioned this idea of, again, finding the bugs and so forth. But one of the things that I think we see quite a bit is always that bleeding edge stuff. People always want to write content about something new that happened.

I'm guilty of this, I would think from a serverless standpoint where you're talking about things that are really, really bleeding edge. It's useful and they're interesting. Certainly, if you go to a conference about serverless, then it's really nice to see you have these talks and what might be possible. But sometimes when you're going to more practical type things. Again, even DevOps Days, and some of those other things, I think you've got attendees or talk listeners who are looking for very practical advice.

I guess the question is like, how do you take a new piece of content, one of these problems, whatever it is. I guess, how do you keep finding new content is probably the better way to ask that question?

Amy: Well, to just roll back just a little bit. My problem with bleeding edge content, I love watching it, but bleeding edge content will almost always be a product demo because it's someone who developed a new solution, and they want to share with everybody, which is just going to walk you through how it's used, which is great, except, and this is just a nature of what the cloud industry is like, all of this stuff, it changes day-to-day.

These tools may not be applicable in a few months, or they may become the new standard. There's no way to tell until you're already six months out, and by then, they've already gone through several product revisions. I once did a talk where I was talking about best practices, and AWS released their updated best practices the day before my talk, and I had to update three slides. It threw off my timing, it was great.

That's just one of those kind of pitfalls that you have to roll with. As far as getting new content, though, especially if you're dealing ... It depends who your audience is, because my audience tends to be either ICs or technical leads, and by then you're usually in a company ... If you're not developing these bleeding edge solutions, you're just using the tools that's out there already.

You had brought up my "Serverless Frenemies," which is still my favorite title of any talk that I've ever made, because when I did the managing containers one, and I love all my Devro friends, but they all got into my mentions about why don't you just use Fargate? If you're at the containerization stage, why don't you just use Fargate, because it's not even close to the same thing, it is closer to Kubernetes than it is to Lambda, and I'm looking for a Lambda-like solution. That's what that whole deal was about, and I was able to stretch that out into I think 30 minutes because Twitter will tell you what's wrong, whether or not it's accurate or not, and whether or not they're actually your friends. They are my friends, but come on.

Jeremy: Twitter can definitely be brutal. I think that, and maybe unpack a little bit what you were saying, is you're creating content around existing tools. One way to do it is, you're using existing tools, you're creating content around that, or you can create content around that. Looking at those solutions, you introduce a new solution to something, or you're even using an existing tool, nothing's perfect. You had mentioned that idea of bugs and so forth. But just, I guess new solutions, or just solutions, in general, maybe higher-level abstractions, everything creates some new type of problem that you have to deal with, and that's probably a pretty effective way to generate new content.

Amy: It is. If you ever have to write down an RCA, which, for those who have not had the pleasure of doing one is called a root cause analysis, where you took down production, and you had to explain why.

Jeremy: Yep.

Amy: Or you ever did this, hopefully, in stage, or hopefully, in development where you ran into a situation where ... I had a situation once where Lambda would not delete itself. I call it my Skynet problem where it just hit a stage where it was both trying to save and delete at the same time. It would lock itself and I had to destroy the entire stack and send that command several times just to force that command through.

If you ever have a problem like that, that is a thing that you write up instantly, and then you turn it into slide decks, and then you go to SlidesCarnival, you throw a very flashy background on it, and next thing you know, you have a TED talk, or a technical talk.

Jeremy: Right. The other thing too, is, I find use cases to be an interesting, just like ... Non-traditional use cases are kind of fun too, how can I use this in a way that it wasn't meant to be used, and do something like that?

Amy: I love those. Those are my favorite. I love watching people break away from what the tutorial says you have to do, and I'm going to get a little weird with it, and that to me is totally fascinating. When the whole, I fed these scripts into a computer meme came out, I thought that was super fascinating because that was something a company I had worked for did, they used analytics ... I used to work for Fantasy Sports, to write color commentary for your Fantasy Football team, and they would send it out.

If you did really well, you would get a really raving review, and if you did really poorly, you would get roasted by a computer, and then that gets sent to everyone in the league, and it's hilarious. But that is not a thing that you would just assume a computer would do, is just write hot takes on your Fantasy Football team.

Jeremy: That's ... Sure, go ahead.

Amy: It's so much fun. I love watching people get weird with the tools that are there.

Jeremy: There are times where you could do something like that, you could maybe create a content around some strange use case or whatever, and I love that idea of getting weird with that. The other part of it, though, is that, I guess, if you're sitting through a talk, and it's some super interesting problem that you're listening to, and again, I don't know, maybe it's some database replication thing, that you're just really into, whatever. That makes sense. But I think the majority of problems that developers have, are not that interesting, they're just frustrating.

Probably the worst thing to do is wanting to sit through a talk that talks about some frustrating issue you have. Is there a way to basically say, "Look, I have a problem that I want to talk about. It's not the most interesting problem, but how can you flip that and take a problem that's not interesting and make it interesting?

Amy: The batching containers and the frenemies talk was all based off of a bin library error from within the Lambda AMI. That, on paper is extremely boring, and should be a thing that you can easily look up, it is not. When I went around it trying to make tracking down library errors interesting, just saying it is very slow and can drain the energy out of your voice.

But, I put a lot of energy into my work in general, and that's just how I had to approach pulling these talk is like, I like what I do, just, generally. When I try to explain what I do to people, it sounds super boring, and I own that. Now I'm doing it with spreadsheets, which is much, much worse. But when I tell people, it's not about the error itself, it's about everything that happened to make this one particular error happen. The reason why this error happened was because Lambda uses AWS's very specific Linux AMI when they did not used to, and they left stuff out for either security or performance purposes.

Whether or not we as a group agree with that, that's a business decision that they made. How does their business decision affect your future business decisions and your future technical ones? Well, that becomes a way more interesting conversation, because it's like, we know this is going to break at this part, do we still want to use SSH? Do we still need it for this reason? You can approach it more from a narrative standpoint of, I wasted way too much time with this, did I need to? It's like, well, you shouldn't have, this should not have happened, but no bug should have happened, right?

Jeremy: Right.

Amy: You work through your process of finding a solution instead of concentrating on what the solution is because the solution they can look up in your show notes later.

Jeremy: Right. No, I love that idea of documenting your process as opposed to just the solution itself. You find the problem, you pull the thread and where does that take you? I think to myself, a lot of times I go down the rabbit hole on trying to find the solution to a problem that I have or a bug fix, whatever. Sometimes, the resolution is underwhelming. Maybe it's not worth sharing. But other times, there's a revelation in there. I think you're right, with a little bit of storytelling, you can usually take that and turn that into a really interesting talk.

Amy: One of the things it will also do, if you look at it from a process and from a narrative standpoint, is that when you take this video, and you send it to either a technical lead or a product manager, they'll understand what the problem was because you did not bog it down with code. There's very little live code in mine because I understand that people build things differently, just because every code is as different as every person. I get that and I've come to terms with it. This is the best way to share that information.

Jeremy: Absolutely. All right, let's wrap up the idea of building talks. What is your advice to someone who is starting out new? What's the best way for them to get started, or what's just some general advice for people starting to build talks?

Amy: The best content new engineers can do, and that's mostly because this is never the standpoint from which tutorials are ever written in, is that, as someone who knows very little of the way a language or a framework should work, write down your process, the entire thing on you getting either a framework onboarded, how you build, and a messaging system, things that people have written a billion times because chances are, one, you got that work from someone else's blog post or their documentation, and you can cite that. And two, when you do it that way, you not only get into the habit of writing, but you get in the habit of editing it in a way that makes it more palatable for people who are not in your specific experience.

When you do it this way, people can actually see, from an outsider's perspective, exactly what is hard about the thing that they built, or what people who do not have a different level of experience are going through. If a tutorial is targeted at engineers who know where the memory leaks in PHP are, that's the thing that comes with experience, that is not the thing that can be trained.

When a new engineer hits that point, and they found it in a new framework where you fix it, then you start knowing where to fix other problems. That way more senior engineers and more vetted people can learn from your experience, and then they will contact you and they will teach you how to find these issues, so you don't run into them again, and you end up with someone you can just bounce ideas off of. That's how you get pulled into these technical communities. It's a really self-healing process.

Jeremy: Yeah. I love that. I think this idea of you approaching something from a slightly different angle, your experience, the way that you do it, the way that you see it, the way that you perceive the word or the next prompt that comes back, or how you read an error message or any of those things, you sharing your experience around that is hugely valuable to the people that are building these things. But also, you may run into problems that other people like you run into, and it's just ... Sometimes, all it takes is just a tiny twisting of the words, rearranging a sentence in a way that now that clicks with somebody where the other time it didn't. I love that.

That's why I always encourage people, just even if somebody has written his content 100 times before, whatever slight difference there is in your content, that could have a powerful effect on someone else.

Amy: Yeah, it really can.

Jeremy: Awesome. All right, let me ask you a couple of questions about Lambda and Functions as a Service because I know that you spent quite a bit of time on this stuff. I guess a question, especially, maybe even from a cloud economist, what's next for Lambda and Functions as a Service? Because I know you've written about the Lambda containers, but what's maybe that next evolution?

Amy: What AWS did recently when they released Lambda Containers is basically put it at feature parity with Azure and GCP, which already had that ability, they had either a function service or a function to Json service where you could upload your own container. They finally released the base image, where, granted, if you knew where to look, you could get it before, but they actually released it, and announced it to the general public, so you don't have to know someone in order to be able to use it.

What I see a lot of people being able to do with this now is they really want to do local development testing, so they don't have to push anything to their account and rack up those charges, when all that you want to do is make sure that whatever one line update you made, actually worked and you didn't put the space or the cab in the wrong place, which is, I guess, how it works now and it takes down the entire stack, which again, we've all done at least once, so don't worry about it. If you've ever taken down production, don't worry, you're not the only one, I promise you. You can't throw a t-shirt into an empty conference room and not hit a dude who took down production. I'm going to save that for later.

Local development testing, live simulation is a really big thing. I've seen asked to do full-on data science just on Lambda containers, so they don't have to use Kubernetes anymore, because speaking of cost stuff, it's easier to track cost-wise than Kubernetes is, because Kubernetes is purely consumption-based, and you have to tie a bunch of stuff together in order to make that tracking work. That would be great.

I think from here on, and a lot of the FaaS changes, they're not going to be front ends anymore, it's all going to be optimizations by the providers, you're not going to see much of that anymore. It's not like before, where they would add three more fields and make a blog post about it. I think everything is just going to be tuning just from Lambda's perspective now. That and hooking it to more things, because they love their integrations. What good is Lambda if you can't integrate it yourself?

Jeremy: Right, if you can't hook it up to events. It's interesting, though, this move to support containers as a packaging format. You're right, I think this has been available in IBM, it's been available in Google, it's been available in Microsoft, these capabilities have existed for a while to use a container, and again, that's a very overloaded word, I know, but to use that as a packaging format. But moving to that, the parity there with the other cloud providers is one thing, but who's that conversation for? Whose mind does that change about serverless, or FaaS, I guess.

Amy: The security team.

Jeremy: Security, okay.

Amy: Because if you talk to any engineer, if it's a technical problem, they'll find a way to fix it. That's just the way, especially at the individual contributor level, that's how the brain works is like, oh, this is a small thing, I bet I can fix it with a few days, or a weekend. Weekend turns into a month, but that's a completely different problem. I've had clients who did not want to use Lambda because they could not control the containerization system. You would be pushing your code into containers that were owned by Amazon, and the way they saw that, they saw that as liability.

While it does have some very strong technical implications, because you're now able to choose the kind of runtime you do, easier than trying to hamstring layers together, because I know layers is supposed to fix this problem, but it's so hard. It's so hard for something that you should be able to download off of Docker and then play with it and then put it back. It's so unnecessarily hard, and it makes me so angry.

If you're willing to incur that responsibility, you can tweak your memory and you have more technical control, but also you have more control at a business level too, and that is a conversation that will go way easier as far as adoption.

Jeremy: Right. The other thing, in terms of, I guess the complexity of running K8s or running Kubernetes is one of those things where that just seems like a lot of complexity. You mentioned the billing aspect of it and trying to track cost. Not that everyone's trying to narrow down exactly how much this Lambda container ran them, maybe you have more insight into that than I do, but the idea of just the complexity.

It seems to me that if you start thinking about cost, that the total cost of ownership of running a container and a Lambda function or running it in Fargate, versus having to install and maintain ... I would say, even if you're using one of the managed services like EKS, or something like that, that the total cost of ownership of going down the serverless route has got to be better.

Amy: Yeah, especially if you're one of these apps that are very user generater based. You're tracking mostly events and content, and not even a huge amount of content, you're not streaming video, you're sharing pictures, or sharing ... If you were trying to rebuild Foursquare, you would just be sharing Geo data, which is comparatively an extremely small piece of data.

You don't need an entire instance, or an entire container to do that. You can do that on a very small scale, and build that out really quickly. That said, if you go from one of these three-person teams, and then there's interest in your product, for whatever reason, and it explodes, then not just your cost, but if you had to manage the traffic of that, if you had to manage the actual resources of that, and you did not think your usage would stick with your bill, that's not great.

Being able to, at least in the first few years of the company, just use Lambda for everything, that's probably just a safer solution, because you're still rapidly iterating, and you're still changing things very quickly, and you're still transmitting very small bits of data. That said, it's like there are also large enterprise companies that are heavy Lambda users, and even their Lambda bill compared to their Kubernetes bill, it is ... If you round it to down there Kubernetes bill, you would get their Lambda bill.

Jeremy: Right. Gotcha. I think that's really interesting because I do ... I actually would love to know your thoughts and whether you even see this. I don't know if we have enough data yet to know this, but this idea of using Lambda, especially early on in startups, or even projects within an enterprise, being able to have that flexibility and the low operational overhead and so forth, I think is really great. But do you see that, or is that something that you think will happen is, you'll get to a point where you'll say we've found some sort of stability point with this product, where we now need to move it over to something like Kubernetes, or a container management system because overall, it's going to end up being cheaper in the long run.

Amy: What usually happens when you're making that transition from Lambda to either even ECS or Fargate, or eventually Kubernetes is that your business logic has now become so complex, or your infrastructure requirements have become so complex that Lambda can't do it cleanly anymore. You end up maxing out on either memory or CPU utilization, or because you're ... Apparently Lambda has a limit on how many times you can invoke it at the same time, which some people have hit in real life.

Those are times when it stops being a cheaper solution, and it stops being a target solution because you can run your own FaaS environment within instances, and then you can have a similar environment to what you're building so you don't have to rebuild everything, but you don't have to incur that on-demand cost anymore. That's one path I've seen someone take, and that's usually the decision is that Lambda, before, when it was limited, can't hold it.

Now that you can put your own container, so long as it fits in that requirement, you can pad that runway out a little bit, and you can stretch out how long you have before you do a full conversion to ECS environment. But that is usually how it is because you just try to overload or you have, maybe, 50 Lambdas trying to support one application, which is totally a thing you can do, it may not be the best ... Even with Step, even with everything else. When that becomes too complex, and you end up just going through containers, anyway.

Jeremy: Right. I think that's interesting, and I think any company that grows to the point where that they need to start thinking about that next little infrastructure, it's probably a good thing. It's a good point to start having those conversations.

All right, I got just one more question for you, because I'm really interested. You mentioned what you do as a cloud economist, reading through people's bills and things like that. Now, I thought Corey just made this thing up. I didn't even know this thing existed until, Corey comes out, and he probably coined the term. But in terms of that ...

Amy: That's what he tells people.

Jeremy: He does tell people that, right. I think he did. So, I will definitely give him credit there. But in terms of that role, of being a cloud economist and having to look through people's bills, and trying to find them ways to save it, that's pretty insane that we need people like you to do that, isn't it?

Amy: Yes, it's a bananas job. I cannot believe this is a job that I'm actually doing. It's also a lot of fun. But if you think about it, that when I was starting out, and everything was LAMP stack, when I started. That was a hot new tech when I started, was the LAMP stack. The solution to all of those problems were we're going to throw more hardware at it. Then the following question was, why are we spending so much on hardware?

Their solution to that problem was, we're going to buy real estate to store all of the hardware on. Now that you don't have to do that, you still have the problem of, I'm going to solve this problem by throwing more hardware at it. That's still a mindset that is alive and well, and you still end up with the same problem, except now you don't have the excuse that at least we own the facility that data is in because you don't anymore.

Since you don't actually own the cases and the plates and everything, you don't have to worry about disposing of them and having to use stuff that you don't actually use anymore. A lot of my problems are, one of our services has gone out of control, we don't know why. Then I will tell you, who is spending that money. I will talk to that team to make sure that they know that it's happening because sometimes they don't even know what's happening. Something got spun up into their account, and maybe it was a testbed, maybe it was a demo, maybe they hired a vendor to load something into their environment and those costs got out of control.

It's not like I'm going out trying to tell you that you did something wrong. It's like, this is where the problem is, let's go find out what happened. Forensic cloud bill person, I'm going to workshop that into a business card, because that sounds way better than the title that Corey uses.

Jeremy: Forensic cloud accountant or something like that.

Amy: Yes.

Jeremy: I think it's also interesting that billing is, and the bills you get from AWS are a leading indicator of things that are potentially going wrong. Interesting, because I don't know if people connect this. Maybe I'm underestimating people here, but the idea that a bill that runs, or that you're seeing EC2 instances cost spiking, or you're seeing a higher load or higher bandwidth or things like that. Those can all be indicators of poorly written code, it can be indicators of the bad compression or missing compression settings, all kinds of things that it can jump out at you. Unless somebody is paying attention to those bills, I don't think most developers and most teams, they're not going to see that.

Amy: Yeah. The only time they pay attention when things start spiraling out of control, and ... Okay, this sounds like an intuitive issue, and first thing people will do, will go, "We're going to log everything, and we're going to find out where the problem is."

Jeremy: It'll cost you more money.

Amy: There is a threshold where cloud watch becomes very expensive.

Jeremy: Right, absolutely.

Amy: Then they hit that threshold, and now their bill is four times as much.

Jeremy: Right.

Amy: A lot of the times it's misconfiguration, it's like, very rarely does any product get to the point where they just can't ... It's built so poorly that it can barely hold itself up. That's never been the case. It's always been, this has been turned off, or AWS also offers S3 analytics. You have to turn them on per bucket, that's not a policy that's usually written in anyone's AWS config. When they launch it, they just launch it without any analytics. They don't know if the thing is supposed to be sending things to Glacier, if it's highly used data, there's no way to tell.

It's trying to find little holes like that, where it seems like it shouldn't be a problem, but the minute it becomes a problem, it's because you spent $20,000.

Jeremy: Right. Yeah. No, you can spend money very, very fast in the cloud. I think that is a lesson learned by many, many people.

Amy: The difference between being on metal and throwing hardware at a problem and being on the cloud and throwing hardware at a problem is that you can throw hardware at a problem at scale on the cloud.

Jeremy: Exactly. Right. There's no stopping point like we have to go by using servers ...

Amy: No one will stop you.

Jeremy: No one will stop you. Just maybe the credit card company or whatever. Anyways, Amy, you are doing some amazing work with that, because I actually find that to be very, very fascinating. I think, in terms of what that can do, and the need for it, it's a fascinating field, and super interesting. Good for Corey for really digging into that and calling it out. Then again, for people like you who are willing to take that job, because that seems to me like poring through those numbers can't be the most interesting thing to do. But it must feel good when you do find a way to save somebody some money.

Amy: Spreadsheets can be interesting. Again, it's like everything else about my job. If I try to explain why it's interesting, I just make it sound more boring.

Jeremy: Awesome. All right. Well, let's leave it there. Amy, thank you again, for joining me, this was awesome. If people want to find out more about you, or maybe they have horribly large AWS cloud bills, and they want to check out the Duckbill Group, how do they do that?

Amy: Honestly, if you search for Corey Quinn, you can find the Duckbill Group real fast. If you want to go talk to me because I like doing community engagement, and I like doing talks, and I like roasting people on Twitter just about different stuff, you can hit me up on Twitter @nerdypaws. If you want to be a professional, I'm also on LinkedIn under Amy Codes.

Jeremy: All right, and then you also have a website, Amy-codes.com.

Amy: Amy-codes.com is the archive of all my talks. It's currently only showing the talks from last year because for some reason, it's somehow became very hard to find a spot for the past year. Who knew?

Jeremy: A lot of people doing talks. But anyways, all right, Amy, thank you again. Appreciate it.

Amy: Thank you. Had so much fun.

2021-05-24
Länk till avsnitt

Episode #101: How Serverless is Becoming More Extensible with Julian Wood

About Julian Wood

Julian Wood is a Senior Developer Advocate for the AWS Serverless Team. He loves helping developers and builders learn about, and love, how serverless technologies can transform the way they build and run applications at any scale. Julian was an infrastructure architect and manager in global enterprises and start-ups for more than 25 years before going all-in on serverless at AWS.

Twitter: @julian_wood
All things Serverless @ AWS: ServerlessLand
Serverless Patterns Collection
Serverless Office Hours ? every Tuesday 10am PT
Lambda Extensions
Lambda Container Images

Watch this episode on YouTube: https://youtu.be/jtNLt3Y51-g

This episode sponsored by CBT Nuggets and Lumigo.

Transcript
Jeremy: Hi everyone, I'm Jeremy Daly and this is Serverless Chats. Today I'm joined by Julian Wood. Hey Julian, thanks for joining me.

Julian: Hey Jeremy, thank you so much for inviting me.

Jeremy: Well, I am super excited to have you here. I have been following your work for a very long time and of course, big fan of AWS. So you are a Serverless Developer Advocate at AWS, and I'd love it if you could just tell the listeners a little bit about your background, so they get to know you a bit. And then also, sort of what your role is at AWS.

Julian: Yeah, certainly. Well, I'm Julian Wood. I am based in London, but yeah, please don't let my accent fool you. I'm actually originally from South Africa, so the language purists aren't scratching their heads anymore. But yeah, I work within the Serverless Team at AWS, and hopefully do a number of things. First of all, explain what we're up to and how our sort of serverless things work and sort of, I like to sometimes say a bit cheekily, basically help the world fall in love with serverless as I have. And then also from the other side is to be a proxy and sort of be the voice of builders, and developers and whoever's building service applications, and be their voices internally. So you can also keep us on our toes to help build the things that will brighten your days.

And just before, I've worked for too many years probably, as an infrastructure racker, stacker, architect, and manager. I've worked in global enterprises babysitting their Windows and Linux servers, and running virtualization, and doing all the operations kind of stuff to support that. But, I was always thinking there's a better way to do this and we weren't doing the best for the developers and internal customers. And so when this, you know in inverted commas, "serverless way" of things started to appear, I just knew that this was going to be the future. And I could happily leave the server side to much better and cleverer people than me. So by some weird, auspicious alignment of the stars, a while later, I managed to get my current dream job talking about serverless and talking to you.

Jeremy: Yeah. Well, I tell you, I think a lot of serverless people or people who love serverless are recovering ops and infrastructure people that were doing racking and stacking. Because I too am also recovering from that and I still have nightmares.

I thought that it was interesting too, how you mentioned though, developer advocacy. It's funny, you work for a specific company, AWS obviously, but even developer advocacy in general, who is that for? Who are you advocating for? Are you advocating for the developers to use the service from the company? Are you advocating for the developers so that the company can provide the services that they actually need? Interesting balance there.

Julian: Yeah, it's true. I mean, the honest answer is we don't have great terms for this kind of role, but yeah, I think primarily we are advocating for the people who are developing the applications and on the outside. And to advocate for them means we've got to build the right stuff for them and get their voices internally. And there are many ways of doing that. Some people raise support requests and other kind of things, but I mean, sometimes some of our great ideas come from trolling Twitter, or yes, I know even Hacker News or that kind of thing. But also, we may get responses from 10 different people about something and that will formulate something in our brain and we'll chat with other kind of people. And that sort of starts a thing. It's not just necessarily each time, some good idea in Twitter comes in, it gets mashed into some big surface database that we all pick off.

But part of our job is to be out there and try and think and be developers in whatever backgrounds we come from. And I mean, I'm not a pure software developer where I've come from, and I come, I suppose, from infrastructure, but maybe you'd call that a bit of systems engineering. So yeah, I try and bring that background to try and give input on whatever we do, hopefully, the right stuff.

Jeremy: Right. Yeah. And then I think part of the job too, is just getting the information out there and getting the examples out there. And trying to create those best practices or at least surface those best practices, and encourage the community to do a lot of that work and to follow that. And you've done a lot of work with that, obviously, writing for the AWS blog. I know you have a series on the Serverless Lens and the Well-Architected Framework, and we can talk about that in a little while. But I really want to talk to you about, I guess, just the expansion of serverless over the last couple of years.

I mean, it was very narrowly focused, probably, when it first came out. Lambda was ... FaaS as a whole new concept for a lot of people. And then as this progressed and we've gotten more APIs, and more services and things that it can integrate with, it just becomes complex and complicated. And that's a good thing, but also maybe a bad thing. But one of the things that AWS has done, and I think this is clearly in reaction to the developers needing it, is the ability to extend what you can do with a Lambda function, right? I mean, the idea of just putting your code in there and then, boom, that's it, that's all you have to do. That's great. But what if you do need access to lifecycle hooks? Or what if you do want to manipulate the underlying runtime or something like that? And AWS, I think has done a great job with that.

So maybe we can start there. So just about the extensibility of Lambda in general. And one of the new things that was launched recently was, and recently, I don't know what was it? Seven months ago at this point? I'm not even sure. But was launched fairly recently, let's say that, is Lambda Extensions, and a couple of different flavors of that as well. Could you kind of just give the users an over, the users, wow, the listeners an overview of what Lambda Extensions are?

Julian: I could hear the ops background coming in, talking about our users. Yeah. But I mean, from the get-go, serverless was always a terrible term because, why on earth would you name something for what it isn't? I mean, you know? I remember talking to DBAs, talking about noSQL, and they go, "Well, if it's not SQL, then what is it?" So we're terrible at that, serverless as well. And yeah, Lambda was very constrained when it came out. Lambda was never built being a serverless thing, that's what was the outcome. Sometimes we focus too much on the tools rather than the outcome. And the story is S3, just turning 15. And the genesis of Lambda was being an event trigger for S3, and people thought you'd upload something to S3, fire off a Lambda function, how cool is that? And then obviously the clever clubs at the time were like, "Well, hang on, let's not just do this for S3, let's do this for a whole bunch of kind of things."

So Lambda was born out of that, as that got that great history, which is created an arc sort of into the present and into the future, which I know we're also going to get on about, the power of event driven applications. But the power of Lambda has always been its simplicity, and removing that operational burden, and that heavy lifting. But, sometimes that line is a bit of a gray area and there're people who can be purists about serverless and can be purists about FaaS and say, "Everything needs to be ephemeral. Lambda functions can't extend to anything else. There shouldn't be any state, shouldn't be any storage, shouldn't be any ..." All this kind of thing.

And I think both of us can agree, but I don't want to speak for you, but I think both of us would agree that in some sense, yeah, that's fine. But we live in the real world and there's other stuff that needs to connect to and we're not here about building purist kind of stuff. So Lambda Extensions is a new way basically to integrate Lambda with your favorite tools. And that's the sort of headline thing we like to talk about. And the big idea is to open up Lambda to more effectively work mainly with partners, but also your own tools if you want to write them. And to sort of have deeper hooks into the Lambda lifecycle.

And other partners are awesome and they do a whole bunch of stuff for serverless, plus customers also have connections to on-prem staff, or EC2 staff, or containers, or all kind of things. How can we make the tools more seamless in a way? How can we have a common set of tools maybe that you even use on-prem or in the cloud or containers or whatever? Why does Lambda have to be unique or different or that kind of thing? And Extensions is sort of one of the starts of that, is to be able to use these kind of tools and get more out of Lambda. So I mean, just the kind of tools that we've already got on board, there's things like Splunk and AppDynamics. And Lumigo, Epsagon, HashiCorp, Honeycomb, CoreLogic, Dynatrace, I can't think. Thundra and Sumo Logic, Check Point. Yeah, I'm sorry. Sorry for any partners who I've forgotten a few.

Jeremy: No, right, no. That's very good. Shout them out, shout them out. No, I mean just, and not to interrupt you here, but ...

Julian: No, please.

Jeremy: ... I think that's great. I mean, I think that's one of the things that I like about the way that AWS deals with partners, is that ... I mean, I think AWS knows they can't solve all these problems on their own. I mean, maybe they could, right? But they would be their own way of solving the problems and there's other people who are solving these problems differently and giving you the ability to extend your Lambda functions into those partners is, there's a huge win for not only the partners because it creates that ecosystem for them, but also for AWS because it makes the product itself more valuable.

Julian: Well, never mind the big win for customers because ultimately they're the one who then gets a common deployment tool, or a common observability tool, or a HashiCorp Vault that you can manage secrets and a Lambda function from HashiCorp Vault. I mean, that's super cool. I mean, also AWS services are picking this up because that's easy for them to do stuff. So if anybody's used Lambda Insights or even seen Lambda Insights in the console, it's somewhere in the monitoring thing, and you just click something over and you get this tool which can pull stuff that you can't normally get from a Lambda function. So things like CPU time and network throughput, which you couldn't normally get. But actually, under the hoods, Lambda Insights is using Lambda extensions. And you can see that if you look. It automatically adds the Lambda layer and job done.

So anyway, this is how a lot of the tools work, that a layer is just added to a Lambda function and off you go, the tool can do its work. So also there's a very much a simplicity angle on this, that in a lot of cases you don't have to do anything. You configure some of the extensions via environment variables, if that's cooled you may just have an API key or a log retention value or something like that, I don't know, any kind of example of that. But you just configure that as a normal Lambda environment variable at this partner extension, which is just a Lambda layer, and off you go. Super simple.

Jeremy: Right. So explain Extensions exactly, because I think that's one of those things because now we have Lambda layers and we have Lambda Extensions. And there's also like the runtime API and then something else. I mean, even I'm not 100% sure what all of the naming conventions. I'm pretty sure I know what they do ...

Julian: Yeah, fair enough.

Jeremy: ... but maybe we could say the names and say exactly what they do as well.

Julian: Yeah, cool. You get an API, I get an API, everybody gets an API. So Lambda layers, let's just start, because that's, although it's not related to Extensions, it's how Extensions are delivered to the power core functions. And Lambda layers is just another way to add code to a Lambda function or not even code, it can be a dependency. It's just a way that you could, and it's cool because they are shareable. So you have some dependencies, or you have a library, or an SDK, or some training data for something, a Lambda layer just allows you to add some bits and bobs to your Lambda function. That's a horrible explanation. There's another word I was thinking of, because I don't want to use the word code, because it's not necessarily code, but it's dependency, whatever. It's just another way of adding something. I'll wake up in a cold sweat tonight thinking of the word I was thinking of, but anyway.

But Lambda Extensions introduces a whole new companion API. So the runtime API is the little bit of code that allows your function to talk to the Lambda service. So when an event comes in, this is from the outside. This could be via API gateway or via the Lambda API, or where else, EventBridge or Step Functions or wherever. When you then transports that data rise in the Lambda services and HTTP call, and Lambda transposes that into an event and sends that onto the Lambda function. And it's that API that manages that. And just as a sidebar, what I find it cool on a sort of geeky, technical thing is, that actually API sits within the execution environment. People are like, "Oh, that's weird. Why would your Lambda API sit within the execution environment basically within the bubble that contains your function rather than it on the Lambda service?"

And the cool answer for that is it's actually for a security mechanism. Like your function can then only ever talk to the Lambda runtime API, which is in that secure execution environment. And so our security can be a lot stronger because we know that no function code can ever talk directly out of your function into the Lambda service, it's all got to talk locally. And then the Lambda service gets that response from the runtime API and sends it back to the caller or whatever. Anyway, sidebar, thought that was nerdy and interesting. So what we've now done is we've released a new Extensions API. So the Extensions API is another API that an extension can use to get information from Lambda. And they're two different types of extensions, just briefly, internal and external extensions.

Now, internal extensions run within the runtime process so that it's just basically another thread. So you can use this for Python or Java or something and say, when the Python runtime starts, let's start it with another parameter and also run this Java file that may do some observability, or logging, or tracing, or finding out how long the modules take to launch, for example. I know there's an example for Python. So that's one way of doing extensions. So it's internal extensions, they're two different flavors, but I'll send you a link. I'll provide a link to the blog posts before we go too far down the rabbit hole on that.

And then the other part of extensions are external extensions. And this is a cool part because they actually run as completely separate processes, but still within that secure bubble, that secure execution environment that Lambda runs it. And this gives you some superpowers if you want. Because first of all, an extension can run in any language because it's a separate process. So if you've got a Node function, you could run an extension in other kind of languages. Now, what do we do recommend is you do run your extension in a compiled binary, just because you've got to provide the runtime that the extensions got to run in any way, so as a compiled binary, it's super easy and super useful. So is something like Go, a lot of people are doing because you write a single extension and Go, and then you can use it on your Node functions, your Java functions, your PowerShell functions, whatever. So that's a really good, simple way that you can have the portability.

But now, what can these extensions do? Well, the extensions basically register with extensions API, and then they say to Lambda, "Lambda, I want to know about what happens when my functions invoke?" So the extension can start up, maybe it's got some initialization code, maybe it needs to connect to a database, or log into an observability platform, or pull down a secret order. That it can do, it's got its own init that can happen. And then it's basically ready to go before the function invokes. And then when the extension then registers and says, "I want to know when the function invokes and when it shuts down. Cool." And that's just something that registers with the API. Then what happens is, when a functioning invoke comes in, it tells the runtime API, "Hello, you now have an event," sends it off to the Lambda function, which the runtime manages, but also extension or extensions, multiple ones, hears information about that event. And so it can tell you the time it's going to run and has some metadata about that event. So it doesn't have the actual event data itself, but it's like the sort of Lambda context, a version of that that it's going to send to the extension.

So the extension can use that to do various things. It can start collecting telemetry data. It can alter instrument some of your code. It could be managing a secret as a separate process that it is going to cache in the background. For example, we've got one with AppConfig, which is really cool. AppConfig is a service where you manage parameters external to your Lambda function. Well, each time your Lambda function warm invokes if you've got to do an external API call to retrieve that, well, it's going to be a little bit efficient. First of all, you're going to pay for it and it's going to take some time.

So how about when the Lambda function runs and the extension could run before the Lambda function, why don't we just cache that locally? And then when your Lambda function runs, it just makes a local HTTP call to the extension to retrieve that value, which is going to be super quick. And some extensions are super clever because they're their own process. They will go, "Well, my value is for 30 minutes and every 30 minutes if I haven't been run, I will then update the value from that." So that's useful. Extensions can then also, when the runtime ... Sorry, let me back up.

When the runtime is finished, it sends its response back to the runtime API, and extensions when they're done doing, so the runtime can send it back and the extension can carry on processing saying, "Oh, I've got the information about this. I know that this Lambda function has done X, Y, Z, so let me do, do some telemetry. Let me maybe, if I'm writing logs, I could write a log to S3 or to Kinesis or whatever. Do some kind of thing after the actual function invocation has happened." And then when it says it's ready, it says, "Hello, extensions API, I'm telling you I'm done." And then it's gone. And then Lambda freezes the execution environment, including the runtime and the extensions until another invocation happens. And the cycle then will happen.

And then the last little bit that happens is, instead of an invoke coming in, we've extended the Lambda life cycles, so when the environment is going to be shut down, the extension can receive the shutdown and actually do some stuff and say, "Okay, well, I was connected to my observer HTTP platform, so let me close that connection. I've got some extra logs to flush out. I've got whatever else I need to do," and just be able to cleanly shut down that extra process that is running in parallel to the Lambda function.

Jeremy: All right.

Julian: So that was a lot of words.

Jeremy: That was a lot and I bet you that would be great conversation for a dinner party. Really kicks things up. Now, the good news is that, first of all, thank you for that though. I mean, that's super technical and super in-depth. And for anyone listening who ...

Julian: You did ask, I did warn you.

Jeremy ... kind of lost their way ... Yes, but something that is really important to remember is that you likely don't have to write these yourself, right? There is all those companies you mentioned earlier, all those partners, they've already done this work. They've already figured this out and they're providing you access to their tools via this, that allows you to build things.

Julian: Exactly.

Jeremy: So if you want to build an extension and you want to integrate your product with Lambda or so forth, then maybe go back and listen to this at half speed. But for those of you who just want to take advantage of it because of the great functionality, a lot of these companies have already done that for you.

Julian: Correct. And that's the sort of easiness thing, of just adding the Lambda layer or including in a container image. And yeah, you don't have to worry any about that, but behind the scenes, there's some really cool functionality that we're literally opening up our Lambda operates and allowing you to impact when a function responds.

Jeremy: All right. All right. So let me ask another, maybe an overly technical question. I have heard, and I haven't experienced this, but that when it runs the life cycle that ends the Lambda function, I've heard something like it doesn't send the information right away, right? You have to wait for that Lambda to expire or something like that?

Julian: Well, yes, for now, about to change. So currently Extensions is actually in preview. And that's not because it's in Beta or anything like that, but it's because we spoke to the partners and we didn't want to dump Extensions on the world. And all the partners had to come out with their extensions on day one and then try and figure out how customers are going to use them and everything. So what we really did, which I think in this case works out really well, is we worked with the partners and said, "Well, let's release this in preview mode and then give everybody a whole bunch of months to work out what's the best use cases, how can we best use this?" And some partners have said, "Oh, amazing. We're ready to go." And some partners have said, "Ah, it wasn't quite what we thought. Maybe we're going to wait a bit, or we're going to do something differently, or we've got some cool ideas, just give us time." And so that's what this time has been.

The one other thing that has happened is we've actually added some performance enhancements during it. So yes, currently during the preview, the runtime and all extensions need to finish before we give you your response back to your Lambda function. So if you're in an asynchronous mode, you don't really care, but obviously if you're in a synchronous mode behind an API, yeah, you don't really want that. But when Extensions goes GA, which isn't going to be long, then that is no longer the case. So basically what'll happen is the runtime will respond and the result goes directly back to whoever's calling that, maybe API gateway, and the extensions can carry on, partly asynchronously in the background.

Jeremy: Yep. Awesome. All right. And I know that the plan is to go GA soon. I'm not sure when around when this episode comes out, that that will be, but soon, so that's good to know that that is ...

Julian: And in fact, when we go GA that performance enhancement is part of the GA. So when it goes GA, then you know, it's not something else you need to wait for.

Jeremy: Perfect. Okay. All right. So let's move on to another bit of, I don't know if this is extensibility of the actual product itself or more so I think extensibility of maybe the workflow that you use to deploy to Lambda and deploy your serverless applications, and that's container image support. I mean, we've discussed it a lot. I think people kind of have an idea, but just give me your quick overview of what that is to set some context here.

Julian: Yeah, sure. Well, container image support in a simple sort of headline thing is to be able to build and package your functions as a container image. So you basically build a function using a Docker file. So before if you use a zip function, but a lot of people use Serverless Framework or SAM, or whatever, that's all abstracted away from you, but it's actually creating a zip file and uploading it to Lambda or S3. So with container image support, you use a Docker file to build your Lambda function. That's the headline of what's happening.

Jeremy: Right. And so the idea of creating, and this is also, and again, you mentioned packaging, right? I mean, that is the big thing here. This is a packaging format. You're not actually running the container in a Lambda function.

Julian: Correct. Yeah, let's maybe think, because I mean, "containers," in inverted commas again for people who are on the audio, is ...

Jeremy: What does it even mean?

Julian: Yeah, exactly. And can be quite an overload of terms and definitely causes some confusion. And I sort of think maybe there's sort of four things that are in the container world. One, containers is an isolation mechanism. So on Linux, this is UNC Group, seccomp, other bits and pieces that can be used to isolate processes or maybe groups of processes. And then a second one, containers as the packaging mechanism. This is what Docker really popularized and this is about taking some code and the dependencies needed to run the code, and then packaging them all out together, maybe with some metadata to describe it.

And then, three is containers as also a design philosophy. This is the idea, if we can package and isolate software, it's easier to run. Maybe smaller pieces of software is easy to reason about and manage independently. So I don't want to necessarily use microservices, but there's some component of that with it. And the emphasis here is on software rather than services, and standardized tooling to simplify your ops. And then the fourth thing is containers as an ecosystem. This is where all the products, tools, know how, all the actual things to how to do containers. And I mean, these are certain useful, but I wouldn't say there're anything about the other kind of things.

What is cool and worth appreciating is how maybe independent these things are. So when I spoke about containers as isolation, well, we could actually replace containers as isolation with micro VMs such as we do with Firecracker, and there's no real change in the operational properties. So one, if we think, what are we doing with containers and why? One of those is in a way ticked off with Lambda. Lambda does have secure isolation. And containers as a packaging format. I mean, you could replace it with static linking, then maybe won't really be a change, but there's less convenience. And the design philosophy, that could really be applicable if we're talking microservices, you can have instances and certainly functions, but containers are all the same kind of thing.

So if we talk about the packaging of Lambda functions, it's really for people who are more familiar with containers, why does Lambda have to be different? You've got, why does Lambda to have to be a snowflake in a way that you have to manage differently? And if you are packaging dependencies, and you're doing npm or pip install, and you're used to building Docker files, well, why can't we just do that for Lambda at the same things? And we've got some other things that come with that, larger function sizes, up to 10 gig, which is enabled with some of this technology. So it's a packaging format, but on the backend, there's a whole bunch of different stuff, which has to be done to to allow this. Benefits are, use your tooling. You've got your CI/CD pipelines already for containers, well, you can use that.

Jeremy: Yeah, yeah. And I actually like that idea too. And when I first heard of it, I was like, I have nothing against containers, the containers are great. But when I was thinking about it, I'm like, "Wait container? No, what's happening here? We're losing something." But I will say, like when Lambda layers came out, which was I think maybe 2019 or something like that, maybe 2018, the idea of it made a lot of sense, being able to kind of supplement, add additional dependencies or code or whatever. But it always just seemed awkward. And some of the publishing for it was a little bit awkward. The versioning used like a numbered versioning instead of like semantic versioning and things like that. And then you had to share it to multiple places and if you published it as a SAR app, then you got global distri ... Anyways, it was a little bit hard to use.

And so when you're trying to package large dependencies and put those in a layer and then combine them with a Lambda function, the other problem you had was you still had a maximum size that you could use for those, when those were combined. So I like this idea of saying like, "Look, I'd like to just kind of create this little isolate," like you said, "put my dependencies in there." Whether that's PyCharm or some other thing that is a big dependency that maybe I don't want to install, directly in a Lambda layer, or I don't want to do directly in my Lambda function. But you do that together and then that whole process just is a lot easier. And then you can actually run those containers, you could run those locally and test those if you wanted to.

Julian: Correct. So that's also one of the sort of superpowers of this. And that's when I was talking about, just being able to package them up. Well, that now enables a whole bunch of extra kind of stuff. So yes, first of all is you can then use those container images that you've created as your local testing. And I know, it's silly for anyone to poo poo local testing. And we do like to say, "Well, bring your testing to the cloud rather than bringing the cloud to your testing." But testing locally for unit tests is super great. It's going to be super fast. You can iterate, have your Lambda functions, but we don't want to be mocking all of DynamoDB, all of building harebrained S3 options locally.

But the cool thing is you've got the same Docker file that you're going to run in Lambda can be the same Docker file to build your function that you run locally. And it is literally exactly the same Lambda function that's going to run. And yes, that may be locally, but, with a bit of a stretch of kind of stuff, you could also run those Lambda functions elsewhere. So even if you need to run it on EC2 instances or ECS or Fargate or some kind of thing, this gives you a lot more opportunities to be able to use the same Lambda function, maybe in different way, shapes or forms, even if is on-prem. Now, obviously you can't recreate all of Lambda because that's connected to IM and it's got huge availability, and scalability, and latency and all that kind of things, but you can actually run a Lambda function in a lot more places.

Jeremy: Yeah. Which is interesting. And then the other thing I had mentioned earlier was the size. So now the size of these container or these packages can be much, much bigger.

Julian: Yeah, up to 10 gig. So the serverless purists in the back are shouting, "What about cold starts? What about cold starts?"

Jeremy: That was my next question, yes.

Julian: Yeah. I mean, back on zip functional archives are also all available, nothing changes with that Lambda layers, many people use and love, that's all available. This isn't a replacement it's just a new way of doing it. So now we've got Lambda functions that can be up to 10 gig in size and surely, surely that's got to be insane for cold starts. But actually, part of what I was talking about earlier of some of the work we've done on the backend to support this is to be able to support these super large package sizes. And the high level thing is that we actually cache those things really close to where the Lambda layer is going to be run.

Now, if you run the Docker ecosystem, you build your Docker files based on base images, and so this needs to be Linux. One of the super things with the container image support is you don't have to use Amazon Linux or Amazon Linux 2 for Lambda functions, you can actually now build your Lambda functions also on Ubuntu, DBN or Alpine or whatever else. And so that also gives you a lot more functionality and flexibility. You can use the same Linux distribution, maybe across your entire suite, be it on-prem or anywhere else.

Jeremy: Right. Right.

Julian: And the two little components, there's an interface client, what you install, it's just another Docker layer. And that's that runtime API shim that talks to the runtime API. And then there's a runtime interface emulator and that's the thing that pretends to be Lambda, so you can shunt those events between HTTP and JSON. And that's the thing you would use to run locally. So runtime interface client means you can use any Linux distribution at the runtime interface client and you're compatible with Lambda, and then the interface emulators, what you would use for local testing, or if you want to spread your wings and run your Lambda functions elsewhere.

Jeremy: Right. Awesome. Okay. So the other thing I think that container support does, I think it opens up a broader set of, or I guess a larger audience of people who are familiar with containerization and how that works, bringing those two Lambda functions. And one of the things that you really don't get when you run a container, I guess, on EC2, or, not EC2, I'm sorry, ECS, or Fargate or something like that, without kind of adding another layer on top of it, is the eventing aspect of it. I mean, Lambda just is naturally an event driven, a compute layer, right? And so, eventing and this idea of event driven applications and so forth has just become much more popular and I think much more mainstream. So what are your thoughts? What are you seeing in terms of, especially working with so many customers and businesses that are using this now, how are you seeing this sort of evolution or adoption of event driven applications?

Julian: Yeah. I mean, it's quite funny to think that actually the event of an application was the genesis of Lambda rather than it being Serverless. I mentioned earlier about starting with S3. Yeah, the whole crux of Lambda has been, I respond to an event of an API gateway, or something on SQS, or via the API or anything. And so the whole point in a way of Lambda has been this event driven computing, which I think people are starting to sort of understand in a bigger thing than, "Oh, this is just the way you have to do Lambda." Because, I do think that serverless has a unique challenge where there is a new conceptual learning maybe that you have to go through. And one other thing that holds back service development is, people are used to a client's server and maybe ports and sockets. And even if you're doing containers or on-prem, or EC2, you're talking IP addresses and load balances, and sockets and firewalls, and all this kind of thing.

But ultimately, when we're building these applications that are going to be composed of multiple services talking together through using APIs and events, the events is actually going to be a super part of it. And I know he is, not for so much longer, but my ultimate boss, but I can blame Jeff Bezos just a little bit, because he did say that, "If you want to talk via anything, talk via an API." And he was 100% right and that was great. But now we're sort of evolving that it doesn't just have to be an API and it doesn't have to be something behind API gateway or some API that you can run. And you can use the sort of power of events, particularly in an asynchronous model to not just be "forced" again in inverted commas to use APIs, but have far more flexibility of how data and information is going to flow through, maybe not just your application, but your suite of applications, or to and from your partners, or where that is.

And ultimately authentications are going to be distributed, and maybe that is connecting to partners, that could be SaaS partners, or it's going to be an on-prem component, or maybe things in other kind of places. And those things need to communicate. And so the way of thinking about events is a super powerful way of thinking about that.

Jeremy: Right. And it's not necessarily new. I mean, we've been doing web hooks for quite some time. And that idea of, something is going to happen somewhere and I want to be notified of it, is again, not a new concept. But I think certainly the way that it's evolved with Lambda and the way that other FaaS products had done eventing and things like that, is just those tight integrations and just all of the, I guess, the connective tissue that runs between those things to make sure that the events get delivered, and that you can DLQ them, and you can do all these other things with retries and stuff like that, is pretty powerful.

I know you have, I actually just mentioned this on the last episode, about one of my favorite books, I think that changed my thinking and really got me thinking about how microservices communicate with one another. And that was Building Microservices by Sam Newman, which I actually said was sort of like my Bible for a couple of years, yes, I use that. So what are some of the other, like I know you have a favorite book on this.

Julian: Well, that Building Microservices, Sam Newman, and I think there's a part two. I think it's part two, or there's another one ...

Jeremy: Hopefully.

Julian: ... in the works. I think even on O'Riley's website, you can go and see some preview copies of it. I actually haven't seen that. But yeah, I mean that is a great kind of Bible talking. And sometimes we do conflate this microservices things with a whole bunch of stuff, but if you are talking events, you're talking about separating things. But yeah, the book recommendation I have is one called Flow Architectures by James Urquhart. And James Urquhart actually works with VMware, but he's written this book which is looking sort of at the current state and also looking into the future about how does information flow through our applications and between companies and all this kind of thing.

And he goes into some of the technology. When we talk about flow, we are talking about streams and we're talking about events. So streams would be, let's maybe put some AWS words around it, so streams would be something like Kinesis and events would be something like EventBridge, and topics would be SNS, and SQS would be queues. And I know we've got all these things and I wish some clever person would create the one flow service to rule them all, but we're not there. And they've got also different properties, which are helpful for different things and I know confusingly some of them merge. But James' sort of big idea is, in the future we are going to be able to moving data around between businesses, between applications. So how can we think of that as a flow? And what does that mean for designing applications and how we handle that?

And Lambda is part of it, but even more nicely, I think is even some of the native integrations where you don't have to have a Lambda function. So if you've got API gateway talking to Step Functions directly, for example, well, that's even better. I mean, you don't have any code to manage and if it's certainly any code that I've written, you probably don't want to manage it. So yeah. I mean this idea of flow, Lambda's great for doing some of this moving around. But we are even evolving to be able to flow data around our applications without having to do anything and just wire up some things in a console or in a terminal.

Jeremy: Right. Well, so you mentioned, someone could build the ultimate sort of flow control system or whatever. I mean, I honestly think EventBridge is very close to that. And I actually had Mike Deck on the show. I think it was like episode five. So two years ago, whenever it was when the show came out. I mean, when EventBridge came out. And we were talking and I sort of made the joke, I'm like, so this is like serverless web hooks, essentially being able, because there was the partner integrations where partners could push events onto an event bus, which they still can do. But this has evolved, right? Because the issue was always sort of like, I would have to subscribe to web books, I'd have to build a web hook to get events from a particular company. Which was great, always worked fine, but you're still maintaining that infrastructure.

So EventBridge comes along, it creates these partner integrations and now you can just push an event on that now your applications, whether it's a Lambda function or other services, you can push them to an SQS queue, you can push them into a Kinesis stream, all these different destinations. You can go ahead and pull that data in and that's just there. So you don't have to worry about maintaining that infrastructure. And then, the EventBridge team went ahead and released the destination API, I think it's called.

Julian: Yeah, API destinations.

Jeremy: Event API destinations, right, where now you can set up these integrations with other companies, so you don't even have to make the API call yourself anymore, but instead you get all of the retries, you get the throttling, you get all that stuff kind of built in. So I mean, it's just really, really interesting where this is going. And actually, I mean, if you want to take a second to tell people about EventBridge API destinations, what that can do, because I think that now sort of creates both sides of that equation for you.

Julian: It does. And I was just thinking over there, you've done a 10 times better job at explaining API destinations than I have, so you've nailed it on the head. And packet is that kind of simple. And it is just, events land up in your EventBridge and you can just pump events to any arbitrary endpoint. So it doesn't have to be in AWS, it can be on-prem. It can be to your Raspberry PI, it can literally be anywhere. But it's not just about pumping the events over there because, okay, how do we handle failover? And how do we handle over throttling? And so this is part of the extra cool goodies that came with API destinations, is that you can, for instance, if you are sending events to some external API and you only licensed for 1,000 invocations, not invocations, that could be too Lambda-ish, but 1,000 hits on the API every minute.

Jeremy: Quotas. I think we call them quotas.

Julian: Quotas, something like that. That's a much better term. Thank you, Jeremy. And some sort of quota, well, you can just apply that in API destinations and it'll basically store the data in the meantime in EventBridge and fire that off to the API destination. If the API destination is in that sort of throttle and if the API destination is down, well, it's going to be able to do some exponential back-off or calm down a little bit, don't over-flood this external API. And then eventually when the API does come back, it will be able to send those events. So that does just really give you excellent power rather than maintaining all these individual API endpoints yourself, and you're not handling the availability of the endpoint API, but of whatever your code is that needs to talk to that destination.

Jeremy: Right. And I don't want to oversell this to anybody, but that also ...

Julian: No, keep going. Keep going.

Jeremy: ... adds the capability of enhanced security, because you're not exposing those API keys to your developers or anybody else, they're all baked in and stored within, the API destinations or within an EventBridge. You have the ability, you mentioned this idea of not needing Lambda to maybe talk directly, API gateway to DynamoDB or to step function or something like that. I mean, the cool thing about this is you do have translation capabilities, or transformation capabilities in EventBridge where you can transform the event. I haven't tried this, but I'm assuming it's possible to say, get an event from Salesforce and then pipe it into Stripe or some other API that you might want to pipe it into.

So I mean, just that idea of having that centralized bus that can communicate with all these different things. I mean, we're talking about distributed systems here, right? So why is it different sending an event from my microservice A to my microservice B? Why can't I send it from my microservice A to company-wise, microservice B or whatever? And being able to do that in a secure, reliable, just with all of that stuff kind of built in for you, I think it's amazing. So I love EventBridge. To me EventBridge is one of those services that rivals Lambda. It's as, I guess as important as Lambda is, in this whole serverless equation.

Julian: Absolutely, Jeremy. I mean, I'm just sitting here. I don't actually have to say anything. This is a brilliant interview and Jeremy, you're the expert. And you're just like laying down all of the excellent use cases. And exactly it. I mean, I like to think we've got sort of three interlinked services which do three different things, but are awesome. Lambda, we love if you need to do some processing or you need to do something that's literally your business logic. You've got EventBridge that can route data from in and out of SaaS partners to any other kind of API. And then you've got Step Functions that can do some coordination. And they all work together, but you've got three different things that really have sort of superpowers in terms of the amount of stuff you can do with it. And yes, start with them. If you land up bumping up against any kind of things that it doesn't work, well, first of all, get in touch with me, I'll work on that.

But then you can maybe start thinking about, is it containers or EC2, or that kind of thing? But using literally just Lambda, Step Functions and EventBridge, okay. Yes, maybe you're going to need some queues, topics and APIs, and that kind of thing. But ...

Jeremy: I was just going to say, add DynamoDB in there for some permanent state or for some data persistence. Right? Yeah. But other than that, no, I think you nailed it. Honestly, sometimes you're starting to build applications and yeah, you're right. You maybe need a queue here and there and things like that. But for the most part, no, I mean, you could build a lot with those three or four services.

Julian: Yeah. Well, I mean, even think of it what you used to do before with API destinations. Maybe you drop something on a queue, you'd have Lambda pull that from a queue. You have Lambda concurrency, which would be set to five per second to then send that to an external API. If it failed going to that API, well, you've got to then dump it to Lambda destinations or to another SQS queue. You then got something ... You know, I'm going down the rabbit hole, or just put it on EventBridge ...

Jeremy: You just have it magically happen.

Julian: ... or we talk about removing serverless infrastructure, not normal infrastructure, and just removing even the serverless bits, which is great.

Jeremy: Yeah, no. I think that's amazing. So we talked about a couple of these different services, and we talked about packaging formats and we talked about event driven applications, and all these other things. And a lot of this stuff, even though some of it may be familiar and you could probably equate it or relate it to things that developers might already know, there is still a lot of new stuff here. And I think, my biggest complaint about serverless was not about the capabilities of it, it was basically the education and the ability to get people to adopt it and understand the power behind it. So let's talk about that a little bit because ... What's that?

Julian: It sounds like my job description, perfectly.

Jeremy: Right. So there we go. Right, that's what you're supposed to be doing, Julian. Why aren't you doing it? No, but you are doing it. You are doing it. No, and that's why I want to talk to you about it. So you have that series on the Well-Architected Framework and we can talk about that. There's a whole bunch of really good resources on this. Obviously, you're doing videos and conferences, well, you used to be doing conferences. I think you probably still do some of those virtual ones, right? Which are not the same thing.

Julian: Not quite, no.

Jeremy: I mean, it was fun seeing you in Cardiff and where else were you?

Julian: Yeah, Belfast.

Jeremy: Cardiff and Northern Ireland.

Julian: Yeah, exactly.

Jeremy: Yeah, we were all over the place together.

Julian: With the Guinness and all of us. It was brilliant.

Jeremy: Right. So tell me a little bit about, sort of, the education process that you're trying to do. Or maybe even where you sort of see the state of Serverless education now, and just sort of where it's evolved, where we're getting best practices from, what's out there for people. And that's a really long question, but I don't know, maybe you can distill that down to something usable.

Julian: No, that's quite right. I'm thinking back to my extensions explanation, which is a really long answer. So we're doing really long stuff, but that's fine. But I like to also bring this back to also thinking about the people aspect of IT. And we talk a lot about the technology and Lambda is amazing and S3 is amazing and all those kinds of things. But ultimately it is still sort of people lashing together these services and building the serverless applications, and deciding what you even need to do. And so the education is very much tied with, of course, having the products and features that do lots of kinds of things. And Serverless, there's always this lever, I suppose, between simplicity and functionality. And we are adding lots of knobs and levers and everything to Lambda to make it more feature-rich, but we've got to try and keep it simple at the same time.

So there is sort of that trade-off, and of course with that, that obviously means not just the education side, but education about Lambda and serverless, but generally, how do I build applications? What do I do? And so you did mention the Well-Architected Framework. And so for people who don't know, this came out in 2015, and in 2017, there was a Serverless Lens which was added to it; what is basically serverless specific information for Well-Architected. And Well-Architected means bringing best practices to serverless applications. If you're building prod applications in the cloud, you're normally looking to build and operate them following best practices. And this is useful stuff throughout the software life cycle, it's not just at the end to tick a few boxes and go, "Yes, we've done that." So start early with the well-architected journey, it'll help you.

And just sort of answer the question, am I well architected? And I mean, that is a bit of a fuzzy, what is that question? But the idea is to give you more confidence in the architecture and operations of your workloads, and that's not a goal it's in, but it's to reduce and minimize the impact of any issues that can happen. So what we do is we try and distill some of our questions and thoughts on how you could do things, and we built that into the Well-Architected Framework. And so the ServiceLens has a few questions on its operational excellence, security, reliability, performance, efficiency, and cost optimization. Excellent. I knew I was going to forget one of them and I didn't. So yeah, these are things like, how do you control access to an API? How do you do lifecycle management? How do you build resiliency into your application? All these kinds of things.

And so the Well-Architected Framework with Serverless Lens there's a whole bunch of guidance to help you do that. And I have been slowly writing a blog series to literally cover all of the questions, they're nine questions in the Well-Architected Serverless Lens. And I'm about halfway through, and I had to pause because we have this little conference called re:Invent, which requires one or two slides to be created. But yeah, I'm desperately keen to pick that up again. And yeah, that's just providing some really and sort of more opinionated stuff, because the documentation is awesome and it's very in-depth and it's great when you need all that kind of stuff. But sometimes you want to know, well, okay, just tell me what to do or what do you think is best rather than these are the seven different options.

Jeremy: Just tell me what to do.

Julian: Yeah.

Jeremy: I think that's a common question.

Julian: Exactly. And I'll launch off from that to mention my colleague, James Beswick, he writes one or two things on serverless ...

Jeremy: Yeah, I mean, every once in a while you see something from it. Yeah.

Julian: ... every day. The Besbot machine of serverless. He's amazing. James, he's so knowledgeable and writes like a machine. He's brilliant. Yeah, I'm lucky to be on his team. So when you talk about education, I learn from him. But anyway, in a roundabout way, he's created this blog series and other series called the Lambda Operations Guide. And this is literally a whole in-depth study on how to operate Lambda. And it goes into a whole bunch of things, it's sort of linked to the Serverless Lens because there are a lot of common kind of stuff, but it's also a great read if you are more nerdily interested in Lambda than just firing off a function, just to read through it. It's written in an accessible way. And it has got a whole bunch of information on how to operate Lambda and some of the stuff under the scenes, how to work, just so you can understand it better.

Jeremy: Right. Right. Yeah. And I think you mentioned this idea of confidence too. And I can tell you right now I've been writing serverless applications, well, let's see, what year is it? 2021. So I started in 2015, writing or building applications with Lambda. So I've been doing this for a while and I still get to a point every once in a while, where I'm trying to put something in cloud formation or I'm using the Serverless Framework or whatever, and you're trying to configure something and you think about, well, wait, how do I want to do this? Or is this the right way to do it? And you just have that moment where you're like, well, let me just search and see what other people are doing. And there are a lot of myths about serverless.

There's as much good information is out there, there's a lot of bad information out there too. And that's something that is kind of hard to combat, but I think that maybe we could end it there. What are some of the things, the questions people are having, maybe some of the myths, maybe some of the concerns, what are those top ones that you think you could sort of ...

Julian: Dispel.

Jeremy: ... to tell people, dispel, yeah. That you could say, "Look, these are these aren't things to worry about. And again, go and read your blog post series, go and read James' blog post series, and you're going to get the right answers to these things."

Julian: Yeah. I mean, there are misconceptions and some of them are just historical where people think the Lambda functions can only run for five minutes, they can run for 15 minutes. Lambda functions can also now run up to 10 gig of RAM. At re:Invent it was only 3 gig of RAM. That's a three times increase in Lambda functions within a three times proportional increase in CPU. So I like to say, if you had a CPU-intensive job that took 40 minutes and you couldn't run it on Lambda, you've now got three times the CPU. Maybe you can run it on Lambda and now because that would work. So yeah, some of those historical things that have just changed. We've got EFS for Lambda, that's some kind of thing you can't do state with Lambda. EFS and NFS isn't everybody's cup of tea, but that's certainly going to help some people out.

And then the other big one is also cold starts. And this is an interesting one because, obviously we've sort of solved the cold start issue with connecting Lambda functions to VPC, so that's no longer an issue. And that's been a barrier for lots of people, for good reason, and that's now no longer the case. But the other thing for cold starts is interesting because, people do still get caught up at cold starts, but particularly for development because they create a Lambda function, they run it, that's a cold start and then update it and they run it and then go, oh, that's a cold start. And they don't sort of grok that the more you run your Lambda function the less cold starts you have, just because they're warm starts. And it's literally the number of Lambda functions that are running at exactly the same time will have a cold start, but then every subsequent Lambda function invocation for quite a while will be using a warm function.

And so as it ramps up, we see, in the small percentages of cold starts that are actually going to happen. And when we're talking again about the container image support, that's got a whole bunch of complexity, which people are trying to understand. Hopefully, people are learning from this podcast about that as well. But also with the cold starts with that, those are huge and they're particular ways that you can construct your Lambda functions to really reduce those cold starts, and it's best practices anyway. But yeah, cold starts is also definitely one of those myths. And the other one ...

Jeremy: Well, one note on cold starts too, just as something that I find to be interesting. I know that we, I even had to spend time battling with that earlier on, especially with VPC cold starts, that's all sort of gone away now, so much more efficient. The other thing is like provision concurrency. If you're using provision concurrency to get your cold starts down, I'm not even sure that's the right use for provision concurrency. I think provision concurrency is more just to make sure you have enough capacity because of the ramp-up time for Lambda. You certainly can use it for cold starts, but I don't think you need to, that's just my two cents on that.

Julian: Yeah. No, that is true. And they're two different use cases for the same kind of thing. Yeah. As you say, Lambda is pretty scalable, but there is a bit of a ramp-up to get up to many, many, many, many thousands or tens of thousands of concurrent executions. And so yeah, using provision currency, you can get that up in advance. And yeah, some people do also use it for provision concurrency for getting those cold starts done. And yet that is another very valid use case, but it's only an issue for synchronous workloads as well. Anything that is synchronous you really shouldn't be carrying too much. Other than for cost perspective because it's going to take longer to run.

Jeremy: Sure. Sure. I have a feeling that the last one you were going to mention, because this one bugs me quite a bit, is this idea of no ops or some people call it ops-less, which I think is kind of funny. But that's one of those things where, oh, it drives me nuts when I hear this.

Julian: Yeah, exactly. And it's a frustrating thing. And I think often, sometimes when people are talking about no ops, they either have something to sell you. And sometimes what they're selling you is getting rid of something, which never is the case. It's not as though we develop serverless applications and we can then get rid of half of our development team, it just doesn't work like that. And it's crazy, in fact. And when I was talking about the people aspect of IT, this is a super important thing. And me coming from an infrastructure background, everybody is dying in their jobs to do more meaningful work and to do more interesting things and have the agility to try those experiments or try something else. Or do something that's better or even improve the way your build or improve the way your CI/CD pipeline runs or anything, rather than just having to do a lot of work in the lower levels.

And this is what serverless really helps you do, is to be able to, we'll take over a whole lot of the ops for you, but it's not all of the ops, because in a way there's never an end to ops. Because you can always do stuff better. And it's not just the operations of deploying Lambda functions and limits and all that kind of thing. But I mean, think of observability and not knowing just about your application, but knowing about your business. Think of if you had the time that you weren't just monitoring function invocations and monitoring how long things were happening, but imagine if you were able to pull together dashboards of exactly what each transaction costs as it flows through your whole entire application. Think of the benefit of that to your business, or think of the benefit that in real-time, even if it's on Lambda function usage or something, you can say, "Well, oh, there's an immediate drop-off or pick-up in one region in the world or one particular application." You can spot that immediately. That kind of stuff, you just haven't had time to play with to actually build.

But if we can take over some of the operational stuff with you and run one or two or trillions of Lambda functions in the background, just to keep this all ticking along nicely, you're always going to have an opportunity to do more ops. But I think the exciting bit is that ops is not just IT infrastructure, plumbing ops, but you can start even doing even better business ops where you can have more business visibility and more cool stuff for your business because we're not writing apps just for funsies.

Jeremy: Right. Right. And I think that's probably maybe a good way to describe serverless, is it allows you to focus on more meaningful work and more meaningful tasks maybe. Or maybe not more meaningful, but more impactful on the business. Anyways, Julian, listen, this was a great conversation. I appreciate it. I appreciate the work that you're doing over at AWS ...

Julian: Thank you.

Jeremy: ... and the stuff that you're doing. And I hope that there will be a conference soon that we will be able to attend together ...

Julian: I hope so too.

Jeremy: ... maybe grab a drink. So if people want to get a hold of you or find out more about serverless and what AWS is doing with that, how do they do that?

Julian: Yeah, absolutely. Well, please get hold of me anytime on Twitter, is the easiest way probably, julian_wood. Happy to answer your question about anything Serverless or Lambda. And if I don't know the answer, I'll always ask Jeremy, so you're covered twice over there. And then, three different things. James is, if you're talking specifically Lambda, James Beswick's operations guide, have a look at that. Just so much nuggets of super information. We've got another thing we did just sort of jump around, you were talking about cloud formation and the spark was going off in my head. We have something which we're calling the Serverless Patterns Collection, and this is really super cool. We didn't quite get to talk about it, but if you're building applications using SAM or serverless application model, or using the CDK, so either way, we've got a whole bunch of patterns which you can grab.

So if you're pulling something from S3 to Lambda, or from Lambda to EventBridge, or SNS to SQS with a filter, all these kind of things, they're literally copy and paste patterns that you can put immediately into your cloud formation or your CDK templates. So when you are down the rabbit hole of Hacker News or Reddit or Stack Overflow, this is another resource that you can use to copy and paste. So go for that. And that's all hosted on our cool site called serverlessland.com. So that's serverlessland.com and that's an aggregation site that we run because we've got video talks, and we've got blog posts, and we've got learning path series, and we've got a whole bunch of stuff. Personally, I've got a learning path series coming out shortly on Lambda extensions and also one on Lambda observability. There's one coming out shortly on container image supports. And our team is talking all over as many things as we can virtually. I'm actually speaking about container images of DockerCon, which is coming up, which is exciting.

And yeah, so serverlessland.com, that's got a whole bunch of information. That's just an easy one-stop-shop where you can get as much information about AWS services as you can. And if not yet, get in touch, I'm happy to help. I'm happy to also carry your feedback. And yeah, at the moment, just inside, we're sort of doing our planning for the next cycle of what Lambda and what all the service stuff we're going to do. So if you've got an awesome idea, please send it on. And I'm sure you'll be super excited when something pops out in the near issue, maybe just in future for a cool new functionality you could have been involved in.

Jeremy: Well, I know that serverlessland.com is an excellent resource, and it's not that the AWS Compute blog is hard to parse through or anything, but serverlessland.com is certainly a much easier resource to get there. So awesome. Julian, I will get all that stuff in the show notes. Thank you so much.

Julian: Oh, thank you very ... Oh, one more thing I didn't mention is Serverless Office Hours. Every Tuesday at 10:00 AM, Pacific Time, I'm in London, that's 6:00 PM. So Serverless Office Hours for an hour every week, we rotate about five different topics and bring any of your questions, anything. It's not just Lambda, it's Step Functions, API gateway, messaging, Lambda, serverless surprise as well. So have any questions, join us. And the links are also on Serverlessland and it's on Twitter and YouTube. That's another way you can get in touch. And yeah, just to finish up, Jeremy, thank you so much for inviting me. You've been a light in the serverless world and we really, really appreciate it, internally at AWS and personally about how you've created and talked about community and people, and just made the serverless thing such a cool place to be. So, yeah. Thank you for all you've done. And I really appreciate being able to share a little bit of time with you.

Jeremy: Well, thank you. It was great.

2021-05-17
Länk till avsnitt

Episode #100: All Things Serverless with Jeremy Daly

About Rebecca Marshburn

Rebecca's interested in the things that interest people?What's important to them? Why? And when did they first discover it to be so? She's also interested in sharing stories, elevating others' experiences, exploring the intersection of physical environments and human behavior, and crafting the perfect pun for every situation. Today, Rebecca is the Head of Content & Community at Common Room. Prior to Common Room, she led the AWS Serverless Heroes program, where she met the singular Jeremy Daly, and guided content and product experiences for fashion magazines, online blogs, AR/VR companies, education companies, and a little travel outfit called Airbnb.

Twitter: @beccaodelay
LinkedIn: Rebecca Marshburn
Company: www.commonroom.io
Personal work (all proceeds go to the charity of the buyer's choice): www.letterstomyexlovers.com

Watch this episode on YouTube: https://youtu.be/VVEtxgh6GKI

This episode sponsored by CBT Nuggets and Lumigo.

Transcript:
Rebecca: What a day today is! It's not every day you turn 100 times old, and on this day we celebrate Serverless Chats 100th episode with the most special of guests. The gentleman whose voice you usually hear on this end of the microphone, doing the asking, but today he's going to be doing the telling, the one and only, Jeremy Daly, and me. I'm Rebecca Marshburn, and your guest host for Serverless Chats 100th episode, because it's quite difficult to interview yourself. Hey Jeremy!

Jeremy: Hey Rebecca, thank you very much for doing this.

Rebecca: Oh my gosh. I am super excited to be here, couldn't be more honored. I'll give your listeners, our listeners, today, the special day, a little bit of background about us. Jeremy and I met through the AWS Serverless Heroes program, where I used to be a coordinator for quite some time. We support each other in content, conferences, product requests, road mapping, community-building, and most importantly, I think we've supported each other in spirit, and now I'm the head of content and community at Common Room, and Jeremy's leading Serverless Cloud at Serverless, Inc., so it's even sweeter that we're back together to celebrate this Serverless Chats milestone with you all, the most important, important, important, important part of the podcast equation, the serverless community. So without further ado, let's begin.

Jeremy: All right, hit me up with whatever questions you have. I'm here to answer anything.

Rebecca: Jeremy, I'm going to ask you a few heavy hitters, so I hope you're ready.

Jeremy: I'm ready to go.

Rebecca: And the first one's going to ask you to step way, way, way, way, way back into your time machine, so if you've got the proper attire on, let's do it. If we're going to step into that time machine, let's peel the layers, before serverless, before containers, before cloud even, what is the origin story of Jeremy Daly, the man who usually asks the questions.

Jeremy: That's tough. I don't think time machines go back that far, but it's funny, when I was in high school, I was involved with music, and plays, and all kinds of things like that. I was a very creative person. I loved creating things, that was one of the biggest sort of things, and whether it was music or whatever and I did a lot of work with video actually, back in the day. I was always volunteering at the local public access station. And when I graduated from high school, I had no idea what I wanted to do. I had used computers at the computer lab at the high school. I mean, this is going back a ways, so it wasn't everyone had their own computer in their house, but I went to college and then, my first, my freshman year in college, I ended up, there's a suite-mate that I had who showed me a website that he built on the university servers.

And I saw that and I was immediately like, "Whoa, how do you do that"? Right, just this idea of creating something new and being able to build that out was super exciting to me, so I spent the next couple of weeks figuring out how to do HTML, and this was before, this was like when JavaScript was super, super early and we're talking like 1997, and everything was super early. I was using this, I eventually moved away from using FrontPage and started using this thing called HotDog. It was a software for HTML coding, but I started doing that, and I started building websites, and then after a while, I started figuring out what things like CGI-bins were, and how you could write Perl scripts, and how you could make interactions happen, and how you could capture FormData and serve up different things, and it was a lot of copying and pasting.

My major at the time, I think was psychology, because it was like a default thing that I could do. But then I moved into computer science. I did computer science for about a year, and I felt that that was a little bit too narrow for what I was hoping to sort of do. I was starting to become more entrepreneurial. I had started selling websites to people. I had gone to a couple of local businesses and started building websites, so I actually expanded that and ended up doing sort of a major that straddled computer science and management, like business administration. So I ended up graduating with a degree in e-commerce and internet marketing, which is sort of very early, like before any of this stuff seemed to even exist. And then from there, I started a web development company, worked on that for 12 years, and then I ended up selling that off. Did a startup, failed the startup. Then from that startup, went to another startup, worked there for a couple of years, went to another startup, did a lot of consulting in between there, somewhere along the way I found serverless and AWS Cloud, and then now it's sort of led me to advocacy for building things with serverless and now I'm building sort of the, I think what I've been dreaming about building for the last several years in what I'm doing now at Serverless, Inc.

Rebecca: Wow. All right. So this love story started in the 90s.

Jeremy: The 90s, right.

Rebecca: That's an incredible, era and welcome to 2021.

Jeremy: Right. It's been a journey.

Rebecca: Yeah, truly, that's literally a new millennium. So in a broad way of saying it, you've seen it all. You've started from the very HotDog of the world, to today, which is an incredible name, I'm going to have to look them up later. So then you said serverless came along somewhere in there, but let's go to the middle of your story here, so before Serverless Chats, before its predecessor, which is your weekly Off-by-none newsletter, and before, this is my favorite one, debates around, what the suffix "less" means when appended to server. When did you first hear about Serverless in that moment, or perhaps you don't remember the exact minute, but I do really want to know what struck you about it? What stood out about serverless rather than any of the other types of technologies that you could have been struck by and been having a podcast around?

Jeremy: Right. And I think I gave you maybe too much of a surface level of what I've seen, because I talked mostly about software, but if we go back, I mean, hardware was one of those things where hardware, and installing software, and running servers, and doing networking, and all those sort of things, those were part of my early career as well. When I was running my web development company, we started by hosting on some hosting service somewhere, and then we ended up getting a dedicated server, and then we outgrew that, and then we ended up saying, "Well maybe we'll bring stuff in-house". So we did on-prem for quite some time, where we had our own servers in the T1 line, and then we moved to another building that had a T3 line, and if anybody doesn't know what that is, you probably don't need to anymore.

But those are the things that we were doing, and then eventually we moved into a co-location facility where we rented space, and we rented electricity, and we rented all the utilities, the bandwidth, and so forth, but we had Blade servers and I was running VMware, and we were doing all this kind of stuff to manage the infrastructure, and then writing software on top of that, so it was a lot of work. I know I posted something on Twitter a few weeks ago, about how, when I was, when we were young, we used to have to carry a server on our back, uphill, both ways, to the data center, in the snow, with no shoes, and that's kind of how it felt, that you were doing a lot of these things.

And then 2008, 2009, as I was kind of wrapping up my web development company, we were just in the process of actually saying it's too expensive at the colo. I think we were paying probably between like $5,000 and $7,000 a month between the ... we had leases on some of the servers, you're paying for electricity, you're paying for all these other things, and we were running a fair amount of services in there, so it seemed justifiable. We were making money on it, that wasn't the problem, but it just was a very expensive fixed cost for us, and when the cloud started coming along and I started actually building out the startup that I was working on, we were building all of that in the cloud, and as I was learning more about the cloud and how that works, I'm like, I should just move all this stuff that's in the co-location facility, move that over to the cloud and see what happens.

And it took a couple of weeks to get that set up, and now, again, this is early, this is before ELB, this is before RDS, this is before, I mean, this was very, very early cloud. I mean, I think there was S3 and EC2. I think those were the two services that were available, with a few other things. I don't even think there were VPCs yet. But anyways, I moved everything over, took a couple of weeks to get that over, and essentially our bill to host all of our clients' sites and projects went from $5,000 to $7,000 a month, to $750 a month or something like that, and it's funny because had I done that earlier, I may not have sold off my web development company because it could have been much more profitable, so it was just an interesting move there.

So we got into the cloud fairly early and started sort of leveraging that, and it was great to see all these things get added and all these specialty services, like RDS, and just taking the responsibility because I literally was installing Microsoft SQL server on an EC2 instance, which is not something that you want to do, you want to use RDS. It's just a much better way to do it, but anyways, so I was working for another startup, this was like startup number 17 or whatever it was I was working for, and we had this incident where we were using ... we had a pretty good setup. I mean, everything was on EC2 instances, but we were using DynamoDB to do some caching layers for certain things. We were using a sharded database, MySQL database, for product information, and so forth.

So the system was pretty resilient, it was pretty, it handled all of the load testing we did and things like that, but then we actually got featured on Good Morning America, and they mentioned our app, it was the Power to Mobile app, and so we get mentioned on Good Morning America. I think it was Good Morning America. The Today Show? Good Morning America, I think it was. One of those morning shows, anyways, we got about 10,000 sign-ups in less than a minute, which was amazing, or it was just this huge spike in traffic, which was great. The problem was, is we had this really weak point in our system where we had to basically get a lock on the database in order to get an incremental-ID, and so essentially what happened is the database choked, and then as soon as the database choked, just to create user accounts, other users couldn't sign in and there was all kinds of problems, so we basically lost out on all of this capability.

So I spent some time doing a lot of research and trying to figure out how do you scale that? How do you scale something that fast? How do you have that resilience in there? And there's all kinds of ways that we could have done it with traditional hardware, it's not like it wasn't possible to do with a slightly better strategy, but as I was digging around in AWS, I'm looking around at some different things, and we were, I was always in the console cause we were using Dynamo and some of those things, and I came across this thing that said "Lambda," with a little new thing next to it. I'm like, what the heck is this?

So I click on that and I start reading about it, and I'm like, this is amazing. We don't have to spin up a server, we don't have to use Chef, or Puppet, or anything like that to spin up these machines. We can basically just say, when X happens, do Y, and it enlightened me, and this was early 2015, so this would have been right after Lambda went GA. Had never heard of Lambda as part of the preview, I mean, I wasn't sort of in that the re:Invent, I don't know, what would you call that? Vortex, maybe, is a good way to describe the event.

Rebecca: Vortex sounds about right. That's about how it feels by the end.

Jeremy: Right, exactly. So I wasn't really in that, I wasn't in that group yet, I wasn't part of that community, so I hadn't heard about it, and so as I started playing around with it, I immediately saw the value there, because, for me, as someone who again had managed servers, and it had built out really complex networking too. I think some of the things you don't think about when you move to an on-prem where you're managing your stuff, even what the cloud manages for you. I mean, we had firewalls, and we had to do all the firewall rules ourselves, right. I mean, I know you still have to do security groups and things like that in AWS, but just the level of complexity is a lot lower when you're in the cloud, and of course there's so many great services and systems that help you do that now.

But just the idea of saying, "wait a minute, so if I have something happen, like a user signup, for example, and I don't have to worry about provisioning all the servers that I need in order to handle that," and again, it wasn't so much the server aspect of it as it was the database aspect of it, but one of the things that was sort of interesting about the idea of Serverless 2 was this asynchronous nature of it, this idea of being more event-driven, and that things don't have to happen immediately necessarily. So that just struck me as something where it seemed like it would reduce a lot, and again, this term has been overused, but the undifferentiated heavy-lifting, we use that term over and over again, but there is not a better term for that, right?

Because there were just so many things that you have to do as a developer, as an ops person, somebody who is trying to straddle teams, or just a PM, or whatever you are, so many things that you have to do in order to get an application running, first of all, and then even more you have to do in order to keep it up and running, and then even more, if you start thinking about distributing it, or scaling it, or getting any of those things, disaster recovery. I mean, there's a million things you have to think about, and I saw serverless immediately as this opportunity to say, "Wait a minute, this could reduce a lot of that complexity and manage all of that for you," and then again, literally let you focus on the things that actually matter for your business.

Rebecca: Okay. As someone who worked, how should I say this, in metatech, or the technology of technology in the serverless space, when you say that you were starting to build that without ELB even, or RDS, my level of anxiety is like, I really feel like I'm watching a slow horror film. I'm like, "No, no, no, no, no, you didn't, you didn't, you didn't have to do that, did you"?

Jeremy: We did.

Rebecca: So I applaud you for making it to the end of the film and still being with us.

Jeremy: Well, the other thing ...

Rebecca: Only one protagonist does that.

Jeremy: Well, the other thing that's interesting too, about Serverless, and where it was in 2015, Lambda goes GA, this will give you some anxiety, there was no API gateway. So there was no way to actually trigger a Lambda function from a web request, right. There was no VPC access in Lambda functions, which meant you couldn't connect to a database. The only thing you do is connect via HDP, so you could connect to DynamoDB or things like that, but you could not connect directly to RDS, for example. So if you go back and you look at the timeline of when these things were released, I mean, if just from 2015, I mean, you literally feel like a caveman thinking about what you could do back then again, it's banging two sticks together versus where we are now, and the capabilities that are available to us.

Rebecca: Yeah, you're sort of in Plato's cave, right, and you're looking up and you're like, "It's quite dark in here," and Lambda's up there, outside, sowing seeds, being like, "Come on out, it's dark in there". All right, so I imagine you discovering Lambda through the console is not a sentence you hear every day or general console discovery of a new product that will then sort of change the way that you build, and so I'm guessing maybe one of the reasons why you started your Off-by-none newsletter or Serverless Chats, right, is to be like, "How do I help tell others about this without them needing to discover it through the console"? But I'm curious what your why is. Why first the Off-by-none newsletter, which is one of my favorite things to receive every week, thank you for continuing to write such great content, and then why Serverless Chats? Why are we here today? Why are we at number 100? Which I'm so excited about every time I say it.

Jeremy: And it's kind of crazy to think about all the people I've gotten a chance to talk to, but so, I think if you go back, I started writing blog posts maybe in 2015, so I haven't been doing it that long, and I certainly wasn't prolific. I wasn't consistent writing a blog post every week or every, two a week, like some people do now, which is kind of crazy. I don't know how that, I mean, it's hard enough writing the newsletter every week, never mind writing original content, but I started writing about Serverless. I think it wasn't until the beginning of 2018, maybe the end of 2017, and there was already a lot of great content out there. I mean, Ben Kehoe was very early into this and a lot of his stuff I read very early.

I mean, there's just so many people that were very early in the space, I mean, Paul Johnson, I mean, just so many people, right, and I started reading what they were writing and I was like, "Oh, I've got some ideas too, I've been experimenting with some things, I feel like I've gotten to a point where what I could share could be potentially useful". So I started writing blog posts, and I think one of the earlier blog posts I wrote was, I want to say 2017, maybe it was 2018, early 2018, but was a post about serverless security, and what was great about that post was that actually got me connected with Ory Segal, who had started PureSec, and he and I became friends and that was the other great thing too, is just becoming part of this community was amazing.

So many awesome people that I've met, but so I saw all this stuff people were writing and these things people were doing, and I got to maybe August of 2018, and I said to myself, I'm like, "Okay, I don't know if people are interested in what I'm writing". I wasn't writing a lot, but I was writing a little bit, but I wasn't sure people were overly interested in what I was writing, and again, that idea of the imposter syndrome, certainly everything was very early, so I felt a little bit more comfortable. I always felt like, well, maybe nobody knows what they're talking about here, so if I throw something into the fold it won't be too, too bad, but certainly, I was reading other things by other people that I was interested in, and I thought to myself, I'm like, "Okay, if I'm interested in this stuff, other people have to be interested in this stuff," but it wasn't easy to find, right.

I mean, there was sort of a serverless Twitter, if you want to use that terminology, where a lot of people tweet about it and so forth, obviously it's gotten very noisy now because of people slapped that term on way too many things, but I don't want to have that discussion, but so I'm reading all this great stuff and I'm like, "I really want to share it," and I'm like, "Well, I guess the best way to do that would just be a newsletter."

I had an email list for my own personal site that I had had a couple of hundred people on, and I'm like, "Well, let me just turn it into this thing, and I'll share these stories, and maybe people will find them interesting," and I know this is going to sound a little bit corny, but I have two teenage daughters, so I'm allowed to be sort of this dad-jokey type. I remember when I started writing the first version of this newsletter and I said to myself, I'm like, "I don't want this to be a newsletter." I was toying around with this idea of calling it an un-newsletter. I didn't want it to just be another list of links that you click on, and I know that's interesting to some people, but I felt like there was an opportunity to opine on it, to look at the individual links, and maybe even tell a story as part of all of the links that were shared that week, and I thought that that would be more interesting than just getting a list of links.

And I'm sure you've seen over the last 140 issues, or however many we're at now, that there's been changes in the way that we formatted it, and we've tried new things, and things like that, but ultimately, and this goes back to the corny thing, I mean, one of the first things that I wanted to do was, I wanted to basically thank people for writing this stuff. I wanted to basically say, "Look, this is not just about you writing some content". This is big, this is important, and I appreciate it. I appreciate you for writing that content, and I wanted to make it more of a celebration really of the community and the people that were early contributors to that space, and that's one of the reasons why I did the Serverless Star thing.

I thought, if somebody writes a really good article some week, and it's just, it really hits me, or somebody else says, "Hey, this person wrote a great article," or whatever. I wanted to sort of celebrate that person and call them out because that's one of the things too is writing blog posts or posting things on social media without a good following, or without the dopamine hit of people liking it, or re-tweeting it, and things like that, it can be a pretty lonely place. I mean, I know I feel that way sometimes when you put something out there, and you think it's important, or you think people might want to see it, and just not enough people see it.

It's even worse, I mean, 240 characters, or whatever it is to write a tweet is one thing, or 280 characters, but if you're spending time putting together a tutorial or you put together a really good thought piece, or story, or use case, or something where you feel like this is worth sharing, because it could inspire somebody else, or it could help somebody else, could get them past a bump, it could make them think about something a different way, or get them over a hump, or whatever. I mean, that's just the kind of thing where I think people need that encouragement, and I think people deserve that encouragement for the work that they're doing, and that's what I wanted to do with Off-by-none, is make sure that I got that out there, and to just try to amplify those voices the best that I could. The other thing where it's sort of progressed, and I guess maybe I'm getting ahead of myself, but the other place where it's progressed and I thought was really interesting, was, finding people ...

There's the heavy hitters in the serverless space, right? The ones we all know, and you can name them all, and they are great, and they produce amazing content, and they do amazing things, but they have pretty good engines to get their content out, right? I mean, some people who write for the AWS blog, they're on the AWS blog, right, so they're doing pretty well in terms of getting their things out there, right, and they've got pretty good engines.

There's some good dev advocates too, that just have good Twitter followings and things like that. Then there's that guy who writes the story. I don't know, he's in India or he's in Poland or something like that. He writes this really good tutorial on how to do this odd edge-case for serverless. And you go and you look at their Medium and they've got two followers on Medium, five followers on Twitter or something like that. And that to me, just seems unfair, right? I mean, they've written a really good piece and it's worth sharing right? And it needs to get out there. I don't have a huge audience. I know that. I mean I've got a good following on Twitter. I feel like a lot of my Twitter followers, we can have good conversations, which is what you want on Twitter.

The newsletter has continued to grow. We've got a good listener base for this show here. So, I don't have a huge audience, but if I can share that audience with other people and get other people to the forefront, then that's important to me. And I love finding those people and those ideas that other people might not see because they're not looking for them. So, if I can be part of that and help share that, that to me, it's not only a responsibility, it's just it's incredibly rewarding. So ...

Rebecca: Yeah, I have to ... I mean, it is your 100th episode, so hopefully I can give you some kudos, but if celebrating others' work is one of your main tenets, you nail it every time. So ...

Jeremy: I appreciate that.

Rebecca: Just wanted you to know that. So, that's sort of the Genesis of course, of both of these, right?

Jeremy: Right.

Rebecca: That underpins the foundational how to share both works or how to share others' work through different channels. I'm wondering how it transformed, there's this newsletter and then of course it also has this other component, which is Serverless Chats. And that moment when you were like, "All right, this newsletter, this narrative that I'm telling behind serverless, highlighting all of these different authors from all these different global spaces, I'm going to start ... You know what else I want to do? I don't have enough to do, I'm going to start a podcast." How did we get here?

Jeremy: Well, so the funny thing is now that I think about it, I think it just goes back to this tenet of fairness, this idea where I was fortunate, and I was able to go down to New York City and go to Serverless Days New York in late 2018. I was able to ... Tom McLaughlin actually got me connected with a bunch of great people in Boston. I live just outside of Boston. We got connected with a bunch of great people. And we started the Serverless Days Boston for 2019. And we were on that committee. I started traveling and I was going to conferences and I was meeting people. I went to re:Invent in 2018, which I know a lot of people just don't have the opportunity to do. And the interesting thing was, is that I was pulling aside brilliant people either in the hallway at a conference or more likely for a very long, deep discussion that we would have about something at a pub in Northern Ireland or something like that, right?

I mean, these were opportunities that I was getting that I was privileged enough to get. And I'm like, these are amazing conversations. Just things that, for me, I know changed the way I think. And one of the biggest things that I try to do is evolve my thinking. What I thought a year ago is probably not what I think now. Maybe call it flip-flopping, whatever you want to call it. But I think that evolving your thinking is the most progressive thing that you can do and starting to understand as you gain new perspectives. And I was talking to people that I never would have talked to if I was just sitting here in my home office or at the time, I mean, I was at another office, but still, I wasn't getting that context. I wasn't getting that experience. And I wasn't getting those stories that literally changed my mind and made me think about things differently.

And so, here I was in this privileged position, being able to talk to these amazing people and in some cases funny, because they're celebrities in their own right, right? I mean, these are the people where other people think of them and it's almost like they're a celebrity. And these people, I think they deserve fame. Don't get me wrong. But like as someone who has been on that side of it as well, it's ... I don't know, it's weird. It's weird to have fans in a sense. I love, again, you can be my friend, you don't have to be my fan. But that's how I felt about ...

Rebecca: I'm a fan of my friends.

Jeremy: So, a fan and my friend. So, having talked to these other people and having these really deep conversations on serverless and go beyond serverless to me. Actually I had quite a few conversations with some people that have nothing to do with serverless. Actually, Peter Sbarski and I, every time we get together, we only talk about the value of going to college for some reason. I don't know why. It has usually nothing to do with serverless. So, I'm having these great conversations with these people and I'm like, "Wow, I wish I could share these. I wish other people could have this experience," because I can tell you right now, there's people who can't travel, especially a lot of people outside of the United States. They ... it's hard to travel to the United States sometimes.

So, these conversations are going on and I thought to myself, I'm like, "Wouldn't it be great if we could just have these conversations and let other people hear them, hopefully without bar glasses clinking in the background. And so I said, "You know what? Let's just try it. Let's see what happens. I'll do a couple of episodes. If it works, it works. If it doesn't, it doesn't. If people are interested, they're interested." But that was the genesis of that, I mean, it just goes back to this idea where I felt a little selfish having conversations and not being able to share them with other people.

Rebecca: It's the very Jeremy Daly tenet slogan, right? You got to share it. You got to share it ...

Jeremy: Got to share it, right?

Rebecca: The more he shares it, it celebrates it. I love that. I think you do ... Yeah, you do a great job giving a megaphone so that more people can hear. So, in case you need a reminder, actually, I'll ask you, I know what the answer is to this, but do you know the answer? What was your very first episode of Serverless Chats? What was the name, and how long did it last?

Jeremy: What was the name?

Rebecca: Oh yeah. Oh yeah.

Jeremy: Oh, well I know ... Oh, I remember now. Well, I know it was Alex DeBrie. I absolutely know that it was Alex DeBrie because ...

Rebecca: Correct on that.

Jeremy: If nobody, if you do not know Alex DeBrie, not only is he an AWS data hero, as well as the author of The DynamoDB Book, but he's also like the most likable person on the planet too. It is really hard if you've ever met Alex, that you wouldn't remember him. Alex and I started communicating, again, we met through the serverless space. I think actually he was working at Serverless Inc. at the time when we first met. And I think I met him in person, finally met him in person at re:Invent 2018. But he and I have collaborated on a number of things and so forth. So, let me think what the name of it was. "Serverless Purity Versus Practicality" or something like that. Is that close?

Rebecca: That's exactly what it was.

Jeremy: Oh, all right. I nailed it. Nailed it. Yes!

Rebecca: Wow. Well, it's a great title. And I think ...

Jeremy: Don't ask me what episode number 27 was though, because no way I could tell you that.

Rebecca: And just for fun, it was 34 minutes long and you released it on June 17th, 2019. So, you've come a long way in a year and a half. That's some kind of wildness. So it makes sense, like, "THE," capital, all caps, bold, italic, author for databases, Alex DeBrie. Makes sense why you selected him as your guest. I'm wondering if you remember any of the ... What do you remember most about that episode? What was it like planning it? What was the reception of it? Anything funny happened recording it or releasing it?

Jeremy: Yeah, well, I mean, so the funny thing is that I was incredibly nervous. I still am, actually a lot of guests that I have, I'm still incredibly nervous when I'm about to do the actual interview. And I think it's partially because I want to do justice to the content that they're presenting and to their expertise. And I feel like there's a responsibility to them, but I also feel like the guests that I've had on, some of them are just so smart, and the things they say, just I'm in awe of some of the things that come out of these people's mouths. And I'm like, "This is amazing and people need to hear this." And so, I feel like we've had really good episodes and we've had some okay episodes, but I feel like I want to try to keep that level up so that they owe that to my listener to make sure that there is high quality episode that, high quality information that they're going to get out of that.

But going back to the planning of the initial episodes, so I actually had six episodes recorded before I even released the first one. And the reason why I did that was because I said, "All right, there's no way that I can record an episode and then wait a week and then record another episode and wait a week." And I thought batching them would be a good idea. And so, very early on, I had Alex and I had Nitzan Shapira and I had Ran Ribenzaft and I had Marcia Villalba and I had Erik Peterson from Cloud Zero. And so, I had a whole bunch of these episodes and I reached out to I think, eight or nine people. And I said, "I'm doing this thing, would you be interested in it?" Whatever, and we did planning sessions, still a thing that I do today, it's still part of the process.

So, whenever I have a guest on, if you are listening to an episode and you're like, "Wow, how did they just like keep the thing going ..." It's not scripted. I don't want people to think it's scripted, but it is, we do review the outline and we go through some talking points to make sure that again, the high-quality episode and that the guest says all the things that the guest wants to say. A lot of it is spontaneous, right? I mean, the language is spontaneous, but we do, we do try to plan these episodes ahead of time so that we make sure that again, we get the content out and we talk about all the things we want to talk about. But with Alex, it was funny.

He was actually the first of the six episodes that I recorded, though. And I wasn't sure who I was going to do first, but I hadn't quite picked it yet, but I recorded with Alex first. And it was an easy, easy conversation. And the reason why it was an easy conversation was because we had talked a number of times, right? It was that in a pub, talking or whatever, and having that friendly chat. So, that was a pretty easy conversation. And I remember the first several conversations I had, I knew Nitzan very well. I knew Ran very well. I knew Erik very well. Erik helped plan Serverless Days Boston with me. And I had known Marcia very well. Marcia actually had interviewed me when we were in Vegas for re:Invent 2018.

So, those were very comfortable conversations. And so, it actually was a lot easier to do, which probably gave me a false sense of security. I was like, "Wow, this was ... These came out pretty well." The conversations worked pretty well. And also it was super easy because I was just doing audio. And once you add the video component into it, it gets a little bit more complex. But yeah, I mean, I don't know if there's anything funny that happened during it, other than the fact that I mean, I was incredibly nervous when we recorded those, because I just didn't know what to expect. If anybody wants to know, "Hey, how do you just jump right into podcasting?" I didn't. I actually was planning on how can I record my voice? How can I get comfortable behind a microphone? And so, one of the things that I did was I started creating audio versions of my blog posts and posting them on SoundCloud.

So, I did that for a couple of ... I'm sorry, a couple of blog posts that I did. And that just helped make me feel a bit more comfortable about being able to record and getting a little bit more comfortable, even though I still can't stand the sound of my own voice, but hopefully that doesn't bother other people.

Rebecca: That is an amazing ... I think we so often talk about ideas around you know where you want to go and you have this vision and that's your goal. And it's a constant reminder to be like, "How do I make incremental steps to actually get to that goal?" And I love that as a life hack, like, "Hey, start with something you already know that you wrote and feel comfortable in and say it out loud and say it out loud again and say it out loud again." And you may never love your voice, but you will at least feel comfortable saying things out loud on a podcast.

Jeremy: Right, right, right. I'm still working on the, "Ums" and, "Ahs." I still do that. And I don't edit those out. That's another thing too, actually, that one of the things I do want people to know about this podcast is these are authentic conversations, right? I am probably like ... I feel like I'm, I mean, the most authentic person that I know. I just want authenticity. I want that out of the guests. The idea of putting together an outline is just so that we can put together a high quality episode, but everything is authentic. And that's what I want out of people. I just want that authenticity, and one of the things that I felt kept that, was leaving in, "Ums" and, "Ahs," you know what I mean? It's just, it's one of those things where I know a lot of podcasts will edit those out and it sounds really polished and finished.

Again, I mean, I figured if we can get the clinking glasses out from the background of a bar and just at least have the conversation that that's what I'm trying to achieve. And we do very little editing. We do cut things out here and there, especially if somebody makes a mistake or they want to start something over again, we will cut that out because we want, again, high quality episodes. But yeah, but authenticity is deeply important to me.

Rebecca: Yeah, I think it probably certainly helps that neither of us are robots because robots wouldn't say, "Um" so many times. As I say, "Uh." So, let's talk about, Alex DeBrie was your first guest, but there's been a hundred episodes, right? So, from, I might say the best guest, as a hundredth episode guests, which is our very own Jeremy Daly, but let's go back to ...

Jeremy: I appreciate that.

Rebecca: Your guests, one to 99. And I mean, you've chatted with some of the most thoughtful, talented, Serverless builders and architects in the industry, and across coincident spaces like ML and Voice Technology, Chaos Engineering, databases. So, you started with Alex DeBrie and databases, and then I'm going to list off some names here, but there's so many more, right? But there's the Gunnar Grosches, and the Alexandria Abbasses, and Ajay Nair, and Angela Timofte, James Beswick, Chris Munns, Forrest Brazeal, Aleksandar Simovic, and Slobodan Stojanovic. Like there are just so many more. And I'm wondering if across those hundred conversations, or 99 plus your own today, if you had to distill those into two or three lessons, what have you learned that sticks with you? If there are emerging patterns or themes across these very divergent and convergent thinkers in the serverless space?

Jeremy: Oh, that's a tough question.

Rebecca: You're welcome.

Jeremy: So, yeah, put me on the spot here. So, yeah, I mean, I think one of the things that I've, I've seen, no matter what it's been, whether it's ML or it's Chaos Engineering, or it's any of those other observability and things like that. I think the common thing that threads all of it is trying to solve problems and make people's lives easier. That every one of those solutions is like, and we always talk about abstractions and, and higher-level abstractions, and we no longer have to write ones and zeros on punch cards or whatever. We can write languages that either compile or interpret it or whatever. And then the cloud comes along and there's things we don't have to do anymore, that just get taken care of for us.

And you keep building these higher level of abstractions. And I think that's a lot of what ... You've got this underlying concept of letting somebody else handle things for you. And then you've got this whole group of people that are coming at it from a number of different angles and saying, "Well, how will that apply to my use case?" And I think a lot of those, a lot of those things are very, very specific. I think things like the voice technology where it's like the fact that serverless powers voice technology is only interesting in the fact as to say that, the voice technology is probably the more interesting part, the fact that serverless powers it is just the fact that it's a really simple vehicle to do that. And basically removes this whole idea of saying I'm building voice technology, or I'm building a voice app, why do I need to worry about setting up servers and all this kind of stuff?

It just takes that away. It takes that out of the equation. And I think that's the perfect idea of saying, "How can you take your use case, fit serverless in there and apply it in a way that gets rid of all that extra overhead that you shouldn't have to worry about." And the same thing is true of machine learning. And I mean, and SageMaker, and things like that. Yeah, you're still running instances of it, or you still have to do some of these things, but now there's like SageMaker endpoints and some other things that are happening. So, it's moving in that direction as well. But then you have those really high level services like NLU API from IBM, which is the Watson Natural Language Processing.

You've got AP recognition, you've got the vision API, you've got sentiment analysis through all these different things. So, you've got a lot of different services that are very specific to machine learning and solving a discrete problem there. But then basically relying on serverless or at least presenting it in a way that's serverless, where you don't have to worry about it, right? You don't have to run all of these Jupiter notebooks and things like that, to do machine learning for a lot of cases. This is one of the things I talk about with Alexandra Abbas, was that these higher level APIs are just taking a lot of that responsibility or a lot of that heavy lifting off of your plate and allowing you to really come down and focus on the things that you're doing.

So, going back to that, I do think that serverless, that the common theme that I see is that this idea of worrying about servers and worrying about patching things and worrying about networking, all that stuff. For so many people now, that's just not even a concern. They didn't even think about it. And that's amazing to think of, compute ... Or data, or networking as a utility that is now just available to us, right? And I mean, again, going back to my roots, taking it for granted is something that I think a lot of people do, but I think that's also maybe a good thing, right? Just don't think about it. I mean, there are people who, they're still going to be engineers and people who are sitting in the data center somewhere and racking servers and doing it, that's going to be forever, right?

But for the things that you're trying to build, that's unimportant to you. That is the furthest from your concern. You want to focus on the problem that you're trying to solve. And so I think that, that's a lot of what I've seen from talking to people is that they are literally trying to figure out, "Okay, how do I take what I'm doing, my use case, my problem, how do I take that to the next level, by being able to spend my cycles thinking about that as opposed to how I'm going to serve it up to people?"

Rebecca: Yeah, I think it's the mantra, right, of simplify, simplify, simplify, or maybe even to credit Bruce Lee, be like water. You're like, "How do I be like water in this instance?" Well, it's not to be setting up servers, it's to be doing what I like to be doing. So, you've interviewed these incredible folks. Is there anyone left on your list? I'm sure there ... I mean, I know that you have a large list. Is there a few key folks where you're like, "If this is the moment I'm going to ask them, I'm going to say on the hundredth episode, 'Dear so-and-so, I would love to interview you for Serverless Chats.'" Who are you asking?

Jeremy: So, this is something that, again, we have a stretch list of guests that we attempt to reach out to every once in a while just to say, "Hey, if we get them, we get them." But so, I have a long list of people that I would absolutely love to talk to. I think number one on my list is certainly Werner Vogels. I mean, I would love to talk to Dr. Vogels about a number of things, and maybe even beyond serverless, I'm just really interested. More so from a curiosity standpoint of like, "Just how do you keep that in your head?" That vision of where it's going. And I'd love to drill down more into the vision because I do feel like there's a marketing aspect of it, that's pushing on him of like, "Here's what we have to focus on because of market adoption and so forth. And even though the technology, you want to move into a certain way," I'd be really interesting to talk to him about that.

And I'd love to talk to him more too about developer experience and so forth, because one of the things that I love about AWS is that it gives you so many primitives, but at the same time, the thing I hate about AWS is it gives you so many primitives. So, you have to think about 800 services, I know it's not that many, but like, what is it? 200 services, something like that, that all need to kind of connect together. And I love that there's that diversity in those capabilities, it's just from a developer standpoint, it's really hard to choose which ones you're supposed to use, especially when several services overlap. So, I'm just curious. I mean, I'd love to talk to him about that and see what the vision is in terms of, is that the idea, just to be a salad bar, to be the Golden Corral of cloud services, I guess, right?

Where you can choose whatever you want and probably take too much and then not use a lot of it. But I don't know if that's part of the strategy, but I think there's some interesting questions, could dig in there. Another person from AWS that I actually want to talk to, and I haven't reached out to her yet just because, I don't know, I just haven't reached out to her yet, but is Brigid Johnson. She is like an IAM expert. And I saw her speak at re:Inforce 2019, it must have been 2019 in Boston. And it was like she was speaking a different language, but she knew IAM so well, and I am not a fan of IAM. I mean, I'm a fan of it in the sense that it's necessary and it's great, but I can't wrap my head around so many different things about it. It's such a ...

It's an ongoing learning process and when it comes to things like being able to use tags to elevate permissions. Just crazy things like that. Anyways, I would love to have a conversation with her because I'd really like to dig down into sort of, what is the essence of IAM? What are the things that you really have to think about with least permission? Especially applying it to serverless services and so forth. And maybe have her help me figure out how to do some of the cross role IAM things that I'm trying to do. Certainly would love to speak to Jeff Barr. I did meet Jeff briefly. We talked for a minute, but I would love to chat with him.

I think he sets a shining example of what a developer advocate is. Just the way that ... First of all, he's probably the only person alive who knows every service at AWS and has actually tried it because he writes all those blog posts about it. So that would just be great to pick his brain on that stuff. Also, Adrian Cockcroft would be another great person to talk to. Just this idea of what he's done with microservices and thinking about the role, his role with Netflix and some of those other things and how all that kind of came together, I think would be a really interesting conversation. I know I've seen this in so many of his presentations where he's talked about the objections, what were the objections of Lambda and how have you solved those objections? And here's the things that we've done.

And again, the methodology of that would be really interesting to know. There's a couple of other people too. Oh, Sam Newman who wrote Building Microservices, that was my Bible for quite some time. I had it on my iPad and had a whole bunch of bookmarks and things like that. And if anybody wants to know, one of my most popular posts that I've ever written was the ... I think it was ... What is it? 16, 17 architectural patterns for serverless or serverless microservice patterns on AWS. Can't even remember the name of my own posts. But that post was very, very popular. And that even was ... I know Matt Coulter who did the CDK. He's done the whole CDK ... What the heck was that? The CDKpatterns.com. That was one of the things where he said that that was instrumental for him in seeing those patterns and being able to use those patterns and so forth.

If anybody wants to know, a lot of those patterns and those ideas and those ... The sort of the confidence that I had with presenting those patterns, a lot of that came from Sam Newman's work in his Building Microservices book. So again, credit where credit is due. And I think that that would be a really fascinating conversation. And then Simon Wardley, I would love to talk to. I'd actually love to ... I actually talked to ... I met Lin Clark in Vegas as well. She was instrumental with the WebAssembly stuff, and I'd love to talk to her. Merritt Baer. There's just so many people. I'm probably just naming too many people now. But there are a lot of people that I would love to have a chat with and just pick their brain.

And also, one of the things that I've been thinking about a lot on the show as well, is the term "serverless." Good or bad for some people. Some of the conversations we have go outside of serverless a little bit, right? There's sort of peripheral to it. I think that a lot of things are peripheral to serverless now. And there are a lot of conversations to be had. People who were building with serverless. Actually real-world examples.

One of the things I love hearing was Yan Cui's "Real World Serverless" podcast where he actually talks to people who are building serverless things and building them in their organizations. That is super interesting to me. And I would actually love to have some of those conversations here as well. So if anyone's listening and you have a really interesting story to tell about serverless or something peripheral to serverless please reach out and send me a message and I'd be happy to talk to you.

Rebecca: Well, good news is, it sounds like A, we have at least ... You've got at least another a hundred episodes planned out already.

Jeremy: Most likely. Yeah.

Rebecca: And B, what a testament to Sam Newman. That's pretty great when your work is referred to as the Bible by someone. As far as in terms of a tome, a treasure trove of perhaps learnings or parables or teachings. I ... And wow, what a list of other folks, especially AWS power ... Actually, not AWS powerhouses. Powerhouses who happened to work at AWS. And I think have paved the way for a ton of ways of thinking and even communicating. Right? So I think Jeff Barr, as far as setting the bar, raising the bar if you will. For how to teach others and not be so high-level, or high-level enough where you can follow along with him, right? Not so high-level where it feels like you can't achieve what he's showing other people how to do.

Jeremy: Right. And I just want to comment on the Jeff Barr thing. Yeah.

Rebecca: Of course.

Jeremy: Because again, I actually ... That's my point. That's one of the reasons why I love what he does and he's so perfect for that position because he's relatable and he presents things in a way that isn't like, "Oh, well, yeah, of course, this is how you do this." I mean, it's not that way. It's always presented in a way to make it accessible. And even for services that I'm not interested in, that I know that I probably will never use, I generally will read Jeff's post because I feel it gives me a good overview, right?

Rebecca: Right.

Jeremy: It just gives me a good overview to understand whether or not that service is even worth looking at. And that's certainly something I don't get from reading the documentation.

Rebecca: Right. He's inviting you to come with him and understanding this, which is so neat. So I think ... I bet we should ... I know that we can find all these twitter handles for these folks and put them in the show notes. And I'm especially ... I'm just going to say here that Werner Vogels's twitter handle is @Werner. So maybe for your hundredth, all the listeners, everyone listening to this, we can say, "Hey, @Werner, I heard that you're the number one guest that Jeremy Daly would like to interview." And I think if we get enough folks saying that to @Werner ... Did I say that @Werner, just @Werner?

Jeremy: I think you did.

Rebecca: Anyone if you can hear it.

Jeremy: Now listen, he did retweet my serverless musical that I did. So ...

Rebecca: That's right.

Jeremy: I'm sort of on his radar maybe.

Rebecca: Yeah. And honestly, he loves serverless, especially with the number of customers and the types of customers and ... that are doing incredible things with it. So I think we've got a chance, Jeremy. I really do. That's what I'm trying to say.

Jeremy: That's good to know. You're welcome anytime. He's welcome anytime.

Rebecca: Do we say that @Werner, you are welcome anytime. Right. So let's go back to the genesis, not necessarily the genesis of the concept, right? But the genesis of the technology that spurred all of these other technologies, which is AWS Lambda. And so what ... I don't think we'd be having these conversations, right, if AWS Lambda was not released in late 2014, and then when GA I believe in 2015.

Jeremy: Right.

Rebecca: And so subsequently the serverless paradigm was thrust into the spotlight. And that seems like eons ago, but also three minutes ago.

Jeremy: Right.

Rebecca: And so I'm wondering ... Let's talk about its evolution a bit and a bit of how if you've been following it for this long and building it for this long, you've covered topics from serverless CI/CD pipelines, observability. We already talked about how it's impacted voice technologies or how it's made it easy. You can build voice technology without having to care about what that technology is running on.

Jeremy: Right.

Rebecca: You've even talked about things like the future and climate change and how it relates to serverless. So some of those sort of related conversations that you were just talking about wanting to have or having had with previous guests. So as a host who thinks about these topics every day, I'm wondering if there's a topic that serverless hasn't touched yet or one that you hope it will soon. Those types of themes, those threads that you want to pull in the next 100 episodes.

Jeremy: That's another tough question. Wow. You got good questions.

Rebecca: That's what I said. Heavy hitters. I told you I'd be bringing it.

Jeremy: All right. Well, I appreciate that. So that's actually a really good question. I think the evolution of serverless has seen its ups and downs. I think one of the nice things is you look at something like serverless that was so constrained when it first started. And it still has constraints, which are good. But it ... Those constraints get lifted. We just talked about Adrian's talks about how it's like, "Well, I can't do this, or I can't do that." And then like, "Okay, we'll add some feature that you can do that and you can do that." And I think that for the most part, and I won't call it anything specific, but I think for the most part that the evolution of serverless and the evolution of Lambda and what it can do has been thoughtful. And by that I mean that it was sort of like, how do we evolve this into a way that doesn't create too much complexity and still sort of holds true to the serverless ethos of sort of being fairly easy or just writing code.

And then, but still evolve it to open up these other use cases and edge cases. And I think that for the most part, that it has held true to that, that it has been mostly, I guess, a smooth ride. There are several examples though, where it didn't. And I said I wasn't going to call anything out, but I'm going to call this out. I think RDS proxy wasn't great. I think it works really well, but I don't think that's the solution to the problem. And it's a band-aid. And it works really well, and congrats to the engineers who did it. I think there's a story about how two different teams were trying to build it at the same time actually. But either way, I look at that and I say, "That's a good solution to the problem, but it's not the solution to the problem."

And so I think serverless has stumbled in a number of ways to do that. I also feel EFS integration is super helpful, but I'm not sure that's the ultimate goal to share ... The best way to share state. But regardless, there are a whole bunch of things that we still need to do with serverless. And a whole bunch of things that we still need to add and we need to build, and we need to figure out better ways to do maybe. But I think in terms of something that doesn't get talked about a lot, is the developer experience of serverless. And that is, again I'm not trying to pitch anything here. But that's literally what I'm trying to work on right now in my current role, is just that that developer experience of serverless, even though there was this thoughtful approach to adding things, to try to check those things off the list, to say that it can't do this, so we're going to make it be able to do that by adding X, Y, and Z.

As amazing as that has been, that has added layers and layers of complexity. And I'll go back way, way back to 1997 in my dorm room. CGI-bins, if people are not familiar with those, essentially just running on a Linux server, it was a way that it would essentially run a Perl script or other types of scripts. And it was essentially like you're running PHP or you're running Node, or you're running Ruby or whatever it was. So it would run a programming language for you, run a script and then serve that information back. And of course, you had to actually know ins and outs, inputs and outputs. It was more complex than it is now.

But anyways, the point is that back then though, once you had the script written. All you had to do is ... There's a thing called FTP, which I'm sure some people don't even know what that is anymore. File transfer protocol, where you would basically say, take this file from my local machine and put it on this server, which is a remote machine. And you would do that. And the second you did that, magically it was updated and you had this thing happening. And I remember there were a lot of jokes way back in the early, probably 2017, 2018, that serverless was like the new CGI-bin or something like that. But more as a criticism of it, right? Or it's just CGI-bins reborn, whatever. And I actually liked that comparison. I felt, you know what? I remember the days where I just wrote code and I just put it to some other server where somebody was dealing with it, and I didn't even have to think about that stuff.

We're a long way from that now. But that's how serverless felt to me, one of the first times that I started interacting with it. And I felt there was something there, that was something special about it. And I also felt the constraints of serverless, especially the idea of not having state. People rely on things because they're there. But when you don't have something and you're forced to think differently and to make a change or find a way to work around it. Sometimes workarounds, turn into best practices. And that's one of the things that I saw with serverless. Where people were figuring out pretty quickly, how to build applications without state. And then I think the problem is that you had a lot of people who came along, who were maybe big customers of AWS. I don't know.

I'm not going to say that you might be influenced by large customers. I know lots of places are. That said, "We need this." And maybe your ... The will gets bent, right. Because you just... you can only fight gravity for so long. And so those are the kinds of things where I feel some of the stuff has been patchwork and those patchwork things haven't ruined serverless. It's still amazing. It's still awesome what you can do within the course. We're still really just focusing on fast here, with everything else that's built. With all the APIs and so forth and everything else that's serverless in the full-service ecosystem. There's still a lot of amazing things there. But I do feel we've become so complex with building serverless applications, that you can't ... the Hello World is super easy, but if you're trying to build an actual application, it's a whole new mindset.

You've got to learn a whole bunch of new things. And not only that, but you have to learn the cloud. You have to learn all the details of the cloud, right? You need to know all these different things. You need to know cloud formation or serverless framework or SAM or something like that, in order to get the stuff into the cloud. You need to understand the infrastructure that you're working with. You may not need to manage it, but you still have to understand it. You need to know what its limitations are. You need to know how it connects. You need to know what the failover states are like.

There's so many things that you need to know. And to me, that's a burden. And that's adding new types of undifferentiated heavy-lifting that shouldn't be there. And that's the conversation that I would like to have continuing to move forward is, how do you go back to a developer experience where you're saying you're taking away all this stuff. And again, to call out Werner again, he constantly says serverless is about writing code, but ask anybody who builds serverless applications. You're doing a lot more than writing code right now. And I would love to see us bring the conversation back to how do we get back there?

Rebecca: Yeah. I think it kind of goes back to ... You and I have talked about this notion of an ode to simplicity. And it's sort of what you want to write into your ode, right? If we're going to have an ode to simplicity, how do we make sure that we keep the simplicity inside of the ode?

Jeremy: Right.

Rebecca:
So I've got ... I don't know if you've seen these.

Jeremy: I don't know.

Rebecca: But before I get to some wrap-up questions more from the brainwaves of Jeremy Daly, I don't want to forget to call out some long-time listener questions. And they wrote in a via Twitter and they wanted to perhaps pick your brain on a few things.

Jeremy: Okay.

Rebecca: So I don't know if you're ready for this.

Jeremy: A-M-A. A-M-A.

Rebecca: I don't know if you've seen these. Yeah, these are going to put you in the ...

Jeremy: A-M-A-M. Wait, A-M-A-A? Asked me almost anything? No, go ahead. Ask me anything.

Rebecca: A-M-A-A. A-M-J. No. Anyway, we got it. Ask Jeremy almost anything.

Jeremy: There you go.

Rebecca: So there's just three to tackle for today's episode that I'm going to lob at you. One is from Ken Collins. "What will it take to get you back to a relational database of Lambda?"

Jeremy: Ooh, I'm going to tell you right now. And without a doubt, Aurora Serverless v2. I played around with that right after re:Invent 2000. What was it? 20. Yeah. Just came out, right? I'm trying to remember what year it is at this point.

Rebecca: Yes. Indeed.

Jeremy: When that just ... Right when that came out. And I had spent a lot of time with Aurora Serverless v1, I guess if you want to call it that. I spent a lot of time with it. I used it on a couple of different projects. I had a lot of really good success with it. I had the same pains as everybody else did when it came to scaling and just the slowness of the scaling and then ... And some of the step-downs and some of those things. There were certainly problems with it. But v2 just the early, early preview version of v2 was ... It was just a marvel of engineering. And the way that it worked was just ... It was absolutely fascinating.

And I know it's getting ready or it's getting close, I think, to being GA. And when that becomes GA, I think I will have a new outlook on whether or not I can fit RDS into my applications. I will say though. Okay. I will say, I don't think that transactional applications should be using relational databases though. One of the things that was sort of a nice thing about moving to serverless, speaking of constraints. Was this idea that MySQL or Postgres or whatever, really didn't have the scale or without, again, engineering a whole cluster and failover and sharding and all kinds of crazy things like that to make sure that you had the scale. Relational databases were just not the best choice when you were building things with serverless.

And so when I quickly realized that, I tried to find a solution. So I built something called Serverless MySQL, which sort of is a ... And again, I don't want the RDS proxy people to think that because RDS proxy sort of was trying to solve the same problem that Serverless MySQL was, that I have any problem with that. I'm actually glad. The fact that I had to use my Serverless MySQL was only because there wasn't a better solution for it. But I built that because I wanted to continue to use it. And even though I built that and it worked, there was just so many limitations. And it was one of those things where using NoSQL or no SQL just made so much more sense. And I forced myself into thinking that way because of the constraint. And that was huge, that changed my mind on how NoSQL works.

And I absolutely have to call out Rick Houlihan as well as Alex DeBrie. But Rick Houlihan, speaking about another sort of person who influenced so many of my thought processes and changed my mind so dramatically. When I saw Rick's 2018 talk at re:Invent about single table design. And I know that they're calling it something different now, but essentially single table design. This idea of ... I went back and watched 2017. And that was like, Okay, now my mind's moving around. And then watching 2017, watching 2018, then watching 2019. 2019, I was in his session watching it. And I could see his evolution of thinking. Of how he changed the way that he was approaching different problems and the patterns that he was using. And that clicked so much for me, that now I think about ... I feel like, this is going to sound strange, but if you've seen the movie The Matrix. You've probably seen the movie The Matrix.

Rebecca: Oh, yeah.

Jeremy: Oh, of course. Okay. When one of the guys is watching the green character scroll down the screen and he's like, "I don't even see the code anymore. I just see there's a blonde, there's whatever." That's how I feel looking at DynamoDB now. It just makes sense to me. It clicked. And I wish everybody could feel that way. I don't feel superior or anything like that, but it just works. It just works in my brain now. And that's one of the reasons why I built DynamoDB toolbox too, was I was trying to say, "How can I translate how I see this into a way that might be more relatable to other people and hopefully get them to sort of click." And now actually at Serverless Inc. one of the things with serverless clouds, we're building something called Serverless Data, which is a similar key-value store type thing.

Which again is me ... Is my manifestation of how I envision sort of the interface into these things and hopefully will make sense to people. So, but yeah. So to answer Ken's question. Serverless Aurora or Aurora Serverless v2, will definitely get me using it for anything that's got to be analytic or analytical processing or analytics process processing. And also probably using it as I sort of do now, as sort of a secondary, a separate secondary data store so that I can run queries on data, even though I want it to be more highly available on the front end through something like DynamoDB.

Rebecca: Oh my gosh. That moment where it clicks, I just have this mental image of two brain synapses extending their hands toward each other and finally touching. Yeah. Finally touching index fingers, being like, "Hey, we did it." All right. So from Matt Colter.

Jeremy: Oh.

Rebecca: Comes another one.

Jeremy: Love Matt too.

Rebecca: Right? There you got some fans and some friends. Some friends that you're fans of, or that are fans of you.

Jeremy: Speaking of Northern Ireland. Nor, Nor, Nor, Noren Ire ... what do they say? Noren Iron? Noren Iron.

Rebecca: I would... It would be trouble if I tried to pronounce it the way they say it. "So with the terms serverless-first," or Matt asks, "With the term serverless-first and cloud-native causing confusion as to what we know it means (code as a liability), do you have a less confusing name we could use?"

Jeremy: Ooh. So it's funny. I had this conversation with Jay Nair over breakfast one time where we were, I think we were at the serverless breakfast. Where were we? It must've been re:Invent I guess. But we were having this conversation and he was, he asked me a very similar question. And I said, "Look, it's as simple. Serverless is just the way." Right. So when somebody says, "So what's the term for serverless-first or for cloud-native or whatever." When you're building an application, it's just the way. That, it blows my mind. Now look, containers are great, by the way. I love containers. I know I don't talk about them very much. I'm not a fan of Kubernetes because I think it's overly complex for what it needs to do, but it works really well too. So I mean that if you know how to manage it, which not very many people do, that's why all the cloud providers are doing it now that, using containers it can be a very, very good way and a smart way to build your applications. You get portability around them. There're all kinds of reasons to do it. I love the fact that ... My skin was crawling a little bit when they announced that Lambda would support containers. Then when I realized it was just a packaging format, I was like, "Okay, that's much better. That I can deal with". But I think that's one of the things where it introduces a level of comfortability.

And if you put your, or if I should say, if I put my product manager hat on and I'm looking at that, I need to find familiarity, right? When I'm building a product, I need something that's familiar to people and not necessarily revolutionary, maybe evolutionary. I think serverless is revolutionary, which is part of the problem why it's not being adopted. I think as quickly as it could be because it's not quite as evolutionary as something like containers were. Containers are like the next step in VMs. Or it's this idea where you can now split things up into little, smaller chunks, and smaller chunks, and so forth, and then came orchestration and you had all the problems around that.

So I think that no matter how you're building your applications, whether you're building them in containers as a packaging format, as a runtime, whatever, how you serve those containers up, whether those are again, serverless, running in Lambda and Firecracker, or you're running them on IBM, or you're running them on Google Cloud ... Google Cloud Run, I think is a fascinating technology. No matter how you're doing it, I think the key here is to focus on the fact that you should be building your applications in a way that is going to be able to run on one of these modern types of infrastructures.

So that's the only thing I would say, if you're trying to write code to run directly on an EC2 instance, 1997 called, they want their ... I don't know their Pearl Jam shirt back, I don't know anyways. So ...

Rebecca: They want their HotDog shirt back.

Jeremy: They want their HotDog shirt back, right. I would say they want their Green Day shirt back, but I love Green Day. I was just listening to them the other day, anyways. So again, I don't think I have a better term for you, Matt, unfortunately, other than just to say, I don't think serverless is a very good term. I've never been a fan of the term. I've tried to defend it. And then I feel like what's the saying, if you argue with an idiot in public, people don't know that anyways. I don't know what the saying is. I was really ...

Rebecca: Who is the idiot?

Jeremy: Who's the idiot, right? Exactly. So it's one of those things. So the problem is, is that that term is not great. And things like serverless first, I liked that idea that a PR I mean, I love the sentiment of it, right? Like this idea of saying like, we're going to try to build everything serverless first, and then we're going to fall back to containers on the things we can't. And then maybe the things that we can't do, then we're going to fall back to VMs. And then in the worst-case scenario, you've got to build stuff on bare metal.

But so I love that sentiment. And I think that sentiment is always going to be the way, but I think there's just this idea of like, this is just modern app development. Like, this is just how you build apps now. And if you're building apps starting on the bottom up, unless you have a really compelling reason to do that, you are really handicapping yourself. And you're going to struggle because this is just the way the world's moving. So again, I just say, it's the way, it's the way to the way to build applications now.

Rebecca: Yeah. And as someone who had a large part in writing so many of those serverless messaging, serverless narratives, Lambda as packaged as containers, I'm sorry for making your skin crawl and to any listeners that I made crawl, I am @beccaodelay if you want to hit me up on Twitter about that, and I can apologize to you publicly. So lastly from Rob Sutter and, and maybe my personal favorite, because I can hear Rob's brand of humor in this question, since I have the great honor of knowing him personally as well, but he asks, "Do you ever go back and stand up instances just to remind yourself of everything you don't have to deal with anymore?"

Jeremy: Well, I just had Rob on episode 99, talking about FaunaDB, which is also a fascinating thing. So super exciting stuff that they're doing over there. So every once in a while, and I'm going to admit to this and hopefully people won't flood my website to crash it, but I have been so busy doing these other things that my personal website, jeremydaly.com is still a load-balanced WordPress site running on EC2 with an ELB in front of it, or maybe an ALB, whatever it is. And, and that is still out there running on multiple instances. And I have backup instances. I was hit on a Hacker News article one time that they shut my site down. So I had to spin up some additional instances, but that is still running there. So I am so afraid of going to touch that thing, that, because I just haven't done it in so long that I actually avoid it and I haven't done it yet.

So Off-by-none and Serverless Chat sites those are all static Jamstack sites. But yeah ... but so I don't do it often, but every once in a while, I do have to type in SSH and get into an old school server. The last startup that I was just at, we were running, we had actually inherited a Symphony app that was running on PHP.

So I did have to actually manage, manage a few clusters of servers, and have all the load balancing and all of the scale, the scale and groups and auto-scaling and things like that. And so I am not unfamiliar with all of that stuff, but if I do it for my own personal amusement, or I guess from my own personal suffering, I would not do that on purpose anymore. And, and that would be my advice to anybody is you maybe want to learn that stuff. And you probably do, if you want to get, you know, some of your AWS certs, but, but from a practical standpoint, I would not, I would not do that myself if, unless the absolute need arise or arose, I guess.

Rebecca: Yeah. I want to say I'm like, that sounds like fun dot, dot, dot. All right. So those are the listener questions. And I, I want to get back to a few of my own because this is me interviewing you. Right. But it's a time to reflect. Right. And so I just, I'm curious in the spirit of reflection in this 100th episode, if you were just starting Serverless Chats today, or just starting your newsletter or your blog or your conference talks, what advice would you offer to your own self?

Jeremy: Oh, well, I, one thing I have to absolutely call out because I don't, I certainly don't want people to think that these things just magically happen. I have a team that helps me with these now, when I first started, it was just me. I was writing the newsletter. I was reading the newsletter, I was copy editing. I was doing all of that stuff myself, but now I have two absolutely amazing people that help me out. Angela Milinazzo basically does a lot of social media stuff. She runs the newsletter for Serverless Chats Insiders, and she does all the guests reach out and are reaching out to the guests and coordination with the guests and so forth.

She does a lot of the marketing stuff and, and just so many things that can get taken off of my plate so that I can focus more on again, having the conversations and, and doing some of the more, hopefully important, the undifferentiated heavy-lifting stuff, I guess, that I don't have to do with, with some of those, some of the things that Angela does for me, which is amazing.

So thank you. And I've actually, I've been working with Angela for, I want to say like maybe eight or nine years, eight years now, something like that. And she worked at the same startup together. And then, and then we went separate ways is that we left that startup, but then we came back together to do this. So that was super exciting. And then also Melissa De orenzo, who was a time friend of mine, but all, she actually worked at my web development company with me, but she does copy editing for me, writes, helps with the research for the Serverless Stars and, and just copy edits the newsletter. And does all these other things, she does my accounting for me for all the stuff that we do. So I would not be able to do this stuff without having a really amazing team behind me.

And I think that is one of the biggest pieces of advice I would give anybody is teamwork. Like there is just ... I mean, I've been doing this if we want to, if we want to consider 1997 the sort of the start date of my career around this stuff, that's 24 years or so that I've been doing this. And I think any person is the sum of their experience. Like you just ... some things you remember, some things you forget, but whatever you experienced you're that's going to shape who you are. It's going to shape the way that you think. But you cannot ... there're just experiences that I will never have.

There are, there are backgrounds or situations that I will never experience. There are thoughts that I will never have because of who I am because of where I grew up, because of how, where I went to school, because of the experiences that I've had or whatever, and without having diverse teammates to sort of help you see things from a different angle, not only to help you, but I mean, not like help you, actually help you to work and accomplish things, but without having people around you that differ in ... whether it's in the slightest way or in a massive way ... without that different level of perspective you can't grow.

And so, I mean, that is one of the greatest things that I have ... I think the greatest thing for me, having gotten the chance to not only interview all these people for Serverless Chats, not only to read all of the posts that I've shared on Off-by-none and all of the articles there but to get to go to the conferences, to get, to give these conference talks, to have people question things that I've written in my blog posts. I should call out something really important here, and thank him for it. I wrote a blog post about serverless security, one of the, not my main post, but another one that was about like a SQL injection thing. And, and I used some language in there that was sort of general and sort of speculative, right. And Chris Munns actually called me out on that because he, and he actually kind of had to do a little tweetstorm.

I think he got some word from up above that he had to sort of call out, and, and make sure that he corrected the record on these things. And, and while I wasn't being critical of it, I was just being, I think I used some language and some general generalities that, that weren't accurate or did that maybe were misplaced in a way. And I don't think I did anything wrong other than the fact that I was too general and I wasn't clear enough. And when I got that feedback from him, I wasn't offended by it. Right. And that's another important thing. Like you've got to learn to take criticism, like absolutely, whether it's constructive or not, you need to learn to take it. But I remember getting that criticism from him and thinking to myself, I'm like, no, he's absolutely right.

Like, I can see how people could misinterpret what I said. And again, once you get a voice and once people start reading what you're writing, you have a responsibility and you absolutely have to just make sure. And that's one of the things that I did. I mean, sort of from that point forward, every post that I've ever written is highly researched. Right. And if I don't know something, I usually will say, I'm not a hundred percent sure about this. You know, I could be wrong about this or whatever. So I will try to call those things out if I'm not a hundred percent sure, but you can make, you can make a lot of assumptions when you're writing blog posts, because sometimes it's just easier to fill in a sentence here and there by adding some bit of flair to it or whatever, that will make an assumption.

And while it might seem right at the time those words can have an effect on somebody. And that's why I spend so much time now when I do write blog posts to try to heavily research those and make sure that the things I say are accurate and that I, again, I'm not using terms like "simple" and "easy" and things like that. 'Cause that's another thing what is simple to me or what is easy to me or what seems easy to me, or what seems easy to you or whatever could be wildly different from somebody else's perception. And that's another thing is just, I guess if we're moving on with the advice here is, know your audience again, going back to Jeff Barr, he just has this way to explain things that bring it down to a level that don't make you feel dumb, right.

That you're like, or that your eyes don't gloss over because you have no idea what he's talking about, but at the same time, doesn't feel like he's talking down to you either. And that's a hard thing to learn. I don't think ... I don't read a lot of blog posts that don't do that. And that's a hard thing to learn. So if you can find that level of humility where you say, I may know something, and I may know something really, really well, how can I communicate that to people in a way that doesn't talk down to them, but at the same time is accessible. Right? 'Cause accessibility is, is a huge thing as well.

Yeah. What else? I mean, going back again, I think the diversity thing, I don't want to harp on it too much, but that is those one of those things where me as a person personally, from 2018, when I first started writing these blog posts to the person that I am now, and the way that I think about things, politically as well as intellectually, the way I think about technology, all these different things.

I am a different person and I think differently now I have evolved my thinking dramatically because I was able just for a moment, just for a few minutes for 45 minutes, for 30 minutes on a conversation, or for five minutes in the hallway at a conference I was able to, sort of feel or be empathetic to somebody else's predicament or their perspective or there, sort of their ... and get a tiny taste into that background, that insight and so forth. And that has dramatically changed who I am. And I mean, I'm pretty happy with where I am now. I still think I've got a lot of work to do on myself in terms of continuing to open my mind, but I've met more people than I can ...

I don't remember all their names, unfortunately, but I have met so many people. And that is just, that is one of those things where I guess ... so my advice here is whatever you do if you're speaking at conferences or you're writing blog posts, or you're doing open source projects, or you're doing a podcast or whatever, open your channels up for feedback and talk to people. And, and if somebody has a problem with something, you say like, again, there are trolls, and then there are legitimate people who have concerns, or you know, who want to talk to you about something. And if you can take that criticism and you can open yourself up to that stuff, then I think, you just make yourself a better person. And yeah. And then the other thing, I guess I'll just say is, it's fine to make assumptions about products and find, to make assumptions about building things and software and whatever, never make assumptions about people.

Just the thing is, is that, you might read something somebody wrote and it might be wildly inaccurate. It might be way off base from your own personal thinking. And my biggest, I guess my, the advice that I give to a lot of people, especially younger people, is over the course of your career, you will hear a lot of good ideas and you'll hear a lot of bad ideas and you will not know the difference. It takes a long time to start understanding.

Most times when you hear something, you have no context, you don't have the context. You need to understand whether that's good or it's bad. Sometimes you do, but most times you don't. So just take everything with a grain of salt and know that the best thing you can do is continue to learn and keep an open mind and get as much context as you can. And hopefully, that turns into a better person and a better member of the community.

Rebecca: So I asked you for some serverless advice, but I think that's just really some great life advice. Most of those can be applied to a broader ...

Jeremy: Did I go off there?

Rebecca: No, I mean, thank you. I think let's, yeah, it's more than serverless, right? It's like the power of building a great team, the power of being able to receive critiques, and then in such a way where you want to improve your work and the way that you want to help others share it the way that you want to remain open, to understanding where your own blind spots are. Yeah. I'm a little floored, but yeah.

Thank you for sharing that with me. I think a huge, thank you, of course, to Angela and Melissa and your team, a huge thank you to Rob and Matt and Ken who submitted their questions that we asked on the air, and then a huge thank you to all of your listeners and to the serverless community. And so on this day, this very special day of your hundredth Serverless Chats episode. What do you hope remains with them, with your team, with your listeners, with the serverless community? What do you hope with them in their mind's ear as they drift off to sleep this evening after hearing you as your guest on Serverless Chats?

Jeremy: Oh man. Well, I mean, again, I think it's just one of those things where I'm wired in a way where I do what I do, I think, because I like to help people. And I think there are a lot of people who you just don't, you just don't know like, that's why I say don't make assumptions about people because you don't, you don't know, you have no idea what that person has gone through or what they're going through, or what's behind a smile or a frown, or whatever. You just, you don't know. And, and there's so much that, you know, human potential is amazing. And if you limit the possibility of human potential because you judge too soon, then I think that you do everything, everyone that disservice, including yourself and including, I guess, the evolution of mankind, which is something that is sort of passionate, that I'm passionate about.

So I guess the advice that I would leave is just look, don't make assumptions, be good to people, right? Like just, we're all in this together. So sort of like, I don't know, just go forth, be good, treat people, treat people well and build with serverless because that is definitely the future.

Rebecca: All right. I promise you as I drift off to sleep tonight, I will definitely think, be good to people, treat others well, build with serverless. Yeah. I love it. That's, it's three easy sentences. Good to remember, Jeremy, thank you so much for being our guest, but really your guest on Serverless Chats. And thank you for the honor of letting me be the asker. I can't wait to see you at episode 101. And I can't wait to see, did we, did I hear @vernor is going to be episode 101?

Jeremy: I don't know about 101, but I mean, there's plenty of episodes between now and 200, so.

Rebecca: Well, I can't wait. Is there anything else you want to leave us with?

Jeremy: No, I just, I want to thank you, Rebecca. I mean, you have been, you've been, I think another sort of what's the right word for it. Like you've been instrumental in me being able to do this, right. I mean, one of the things I had said earlier was, validation is extremely important. It's just what humans crave: validation. And it's not so much the notoriety or the popularity or any of those things that matter to me, what matters to me is that, that I help people. And when you put a lot of time and energy into something to try to help somebody, or you're thinking you try to help somebody, and it just doesn't get amplified. It is, it can be really frustrating. And, and again, like I said, there are so many ideas out there.

There's so many ideas out there that you do not know, that I do not know, that I've never heard before, they're perspectives that I've never heard before. And if you can't get to those, because you know, that person has five followers on Twitter, that's a shame, that's a horrible shame. And so I guess the ... what I want to say is that like you and the Heroes Program at AWS and, and the help that you gave me, I mean, just with, with dealing with like trying to do sponsorships and these other things that you, I mean, you helped coordinate some guests for me and like reach out to some people like that.

That's the kind of help, that's the type of thing where that, that feedback from you, that validation from you, the validation from AWS, that recognition that what I was doing was, was helping people and, and hopefully moving that needle, those are the kinds of things that people need. That was something that I needed. And I don't think I would, I don't think I'd be able to be doing what I'm doing now if I didn't have that encouragement and that support from you and other members of the community.

Rebecca: Well, I have no doubt that you would have achieved it regardless, but I am happy that I got to be the person to help you do so. And I think we share that ethos around and enthusiasm around the power of bringing community members together and being able to share their ideas and perspectives excitedly. So on the same plane here, I love it. Yeah.

Jeremy: Awesome. Awesome. Well, thank you, Rebecca. I appreciate it.

Rebecca: Yeah. Can't wait to see episode 100, happy episode 100 and, or, you know, hear it, I should say, 'cause I've seen it now this whole time, and then can't wait to, can't wait to see the next 100 episodes, really looking forward to it.

Jeremy: Awesome.

2021-05-10
Länk till avsnitt

Episode #99: You already have a Multi-Cloud Strategy with Rob Sutter

About Rob Sutter

Rob Sutter, a Principal Developer Advocate at Fauna, has woven application development into his entire career, from time in the U.S. Army and U.S. Government to stints with the Big Four and Amazon Web Services. He has started his own company ? twice ? once providing consulting services and most recently with WorkFone, a software as a service startup that provided virtual digital identities to government clients. Rob loves to build in public with cloud architectures, Node.js or Go, and all things serverless!

Twitter: @rts_rob
Personal email: [email protected]
Personal website: robsutter.com
Fauna Homepage
Learn more about Fauna
Supported Languages and Frameworks
Try Fauna for Free
The Calvin Paper

This episode sponsored by CBT Nuggets: https://www.cbtnuggets.com/

Watch this video on YouTube: https://youtu.be/CUx1KMJCbvk

Transcript
Jeremy: Hi, everyone. I'm Jeremy Daly, and this is Serverless Chats. Today, I'm joined by Rob Sutter. Hey, Rob. Thanks for joining me.

Rob: Hey, Jeremy. Thanks for having me.

Jeremy: So you are now the or a Principal Developer Advocate at Fauna. So I'd love it if you could tell the listeners a little bit about your background and what Fauna is all about.

Rob: Right. So as you've said, I'm a DA at Fauna. I've been a serverless user in production since 2017, started the Serverless User Group in Dubai. So that's how I got into serverless in general. Previously, I was a DA on the Serverless Team at AWS, and I've been a SaaS startup co-founder, a government employee, an IT auditor, and like a lot of serverless people I find, I have a lot of Ops in my background, which is why I don't want to do it anymore. There's a lot of us that end up here that way, I think. Fauna is the data API for modern applications. So it's a database that you access as an API just as you would access Stripe for payments, Twilio for messaging. You just put your data into Fauna and access it that way. It's flexible, serverless. It's transactional. So it's a distributed database with asset transactions, and again, it's as simple as accessing any other API that you're already accessing as a developer so that you can simplify your code and ship faster.

Jeremy: Awesome. All right. Well, so I want to talk more about Fauna, but I want to talk about it actually in a broader ... I think in the broader ecosystem of what's happening in the cloud right now, and we hear this term "multicloud" all the time. By the way, I'm super excited to have you on. I wanted to have you on for the longest time, and then just schedules, and it's like ...

Rob: Yeah.

Jeremy: You know how it is, but anyways.

Rob: Thank you.

Jeremy: No. But seriously, I'm super excited because your tweets, and everything that you've written, and the things that you were doing at AWS and things like that I think just all reinforced this idea that we are living in this multicloud world, right, and that when people think of multicloud ... and this is something I try to be very clear on. Multicloud is not cloud-agnostic, right?

Rob: Right.

Jeremy: It's a very different thing, right? We're not talking about running the same work load in parallel on multiple service providers or whatever.

Rob: Right.

Jeremy: We're talking about this idea of using the best services that are available to you across the spectrum of providers, whether those are cloud service providers, whether those are SaaS companies, or even to some degree, some open-source projects that are out there that make up this strategy. So let's start there right from the beginning. Just give me your thoughts on this idea of what multicloud is.

Rob: Right. Well, it's sort of a dirty word today, and people like to rail against it. I think rightly so because it's that multicloud 1.0, the idea of, as you said, cloud-agnostic that "write once, run everywhere." All that is, is a race to the bottom, right? It's the lowest common denominator. It's, "What do I have available on every cloud service provider? And then let me write for that as a risk management strategy." That's a cost center when you want to put it in business terms.

Jeremy: Right.

Rob: Right? You're not generating any value there. You're managing risk by investing against that. In contrast, what you and I are talking about today is this idea of, "Let me use best in class everywhere," and that's a value generation strategy. This cloud service provider offers something that this team understands, and wants to build with, and creates value for the customer more quickly. So they're going to write on that cloud service provider. This team over here has different needs, different customers. Let them write over there. Quite frankly, a lot of this is already happening today at medium businesses and enterprises. It's just not called multicloud, right?

Jeremy: Right.

Rob: So it's this bottom-up approach that individual teams are consuming according to their needs to create the greatest value for customers, and that's what I like to see, and that's what I like to promote.

Jeremy: Yeah, yeah. I love that idea of bottom-up because I think that is absolutely true, and I don't think you've seen this as aggressively as you have in the last probably five years as more SaaS companies have become or SaaS has become a household name. I mean, probably 10 years ago, I think Salesforce was around, and some of these other things were around, right?

Rob: Yeah.

Jeremy: But they just weren't ... They weren't the household names that they are now. Now, you watch any sport, any professional sport, and you see advertisements for all these SaaS companies now because that seems to be the modern economy. But the idea of the bottom-up approach, that is something where you basically give a developer or maybe you don't give them, but the developer takes the liberties, I would say, to maybe try and experiment with something new without having to do years of research, go through procurement, get approval to use some platform. Even companies trying to move to AWS, or on to Azure, or something like that, they have to go through hoops. Basically, jump through hoops in order to get them there. So this idea of the bottom-up approach, the developers are the ones who are experimenting. Very low-risk experiments, by the way, with some of these other services. That approach, that seems like the right marketing approach for companies that are building these services, right?

Rob: Yeah. It seems like a powerful approach for them. Maybe not necessarily the only one, but it is a good one. I mean, there's a historical lesson here as well, right? I want to come back to your point about the developers after, but I think of this as shadow cloud. Right? You saw this with the early days of SaaS where people would go out and sign up for accounts for their business and use them. They weren't necessarily regulated, but we saw even before that with shadow IT, right, where people were bringing their own software in?

Jeremy: Right.

Rob: So for enterprises that are afraid of this that are heavily risk-focused or control-focused top-down, I would say don't be so afraid because there's an entire set of lessons you can learn about this as you bring it, as you come forward to it. Then, with the developers, I think it's even more powerful than the way you put it because a lot of times, it's not an experiment. I mean, you've seen the same things on Twitter I've seen, the great tech turnover of 2021, right? That's normal for tech. There's such a turnover that a lot of times, people are coming in already having the skills that they know will enhance delivery and add customer value more quickly. So it's not even an experiment. They already have the evidence, and they're able to get their team skilled up and building quickly. If you hire someone who's coming from an AWS shop, you hire someone who's coming from an Azure shop on to two different teams, they're likely going to evolve that excellence or that capability independently, and I don't necessarily think there's a reason to stop that as long as you have the right controls around it.

Jeremy: Right. I mean, and you mentioned controls, and I think that if I'm the CTO of some company or whatever, or I'm the CIO because we're dealing in a super enterprise-y world, that you have developers that are starting to use tools ... Maybe not Stripe, but maybe like a Twilio or maybe they're using, I don't know, ChaosSearch or something, something where data that is from within their corporate walls are going out somewhere or being stored somewhere else, like the security risk around that. I mean, there's something there though, right?

Rob: Yeah, there absolutely is. I think it's incumbent on the organizations to understand what's going on and adapt. I think it's also imcu,bent on the cloud service providers to understand those organizational concerns and evolve their product to address them, right?

Jeremy: Right.

Rob: This is one thing. My classic example of this is data exfiltration in a Lambda function. Some companies get ... I want to be able to inspect every packet that leaves, and they have that hard requirement for reasons, right?

Jeremy: Right.

Rob: You can't argue with them that they're right or wrong. They made that decision as a company. But then, they have to understand the impact of that is, "Okay. Well, every single Lambda function that you ever create is going to run inside of VPC or is going to run connected to a VPC." Now, you have the added complexity of managing a VPC, managing your firewall rules, your NACLs, your security groups. All of this stuff that ... Maybe you still have to do it. Maybe it really is a requirement. But if you examine your requirements from a business perspective and say, "Okay. There's another way we can address this with tightly-scoped IAM permissions that only allow me to read certain records or from certain tables, or access certain keys, or whatever." Then, we assume all that traffic goes out and that's okay. Then, you get to throw all of that complexity away and get back to delivering value more quickly. So they have to meet together, right? They have to meet.

Jeremy: Right.

Rob: This led to a lot of the work that AWS did with VPC networking for Lambda functions or removing the cold start because a lot ... Those enterprises weren't ready to let go of that requirement, and AWS can't tell them, "You're wrong." It's their business. It's their requirement. So AWS built that for them to reduce the cold start so that Lambda became a viable platform for them to build against.

Jeremy: Right, and so if you're a developer and you're looking at some solution because ... By the way, I mean, like you said, choosing the best of breed. There are just a lot of good services out there. There are thousands and thousands of SaaS companies, and I think ... I don't know if we made this point, but I certainly consider SaaS companies themselves to be part of the cloud. I would think you would probably agree with that, right?

Rob: Yeah.

Jeremy: It might as well be cloud providers themselves. Most of them run on top of the cloud providers anyways, but they found ...

Rob: But they don't have to, and that's interesting to me and another truth that you could be consuming services from somebody else's data center and that's still multicloud.

Jeremy: Right, right. So, anyway. So my thought here or I guess the question I have is if you're a developer and you're trying to choose something best in breed, right? Whatever that is. Let's say I'm trying to send text messages, and I just think Twilio is ... It's got all the features that I need. I want to go with Twilio. If you're a developer, what are the things that you need to look for in some of these companies that maybe don't have ... I mean, I would say Twilio does, but like don't necessarily have the trust or the years of experience or I guess years under their belts of providing these services, and keeping data secure, and things like that. What's the advice that you give to developers looking to choose something like that to be aware of?

Rob: To developers in particular I think is a different answer ...

Jeremy: Well, I mean, yeah. Well, answer it both ways then.

Rob: Yeah, because there's the builder and the buyer.

Jeremy: Right.

Rob: Right?

Jeremy: Right.

Rob: Whoever the buyer is, and a lot of times, that could just be the software development manager who's the buyer, and they still would approach it different ways. I think the developer is first going to be concerned with, "Does it solve my problem?" Right? "Overall, does it allow me to ship faster?" The next thing has to be stability. You have to expect that this company will be around, which means there is a certain level of evidence that you need of, "Okay. This company has been around and has serviced," and that's a bit of a chicken and an egg problem.

Jeremy: Right.

Rob: I think the developer is going to be a lot less concerned with that and more concerned with the immediacy of the problem that they're facing. The buyer, whether that's a manager, or CIO, or anywhere in between, is going to need to be concerned with other things according to their size, right? You even get the weird multicloud corner cases of, "Well, we're a major supplier of Walmart, and this tool only runs on a certain cloud service provider that they don't want us to use. So we're not going to use it." Again, that's a business decision, like would I build my software that way? No, but I'm not subject to that constraint. So that means nothing in that equation.

Jeremy: Right. So you mentioned a little bit earlier this idea of bringing people in from different organizations like somebody comes in and they can pick up where somebody else left off. One of the things that I've noticed quite a bit in some of the companies that I've worked with is that they like to build custom tools. They build custom tools to solve a job, right? That's great. But as soon as Fred or Sarah leave, right, then all of a sudden, it's like, "Well, who takes over this project?" That's one of the things where I mentioned ... I said "experiments," and I said "low-risk." I think something that's probably more low-risk than building your own thing is choosing an API or a service that solves your problem because then, there's likely someone else who knows that API or that service that can come in, and can replace it, and then can have that seamless transition.

And as you said, with all the turnover that's been happening lately, it's probably a good thing if you have some backup, and even if you don't necessarily have that person, if you have a custom system built in-house, there is no one that can support that. But if you have a custom ... or if you have a system you've used, you're interfacing with Twilio, or Stripe, or whatever it is, you can find a lot of developers who could come in even as consultants and continue to maintain and solve your problems for you.

Rob: Yeah, and it's not just those external providers. It's the internal tooling as well.

Jeremy: Right.

Rob: Right?

Jeremy: Right.

Rob: We're guilty of this in my company. We wrote a lot of stuff. Everybody is, right, like you like to do it?

Jeremy: Right.

Rob: It's a problem that you recognized. It feels good to solve it. It's a quick win, and it's almost always the wrong answer. But when you get into things like ... a lot of cases it doesn't matter what specific tool you use. 10 years ago, if you had chosen Puppet, or Chef, or Ansible, it wouldn't be as important which one as the fact that you chose one of those so that you could then go out and find someone who knew it. Today, of course, we've got Pulumi, Terraform, and all these other things that you could choose from, and it's just better than writing a bunch of Bash scripts to tile the stuff together. I believe Bash should more or less be banned in the cloud, but that's another ... That's my hot take for another time. Come at me on Twitter if you don't like that one.

Jeremy: So, yeah. So I think just bringing up this idea of tooling is important because the other thing that you potentially run into is with the variety of tooling that's out there, and you mentioned the original IAC. I guess they would... Right? We call those like Ansible and those sort of things, right?

Rob: Right.

Jeremy: All of those things, the Chefs and the Puppets. Those were great because you could have repeatable deployments and things like that. I mean, there's still work to be done there, but that was great because you made the choice to not building something yourself, right?

Rob: Right.

Jeremy: Something that somebody else could jump in on. But now with Terraform and with ... You mentioned Pulumi. We've got CloudFormation and even Microsoft has their own ... I think it's called ARM or something like that that is infrastructure as code. We've got the Serverless Framework. We've got SAM. We've got Begin. You've got ... or Architect, right? You've got all of these choices, and I think what happens too is that if teams don't necessarily ... If they don't rally around a tool, then you end up with a bunch of people using a bunch of different tools. Maybe not all these tools are going to be compatible, but I've seen really interesting mixes of people using Terraform, and CloudFormation, and SAM, and Serverless Framework, like binding it all together, and I think that just becomes ... I think that becomes a huge mess.

Rob: It does, and I get back to my favorite quote about complexity, right? "Simplicity before complexity is worthless. Simplicity beyond complexity is priceless." I find it hard to get to one tool that's like artificial, premature optimization, fake simplicity.

Jeremy: Yeah.

Rob: If you force yourself into one tool all the time, then you're going to find it doing what it wasn't built to do. A good example of this, you talked about Terraform and the Serverless Framework. My opinion, they weren't great together, but your Terraform comes for your persistent infrastructure, and your Serverless Framework comes in and consumes the outputs of those Terraform stacks, but then builds the constantly churning infrastructure pieces of it. Right? There's a blast radius issue there as well where you can't take down your database, or S3 bucket, or all of this from a bad deploy when all of that is done in Terraform either by your team, or by another team, or by another process. Right? So there's a certain irreducible complexity that we get to, but you don't want to have duplication of effort with multiple tools, right?

Jeremy: Right.

Rob: You don't want to use CloudFormation to manage your persistent data over here and Terraform to manage your persistent data over here because then you're not ... That's like that agnostic model where you're not benefiting from the excellent features in each. You're only using whatever is common between them.

Jeremy: Right, right, and I totally agree with you. I do like the idea of consuming. I mean, I have been using AWS for a very, very long time like 2007, 2008.

Rob: Yeah, same. Oh, yeah.

Jeremy: Right when EC2 instances were a thing. I guess 2008. But the biggest thing for me looking at using Terraform or something like that, I always felt like keeping it in, keeping it in the family. That's the wrong way to say it, but like using CloudFormation made a lot of sense, right, because I knew that CloudFormation ... or I thought I knew that CloudFormation would always support the services that needed to be built, and that was one of my big complaints about it. It was like you had this delay between ... They would release some service, and you had to either do it through the CLI or through the console. But then, CloudFormation support came months later. The problem that you have with some of that was then again other tools that were generating CloudFormation, like a Serverless Framework, that they would have to wait to get CloudFormation support before they could support it, and that would be another delay or they'd have to build something custom, which is not always the cleanest way to do it.

Rob: Right.

Jeremy: So anyways, I've always felt like the CloudFormation route was great if you could get to that CloudFormation, but things have happened with CDK. We didn't even mention CDK, but CDK, and Pulumi, and Terraform, and all of these other things, they've all provided these different ways to do things. But the thing that I always thought was funny was, and this is ... and maybe you have some insight into this if you can share it, but with SAM, for example, SAM wasn't extensible, right? You would just run into issues where you're like, "Oh, I can't do that with SAM." Whereas the Serverless Framework had this really great third-party plug-in system that allowed you to do some of these other things. Now, granted not all third-party plug-ins were super stable and were the best way to do something, right, because they'd either interact with APIs directly or whatever, but at least it gave you ... It unblocked you. Whereas I felt like with SAM and even CloudFormation when it didn't support something would block you.

Rob: Yeah. Yeah, and those are just two different implementation philosophies from two different companies at two different stages of their existence, right? Like AWS ... and let's separate the reality from the theory here. The theory is that a large company can exert control over release cycles and limit what it delivers, but deliver it with a bar of excellence. A small company can open things up, and it depends on its community members for contributions to solve problems. It's very much like this is the cathedral and the Bazaar of cloud tooling, right?

AWS has that CloudFormation architecture that they're working around with its own goals and approach, the one way to do it. Serverless Framework is, "Look, you need to ... You want to set up a stall here and insert IAM policies per function. Set up a stall. It will be great. Maybe people come and maybe they don't," and the system inherently sorts or bubbles up the value, right? So you see things like the Step Functions plug-in for Serverless Framework. It was one of the early ones that became very popular very quickly, whereas Step Functions supporting SAM trailed, but eventually came in. I think that team, by the way, deserves a lot of credit for really being focused on developers, but that's not the point of the difference between the two.

A small young company like Serverless Framework that is moving very quickly can't have that cathedral approach to it, and both are valid, right? They're both just different strategies and good for the marketplace, quite frankly. I have my preferred approach, which is not about AWS or SAM vs Serverless Framework. It's the extensibility of plug-in frameworks to me are a key component of tooling that adapts as quickly as the clouds change, and you see this. Like Terraform was the first place that I really learned about plug-ins, and their plug-in framework is fantastic, the way they do providers. Serverless Framework as well is another good example, but you can't know how developers are going to build with your services. You just can't.

You do customer development. You talk to them ahead of time. You get all this research. You talk to a thousand customers, and then you release it to 14 million customers, right? You're never going to guess, so let them. Let them build it, and if people ... They put the work in. People find there's value in it. Sometimes you can bring it in. Sometimes you leave it up to the community to maintain, but you just ... You have to be willing to accept that customers are going to use your product in different ways than you envisioned, and that's a good thing because it means customers are using your product.

Jeremy: Right, right. Yeah. So I mean, from your perspective though ... because let's talk about SAM for a minute because I was excited when SAM came out. I was thinking to myself. I'm like, "All right. A simplified tooling that is focused on serverless. Right? Like gives me all the things that I think I'm going to need." And then I did ... from a developer experience standpoint, and let's call out the elephant in the room. AWS and developer experience are not always the same. They don't always give you that developer experience that you would want. They give you tons of tools, right, but they don't always give you that ...

Rob: Funny enough, you can spell "developer experience" without AWS.

Jeremy: Right. So I mean, that's my ... I was disappointed when I started using SAM, and I immediately reverted back to the Serverless Framework. Not because I thought that it wasn't good or that it wasn't well-thought-out. Like you said, there was a level of excellence there, which certainly you cannot diminish, It just didn't do the things I needed it to do, and I'm just curious if that was a consistent feedback that you got as being someone on the dev advocate team there. Was that something that you felt as well?

Rob: I need to give two answers to this to be fair, to be honest. That was something that I felt as well. I never got as comfortable with SAM as I am with the Serverless Framework, but there's another side to this coin, and that's that enterprise uptake of SAM CLI has been really strong.

Jeremy: Right.

Rob: Enterprise it, it does what they need it to do, and it addresses their concerns, and they liked getting tooling from AWS. It just goes back to there being a place for both, right?

Jeremy: Right.

Rob: Enterprises are much more likely to build cathedrals. They want that top-down, "Okay, everybody. This is how you define something. In fact, we've created a module for you. Consume it here. Thou shalt not write new S3 to web server configuration in your SAM templates. Thou shalt consume this." That's not wrong, and the usage numbers don't lie with SAM. It's got a lot of fans, and it's got a lot of uptake, but that's an entirely different answer from how I feel about it. I think it also goes back to I'm not running an enterprise. I've never run an enterprise. The biggest I've got in terms of responsibility is at best a small company, right? So I think it's natural for me to feel that way when I try to use a tool that has such popularity amongst enterprise. Now, of course, you have the switch, right? You have enterprises using Serverless Framework, and you have small builders using SAM. But in general, I think the success there was with the enterprise, and it's a validation of their strategy.

Jeremy: Right, right. So let's talk about enterprises for a second because this is where we look at tools like the CDK and SAM, Serverless Framework, things like that. You look at all those different tools, and like you said, there's adoption across some of those. But at the end of the day, most of those tools are compiling down to a CloudFormation or they're compiling down to ... What's it called? The Azure Resource Manager Language or whatever the heck it is, right?

Rob: ARM templates.

Jeremy: ARM templates. What's the value now in CloudFormation and those sort of things that the final product that you get to ... I mean, certainly, it's so much easier to build those using these frameworks, but do we need CloudFormation in those things anymore? Do we need to know those? Does an individual developer need to be able to understand those, or can they just basically take a step back and say, "Look, CDK does it for me," or, "Pulumi does it for me. So why do I need to know what's baked into those templates?"

Rob: Yeah. So let's set Terraform aside and talk about it after because it's different. I think the choice of JSON and YAML as implementation languages for most of this tooling, most of this tooling is very ... It was a very effective choice because you don't necessarily have to know CloudFormation to look at a template and define what it's doing.

Jeremy: Right.

Rob: Right? You don't have to understand transforms. You don't have to understand parameter replacement and all of this stuff to look at the final transformed template in CloudFormation in the console and get a very quick reasoning about what's happening. That's good. Do I think there's value in learning to create multi-thousand-line CloudFormation templates by hand? I don't. It's the assembly language of the cloud, right?

Jeremy: Right.

Rob: It's there when you need it, and just like with procedural languages, you might want to look underneath at the instructions, how it unrolled certain loops, how it decided not to unroll others so that you can make changes at the next level. But I think that's rare, and that's optimization. In terms of getting things done and getting things shipped and delivered, to start, I wouldn't start with plain CloudFormation for any of these, especially of anything for any meaningful production size. That's not a criticism of CloudFormation. It's just like you said. It's all these other tooling is there to help us generate what we want consistently.

The other benefit of it is once you have that underlying lingua franca of the cloud, you can build visualization, and debugging, and monitoring, and like ... I mean, all of these other evaluatory forensic. "Evaluatory?" Is that a word? It's a word now. You heard it here first on this podcast. Like forensic, cloud forensic type tooling that lets you see what's going on because it is a universal language among all of the tools.

Jeremy: Yeah, and I want to get back to Terraform because I know you mentioned that, but I also want to be clear. I don't suggest you write CloudFormation. I think it is horribly verbose, but probably needs to be, right?

Rob: Yeah.

Jeremy: It probably needs to have that level of fidelity there or that just has all that descriptive information. Yeah. I would not suggest. I'm with you, don't suggest that people choose that as their way to go. I'm just wondering if it's one of those things where we don't need to be able to look at ones and zeroes anymore, and understand what they do, right?

Rob: Right.

Jeremy: We've got higher-level constructs that do that for us. I wouldn't quite put ... I get the assembly language comparison. I think that's a good comparison, but it's just that if you are an enterprise, right, is that ... Do you trust? Do you trust something like CDK to do everything you need it to do and make sure that it's covering all its bases because, again, you're writing layers of abstraction on top of a layer of abstraction that then abstracts it even more. So yeah, I'm just wondering. You had mentioned forensic tools. I think there's value there in being able to understand exactly what gets put into the cloud.

Rob: Yeah, and I'd agree with that. It takes 15 seconds to run into the limit of my knowledge on CDK, by the way. But that knowledge includes the fact that CDK Synth is there, which generates the CloudFormation template for you, and that's actually the thing, which is uploaded to the CloudFormation service and executed. You'd have to bring in somebody like Richard Boyd or someone to talk about the guard rails that are there around it. I know they exist. I don't know what they are. It's wildly popular, and adoption is through the roof. So again, whether I philosophically think it's a good idea or not is irrelevant. Developers want it and want to build with it, right?

Jeremy: Yeah.

Rob: It's a Bazaar-type tool where they give you some basic constructs, and you can write your own constructs around it and get whatever you need. But ultimately, that comes back to CloudFormation, which is then subject to all the controls that your organization puts around CloudFormation, so it is ... There's value there. It can't be denied.

Jeremy: Right. No, and the thing that I like about the CDK is the idea of being able to create those constructs because I think, especially from a ... What's the right word? Compliance standpoint or something like that, that you can write in these constructs that you say, "You need to use these constructs when you deploy a microservice or you deploy this," or whatever it is, and then you have those guard rails as you mentioned or whatever, but those ... All of those checkboxes are ticked because you can put that all into one construct. So I totally think that's great. All right. So let's talk about Terraform.

Rob: Yeah, so there's ... First, it's a completely different model, right?

Jeremy: Right.

Rob: This is an interesting discussion to have because it's API calls. You write your provider, whatever your infrastructure is, and anything that can be an API call can now be a Terraform declarative resource. So it's that mapping between declarative and imperative that I find fascinating. Well, also building the dependency graph for you. So it handles all of those aspects of it, which is a really powerful tool. The thing that they did so well ... Terraform is equally verbose as CloudFormation. You've got to configure all the same options. You get the same defaults, et cetera. It can be terribly verbose, but it's modular. Every Terraform file that you have in one directory is concatenated, and that is a huge distinction between how CloudFormation wants everything in one template, or well, you can refer to something in an S3 bucket, but that's not actually useful to me as a developer.

Jeremy: Right, right.

Rob: I can't mount an S3 bucket as a drive on my workstation and compose all of these independent files at once and do them that way. Sidebar here. Maybe I can. Maybe it supports that, and I haven't been able to discover it, right? Whereas Terraform by default, out of the box put everything in its own file according to function. It's very easy to look in your databases.tf and understand what's in there, look in your VPC.tf and understand what's in there, and not have to go through thousands of lines of code at once. Yes, we have find and replace. Yes, we have search, and you ... Anybody who's ever built any of this stuff knows that's not the same thing. It's not the same thing as being able to open a hundred lines in your text editor, and look at everything all at once, and gain an understanding of it, and then dive into the next level of detail a hundred lines at a time and understand that.

Jeremy: Right. But now, just a question here, without... because the API thing, I love that idea, and actually, Serverless Components used an API thing to do it and bypass CloudFormation. Actually, I believe Architect originally was using APIs, and then switched to CloudFormation, but the question I have about that is, if you don't have a CloudFormation template, if you don't have that assembly language of the web, and that's not sitting in your CloudFormation tool built into the dashboard, you don't get the drift protection, right, or the detection, and you don't get ... You don't have that resource map necessarily up there, right?

Rob: First, I don't think CloudFormation is the assembly language of the web. I think it's the assembly language of AWS.

Jeremy: I'm sorry. Right. Yes. Yeah. Yeah.

Rob: That leads into my point here, which is, "Okay. AWS gives you the CloudFormation dashboard, but what if you're now consuming things from Datadog, or from Fauna, or from other places that don't map this the same way?

Jeremy: Right.

Rob: Terraform actually does manage that. You can do a plan against your existing file, and it will go out and check the actual existing state of all of your resources, and compare them to what you've asked for declaratively, and show you what the changeset will be. If it's zero, there's no drift. If there is something, then there's either drift or you've added new functionality. Now, with Terraform Cloud, which I've only used at a basic level, I'm not sure how automatic that is or whether it provides that for you. If you're from HashiCorp and listening to this, I would love to learn more. Get in touch with me. Please tell me. But the tooling is there to do that, but it's there to do it across anything that can be treated as an API that has ... really just create and retrieve. You don't even necessarily need the update and delete functionality there.

Jeremy: Right, right. Yeah, and I certainly ... I am a fan of Terraform and all of these services that make it easier for you to build clouds more easily, but let's talk about APIs because you mentioned anything with an API. Well, everything has APIs now, right? I mean, essentially, we are living in a ... I mentioned SaaS, but now we're sort of ... This whole idea of the API economy. So just overall, what are your thoughts on this idea of basically anything you want to do is available now via an API?

Rob: It's not anything you want to do. It's everything you want to do short of fulfilling your customer's needs, short of solving your customer's specific problem is available as an API. That means that you get to choose best-in-class for everything, right? Your customer's need isn't, "I want to spend $25 on my credit card." Your customer's need is, "I need a book."

Jeremy: Right.

Rob: So it's not, "I want to store information about books in a database." It's, "I need a book." So everything and every step of the way there can now be consumed from an API. Again, it's like serverless in general, right? It allows you to focus purely or as close to purely as we can right now on generating customer value and addressing customer problems so that you can ship faster, and so that you have it as a competitive advantage. I can write a payment processing program. I know I can because I've done it back in 2004, and it was horrible, and it was awful. It wasn't a very good one, and it worked. It took your money, but this was like pre-PCIDSS.

If I had to comply with all of those things, why would I do that? I'm not a credit card payment processor. Stripe is, and they have specialists in all of the areas related to the problem of, "I need to take and process payments." That's the customer problem that they're solving. The specialization of labor that comes along with the API economy is fantastic. Ops never went away. All the ops people work at the cloud service providers now.

Jeremy: Right.

Rob: Right? Audit never went away. All the auditors have disappeared from view and gone into internal roles in payments companies. All of this continues to happen where the specialists are taking their deep, deep knowledge and bringing it inside companies that specialize in that domain.

Jeremy: Right, and I think the domain expertise value that you get from whatever it is, whether it's running a database company or whether it's running a payment company, the number of people that you would need to hire to have a level of specialization for what you're paying for two cents per transaction or whatever, $50 a month for some service, you couldn't even begin. The total cost of ownership on those things are ... It's not even a conversation you would want to have, but I also built a payment processing system, and I did have to pass PCI, which we did pass, but it was ...

Rob: Oh, good for you.

Jeremy: Let's put it this way. It was for a customer, and we lost money on that customer because we had to go through PCI compliance, but it was good. It was a good experience to have, and it's a good experience to have because now I know I never want to do it again.

Rob: Yeah, yeah. Back to my earlier point on Ops and serverless.

Jeremy: Right. Exactly.

Rob: These things are hard, right?

Jeremy: Right, right.

Rob: Sorry.

Jeremy: No, no. Go ahead.

Rob: Not to interrupt, but these are all really hard problems that people with graduate degrees and post-graduate research who have ... They're 30 years old when they start working on the problem are solving. There's a supply question there as well, right? There's just not enough people, and so you and I can like ... Well, I'm not going to project this on to you. I can stumble through an implementation and check off the requirements just like I worked in an optical microscopy lab in college, and I could create computer programs that modeled those concepts, but I was not an optical microscopist. I was not a PhD-level generating understandings of these things, and all of these, they're just so hard. Why would you do that when customer problems are equally hard and this set is solved?

Jeremy: Right, right.

Rob: This set of problems over here is solved, and you can't differentiate yourself by solving it better, and you're not likely to solve it better either. But even if you did, it wouldn't matter. This set of problems are completely unsolved. Why not just assemble the pieces from best-in-class so that you can attack those problems yourself?

Jeremy: Again, I think that makes a ton of sense. So speaking about expertise, let's talk about what you might have to pay say a database administrator. If you were to hire a database administrator to maintain all the databases for you and keep all that uptime, and maybe you have to hire six database administrators in order for them to ... Well, I'm thinking multi-region and all that kind of stuff. Maybe you have to hire a hundred, depending on what it is. I mean, I'm getting a little ahead of myself, but ... So if I can buy a service like Fauna, so tell me a little bit more about just how that works.

Rob: Right. Well, I mean, six database engineers in the US, you're over a million dollars a year easily, right?

Jeremy: Right.

Rob: I don't know what the exact number is, but when you consider benefits, and total cost, and all of that, it's a million dollars a year for six database engineers. Then, there are some very difficult problems in especially distributed databases and database scaling that Fauna solves. A number of other products or services solve some of them. I'm biased, of course, but I happen to think Fauna solves all of them in a way that no other product does, but you're looking ... You mentioned distributed transactions. Fauna is built atop the Calvin paper, which came out of Yale. It's a very brief, but dense academic research paper. It's a PHC research paper, and it talks about a model for distributed transactions and databases. It's a layer, a serialization layer, that sits atop your database.

So let's say you wanted to replicate something like Fauna. So not only do you need to get six database engineers who understand the underlying database, but you need to find engineers that understand this paper, understand the limitations of the theory in the paper, and how to overcome them in operations. In reality, what happens when you actually start running regions around the world, replicating transactions between those regions? Quite frankly, there's a level of sophistication there that most of the set of people who satisfy that criteria already work at Fauna. So there's not much of a supply there. Now, there are other database competitors that solve this problem in different ways, and most of the specialists already work at those companies as well, right? I'm not saying that they aren't equally competent database engineers. I'm just saying there's not a lot of them.

Jeremy: Right.

Rob: So if you're thing is to sell books at a certain scale like Amazon, then that makes sense to have them because you are a database creator as well. But if your thing is to sell books at some level below that, why would you compete for that talent rather than just consuming it?

Jeremy: Right. Yeah, and I would say unless you're a horse with a horn on your head, it's probably not worth maintaining your own database and things like that. So let's talk a little bit more, though, about that. I guess just this idea of maybe a shortage of people, like if you're ... You're right. There's a limited number of resources, right? I'm sure there's brilliant database engineers all around the world, and they have the experience where, right, they could come in and they could probably really help you maintain your database. Even if you could afford six of them and you wanted to do that, I think the problem is it's got to be the interestingness of the problem. I don't think "interestingness" is a word either, but like if I'm a database engineer, wouldn't I want to be working on something like Fauna that I could help millions and millions of people as opposed to helping some trucking company maintain their internal database or something like that?

Rob: Yeah, and I would hope so. I hope it's okay that I mention we're hiring. So come to Fauna.com and look at our roles database engineers.

Jeremy: You just read that Calvin paper first. Go ahead.

Rob: But read the Calvin paper first. I think it's only like 12 pages, and even just the first page is enough. I'm happy to talk about that at any length because I find it fascinating and it's public. It is an interesting problem and the ... It's the reification or the implementation of theory. It's bringing that theory to the real world and ... Okay. First off, the theory is brilliant. This is not to take away from it, but the theory is conceived inside someone's mind. They do some tests, they prove it, and there's a world of difference between that point, which is foundational, and deploying it to production where people are trusting their workloads on it around the world. You're actually replicating across multiple cloud providers, and you're actually replicating across multiple regions, and you're actually delivering on the promise of the paper.

What's described in the paper is not what we run at Fauna other than as a kernel, as a nugget, right, as the starting point or the first principle. That I think is wildly interesting for that class of talent like you talked about, the really world-class engineers who want to do something that can't be done anywhere else. I think one thing that Fauna did smartly early was be a remote-first company, which means that they can take advantage of those world-class engineers and their thirst for innovation regardless of wherever Fauna finds them. So that's a big deal, right? Would you rather work on a world-class or global problem or would you rather work on a local problem? Now, look, we need people working on local problems too. It's not to disparage that, but if this is your wheelhouse, if innovation is the thing that you want to do, if you want to be doing things with databases that nobody else is doing, this is where you want to be. So I think there's a strong argument there for coming to work in a place like Fauna.

Jeremy: Yeah, and I want to make sure I apologize to any database engineer working at a trucking company because I'm sure there are actually probably really interesting problems with logistics and things like that that they are solving. So maybe not the best example. Maybe. I don't know. I can't think of another example. I don't want to offend anybody who's chosen a more local problem because you're right. I mean, there are local problems that need to be solved, but I do think that there are people ... I mean, even someone like me. I want to work on a bigger problem. You know what I mean? I owned a web development company for 12 years, and I was solving other people's problems by building them a website or whatever, and it just got to a point where I'm like, "I'm not making enough of an impact here." You're not solving a big enough problem. You want to work on something more interesting.

Rob: Yeah. Humans crave challenge, right? Challenge is a necessary precondition for growth, and at least most of us, we want to grow. We want to be better at whatever it is we're doing or just however we think of ourselves next year that we aren't today, and you can't do that with challenge. If you build other people's websites for 12 years, eventually, you get to a point where maybe you're too good at it. Maybe that's great from a business perspective, but it's not so great from a personal fulfillment perspective.

Jeremy: Right.

Rob: If it's, "Oh, look, another brochure website. Okay. Here you go. Oh, you need a contact form?" Again, it's not to disparage this. It's the fact that if you do anything for 12 years, sometimes mastery is stasis. Not always.

Jeremy: Right, and I have nightmares of contact forms, of building contact forms, by the way, but...

Rob: It makes sense. Yeah. You know what you should do is just put all of those directly into Fauna and don't worry about it.

Jeremy: Easy enough. Easy enough.

Rob: Yeah, but it's not necessarily stasis, but I think about craftsmen and people who actually make things with their hands, physical builders, and I think a lot of that ... Like if you're making furniture, you're a cabinet maker. I think a lot of that is every time, it's just a little bit wrong, right? Not wrong, but just a little bit off from your optimum no matter how long you do it, and so everything has a chance to evolve. That's there with software to a certain extent, but the problem is never changing.

Jeremy: Right.

Rob: So, yeah. I can see both sides of it, but for me, I ... You can see it when I was on serverless four years ago and now that I'm on a serverless database now. I like to be out at the edge, pushing that edge out for everyone who's coming behind. It can be challenging because sometimes there's just no way forward, and sometimes everybody is not ready to come with you. In a lot of ways, being early is the same as being wrong.

Jeremy: Right. Well, I've been ...

Rob: Not an original statement, but ...

Jeremy: No, but I've been early on many things as well where like five years after we tried to do something, like then, all of a sudden, it was like this magical thing where everybody is doing it, but you mentioned the edge. That would be something ... or you said on the edge. I know you mean this way, but the edge in terms of the actual edge. That's going to be an interesting data problem to solve.

Rob: Oh, that's a fascinating data problem, especially for us at Fauna. Yeah, compute, and Andy Jassy, when he was at AWS, talked about how compute was bifurcating, right? It's either moving all the way out to the edge or it's moving all the way into the cloud, and that's true. But I think at Fauna, we take that a step further, right? The edge part is true, and a lot of the work that we've done recently, announcements with CloudFlare workers. We're ready for that. We believe in it, and we like pushing that out as close to the user as possible. One thing that we do uniquely is we have this concept of user-defined functions, and anybody who's written T-SQL back in the day, who wrote store procedures is going to be familiar with this, but it's ... You bring that business logic and that code to your data. Not near your data, to your data.

Jeremy: Right.

Rob: So you bring the compute not just to the cloud where it still needs to pass through top-of-rack and all of this. You bring it literally on to the same instance as your data where these functions execute against them there. So you get not just the database, but you get a compute layer in there, and this helps for things like filtering for things like the equivalent of joins, stuff that just ... If you've got to load gigabytes of data and move it somewhere, compute against it, reduce it to something, and then store that back, the speed of light still matters. Even if it's the speed of light across a couple switches, it still matters, and so there are some really interesting things that you can begin to do as you pull more and more of that logic into your data layer, and you also protect that logic from other components of your application.

So I like that because things like GraphQL that endpoints already speak and already understand, just send it over, and again, they don't care about the architectural, quite frankly, genius--I can say that because I didn't create it--the genius behind all of this stuff. They just care that, "Look, I send this request over and I get it back," and entire workflows, and complex processes, and everything are executing behind the scenes just so that the endpoints can send and retrieve what they need more effectively and more quickly. The edge is fascinating. The thing I regret the most about the edge is I have no hardware skills, right? So I can't make fun things to do fun things in my house. I have to buy them, but you can't do everything.

Jeremy: Yeah. Well, no. I think you make a good point though about bringing the compute to the data side, and other people have said there's no ... Ben Kehoe has been talking about this for a while too where like it just makes sense. Run the compute where the data is, and then send that data somewhere else, right, because there's more things that can be done with data after that initial bit of compute. But certainly, like you said, filtering it down or getting the bits that are relevant and moving a small amount of data as opposed to a large amount of data I think is hugely important.

Now, the other thing I just want to mention before I let you go or I want to talk about quickly is this idea of going back to the API economy aspect of things and buying versus building. If you think about what you've had to do at Fauna, and I know you're relatively recent there, but you know what they've done and the work that had to go in in order to build this distributed system. I mean, I think about most systems now, and I think like anything I'm going to build now, I got to think about scale, right?

I don't necessarily have to build to scale right away, especially if I'm doing an MVP or something like that. But if I was going to build a service that did something, I need to think about multi-region, and I need to think about failover, and I need to think about potentially providing it at the edge, and all these other things. So you come down to this thing, and I will just use the database example. But even if you were say using like MySQL, or Postgres, or something like that, that's going to scale. That's going to scale pretty well to get to a certain point, and then you're going to have to start sharding, right? When data gets hard, it's time to shard, right? You just have to start sharding everything.

Rob: Yeah.

Jeremy: Essentially, what you end up doing is rebuilding DynamoDB, or trying to rebuild Fauna, or something like that. So just thinking about that, anything you're building ... Maybe you have some advice for developers who ... I know we've talked about this a little bit, but I just go back to this idea of like, if you think about how complex some of these SaaS companies and these services that are being built out right now, why would you ever want to take that complexity on yourself?

Rob: Pride. Hubris. I mean, the correct answer is you wouldn't. You shouldn't.

Jeremy: People do.

Rob: Yeah, they do. I would beg and plead with them like, "Look, we did take a lot of that on. Fauna scales. You don't need to plan for sharding. You don't need to plan for global replication. All of these things are happening." I raise that as an example of understanding the customer's problem. The customer didn't want to think about, "Okay, past a thousand TPS, I need to create a new read replica. Past a million TPS, I need to have another region with active-active." The customer wanted to store some data and get that data, knowing that they had the ASA guarantees around it, right, and that's what the customer has.

So get that good understanding of what your customer really wants. If you can buy that, then you don't have a product yet. This is even out of software development and into product ideation at startups, right? If you can go ... Your customer's problem isn't they can't send text messages programmatically. They can do that through Twilio. They can do that through Amazon. They can do that through a number of different services, right? Your customer's problem is something else. So really get a good understanding of it. This is where knowing a little ... Like Joe Emison loves to rage against senior developers for knowing not quite enough. This is where knowing like, "Oh, yeah, Postgres. You can just shard it." "Just," the worst word in computer science, right?

Jeremy: Right.

Rob: "You can just shard it." Okay. Now, we're back to those database engineers that you talked about, and your customer doesn't want to shard a database. Your customer wants to store and retrieve data.

Jeremy: Right.

Rob: So any time that you can really focus in, and I guess I really got this one, this customer obsession beaten into me from my time at AWS. Really focus in on what the customer is asking you to do or what the customer needs even if they don't know how to express it, and build for that.

Jeremy: Right, right. Like the saying. I forgot who said it. Somebody from Harvard Business Review, but, "Your customers don't want a quarter-inch drill. They want a quarter-inch hole."

Rob: Right, right.

Jeremy: That's basically true. I mean, the complexity that goes behind the scenes are something that I think a vast majority of customers don't necessarily want, and you're right. If you focus on that product ideation thing, I think that's a big piece of that. All right. Well, anyway. So I have one more question for you just to ...

Rob: Please.

Jeremy: We've been talking for a while here. Hopefully, we haven't been boring people to death with our talk about APIs and stuff like that, but I would like to get a little bit academic here and go into that Calvin paper just a tiny bit because I think most people probably will not want to read it. Not because they don't want to, but because people are busy, right, and so they're listening to the podcast.

Rob: Yeah.

Jeremy: Just give us a quick summary of this because I think this is actually really fascinating and has benefits beyond just I think solving data problems.

Rob: Yeah. So what I would say first. I actually have this paper on my desk at all times. I would say read Section 1. It's one page, front and back. So if you're interested in it, you don't have to read the whole paper. Read that, and then listeners to this podcast will probably understand when I say this. Previously, for distributed databases and distributed transactions, you had what was called a two-phase commit. The first was you'd go out to all of your replicas, and you say, "Hey, I need lock." When everybody comes back, and acknowledges, and says, "Okay. You have the lock," then you do your transaction, and then you replicate it, and then you say, "Hey, everybody. I'm done. Release the lock." So it's a two-phase commit. If something went wrong, you rolled it all the way back and you said, "Hey, everybody. Forget it."

Calvin is event-sourcing for databases. So if I could distill the entire paper down into one concept, that's it. Right? Instead of saying, "Hey, everybody. Give me a lock, I'm going to do something," you say, "Hey, everybody. Here's what we're going to do." It's a deterministic application of the transaction so that you can ... You both create the lock and execute the transaction at the same time. So rather than having this outbound round trip, and then doing the thing in an outbound round trip, you have the outbound round trip, and you're done.

They all apply it independently, and then it gets into how you structure the guarantees around that, which again is very similar to event-sourcing in that you use snapshotting or checkpointing. "So, hey. At this point, we all agree. So we can forget all of our old transactions, and we roll forward from here." If somebody leaves the cluster, they come back in at that checkpoint. They reapply all of the events that occurred prior to that. Those events are transactions. They'd get up to working speed, and then off they go. The paper I think uses Paxos. That's an implementation detail, but the really interesting thing about it is you're not having a double round trip.

Jeremy: Yeah.

Rob: Again, I love the idea of event-sourcing. I think Amazon EventBridge is the best service that they've released in the past couple years.

Jeremy: Totally.

Rob: If you understand all of that and are already building serverless applications that way, well, that's what we're doing, just event-sourcing for database. That's it. That's it.

Jeremy: Just event-sourcing. It's easy. Simple. All right. All the words you never want to hear. Simple, easy, just. Right. Yeah. Perfect.

Rob: Yeah, but we do the hard work, so you don't have to. You don't care about all of that. You want to write your data somewhere, and you want to retrieve your data from an API, and that's what Fauna gives you.

Jeremy: Which I think is the main point here. So awesome. All right. Well, Rob, listen. This was great, and I'm super happy that I finally got you on the show. Congratulations for the new role at Fauna and on what's happening over there because it is pretty exciting.

Rob: Thank you.

Jeremy: I love companies that are innovating, right? It's not just another hosted database. You're actually building something here that is innovative, which is pretty amazing. So if people want to find out more about you, follow you on Twitter, or find out more about Fauna, how do they do that?

Rob: Right. Twitter, rts_rob. Probably the easiest way, go to my website, robsutter.com, and you will link to me from there. From there, of course, you'll get to fauna.com and all of our resources there. Always open to answer questions on Twitter. Yeah. Oh, [email protected] If you're old-school like me and you prefer the email, there you go.

Jeremy: All right. Awesome. Well, I will get all that into the show notes. Thanks again, Rob.

Rob: Thank you, Jeremy. Thanks for having me.

2021-05-03
Länk till avsnitt

Episode #98: Making Serverless Accessible with Bit Project with Daniel Kim

About Daniel Kim

Daniel Kim (He/Him) is a Senior Developer Relations Engineer at New Relic and the founder of Bit Project, a 501(c)(3) nonprofit dedicated to making tech accessible to underserved communities. He wants to inspire generations of students in tech to be the best they can be through inclusive, accessible developer education. He is passionate about diversity & inclusion in tech, good food, and dad jokes.

Twitter: @learnwdaniel
Volunteer with Bit Project: bitproject.org/volunteer
Learn Serverless with Bit Project: bitproject.org/course/serverless

Watch this video on YouTube: https://youtu.be/oDdrbDXQG6w

This episode sponsored by, CBT Nuggets.

Transcript:
Jeremy: Hi, everyone. I'm Jeremy Daly and this is Serverless Chats. Today I'm chatting with Daniel Kim. Hey, Daniel. Thanks for joining me.

Daniel: Hi, Jeremy. How's it going?

Jeremy: It's going real ...

Daniel: I'm glad to be here.

Jeremy: Well, I'm glad that you're here. So, you are a Senior Developer Relations Engineer at New Relic, but you're also the founder of Bit Project. So, I would love it if you could tell the listeners a little bit about yourself and your background and what Bit Project is all about.

Daniel: That sounds great, Jeremy. My name is Daniel. I'm a Senior Developer Relations Engineer at New Relic, which means I get to help the community and go find developers and help them become better developers. And I got into developer relations because I founded a school club and now it's a nonprofit, but it started as a school club, called Bit Project, where me and my friends gathered together to teach each other awesome web technologies. And yeah, that's how I got my start. And I am still running Bit Project as a nonprofit to help students around the world build and ship projects using awesome technologies and help them learn and become better developers.

Jeremy: Right. And one of those awesome technologies is serverless. And that's what I want to talk to you about today because this is a really great program that you're running here that helps make Serverless more accessible to more people, which is what I'm all about, right? So, I absolutely love this. So, let's go back and talk a little bit about Bit Project and just get into how it got started. You mentioned it was a project you were doing with some college friends, but how did it go from that to what it is now?

Daniel: Yeah. So, I started this, I think, late freshman year when I was still in school at UC Davis. I was not a computer science major, actually. I was an electrical engineering major, but as I got into technology and seeing all the possibilities of things you can build with cool tech, I was like, "I really need to get into web development because this is so awesome. I can make changes on the fly. I can see awesome things. I can build awesome things with my hands." Well, with my computer. So, yeah, I got a couple of friends together because I'm a very social person so I like to build and learn things together with my friends. So, I got a couple of them together. We rented a lecture hall and then we just taught each other everything we knew to each other. For example, I was super into Gatsby and React, so I was teaching my friends React. Other friends were super into backend development, so they were teaching me things like how to design APIs and how to connect a frontend to a backend, like really awesome things to each other.

And it started like that until I decided to scale the program so I could help more and more of my fellow students. So, instead of doing four-person meetups, I would organize a workshop. And those workshops turned into sponsored workshops with funding, which meant a lot of free food, which meant more people, and it just ballooned into this awesome student organization where we always had the best food. We had free Boba, free pizza, and we would share with each other all these awesome technologies and tools that we learned how to work with using in our projects. So, that's how it started.

Jeremy: Right. And then, so once you got this thing rolling, obviously you're seeing some success with it, then you get into developer relations?

Daniel: Yeah, definitely. So, that's when I understood what I wanted to do with them for the rest of my life. I didn't want to be that production engineer on-call all the time. I wanted to be that engineer that helped other engineers become more successful and find the joy in programming. I love seeing when developers find that "aha moment" when they're learning something new and help them become better developers. And I found that out when I was teaching my friends how to program because I got more joy out of seeing other people succeed than me succeeding myself. So, I was like, "Developer relations is the path for me." So, that's why I directly entered developer relations right out of college, because I was like, "This is what I'm meant to do." Because one of my favorite things to do is figure out how to break down really complex ideas and concepts into more fun, easy-to-understand chunks so everyone can succeed and have a good time. That's my thing.

Jeremy: No, I love that. I love that because I feel like, especially people who are maybe not traditional tech people or don't have a traditional tech background, sometimes it just takes a little bit of twisting of the presentation for them to really understand that. And I love that idea of just reaching out and trying to help more people because I'm on the total same page with you here. So, now you go and so you get into developer relations and you've got this Bit Project thing. And so is this something that you wanted to keep as a side project? What was the next evolution of that?

Daniel: Yeah, definitely. So, I think Bit Project is an extension to the advocacy work I do at New Relic. Because at New Relic, my job is not to push New Relic the product. We have amazing product marketing managers and other folks who do that. My job is to make it easy for people to level up the community, like the people in the community to level up as developers and help the community. And one way I do that is through Bit Project. So, a lot of the work I do at New Relic mirrors or is parallel to the work I'm doing at Bit Project, where I help make complex ideas more accessible to developers. So, in a way, it's not more of a side project. It's like a parallel project of what I'm doing at New Relic, what I'm doing at Bit Project.

Jeremy: Right. And so in terms of the things that you're teaching at Bit Project too because that's the other thing too. I think leveling up developers is one of those things where, I mean, if somebody wants to go learn HTML or CSS or one of those things, there's probably plenty of resources for them to go and do that. There's probably nine million YouTube tutorials out there, right?

Daniel: Definitely.

Jeremy: But for concepts like Serverless, right? And I mean even Serverless with Azure and AWS and some of these other things, these are newer things. I've actually interviewed quite a few candidates for a recent position that I'm trying to fill and not a lot of them are learning this stuff in college.

Daniel: Definitely. Something that we really wanted to instill to our students was that this is not your average bootcamp or course. We're not promising any six-figure salary after our bootcamps. That's not what we're promising. What we're promising is the opportunity to learn a concept that is foreign to many developers, even seasoned developers, because it's a relatively new technology, and we teach you the tools we give you and teach you the ways to become successful. So, we won't teach you everything you need to know, but we will teach you how to find the things you need to know to become successful developers. So, we help establish a good foundation for developers to learn new things and then build things on their own.

Jeremy: Right. And this is ...

Daniel: That's the focus of our program.

Jeremy: Yeah. And this is completely free, right?

Daniel: It's completely free. We're run thanks to the generosity of our corporate sponsors. So, shout out to them. Yeah. So, it's completely free for all students. So, please go and apply if you're interested and you are a student.

Jeremy: So, one of the major things that you focus on, and I know that you have different courses or different workshops that you're going through. And I know some of the other ones are a little bit earlier like the DevOps one. But you have a pretty robust serverless. I mean, that's the main thing, right? Teaching people to build serverless applications on Microsoft Azure. So, I'm curious, especially having somebody jump in from maybe a non-traditional tech background or no tech background at all, and also students of all ages, right? We're not just talking about high school or college kids here, that jumping into something like serverless, what makes serverless such a good, I guess, jumping in point for the types of candidates that you're looking for?

Daniel: Yeah. This is actually a great question because I have this conversation a lot with my colleagues at New Relic because when seasoned engineers hear about serverless, they jump straight into the, "How is this scalable for my enterprise use case? How is this going to integrate with my 70,000 other microservices?" They get into those questions immediately. But if you really boil down what serverless is, it's basically running code without thinking about infrastructure. That's the crux of what serverless is. And if you think about it from that perspective and not worry about all the other technical hurdles into implementing it in scale, it becomes a lot easier to digest for students. And it becomes a really friendly medium to get started with coding a project because you just have to code a small JavaScript or Python function that you just deploy to the cloud.

It just magically works. We try not to overwhelm students with all the infrastructure talk and more focus on the code that they're writing. And I was really inspired because one of my mentors for my career is Chloe Condon from Microsoft, and I remember her writing a lot of blogs around getting started with serverless. She built this fake boyfriend app with a Twilio and serverless. And I was like, "Hey, this is not that unapproachable for students to get started with serverless functions," because it was only maybe 40, 50 lines of code. It integrated multiple APIs. So, I was like, "This is the perfect medium," because it's relatively simple to understand the idea of just writing code and deploying it to a magical kingdom where the magical kingdom controls everything, you know?

Jeremy: Right.

Daniel: So, that's my inspiration for using serverless as a medium to teach people how the modern full-stack app works, if that makes sense.

Jeremy: Yeah. No, I totally agree, and I use this quite a bit where I tell people when I was a kid when I first started programming in the late 1990s, everything was CGI bins, right? So, we were just uploading code using FTP, but it was seriously magical. Now, again, it wouldn't scale, right. But it was magical in terms of how that happened. But even if it didn't scale, the point where you can get to that, what do we call it? The "aha moment," right? Where you're like, "Oh, this is how that works," or, "Oh, I get it now." I think you just get there faster with serverless.

Daniel: Exactly. I think that's one of the reasons I love serverless is that we have students spin up a serverless function day one of the camp. We don't wait until day three or day four to teach them how to build with serverless. We're like, "Hey, this is the environment that you're going to work in," and then we have them write their own serverless functions based on a boilerplate code that we have written already. So, we try to make the barrier to entry as low as possible, so students don't get intimidated by the word "serverless."

Jeremy: Right. Right. Yeah. And I think also it's probably a good place to get people started thinking about just what the cloud is and how the cloud works in general.

Daniel: Yeah. Definitely. Some of our students have never even heard of what an API is. So, we really take students from zero to understanding how different services work on the internet and how we can take advantage of services and other code that other people have written to write our own applications. Because a lot of students, especially junior developers, don't realize how little you have to code to actually get an app working. Because most likely there's someone in the world who's coded something that you're looking for to implement already. So, it's more like a jigsaw puzzle than trying to build something yourself.

Jeremy: Right. It's that whole Lego concept, just sticking those building blocks together. So, you mentioned, though, some of your students they've never even heard of an API or they don't know what an API is. So, I'm just thinking from the perspective of an absolute beginner, how do you scope a project for an absolute beginner to get them to somewhere where they actually have something that gets them to that "aha moment," makes them feel like, "Hey, I've actually done something interesting here," but not overwhelm them with things like open API spec 3.0? You know what I mean? All this kind of stuff.

Daniel: Yeah. I think one of the most important things when you're designing a curriculum is understanding the pain points of the student. So, this curriculum was designed by a bunch of students. I'm not the only one that wrote this curriculum. This curriculum had a lot of contributors from all over the world who are high school and college students. We knew that we didn't want to go too in-depth from the beginning because we have a lot of students from non-traditional backgrounds that don't have a lot of previous knowledge. So, what we try to do is set up guide rails and have boilerplates and things like that to ensure that they're successful. Because the worst thing you can do when you're working with a junior developer is just overwhelm them with information and have really, really hard assignments that lead to frustration.

So, we try to make that path really, really easy. But instead, what we try to do is have stretch goals or have extra-curricular assignments where they can apply what they have learned. So, if they're a little bit more advanced and they're getting the concepts and they're understanding at a deeper level how things work, they're able to practice and hone those skills. So, what we do is we try to work with our mentors, our fabulous mentors who are engineers in the industry, to help students code those stretch goals and help them understand at a deeper level if they have the capacity to do so. So, we try to customize the experience for every student based on their previous experience.

Jeremy: Right. And I think another important thing is setting expectations with the students as well. I mean, you mentioned earlier that this isn't a bootcamp that you're guaranteeing $100,000 salaries when you walk out. I think that that is something, to me, where I think that level of honesty and truth is really important because I think there are a lot of these eight to 12-week boot camps that over-promise. And I don't know. I mean, I've been doing this for 24 years, and I don't feel like I'm an expert on anything and I've been doing it for a very long time. So, eight weeks doesn't get you to be an expert in anything, but if you can become productive, that's pretty exciting.

Daniel: Yeah. Our goal is not to get you a six-figure job. Because that would be nice, but I feel like that's straight-up lying. Because I don't know all the students before they start personally, and I can't promise them a six-figure job. That's just ridiculous to me. But what I can promise is that you will ship an app. That's what I can promise. And I feel like when you're shipping an app and you're writing code to build an actual app that will work, you learn so much. You learn how to plan for a software project, how to ask questions, how to look for things on Google. So, that's the things we promise is the experiences, not necessarily the shiny six-figure salary. Even though I wish I could promise that. That would be amazing.

Jeremy: Right. Yeah. And I think probably the greatest skill you can teach anyone as a developer is how to Google and how to use stack overflow.

Daniel: Definitely.

Jeremy: All right. So, you mentioned something about customizing, trying to make sure that the curriculum is adapted for the particular student. So, tell me a little bit more about that because that sounds really interesting.

Daniel: Yeah. So, one of the reasons that I find our content and curriculum really special is that it's open-ended. It's not like they're programming exactly what every other student programs. So, for the first four weeks, we teach how serverless functions work, how to set up your development environment, everything through pair programming. So students, instead of having lectures, we have senior engineers actually pair program with junior developers, younger students, or students with less experience, so they can ask questions in the chat to learn as they are doing it with a mentor. And during the last four weeks, we actually have the students apply the things they've learned in the first four weeks through pair programming into their own applications. So, we teach them, "Hey, by week one, you should have this part of your project done. Week two, you should have this part of your project done." But we don't really specify exactly what their project should be. So, at the end of the camp, every single student has a different project they have built based on the interests they have, which has been really awesome to see.

Jeremy: Well, that's also great, too. It's one of those things where, when your English teacher forces you to read Romeo and Juliet and you're not interested in Shakespeare, it's really hard to excel in that sometimes. So, letting people pick and choose where they go, I think is, again, is just a really good motivator and an excellent way. And again, just not over-promising. Just teaching people some of the basics, and then you have something to work on, something to iterate on, something to go a little bit deeper on and start understanding. If you just know that there are headers when you call an API, then you can maybe start doing some research as to what the other headers are and what I can do with those. And I think that level of curiosity would be really great for somebody and again, would excite them and get them going down that path.

Daniel: Definitely. And I think the best way that students learn is actually trying to implement the things they have in their head. Because some of these projects that students have built for their capstone projects have been very, very complicated using serverless functions. One of the students actually built a Dropbox clone using serverless functions, and it was actually amazing. I couldn't do that, honestly, but she built it in three weeks, I think. So, I think it's the creativity that really, really I find impressive and amazing every cohort we have, is the variance in projects that we have for every single student.

Jeremy: Right. Yeah. So, what are some of those projects? Because I think that'd be really interesting. Just give some examples of the sort of things you can build, right? Because the "Hello, World" tutorials are out there. People can go and probably cobble something together, but it sounds to me like the students that you have are building something that is actually, maybe not-production ready, but it is something that solves a real problem and it's a real solution to that. So, what are some of those different projects?

Daniel: Definitely. One of our students, Bo, built an IoT heart rate monitor that connected to a serverless function. So, every time that the heartbeat went over a certain number, it would send a Twilio text message to the family members of whoever was subscribed to that particular heartbeat monitor. And he built that because his grandfather was suffering with some heart issues, and it was really important to his family that they knew that he was doing okay. They got alerted every time his heartbeat got too fast. So, he actually built this whole thing using a Raspberry PI. He had a heartbeat sensor that was attached to a bracelet and it actually connected to a Serverless function. And he demoed it and he actually did jumping jacks to get his heart rate up. It actually worked, which was super awesome. We got to demo during our demo day.

Another student built a face mask detector. So, she would have someone take a picture on her website of someone wearing a face mask. And it would tell, using some cognitive APIs, if someone was wearing a mask or not. And she designed that because she knew a lot of local businesses who didn't have staff directly in the entrance of the business, and she wanted to make sure there was a solution where the owners could make sure someone was wearing a mask before they entered the establishment. So, that was a really cool project. There was another student who was actually in his forties who was a mining engineer who wanted to make a career change. So, he actually built this awesome serverless function that sent out earthquake notifications based on the data from the government, which was really, really awesome as well. So, there's so many projects that students have built with serverless functions, ranging everything from IoT to big data and so many things that I've learned actually, by watching all these projects being built.

Jeremy: Yeah. That's amazing. And actually I think that something that's really interesting, you mentioned the gentleman with the career change, is that developers, I think, especially career developers, I mean, we get narrowly focused on solving software problems, right?

Daniel: Exactly.

Jeremy: And we maybe don't think so much about some of these other real-world problems that exist. So, that idea of taking your existing life experiences and problems that you've been dealing with and have a solution maybe in your head, but you can't express that. That's really frustrating, right? So, being able to do something like this and being able to express that, I think that's absolutely amazing.

Daniel: Definitely. And I think this is one of the reasons why I find this program really rewarding for both students and the people who actually run the program because they see folks who have zero experience getting to the point where they can build the things that are in their head, which I think is magical.

Jeremy: Yeah. No, I totally agree. Also, I think you said there's some other case studies on the blog?

Daniel: Yeah. So, if you go to bitproject.org and go to the blog, we have a bunch of case studies that are still being uploaded. So, every week we're going to have new student projects that are going to be uploaded there. So, if you want to see some of the cool stuff that our students have built, feel free to go check it out.

Jeremy: Awesome. All right. So, you just mentioned that this is a really rewarding thing. And I know for me, I do a lot of open source projects. I try to help as many people as I can. I don't run a nonprofit that runs courses. Maybe someday. But I do get exactly what you're saying because it is great to get that feedback, to see someone be successful because you've helped enable that. So, I know you're looking for mentors, right?

Daniel: Yeah, definitely. We're looking for mentors who have previous experience or passion with serverless to mentor students, to get them to that point where they can build their own apps. So, we'd love to have you if you are interested and have a couple of hours per week to spare.

Jeremy: Right. What's the requirement or the time commitment? It's just a few hours a week?

Daniel: We recommend four to five hours a week to just work directly one-on-one with the student, and previous experience in serverless or just regular full-stack development is quite encouraged because we want to make sure that you are able to answer some of the technical questions that students might have around the content.

Jeremy: Right. And you mentioned that, again, just going back to the mining example, but it sounds like that gentlemen was a little bit older. So, what's the age range of the students that you have in this program?

Daniel: We don't have a minimum or maximum age that we accept. We just care about passion and the willingness to complete the program. Because the program is completely free, the standard that we set for our applicants is not of experience, but more of passion and desire to learn and become a successful developer.

Jeremy: Right, right. Yeah. So, what about for mentors or people who are looking to do this? Again, I know it's rewarding to work with people and to help people. You know it's rewarding. What can you tell people, though, that might be interested in this? What are some of the other benefits, I guess, of being a mentor?

Daniel: Yeah. Some of the really cool benefits I've seen is that we've been working directly with the Azure Functions team at Microsoft to mentor our students because they are using Azure Functions as the platform to host their serverless functions. And we've actually had PMs that are building as their functions, work with our students to get new ideas for product features, as well as engineers getting direct feedback on the features they worked on only a couple of weeks prior. Which I think is quite magical because I've seen these older PMs who are building that product the students are using and the students are very blunt, let me tell you. So they're like, "This feature makes no sense." So, a couple of weeks later it's magically fixed for some reason. I don't know how that could have happened, but things get resolved quite quickly when the student feedback comes in.

Jeremy: Yeah. And also the other thing is, is that again, it's feedback that's, I guess, untainted from the experience of being a developer, right?

Daniel: Definitely.

Jeremy: So, it's like that childhood honesty that is what probably every product management team needs to figure that stuff out. So, all right. Well, so where are you going with this? What do you hope to do with Bit Project? I mean, is this something you want to grow or you want to add more courses? What's the future?

Daniel: Definitely. That's a great question. So, as I work for New Relic, we are pivoting to create more content and more courses and more interactive learning materials and experiences in the DevOps field. So, right now we're creating content around observability, around container orchestration, things like that, that are more niche skills that students could learn to better their chances of getting a job as a site reliability engineer or a DevOps engineer. But most importantly, right now what we're trying to do is make sure that we're ready to scale as soon as possible because we feel like we have something really special here where we're teaching students how to ship apps, not to learn specific concepts like HTML or CSS. I think we have a really unique model here of how we're teaching students and how we're working with industry, leveraging cloud advocates and engineers who want to volunteer their specialized skill to better the community. So, right now I see the future as us helping make and lead more engineers of the future so we can have better services and better internet, hopefully, in a couple of years or a couple of decades.

Jeremy: Well, it's a very noble goal. And so what about data science? I know there's a thing on the site about data science and I think you're doing some work with universities around that, right?

Daniel: Yeah, definitely. So, we have a program called Bit University where we create these really easy-to-integrate data science courses for humanities classrooms, because there's a huge demand right now for humanities students to get data science experience, to get research opportunities as well as job opportunities. But a lot of them actually don't have access to data science courses because they're a humanities major, especially at smaller schools. So, what we do is we partner with universities like Cal State Fullerton and Sacramento State University to provide data science courses specifically tailored for humanities majors at these schools partnering with professors. So, yeah, that's the program and it's been super successful and we've had so many humanities students learn the basic skills they need to get these internships and these research opportunities, which has been really rewarding.

Jeremy: Yeah. That's awesome. So Daniel, is there anything else you want to tell the listeners about Bit Project?

Daniel: Definitely. Yeah. So, if you or your company want to help us make more technical content, like let's say you work in DevOps or you even work in Serverless functions that you want to extend the work we're doing, especially if you're an advocate, please reach out to me. I'm on Twitter. I'm on email. So, please reach out to me to work together on more technical content because my job is to make things more assessable. So, if you want to make anything, whether it's your area of expertise or something you think could be more accessible, I'd love to work with you to make sure that happens. And that is a free resource that's available to the community. That's my plug.

Jeremy: That's awesome.

Daniel: Reach out to me.

Jeremy: That's awesome. No, I love it. Daniel, I appreciate, one, you being here and sharing this with everybody, but also the work that you're doing with the community is just amazing. The more people we can get into serverless and the more people we can get to understand this next generation of, I don't know, applications, I guess you want to call it, is absolutely a very, very noble goal. So, you mentioned Twitter. So, it's just learnwdaniel, right?

Daniel: Definitely. Yeah.

Jeremy: And then also the Bit Project has a Twitter, just bitPRJ. And then if you're interested in volunteering, you go to bitproject.org/volunteer. And students, if students want to sign up, how do they do that? They just go to bitproject.org?

Daniel: Yeah. You can go directly to apply at bitproject.org, or if you want more information about the program, just go to bitproject. There's a huge banner at the top that will lead you directly to that website.

Jeremy: Awesome. All right. Well, I will make sure I get all that into the show notes. Thanks again, Daniel.

Daniel: Thank you.

2021-04-26
Länk till avsnitt

Episode #97: How Serverless Fits in to the Cyclical Nature of the Industry with Gojko Adzic

About Gojko Adzic

Gojko Adzic is a partner at Neuri Consulting LLP. He one of the 2019 AWS Serverless Heroes, the winner of the 2016 European Software Testing Outstanding Achievement Award, and the 2011 Most Influential Agile Testing Professional Award. Gojko?s book Specification by Example won the Jolt Award for the best book of 2012, and his blog won the UK Agile Award for the best online publication in 2010.

Gojko is a frequent speaker at software development conferences and one of the authors of MindMup and Narakeet.

As a consultant, Gojko has helped companies around the world improve their software delivery, from some of the largest financial institutions to small innovative startups. Gojko specializes in agile and lean quality improvement, in particular impact mapping, agile testing, specification by example, and behavior driven development.

Twitter: @gojkoadzic
Narakeet: https://www.narakeet.com
Personal website: https://gojko.net

Watch this video on YouTube: https://youtu.be/kCDDli7uzn8

This episode is sponsored by CBT Nuggets: https://www.cbtnuggets.com/

Transcript
Jeremy: Hi everyone, I'm Jeremy Daly and this is Serverless Chats. Today my guest is Gojko Adzic. Hey Gojko, thanks for joining me.

Gojko: Hey, thanks for inviting me.

Jeremy: You are a partner at Neuri Consulting, you're an AWS Serverless Hero, you've written I think, what? I think 6,842 books or something like that about technology and serverless and all that kind of stuff. I'd love it if you could tell listeners a little bit about your background and what you've been working on lately.

Gojko: I'm a developer. I started developing software when I was six and a half. My dad bought a Commodore 64 and I think my mom would have kicked him out of the house if he told her that he bought it for himself, so it was officially for me.

Jeremy: Nice.

Gojko: And I was the only kid in the neighborhood that had a computer, but didn't have any ways of loading games on it because he didn't buy it for games. I stayed up and copied and pasted PEEKs and POKEs in a book I couldn't even understand until I made the computer make weird sounds and print rubbish on the screen. And that's my background. Basically, ever since, I only wanted to build software really. I didn't have any other hobbies or anything like that. Currently, I'm building a product for helping tech people who are not video editing professionals create videos very easily. Previously, I've done a lot of work around consulting. I've built a lot of product that is used by millions of school children worldwide collaborate and brainstorm through mind-mapping. And since 2016, most of my development work has been on Lambda and on team stuff.

Jeremy: That's awesome. I joke a little bit about the number of books that you wrote, but the ones that you have, one of them's called Running Serverless. I think that was maybe two years ago. That is an excellent book for people getting started with serverless. And then, one of my probably favorite books is Humans Vs Computers. I just love that collection of tales of all these things where humans just build really bad interfaces into software and just things go terribly.

Gojko: Thank you very much. I enjoyed writing that book a lot. One of my passions is finding edge cases. I think people with a slight OCD like to find edge cases and in order to be a good developer, I think somebody really needs to have that kind of intent, and really look for edge cases everywhere. And I think collecting these things was my idea to help people first of all think about building better software, and to realize that stuff we might glance over like, nobody's ever going to do this, actually might cause hundreds of millions of dollars of damage ten years later. And thanks very much for liking the book.

Jeremy: If people haven't read that book, I don't know, when did that come out? Maybe 2016? 2015?

Gojko: Yeah, five or six years ago, I think.

Jeremy: Yeah. It's still completely relevant now though and there's just so many great examples in there, and I don't want to spent the whole time talking about that book, but if you haven't read it, go check it out because it's these crazy things like police officers entering in no plates whenever they're giving parking tickets. And then, when somebody actually gets that, ends up with thousands of parking tickets, and it's just crazy stuff like that. Or, not using the middle initial or something like that for the name, or the birthdate or whatever it was, and people constantly getting just ... It's a fascinating book. Definitely check that out.

But speaking of edge cases and just all this experience that you have just dealing with this idea of, I guess finding the problems with software. Or maybe even better, I guess a good way to put it is finding the limitations that we build into software mostly unknowingly. We do this unknowingly. And you and I were having a conversation the other day and we were talking about way, way back in the 1970s. I was born in the late '70s. I'm old but hopefully not that old. But way back then, time-sharing was a thing where we would basically have just a few large computers and we would have to borrow time against them. And there's a parallel there to what we were doing back then and I think what we're doing now with cloud computing. What are your thoughts on that?

Gojko: Yeah, I think absolutely. We are I think going in a slightly cyclic way here. Maybe not cyclic, maybe spirals. We came to the same horizontal position but vertically, we're slightly better than we were. Again, I didn't start working then. I'm like you, I was born in late '70s. I wasn't there when people were doing punch cards and massive mainframes and time-sharing. My first experience came from home PC computers and later PCs. The whole serverless thing, people were disparaging about that when the marketing buzzword came around. I don't remember exactly when serverless became serverless because we were talking about microservices and Lambda was a way to run microservices and execute code on demand. And all of a sudden, I think the JAWS people realized that JAWS is a horrible marketing name, and decided to rename it to serverless. I think it most important, and it was probably 2017 or something like that. 2000 ...

Jeremy: Something like that, yeah.

Gojko: Something like that. And then, because it is a horrible marketing name, but it's catchy, it caught on and then people were complaining how it's not serverless, it's just somebody else's servers. And I think there's some truth to that, but actually, it's not even somebody else's servers. It really is somebody else's mainframe in a sense. You know in the '70s and early '80s, before the PC revolution, if you wanted to be a small software house or a small product operator, you probably were not running your own data center. What you would do is you would rent it based on paying for time to one of these massive, massive, massive operators. And in fact, we ended up with AWS being a massive data center. As far as you and I are concerned, it's just a blob. It's not a collection of computers, it's a data center we learn something from and Google is another one and then Microsoft is another one.

And I remember reading a book about Andy Grove who was the CEO of Intel where they were thinking about the market for PC computers in the late '70s when somebody came to them with the idea that they could repurpose what became a 8080 processor. They were doing this I think for some Japanese calculator and then somebody said, "We can attach a screen to this and make this a universal computer and sell it." And they realized maybe there's a market for four or five computers in the world like that. And I think that that's ... You know, we ended up with four or five computers, it's just the definition of a computer changed.

Jeremy: Right. I think that's a good point because you think about after the PC revolution, once the web started becoming really big, people started building data centers and collocation facilities like crazy. This is way before the cloud, and everybody was buying racks and Dell was getting really popular because people buying servers from Dell, and installing these in their data centers and doing this. And it just became this massive, whole industry built around doing that. And then you have these few companies that say, "Well, what if we just handled all that stuff for you? Rather than just racking stuff for you," but started just managing the software, and started managing the networking, and the backups, and all this stuff for you? And that's where the cloud was born.

But I think you make a really good point where the cloud, whatever it is, Amazon or Google or whatever, you might as well just assume that that's just one big piece of processing that you're renting and you're renting some piece of that. And maybe we have. Maybe we've moved back to this idea where ... Even though everybody's got a massive computer in their pocket now, tons of compute power, in terms of the real business work that's being done, and the real global value, and the things that are powering global commerce and everything else like that, those are starting to move back to run in four, five, massive computers.

Gojko: Again, there's a cyclic nature to all of this. I remember reading about the advent of power networks. Because before people had electric power, there were physical machines and movement through physical power, and there were water-powered plants and things like that. And these whole systems of shafts and belts and things like that powering factories. And you had this one kind of power load in a factory that was somewhere in the middle, and then from there, you actually have physical belts, rotating cogs in other buildings, and that was rotating some shafts that were rotating other cogs, and things like that.

First of all, when people were able to package up electricity into something that's distributable, and they were running their own small electricity generators next to these big massive machines that were affecting early factories. And one of the first effects of that was they could reuse 30% of their factories better because it was up to 30% of the workspace in the factory that was taken up by all the belts and shafts. And all that movement was producing a lot of air movement and a lot of dust and people were getting sick. But now, you just plug a cable and you no longer have all this bad air and you don't have employees going sick and things like that. Things started changing quite a lot and then all of a sudden, you had this completely new revolution where you no longer had to operate your own electric generator. You could just plug in and get power from the network.

And I think part of that is again, cyclic, what's happening in our industry now, where, as you said, we were getting machines. I used to make money as a Linux admin a long time ago and I could set up my own servers and things like that. I had a company in 2007 where we were operating our own gaming system, and we actually had physical servers in a physical server room with all the LEDs and lights, and bleeps, and things like that. Around that time, AWS really made it easy to get virtual machines on EC2 and I realized how stupid the whole, let's manage everything ourself is. But, we are getting to the point where people had to run their own generators, and now you can actually just plug into the electricity network. And of course, there is some standardization. Maybe U.S. still has 110 volts and Europe has 220, and we never really get global standardization there.

But I assume before that, every factory could run their own voltage they wanted. It was difficult to manufacture for these things but now you have standardization, it's easier for everybody to plug into the ecosystem and then the whole ecosystem emerged. And I think that's partially what's happening now where things like S3 is an API or Lambda is an API. It's basically the electric socket in your wall.

Jeremy: Right, and that's that whole Wardley maps idea, they become utilities. And that's the thing where if you look at that from an enterprise standpoint or from a small business standpoint if you're a startup right now and you are ordering servers to put into a data center somewhere unless you're doing something that's specifically for servers, that's just crazy. Use the cloud.

Gojko: This product I mentioned that we built for mind mapping, there's only two of us in the whole company. We do everything from presales, to development testing supports, to everything. And we're competing with companies that have several orders of magnitude more employees, and we can actually compete and win because we can benefit from this ecosystem. And I think this is totally wonderful and amazing and for anybody thinking about starting a product, it's easier to start a product now than ever. And, another thing that's totally I think crazy about this whole serverless thing is how in effect we got a bookstore to offer that first.

You mentioned the world utility. I remember I was the editor of a magazine in 2001 in Serbia, and we had licensing with IDG to translate some of their content. And I remember working on this kind of piece from I think PC World in the U.S. where they were interviewing Hewlett Packard people about utility computing. And people from Hewlett Packard back then were predicting that in a few years' time, companies would not operate their own stuff, they would use utility and things like that. And it's totally amazing that in order to reach us over there, that had to be something that was already evaluated and tested, and there was probably a prototype and things like that. And you had all these giants. Hewlett Packard in 2001 was an IT giant. Amazon was just up-and-coming then and they were a bookstore then. They were not even anything more than a bookstore. And you had, what? A decade later, the tables completely turned where HP's ... I don't know ...

Jeremy: I think they bought Compaq at some point too.

Gojko: You had all these giants, IBM completely missed it. IBM totally missed ...

Jeremy: It really did.

Gojko: ... the whole mobile and web and everything revolution. Oracle completely missed it. They're trying to catch up now but fat chance. Really, we are down to just a couple of massive clouds, or whatever that means, that we interact with as we're interacting with electricity sockets now.

Jeremy: And going back to that utility comparison, or, not really a comparison. It is a utility now. Compute is offered as a utility. Yes, you can buy and generate compute yourself and you can still do that. And I know a lot of enterprises still will. I think cloud is like 4% of the total IT market or something. It's a fraction of it right now. But just from that utility aspect of it, from your experience, you mentioned you had two people and you built, is it MindMup.com?

Gojko: MindMup, yeah.

Jeremy: You built that with just two people and you've got tons of people using it. But just from your experience, especially coming from the world of being a Linux administrator, which again, I didn't administer ... Well, I guess I was. I did a lot of work in data centers in my younger days. But, coming from that idea and seeing how companies were building in the past and how companies are still building now, because not every company is still using the cloud, far from it. But not taking advantage of that utility, what are those major disadvantages? How badly do you think that's going to slow companies down that are trying to innovate?

Gojko: I can give you a story about MindMup. You mentioned MindMup. When was it? 2018, there was the Intel processor vulnerabilities that were discovered.

Jeremy: Right, yes.

Gojko: I'm not entirely sure what the year was. A few years ago anyway. We got a email from a concerned university admin when the second one was discovered. The first one made all the news and a month later a second one was discovered. Now everybody knew that, they were in panic and things like that. After the second one was discovered, we got a email from a university admin. And universities are big users, they need to protect the data and things like that. And he was insisting that we tell him what our plan was for mitigating this thing because he knows we're on the cloud.

I'm working on European time. The customer was in the U.S., probably somewhere U.S. Pacific because it arrived in the middle of the night. I woke up, I'm still trying to get my head around and drinking coffee and there's this whole sausage CV number that he sent me. I have no idea what it's about. I took that, pasted it into Google to figure out what's going on. The first result I got from Google was that AWS Lambda was already patched. Copy, paste, my day's done. And I assume lots and lots of other people were having a totally different conversation with their IT department that day. And that's why I said I think for products like the one I'm building with video and for the MindMup, being able to rent operations as a utility, but really totally rent ops as a utility, not have to worry about anything below my unique business level is really, really important.

And yes, we can hire people to work on that it could even end up being slightly cheaper technically but in terms of my time and where my focus goes and my interruptions, I think deploying on a utility platform, whatever that utility platform is, as long as it's reliable, lets me focus on adding value where I can actually add value. That makes my product unique rather than the generic stuff.

Jeremy: You mentioned the video product that you're working on too, and something that is really interesting I think too about taking advantage of the cloud is the scalability aspect of it. I remember, it was maybe 2002, maybe 2003, I was running my own little consulting company at the time, and my local high school always has a rivalry football game every Thanksgiving. And I thought it'd be really interesting if I was to stream the audio from the local AM radio station. I set up a server in my office with ReelCast Streaming or something running or whatever it was. And I remember thinking as long as we don't go over 140 subscribers, we'll be okay. Anything over that, it'll probably crash or the bandwidth won't be enough or whatever.

Gojko: And that's just one of those things now, if you're doing any type of massive processing or you need bandwidth, bandwidth alone ... I remember T1 lines being great and then all of a sudden it was like, well, now you need a T3 line or something crazy in order to get the bandwidth that you need. Just from that aspect of it, the ability to scale quickly, that just seems like such a huge blocker for companies that need to order provision servers, maybe get a utility company to come in and install more bandwidth for them, and things like that. That's just stuff that's so far out of scope for building a business to me. At least building a software business or building any business. It's crazy.

When I was doing consulting, I did a bit of work for what used to be one of the largest telecom companies in the world.

Jeremy: Used to be.

Gojko: I don't want to name names on a public chat. Somewhere around 2006, '07 let's say, we did a software project where they just needed to deploy it internally. And it took them seven months to provision a bunch of virtual machines to deploy it internally. Seven months.

Jeremy: Wow.

Gojko: Because of all the red tape and all the bureaucracy and all the wait for capacity and things like that. That's around the time where Amazon when EC2 became commercially available. I remember working with another client and they were waiting for some servers to arrive so they can install more capacity. And I remember just turning on the Amazon console. I didn't have anything useful to running it then but just being able to start up a virtual machine in about, I think it was less than half an hour, but that was totally fascinating back then. Here's a new Linux machine and in less than half an hour, you can use it. And it was totally crazy. Now we're getting to the point where Lambda will start up in less than 10 milliseconds or something like that. Waiting for that kind of capacity is just insane.

With the video thing I'm building, because of Corona and all of this remote teaching stuff, for some reason, we ended up getting lots of teachers using the product. It was one of these half-baked experiments because I didn't have time to build the full user interface for everything, and I realized that lots of people are using PowerPoint to prepare that kind of video. I thought well, how about if I shorten that loop, so just take your PowerPoint and convert it into video. Just type up what you want in the speaker notes, and we'll use these neurometrics to generate audio and things like that. Teachers like it for one reason or the other.

We had this influential blogger from Russia explain it on his video blog and then it got picked up, my best guess from what I could see from Google Translate, some virtual meeting of teachers of Russia where they recommended people to try it out. I woke up the next day, the metrics went totally crazy because a significant portion of teachers in Russia tried my tool overnight in a short space of time. Something like that, I couldn't predict it. It's lovely but as you said, as long as we don't go over a hundred subscribers, we're fine. If I was in a situation like that, the thing would completely crash because it's unexpected. We'd have a thing that's amazingly good for marketing that would be amazingly bad for business because it would crash all our capacity we had. Or we had to prepare for a lot more capacity than we needed, but because this is all running on Lambda, Fargate, and other auto-scaling things, it's just fine. No sweat at all. It was a lovely thing to see actually.

Jeremy: You actually have two problems there. If you're not running in the cloud or not running on-demand compute, is the fact that one, you would've potentially failed, things would've fallen over and you would've lost all those potential customers, and you wouldn't have been able to grow.

Gojko: Plus you've lost paying customers who are using your systems, who've paid you.

Jeremy: Right, that's the other thing too. But, on the other side of that problem would be you can't necessarily anticipate some of those things. What do you do? Over-provision and just hope that maybe someday you'll get whatever? That's the crazy thing where the elasticity piece of the cloud to me, is such a no-brainer. Because I know people always talk about, well, if you have predictable workloads. Well yeah, I know we have predictable workloads for some things, but if you're a startup or you're a business that has like ... Maybe you'd pick up some press. I worked for a company that we picked up some press. We had 10,000 signups in a matter of like 30 seconds and it completely killed our backend, my SQL database. Those are hard to prepare for if you're hosting your own equipment.

Gojko: Absolutely, not even if you're hosting your own.

Jeremy: Also true, right.

Gojko: Before moving to Lambda, the app was deployed to Heroku. That was basically, you need to predict how many virtual machines you need. Yes, it's in the cloud, but if you're running on EC2 and you have your 10, 50, 100 virtual machines, whatever running there, and all of a sudden you get a lot more traffic, will it scale or will it not scale? Have you designed it to scale like that? And one of the best things that I think Lambda brought as a constraint was forcing people to design this stuff in a way that scales.

Jeremy: Yes.

Gojko: I can deploy stuff in the cloud and make it all distributed monolith, so it doesn't really scale well, but with Lambda because it was so constrained when it launched, and this is one thing you mentioned, partially we're losing those constraints now, but it was so constrained when it launched, it was really forcing people to design things that were easy to scale. We had total isolation, there was no way of sharing things, there was no session stickiness and things like that. And then you have to come up with actually good ways of resolving that.

I think one of the most challenging things about serverless is that even a Hello World is a distributed transaction processing system, and people don't get that. They think about, well, I had this DigitalOcean five-dollar-a-month server and it was running my, you know, Rails up correctly. I'm just going to use the same ideas to redesign it in Lambda. Yes, you can, but then you're not going to really get the benefits of all of this other stuff. And if you design it as a massively distributed transaction processing system from the start, then yes, it scales like crazy. And it scales up and down and it's lovely, but as Lambda's maturing, I have this slide deck that I've been using since 2016 to talk about Lambda at conferences. And every time I need to do another talk, I pull it out and adjust it a bit. And I have this whole Git history of it because I do markdown to slides and I keep the markdown in Git so I can go back. There's this slide about limitations where originally it's only ... I don't remember what was the time limitation, but something very short.

Jeremy: Five minutes originally.

Gojko: Yeah, something like that and then it was no PCI compliance and the retries are difficult, and all of this stuff basically became sold. And one of the last things that was there, there was don't even try to put it in a VPC, definitely, you can but it's going to take 10 minutes to start. Now that's reasonably okay as well. One thing that I remember as a really important design constraint was effectively it was a share nothing platform because you could not share data between two Lambdas running at the same time very easily in the same VM. Now that we can connect Lambdas to EFS, you effectively can do that as well. You can have two Lambdas, one writing into an EFS, the other reading the same EFS at the same time. No problem at all. You can pump it into a file and the other thing can just stay in a file and get the data out.

As the platform is maturing, I think we're losing some of these design constraints, and sometimes constraints breed creativity. And yes, you still of course can design the system to be good, but it's going to be interesting to see. And this 15-minute limit that we have in Lamdba now is just an artificial number that somebody thought.

Jeremy: Yeah, it's arbitrary.

Gojko: And at some point when somebody who is important enough asks AWS to give them half-hour Lamdbas, they will get that. Or 24-hour Lambdas. It's going to be interesting to see if Lambda ends up as just another way of running EC2 and starting EC2 that's simpler because you don't have to manage the operating system. And I think the big difference we'll get between EC2 and Lambda is what percentage of ops your developers are responsible for, and what percentage of ops Amazon's developers are responsible for.

Because if you look at all these different offerings that Amazon has like Lightsail and EC2 and Fargate and AWS Batch and CodeDeploy, and I don't know how many other things you can run code on in Lambda. The big difference with Lambda is really, at least until very recently was that apart from your application, Amazon is responsible for everything. But now, we're losing design constraints, you can put a Docker container in, you can be responsible for the OS image as well, which is a bit again, interesting to look at.

Jeremy: Well, I also wonder too, if you took all those event sources that you can point at Lambda and you add those to Fargate, what's the difference? It seems like they're just merging into two very similar products.

Gojko: For the video build platform, the last step runs in Fargate because people are uploading things that are massive, massive, massive for video processing, and just they don't finish in 15 minutes. I have to run to Fargate, and the big difference is the container I packaged up for Fargate takes about 40 seconds to actually deploy. A new event at the moment with the stuff I've packaged in Fargate takes about 40 seconds to deploy. I can optimize that, but I can't optimize it too much. Fargate is still order of magnitude of tens of seconds to process an event. I think as Fargate gets faster and as Lambda gets more of these capabilities, it's going to be very difficult to tell them apart I think.

With Fargate, you're intended to manage the container image yourself. You're responsible for patching software, you're responsible for patching OS vulnerabilities and things like that. With Lambda, Amazon, unless you use a container image, Amazon is responsible for that. They come close. When looking at this video building for the first time, I was actually comparing code. I was considering using CodeBuild for that because CodeBuild is also a way to run things on demand and containers, and you actually can get quite decent machines with CodeBuild. And it's also event-driven, and Fargate is event-driven, AWS Batch is event-driven, and all of these things are converging to each other. And really, AWS is famous for having 10 products that do the same thing effectively and you can't tell them apart, and maybe that's where we'll end.

Jeremy: And I'm wondering too, the thing that was great about Lambda, at least for me like you said, the shared nothing architecture where it was like, you almost didn't have to rely on anything other than the event that came in, and the processing of that Lambda function. And if you designed your systems well, you may have some bottleneck up front, but especially if you used distributed transactions and you used async invocations of downstream functions, where you could basically take some data that you needed to pass into it, and then you wouldn't necessarily need that to communicate with anything other than itself to process that data. The scale there was massive. You could just keep scaling and scaling and scaling. As you add things like EFS and that adds constraints in terms of the number of transactions and connections that, that can make and all those sort of things. Do these things, do they become less reliable? By allowing it to do more, are we building systems that are less reliable because we're not using some of those tried-and-true constraints that were there?

Gojko: Possibly, but every time you add a new moving part, you create one more potential point of failure there. And I think for me, one of the big lessons when I was working on ... I spent a few years working on very high throughput transaction processing systems. That's why this whole thing rings a bell a lot. A lot of it really was how do you figure out what type of messages you send and where you send them. The craze of these messages and distributed transaction processing systems in early 2000s, created this whole craze of enterprise service buses later that came. We now have this... What is it called? It's not called enterprise service bus, it's called EventBridge, or something like that.

Jeremy: EventBridge, yes.

Gojko: That's effectively an enterprise service bus, it's just the enterprise is the Amazon cloud. The big challenge in designing things like that is decoupling. And it's realizing that when you have a complicated system like that, stuff is going to fail. And especially when we were operating around hardware, stuff is going to fail badly or occasionally, and you need to not bring the whole house down where some storage starts working a bit slower. You create circuit breakers, you create layers and layers of stuff that disconnect things. I remember when we were looking originally at Lambdas and trying to get the head around that and experimenting, should one Lambda call another? Or should one Lambda not call another? And things like that.

I realized, let's say for now, until we realize we want to do something else, a Lambda should only ever talk to SNS and nothing else. Or SQS or something like that. When one Lambda completes, it's going to track a message somewhere and we need to design these messages to be good so that we can decouple different parts of the process. And so far, that helps too as a constraint. I think very, very few times we have one Lambda calling another. Mostly when we actually need a synchronized response back, and for security reasons, we wanted to isolate something to a single Lambda, but that's effectively just a black box security isolation. Since creating these isolation layers through messages, through queues, through topics, becomes a fundamental part of designing these systems.

I remember speaking at the conference to somebody. I forgot the name of the person who was talking about airline. And he was presenting after me and he said, "Look, I can relate to a lot of what you said." And in the airline community basically they often talk about, apparently, I'm not an airline programmer, he told me that in the airline community, talk about designing the protocol being the biggest challenge. Once you design the protocol between your components, the message is who sends what where, you can recover from almost any other design flaw because it's decoupled so if you've made a mess in one Lambda, you can redesign that Lambda, throw it away, rewrite it, decouple things a different way. If the global protocol is good, you get all the flexibility. If you mess up the protocol for communication, then nothing's going to save you at the end.

Now we have EFS and Lambda can talk to an EFS. Should this Lambda talk directly to an EFS or should this Lambda just send some messages to a topic, and then some other Lambdas that are maybe reserved, maybe more constrained talk to EFS? And again, the platform's evolved quite a lot over the last few years. One thing that is particularly useful in that regard is the SQS FIFO queues that came out last year I think. With Corona ...

Jeremy: Yeah, whenever it was.

Gojko: Yeah, I don't remember if it was last year or two years ago. But one of the things it allows us to do is really run lots and lots of Lambdas in parallel where you can guarantee that no two Lambdas access the same kind of business entity that you have in the same type. For example, for this mind mapping thing, we have lots and lots of people modifying lots and lots of files in parallel, but we need to aggregate a single map. If we have 50 people over here working with a single map and 60 people on a map working a different map, aggregation can run in parallel but I never ever, ever want two people modifying the same map their aggregation to run in parallel.

And for Lambda, that was a massive challenge. You had to put Kinesis between Lambda and other Lambdas and things like that. Kinesis' provision capacity, it costs a lot, it doesn't auto-scale. But now with SQS FIFO queues, you can just send a message and you can say the kind of FIFO ID is this map ID that we have. Which means that SQS can run thousands of Lambdas in parallel but they'll never run more than one Lambda for the same map idea at the same time. Designing your protocols like that becomes how you decouple one end of your app that's massively scalable and massively parallel, and another end of your app that we have some reserved capacity or limits.

Like for this kind of video thing, the original idea of that was letting me build marketing videos easier and I can't get rid of this accent. Unfortunately, everything I do sounds like I'm threatening someone to blackmail them. I'm like a cheap Bond villain, and that's not good, but I can't do anything else. I can pay other people to do it for me and we used to do that, but then that becomes a big problem when you want to modify tiny things. We paid this lady to professionally record audio for a marketing video that we needed and then six months later, we wanted to change one screen and now the narration is incorrect. And we paid the same woman again. Same equipment, same person, but the sound is totally different because two different equipment.

Jeremy: Totally different, right.

Gojko: You can't just stitch it up. Then you end up like, okay, do we go and pay for the whole thing again? And I realized the neurometric text-to-speech has learned so much that it can do English better than I can. You're a native English speaker so you can probably defeat those machines, but I can't.

Jeremy: I don't know if I could. They're pretty good now. It's kind of scary.

Gojko: I started looking at one like why don't they just put stuff in a Markdown and use Markdown to generate videos and things like that? All of these things, you get quota limits still. I thought we were limited on Google. Google gave us something like five requests per second in parallel, and it took me a really long time to even raise these quotas and things like that. I don't want to have lots of people requesting stuff and then in parallel trashing this other thing over there. We need to create these layers of running things in a decent limit, and I think that's where I think designing the protocol for this distributed system becomes an importance.

Jeremy: I want to go back because I think you bring up a really good point just about a different type of architecture, or the architectural design of decoupling systems and these event-driven things. You mentioned a Lambda function processes something and sends it to SQS or sends it to SNS to it can do a fan-out pattern or in the case of the FIFO queue, doing an ordered pattern for sequential processing, which those were all great patterns. And even things that AWS has done, such as add things like Lambda destination. Now if you run an asynchronous Lambda function, you still have to write some code or you used to have to write some code that said, "When this is finished processing, now call some other component." And there's just another opportunity for failure there. They basically said, "Well, if it succeeds, then you can actually just forward it off to one of these other services automatically and we'll handle all of the retries and all the failures and that kind of stuff."

And those things have been added in to basically give you that warm and fuzzy feeling that if an event doesn't reach where it's supposed to go, that some sort of cloud trickery will kick in and make sure that gets processed. But what that is introduced I think is a cognitive overload for a lot of developers that are designing these systems because you're no longer just writing a script that does X, Y, and Z and makes a few database calls. Now you're saying, okay, I've got to write a script that can massively scale and take the transactions that I need to maybe parallelize or that I maybe need to queue or delay or throttle or whatever, and pass those down to another subsystem. And then that subsystem has to pick those up and maybe that has to parallelize those or maybe there are failure modes in there and I've got all these other things that I have to think about.

Just that effect on your average developer, I think you and I think about these things. I would consider myself to be a cloud architect, if that's a thing. But essentially, do you see this being I guess a wall for a lot of developers and something that really requires quite a bit of education to ramp them up to be able to start designing these systems?

Gojko: One of the topics we touched upon is the cyclic nature of things, and I think we're going back to where moving from apps working on a single machine to client server architectures was a massive brain melt for a lot of people, and three-tier architectures, which is later, we're not just client server, but three-tier architectures ended up with their own host of problems and then design problems and things like that. That's where a lot of these architectural patterns and design patterns emerged like circuit breakers and things like that. I think there's a whole body of knowledge there for people to research. It's not something that's entirely new and I think you can get started with Lambda quite easily and not necessarily make a mess, but make something that won't necessarily scale well and then start improving it later.

That's why I was mentioning that earlier in the discussion where, as long as the protocol makes sense, you can salvage almost anything late. Designing that protocol is important, but then we're going to good software design. I think teaching people how to do that is something that every 10 years, we have to recycle and reinvent and figure it out because people don't like to read books from more than 10 years ago. All of this stuff like designing fault tolerance systems and fail-safe systems, and things like that. There's a ton of books about that from 20 years ago, from 10 years ago. Amazon, for people listening to you and me, they probably use Amazon more for compute than they use for getting books. But Amazon has all these books. Use it for what Amazon was originally intended for and then get some books there and read through this stuff. And I think looking at design of distributed systems and stuff like that becomes really, really critical for Lambdas.

Jeremy: Yeah, definitely. All right, we've got a few minutes left and I'd love to go back to something we were talking about a little bit earlier and that was everything moving onto a few of these major cloud providers. And one of the things, you've got scale. Scale is a problem when we talked about oh, we can spin up as many VMs as we want to, and now with serverless, we have unlimited capacity really. I know we didn't say that, but I think that's the general idea. The cloud just provides this unlimited capacity.

Gojko: Until something else decides it's not unlimited.

Jeremy: And that's my point here where every major cloud provider that I've been involved with and I've heard the stories of, where you start to move the needle at all, there's always an SA that reaches out to you and really wants to understand what your usage is going to be, and what your patterns were going to be. And that's because they need to make sure that where you're running your applications, that they provision enough capacity because there is not enough capacity, or there's not unlimited capacity in the cloud.

Gojko: It's physically limited. There's only so much buildings where you can have data centers on the surface of Earth.

Jeremy: And I guess that's where my question comes in because you always hear these things about lock-in. Like, well serverless, if you use Lambda, you're going to be locked in. And again, if you're using Oracle, you're locked in. Or, you're using MySQL you're locked in. Or, you're using any of the other things, you're locked in.

Gojko: You're actually not locked in physically. There's a key and a lock.

Jeremy: Right, but this idea of being locked in not to a specific cloud provider, but just locked into a cloud in general and relying on the cloud to do that scaling for you, where do you think the limitations there are?

Gojko: I think again, going back to cyclic, cyclic, cyclic. The PC revolution started when a lot more edge compute was needed on mainframes, and people wanted to get stuff done on their own devices. And I think probably, if we do ever see the limitations of this and it goes into a next cycle, my best guess it's going to be driven by lots of tiny devices connected to a cloud. Not necessarily computers as we know computers today. I pulled out some research preparing for this from IDC. They are predicting basically from 18.3 zettabytes of data needed for IOT in 2019, to be 73.1 zettabytes by 2025. That's like times three in a space of six years. If you went to Amazon now and told them, "You need to have three times more data space in three years," I'm not sure how they would react to that.

This stuff, everything is taking more and more data, and everything is more and more connected to the cloud. The impact of something like that going down now is becoming totally crazy. There was a case in 2017 where S3 started getting a bit more latency than usual in U.S. East 1, in I think February of 2018, or something like that. There were cases where people couldn't turn the lights on in their houses because the management software was working on S3 and depending on S3. Expecting S3 to be indestructible. Last year, in November, Kinesis pretty much went offline as far as everybody else outside AWS concerend for about 15 hours I think. There were people on Twitter that they can't go back into their house because their smart lock is no longer that smart.

And I think we are getting to places where there will be more need for compute on the edge. First of all, there's going to be a lot more demand for data centers and cloud power and I think that's going to keep going on for the next five, ten years. But then people will realize they've hit some limitation of that, and they're going to start moving towards the edge. And we're going from mainframe back into client server computing I think. We're getting these products now. I assume most of your listeners have seen one like all these fancy ubiquity Wi-Fi thingies that are costing hundreds of dollars and they look like pieces of furniture that's just sitting discretely on the wall. And there was a massive security breach yesterday published. Somebody took their AWS keys and took all the customer data and everything.

The big advantage over all the ugly routers was that it's just like a thin piece of glass that sits on your wall, and it's amazing and it looks good, but the reason why they could do a very thin piece of glass is the minimal amount of software is running on that piece of glass, the rest is running the cloud. It's not just locking in terms of is it on Amazon or Google, it's that it's so tightly coupled with something totally outside of your home, where your network router needs Amazon to be alive now in a very specific region of Amazon where everybody's been deploying for the last 15 years, and it's running out of capacity very often. Not very often but often enough.

There's some really interesting questions that I guess we'll answer in the next five, ten years. We're on the verge of IOT I think exploding because people are trying to come up with these new products that you wouldn't even think before that you'd have smart shoes and smart whatnot. Smart glasses and things like that. And when that gets into consumer technology, we're no longer going to have five or ten computer devices per person, we'll have dozens and dozens of computing. I guess think about it this way, fifteen years ago, how many computer devices were you carrying with yourself? Probably mobile phone and laptop. Probably not more. Now, in the headphones you have there that's Bose ...

Jeremy: Watch.

Gojko: ... you have a microprocessor in the headphones, you have your watch, you have a ton of other stuff carrying with you that's low-powered, all doing a bit of processing there. A lot of that processing is probably happening on the cloud somewhere.

Jeremy: Or, it's just sending data. It's just sending, hey here's the information. And you're right. For me, I got my Apple Watch, my thermostat is connected to Wi-Fi and to the cloud, my wife just bought a humidifier for our living room that is connected to Wi-Fi and I'm assuming it's sending data to the cloud. I'm not 100% sure, but the question is, I don't know why we need to keep track of the humidity in my living room. But that's the kind of thing too where, you mentioned from a security standpoint, I have a bunch of AWS access keys on my computer that I send over the network, and I'm assuming they're secure. But if I've got another device that can access my network and somebody hacked something on the cloud side and then they can get in, it gets really dangerous.

But you're right, the amount of data that we are now generating and compute that we're using in the cloud for probably some really dumb things like humidity in my living room, is that going to get to a point where... You said there's going to be a limitation like five years, ten years, whatever it is. What does the cloud do then? What does the cloud do when it can no longer keep up with the pace of these IOT devices?

Gojko: Well, if history is repeating and we'll see if history is repeating, people will start getting throttled and all of a sudden, your unlimited supply of Lambdas will no longer be unlimited supply of Lambdas. It will be something that you have to reserve upfront and pay upfront, and who knows, we'll see when we get there. Or, we get things that we have with power networks like you had a Texas power cut there that was completely severe, and you get a IT cut. I don't know. We'll see. The more we go into utility, the more we'll start seeing parallels between compute and power networks. And maybe power networks are something that you can look at and later name. That's why I think the next cycle is probably going to be some equivalent of client server computing reemerging.

Jeremy: Yeah. All right, well, I got one more question for you and this is just something where it may be a little bit of a tongue-in-cheek question. Because we talked it a little bit ... we talked about the merging of Lambda, and of Fargate, and some of these other things. But just from your perspective, serverless in five years from now, where do you see that going? Do you see that just becoming the main ... This idea of utility computing, on-demand computing without setting up servers and managing ops and some of these other things, do you see that as the future of serverless and it just becoming just the way we build applications? Or do you think that it's got a different path?

Gojko: There was a tweet by Simon Wardley. You mentioned Simon Wardley earlier in the talk. There was a tweet a few days ago where he mentioned some data. I'm not sure where he pulled it from. This might be unverified, but generally Simon knows what he's talking about. Amazon itself is deploying roughly 50% of all new apps they're building on serverless. I think five years from now, that way of running stuff, I'm not sure if it's Lambda or some new service that Amazon starts and gives it some even more confusing name that runs in parallel to everything. But, that kind of stuff where the operator takes care of all the ops, which they really should be doing, is going to be the default way of getting utility compute out.

I think a lot of these other things will probably remain useful for specialists' use cases, where you can't really deploy it in that way, or you need more stability, or it's not transient and things like that. My best guess is first of all, we'll get Lambda's that run for longer, and I assume that after we get Lambdas that run for longer, we'll probably get some ways of controlling routing to Lambdas because you already can set up pre-provisioned Lambdas and hot Lambdas and reserved capacity and things like that. When you have reserved capacity and you have longer running Lambdas, the next logical thing there is to have session stickiness, and routing, and things like that. And I think we'll get a lot of the stuff that was really complicated to do earlier, and you had to run EC2 instances or you had to run complicated networks of services, you'll be able to do in Lambda.

And Lambda is, I wouldn't be surprised if they launch a totally new service with some AWS cloud socket, whatever. Something that is a implementation of the same principle, just in a different way, that becomes a default we are running computer for lots of people. And I think GPUs are still a bit limited. I don't think you can run GPU utility anywhere now, and that's limiting for a whole host of use cases. And I think again, it's not like they don't have the technology to do it, it's just they probably didn't get around to doing it yet. But I assume in five years time, you'll be able to do GPUs on-demand, and processing GPUs, and things like that. I think that the buzzword itself will lose really any special meaning and that's going to just be a way of running stuff.

Jeremy: Yeah, absolutely. Totally agree. Well, listen Gojko, thank you so much for spending the time chatting with me. Always great to talk with you.

Gojko: You, too.

Jeremy: If people want to get in touch with you, find out more about what you're doing, how do they do that?

Gojko: Well, I'm very easy to find online because there's not a lot of people called Gojko. Type Gojko into Google, you'll find me. And gojko.networks, gojko.com works, gojko.org works, and all these other things. I was lucky enough to get all those domains.

Jeremy: That's G-O-J-K-O ...

Gojko: Yes, G-O-J-K-O.

Jeremy: ... for people who need the spelling.

Gojko: Excellent. Well, thanks very much for having me, this was a blast.

Jeremy: All right, yeah. And make sure you check out ... You mentioned Narakeet. It's a speech thing?

Gojko: Yeah, for developers that want to build videos without hassle, and want to put videos in continuous integration, and things like that. Narakeet, that's like parakeet with an N for narration. Check that out and thanks for plugging it.

Jeremy: Awesome. And then, check out MindMup as well. Awesome stuff. I've got all the stuff in the show notes. Thanks again, Gojko.

Gojko: Thank you. Bye-bye.

2021-04-19
Länk till avsnitt

Episode #96: Serverless and Machine Learning with Alexandra Abbas

About Alexa Abbas

Alexandra Abbas is a Google Cloud Certified Data Engineer & Architect and Apache Airflow Contributor. She currently works as a Machine Learning Engineer at Wise. She has experience with large-scale data science and engineering projects. She spends her time building data pipelines using Apache Airflow and Apache Beam and creating production-ready Machine Learning pipelines with Tensorflow.

Alexandra was a speaker at Serverless Days London 2019 and presented at the Tensorflow London meetup.

Personal links

Twitter: https://twitter.com/alexandraabbas
LinkedIn: https://www.linkedin.com/in/alexandraabbas
GitHub: https://github.com/alexandraabbas

datastack.tv's links
Web: https://datastack.tv
Twitter: https://twitter.com/datastacktv
YouTube: https://www.youtube.com/c/datastacktv
LinkedIn: https://www.linkedin.com/company/datastacktv
GitHub: https://github.com/datastacktv
Link to the Data Engineer Roadmap: https://github.com/datastacktv/data-engineer-roadmap


This episode is sponsored by CBT Nuggets: cbtnuggets.com/serverless and
Stackery: https://www.stackery.io/

Watch this video on YouTube: https://youtu.be/SLJZPwfRLb8

Transcript
Jeremy: Hi, everyone. I'm Jeremy Daly, and this is Serverless Chats. Today I'm joined by Alexa Abbas. Hey, Alexa, thanks for joining me.

Alexa: Hey, everyone. Thanks for having me.

Jeremy: So you are a machine learning engineer at Wise and also the founder of datastack.tv. So I'd love it if you could tell the listeners a little bit about your background and what you do at Wise and what datastack.tv is all about.

Alexa: Yeah. So as you said, I'm a machine learning engineer at Wise. So Wise is an international money transfer service. We are aiming for very transparent fees and very low fees compared to banks. So at Wise, basically, designing, maintaining, and developing the machine learning platform, which serves data scientists and analysts, so they can train their models and deploy their models, easily.

Datastack.tv is, basically, it's a video service or a video platform for data engineers. So we create bite-sized videos, educational videos, for data engineers. We mostly cover open source topics, because we noticed that some of the open source tools in the data engineering world are quite underserved in terms of educational content. So we create videos about those.

Jeremy: Awesome. And then, what about your background?

Alexa: So I actually worked as a data engineer and machine learning engineer, so I've always been a data engineer or machine learning engineer in terms of roles. I also worked, for a small amount of time, I worked as a data scientist as well. In terms of education, I did a big data engineering Master's, but actually my Bachelor is economics, so quite a mix.

Jeremy: Well, it's always good to have a ton of experience and that diverse perspective. Well, listen, I'm super excited to have you here, because machine learning is one of those things where it probably is more of a buzzword, I think, to a lot of people where every startup puts it in their pitch deck, like, "Oh, we're doing machine learning and artificial intelligence ..." stuff like that. But I think it's important to understand, one, what exactly it is, because I think there's a huge confusion there in terms of what we think of as machine learning, and maybe we think it's more advanced than it is sometimes, as I think there's lower versions of machine learning that can be very helpful.

And obviously, this being a serverless podcast, I've heard you speak a number of times about the work that you've done with machine learning and some experiments you've done with serverless there. So I'd love to just pick your brain about that and just see if we can educate the users here on what exactly machine learning is, how people are using it, and where it fits in with serverless and some of the use cases and things like that. So first of all, I think one of the important things to start with anyways is this idea of MLOps. So can you explain what MLOps is?

Alexa: Yeah, sure. So really short, MLOps is DevOps for machine learning. So I guess the traditional software engineering projects, you have a streamlined process you can release, really often, really quickly, because you already have all these best practices that all these traditional software engineering projects implement. Machine learning, this is still in a quite early stage and MLOps is in a quite early stage. But what we try to do in MLOps is we try to streamline machine learning projects, as well as traditional software engineering projects are streamlined. So data scientists can train models really easily, and they can release models really frequently and really easily into production. So MLOps is all about streamlining the whole data science workflow, basically.

And I guess it's good to understand what the data science workflow is. So I talk a bit about that as well. So before actually starting any machine learning project, the first phase is an experimentation phase. It's a really iterative process when data scientists are looking at the data, they are trying to find features and they are also training many different models; they are doing architecture search, trying different architecture, trying different hyperparameter settings with those models. So it's a really iterative process of trying many models, many features.

And then by the end, they probably find a model that they like and that hit the benchmark that they were looking for, and then they are ready to release that model into production. And this usually looks like ... so sometimes they use shadow models, in the beginning, to check if the results are as expected in production as well, and then they actually release into production. So basically MLOps tries to create the infrastructure and the processes that streamline this whole process, the whole life cycle.

Jeremy: Right. So the question I have is, so if you're an ML engineer or you're working on these models and you're going through these iterations and stuff, so now you have this, you're ready to release it to production, so why do you need something like an MLOps pipeline? Why can't you just move that into production? Where's the barrier?

Alexa: Well, I guess ... I mean, to be honest, the thing is there shouldn't be a barrier. Right now, that's the whole goal of MLOps. They shouldn't feel that they need to do any manual model artifact copying or anything like that. They just, I don't know, press a button and they can release to production. So that's what MLOps is about really and we can version models, we can version the data, things like that. And we can create reproducible experiments. So I guess right now, I think many bits in this whole lifecycle is really manual, and that could be automated. For example, releasing to production, sometimes it's a manual thing. You just copy a model artifact to a production bucket or whatever. So sometimes we would like to automate all these things.

Jeremy: Which makes a lot of sense. So then, in terms of actually implementing this stuff, because we hear all the time about CI/CD. If we're talking about DevOps, we know that there's all these tools that are being built and services that are being launched that allow us to quickly move code through some process and get into production. So are there similar tools for deploying models and things like that?

Alexa: Well, I think this space is quite crowded. It's getting more and more crowded. I think there are many ... So there are the cloud providers, who are trying to create tools that help these processes, and there are also many third-party platforms that are trying to create the ML platform that everybody uses. So I think there is no go-to thing that everybody uses, so I think there is many tools that we can use.

Some examples, for example, TensorFlow is a really popular machine learning library, But TensorFlow, they created a package on top of TensorFlow, which is called TFX, TensorFlow Extended, which is exactly for streamlining this process and serving models easily, So I would say it TFX is a really good example. There is Kubeflow, which is a machine learning toolkit for Kubernetes. I think there are many custom implementations in-house in many companies, they create their own machine learning platforms, their own model serving API, things like that. And like the cloud providers on AWS, we have SageMaker. They are trying to cover many parts of the tech science lifecycle. And on Google Cloud, we have AI Platform, which is really similar to SageMaker.

Jeremy: Right. And what are you doing at Wise? Are you using one of those tools? Are you building something custom?

Alexa: Yeah, it's a mix actually. We have some custom bits. We have a custom API, serving API, for serving models. But for model training, we are using many things. We are using SageMaker, Notebooks. And we are also experimenting with SageMaker endpoints, which are actually serverless model serving endpoints. And we are also using EMR for model training and data preparation, so some Spark-based things, a bit more traditional type of model training. So it's quite a mix.

Jeremy: Right. Right. So I am not well-versed in machine learning. I know just enough to be dangerous. And so I think that what would be really interesting, at least for me, and hopefully be interesting to listeners as well, is just talk about some of these standard tools. So you mentioned things like TensorFlow and then Kubeflow, which I guess is that end-to-end piece of it, but if you're ... Just how do you start? How do you go from, I guess, building and training a model to then productizing it and getting that out? What's that whole workflow look like?

Alexa: So, actually, the data science workflow I mentioned, the first bit is that experimentation, which is really iterative, really free, so you just try to find a good model. And then, when you found a good model architecture and you know that you are going to receive new data, let's say, I don't know, I have a day, or whatever, I have a week, then you need to build out a retraining pipeline. And that is, I think, what the productionization of a model really means, that you can build a retraining pipeline, which can automatically pick up new data and then prepare that new data, retrain the model on that data, and release that model into production automatically. So I think that means productionization really.

Jeremy: Right. Yeah. And so by being able to build and train a model and then having that process where you're getting that feedback back in, is that something where you're just taking that data and assuming that that is right and fits in the model or is there an ongoing testing process? Is there supervised learning? I know that's a buzzword. I'm not even sure what it means. But those ... I mean, what types of things go into that retraining of the models? Is it something that is just automatic or is it something where you need constant, babysitting's probably the wrong word, but somebody to be monitoring that on a regular basis?

Alexa: So monitoring is definitely necessary, especially, I think when you trained your model and you shouldn't release automatically in production just because you've trained a new data. I mentioned this shadow model thing a bit. Usually, after you retrained the model and this retraining pipeline, then you release that model into shadow mode; and then you will serve that model in parallel to your actual product production model, and then you will check the results from your new model against your production model. And that's a manual thing, you need to ... or maybe you can automate it as well, actually. So if it performs like ... If it is comparable with your production model or if it's even better, then you will replace it.

And also, in terms of the data quality in the beginning, you should definitely monitor that. And I think that's quite custom, really depends on what kind of data you work with. So it's really important to test your data. I mean, there are many ... This space is also quite crowded. There are many tools that you can use to monitor your distribution of your data and see that the new data is actually corresponds to your already existing data set. So there are many bits that you can monitor in this whole retraining pipeline, and you should monitor.

Jeremy: Right. Yeah. And so, I think of some machine learning like use cases of like sentiment analysis, for example... looking at tweets or looking at customer service conversations and trying to rate those things. So when you say monitoring or running them against a shadow model, is that something where ... I mean, how do you gauge what's better, right? if you've got a shadow... I mean, what's the success metric there as to say X number were classified as positive versus negative sentiment? Is that something that requires human review or some sampling for you to kind of figure out the quality of the success of those models?

Alexa: Yeah. So actually, I think that really depends on the use case. For example, when you are trying to catch fraudsters, your false positive rate and true positive rate, these are really important. If your true positive rate is higher that means, oh, you are catching more fraudsters. But let's say your new model, with your model, also the false positive rate is higher, which means that you are catching more people who are actually not fraudsters, but you have more work because I guess that's a manual process to actually check those people. So I think it really depends on the use case.

Jeremy: Right. Right. And you also said that the markets a little bit flooded and, I mean, I know of SageMaker and then, of course, there's all these tools like, what's it called, Recognition, a bunch of things at AWS, and then Google has a whole bunch of the Vision API and some of these things and Watson's Natural Language Processing over at IBM and some of these things. So there's all these different tools that are just available via an API, which is super simple and great for people like me that don't want to get into building TensorFlow models and things like that. So is there an advantage to building your own models beyond those things, or are we getting to a point where with things like ... I mean, again, I know SageMaker has a whole library of models that are already built for you and things like that. So are we getting to a point where some of these models are just good enough off the shelf or do we really still need ... And I know there are probably some custom things. But do we still really need to be building our own models around that stuff?

Alexa: So to be honest, I think most of the data scientists, they are using off-the-shelf models, maybe not the serverless API type of models that Google has, but just off-the-shelf TensorFlow models or SageMaker, they have these built-in containers for some really popular model architectures like XGBoost, and I think most of the people they don't tweak these, I mean, as far as I know. I think they just use them out of the box, and they really try to tweak the data instead, the data that they have, and try to have these off-the-shelf models with higher and higher quality data.

Jeremy: So shape the data to fit the model as opposed to the model to fit the data.

Alexa: Yeah, exactly. Yeah. So you don't actually have to know ... You don't have to know how those models work exactly. As long as you know what the input should be and what output you expect, then I think you're good to go.

Jeremy: Yeah, yeah. Well, I still think that there's probably a lot of value in tuning the models though against your particular data sets.

Alexa: Yeah, right. But also there are services for hyperparameter tuning. There are services even for neural architecture search, where they try a lot of different architectures for your data specifically and then they will tell you what is the best model architecture that you should use and same for the hyperparameter search. So these can be automated as well.

Jeremy: Yeah. Very cool. So if you are hosting your own version of this ... I mean, maybe you'll go back to the MLOps piece of this. So I would assume that a data scientist doesn't want to be responsible for maintaining the servers or the virtual machines or whatever it is that it's running on. So you want to have this workflow where you can get your models trained, you can get them into production, and then you can run them through this loop you talked about and be able to tweak them and continue to retrain them as things go through. So on the other side of that wall, if we want to put it that way, you have your ops people that are running this stuff. Is there something specific that ops people need to know? How much do they need to know about ML, as opposed to ... I mean, the data scientists, hopefully, they know more. But in terms of running it, what do they need to know about it, or is it just a matter of keeping a server up and running?

Alexa: Well, I think ... So I think the machine learning pipelines are not yet as standardized as a traditional software engineering pipeline. So I would say that you have to have some knowledge of machine learning or at least some understanding of how this lifecycle works. You don't actually need to know about research and things like that, but you need to know how this whole lifecycle works in order to work as an ops person who can automate this. But I think the software engineering skills and DevOps skills are the base, and then you can just build this knowledge on top of that. So I think it's actually quite easy to pick this up.

Jeremy: Yeah. Okay. And what about, I mean, you mentioned this idea of a lot of data scientists aren't actually writing the models, they're just using the preconfigured model. So I guess that begs the question: How much does just a regular person ... So let's say I'm just a regular developer, and I say, "I want to start building machine learning tools." Is it as easy as just pulling a model off the shelf and then just learning a little bit more about it? How much can the average person do with some of these tools out of the box?

Alexa: So I think most of the time, it's that easy, because usually the use cases that someone tries to tackle, those are not super edge cases. So for those use cases, there are already models which perform really well. Especially if you are talking about, I don't know, supervised learning on tabular data, I think you can definitely find models that will perform really well off the shelf on those type of datasets.

Jeremy: Right. And if you were advising somebody who wanted to get started... I mean, because I think that I think where it might come down to is going to be things like pricing. If you're using Vision API and you're maybe limited on your quota, and then you can ... if you're paying however many cents per, I guess, lookup or inference, then that can get really expensive as opposed to potentially running your own model on something else. But how would you suggest that somebody get started? Would you point them at the APIs or would you want to get them up and running on TensorFlow or something like that?

Alexa: So I think, actually, for a developer, just using an API would be super easy. Those APIs are, I think ... So getting started with those APIs just to understand the concepts are very useful, but I think getting started with Tensorflow itself or just Keras, I definitely I would recommend that, or just use scikit-learn, which is a more basic package for more basic machine learning. So those are really good starting points. And there are so many tutorials to get started with, and if you have an idea of what you would like to build, then I think you will definitely find tutorials which are similar to your own use case and you can just use those to build your custom pipeline or model. So I would say, for developers, I would definitely recommend jumping into TensorFlow or scikit-learn or XGBoost or things like that.

Jeremy: Right, right. And how many of these models exist? I mean, are we talking there's 20 different models or are we talking there's 20,000 models?

Alexa: Well, I think ... Wow. Good question. I think we are more towards today maybe not 20,000, but definitely many thousands, I think. But there are popular models that most of the people use, and I think there are maybe 50 or 100 models that are the most popular and most companies use them and you are probably fine just using those for any use case or most of the use cases.

Jeremy: Right. Now, and speaking of use cases, so, again, I try to think of use cases or machine learning and whether it's classifying movies into genres or sentiment analysis, like I said, or maybe trying to classify news stories, things like that. Fraud detection, you mentioned. Those are all great use cases, but what are ... I know you've worked on a bunch of projects. So what are some of the projects that you've done and what were the use cases that were being solved there, because I find these to be really interesting?

Alexa: Yeah. So I think a nice project that I worked on was a project with Lush, which is a cosmetics company. They manufacture like soaps and bath bombs. And they have this nice mission that they would like to eliminate packaging from their shops. So they asked us, when I worked at Datatonic, we worked on a small project with them. They asked us to create an image recognition model, to train one, and then create a retraining pipeline that they can use afterwards. So they provided us with many hundred thousand images of their products, and they made photos from different angles with different lightings and all of that, so really high-quality image data set of all their products.

And then, we used a mobile net model, because they wanted this model to be built-in into their mobile application. So when users actually use this model, they download it with their mobile application. And then, they created a service called Lush [inaudible], which you can use from within their app. And then, people can just scan the products and they can see the ingredients and how-to-use guides and things like that. So this is how they are trying to eliminate all kinds of packaging from their shops, that they don't actually need to put the papers there or put packaging with ingredients and things like that.

And in terms of what we did on the technical side, so as I mentioned, we used a mobile net model, because we needed to quantize the model in order to put it on a mobile device. And we used TF Lite to do this. TF Lite is specifically for models that you want to run on an edge device, like a mobile phone. So that was already a constraint. So this is how we picked a model. I think, back then, like there were only a few model architectures supported by TF Lite, and I think there were only two, maybe. So we picked MobileNet, because it had a smaller size.

And then, in terms of the retraining, so we automated the whole workflow with Cloud Composer on Google Cloud, which is a managed version of Apache Airflow, the open source scheduling package. The training happened on AI Platform, which is Google Cloud's SageMaker.

Jeremy: Yeah.

Alexa: Yeah. And what else? We also had an image pre-processing step just before the training, which happened on Dataflow, which is an auto-scaling processing service on Google Cloud. And after we trained the model, we just saved the model active artifact in a bucket, and then ... I think we also monitored the performance of the model, and if it was good enough, then we just shipped the model to developers who actually they manually updated the model file that went into the application that people can download. So we didn't really see if they use any shadow model thing or anything like that.

Jeremy: Right. Right. And I think that is such a cool use case, because, if I'm hearing you right, there were just like a bar soap or something like that with no packaging, no nothing, and you just hold your mobile phone camera up to it or it looks at it, determines which particular product is, gives you all that ... so no QR codes, no bar codes, none of that stuff. How did they ring them up though? Do you know how that process worked? Did the employees just have to know what they were or did the employees use the app as well to figure out what they were billing people for?

Alexa: Good question. So I think they wanted the employees as well to use the app.

Jeremy: Nice.

Alexa: Yeah. But when the app was wrong, then I don't know what happened.

Jeremy: Just give them a discount on it or something like that. That's awesome. And that's the thing you mentioned there about ... Was it Tensor Lite, was it called?

Alexa: TF Lite. Yeah.

Jeremy: TF Lite. Yes. TensorFlow Lite or TF Lite. But, basically, that idea of being able to really package a model and get it to be super small like you said. You said edge devices, and I'm thinking serverless compute at the edge, I'm thinking Lambda functions. I'm thinking other ways that if you could get your models small enough in package, that you could run it. But that'd be a pretty cool way to do inference, right? Because, again, even if you're using edge devices, if you're on an edge network or something like that, if you could do that at the edge, that'd be a pretty fast response time.

Alexa: Yeah, definitely. Yeah.

Jeremy: Awesome. All right. So what about some other stuff that you've done? You've mentioned some things about fraud detection and things like that.

Alexa: Yeah. So fraud detection is a use case for Wise. As I mentioned, Wise services international money transfer, one of its services. So, obviously, if you are doing anything with money, then a full use case is for sure that you will have. So, I mean, in terms of ... I don't actually develop models at Wise, so I don't know actually what models they use. I know that they use H2O, which is a Spark-based library that you can use for model training. I think it's quite an advanced library, but I haven't used it myself too much, so I cannot talk about that too much.

But in terms of the workflow, it's quite similar. We also have Airflow to schedule the retraining of the models. And they use EMR for data preparation, so quite similar to Dataflow, in a sense. A Spark-based auto-scaling cluster that processes the data and then, they train the models on EMR as well but using this H2O library. And then in the end, when they are happy with the model, we have this tool that they can use for releasing shadow models in production. And then, if they are satisfied with the performance of the model that they can actually release into production. And at Wise, we have a custom micro service, a custom API, for serving models.

Jeremy: Right. Right. And that sounds like you need a really good MLOps flow to make all that stuff work, because you just have a lot of moving parts there, right?

Alexa: Yeah, definitely. Also, I think we have many bits that could be improved. I think there are many bits that still a bit manual and not streamlined enough. But I think most of the companies struggle with the same thing. It's just we don't yet have those best practices that we can implement, so many people try many different things, and then ... Yeah, so I think it's still a work in progress.

Jeremy: Right. Right. And I'm curious if your economics background helps at all with the fraud and the money laundering stuff at all?

Alexa: No.

Jeremy: No. All right. So what about you worked in another data engineering project for Vodafone, right?

Alexa: Yeah. Yeah, so that was a data engineering project purely, so we didn't do any machine learning. Well, Vodafone has their own Google Analytics library that they use in all their websites and mobile apps and things like that and that sense Clickstream data to a server in a Google Cloud Platform Project, and we consume that data in a streaming manner from data flows. So, basically, the project was really about processing this data by writing an Apache Beam pipeline, which was always on and always expected messages to come in. And then, we dumped all the data into BigQuery tables, which is data warehouse in Google Cloud. And then, these BigQuery tables powered some of the dashboards that they use to monitor the uptime and, I don't know, different metrics for their websites and mobile apps.

Jeremy: Right. But collecting all of that data is a good source for doing machine learning on top of that, right?

Alexa: Yeah, exactly. Yeah. I think they already had some use cases in mind. I'm not sure if they actually done those or not, but it's a really good base for machine learning, what we collected the data there in BigQuery, because that is an analytical data warehouse, so some analysts can already start and explore the data as a first step of the machine learning process.

Jeremy: Right. I would think anomaly detection and things like that, right?

Alexa: Yeah, exactly.

Jeremy: Right. All right. Well, so let's go on and talk about serverless a little bit more, because I know I saw you do a talk where you were you ran some experiments with serverless. And so, I'm just kind of curious, where are the limitations that you see? And I know that there continues ... I mean, we now have EFS integration, and we've got 10 gigs of memory for lambda functions, you've even got Cloud Run, which I don't know how much you could do with that, but where's still some of the limitations for running machine learning in a serverless way, I guess?

Alexa: So I think, actually, from this data science lifecycle, many bits, there are Cloud providers offer a lot of serverless options. For data preparation, there is Dataflow, which is, I think, kind of like serverless data processing service, so you can use that for data processing. For model training, there is ... Or the SageMaker and AI Platform, which are kind of serverless, because you don't actually need to provision these clusters that you train your models on. And for model serving, in SageMaker, there are the serverless model endpoints that you can deploy. So there are many options, I think, for serverless in the machine learning lifecycle.

In my experience, many times, it's a cost thing. For example, at Wise, we have this custom model serving API, where we serve all our models. And if they would use SageMaker endpoints, I think, a single SageMaker endpoint is about $50 per month, that's the minimum price, and that's for a single model and a single endpoint. And if you have thousands of models, then your price can go up pretty quickly, or maybe not thousands, but hundreds of models, then your price can go up pretty quickly. So I think, in my experience, limitation could be just price.

But in terms of ... So I think, for example, if I compare Dataflow with a spark cluster that you program yourself, then I would definitely go with Dataflow. I think it's just much easier and maybe cost-wise as well, you might be better off, I'm not sure. But in terms of comfort and developer experience, it's a much better experience.

Jeremy: Right. Right. And so, we talked a little bit about TF Lite there. Is that something possible where maybe the training piece of it, running that on Functions as a Service or something like that maybe isn't the most efficient or cost-effective way to do that, but what about running models or running inference on something like a Lambda function or a Google Cloud function or an Azure function or something like that? Is it possible to package those models in a way that's small enough that you could do that type of workload?

Alexa: I think so. Yeah. I think you can definitely make inference using a Lambda function. But in terms of model training, I think that's not a ... Maybe there were already experiments for, I'm sure there were. But I think it's not the kind of workload that would fit for Lambda functions. That's a typical parallelizable, really large-scale workloads for ... You know the MapReduce type of data processing workloads? I think those are not necessarily fit for Lambda functions. So I think for model training and data preparation, maybe those are not the best options, but for model inference, definitely. And I think there are many examples using Lambda functions for inference.

Jeremy: Right. Now, do you think that ... because this is always something where I find with serverless, and I know you're more of a data scientist, ML expert, but I look at serverless and I question whether or not it needs to handle some of these things. Especially with some of the endpoints that are out there now, we talked about the Vision API and some of the other NLP things, are we putting in too much effort maybe to try to make serverless be able to handle these things, or is it just something where there's a really good way to handle these by hosting your ... I mean, even if you're doing SageMaker, maybe not SageMaker endpoints, but just running SageMaker machines to do it or whatever, are we trying too hard to squeeze some of these things into a serverless environment?

Alexa: Well, I don't know. I think, as a developer, I definitely prefer the more managed versions of these products. So the less I need to bother with, "Oh, my cluster died and now we need to rebuild a cluster of things," and I think serverless can definitely solve that. I would definitely prefer the more managed version. Maybe not serverless, because, for some of the use cases or some of the bits from the lifecycle, serverless is not the best fit, but a managed product is definitely something that I prefer over a non-managed product.

Jeremy: Right. And so, I guess one last question for you here, because this is something that always interests me. Just there are relevant things that we need machine learning for. I mean, I think the fraud detection is a hugely important one. Sentiment analysis, again. Some of those other things are maybe, I don't know, I shouldn't call them toy things, but personalization and some of the things, they're all really great things to have, and it seems like you can't build an application now without somebody wanting some piece of that machine learning in there. So do you see that as where we are going where in the future, we're just going to have more of these APIs?

I mean, out of AWS, because I'm more familiar with the AWS ecosystem, but they have Personalize and they have Connect and they have all these other services, they have the recommendation engine thing, all these different services ... Lex, or whatever, that will read text, natural language processing and all that kind of stuff. Is that where we're moving to just all these pre-trained, canned products that I can just access via an API or do you think that if you're somebody getting started and you really want to get into the ML world that you should start diving into the TensorFlows and some of those other things?

Alexa: So I think if you are building an app and your goal is not to become an ML engineer or a data scientist, then these canned models are really useful because you can have a really good recommendation engine in your product, you could have really good personalization engine in your product, things like that. And so, those are, I think, really useful and you don't need to know any machine learning in order to use them. So I think we definitely go into that direction, because most of the companies won't hire data scientists just to train a recommender model. I think it's just easier to use an API endpoint that is already really good.

So I think, yeah, we are definitely heading into that direction. But if you are someone who wants to become a data scientist or wants to be more involved with MLOps or machine learning engineering, then I think jumping into TensorFlow and understanding, maybe not, as we discussed, not getting into the model architectures and things like that, but just understanding the workflow and being able to program a machine learning pipeline from end to end, I think that's definitely recommended.

Jeremy: All right. So one last question: If you've ever used the Watson NLP API or the Google Vision API, can you put on your resume that you're a machine learning expert?

Alexa: Well, if you really want to do that, I would give it a go. Why not?

Jeremy: All right. Good. Good to know. Well, Alexa, thank you so much for sharing all this information. Again, I find the use cases here to be much more complex than maybe some of the surface ones that you sometimes hear about. So, obviously, machine learning is here to stay. It sounds like there's a lot of really good opportunities for people to start kind of dabbling in it and using that without having to become a machine learning expert. But, again, I appreciate your expertise. So if people want to find out more about you or more about the things you're working on and datastack.tv, things like that, how do they do that?

Alexa: So we have a Twitter page for datastack.tv, so feel free to follow that. I also have a Twitter page, feel free to follow me, account, not page. There is a datastack.tv website, so it's just datastack.tv. You can go there, and you can check out the courses. And also, we have created a roadmap for data engineers specifically, because there was no good roadmap for data engineers. I definitely recommend checking that out, because we listed most of the tools that a data engineer and also machine learning engineer should know about. So if you're interested in this career path, then I would definitely recommend checking that out. So under datastack.tv's GitHub, there is a roadmap that you can find.

Jeremy: Awesome. All right. And that's just, like you said, datastack.tv.

Alexa: Yes.

Jeremy: I will make sure that we get your Twitter and LinkedIn and GitHub and all that stuff in there. Alexa, thank you so much.

Alexa: Thanks. Thank you.

2021-04-12
Länk till avsnitt

Episode #95: Going Serverless with IBM Cloud Code Engine with Jason McGee

About Jason McGee

Jason McGee, IBM Fellow, is VP and CTO at IBM Cloud Platform. Jason is currently responsible for technical strategy and architecture for all of IBM?s Cloud Platform, across public, dedicated, and local delivery models. Previously Jason has served as CTO of Cloud Foundation Services, Chief Architect of PureApplication System, WebSphere Extended Deployment, WebSphere sMash, and WebSphere Application Server on distributed platforms.  

Twitter: @jrmcgee LinkedIn: https://www.linkedin.com/in/jrmcgee/ IBM Cloud Code Engine: Learn more during this live virtual event on April 14th (also available on-demand after April 14th) Read more: https://www.ibm.com/cloud/code-engine Get started today: https://cloud.ibm.com/docs/codeengine?topic=codeengine-getting-started

Watch this episode on YouTube: https://youtu.be/yH_mgW2kGzU

This episode sponsored by IBM Cloud.

Transcript:
Jeremy: Hi, everyone. I'm Jeremy Daly and this is Serverless Chats. Today I'm joined by Jason McGee. Hey Jason, thanks for joining me.

Jason: Thanks for having me.

Jeremy: So you are an IBM fellow and the VP and CTO of the IBM Cloud platform. So I'd love it if you could tell our guests a little bit about yourself and what it is that you do at IBM.

Jason: Sure. I spend my day at IBM worried about developers and platform services on our public cloud. So I'm responsible for both the technical strategy and the delivery of our Kubernetes and OpenShift platforms, our serverless environments, and kind of all the things that surround that space, logging, and monitoring and other developer tools that kind of make up the developer platform for IBM Cloud.

Jeremy: And what about yourself? What's your background?

Jason: Been a software, kind of middleware guy, my whole life. I used to be the chief architect for WebSphere app server. So I spent the last 20 plus years working on enterprise application platforms and helping companies be able to build mission-critical business systems.

Jeremy: Awesome. So I had Michael Behrendt on the show not too long ago and it was great. We talked about a whole bunch of different things. IBM's point of view of serverless. We talked a little bit about the future of serverless and we talked about the IBM Cloud Code Engine, which I want to get into, but for the benefit of our listeners and just because I'm so fascinated by some of the things that IBM is doing now with serverless, it's just super interesting. So could you sort of give me your point of view or IBM's point of view on serverless and just sort of refresh the listener's memory sort of about how IBM is thinking about serverless and how they're probably thinking about it maybe differently than some of the other cloud providers?

Jason: Yeah, sure. I mean, it's such a fascinating space and it's really changed a lot, I think, over the last five years or so from its kind of maybe beginnings in being very aligned with serverless functions and kind of event-driven computing and becoming a more general concept about how developers especially can consume cloud platforms. I think if you look at the IBM perspective on serverless, there's a couple layers to the problem that we think about. First is we've been pretty clear that we think Kubernetes and distributions of Kubernetes like OpenShift are kind of the key foundation compute environment for developers to use going forward. And we've done a ton of work in kind of building out our Kubernetes and OpenShift platforms and delivering them as a service on our public cloud. And that's an incredibly flexible platform that you can really build any kind of application. I think over the last five years, we've proven we can run anything on Kubernetes databases and AI and stateless apps and whatever you want.

Jeremy: Right.

Jason: So very, very flexible. However, sometimes flexible also means complicated and it means that there's lots to manage and there's lots of concepts to get your head around. And so we've been thinking a lot about, well, how do you actually consume a platform like Kubernetes more easily? How does the developer stay more focused on what they're really trying to do, which is like build application logic, solve problems? Now they don't really want to stand up coop clusters and configure security policies. They just want to write code and run code and they want to get the power of cloud to do that. Right? And so I think serverless has kind of morphed to be, for us, more about the experience that we can build on top of that container platform that's more oriented around how developers get work done and allows them to kind of more easily take advantage of the scale and power of public clouds without having to kind of take on the burden of a lot of that kind of work and management.

And so the work that we've been doing is really aligned in that direction, that we've been working in projects like Knative, in the open source community to build simpler abstractions on top of Kubernetes. And we've been starting to deliver those in our cloud through things like Code Engine.

Jeremy: Yeah. And I think that's interesting too because I always have, this is probably the wrong way to say it, but it's sort of a chip on my shoulder about Kubernetes because it just got so complicated. Right? It's just so many things that you have to do, so hard to manage. And as a serverless guy myself, I love just the simplicity of being able to write some code and just get it out there, have it auto scale, tie into all those events. So I think that a lot of cloud providers have sort of moved that way to say like, "Well, we're going to manage your Kubernetes cluster for you." Right? Which essentially is just, I think moving backwards, but also moving forwards at the same time, if that makes sense. But so in terms of the use cases that this opens up because now you're not necessarily limited to a sort of bespoke implementation of some serverless platform, you have a lot more capabilities. So what types of use cases does this open up?

Jason: Yeah. I mean, I may have a couple of comments on that. I mean, so I think with Kubernetes, you have the complexity of managing the Kubernetes environment, but even if that's totally taken care of for you, and even if you're using a managed Kubernetes service like the things we offer on IBM Cloud, you still have that kind of resource burden of using Kubernetes. You have services and pods and replica sets and namespaces and all kinds of concepts that you have to kind of wrap your head around and know how to use in the right way. And so there's a value in like, "Can we abstract that? Can we move away from that?" And it's not like this idea hasn't been tried before. I mean, we've had paths platforms, like kind of Cloud Foundry style, Heroku, very opinionated paths environments in the past and they definitely simplify the user experience. However, they came with this negative, which is if you don't fit within the box of the opinion ...

Jeremy: Right.

Jason: ... then you can't do what you want to do. And the cost of going outside the box was super high. Maybe you had to completely switched platforms. You were completely blocked. You to switch to some other approach. And so part of what's informing us and as we think about this is how do you have more of a continuum? You have a simple model. It's aligned around what you're doing. Just run my source code, just run my container image. I want to run a batch job, but it's all running on one platform. They're running next to each other. You can drop down a layer into Kubernetes if you want to. If what you're trying to accomplish needs some of that flexibility, you should have access to it without having to kind of start over. And so that's kind of how we've approached the problem a little bit differently is bringing this all together into kind of one unified serverless environment on top of Kubernetes.

And that lets us handle different use cases. That lets those handle kind of stateless, data processing and functions. That lets us handle simple web apps. That lets us handle very data-intensive, high-scale computation and data processing, async processing like batch all in one combined way.

Jeremy: Right. Yeah. And I think it's interesting because there are artificial limitations may be put in place sometimes on serverless platforms. If you think about AWS Lambda, for example, you get 15 minutes of compute and they bumped things up. So now, and again, I've just sort of grew up in the AWS environment, but they have things like 10 gigs for a function or something like that. And so they've increased these things, but they are sort of artificial limits that I think, depending on the type of workload that you're doing, they can really get in your way, especially if, like you said, you're doing these data-intensive things. So from an IBM perspective, I mean that's sort of gone, right?

Jason: Right. Exactly. That's a great, very concrete way to look at the problem. The approaches that have been taken in some of the other cloud environments is these different use cases like serverless functions, single containers, batch processing, they're different services. And every service has its own kind of limitations or rules about what you can and cannot do. How long your thing can execute, how big your code can be, how much data you can transfer. We've taken a different approach to say, "Let's eliminate all those limits and let's have one logical service, one environment that supports all those styles." We can still expose a simplified kind of consumption model for the developer like just give me your source code or just give me your image, but I can run it in a way that doesn't have those computational limits, and therefore I can do more. Right? I can run more kinds of workloads. I don't run up against some of those walls that kind of stopped me from getting my work done.

Jeremy: Right. Right. Yeah. And I like that approach too because I'm a big fan of managed services. I think that if you have a service that does image recognition for you, that's great. And do you have a service that does queuing for you? That's great. But in some cases, you start stringing together so many different services and I feel like you lose a lot of that control. So I like that idea of just basically being able to say, "Look, I've got the compute. I can do whatever I need to do with it. It will scale to whatever I needed to scale to." And I think that's where this idea of IBM Cloud Code Engine comes in, which just became GA so I'd love it if you could tell the listeners exactly what that is.

Jason: Yeah, absolutely. So, so Code Engine is the new service that we launched that makes some of these concepts I've been talking about real. It is a service that allows developers to deploy functions, containers, source code, batch jobs, into IBM Cloud. The entire environment behind that application is managed for you. So we handle you don't manage clusters, you don't provision infrastructure. You can scale all the way to zero. So you can literally only pay for what you're using. You can scale up to thousands of cores that are in parallel processing your application and we manage that entire runtime environment for you. So you can think of it as a multi-tenant shared Kubernetes-based runtime environment that you can run your workloads on that presents to you the personality that you need for different workloads. And because it's all in one service, if you have an application that's like a mix of some single containers and batch jobs, they can actually talk to each other, they can talk to each other over a private network connection. They can work together instead of being kind of siloed in these completely different environments.

Jeremy: Right? Yeah. And so from the developer, I guess, perspective, you had mentioned that you can deploy just code or you could deploy a container if you want to. So what does that developer experience look like? So is this something where I could just say, "Look, I don't need to have a whole ops team now managing this for me. If I just want to write code, deploy it into these things, I'm sure there's some things I need to know," but for the most part, what does that developer experience look like?

Jason: Yeah. So you absolutely could do it without a whole ops team. The experience right now, there's like maybe kind of three basic entry points. You can give me source code and we will take care of compiling that source code, combining with a runtime, executing it for you, giving it a web end point, scaling it. You can give me some hints about kind of how much resource you think you need and things like that and we can scale that up and down and manage it for you, including all the way down to zero. That's nice if you're coming from maybe a historical paths background or it's just like, "Here's my code, run it for me." You can have that experience with Code Engine. You could also start with a container image. So lots of developers now, because of things like Kubernetes and Docker, are very familiar and comfortable with packaging up their application as a container image, but you don't want to then deal with creating a cluster and dealing with Kubes.

So you can just say like, "Here's my image, run it for me." And one of the advantages we have with Code Engine is we can really do that with any container image. You don't have to have a container image that follows some particular framework that's built in a very special way. We can take any container image and you can just literally point me at the image and say, "Run this for me," and Code Engine will execute it and scale it and manage it for you. Or you can start with a batch job interface. So like a more of an async kind of parallel job submission model. So maybe I'm doing Monte Carlo simulations or data processing and I want to parallelize that across a whole bunch of machines and cores, Code Engine gives you an interface for that. So as a developer, you kind of start with one of those three entry points and let Code Engine take care of how to run that and scale it and keep it highly available and things like that.

Jeremy: Right. So I love the idea of the batch jobs. I want to talk about that a little bit more, but let's go back to some of the use cases here. So what if I was building just like a REST API, that seems to be a very popular, serverless use case, what would I do for that? Do I need to have some sort of an API type gateway type thing in front of it? Or how does that work?

Jason: No, Code Engine provides all that for you. So you would literally either just take your implementation and package it in a container or point us at your source code directory. If you have source code, we use things like Paketo Buildpacks to build a runtime around that source code. And so you can use different languages. So you can either point us, with our CLI tool, you point us at the source code directory and we'll build it and package it in a runtime and run it for you. Or you point us out a container image that you've uploaded to our container registry or to your container registry of choice and then Code Engine will execute that for you. It will give you that web end point, right? So it'll give you a HTTP end point that you can use to access that service. And it will watch the demand on that system and scale it up and down as needed. And by default, we'll just scale it to zero. So it'll just be kind of registered in the system and it'll take care of scaling it up as needed to handle the demand on the app.

Jeremy: All right. Cool. And then what about these batch jobs? So I talked a little bit about this with Michael and this idea of being able to run massively parallel execution. So how does that all work?

Jason: Yeah. So similar, obviously with batch, there's a little bit more kind of metadata that you have to provide to describe the job and what you want to execute and how things relate to each other. So there's some input data you provide along with the implementation of the batch job, which itself could just be like a container image and you submit that job. So the CLI interface is a little bit different. You're not standing up a long-running REST end point, you're submitting a job to Code Engine for execution, and it will go take that job and execute it and parallelize it for you. You can also use Frameworks on top. One of the things we've been doing a lot of work on, maybe Michael talked about it a little bit when he was here, is some work we're doing around Ray. Ray is a really interesting new project that lets you do kind of distributed computing, especially around data workloads in a really easy way.

And so you can actually stand up Ray on top of Code Engine and so Ray acts as kind of the application interface for the developer to be able to easily parallelize their code, particularly Python code, and then Code Engine acts as the runtime below it. And you can take a simple function in Python, mark it as Ray remote and it'll now execute on the cloud and distribute itself across a thousand cores. And you get your answer back 20 times faster than you would have running it locally. And so you can have those kinds of async environments as well.

Jeremy: Awesome. And so what about some customers? So do you have customers that are having success with this now?

Jason: Yeah, we have a number. I mean, we have the European Microbiology Laboratory, which is using it to do science processing and provide access for scientists to the large-scale compute environments of the cloud. We have some airlines that are leveraging this. The airline scenarios, I think, the scenario is actually kind of interesting because it shows the power of combining REST end points, more interactive workloads with batch workloads. In their case, they're exploring using it to do dynamic pricing. So if you think about how you do dynamic pricing, there's kind of two dimensions. It's like, there's a very interactive, somebody is getting a price on a ticket or a route, and you want to be able to present them with dynamic price information as part of that web interaction. But then there's like a data processing angle.

You're looking at all kinds of data coming from your backend systems from route data, from the fleet and historical information. And you're trying to decide what the right price table is for that route. And so you're doing batch processing in the background, and then you're doing this interactive processing. You can implement both halves on serverless with Code Engine and they scale as needed. If you're getting a lot of traffic on the web front end, it scales up as needed without you having to do anything. So they can kind of combine both halves in one environment.

Jeremy: Right. Right. And so in terms of, I think we kind of talked about this a little bit, but when you see all these different services, right, and no matter what it is, whether it's Google's Kubernetes engine that they run or it's EKS on AWS or something like that, I think a lot of people look at these and like, "Oh, it's just another managed Kubernetes cluster." Right? So what are the major differences? I know we talked about it a little bit, but maybe you could just be a little bit more succinct and sort of talk about why is it so different than other sort of previous generations of tools or some of the other competing products out there.

Jason: Yeah. So if you look kind of behind the curtain on Code Engine, you'd see a couple of things. One is there is Kubernetes there, there is a Kubernetes environment there. The differences that Kubernetes environment is completely managed by the Code Engine service. So we're not, if you look at, in IBM Cloud, we have the IBM Cloud Kubernetes service and our Red Hat OpenShift service. So in those services, we're managing a cluster on your behalf, but we give you the cluster. It's like, "Here's your Kube cluster. We'll manage its life cycle, but you have direct access to it." With Code Engine, we have Kube cluster there, we completely manage it in all respects. You have no kind of direct access to it. That allows us to manage scale and capacity. We run that in a multi-tenant way. I mean, we have security and isolation between tenants, but logically you can think of it as like a big Kube cluster that lots of users are sharing, which is how the pay as you go model ultimately works because we're keeping track of what you're actually running and just charging you for that.

So one part of it is fully managing that runtime environment. We've layered on top of that things like Knative so that we have that developer abstraction like a simpler way to define services, to do the source code and image stuff that I talked about. That's coming through largely through things like Knative, which again, we're completely running for you, but it gives you some of that simple interface now that we talked about, and we're doing that in an open-source way with the community. So it's not like proprietary to IBM Cloud. And then on top of that, we built kind of the batch processing system. So batch scheduling and some of these unique interfaces, the command line interface and the user experience to get into that environment for the different workflows that I talked about. And one of the cool things is, because we built it on top of that Kubernetes layer, we can also expose the Kubernetes API if we want.

So like the Ray example I gave you, Ray doesn't really know anything about Code Engine, but Ray knows how to deploy and leverage a Kube cluster. So we're able to actually hand Ray the Kubernetes API server end point inside of Code Engine for your instance. And that framework can use Kubernetes to stand itself up. And then you can use the kind of simple abstractions on top, and that's still all in Code Engine. It's still pay as you go and it still scales to zero. And so that's what I meant by this you can kind of blend the lines and drop down to or the framework can drop down to something like Kubernetes as needed to give you that flexibility.

Jeremy: Yeah, that's awesome. So you mentioned you have a fully managed Kubernetes service and then you also have a bunch of other serverless services that run within the IBM Cloud. So OpenWhisk or, I guess, IBM Cloud functions now. And then also, I mean, you mentioned Cloud Foundry, which is sort of a pass, but it also sort of an easy-to-use serverless environment in a sense. Right? And so I guess, is this like an evolution? Is this where you suggest people go?

Jason: Yeah. Yeah. So I think the simplest way to think about it is yes, Code Engine is the evolution of those ideas. It doesn't necessarily have a direct technical lineage, always, between those projects, but the problem that functions with IBM Cloud functions that Whisk was trying to solve and the problem that Cloud Foundry was trying to solve with source code, start from source code paths, are both represented in what we're doing in Code Engine. So Code Engine will be the kind of natural evolution path for those workloads and for the problems that those users are using those platforms for. The Cloud Foundry one, I think, is super interesting, in the sense that with the rise of Kubernetes has clearly pivoted many people who were doing Cloud Foundry into doing Kubernetes.

Jeremy: Yeah.

Jason: And people are using Kubernetes as their foundation and the Cloud Foundry project, which we're deeply involved in, has done a lot of work to kind of realign Cloud Foundry with Kubernetes in a better way. But what never went away, what people always still saw value in with Cloud Foundry was the simple push my source code developer experience. Right? And so that still carries forward. And with Code Engine, we're taking that same experience that we had in Cloud Foundry, and we're bringing it into this new service and bringing it onto Kubernetes seat, so the developer still gets that similar experience, but without the boundaries that we talked about. The challenge with Cloud Foundry was always like, oh, as soon as you want to do stateful things, or you want to do async jobs, Cloud Foundry didn't solve that problem. Go use a Kube cluster or go use some completely different environment. And so it's kind of the same experience with the boundaries removed and that's where we would see people go.

Jeremy: Right. So if I'm in one of those services, now, if I've got things written in Cloud Functions or in Cloud Foundry, and I've hit some of those limits, or I just want to take advantage of some of the cooler things that Code Engine does, is there a simple migration path for those?

Jason: Yeah. In general, yes. For Cloud Foundry, for sure. It's pretty straightforward to take the same source code directory that you have and just push it to Code Engine instead. Right? So I think the path for a Cloud Foundry, I mean, there's edge cases with everything obviously, but the base of workflow is the same. You can use the same source input directories. We mapped to Paketo Buildpacks, which Cloud Foundry, a lot of that stuff came out of Cloud Foundry. And so that has a really clean path. For Cloud Functions. There's a little bit of a timing thing in general, yeah, you can take your same functions. You can run them on Code Engine. OpenWhisk has some advantages still that we haven't quite gotten built into Code Engine yet. It's got faster startup times, for example, right? The runtime model behind Code Engine, we're still starting a container, like a full container.

In OpenWhisk we had done a bunch of work on warm start of containers and container pooling so we can get like small number of milliseconds startup times on those functions. And some of that hasn't worked its way into Code Engine yet. So there are still some cases with Cloud Functions where it has some capability that doesn't quite exist in Code Engine yet, but over time that will get filled in and there'll be a simple path there to move all those workloads over to Code Engine as well.

Jeremy: Right. So with Code Engine, because you mentioned this idea of sort of like the cold starts. So does Code Engine keep containers warm for a certain amount of time or is it always a cold start?

Jason: It is, in general, a cold start. It can keep some of them, like in the scale up scale down cycle, it may keep them around for a while, so it doesn't be overly aggressive about scaling them down and bringing them right back. But it's not doing some of the warm start tricks yet that OpenWhisk was doing where we have a pool of primed container instances, and then we're injecting code into them and running them. That's work-in-progress. There's work to do both in Knative to improve that stack and then stuff to do in Code Engine. There's a balancing act there too ...

Jeremy: Yeah, definitely.

Jason: ... on things like network isolation and getting on customer VPC networks and other things which are harder to do in that warm start model.

Jeremy: Yeah, definitely. All right. So if somebody wanted to get started with Code Engine, what's the best way for them to do that, just sign up and start writing some code or how do they do that?

Jason: Yeah, kind of. I mean, obviously, we've been talking a lot about how developers use these things. And so I always think the best way to get started is either to build something on it or to try out some specific source code project. We have a lot of things that we've done to try to make that easy. So there's a Code Engine landing page on IBM Cloud. It has some great examples to guide you through those three starting points I talked about, start from source code, start from image and do batch. We have some really nice tutorials, like specific text analysis tutorials, for example, that'll show you how to build applications on Code Engine. And we actually have a pretty cool Git repo, which will take you through tons of samples of how to use Code Engine to solve all kinds of problems.

So there's a lot of really good code assets out there that a developer could go to and actually try something real on Code Engine and the getting started experience is super easy. You've got IBM Cloud, you log in and you go to Code Engine, you create a project, you push an image and then a couple of minutes you'll have something up and running that you can play with.

Jeremy: Amazing. All right. So I love watching the evolution of things and again, just this different way that, that IBM is thinking about serverless and, again, trying to make it easier. Because I always look back and I think of Lambda when it first came out, I was like, "Oh, it's so easy. You just put some code there and it's just done for you." And then we got more and more complex and more and more complex. And not that we didn't need to, I mean, some of this complexity is absolutely necessary, but I'm just curious, seeing the evolution and where things have gone, I talked to a bunch of people earlier about, Roger Graba, for example, who was one of the first people involved with the IBM or the OpenWhisk project, I guess it was Apache OpenWhisk or it became Apache OpenWhisk, whatever what it was, seeing that evolution and seeing the changes that these different cloud providers have gone through, seeing the changes that IBM has gone through and where you sort of are now with Cloud Code Engine.

I'd love to get your perspective here on where you think this is going, not just maybe what the future is for IBM, but what you think the future of serverless is and just cloud computing maybe in general. I know that's a lot of question.

Jason: I'll give you a long answer.

Jeremy: Perfect.

Jason: So that brings to mind two things. First, let me talk about the complexity thing for a second. Managing complexity is always hard. You are so right. That many things start out with a value prop of like, this is easy. And then as people use, the more you add more, and then three years later, we're like, "We need a new thing that's easy because that other thing is too hard now." And there's no magic pill for that. That's always a hard problem to manage. However, one of the things I like about the approach that we're trying to take with Code Engine is because we've layered it on Kubernetes, It gives us a way to kind of decide where we want that complexity to show up. When we had a Cloud Functions OpenWhisk stack and we had a Cloud Foundry stack and you had a Kubernetes stack, you had to try to solve all problems within each stack.

So each stack was getting more complex because you were trying to like, "Oh, I need storage. And I need like private networking. And I need all these things." With Code Engine, I think we have an opportunity to say, once you cross some line, we're just going to ask you to drop down a layer and go use it directly in Kubernetes, right? You can push some of the complexity down and that allows us to hold a harder line on complexity in the developer layer on top. So it's the balancing act we're trying to play is because we built it on a common platform, we don't have to solve all problems in Code Engine directly.

Jeremy: Right.

Jason: So that's kind of my viewpoint on the complexity problem. On the evolution, it's really interesting. So one of the other things that my team's working on and launched recently is this thing called IBM Cloud Satellite, which is about distributing cloud outside of cloud data centers so you can kind of consume cloud services anywhere you want. So cloud computing in general, and this is not just an IBM thing, in the industry cloud computing is diversifying to be kind of omnipresent. You can consume cloud on-prem, at the edge, in our cloud data centers, wherever you want. There's a programming model dimension to that problem, too. As you specially go to the edge, you kind of want some of these simple to consume, easy to deploy, scale to zero, resource-efficient, you need some kind of model like that because at the edge, especially, you don't have 2000 cores worth of compute to go deal with.

You have one box in a retail store, or you have two servers in the back of the distribution center. And so I think things like Code Engine layered on top of distributed cloud and in our case, things like Satellite, is actually a really powerful combination. I think we're going to see serverless become the dominant application development and deployment model, especially for these edge use cases, because it combines ease of deployment and management with efficiency and scale to zero footprint, which are all really attractive when you get outside of a mega data center like you have in cloud.

Jeremy: Right. Right. So I love this idea, too, about sort of expose the complexity when the complexity needs to be exposed. I love this idea of sort of creating same defaults, right? If you could default Kubernetes to do all the optimal things that you would need it to do for use case X, if you could just do that for me and then if I say, "Oh, I want to tweak this one thing," then be able to kind of go down to that level. But I love this idea of you mentioned about edge too because that's one of those things that I think, from a programming model, as you said, how do you write code that's sort of, I guess, environment-aware? How does it know what's running at the edge versus running in a data center versus running maybe in a hybrid cloud and partially in your own private cloud or your own private data center? That model, just wrapping your head around it from a developer standpoint, I think is incredibly complex right there.

Jason: Yeah. It is. And sometimes it's like, how do they know? And then sometimes it's like, how do I just operate at a high enough level of abstraction that like the differences between those environments can get handled below me? If I'm consuming Kubernetes clusters directly, the shape of that Kubernetes cluster in like a retail store or a telco data center in Atlanta somewhere or in the cloud are going to all be different because you have a different amount of capacity. You have a different networking arm. So you're going to have to deal with the differences. If I'm giving you a container image and saying, "Run this," the developer doesn't have to deal with those differences. The provider might have to deal with those differences but the developer doesn't have to deal with those differences. So that's where I think things like serverless and approaches like Code Engine really come to be much more valuable because you're just dealing at this higher level of abstraction and then Satellite and Code Engine and other services can kind of magically deal with the complexity for you.

Jeremy: Yeah. And so I know we talked a lot about Kubernetes and what's running underneath a lot of these services. Is that something you see, though, as being that sort of common format across all these different services, or do you think that something will evolve beyond Kubernetes to become a standard?

Jason: Right now, I really think that Kubernetes will become the base platform. What Kubernetes is will probably keep evolving. And I'm not saying it's Kubernetes forever, but I don't think we should underestimate the power of the kind of industry-wide alignment that exists around containerization and Kubernetes as the next infrastructure platform, if you will, because that's kind of really what it is. And I told you at the beginning, I used to build webs for apps servers. So I was like very involved in the whole Java app server era, the late 90s and early 2000s. And at that time, the industry kind of aligned around two platforms, Java and .net, as the two dominant, at least enterprise, application platforms. We have everyone aligned on Kube. Literally, there's nobody in the industry who's not like, "Kubernetes is the platform." So I think it will be the abstraction for infrastructure in all these environments. The question will be, how do you consume it? Who manages it? How's it delivered? How does it optimize itself? And then at what level do you consume?

And I don't think Code Engine is the end of it at all. I think there's lots of room for improving the consumption experience on top of Kubernetes for these developer use cases.

Jeremy: Yeah. Yeah. And that's actually was going to be my next question, sort of where do you see, what's the next evolution of Code Engine, right? So is that going to be kind of driving into specific use cases more and trying to solve those or becoming more flexible? How do you see the developers, I don't know, in five years, maybe this probably a hard question, but in five years, how are we going to be writing cloud applications?

Jason: Yeah. It's a great and super hard question, but I think projects like Ray, I think, are an interesting forward look into where this might go. One of the things that I've always felt like, if I look at the whole history of paths in particular over the last five, six, seven years, paths has always been about simplifying the experience for the developers, but fundamentally, most paths environments don't change anything about how you write the code. They change how you package the code, how you deploy the code, how the code is executed, and how the dependencies of the code are satisfied. But the actual code you write probably wasn't any different. Right? And that's where I think there's the next step is like, how do we actually get into the languages, into the code structure itself to be able to take advantage of cloud capacity, to be able to take advantage of scale and there's lots of projects that have taken attempts at that.

Ray, as an example, I think is a particularly interesting one, because there's some good examples where you can take a Python function, you literally add like one annotation to it in the language, and now it becomes remotely executable and horizontally scalable for you.

Jeremy: Right.

Jason: It's that kind of stuff that I think three or four years from now, there'll be a lot more of, where we're actually changing how code is written because that code can assume there's some containerized, scalable fabric out there somewhere that it can go execute on top of.

Jeremy: Right. Yeah. And I think that that pendulum swing for developers, especially, well, developers in the cloud, who's they used to be writing a bunch of code, whether it was JavaScript or Python or Java, whatever it was and then all of a sudden now they have to switch context and be like, "All right, now I have to write a YAML file in order to configure my cloud resources," and that sort of back and forth. So yeah, that marrying of basically saying like a programming language for the cloud is a really interesting concept.

Jason: And I think the distributed cloud notion, funnily enough, is a big enabler of that. Because, I don't know, the other tension I see right now is like, let's say you wanted to use Lambda or you want to use serverless functions. That only works in your cloud environment, but you're also running something at the edge or you're running something in your data center, so you're forced to kind of use different approaches, which tends to force you to kind of some common denominator models.

Jeremy: Right. Right.

Jason: And so you're kind of holding back from really adopting some of these newer models because of the diversity. Well, if cloud goes everywhere and those services go everywhere, then now I can just say, "Well, I'll use the serverless model everywhere. And so I can really deeply adopt it." So I think the distributed cloud thing will open up the opportunity to embed these approaches more deeply in kind of day-to-day development activities.

Jeremy: Yeah. No, I love that. I'm all for that approach because I think this split-brain sort of approach to it is getting very complex and it's not super easy. So is there anything else that you'd like to let the listeners know about IBM Cloud Code Engine?

Jason: No. I mean, I think we touched on a lot of the motivation behind it and the kind of core capabilities. I would just encourage you to go check it out, go check out the space, go give it a try and love to hear people's feedback as they do that.

Jeremy: Awesome. Well, first of all, I got to make sure I thank IBM Cloud for sponsoring this episode because just the team over there and everything that all of you are working on is amazing stuff and I appreciate the support. We appreciate the support in the community for what you're doing. So if people want to find out more about you or more about Cloud Code Engine, how do they do that?

Jason: Yeah. And you can find me on Twitter, JRMcGee, or LinkedIn. For me personally, I love to talk to people. For Code Engine, I think the best place to start is the product page, which is ibm.com/cloud/code-engine. And from there, you can get to all of the code examples I talked about.

Jeremy: Awesome. All right. Well, I will put all that stuff in the show notes. Thanks again, Jason.

Jason: Yeah. Great. Thanks, Jeremy.

2021-04-05
Länk till avsnitt

Episode #94: Serverless for Scientific Research with Denis Bauer

About Denis Bauer

Dr. Denis Bauer is an internationally recognized expert in artificial intelligence, who is passionate about improving health by understanding the secrets in our genome using cloud-computing technology. She is CSIRO?s Principal Research Scientist in transformational bioinformatics and adjunct associate professor at Macquarie University. She keynotes international IT, LifeScience, and Medical conferences and is an AWS Data Hero, determined to bridge the gap between academe and industry. To date, she has attracted more than $31M to further health research and digital applications. Her achievements include developing open-source bioinformatics software to detect new disease genes and developing computational tools to track, monitor, and diagnose emerging diseases, such as COVID-19.

Twitter: https://twitter.com/allPowerde  LinkedIn: https://www.linkedin.com/in/denisbauer/ Webpage: https://bioinformatics.csiro.au/

Watch this episode on YouTube: https://youtu.be/5MGxgYd93Jw

This episode sponsored by New Relic.

Transcript:

Jeremy: Hi everyone. I'm Jeremy Daly, and this is Serverless Chats. Today, I'm chatting with Denis Bauer. Hey, Denis, thanks for joining me.

Denis: Thanks for having me. Great to be on your show.

Jeremy: So you are a Group Lead at CSIRO and an Honorary Associate Professor at Macquarie University in Sydney, Australia. So I would love it if you could explain and tell the listeners a little bit about your background and what CSIRO does.

Denis: Yeah. CSIRO is Australia's government research agency and Macquarie University is one of Australia's Ivy League universities. They've been working together on really translating research into products that people can use in their everyday life. Specifically, they worked together in order to invent WiFi, which is now used in 5 billion devices worldwide. CSIRO has also collaborated with other universities, for example, has developed the first treatment for influenza. And on a lighter note has developed a recipe book, the Total Wellbeing Diet book, which is now on the book bestseller list alongside Harry Potter and The Da Vinci Code. From that perspective CSIRO really has this nice balance between product that people need and product that people enjoy.

Jeremy: Right. And what's your background?

Denis: So my background is in bioinformatics, which means that in my undergraduate, I was together with the students that did IT courses, math, stats, as well as medicine and molecular biology and then in the last year of the study all of this was brought together and sort of a specialized way of really focusing on what bioinformatics is. Which is using computers, back in the days it was high-performance compute, in order to analyze massive amounts of life science data. Today, this is of course, cloud computing for me at least.

Jeremy: Right. Well, that's pretty amazing. Today's episode ... I've seen you talk a number of times all remotely, unfortunately. I hope one day that I'll be able to see you speak in-person when we can start traveling again. I've seen you speaking a lot about the scientific research that's being done and the work the CSIRO doing and more specifically, how you're doing it with serverless and how serverless is sort of enabling you to do some of these things in a way that probably was only possible for really large institutions in the past. I want to focus this episode really on this idea of serverless for scientific research. We're going to talk about COVID later, we can talk about a couple of other things, but really it's a much broader thing. I had a conversation with Lynn Langit before, we were talking about Big Data and the role that plays in genomics and some of these other things and how just the cloud accelerates people's ability to do that. Maybe we can start before we get into the serverless part of this. We could just kind of take step back and you could give me a little bit more context on the type of research that you and your organization has been doing.

Denis: Yeah. So my group is the Transformational Bioinformatics Team. So again, it's translating research into something that affects the real world. In our case that usually is medical practice because we want to research human health and improve disease treatment and all this management going forward and for that data is really critical. It's sort of the one thing that separates a hunch from actually something that you can point to and say, "Okay, this is evidence moving forward," and from there you can incrementally improve and you know that you're going in the right direction rather than just exploring the space.

Jeremy: Right. And you mentioned data again. Data is one of those things where, and I know this is something you mentioned in your talks, where the importance of data or the amount of data and what you can do with that is becoming almost as important, if not just as important, as the actual clinicians on the frontline actually treating disease. So can you expand upon that a little bit? What role does data play? And maybe you could give us an example of where data helped make better decisions.

Denis: Yeah. So a very recent example is of course with COVID, where no one knew anything really at the beginning. I mean, coronaviruses were studied, but not to that extent. So the information that we had beginning of a pandemic were very basic. From that perspective, when you know nothing about a disease, the first thing you need to do is collect information. Back then, we did not have that information and actions were needed. So some of the decisions that had to be made back then were based on those hunches and those previous assumptions that were made about other diseases. So for example, in the UK they define their strategy based on how influenza behaved and how it spread and we now know that it's vastly different, how influenza is spreading and how coronavirus is spreading. So therefore in the course of the action more research was done and based on that, they adjusted, probably the whole world adjusted how they managed or interfered with the disease. We now know that whatever we did at the beginning was not as good as what we're doing now, so therefore data is absolutely critical.

Jeremy: Right. And the problem with medical data, I would assume, is one, that it's massive, right? There's just so much of it out there. When we're going to start talking about genomics and gene sequencing and things like that, I can imagine there's a lot of data in every sample there. And so, you've got this massive amount of data that you need to deal with. I do want to get into that a little bit. Maybe we can start getting into this idea of sort of genome editing and things like that and where serverless fits in there.

Denis: Yeah, absolutely. So my group researches two different areas. One is genome analysis where we try to understand disease genes, predict risk, for example, of developing heart disease, diabetes, in the future, but the other element is around doing something, treating actual patients with newer technology, and this is where genome editing or genomic surgery comes in, where the aim is to cure diseases that previously thought to be incurable genetic diseases. The aim of genome engineering is to go into a living cell and make a change in the genome, at a specific location, at a specific time, without any interference of accidentally editing other genes. And this is a massively complicated task on a molecular level, but also on a guidance level, on a computational level, which is where serverless comes in.

Jeremy: Right. Now, this is that CRISPR thing, right?

Denis: Exactly. So CRISPR is the genome engineering or genome editing machinery. It's basically a nano machinery that goes into yourself, find right location in the genome, and makes that edit at that spot.

Jeremy: Right. So then how do you find the spot that you're supposed to edit?

Denis: Mm-hmm. So CRISPR is programmable, so as IT people we can easily relate to that, in that it basically is a string set. It goes through the genome, which is 3 billion letters, and it finds a specific string that you program it with. Therefore, this particular string needs to provide the landing pad for this machinery to actually interact with the DNA because you can't interact at any location.

Jeremy: Right.

Denis: From that perspective, it's like finding the right grain of sand on a beach. It has to be in the right shape, the right size, and the right color, for this machinery to actually be able to interact with the genome, which of course, it's very complicated. But it doesn't stop there because we want it to be only editing a specific gene and not accidentally editing another correct gene. Therefore, this particular landing pad or the string needs to be unique enough in the 3 billion letters of the genome in order to not accidentally veer it away. This particular string needs to be compared to all the other potential binding sites in the genome to make sure that it's unique enough to attract faithfully this machinery. This particular string is actually very short, therefore, when you think of the combinatorics, it's a hugely complicated problem that requires a lot of computational methods in order to get us there.

Jeremy: Yeah, I can imagine. So before CRISPR can go in and even identify that spot, I'm assuming there's more research that goes into understanding even where that spot is, right? Like how you would even find that spot within the sequence genome.

Denis: Yeah, of course. The first thing you need to find out is what kind of gene do you actually want to edit, where's the problem and this the first part of my microbes research, finding the disease genes of really identifying and even within the gene because it has a complicated structure. Even within the gene, you need to find the location that is actually most beneficial for the machinery to interact with and this is where we developed the search engine for the genome. It's a webpage where researchers can type in the gene that they want to edit and the computational then goes in and finds the right spot, right shape, color, and size, binding side, but also makes sure that it's unique enough compared to all the other sites.

Jeremy: Right. And so, this search engine, exactly how does this work explain this. Like what's the architecture of it.

Denis: Yeah. So in order to build the search engine for the genome we wanted to have something that is always online, that researchers can go in at any time of the day and trigger off or kick off this massive compute. In order to do that in the cloud, you would have the option of having massive EC2 instances running 24/7, which of course would have broken the bank. Or, we could have used an autoscaling group where it would eventually scale out to the massive amount of compute in order to serve that task. Researchers tend to not have a lot of patience when it comes to online tools and online analysis, therefore it needed to be something that could be done within seconds. Therefore, an autoscaling group wasn't an option either, so therefore the only thing that we could do was use serverless. This search engine for the genome is built on serverless architecture and back then, we built it like four years ago, that was one of the first real-world architectures that did something more complicated than serve an Alexa scale.

Jeremy: Right. You obviously can't fit 4 billion letters into a single Lambda function, so how do you actually use something like Lambda, which is stateless, to basically load all that data to be able to search it?

Denis: Yeah, exactly. That was the first problem that we actually ran into and back then, we weren't really aware of this problem. Back then the research requirements were even less. It wasn't only the memory issue, but it was also the timing out issue. We figured, "Okay, well, how about rather than processing this one task in one go, we could break it up into smaller chunks, parallelize it."

Jeremy: Right.

Denis: And this is exactly what we've done with a serverless architecture in that, we used SNS topic in order to send the payload of which region in the genome a specific Lambda function should analyze. And then from there the result of that Lambda function was then put into a DynamoDB database sort of in an asynchronous way of collecting all the information and after all of this was done the summary was sent back to the user.

Jeremy: So like a fan-in fan-out pattern?

Denis: That's exactly right.

Jeremy: Right. Cool. So then, where were you storing the genome data, was that in like S3?

Denis: Exactly. This particular one is in S3. We did experiment with other options, like having a database or having Athena work with that, but the problem was that the interaction wasn't quite as seamless as S3. Because in biometrics we do have a lot of tricks around the indexing of large flat files and therefore any other solution that was in there in order to shortcut this wasn't as efficient as this purpose-built indexing approaches. So, therefore, having the files just sit on S3 and query from there was the most efficient way of doing things.

Jeremy: Right. And so, are you just searching through like one sequence or there are like thousands of sequences that you're searching through as part of this? And then how were they stored? We're you storing like 4 billion letters in one flat file or are they all multiple files, how does that work?

Denis: Yeah, so it is for 3 billion letters in one flat file.

Jeremy: Did I say 4 billion, sorry, 3 billion.

Denis: 3 billion letters in one flat file and indexing in order to, not start from the beginning, but jump in straight where that letter is. It depends on the application case as well, like if you're searching one reference genome, which is basically what it's called when you search a specific genome for a specific species, for example, human. For human, it typically is one genome, but if you search bacterial data or viral data, there can be multiple organisms in one file, so it really depends on the application case.

Jeremy: Awesome. Yeah. I'm just curious of how that actually works, because I can see this being a solution for other big data problems as well, like being able to search through a massive amount of texts in parallel and breaking that up, so that's pretty cool. Basically, what you're doing is you're using Lambda here sort of as that parallelized supercomputer, right? And sort of as high-performance compute. From a cost standpoint, you mentioned having this running all the time would be sort of insane to run all the time. How do you see the cost differ? I mean, is this something that is like dramatically different where like anybody can use this or is it something where it's still somewhat cost-prohibitive?

Denis: Anyone can use it for sure. Not for this application, but for another application, we've made a side-by-side comparison of running it the standard way in the cloud with EC2 instances and databases and things like that. The task that we looked at was around $3,000 a month, and this was for hosting human data for rare disease research, whereas using serverless, we can bring that down to $15 a month ...

Jeremy: Wow.

Denis: ... which is like less than a cup of coffee to advance human research. So to me that's absolutely a no-brainer to go into this area.

Jeremy: I would say. What are the tricks might have you been using, or might you have been using to speed up some of this processing? Like in terms of like loading the data and things like that, were there anything that you could use serverless to power that?

Denis: Well, we look at parquet indexing as one of the solutions and that worked for the super-massive files in the human space really well. But again, it comes down to indexing S3 and there was nothing really special around the serverless access. In saying that, one of the big benefits of serverless, again, is being able to paralyze it, which means the data doesn't have to be in one account. It can be spread over multiple accounts and you just point the Lambda functions to the multiple accounts and then collect back the results. And this is something that we've done, for example, for the COVID research where we did the parallelization in a different way, so by now. Genome research there's always the problem of having to deal with large data. Serverless is our first. We were always going to serverless first and therefore, we came against this problem of running out of resources in the Lambda function very frequently.

Jeremy: Right.

Denis: Therefore, we came up with this whole range of different parallelization patterns that serve anything from completely asynchronous with the GT scan is where you reserve the data back in a DynamoDB. Synchronous approach is where you don't necessarily have to collect, sorry, asynchronous approach is where you don't actually have to collect the data back, to completely synchronous approaches where you basically have to monitor everything that you do and make sure that everything is running in CSIRO in order to collect the data back together.

Jeremy: Right. Let's get into the COVID response here because I know there was quite a bit of work that your organization did around that. Before we get into the pattern differences, what exactly was the involvement of CSIRO in the Australian government's response to coronavirus?

Denis: Yeah, so we were fortunate in walking together with CEPI, which is the international consortium sponsored by the Gates Foundation, which way back when was preparing for disease X, pandemic X to come and it was curious that only a year later COVID hit. So all of this pre-work in setting up this hypothetical disease in the future only a year later it actually was needed. So, therefore, CSIRO and CEPI had already put everything in place in order to have rapid response should the pandemic hit, being able to test the vaccine development, so the efficacy in animal models, that was the part that CSIRO was tasked to do. But in order to do that, because with pathogen RNA viruses in this particular case, we know that they mutate, which means it changed slightly the genome and every replication cycle. Also, we've heard about the England strain or the South African strain, being slightly different.

So with every mutation, there is a risk that the vaccine might not be working anymore, might not be effective anymore. Therefore, the first task we needed to find out was, where is this whole global pandemic heading? Is it mutating away in a certain direction? Like, is that direction something that we should put the future disease research on, rather than focusing on the current strains that are available. So, therefore, we've done the first study around this particular question of how the virus is mutating and whether the future of vaccine development is actually deputized by that. Good news was that coronavirus is mutating relatively slowly and therefore the changes that we've observed back then and likely nowadays as well, is probably not going to affect vaccine efficacy dramatically.

Jeremy: Right. You had mentioned in another talk that you gave something about being able to look at those different variants and trying to identify the peaks or whatever that were close to one another, so you could determine how far apart each individual variant was or something like that. Again, I know nothing about this stuff, but I thought that was kind of fascinating where it was like, I don't know if it was, you could look at the different strains and figure out if different markers had something to do with whether or not it was more dangerous or they were easier to spread and things like that, so I found that sort of to be really interesting.

Denis: Yeah. There are different properties, again, with those mutations. We don't know what actually could come out of this because again, coronaviruses are not studied to that extent to really be confident to say a change here would definitely cause this kind of effect.

Jeremy: Right.

Denis: Therefore, coming back to a purely data-driven approach and that's what we've done. So we've converted each virus with its 20,000 letters in its sequence into a KMO profile. So KMO are being little strings, little rods, and be collected how often the specific rod appeared in that letter. So basically, sterilizing it, or [inaudible] coding, if you want to. And with that kind of information, we were running a principal component analysis in order to put it on a 2D map. And then from there, each distance between a dot, which represents a particular virus strain, to the next dot represents the evolutionary distance between those two entities. And from there, we can then overlay the time component to see if it's moving away from its origin. And we do know that this is happening because with every mutation it gets passed on to the next generation of viruses and mutates then and so on, so it does slightly drift away from the first instance that we recorded.

And this is what we've done with machine learning in order to identify and create this 2D map for researchers to really have a sort of an understanding and a way of monitoring how fast it's actually moving and whether that pace is accelerating or not. Currently, they have 500,000 instances of the viruses collected from around the world. So 500,000 times 20 thousand the lengths of the genome, that is 10 billion data points that we need to analyze in order to really monitor where this whole pandemic is going.

Jeremy: Right. And so are you using a similar infrastructure to do that, or is that different?

Denis: We are. Although in this particular case we had to actually give up on serverless in that, the actual compute that we're doing is not done on serverless. We're using EC2 instance, but the EC2 instance is triggered by serverless and the rest of this whole thing is handled and managed by a serverless instance. Eventually, we're planning on making it serverless, but it requires some re-implementation of the traditional approaches which we just didn't have time for it at the moment.

Jeremy: Right. Is that because of the machine learning aspect?

Denis: It's not necessarily the machine learning aspect, it's more of the traditional methods of generating these distances if you want. There's another element to it, which is around creating phylogenetic trees which is, basically, a similar way of recording the genetic distances between two. So you can think of this like the tree of life, where you have humans and the apes, and so on. A phylogenetic tree is basically that, except for only the coronavirus space. And in order to create that we needed to use traditional approaches, which use massive amounts of memory and there was no way of us parallelizing it in one of those clever ways to bring it down into the memory constraints of a Lambda function yet.

Jeremy: But you say yet, so you think that it is possible though that you could definitely build this in a serverless way?

Denis: Yeah, absolutely. I mean, it's just a matter of parallelizing it with one of our clever parallezation methods that we developed now. Another COVID approach, for example, which we implemented from scratch, we're using serverless parallelization in a different way. So here we're using recursion in order to break down these tasks in a more dynamic way, which would basically be required in the tracking approach as well. With this one, the approach is around being able to trace the origin of infection. So imagine someone comes into a pathology lab and it is not quite clear where they got the infection from therefore the social tracing is happening, interviews where they've been, who did they get in contact with, and so on. Also, molecular tracing can happen, where you can look at the specific profile, the mutation profile that that individual has and compare it to all the 500,000 virus strains that are known from around the world and the ones closest to it are probably close to the origin where someone got it from.

And therefore, being able to quickly compare this profile with 10 billion entities that are online that you can compare with was the task and there for doing that serverless was what we developed. It's called the Path Beacon Approach because Beacon is a protocol in the human health space that we adopted and it's completely serverless. What it does is it breaks down the task of recording all those 10 billion elements out there. It breaks it down into dynamic chunks because we don't necessarily know how much mutations are in each element of the genome and therefore sometimes there might be two or three mutations and sometimes there might be thousands of them.

Jeremy: Right.

Denis: Therefore, first paralyzing it in larger chunks and then if necessary, and a Lambda function would be running out of time, we can split off to new Lambda functions that handles some tasks and so on. So if we can process down the recursion in order to spin more and more Lambda functions that all individually deposit their data. So here's another asynchronous approach because we don't have to go back to the recursion tree in order to resolve the whole chain, but each Lambda function itself has the capability of recording, handling, and shutting down the analysis.

Jeremy: Let's say that I'm an independent lab somewhere, I'm a lab in the United States or whatever, and I run the test and then I get that sequence. Is this something I can just put into this service and then that service will run that calculation for me and come back and say, "This strain is most popular or occurs most likely in XYZ?"

Denis: That's exactly right. That's exactly the idea. And this is so valuable because the pathology labs they might have their own data from their local environment, like from the local country, which they don't necessarily are in a position of sharing with the world yet. And therefore being able to merge these two things of the international data with the local data, because serverless allows you to have different data sources in different accounts, I think is going to be crucial going forward. Especially around with a vaccination status and things like that, where we do want to know if the virus managed to escape, should it escape from the vaccine. All of this is really crucial information to keep monitoring the progression going forward.

Jeremy: Right. Now you get some of the data, was it GISAID, or something like that, where you get some data from. And I remember you mentioning something along the lines of, you were trying to look at different characteristics, like maybe different symptoms that people are having, or different things like that, but the reporting was wildly inaccurate or it was very variant. It varied greatly. I think one of the examples you gave was, like the loss of smell, for example, it was described multiple ways in free-text, so that's the kind of thing. So what were you doing with that, what was the purpose of trying to collect that data?

Denis: Yeah. GISAID is the largest database for genomic COVID virus data around the world. They originally came from influenza data collecting and then very quickly moved towards COVID and provided this fantastic resource for the world and the pathology labs of the world to deposit their data. In that effort, in order to make that data, to collect the crucial data, the genomic data for tracing and tracking made that available. They not necessarily implemented the medical data collection part in a way that enables the analysis that we would want to do. Partly because of the technical aspects, but mainly because it requires a lot more ethical and data responsibility and security consideration in order to get access to that kind of data. All they had was a free-text field with every sample to sort of have, if the pathology lab had that information, to quickly annotate how the patient was doing.

This clearly was a crude proxy for what we actually would have needed to have the exact definition of the diseases, ideally, annotated in an interoperable way using technologies and this is basically what we've developed. So we're using FIRE, which is the most accepted terminology approach really around the world, which allows you to catalog certain responses. Instead of saying anosmia, which is the loss of sense smell, it has a specific code attached to it. This code is universal and it's relatively straightforward to just type in the free-text and then the tool that we've developed automatically converts that into the right code and this should be the information that is recorded. Similarly, in the future, what kind of vaccines a person has received and so on. And then from there we can identify, or we can run the analysis of saying, 20,000 letters in the SARS-CoV-2 genome, so the COVID virus genome, any one of those mutations is it associated with how relevant or how infectious a certain strain is or whether it has a different disease progression, or it might be whether it's resistant to a certain vaccine.

All of these is really critical, but because there 20,000 letters these associations can be very [spiries 00:32:49]. In order to get to a statistical significant level, we do need to have a lot of data and currently, this data is just not available. Like we went through and we looked for the annotations where we had good quality data of how the patient was going. I think we ended up with 500 instances out of the 200,000 that was submitted back then that were good enough annotated in order to do this association analysis of saying that mutation is associated with an outcome. And while we found some association specifically in the spike protein that would be affecting how virulent or what kind of disease this particular strain could cause, it definitely was not statistically significant. So we definitely need to repeat that once we have more data and better-annotated data.

Jeremy: Yeah. But that's pretty amazing if you could say someone's loss of smell for example, is associated with particular variants of the disease or that certain ones are more deadly or more contagious or whatever. And then if you were able to track that around the world, you'd be able to make decisions about whether or not you might need a lockdown because there was a very contagious strain or something like that. Or, maybe target where vaccines go in certain areas based off of, I guess, the deadliness of that strain or whatever it was. That's pretty cool stuff.

Denis: Yeah, exactly. So rather than shutting down completely, based on any strain, it could be more targeted in the future and probably will be more targeted in the future.

Jeremy: All right. Now, is this something where everything you've built, all of this information you've learned, that when the next pandemic comes because that's another thing I hear quite a bit. It's like, the next pandemic is probably right around the corner, which is not comforting news, but unfortunately probably true. Is this the kind of thing though where with all this stuff you're putting into place that the next round of data is just going to be so much better and we're going to be so much better prepared?

Denis: Absolutely. That is definitely the aim. I mean, you do have to learn from the past, and having this instance happen firmly puts it from the theoretical space where everyone was talking about before to, "Oh, yes. Is actually happening." There was a paper published in Nature last month, sorry, last year. It was around, how much money have you lost through this particular pandemic, I mean, the lives lost obviously, are invaluable. Looking at the pure economics of it, so how much money have we lost and how much will this damage go on to the future. Therefore, they did a cost-benefit analysis of saying, "How much are we willing to invest in order to prevent anything like this from happening in the future?"

Jeremy: Right.

Denis: The figures that they came up with, and this was way back when we didn't really even know what the complete effect was, and we still don't know. But even back then the figures were astronomical. So I think there's going to be huge shift in order to see the value of being prepared, the value of the data, the value of collecting all this information, the value of making science-based decisions, I think it's going to ...

Jeremy: It will be nice. A change of pace at least here in the United States.

Denis: ... I'll be very optimistic going forward, we're much more prepared than we ever were in the past.

Jeremy: That's awesome. All right. So you are part of this transformational bioinformatics group and so, you have sort of the capabilities to work on some of these Serverless things and build some other products or some other solutions to help you do this research. But I can imagine there are a lot of small labs who, nevermind having the money to pay for, or small research groups that don't necessarily have the money to pay for all this compute power, but also maybe don't have the expertise to build these really cool things that you've built that are obviously, incredibly helpful. What have you done in terms of making sure that the technical side of the work that you've done you've made that accessible to other researchers?

Denis: Yeah, absolutely. My group, the Transformational Bioinformatics Group, is very privileged in that we do have a lot of support from CSIRO in order to build the latest news, tours with the latest news to compute. As you said, other researchers around the world are not as privileged, therefore the tools that we developed we want to make as broadly applicable and as broadly accessible as possible so that other people can build on those achievements that we had. If COVID has taught us anything, it's working together to really move into the right direction together, what is not only rewarding, but it's also necessary in order to keep up with the threats that are all around us. So with that, the digital marketplaces, from my perspective, are the way to do this. Typically, digital marketplaces you think that it's an EC2 instance that is spun up with a Windows machine or something like that, while subscribed to a specific service that is set up for a fixed consumption.

But from my perspective, because it allows you to spin up a specific environment with a specific workflow in there, that you have access to because it's in your account, you can build upon. Therefore, this is the perfect reproducible research and collaborative research approach where someone, like us, can put in the initial offering and other people can build on top of that. This is what we've done with VariantSpark, which is our genome analysis technology, so in order to find associations between disease genes and certain diseases. This is a hugely complicated workflow because you first have to normalize stuff, you have to quality control things, you then have to actually run a variance bug and then visualize the outcomes.

So typically, being able to describe all of that and for other people to set it up in their account from scratch without us helping them, it is complicated. And this is basically the bane of the existence of biomedics research, in that the workflows are so complicated that reproducing them is typically impossible. Whereas now, we can just make a Terraform or CloudFormation or ARM template or whatnot, put it into the marketplace for other people to ascribe to, to spin it up in the way that we intended to, that we optimized to and then from there they have this perfectly reproducible base in order to build upon. Unfortunately, this whole thing ... Variance bug is an Elastic MapReduce offering. The marketplaces are currently only looking at EC2 instances as sort of their basis, the virtual machines as their basis.

Jeremy: Right.

Denis: What we definitely need is a serverless marketplace.

Jeremy: Right. I totally agree with that. So you mentioned something about the data. Have your organization run this in your AWS account for example, and then have other people just send their data to you?

Denis: That certainly would be an option. The problem typically with medical data is that there's a security and a privacy concern around it. Genomic data is the most identifiable data you can think of, you only have one genome, and encodes basically your future disease risks and everything that's ... Basically, it is a blueprint of your body. From that perspective, keeping that data as secure as possible is the aim of the game. Nevermind that it's so large, you totally want it to shift that around, but I think the security element is what really sells me to the idea of bringing the compute to the data, bringing that compute and the structure to the securely protected data source of the researchers or the research organization that have and hold the data and is responsible for the data. It also allows dynamic consent, for example, where people that consent for their data to be used for research, they can revoke that, so it's a dynamic process. Being able to have the data in one place and handle the data in one place directly allows this to be executed faithfully, robustly, and swiftly, which I think is absolutely crucial in order to build the trust so that people can donate or lend their data to genomic research.

Jeremy: Yeah. That is certainly something from a privacy standpoint where you think about ... You're right. Everything about you is encoded in your DNA, right? So like there's a lot of information there. But now, I'm curious if somebody else was running this in their environment after they do the processing on this, and again, I'm just completely ignorant as to what happens on the other end of this thing, but. The data that they get out of the other end of this thing, is that something that can be shared and can be used for collaboration?

Denis: Yeah. Typically, the process is you run the analysis, you get the result out, you publish that, and it sort of ends there. I think in order for genomic data to be truly used in the clinical practice and to inform anything from disease risk to what kind of treatments someone should receive, what kind of adverse drug reactions they are at risk of, it really needs to be a bit more integrated. So, therefore, the results that comes out of it should somehow feed back into the self-learning environment. That's one avenue. The other avenue is that the results that are coming out they really need to be validated and processed. Therefore, typically there are wet labs that investigate that this theoretical analysis is correct in order to move forward.

Jeremy: Interesting. Yeah. I'm just thinking, I know I've seen these companies that supposedly analyze your DNA and they try to come up with like, are you more susceptible to carbohydrates, those sorts of things there. Now while that may be a lofty endeavor for some, I'm thinking more like, people who are allergic to things or environmental exposures that may trigger certain things. Tying all that information together and knowing if that, I mean, I'm assuming that has to be encoded in your DNA somewhere like your, I guess your allergies, I keep using that example. So how does that information gets shared? Is that just something that is like way out of scope because you've got people testing just their own group of samples and doing specific analysis on it, but then not sharing that back to a larger pool where like everybody can sort of look at that?

Denis: Definitely, that is the aim going forward. The Global Alliance for Genomics and Health is putting things in place in order to enable this data sharing on a global scale. The serverless beacon that we've developed is moving along the line as well to make it more efficient for individual research labs to light their own beacon in order to share their results with the rest of the world, like the $15 per month in order to share data with the world. I don't think we're quite there yet in terms of the trust, in terms of the processes to make this actually a reality within the next, I don't know, five years.

Ultimately, it definitely is the easy aim, and ultimately this is the need. An element to that is also that, the human genome is incredibly complex and therefore there is no real one-to-one relationship between mutation and an outcome. We do know that, for example, for cystic fibrosis, it's one mutation that causes this deadly devastating disease, but typically it's a whole range of different exacerbation factors, resilience factors, that work together and it's very personal with the kind of risk that it generates. In order to quantify this risk, we need to have massive amounts of data, massive amounts of examples of which kind of combination is causing what kind of outcome.

Jeremy: Right.

Denis: In order to do that probably putting all the data in the same place it's not going to happen ever. Therefore, sharing the models that were created on individual sub-parts and refining the models on a global level, like sharing machine learning compute models, I think is probably going to be the future. And this is a really interesting and exciting space and a new space as well where it's sort of a combination of secret sharing and distributed machine learning in order to build models that truly capture the complexity of the human genome.

Jeremy: Yeah. Well, it's certainly amazing and fascinating stuff and I am glad we have people like you that are working on this stuff because it is really exciting in terms of where we're going just to mean, not only just tracking and tracing diseases and creating vaccines but getting to the point where we can start curing other diseases that are plaguing us as well. I think that's just amazing. I think it's really cool that serverless is playing a part.

Denis: Absolutely. So my goal is really to bring the world together and see the value of scientific research and bring that scientific research into industry practices.

Jeremy: Awesome. All right. Well, Denis, thank you so much for sharing all this knowledge with me. I don't think I understood half of what you said, but again, like I said, I'm glad we have people like you working on this stuff. If people want to reach out to you or find out more about CSIRO and some of the other research and things that you're doing, or they want to use some of your tools, how do they do that?

Denis: Yeah. The easiest is to go to our web page, which is bioinformatics.csiro.au, or find me on LinkedIn, which is allPowerde, and start the conversation from there.

Jeremy: All right. That sounds great. I will get all that information in the show notes. Thanks again, Denis.

Denis: Fantastic to be here.

2021-03-29
Länk till avsnitt

Episode #93: WebAssembly and WASI with Aaron Turner

About Aaron Turner

Aaron Turner is a senior engineer at Fastly. They were previously doing rad stuff at Google and various startups and agencies. In their spare time, they are hacking on various WebAssembly projects on the web, cooking up some dope beats, and shredding local skateparks.

Twitter: @torch2424 Website: https://aaronthedev.com/

Links related to the content in the episode:

Getting Started  WasmByExample MadeWithWebAssembly Wasi.dev  Choosing a Wasm Language AssemblyScript Website Mentioned Production AssemblyScript article: Micrio article by Marcel Duin Emscripten Compiling to Wasm Rust Wasm Book Great WebAssembly Talks WasmSummit Youtube Channel  WasmSF Youtube Channel Patrick Hamann | WebAssembly ? To the browser and beyond! | performance.now() 2019 WebAssembly for Javascript Developers, by Aaron Turner Simulating Sand: Building Interactivity With WebAssembly by Max Bittker | JSConf EU 2019 Robert Aboukhalil :: Level up your web tools with WebAssembly :: #PerfMatters Conference 2019 Keeping up with WebAssembly Fastly Blog Bytecode Alliance Blog WasmWeekly Twitter and Newsletter  WebAssembly In the Future Lin Clark?s Wasm Summit Keynote WebAssembly Specification Proposals WASI Specification Github


Watch this episode on YouTube: https://youtu.be/Ef1iE9KaAd8

This episode sponsored by New Relic and Epsagon.

Transcript
Jeremy: Hi everyone. I'm Jeremy Daly and this is Serverless Chats. Today I'm joined by Aaron Turner, hey Aaron, thanks for joining me.

Aaron: Yeah. Thanks for having me. I'm really excited to be here.

Jeremy: Awesome. So you are a Senior Software Engineer at Fastly. So I'd love it if you could tell the listeners a little bit about yourself and your background and what you do at Fastly?

Aaron: Yeah, totally. So what is it about me? So what I do is I work a lot of WebAssembly. We've been doing that for about two and a half years. I started getting really involved in the community and through that work, I was going through a lot of meetups and things, and I ran into Tyler who was the CTO Fastly and they were working on this new edge WebAssembly thing. And just the timing lined up and our interests and both were passionate about the moment it did. So I joined the company and it's been going great so far. And there, what I'm working on is a lot of WebAssembly work, both in terms of bringing on new languages to the platform, but also it's a lot of community work, participating in a lot of events, still doing podcasts and things and just hanging out with people and having a good time.

Jeremy: Awesome. So actually I had Tyler on the show not that long ago. And we talk mostly about computed edge, but we got into WebAssembly a little bit and I'm finding this whole thing fascinating. Because I remember way back in the day Java applets and of course Flash if anybody remembers that, I think that's still around. But this idea of trying to bring compute and more complex applications in a bytecode form, bring those to the browser. And this a really interesting thing, I don't think it worked out really well in the end, but it seems like WebAssembly is a better shot at doing that. And I find that really, really interesting. So I'd love to just pick your brain for a little while here and talk about WebAssembly, but I think maybe for the benefit of the audience why don't we start with what exactly WebAssembly is?

Aaron: Yeah, totally. So WebAssembly I like to describe it or really what it is is bytecode for the web like you were alluding to. And what that means it has a few implications. Two of my personal favorites are predictable performance. So when you look at something like JavaScript it's an interpreted language, but we got really good at running it really fast. So we built a just-in-time compilers. So what that's going to do is go ahead and read your JavaScript and compile it over and over many times. It makes some assumptions about what your code is doing, which can get really, really fast if it assumes the right things. But if it makes the wrong assumptions they can get really slow. Whereas if you're running a bytecode it's always predictably performing. So a really bad analogy people tell me not to say, but I like to think of it as like, if you're driving on the freeway, it's like your WebAssembly. Taking their freeway depending on what you're trying to do most of the time it's going to be faster.

There might be sometimes you're taking the streets driving around, doing the neighborhood might be a little faster, but nine times out of 10, if it's far away in a few miles, we'll just use the freeway. You know what I'm saying? That sort of thing. So that's what I like to describe the predictable performance. At least that's how it works in my head. And when kids ask me or my little brother is like, what are you doing? I'm like, that's it. And then another thing about is that it's very portable. So as the nature of the web it's great for distributing logic wherever it may be. So both portable in terms of, it runs in all major browsers which is a huge solid, because you start shipping in. It's very bondable so if you wanted to throw WebAssembly into an MPM package, for example, and use it there that can run in the background and do some of that predictable performance for you in your JavaScript ecosystem.

It's also very language-agnostic. So if I guess parent language compiles down to WebAssembly the actual bytecode then you can use any language essentially and have an interface with this JavaScript WebAssembly API. And I'll get into it's very portable and the fact that multiple runtimes support it. So for example, Nodewell it has its own adapters there, but also people have built their own runtimes for WebAssembly itself like WASMtime which has extended long run time and as well what Fastly built, which is called Lucet. And it's an ahead of time compiled WebAssembly runtime type thing, but yeah, that's probably how I would describe it best.

Jeremy: All right. So there's a lot to unpack there. And so what we can do is we'll go through some of those things. But just in terms of like, I think this is where we look at things that run in the browser. And I know we'll get into this, that you can run a WebAssembly in more places than just the browser. But running in the browser one of the things that I think we think a lot about is security, what does it have access to? Can it do network calls? Can it access local resources, things like that? What are the capabilities of WebAssembly?

Aaron: Yeah. Thank you very much for asking that, because it's always like, I'm always performance, portability. Awesome. Then it's like, oh yeah, there's also these other great things. So yeah, total insecurity. The one thing that's really nice about WebAssembly is that it has this concept of linear memory. So the idea really is that you are given a heap or just the way I like to think of it from a JavaScript background, self-taught here is, just this one really big array and you can't go out of the array and you can't go before the array. And that's all you can access and it's sandbox. So because of that, you can't escalate out of that memory and do things with the host. So WebAssembly has a really nice feature in which this linear memory of sandbox. So you can't really escalate out of it just because WebAssembly don't allow that. And on the topic of capabilities one thing that's nice about WebAssembly as well as this concept of host calls.

So essentially you can say, hey but we talked about the host here. It can be JavaScript like you mentioned, or one of those standalone runtimes. You're like, hey, host I know you have access to this function let's say, if I call this, I want you to go do this on my behalf. So for example, a common one in WebAssembly that people often use is like, let's say you want to use console.log, for example, you can go ahead and import and say, hey, look, I want to import console.log. And when I call it, I'm going to pass it this value and then JavaScript can then say, oh, hey, you called this function that I gave you access to, this host call that I provided. Cool, I'll go do some work for you.

In this case, log out a number let's say, and then let you continue executing. So this opens up all types of cool use cases of depending on what the host wants to provide, you can start doing some really cool things and have this security where the user can give you some code that you don't really know what it's doing, but only has as much power as you will let it have a doing those calls really. And that, since that memory is sandboxed they can't work out of there either. So starts to build this very secure. You can start trusting code that people are giving us because of these two features, which is really exciting.

Jeremy: Yes. So you mentioned use cases and I think use cases are probably the best way to communicate to people what the capabilities are, right? So great performance, linear memory, sandbox, security, that all sounds awesome. But if you can't explain what you can do with it, right, it's hard sometimes to visualize. So you mentioned this idea of calling JavaScript or being able to do those host calls and stuff, but what are some of the practical use cases that you would use it for? And maybe even more importantly, what's the use case and then why is WebAssembly better for it?

Aaron: Yeah, totally. So WebAssembly started off as something that's for browser, it's starting to evolve more into serverless use cases and things, but just taking a step backward started the major use case here was speeding up JavaScript in the browser. So JavaScript was in this interesting place where it had this unwilling monopoly, I will say, on the browser, where you had to use JavaScript. So because of that lots of different companies at different interests in it and started pulling in different ways that it really wasn't ever designed for it. So WebAssembly was like hey, look, that whole performance thing we're trying to a JavaScript yeah, it's good for that. But let's take a little bit of a weight off of JavaScript and give something else. So I bring all this up to say that speeding up JavaScript really. So if you have let's say a loop that maybe it's doing image blurring, for example, that isn't supported by the browser natively. I don't know if there was a CSS thing proposed, but let's pretend there isn't but it'll be, go ahead.

Jeremy: Even if there was, it's not going to work the same way in all browsers I'm sure, so ...

Aaron: Yeah. So that's really computationally intensive. You're going to be looping around trying to figure out what pixels need to be duplicated, which ones don't, and WebAssembly is really good at those tasks because of that whole predictable performance thing. So what you would do is take that block of JavaScript and instead replace it with a WebAssembly module that you then pass and say, hey, look, here's the pixels that I want you to go ahead transform, turn back in your linear memory, what the new images, and I'll go ahead and display that. And having access to that predictable performance then lets you do those things on the browser a lot easier, and they're not so taxing. You can also imagine this same computationally intensive speeding up JavaScript is great for game engines. I've seen a few game engines already take some of their physics engines and start replacing pieces with WebAssembly modules, because it's just built and designed for doing these computationally intensive math operations and things that game engines really need.

And then another one is probably just general business logic. So there's all types of different times where we're just like, hey, we got this data structure and we just going to make it look like this now for the server. So sometimes JavaScript's good at that or will be. It definitely depends on the use case. So that's one thing I'll note here is that what does sound most of the time works is I'm sure there's that one use case it's like, okay, fine. But you know what I mean? But nine times out of 10, let's say we have to take some JSON. It's huge and we need to maybe convert these objects into an array let's say, I don't know. I'm just making things, you get the point ...

Jeremy: I get the point.

Aaron: That type of business logic where you're trying to translate things and just stuff that's really tedious on the CPU. You can start to put that on WebAssembly modules and from the portability too, you can start sharing it on different platforms. And it's like, oh, this is really exciting. Yeah.

Jeremy: So I'm curious about this too. You said if JavaScript isn't fast enough and a lot of companies have to use JavaScript sometimes make it do something, sometimes it shouldn't do maybe. So if I'm a developer now, I love Node on the backend, I love to know because I'm just so familiar with JavaScript that it's just really easy to go back and forth between the front end and the backend using this one single language. And pretty much anytime you have to write something for the front end, it has always been JavaScript. You had to basically do that. So is this something though where ifw... We will get into the other languages you can use ... could I compile an entire application or maybe a front end app or what do they call it a single page app or something like that, that I might write in JavaScript now, could I do something like that compile it down to WebAssembly and then do a similar thing?

Aaron: Yeah. So that's actually a really interesting question. I'm glad you asked that. So the short answer is no, but we'll get into why. And it becomes really down to two different reasons. The first being WebAssembly and its nature is binary format depends on strict typing. So you have to say, hey, I have an i32, I have a i64 integer, i64, so on and so forth. JavaScript is dynamically typed and that's part of why we need to interpret and do that just in time thing to figure out those types on the fly. So JavaScript just isn't quite designed to compile down to WebAssembly. And then two I'll say, is that there are some alternatives. Let's say you wanted to write a Rust step that is a Full SPA. There's a few products out there that do that, I've seen them on GitHub. I don't know any off the top of my head, but they totally exist. And that's something you're into then totally feel free to do it.

But I will make the point that JavaScript as we mentioned, has been being pulled in a lot of different ways. But one thing that was sure was designed for was interacting with DOM and building UIs. So that's where I think JavaScript really excels where I would say you really want to just compile a straight SPA into WebAssembly because WebAssembly really get those computationally intensive things. But SPAs, for example, when you see those react demos of like, look, we got 10,000 triangles that are rendering every second. JavaScript is amazing at that because we've been iterating on making JavaScript good at that for so long. So yeah, I hope does that answer it. Does that make sense?

Jeremy: It does. No. No, it does. And actually, to extend that though, there are things that you might want to do in a browser that are fairly I would say insecure especially if you have to do anything with crypto or you have to sign a call or some of these other things call an API that you maybe you need to have a secret in that API call. And obviously, you don't want that available via your JavaScript by just viewing source. So are those some of the things that you could potentially do with WebAssembly where you could compile down something that did cryptographical signing or something like that. And then with the bytecode could you reverse engineer that bytecode? Could you come back and find the actual source on that? Or is that something that might be secure where you could use something like that to do some of those more complex and maybe things that would add a little bit of security to your app?

Aaron: Yeah. So that's a really interesting question. I'm glad you asked it. So I will first iterate and say again, we're talking about that cryptography is going to use a lot of math operations very performance-intensive, which makes it a great fit for WebAssembly. That being said on the security side of things, it gets really interesting because there is a text format for WebAssembly. It's not the same as JavaScript where you can go and view source and it's just like, oh yeah, this is totally what's running on the page. Nowadays with mangling and the translation, it gets more complicated, but WebAssembly is more of a binary format where we can do some funny things. You'd be like, hey, look, it's moving this memory here and there, but to see the actual source code is a little bit more difficult. That being said, I won't pretend I'm a security expert especially in cryptography space ...

Jeremy: I'm not either. So yeah.

Aaron: I've definitely seen some projects that are like, hey, look, we can make this things secure with WebAssembly. Here's our white paper on why, but I wouldn't be able to confidently sit here and be like, oh yeah, totally just throw secrets and WebAssembly, no one will ever know what it is. You know what I mean? So I would definitely suggest maybe I always get back to you about that or we can do ...

Jeremy: Yeah. No, I'm just curious. I'm thinking through the places where it really fits in. Because I think that's a problem, security in the browser, it was with so many people using APIs now usually, and again, a good use case for serverless is essentially setting up a function that all it does is just adds that secure key or whatever it is that access token into a third-party API calls so that you can just pass it through from your from browser. So if you could eliminate that step, you know what I mean? And be able to do some of that I think that would be interesting.

Aaron: Yeah, totally. Yeah. And I definitely agree. For example, whenever we're importing a bunch of packages, and let's say they're both accessing global scope just cross their fingers and hope they don't do something they're not supposed to. That is one benefit again about WebAssembly is that all that sandboxing, that linear memory and dances that only really has access to what you give it to. So if I were to have three WebAssembly modules, they couldn't go and talk to each without JavaScript being in between hey, you're telling this person cool, here you go. Let make sure you're not doing anything funny between one another in the current state of WebAssembly today, so.

Jeremy: Awesome. All right. So let's move on. Let's talk about WASI. What is WASI?

Aaron: Oh yeah. So WASI is an acronym for the WebAssembly System Interface. And this is where in my head, WASI is the node of WebAssembly. I know it's a very loaded term, so please take it with a grain of salt, but essentially it's a standardized system interface for WebAssembly. So you get things a lot of positive Slack calls, if you're familiar with a lot of ... it's getting low level, but you can imagine stuff like fd write, fd read. So reading file descriptors and things, you get access to those things. So you can imagine a Node, you have file system and that's how you would let's say, make generate files on a server. He used the module fs. WASI offers that lower-level primitive that allows you to do those things in WebAssembly like create files, read them and move them around your file system, which I'm talking in circles, but I think I get the point. So what actually ...

Jeremy: So you wouldn't use WASI in the browser, right?

Aaron: So it gets funny there because there have been some ... you can use things IndexedDB if your familiar to create a pseudo file system and start to port, maybe let's say you wanted to compile something in the browser. If you want to bring a C compile into the browser, you could use the WASI things and mock out some of these system-level resources as a browser equivalent and get into this funny world where it does make sense. But WASI itself one of its goals right now isn't really to run in the browser if people are bringing it there. So if you want to, again, bring a compiler but you totally could, which is exciting and really cool. And I think that there's a talk by Ben Smith they did this exactly for their class. They taught I think I forgot what university, but yeah, they made like a C compiled WebAssembly compiler that link took C source compiled it. And then there's something else is I wanted to say about that?

Jeremy: Well, I'm just curious, so then if it's not really for the browser, I get it. Everybody loves to do those things like, oh, can I take this thing that wasn't built for this and make it work in that. But so what would you say are the primary use cases then for using WASI?

Aaron: So a lot of it is probably bringing WebAssembly to the server. Yeah, probably for the server or just even standard command line applications. WASI is really exciting. There's a lot you can do. And you'll hear me say that a lot about WebAssembly, there's a lot you can do with it. So really I like to think of WASI as like all the benefits of WebAssembly, so that sandboxing that you get and that host calling interface. Because really what WASI is using is it host call just like, hey, you have access to the file system please, do what you want with it. So because you get those benefits. There's an infamous tweet. I know I'm talking, but ...

Jeremy: I think an infamous.

Aaron: Infamous, that's not the right word, but a tweet that went viral by Solomon Hykes. If you're familiar, the co-founder of Docker about how WASM plus WASI existed. And I think in 2008 is when they made Docker that they maybe had not needed to make Docker. And the reasoning there is that like, if you could take different applications and compile them down and give them access using a system interface that's standardized and have access to things like file systems, you can start to imagine this world where you get container-like functionality where you're just like, hey, I'm gonna compile my whole app and all of its dependencies down to WebAssembly, give it access to WASI and they can start to do things, and operate in a sandbox way where you don't have to worry about it messing with the parent operating system or completing with other apps and things like that.

And then of course we're on Serverless Chats. So a lot of serverless use case there because WebAssembly is very I guess lightweight and things of that sort. Sandbox, it makes it a great contender for serverless because you can just instantiate the WebAssembly module start running immediately and then close it down. And even do things like snapshotting of the memory and saving state and things like that. But we get to in the future there's still some kinks to figure out there just in general in the whole ecosystem. But yeah, and then a lot of standalone applications. So we were mentioning if you want to compile a C compiler to WebAssembly, give it access to WASI and run it on your local machine. Now you can start compiling things on your actual let's say standalone runtimes. If you just wonder if for some reason it gives C source code to just a WebAssembly module and have it compiled.

Yeah, sure. Cool. And I'm sure there will be use cases for it because then you could imagine a world where it's like, I won't use the exact same file that I used my exact same compiler binary, because I use Mac windows and Linux all do recompile for each architecture at each operating system. So again, that portability and yeah, it was probably, those are really good use cases I can think of off the top of my head right now.

Jeremy: Right. And this might be a stupid question, but I guess the idea of it being the Node of WebAssembly in that sense where basically that's what it is. It's its own container essentially that can do anything, it can interact with the file system, it's got all of the capabilities, HTTP networking calls, it can do all that stuff. And I know the V8 engine is pretty popular with edge computing and things like that, is that something where that can run on like a V8 engine or is there another type of underlying, I guess, container management system or something that would need to run those?

Aaron: Yeah. So I'm glad you brought this up. So one thing I would just say as a quick note so HTTP is still being standardized. So all this is really young. So I wouldn't want somebody to like, oh, HTTP is in there? Cool. All right, let me just close the tab and start crying. You know what I mean? So there's a lot of standardized API still in the works. And we can get into them that too if you'd like, but to answer your question, I'm like, hey, could you use WASI inside of VA? Again, we can probably polyfill some things and then you can totally it in there and that JavaScript way. But what gets interesting about that is that you can run it if you can imagine a world where maybe you don't really need the JavaScript as well that V8 provides, then you're instantiating a JavaScript runtime and a WebAssembly runtime. Whereas if you use some of these other runtimes that only support WebAssembly you get a lighter weight output from it and things like that.

So yes, you can use V8, but it's depending on your use case if you want to have that solid WebAssembly JavaScript relationship at all times, V8 is probably the right answer, but if you just want to use WebAssembly, then the standalone runtime is going to be lighter weight for, and be able to optimize specifically for WebAssembly a little bit better.

Jeremy: Right. So there are standardized runtimes for WebAssembly?

Aaron: Yes.

Jeremy: Okay. That makes sense. All right. So you mentioned some standardized APIs and we talked a little bit about crypto and how that's a good one to deal with. But you've got crypto, you've got machine learning, all kinds of these complex things that are really hard to do especially in JavaScript. And again, I know we're now more towards the server side of things, but what are those types of ... You mentioned they were standardized APIs, what can those do? What are those capabilities?

Aaron: Yeah, totally. So I will say these are also still in flux, they're still being developed. And if you want to participate there's totally community group for WASI that you could totally hop in and join, but this is me, I attend the meetings, having for a little bit. So it's like sharing this little things is going on here and there. So yeah, one of them is definitely crypto, a colleague of mine, Frank Dennis, is working on that where there's a lot of common crypto applications that we would want the host to say, hey, the host runtime we know that you're running just raw bytecode, there is no layer between you and the kernel essentially. So could you please provide SHA256 and make sure that it's working correctly, you won't expose the right thing.

So probably like common crypto functions that people are often using. I don't know if SHA256 is one of them I heard ... you can imagine. If you're a crypto person out there you know, exposing those common ones that folks use. And then machine learning, I'm a little bit more familiar with that because the Bytecode Alliance is a group of different companies that are working together on WASI and WebAssembly and all these specs and they recently announced they're working on a WASI. So a neural network and WASI and providing the primitives there of, if you want to build neural networks on top of WASI what are some of the host calls that we'll need? What are some of the functionalities that we'll need? So we start to build our own neural networks in WebAssembly, which is really exciting.

Jeremy: So are these standardized APIs? Again, this may be a stupid question just because I don't know enough about this stuff yet, but are those NPM packages or Python packages from PyPI or something. Whatever it, is that the idea behind some of these APIs where you say, I need a crypto package, or I need a machine learning package, or I need an image manipulation package, is something where there will be an ecosystem where people can write and contribute packages that other people could just pull in?

Aaron: Yeah. So this is more a little level above that I would say, you can imagine 10 portal, for example, as a new date thing was coming into JavaScript. It was already a few libraries out there, hey we're playing around with this new date API. Here is my version of it and your version of it, but a lot of the community is working together on deciding, okay, well, my organization has these needs for temporal, my organizations as these, what are the compromises that we'll need to make from the entire community to have access to this one standardized API? Eventually for example, Chrome or your whoever's implementing your JavaScript runtime will support natively you wouldn't have to use a library anymore. So I would say it's a little level above, and I'm sure as these APIs develop people will develop like, hey, look, here's the current version of WASI and Rust let's say. And you can include it in your Rust program it then compiles all the way down to WebAssembly. And so, yeah, I hope that ...

Jeremy: Yeah. No, I'm just wondering here because one of the things that I think makes Node and JavaScript, the reason why those ecosystems grew so much was because people contribute to those things. So if there's a way for people to do that and know oh, if this doesn't do this now I don't have to write this whole thing myself. Somebody else may have done this for me and I can just go ahead and bring that in. Yeah. So I don't know if they'll eventually be like an NPM for WASI or whatever or for, I guess WebAssembly in general. But yeah, anyways, I'm just thinking that would be a cool thing to have. So if there's not one of those things out there, then my suggestion is you create one of those things and make it so that people could contribute code.

Aaron: Yeah. So if I could on that note like I had mentioned a little bit earlier before I grazed over, is that in the browser at least, I think in WASI too you could ship just an NPM package of what saying today. And if they used WASI like we mentioned, so it's a little part in there depending on what you want to do. So just a registry in today's world is probably MPM and I have seen some smaller WebAssembly package manager type things that are popping up here and there but right now I think NPM is the ... for a lack of better word the king of the packages right now. We'll see if something else comes up may do something that's more WebAssembly focused rather JavaScript with the WebAssembly if that make sense.

Jeremy: Yeah. Cool. So let's into the toolchains here and maybe we just focus on the popular ones because I'm sure they're a lot of people that are going down this road but you mentioned earlier that WebAssembly was language-agnostic, so you can just write it in any language you want or? There's got to be some limitations here.

Aaron: Yeah. So there are some. Pretty much the limitation is that, is your language willing to support WebAssembly? So especially if it's a strictly type language. So there is, for example, Go is working on something. Let me think of another, Swift has a WebAssembly implementation, Zig is an up-and-coming language that has WebAssembly implementation. There are some languages where it gets funny. So for example, these dynamically type like JavaScript and Python, they're in a world where it's well, just the language doesn't quite line up with what WebAssembly needs when it comes to being compiled. So I mentioned those, but probably the biggest three right now is Emscripton, which is a toolchain for compiling CNC plus to WebAssembly. A lot of folks use it. Google I know it has a lot of folks working on Emscripton and they highlight a lot of projects using it. If I'm not mistaken there's a talk from WebAssembly SF in which the Google Earth team talked about how they're using Emscripton to like, again, take all that business logic and make it more portable across where they need to run Google Earth which is really exciting.

Another big one is Rust. I had mentioned earlier, it's a language that's grown to become quite popular. It's got another systems-level programming language, but Rust has a little flavor of both taking these older C applications I guess not really porting, but building these one-off modules to go off and do maybe me to put JavaScript, serverless type stuff, whatever it may be, Rust is really starting to shine in that area. And a lot of folks really passionate about it. The community is really cool and there's a lot of great documentation. I guess what I'm trying to say really is that Rust has a really solid WebAssembly support. They are going all-in on it, which is really exciting.

Jeremy: Well, just thinking back to things that I've heard, I feel like whenever I hear Rust I just think WebAssembly, is that the wrong way to think of it?

Aaron: Rust does a lot of different things, but it's not maybe the wrong thing just because they spend a lot of time really building out a lot of the tooling, a lot of the community around it and things. So Rust definitely also has lots of great use cases that I've seen at Rust conferences where they were in Rust on no server, they do it in games, so on and so forth. It's a systems level approach, so programming language. So you can imagine you can run see there. Nine times out of 10 I think you could run Rust there, but just there WebAssembly ... What's the right word? Involvement, there's another word starts with an "I" that means what I'm trying to say, but they spend a lot of time working on creating a great developer experience on WebAssembly and Rust. So yeah, because right now I will say like some languages like Go and things are still very young in their WebAssembly implantation. So some of the tooling is like, good luck, but I'm sure that will get better over time as more of the community works on it and things.

And then if I could transition too, there's one more tool chain that's really popular right now and it's called AssemblyScript, I'm a member of the team on that. And what a AssemblyScript is it a very TypeScript-like, not TypeScript exactly, but if you can read TypeScript, the Typescript-like language that compiles to WebAssembly. So the target. There's a lot of these JavaScript developers that we mentioned earlier, JavaScript can't compile WebAssembly. So it's like, come on, we want some of the fun too. So we're hoping to maybe fill that gap and getting developers the closest thing to JavaScript that we can provide it for them that allows them to access all the benefits of WebAssembly. So yeah.

Jeremy: Yeah. Well, so the AssemblyScript thing because I was doing some research before this and when I saw that, I was like, okay, now you might have me because I'm thinking some of these other things. I spend a lot of time in Node and in TypeScript and JavaScript. So for me, I was like, oh, this would be really great. So I do have a bunch of questions on this though and since you work on the AssemblyScript team, you're the perfect person to answer these. So what are the differences between TypeScript and AssemblyScript? Is it almost exactly the same or are there some significant things that I'd have to worry about?

Aaron: Yeah. So one thing is I'll definitely point out is that we mentioned earlier about SPAs, but just imagine not even ... I'm trying to think. So if you take a TypeScript React app or TypeScript Node app, you can't just grab the AssemblyScript compiler and be like, hey, I get WebAssembly for free. Cool. That's not how ... There's a lot of small fundamental differences where you have to actually take the time to port things. But the porting comes down to like, okay, you know this number type number isn't again, that strictly type integer of whatever many bits. So you have to take your numbers and convert them to i32s, you'd use a float there, you need to specifically say like, I want to float here. And things of that sort. But for the most part, I actually wrote an article in the Facet Blog about porting TypeScript to AssemblyScript and what that looks like.

Yeah, I think it might even have been a JavaScript application, but yeah, just like, hey, look, there's this variable here. It's a number we know that but JavaScript does the work for us. Let's just explicitly say this is a number. And yeah, pretty much that's like a lot of the big fundamental differences. There are some "gotchas" to some of the script. One of them being is that since it's young, it's only about a two-year-old language, it's been getting popular though thankfully, is closures is a big one. So if you're doing a lot of callbacks, that callbacks, I call it callbacks. As of right now, we're still working on getting that working in WebAssembly memory and things. So you have to pull those functions out to separate functions which is a little annoying, but we'll get there. You know what I mean? And in most use cases you have that. And then ...

Jeremy: So in terms of the workflow for some of these things, like I'm writing TypeScript now you got to compile it, right? And then I usually run tests against my TypeScript and then compile or compile and run the tests. So what does that workflow do I have to go from ... Or I guess I just writing in AssemblyScript which has to be very similar to TypeScript and then just compile it down to Rust, but how does the testing work and some of that other stuff?

Aaron: Yeah. So that's actually probably the best closest thing. So AssemblyScript is an NPM package that you MPM install into a project and you can scaffold out in AssemblyScript project essentially. Another thing is AssemblyScript it pretty much uses the .TS file extension, so if you'll put the .TS code it's like, oh, hey, this is a TypeScript. And it comes with a TSConfig. So the TSConfig will say like, hey, i32 was an alias for number for your Linters and things. So you open up VS Code and it's like, oh, this is just TypeScript. Cool. Awesome. So you can just hit the ground running in that aspect. If you're used to writing TypeScript, the amount of workflow difference should be near nothing to my experiences. When you start doing some really specific TypeScript things or if you have small things here and there, it gets funny.

So good examples is that even for documentation, for all the AssemblyScript packages I've been writing, I just use TypeDoc. TypeDoc can look through AssemblyScript and be like, cool. Yeah. This is, this goes there, this goes there and there's almost no actual small things I need to do there. And in terms of testing AssemblyScript has its own testing libraries, but I can promise you it's extremely solid. It's called Aspect. It's a just library written by Joshua Tenor. And if you've written Jest, it looks like JavaScript testing. It's like, describe it does that, yeah, expect to be, it offers all of that in AssemblyScript so.

Jeremy: Right. So you write something in AssemblyScript, I know we were talking, so that is compiling it down, is that using WASI? And then what can I do? Can I pull in Node packages in there or does everything I do need to be AssemblyScript?

Aaron: Yeah. I'm glad you asked. So it ends up compiling to it. So this is really technical but it uses a compiler backend called Binaryen. And it pretty much sits at the same level as LOVM if you're familiar. So it uses their immediate representations you then compile to WebAssembly. And then on the note of using separate NPM packages, it would have to be AssemblyScript all the way down. So that is one thing that some people are like, oh, but my favorite image thing is it there? And it's like, we'll have to port the dependencies to you which is annoying, but the ecosystem is really growing. For example, I've been running a lot of URL packages lately for URL parsing. And I saw we have the testing library, there is someone who recently wrote a JSON parser. So there's lots of little small things popping up. The community is growing and it's all an NPM.

So if it says like, hey, look, I'm an AssemblyScript package the VM can install it, and then it should work in your MPM project. I think you asked one more thing.

Jeremy: No, I was just wondering again, if you have the same access to some of the low-level things. So if you're compiling down to WebAssembly using AssemblyScript, are you then able though to do things like HTP calls and access the file system and those things?

Yeah. So that's what you'd ask about WASI. So yeah, essentially you would say, hey, import WASI, and then it'll give you access to those fd read, fd write and stuff, you don't have to directly access through file descriptors. Another community project as-WASI, is kind of like sitting at the same level of Node where you import file system as a whole word. And then it has a create file or maybe not create file, read file, write file, whatever the API names are that I can't remember in my head. But we do into this interesting place. We have a project, an AssemblyScript project, that we're waiting for HDP to be standardized on WASI for that is just called AssemblyScript/Node, because ideally since the APIs are so similar there's nothing stopping us from creating a very as close as to Node as possible API where you can start to just maybe copy paste notes then they should just work, because the languages are so similar.

Aaron: One last ramble if you don't mind ...

Jeremy: No, go ahead.

Aaron: ... is that AssemblyScript is so similar to a TypeScript that even though you can't take AssemblyScript code and run it through the TypeScript compiler, you can take AssemblyScript code, do some small tweaking here and there and get it to run through the TypeScript compilers. You can have the same source code, and let's say you're running in an environment for whatever reason, doesn't support WebAssembly you could just compile it then to JavaScript and stab and call it a day. Which is I think as a testament to how similar these two languages are. So yeah.

Jeremy: Well that's portability too, right? That's really actually, that's cool. So you mentioned a little bit about the ecosystem and the community around it. That is a huge thing where again, things grow when stack overflow is a lot of people's friends, right? So if you can't get your questions answered there or you can't Google for some of these things. So I know you said you've got the ecosystem is growing and you've written a bunch of blog posts and there's a bunch of other people working in this. But if I'm out there and I'm working on AssemblyScript, am I going to be able to find a lot of blog posts on this or is this still very, very early?

Aaron: Yeah. So I think it's probably a little 50/50 not that it's that early. So I guess the reason I'm saying 50/50 is because there's not swaths of people in stack overflow, they're all the AssemblyScript experts that can answer all your questions. The community's a little tighter. So we have a discord server and we have a help channel that's very active. If you want to have a question asked by the person that writes AssemblyScript. Yeah. They're there almost ready to answer any questions, I'm there. Our teams about four to five people, I'm sorry, a sixth person, but we're all there. And I check at least maybe every couple ... whenever I have downtime at work, I'll check and see if anyone asks me any questions. So it's a tight community on our discord.

In terms of blog posts, there's a lot of blog posts. The only reason why I'm 50/50 there is because the project has grown really fast. So there are some blog posts even that I've given that's like, hey, here's how you use pointers in AssemblyScript. Yeah. Not everyone wants to do that. Now it's like, we have more mature runtime, garbage flexes to someone's stuff. So your mileage may vary depending on what block ... There's a lot out there, yes, but maybe 50% of them are still very relevant to the AssemblyScript you would write today. If that makes sense. Or be there like, here's how you write your own JSON thing, but now there's a package for it. So they don't do it? So yeah.

Jeremy: That makes sense. So have people been using this? Are there some success stories here? Have people been successfully building applications with AssemblyScript?

Aaron: Yeah. So just going chronologically in my head. The first one if I may a little self-plug here, is that the reason why I got involved in the project was because I was really excited about WebAssembly when I heard about virtual performance and I was like, oh, I got to get on this asap. So one thing I like to do was build emulators because I think they're really good at testing any new technology because you need graphics, you need audio, you need to make sure that it runs fast enough or things of that sort. So I built in Game Boy emulator called WASMBoy in AssemblyScript. And it's really early days where I was just pointers, memory, stuff's moving, but because it was a Game Boy emulator that's what you have to do anyways. And from that, we found a lot of bugs in the project and we worked through them together.

Yeah, that was probably I guess maybe got no one in with this only community early on, it was like you have to do the build the Game Boy thing, nice! So that's probably the first one. Probably the most recent one I can think of as of recently in terms of just general community is there's someone named Marshall Duin. I said their name right. But they work on a storyboarding application. I think it's called Micrio. And they wrote a whole article about how they were using Javascript and think Canvas at the time and things like that. And they had taken all those hot paths, those things that were computationally intensive and rewritten them in AssemblyScript and started using WebGL and stuff just updated the application to modern-day. But AssemblyScript was a huge part of that. And they wrote an article about like, yeah, I could read it, it made sense, compared to alternatives I didn't have to learn a whole new language they had a really good experience with it.

And I think I'm 99% sure she's on the article up right now, but they shipped it to production and their users are happy and they saw huge performance increases from just the nature of WebAssembly because they used it in the right places and things. So that's really exciting. And then probably more recently today Fastly has been using AssemblyScript that's one of our supported languages on computed edge. So it's still in data and we've had some customers try it out and had some really good feedback about it. And some folks really like it again because it's like, hey, look, I know JavaScript this isn't too much of a transition for me. Cool. Thank you. So that's really exciting.

And then I know Shopify publicly announced they've been playing with AssemblyScript a lot. I know from being on the team we chat with them a bit, but I don't want to get too into their business, but I'm very happy for them. But I would redirect you over to what they've said, just I don't see any wrong. Yeah. But they're trying us out as well, which is really exciting. And shipping stuff if I'm not mistaken. So, yeah.

Jeremy: Awesome. So it sounds like WASM and WASI, AssemblyScript, all of these things are coming together. It's growing, it's becoming more solid like you said, you wrote some blog posts where it was really low-level stuff, and then you started building ways that make that easier. So I guess there's still some limitations in here, it's probably not the right choice for everybody, but I'm excited about it because I think that it could change a lot of things, but I guess maybe since you work on this team and you're part of this ecosystem, what's the future? Where do you think this is going to go? And are we going to get to a point where we can use this for pretty much just everyday stuff?

Aaron: Yeah, totally. So the first one that I'd probably be most excited to bring up is this idea of nano processes. So that sounds really flashy. So I'm not the one championing this, I'm not the main person behind this. I would very much redirect you to Lin Clark's talk who gave a talk about this idea WASM summit of last year. But from how I interpreted when nano processes has this idea of that there's this concept of shared nothing linking. So for example, in the MPM ecosystem, if you have an MPM module that requires another MPM module, and this top MPM module was like, hey, look, I want access to everything that MPM or Node provides. And in this bottom module, and for some reason, this module is like, going to throw some things on global because I feel like it, and the bottom module is like, oh, that's cool. I'm going to figure out some way to get required by this person. That way I can start accessing the things that I didn't have to require.

Therefore, I look okay from a security perspective, but I'm taking advantage of some of the JavaScript type things that MPM allows. And we've had a lot of security, I guess, scares that they were valid scares because things in the ecosystem, because of this problem. So the idea of shared nothing linking is that in the future we're hoping that WASI modules can import other WASM modules. From that you have to declare again from the host call or whatever it may be like, I want these specific things so we can figure out like, okay when we run in the runtime, we know you only need access to fast. You don't need access to machine learning you, whatever crypto, you don't need access to that. Your parent module has access to that, but you don't need it. So we're only going to make sure you only have access to those things, not everything that your parent needs only what you need. So this creates a really good thing so that way you're not escalating out of MPM WASM modules to new WASM modules, because you're really only linking what you need and not the whole everything. If that makes sense.

And another big one is interface types. Again, so we're already on this topic of WASM modules importing other WASM modules and linking together and things. So not every language is the exact same. You talked about this language on let's say WebAssembly. So to see a problem where it's like, well Python strings are different from JavaScript strings that are different from Rust strings that are different from Java strings, what do we do? So this idea of interface types. So pretty much a specification for like, hey, look, if we know your language uses UTF-8 and this language is UTF-16 let's figure out a way so that when you y'all talk to each other, there's not a huge performance cost that needs to be paid of reencoding every single time y'all go back and forth between each other. Or if you just take a string and you just pass it to a third WebAssembly module, we shouldn't have to reencode it from like UTF-8 to UTF-16 back to UTF-8, it should just go straight through, you know what I mean?

So trying to specify that and find a where we can standardize that so that WebAssembly stays fast in that regard of passing memory around between other WebAssembly modules and leading towards that. And take it with a grain of salt, but that MPM-ness, everyone is connecting these packages, like LEGOs that build a bigger picture and a better structure using community and open source and things. So, yeah. Let me try to think, and then ...

Jeremy: Was there anything on performance? I know the performance is pretty good, but any updates or ideas on performance?

Aaron: So glad you brought that up. Thank you. So yeah, one of the big ones is SIMD. One of the champions for it, one of the most involved people is Thomas Lively and they give her a really funny talk, in which ... SIMD what it stands for is, Single instruction, multiple data. And I can't explain it over video, but the idea is that pretty much they're like, hey, let's say we have this four different array of four numbers and we want to add them all to another four numbers. If we were to do that in JavaScript, we'd be like, okay, well array zero, plus array zero of this one, because the new array zero array one of array one there goes a new one. And then in their slides, they're like, but we have SIMD we just add them directly and it equals any result. And you know it's faster because it took less slides to explain that then if you're going array zero to array zero.

So that's really the idea is that if you had a vector by the mathematical terms, but if you had an array of numbers and you want to add them to another array or do the same single instruction to multiple data, it just does it all once. And it's lots of performance benefits out of that. So getting that working in WebAssembly it has a ridiculous number of different instructions because you have to think, okay, well, if I have an i32 that's six long versus it just creates a lot of WebAssembly instructions. I'm probably rambling a little too much. And then I had one more note and threads. Thread is another ...

Jeremy: Yeah. Threads, I was curious about that because I didn't ask you that earlier. And I'm just wondering is when WASM runs, is it single-threaded multi-threaded, can it do to do multiple threads? How does that work?

Aaron: Yeah. So as of today WebAssembly is all synchronous and it's all single-threaded. In JavaScript ... I've given a lot of talks about this. If you use Web Workers which unlocks multiple threading on the web with WebAssembly, that is the performance, like chef's kiss. Yes, if anyone out there is building a computationally-intensive of intensive web app, Web Workers and WebAssembly is a match made in heaven. But for a lot of folks out there that wanted maybe reporting over a larger C application, for example, threads is launching off threads and WebAssembly itself is something that lot of folks are excited about. If I remember correctly last Google IO or Chrome Dev Summit or whatever it may be, VLC has been talking about a port to the web and what they're doing there.

And they're really building a lot of their port on top of WebAssembly threads, which a lot of folks at Google are doing a lot of specification work there. So it just shows like, hey, look, we can watch whatever random video format, the browser, only the supported ones using a VLC port. Yeah. And also, again, I'm not an expert on the work at VLC. So we can do some Google searches I can send a link or provide more research later, but yeah, those are super exciting. So yeah.

Jeremy: And I mentioned earlier this idea of an MPM type thing, but is it possible? Because this is another thing where I like about serverless is that, you can write one function in Python or another function in JavaScript and another function or Node, and then another function in Java or whatever. And of course, they're isolated, so they don't have to necessarily run together so you can use those different runtimes. But is that something that's possible? If somebody wrote a WebAssembly script in Rust and compiled it down and then somebody else wrote one in AssemblyScript, are those able to work together?

Aaron: So that's the future we're definitely headed towards and we're really excited about. There is a specification I mentioned earlier about the WebAssembly module for another WebAssembly module. We're hoping with this event called a Module Imports, which more of we're working on it. So essentially you wouldn't have to have all AssemblyScript or all Rust once they compile it at WebAssembly then they can say, okay, well we're both WebAssembly now. So let's import each other, it doesn't matter what our source code was. So module imports is hoping to solve that problem. And once we get more of these things standardized out while things are still young, that's the future we're headed towards of which my dream thing seems for you as well is like, oh, I would ... Personally I love all the work that Python does in the machine learning community, but Python's not for me, but I would love to access all the cool things that doing over there. But as a primarily JavaScript developer, I'm like, oh I'm going to play with Dom ...

Jeremy: And that's actually ...

Aaron: ... not to downplay myself, but you know what I mean.

Jeremy: No, no, no, no. No, but that's interesting because that's the thing where, think of a global NPM, you know what I mean? Where it's just some functionality was built and if you can pull that in to maybe you're AssemblyScript person because that's what you're familiar with, but you can't make AssemblyScript, do some complex crypto thing or whatever that may be some other language could and compile down because it's either further along or whatever it is. Being able to share those across applications and reuse those. That sounds pretty exciting.

Aaron: Yeah. It sounds super exciting. Definitely, I think that's the future we're headed towards. As a community, we'll see what gets announced here and there and things. And again, there are some folks trying it out already, this idea, but no, we'll see. Really it began also, as you are very bullish on this idea of like, yes, we can totally have this cross. It's no longer are we bounded to our languages. It's like, we just write code and we can all be one large ecosystem. It's going to be awesome. I'm looking forward to it.

Jeremy: Cool. So I want to ask you one more thing though. So let's say that I'm a developer I'm listening to this. And I say, WebAssembly sounds amazing, how would I convince my development team or more importantly, probably my boss, how would I convince them, hey, this is something we should start investing in?

Aaron: Yeah. I'm so glad you asked. So the thing about WebAssembly that's really interesting, and I get asked this often, it really depends on what you're doing. We've maybe covered like, oh, you can do this, you could do that, you could do this, so it depends what your area is. For web applications definitely if you're building an application that does computationally-intensive things, maybe you're working on an online photo that, or whatever it may be. Maybe you're working on a spreadsheeting tool where you have to do complex math functions. Just understanding really where WebAssembly fits into an application that's where you can start to make the pitch to your boss. So if you're really excited about this right now, I have some resources that I built called WASM By Example which is how you can get started with little bite-size examples of WebAssembly, but another one is Made with WebAssembly.

So wasmbyexample.dev and then madewithwebassembly.com. WebAssembly is just a showcase of a bunch of folks using WebAssembly in production and side projects pushing like, okay, well what's what WebAssembly good at, these are some examples. So if you can go on madewithwebassembly and you're like, oh, my app does something similar to that, maybe you're on the right track. Maybe it's worth considering well, what your team needs and things. So yeah, I guess really just understanding what WebAssembly is good at and then see, how can we fit this into our application is probably your best bet, because then you can make a solid argument for why.

Jeremy: Awesome. All right. Well, I'm going to put all of that information in the show notes. I know there's a bunch of links that you have and all really good documentation and just good ideas, like you said if you want to convince your boss or whatever to use it. So I'll make sure I get all that in the show notes, but Aaron, thank you for sharing this. This was super exciting because I just don't know enough about this stuff. So anytime I can, I just love learning this stuff. And I'm super excited about the computed edge stuff. I think WebAssembly is such a great packaging format for that. But I'll put some of those other links you mentioned in the show notes, but if they just want to find out more about you how do they do that?

Aaron: Yeah, totally. Probably the best way to get ahold of me is on Twitter. So my username is torch2424, just as a quick, people ask me why? I got to be in it really young when I was like six, Torch was my imaginary friend. And then 24 is my favorite number when my kid Brian was 24, 24 twice. So ...

Jeremy: There you go.

Aaron: Yeah. Hopefully that story helps drive with my name, but yeah, I'm on Twitter most of the time. Yeah, probably reach out to me on Twitter, probably the best. I'm trying to think. If you like ...

Jeremy: I've got your LinkedIn here, I've got your GitHub which is torch2424 as well, check out Fastly.com and everything that your team is working on over there. And then I think, yeah, you mentioned The Bytecode Alliance too, right? So that's just bytecodealliance.org. Probably some great information there and then I'll get everything else in the show notes so people can check this stuff out. Give them some weekend reading to do, but thanks again, Aaron. I really appreciate it.

Aaron: Yeah. Thank you very much for having me. I super appreciate it. And yeah, looking forward to keeping in touch and things.

2021-03-22
Länk till avsnitt

Episode #92: Streaming Data at Scale Using Serverless with Anahit Pogosova (PART 2)

About Anahit Pogosova

Anahit is an AWS Community Builder and a Lead Cloud Software Engineer at Solita, one of Finland?s largest digital transformation services companies. She has been working on full-stack and data solutions for more than a decade. Since getting into the world of serverless she has been generously sharing her expertise with the community through public speaking and blogging.

Twitter: @anahit_fi LinkedIn: https://www.linkedin.com/in/anahit-pogosova/ Solita: https://www.solita.fi/en/ "Mastering AWS Kinesis Data Streams, part 1?: https://dev.solita.fi/2020/05/28/kinesis-streams-part-1.html "Mastering AWS Kinesis Data Streams, part 2?: https://dev.solita.fi/2020/12/21/kinesis-streams-part-2.html AWS Community Day Nordics 2020: https://youtu.be/gtE2o8qsq-4

Watch this episode on YouTube: https://youtu.be/7pmJJcm0sAU

This episode sponsored by New Relic and Stackery.

Transcript


Jeremy: So you mentioned poll-based versus stream and things like that. So when you connect Kinesis to Lambda, this is the other thing too, I think that confuses people sometimes. You're not actually connecting it to Lambda directly for pretty much all of these triggers in these integrations. There's another service that is in between there. So what's the difference between the Lambda service and the Lambda function itself?

Anahit: That's a great one because I think it's, again, one of those very confusing topics, which are not explained too well in the documentation. And the thing is that when you're just starting dipping your toes in the Lambda world, you just think that, "Okay, I write my code, and I upload it and deploy it, and everything just works. And this is my Lambda," right? But you don't really know how much of the extra magic is happening behind the scenes, and how many components are actually involved into making it a seamless service. And there is a lot of components that come into ... so you can think of a Lambda function as the function that we actually write and deploy and invoke. But then the Lambda service is what does all the triggering, invoking and batching and error handling.

And it really depends on the way the Lambda works, or the way long the service works. It really depends on the invocation model, is you prefer to the poll based, not poll based. So again, one thing that is not too clearly explained, in my opinion, is that there is actually three different ways you can work with Lambda or communicate with Lambda. So you can invoke a Lambda synchronously. So request response traditional way, and the best example, I think, is API gateway, which does that so it requests something from Lambda, it waits for the response. Then there is the async way, which is one of the most common. So you just send something to Lambda and you don't care about what happens next.

Jeremy: Which uses an SQSQ behind the scenes to queue ...

Anahit: Exactly. Yes. That's also like fun facts that you learn along the way. But the point is that like services like SNS, for example, or S3 notifications, they all use the async model, because they don't care about what happens with the identification. They just invoke Lambda and that's it. But then there is this third, gray area or a third totally different way of invoking the Lambda function, and it's called poll-based. And that's exactly how Kinesis operates with Lambda. And it's meant for streaming event sources, so it's both Kinesis data, DynamoDB streams. Also, Kafka currently uses poll-based model. And it also works with the queue of event sources like SQS.

Jeremy: Right. SQS, yeah.

Anahit: And Amazon MQ, I think they also use them, the poll-based method. And what poll-based invocation or the component that is most essential in the poll-based model, it's called the event source mapping. One of the misunderstood components or one of the hidden heroes, I would say, we find in Lambda, because it's an essential service or essential part of the Lambda service. And event source mapping actually takes care of all that extra things that Kinesis plus Lambda combination is capable of. So it's responsible for batching, it's responsible for keeping track of this point in the stream and where a shard, where it's ...

Jeremy: A shard iterator, because anybody wants to know the ...

Anahit: Yes, exactly, shard iterator.

Jeremy: ... technical term for it.

Anahit: Yes, thank you. And, yeah, the most important for me, it handles the errors and retries behind the scenes.

Jeremy: Right.

Anahit: And basically, if you don't have event source mapping, you can't have batching. So it takes care of accumulating, or in case of standard, consistent consumer, it pulls your Kinesis stream, on your behalf, it accumulates batches of records, and then it invokes your Lambda function with that batches of records that it accumulated. Again, in case of enhanced fan-out, of course, it doesn't poll, it gets the records from the Kinesis stream directly. But then from the perspective of your Lambda function doesn't matter, it just gets triggered by the event source mapping, because as you've said yourself, it's not the Lambda that you connect to Kinesis stream, it's the event source mapping that you connect to the stream, and then you point your Lambda to that event source mapping, so.

Jeremy: Right. So you can connect a Lambda function or the Lambda service directly to the Kinesis stream itself, or you can use enhanced fan-out and push it to the Lambda function. Although, for all intents and purposes, it's pretty much the same thing.

Anahit: Yeah. And for your Lambda function, it doesn't really matter how that data ended, or how those records ended up there, you just get a batch of records, and then you deal with it. And I mean, all the rest is pretty much the same from the perspective of a Lambda function, because it's nicely abstracted behind the event source mapping, which hides all that magic that happens behind the scenes.

Jeremy: Right. So you mentioned some aggregations stuff in there and about like Windows and time windows and things like that. So tumbling windows, that's something you can do in Kinesis, as well. Can you explain that?

Anahit: Yeah, it's a feature that actually came out very, very recently. In the end of the re:Invent, I would even say, and I think it was like one day before I was going to publish my second part of my blog post that was already finally ready to submit it, and then in the evening I get this and I was like, "Okay, I have to write a whole new chapter now." But it is a very interesting aspect, you can use it with both Kinesis and DynamoDB streams, actually, so it's available for both. And it's a totally different way of using streams, which wasn't there before. So with Lambda function you know that you can retain state between your function executions unless you are using some external data source or database.

And here, what you're allowed to do with this tumbling window is that you can persist the state of your Lambda function within that one tumbling window. So tumbling window is just a time window, it can be at maximum of 15 minutes, and all your invocation within that 15 minutes max interval, they can pass the state and aggregate the state of passing to the next Lambda invocation. So you can do cool things like real time analysis that you could previously do only with Kinesis data analytics, for example. Here you can do right inside your Lambda. And then when the interval is ending, the 15 minutes, for example, interval is ending, you can send that data somewhere, let's say to a database or somewhere else. And then the next interval is starting, and then you're accumulating again.

And so it's pretty fascinating in the sense that it allows you to do something that wasn't there before. It's a completely different way of using the Lambda basically, with the streams. But of course, there are limitations with that, you can only aggregate the data on the same chart because one Lambda is processing one shard at a time.

And then there is also this thing called paralyzation factor, which we haven't talked about. But which basically means that instead of having one Lambda reading for one shard at a time, you can have actually up to 10 Lambdas that are reading from that same shard. So you can boost the power of reading, because if one Lambda, for example, if Lambda execution takes too long, and you can't keep up with your stream, then you can either add more shards, for example, to your stream, but it's expensive, that takes time and has some limits. Or then you can immediately just throw more Lambdas at it, just say like more horsepower, and they will take care of it. But if you have more than one Lambda reading from a chart, you can't use this new tumbling window features, which makes sense, of course.

Jeremy: Right. And that depends on what you're doing because I mean, the idea of the parallelization factor, such a hard word to say. But the whole point of that is to say you're reading up to 1000 records per second off of this stream. And if for some reason it takes more than a second to process one of those records or whatever, then you're going to see the problem with not being able to process enough records quickly, because you're backing up your stream if you're writing to it. So again, it's just one of those trade-offs.

Anahit: Yeah, but again, this new feature, I think it's going to be developed still, maybe someday it's going to have some kind of support for it, though I can't see really how under the hood, it would function between different Lambdas in the central. But anyhow, this, I think is a very cool new thing that I'm actually eager to try out in production if I just figure out a case for that because it just looks so cool. And it's so simple to do.

Jeremy: Yeah. Well, I mean, and the other thing is, is that depends on what you're doing with it. So the use case that I've seen, and actually I started playing around with like SQS batch windows to try to do something similar. I know they're different, but they're same idea where when you're doing aggregations, if you're just reading off a stream, and you're trying to aggregate, you have to grab that data from somewhere, like you said, because Lambdas are stateless.

So you have to query a DynamoDB database or something like that, or table and then pull back what the last aggregations were. And then you read in the data from the stream, and then you write, you do your aggregation there, and then you write it back to the DynamoDB. If you're doing that hundreds of times a second, that is pretty inefficient, where if you just did it, set your tumbling windows to one minute even, and you could read thousands of records, and then be able to just write that back to the database one time, just the efficiency gain there is huge.

Anahit: Exactly, exactly. If you have a use case that is like that, because I personally don't, that's why it's ... I'm trying to come up with one in the future ...

Jeremy: Come up with ...

Anahit: Yes.

Jeremy: Find the problem for the solution, right?

Anahit: Yes, exactly. But yeah, it can be very, very helpful. And again, it's pretty straightforward to using it. So I can see a lot of people loving it really.

Jeremy: Yeah. Awesome. All right. So another thing I just want to mention, because we keep talking about Lambda, and we mentioned the concurrency thing and some of those other bits. In terms of provisioning shards and having one Lambda per shard, and then potentially, if you do the parallelization factor, 10 Lambdas per shard, if you had 100 shards, because you had a lot of data coming in, and you had the parallelization factor turned on, then you've got 1000 concurrent Lambdas being run at once, which I did ...

Anahit: And guess what happens next?

Jeremy: So what happens next, yeah. And the people don't know the soft limit in any region is 1000 concurrent executions, for your Lambda concurrency. So, that's just something that people need to think about.

Anahit: Yeah, for sure. And that's something I bring up quite often, because we've been there, done that, but actually 100 shards is not even too much. There are apparently streams with 1000s of shards. So we have something like 40 shards in our stream. So it's a really quite, quite decent amount. But yes, as you said, exactly, so if you have, for example, 100 shards, and then you have a parallelization factor of 10, you will have 100 times 10, 1000 Lambdas, running or consuming that stream at all time. So there will be constantly 1000 Lambdas, concurrent Lambda invocations. And you probably won't run into any problems until there is some other Lambda in that same region in the same account, that is probably very business-critical, that does something very important, and then it starts to fail for some unknown reason. And that reason is not even that Lambda, the reason is your stream, which is consuming the entire budget that you have allocated for Lambda.

So yeah, it's something people oversee quite often that though Lambda scales endlessly, potentially. In reality, all the services, they come with the safety mechanisms of soft limits, there is no service in AWS, I think that that is ... it comes out of the box with no limits, just use it as it is. So basically for your own safety, there are some soft limits. And on the other hand, though, they are soft, which means that you can increase them by submitting a ticket to support. It will take some time I warn you, especially if you go higher than normal.

But though you can do that, there still is going to be a limit. There is always going to be limit. And you just need to know that it exists because one day, you're probably going to hit it. And so you have to monitor that all the time. And yeah, that's one thing to keep in mind. But that's that's a common thing with SQS as well, for example, and maybe SNS. So all the services that can scale Lambdas pretty much like out of hands, then you're faced with that concurrency, Lambda concurrency limits that you have to be careful with.

Jeremy: Right. One limit that I love that has nothing to do with Kinesis but with SNS is for, if you send SMS messages with SNS, I think the default limit is $1 spend per month. So if you send like 200 text messages or something like that, it ends up cutting you off. Maybe not 200, you can probably send more than that. But it is a very, very low limit. And I think it's just because they don't want people, I don't know, spamming SMS, or something like that. But anyway ...

Anahit: Can you increase it? Is it soft limit?

Jeremy: No, no, you can increase it, yeah. But you have to submit the ticket. But basically, I remember, I set up a new account for something and we were doing all these alarms, and it was like, within two days, I get a message saying ...

Anahit: "That's it. That's enough."

Jeremy: ... "No, you can't... no, you've exceeded your limit." And I was like, "Well, that was fast." So and if you're using something like control tower, or any of these things to provision hundreds of accounts, some of these soft limits that are in there can affect you. So, whether it's Lambda or some of these other ones, but ...

Anahit: It's surprisingly easy to reach all of the soft limits, as long as ... I mean, as soon as you go to the real world cases from those who, "Hello, World!" cases. And yeah, it's not a problem reaching the limits, per se, the problem is many people don't know that they are there.

Jeremy: Yeah, yeah good point.

Anahit: That's when the problem starts.

Jeremy: Good point. All right. So speaking of limits, there are limits, obviously to Kinesis. And some of these things that maybe even go beyond some of the soft limits. I mean, there's just limitations in distributed systems, and there's limitations in network throughput and some of those other things. And so as you hit some of those limits, or maybe let's just talk about errors in general, as you start to run up against problems, whether they're caused by limits or whether they're caused by something else, what are some of the things that I guess, could go wrong when you're using Kinesis?

Anahit: Right. Well, that's my favorite topic, really. But I mean, with every service, as you said, I mean, nobody says it better than Werner Vogels who says that, "Everything fails all the time." And I love that phrase because that's true.

Jeremy: Very true.

Anahit: And it's not because you want to be pessimistic, but rather, because you want to be prepared, and you want to sleep better at night. Because if you're not prepared, then surprising things will happen eventually. And for me personally, with Kinesis, or any other service, really, when I start working with a new service, first thing basically that I ask is that, "What are the ways in which it fails? What are the possible errors? What are the possible limits? Are they hard limits? Are they soft limits?" All this. And even what's the built-in functionality for retries, for example? What are the default timeouts? And that kind of thing?

So those are very common that things that you start to question after you have got a lot of headache with one of the services. And then you start to question those specific questions when you start working within your service. And with Kinesis, you can probably separate the errors for writing to Kinesis stream, and reading to Kinesis stream.

So for writing, well, first, it's nice that AWS SDK has a built-in functionality for retries for all the failures or system failures that happen and timeouts as well. And it's not documented or it used to be not documented too well, because personally, I learned about the built-in retries, when I was developing unit tests, and then they were behaving weirdly, and I was like, "Something's going on here. What's that?" And then I realized, oh, it retries three times by default for every system error. Wonderful, that's a wonderful news.

But maybe not so wonderful news is the thing called partial failure. And it's actually very common for all the services that are using batching. So what it means is that when you, for example, write a batch of records to Kinesis, it's not an atomic operation, it's not either the entire batch succeeds or entire batch fails, and you get an error code back from Kinesis. The reality is that you almost always get a success code back from Kinesis, and it's very misleading, because parts of that batch could have failed, and you don't know about that. And what you should do, instead of just waiting for an error to come back, what you should do instead is to look at the response that comes back from Kinesis. And to see if there is this field called a failed error count or something like that, which basically tells you where they're actually failures within that batch that didn't go through to the Kinesis. And that can happen, for example, because of throttling. So some of the records just didn't made it, they didn't make it to Kinesis stream.

So, that's that's one of the basically main issues that we have had with Kinesis streams. And you have to take care of those partial failures manual and you have to do some smart retries and backups, and random detours and things like that. And then there are the timeouts of course, which always happen and you need to know the kind of the default settings for the timeouts. Because in case of Kinesis for example, the service times out after two minutes. So, and actually, there're two timeouts. There is a timeout for a new socket connection, and then there is a timeout for sending a request.

So first, you will wait two minutes to create a socket connection, and then you will wait another two minutes for sending the request, then it will be retried three times. And then like 10 minutes in and you're still waiting for one batch to go through, in a pessimistic scenario. And again, those are things you don't really see in the documentation right away, and those are the things that you end up finding out because you have some problems. And the other point is that you almost or you always have to set the timeouts to a lower value than the default two minutes.

Jeremy: Yeah, those defaults ...

Anahit: It's crazy.

Jeremy: ... defaults are not great.

Anahit: Yeah. No, no, not at all. So those are the main things with writing. So like partial failure, sometimes timeouts and that kind of things. But with reading, things get even more interesting, because there's so many options. And one of the things that is very common, it's called poison pill record.

Jeremy: Yeah, the poison pill.

Anahit: Oh, the poison pill, yes. And nowadays, it's actually pretty avoidable, but let's get back to it later. But the idea of poison pill is that if you have a Lambda function attached to your Kinesis stream, and it's reading from the shard, and everything is fine, until there is some corrupt record in your shard for some reason. And your Lambda function tries to read that record and it fails, and then you don't have proper error handling because well who needs error handling, and then your entire Lambda function fails, right? But when your entire Lambda function fails, what happens is that Lambda returns or as we know, event source mapping actually returns the entire batch back to the stream. And then it retries or makes Lambda retry with that entire batch that just failed.

Jeremy: And speaking of defaults, it retries 10,000 times I think by default?

Anahit: No, you're too optimistic. By default, it retries forever.

Jeremy: Oh, forever.

Anahit: Yes, it retries until the data expires, which means from 24 hours to up to seven days or one year. But let's explain why it's not good, it's not a good thing. Well, first of all, you don't want to have all these unnecessary Lambda invocations that don't do anything. They just send to the same records, and then they ...

Jeremy: They keep failing ...

Anahit: Yes. They keep failing at the same point of the batch, and then they start all over again. And it's pointless. But the problem is that in, well, let's say 24 hours, let's take the optimistic scenario, so in 24 hours, the batch expires finally, and Lambda can forget about it. So the batch gets deleted from the stream, and then the next batch comes in. But the problem here is that by that moment in time, probably you're streaming or your shard is probably filled with records that were written around the same time as the records that you were trying to process, which means that they are expiring around the same time, like the previous batch.

And if your Lambda is not quick enough, you might end up in a situation when records end up falling from your stream. This overflowing sink analogy that I had in my blog post, when you basically pour water to the sink more quickly than you can drain it. And then the water ends up on the floor. So, that's the exact situation. So basically, what ended up happening is just because of having one bad record, and no proper error handling, you ended up losing a lot of, or you can potentially end up losing a lot of records. So hence the poison pill because that one bad record poison, poisoned the entire shard basically.

Jeremy: And I'm actually curious, something I've never tested this, but let's say that you get a batch of records, it's only say 100 Records, because there's only 100 records in the stream. So it sends that batch to Lambda and then Lambda fails, because there's a poison pill and it sends those 100 records back. If another 100 records come in, because it's still under whatever your threshold was for batches, would it then send in like the 200 the next time and then will it keeps sending in up to the full batch amount as it retries those batches?

Anahit: I would imagine it should really. That would make sense. We just never had the situation because we usually have the complete batch.

Jeremy: Right, you have a full batch.

Anahit: But I would imagine that's how it should work yet, because it accumulates entire batch. But it doesn't matter because it will stop at the exact same record. It processes them in order, it will stop that exact same record. And then well of course, if you process them in order, you can process all the records in parallel, if you want to. But then you want to have the ordering. But yeah, it's very funny situation, and it's very easy to end up in it, and been there, done that once again, but luckily ... Well, first of all proper error handling in your Lambda function where you don't allow the entire function to fail just because of one record that didn't go through. And then there are different ways to approach that.

And then the things that I was talking about a lot or mentioning a lot is the error handling that comes out of the box with event source mapping. And nowadays, and actually, it's developing. And each year, they are adding new functionality and new possibilities that weren't there before. So what you said about 10,000 retry attempts, it's a totally new feature, it wasn't there. They added this maximum retry attempt settings to the event source mapping. But again, by default, it's minus one, which means that it does it infinitely. So, but you can set it to up to 10,000 if you want to. And then you can set the maximum age of the record that Lambda will accept. So if the records get older than some specific age, I think it can be up to one week even, you can ... your Lambda will keep the records in one process them.

And then there is on-failure destinations where you can send the information about your failed record if everything fails. Then I think that one of the fun possibilities is the cold batch bisecting. So it's when you basically split your problematic batch in two and then Lambda tries to send these two parts separately, and then hopefully, the other one succeeds. And then it continues with the failed one and splits it recursively further until hopefully, you end up with just one bad record.

Jeremy: Just the one.

Anahit: Yes, but on the way there, you actually end up sending same records over or processing same records over and over and over again. So it's not optimal. And then there was one more announcement around the same time, because of which I had to update my workflows. I think it's called custom checkpoints.

Jeremy: Custom checkpoints, yep.

Anahit: Yeah. It's basically common sense. Instead of just failing your Lambda saying, "Well, no can do. There was a batch, I don't know, something bad happened." Instead of that, you can return the exact sequence number of the records back to the stream, the record that caused the problem. So if you went on with your batch, you processed your record, and then you return that and back to event source mapping, and it knows that, "Okay, next time I retry, I will start from that end, rather than starting, again, from scratch." So.

Jeremy: And that should eliminate the need to do the bisecting?

Anahit: Yeah, that's ...

Jeremy: The bisecting. Yeah, right. So if you're ...

Anahit: That's what I'm thinking.

Jeremy: ... have an existing system that is using bisecting, you don't have to change it. But AWS likes to do that, where you keep the old functionality in, but there's a better way to do it. The same thing with dead-letter queues and Lambda destinations, right?

Anahit: Yes, exactly. But for the sake of it, if you like the idea of just kind of splitting your batch, and sending them separately behind the scenes without doing anything, well, you can have that. But yes, of course, this new functionality would be so much better, because you would avoid all this unnecessary read processing of the same records. Yeah.

Jeremy: Right. So what are some of those other common issues? I mean, you mentioned timeouts, and maybe like network issues are obviously happened, but what are some of maybe the other distributed network things that pop up when you're using Kinesis?

Anahit: Yeah, I think the timeouts and network problems are really the core of it, most of the times really. And the other one that I've mentioned several times already that at least once a guarantee, so-called a response guarantee, so that it then prevents the ... Basically, with Kinesis, you are not guaranteed to get your data exactly once, it's at least once. So you will have duplicates in your stream. And it's because of, for example, the retry functionality that we just discussed, both with sending and receiving the records. But also, the fact that, for example, the network issues also contribute to that because you might have sent a batch of records to Kinesis, but never heard back from it. You just didn't get the message so to speak. And then you will retry it because you don't know either it went through or not. And then maybe it did go through and then you end up writing the same batch all over again.

These are the things that happen pretty much all the time. And the only thing or the only way to deal with them, it's just to know that they happen and to be prepared for them, with at least once guarantee your downstream systems must be resilient in the sense that they won't change if the same data comes over and over again. So they need to be able to handle that repeating records in your stream. And then with the network problems, well, there's not much you can do about network problems. Of course, if you have a producer that is running inside VPC, creating a Kinesis VPC endpoint is a good idea, so the traffic won't leave your VPC. But pretty much, that's the only thing you can do about those.

But on the other hand, you can handle those issues with ... or let's say timeouts are also network issues in some way or quite often. And the thing that we were discussing before that default timeouts are really not that great, most of the time you need to adjust those with Kinesis, especially, not especially, but that's a good example, maybe. But actually, one fun thing I remembered about the timeouts is related to DynamoDB, which are probably familiar to you, in a sense, because the DynamoDB also has some ridiculous default timeout, like a minute, two minutes, something like that.

And when a couple of years ago, at re:Invent, I was speaking with one of DynamoDB guys, and was asking that, "Okay, we have this API that needs to retrieve data from DynamoDB, and it needs to be very, very quick. So latency should be very low." And we used to have Lambda in between, so Lambda was doing calls to DynamoDB. And the first thing he said was, "Reduce the timeouts." Because apparently, DynamoDB can timeout pretty frequently. So it's much better to drop the connection sooner rather than later. So you set the timeout to, I don't know, 1000 milliseconds, and then you let the SDK handle the retry, instead of waiting for, like forever. But that was funny. That was the first thing that they recommended me to do. "Okay. "

Jeremy: Yep. Even though they set those defaults pretty high, but ...

Anahit: Yeah, exactly.

Jeremy: All right. So then, in terms of monitoring this, though, I mean, that's one thing that I really like about Kinesis is that you do get quite a few metrics where you can look and see how your shards are doing, how quickly they're being drained, how backed up they are, and stuff like that. What are some of those, I guess, the most important metrics that you want to keep your eyes on?

Anahit: Right. So of course, there are separate ones for writing to the stream and for reading to the stream. So I would say for writing, what is it, right throughput exceeded exception, is like the metrics that tells you that you exceeded the throughput of your stream basically. So that's the one that was pretty much eye-opening for us, because well, the thing is, I think, with metrics in general is that they are at best minute-based. So they are aggregate metrics, or aggregate values over one minute time, right? And with Kinesis, as we have mentioned several times, all the limits are per second. So it's 1000 records per second one, one megabyte per second. And that's the information you don't get from the metrics. So you don't see the picture, per second picture.

So there is a metric that tells you how many records come in and how much open data comes in. And you might look at those and think, "Okay, the threshold is still far, far away, I for sure have no issues with the stream." And then you notice that there is this provisioning throughput exceeded exception metric that is being popping up, and you figure out that, "Okay, apparently, bad things can happen even in this situation." Of course, it's because of, for example, the network's issues that we discussed before, or spike in traffic, because the records ...

Jeremy:
Lots of traffic.

Anahit:
Yep. The records arrive to your stream on uniformly in a way, so it might be that one second, it's like 5000 records. And the next second, it's just like three direct records. And you can see that in metrics, or even one metric. You have to observe the metrics that tell you what goes wrong in a way. That's the key, I guess. And same goes to reading from the stream, really. There is this rich provision throughput exceeded, which is basically only for a shared iterator case. So the standard reading, consuming the stream. So when you exceed, for example, two megabytes, or you exceed this five requests per second, which we don't even go into. Read my blog post, you will know what I'm talking about.

But you get those, and then there is, I think the most important one is the iterator age when it comes to reading from the stream, because that's the one that tells you that kind of age of the record, meaning how long they have been in that stream. And apparently, if the age increases, it means that you can't consume them fast enough. So then you might have a problem. They are with your consumer, for example, or you have too many consumers, and then you have to have the enhanced fan-out and things like that.

But they're basically like two, three metrics that you have to keep an eye on. And if you see any issues with those, then you have to dig deeper, maybe enable the enhanced metrics, which are not stream level, but they are shard level metrics. For each shard, you can have the same or similar information so you can diagnose it more precisely.

Jeremy: Right, yeah. And if it was only serverless, or I should say fully serverless, and do this automatically for us, that would be much better.

Anahit: Yes.

Jeremy: Well, so just like your blog post, this episode turned out to be quite lengthy. And but I hope people got quite a bit of knowledge from this, and are not afraid of using Kinesis, because it's an amazing service. Yes, it has all of those caveats that we talked about, but it's still an amazing service. But if you've got a few more minutes, I'd love to just pick your brain for a second, because I think there are a lot of common misconceptions about building serverless applications, and again, whether Kinesis is serverless or not, we'll put that aside. But just all of these different services, even Lambda, and having to build in the retries, and know about either bisecting or using the custom checkpoints or doing some of these other things, there's a lot that goes into it. So what are some of the ... and maybe just even from your own perspective, when you're building serverless applications or using fully managed services? Like what are just some of those misconceptions that maybe people have?

Anahit: Yes, I've noticed those, well, few of them actually when working with serverless. And people usually have strong opinions about serverless. It's either they go both ways. But I think many people assume that it's either very easy, or then and you don't have to do anything, everything is done for you, or then it's way too complicated. And I think, again, Yen Cui had a nice blog post lately about the complexity of serverless, or perceived complexity of serverless. And what he was saying is that serverless is not complex, it just reveals the underlying complexity of the systems that we used to build before. So all those things that were built in and hidden from everybody's eyes, but there was still there. Now, they are more obvious with using all the different components, and you connect them to each other, and you have all that ecosystem living there, but ...

Jeremy: Which gives you more control over the individual components as well.

Anahit: ... it gives you more observability, it gives you more control and all these nice things why we love serverless. So I'm all for it. But on the other hand, I think it's a simplistic view to think that fully managed and serverless, it means that you basically just deploy your code, and you have to worry about nothing. Because as we discussed with you several times already, yeah, you will probably get away with that on the "Hello, World!" level, it will be pretty much okay. But then when you get to the real world and real world scale, you actually do need to know in quite some detail how each and every service that you are using, how they work, and how they fail. Because once again, they will fail at some point, and you basically ... you need to know how they fail and what can happen just to sleep at night.

Jeremy: Yeah. And I also think just this idea, that again, they said it and forget it for simple things, like you said, yes, but just ongoing management, right? I mean, and optimizations and shards, refactoring code and with the shards thing, with monitoring that and saying, "Hey, we're starting to creep up to this next level, or maybe we're not processing fast enough, or maybe our shard iterator keeps pushing over a certain amount of time during certain times of the day."

Anahit: I'm getting anxious, now.

Jeremy: All right. You want to go back and look at all those metrics, right?

Anahit: Exactly! But that's exactly right, but maybe will sound scary. Well, we'll put it that way, but on the other hand, the ... again, well, it reveals the complexity your systems do have anyway. And the good news here, I think is that in case of AWS, there is a lot of commonalities in how services work.

Jeremy: Yeah, true.

Anahit: And once again, I think understanding of one service through and through will help you to understand all these issues with the distributed systems and under errors and built-in retries and whatnot. So you don't really need to remember every single thing by heart, and it's not as overwhelming as we make it sound at the moment. It does require some work, but I think it's well worth it.

Jeremy: I totally agree. Well, Anahit, thank you so much for taking the time to talk with me and educate the masses about Kinesis. If people want to find out more about what you do or want to contact you, how do they do that?

Anahit: Well, first of all, they need to read the blog. It's long, but I hope it's worth it, and it has some nice pictures, so some benefits. Then they can reach me on LinkedIn, first name, last name. And Twitter, again, first name, last name. And yeah, I think that's about it.

Jeremy: Awesome. And then the blog at solita.fi. And then you've got a really good talk that you gave, I think it was at AWS community day, maybe Stockholm. So, then that.

Anahit: Oh my God, it's been over a year already. That was the last trip that I made before ... it's horrible.

Jeremy: Isn't that crazy? I know. It's been a year, it's been a year.

Anahit: It's been a year.

Jeremy: We just celebrated or, celebrated I guess ... there was just a year passed for ServerlessDays Nashville which was the last conference that I went to in person. So I am looking forward to doing that again and bumping into people and talking to people about this in the hallway because those are the best conversations. So-

Anahit: For sure.

Jeremy: ... anyways, I will take all of this stuff, your Twitter, LinkedIn, blog, the two blog posts that you wrote about this, as well as that video talk from community at Stockholm. I will put all that into the show notes. Anahit, thank you again so much.

Anahit: Thank you so much, Jeremy. It was so much fun.

2021-03-15
Länk till avsnitt

Episode #91: Streaming Data at Scale Using Serverless with Anahit Pogosova (PART 1)

About Anahit Pogosova

Anahit is an AWS Community Builder and a Lead Cloud Software Engineer at Solita, one of Finland?s largest digital transformation services companies. She has been working on full-stack and data solutions for more than a decade. Since getting into the world of serverless she has been generously sharing her expertise with the community through public speaking and blogging.

Twitter: @anahit_fi LinkedIn: https://www.linkedin.com/in/anahit-pogosova/ Solita: https://www.solita.fi/en/ "Mastering AWS Kinesis Data Streams, part 1?: https://dev.solita.fi/2020/05/28/kinesis-streams-part-1.html "Mastering AWS Kinesis Data Streams, part 2?: https://dev.solita.fi/2020/12/21/kinesis-streams-part-2.html AWS Community Day Nordics 2020: https://youtu.be/gtE2o8qsq-4

Watch this episode on YouTube: https://youtu.be/U4snzWHMrtU

Thanks to our episode sponsor, Epsagon.

Transcript

Jeremy: Hi, everyone. I'm Jeremy Daly and this is Serverless Chats. Today I'm chatting with Anahit Pogosova. Hi, Anahit, thanks for joining me.

Anahit: Hi, Jeremy. Thanks so much for having me.

Jeremy: So you are an AWS community builder and also a lead cloud software engineer at Solita. So I would love it if you could tell the listeners a little bit about your background, and what it is you do at Solita.

Anahit: Right. So yes, so I have been working at Solita for pretty long time. So it's a digital transformation company. It was originated in Finland over 25, 26 years ago, and out of those years, I have been on-board for 11 years. Which sounds extraordinary nowadays, I suppose, because everybody gets surprised. But during those years, I've had several roles as a backend and full stack developer. And then I moved to the cloud, to AWS and started doing all the cool stuff with serverless. And I have been also working as a data engineer for several years with one of our customers, so a lot of different stuff.

And we actually have offices in six countries in Europe, of course, they are empty at the moment. And I'm based here in Finland. And yeah, we focus on software development and cloud integration services, analytic services, some consultancy, and service design. So if you're interested, we are hiring. And yeah, that's about Solita and me.

Jeremy: Well, any company that can retain someone for 11 years, sounds like a good place to work at.

Anahit: Right? I think so too. No, apparently, it sounds suspicious to many people. Why exactly?

Jeremy: I don't know. That's a conversation for another podcast, I think, about the job-hopping thing. But anyways, well, I'm glad that you're here. And thank you very much for taking the time to talk to me. I'm super, super excited about this topic, actually, because I came across this blog post that you wrote. Now, this was actually the first version of this that you wrote was, or the first part of this, I think was maybe almost a year ago now or something like that.

Anahit: Yeah, something like that.

Jeremy: But then you had a second part of it that came out in maybe November. And this was two posts, they were called "Mastering AWS Kinesis Data Streams." And now the cool thing about Kinesis is, it's a super powerful service. I think we learned from a recent outage at AWS that Kinesis, pretty much powers everything, every backend service at AWS is powered by Kinesis, which is pretty cool, but also scary at the same time. But, but it's a fascinating service. And I want to warn the listeners, because I want to get super technical with you. I want to get into some of these different details about how this service works, some of the limitations, some of the use cases for it and things like that.

And I would absolutely suggest that people read the two posts that you wrote, now they are very, very long, it took me a long time to get through them. But they are excellent, they're really well written. And it reads a lot easier than the documentation, and you give some good examples in there and some good reasoning behind it, which the documentation doesn't always do. So first of all, I want to start with why you wrote this post in the first place because there is a lot of documentation out there. But why did you write these two posts?

Anahit: Yeah, these two very long posts, as you said. So maybe to give some background, I've been working with Kinesis a bit over three years now with one of my customers, who is at the Finnish National Broadcasting Company called YLE. I always bring this example, you can think of it as BBC in Finland, highly respected accompanied with a lot of content and a lot of viewers as well. So our team is responsible for streaming the user interaction data to the cloud. And at the moment, we have something over 0.6 terabytes of data per day. In the moment of writing the first blog, it was half a terabyte, so it's growing constantly.

And yeah, so we did with Kinesis. And when I started like three-plus years ago, I basically had no production experience with it, just like the "Hello, World!" kind of a thing. And most of the things I learned, or most of the things that are in the blog post, I actually learned the hard way, so by making the mistakes, and by seeing the failures, and that kind of things. And I actually wish that blog post, or two blog posts, like that would exist back then when I started, because as you said that there's a lot of documentation on AWS, of course, but for example, in the case of Kinesis and Lambda, you have to read the Kinesis documentation, and then you have to read the Lambda documentation, then you have to marry them together. And it's a lot of reading and not necessarily too clear.

So I wrote this in a short way, I wrote it to myself three years ago, that kind of thing. And I hope it will help others not to make the same mistakes that I had to make myself. So maybe it will help somebody who has already started their Kineses journey or just thinking about it. And the thing is that while I was writing those blog posts, or before working with Kinesis, I have learned so much when I started to dig under the hood of how the service actually works. So I have learned so much about how the AWS services work in general. So like digging deep or understanding deeply, just one service, in my opinion, gives you a wider understanding of all the other services. So even if you're not that interested in using Kinesis, I would still recommend reading my blog post.

And I actually point out some of the common issues or things that are common for other services as well and distributed services in general things like idempotency, and timeouts, and error handling and that kind of stuff. And to tell the truth, I still use or I do use my own blog post as a reference manual, pretty often myself, because I have a horrible memory, especially when it comes to exact numbers. So it's nice to have a one place where I go to look for stuff. And yeah, so to help myself and to help others is the short answer to your question.

Jeremy: Well, no, I think that's great that, first of all, that you did that to help others, but the fact that you did it to help yourself, that is not an uncommon thing. I know, for me, most of the blog posts that I wrote were just ways for me to make sure that I wrote something down, and it would actually live out there that I would be able to go back and reference myself, just like you said. Because I figured out things my own way, and then it's really helpful for me to go back and see how I did it, as opposed to try to find a needle in a haystack somewhere else. So yeah, so awesome. So again, I think that's amazing. And I'll say it again, I read those blog posts, and I learned so much about Kinesis that I thought I already knew, but just seeing it in that different way was really, really helpful to me.

Anahit: Oh, great to hear that, especially from you, because I assume you do know quite a bit about Kinesis already.

Jeremy: I know a little bit. Yeah, no, I've used it quite a bit, but I mean, just in terms of like failure modes and some of these other things and the different caveats you run into, which is something that the documentation doesn't capture as well as it needs to. And that's one thing I find about AWS documentation, but documentation in general is, it's very easy to get to that, "Hello, World!" phase, like you mentioned, but then to get over that hump and bring it into production, I mean, that's a whole other beast.

Anahit: Yeah. And maybe the simplicity of the serverless, and the managed services nowadays is also quite deceiving in that sense, because nobody reads the documentation from start to finish anymore. You just go skim through it, and then it's like, "Okay, I will try out and see how this works." And then you try out.

Jeremy: And you can get going.

Anahit: Yeah, you get going. And I said, "Okay, this thing is working. I know how it works." Yeah, you do until something fails, because it will.

Jeremy: Exactly, exactly. All right, well, so let's start. Let's take a step back, because I know what Kinesis is, you know what Kineses is, but I'm not sure everyone knows exactly what Kinesis is. So let's start there. Why don't you give a quick overview of what is Kinesis, and why would you use it?

Anahit: Yeah, so Kinesis is massively scalable, and fully managed service in AWS, which is meant for streaming data, huge amounts of data, really. And what they say is that it actually scales pretty much endlessly, not unlike Lambda functions. And it has a lot of service integrations, like other services can send events to Kinesis, for example, AWS IoT Core has that functionality, CloudWatch events, events and blogs. Even some more exotic options with database migration service also has some sort of integration with Kinesis. So it's pretty common to use them in that combination.

And then as you mentioned, in the beginning, it's actually a pretty crucial service in AWS itself. And not everybody realizes that, that a lot of services use Kinesis under the hood, like the CloudWatch events themselves, use it under their hood, the logs, IoT services use it and even Kinesis Firehose use Kinesis as their underlying service. And as far as I know, they're one of the biggest customers for the Kinesis team, so it's cross pollination in that sense. And yeah, that outage last November, it actually showed. I would say that many people we don't know about the Kinesis before the power outage I suppose, or not too much, at least.

And yeah, so Cognito failed, CloudWatch failed. And then there was this chain of failures that they experienced for entire day because Kinesis didn't work the way it was supposed to work. So pretty important service, no matter do you use it or not in your everyday life.

Jeremy: Right, right. Yeah. So in terms of what it actually does, you mentioned it's a data streaming service for high volumes of data. And AWS is famous for creating a bunch of services that do very similar things. We've got SQS EventBridge exists now, SNS is a pub-sub type thing, which I guess you could think of Kinesis that way as well. So, I guess maybe why not use SQS or EventBridge or SNS? What specific reasons would you use Kinesis over those?

Anahit: Yeah, that's a really great question. And I think it's a question a lot of people struggle with, especially when they just start their AWS journey or messaging service journey or whatnot. Because there are so many services that look alike, and it's very difficult to distinguish which one of them do you actually need to use and how to actually choose from them. I have a feeling that those services have been converging lately. I think they are becoming even more close together than they used to be. For example, with like SQS an SNS FIFO support that they recently. So those are more similar than they used to be, than back in the day when they added SQS Lambda trigger that wasn't there. So it used to be SQS, SNS, Lambda pattern, and now you can do it directly. So that went to that direction as well.

And now, especially when they added, I think it was before re:Invent this year, or at the re:Invent, I don't remember anymore, they added to SQS, that batch window support exactly the same actually as Kinesis has. So in that sense, they are exactly the same now, so the same amount of, or the same time, or the same amount of records that you can batch before reading them to a Lambda function, which is quite cool. But then they are getting even more closer. And the question is, what would you actually choose?

And I think that the truth of the matter is that in many cases, you can go with many of those services. It wouldn't necessarily be a wrong choice. But there probably is going to be one particular service that is going to be better tuned for your particular use case. And in that case, you basically ... what it comes down to, there are like several questions that you need to ask, for example, the throughput requirements. So how much of throughput are you going to handle? Is it like individual events every now and then? Or is it a stream of events and huge volumes of events? And then again, what's the size of the events? So for example, as SQS can support, or SNS can support two big overheads, it's like 256 kilobytes or something. And with Kinesis it's one megabyte, so that kind of thing.

Then you should think about the data retention requirements, because like some services can store data for a longer time and others can't. Ordering: how do you want to write data to the stream? Do you want to batch the records, do you want to write individual records, do you want to have direct integrations or custom code that writes to the source, or how do you want to consume the record. So do you want to do the pops up, as you said, or do you want to call? What do you want to do? Or do you want to batch this once again, or do you want to be ... So several questions you can go through before deciding it.

And actually, Kinesis in that sense stands separately from all the other services, because it's not even in the same part of the service least in the console. It's considered to be an analytic service, as opposed to application integration service. So you made that distinguished quite a lot. And basically, with Kinesis, as I said, you have virtually limitless scaling possibilities, using the shards, so you can have more shards. And you can scale more, depending on how much data you need to accommodate. And one record can be as much as one megabyte. So it's a huge chunk of data that you can pretty much send to any other service for that matter.

Jeremy: And so, I want to talk about shards, but let me interrupt you for a second. The thing that is interesting about, like you mentioned with SQS, is SQS right now, with FIFO, The first in first out, you can do ordered records. So that's one of the things that Kinesis has always done. I know that one of the big differences, I think, though, is that SQS can really only have one subscriber. Once you take the message off of that queue, it's gone. Whereas we can, excuse me, as with Kinesis, you can actually have multiple subscribers. And as you said, with the data retention, you can go back in time, right? So I think that's another big thing. But it's funny, you mentioned the analytics versus application integration, because I know way back in the beginning, Kinesis was a really great choice for application integration and people were using it almost as like EventBridge essentially, do like eventing and stuff like that, or as a common thing, but of course you had to have multiple subscribers and it was sort of a pain.

Anahit: That's an interesting piece of information. I didn't even know about it actually. Because now the distinguish ... they are trying to make the difference I think bigger now between Kinesis and the other services now. Like the analytic services, they stand separately from the AWS point of view. But of course, it doesn't mean you can't use it. And actually, you can pretty successfully as a messaging service.

Jeremy: Right, yep.

Anahit: And, yes, so I was, you actually mentioned yourself that the big difference with SQS and Kinesis is that you can have multiple consumers for the same stream. But then again, SNS has that as well, and I think EventBridge as well?

Jeremy: Right.

Anahit: But for Kinesis, you can't actually even do the filtering that you can do with SNS and EventBridge, so it's ...

Jeremy: It's also true.

Anahit: You have to send all the events or the same events to ...

Jeremy: Can't they build one service that just does everything for me?

Anahit: Right? That's what I'm thinking. And this message retention is actually pretty funny that you mentioned because they have announced, I think, once again, before re:Invent, this extended message retention. So before, it used to be that you can keep your messages in Kinesis from 24 hours to up to seven days if you need to. And now you can it have up to one year, which makes a database out of it all of a sudden. And I think it will bring all sorts of new use cases with it, because if you just can put your data in this Kinesis and then do whatever you want with it for inside here, in many, many cases, you don't even need to deliver it to any destination after that, it's just fine like that. Of course, you have to pay extra for that, but that's a different conversation.

But yeah, that's a pretty big difference to pretty much any other of the messaging services, because you can't do that with that. And even with ordering, though SQS, and SNS also have ordering. But at least with SQS, the FIFO queues, they have lower throughput than the normal queues. So there is already this limit. And we can use if you don't have that, because ordering comes pretty much out of the box.

I think the main difference for me personally is how they work with Lambda functions, because I think Lambda has a wonderful support, and it's improving every year. And this year, they again added new possibilities or the functionality there. It has a great support for handling Kinesis records or batches and errors, which is always an interesting aspect for me. So, that's a big difference. But of course, like big pink elephant in the room here, is the cost. That's what everybody is concerned about. And I have heard so many times that Kinesis is too expensive. And I think it's still a bit more of an enterprise product rather than smaller company startup thing, because I think mainly because it doesn't have free tier, that's my opinion. Because you just start to pay immediately from the get-go like SQS at least have those three messages per month. And we had this interesting conversation with Yan Cui a while ago who was talking about the sweet spot between SQL and Kinesis, that there is ...

Jeremy: Yes, I remember that.

Anahit: ... actually a point here after which Kinesis actually cost you less than SQS. If you have the big enough amount of incoming data or your data is large enough in its volume, then SQS will start to cost you much, much more than Kinesis not to speak about how difficult it will be to manage really like the consumption and all that things, so. Yeah, but here are few differences for you to consider, but I think each service has its stronger suit. And as you said, we don't have one service that has all of the features that we would like them to have. So every one of them is suited better for a particular use case, I'd say so.

Jeremy: Right. So with Kinesis, another thing, again, that I think separates it very much so from your SQS in your EventBridge is that you do have to set up the shards. So you have to actually provision something in order to send data to so it's not like just an endpoint where you send data and it'll accept as much as you want. So explain shards and then partitions because this is something we could go super deep on this, but FIFO queues and SQS, for example, have a group ID or a message group or whatever that allows you to do sharding there as well, but without provisioning it. But let's keep the conversation focused on Kinesis here. So shards and partition keys, what are those all about?

Anahit: Yeah, so as you said, unlike Kinesis, or unlike SQS, I'm sorry, Kinesis does need to have provisioning. And you can think of a shard as some sort of order queue within the stream. So your Kinesis stream is basically combined of set of these queues, and each queue comes with its own throughput limitations. So you can send 1000 records or one megabyte of data per second to each shard and then on the out, you can get like two megabytes per second. So if you have more data, you basically need to add more shards to your stream and that's the way your stream is going to scale. So of course, each shard is going to cost you, so that's why you have to consider how much shards you are actually adding to your stream.

And the way your data is spread across the shards in the string is by using the partition key that you mentioned. So it's basically just a string that you add to every single data payload that you send to your stream. You just add a separate stream called partition key. And what Kinesis does is it calculates a hash function of that string, and based on that hash function, it decides which shard the record belongs to. So each shard is assigned a range of hash values which don't overlap. So basically, when you send your records to a stream, it ends up in exactly one shard in that stream. So that's the mechanism, it's pretty simple mechanism, but it's pretty powerful as well. And yeah, and the records, as I said, they are ordered inside each of the shards. So you have this ordering out of the box on the shard level.

Jeremy: Right. And then the sharding itself, so if you have five streams set up, the algorithm will split that into ... then, of course, the partition keys have to be different enough, right?

Anahit: Yes.

Jeremy: So that it can actually split them, but you can't send just like one or whatever, just send like a single-digit or something ...

Anahit: No, you can, but you probably shouldn't.

Jeremy: Right. And actually, you could probably control which shard it goes into by doing that. But then if you want to expand, so let's say that you're writing 4995 records per second across five different shards, and you say, "Okay, now I need to add a sixth shard, or a seventh shard," or whatever and keep adding shards, how does that rebalancing work?

Anahit: Yeah, so you can do it by doing so-called resharding, so you can add new shards. And there's actually two ways to add shards. You can split the existing shards as far as I remember, and then you can add a separate shard. So when you split a shard, the partition keys are split between those two shards. It's more or less equally, because the idea is that as you said, you have to have, or it's better to have a random partition key, because in that case, your records will be distributed equally or uniformly across all the shards instead of sending all of them to the first shard and overwhelming the first shard and then the rest will be just idle, and not using the capacity it could have been using. So a random enough distribution of partition keys is very important.

But then if for some reason, for example, you have a shard, which is overwhelmed, so it has more records coming in than the others, then you can split that particular shard and make it into two and then can you just have to take care of spreading the records between those two based on the partition key.

Jeremy: Right, right. And the fact that you have to do that manually, that you have to say, "Okay, this is a hot shard, or my numbers are going up, I have to add something separately." The big question is, is this really serverless?

Anahit: Yeah, that's a big question indeed. And my blog post, I actually argued that it's not entirely. So ...

Jeremy: I have this ongoing thing with Chris Munns at AWS where I think it's serverless, and he thinks it's not but ...

Anahit: Okay, so I'm more on ...

Jeremy: So kind of serverless?

Anahit: ... his side. Yeah, but getting there. In my blog post, I actually compare it to DynamoDB in a sense, in early days, because DynamoDB also started out without auto scaling, without on-demand capacity. So your provision capacity, and not unlike shards, and then you pay for what you provision, even if you don't use it at all. So it's pretty much the same. And then you have to use API calls to add some capacity and remove capacity, but everybody was not too happy about it. But still, it was assumed to be a serverless service, right? DynamoDB ...

Jeremy: Right.

Anahit: ... always from the get-go. So in that sense, Yeah, kind of, but here as well, we have the same fully managed service. But again, we need to take care of the throughput ourselves. So there is no mechanism that would take into account the incoming and outgoing records and decide, "Okay, now I scale up." You can build it. And there is actually a blog post about Kinesis auto scaling, which uses like five other components to do that. So you can automate it, but it's still something not supported by the service itself. Though everybody's holding their breath for it to come any moment now. I actually was hoping it will come at re:Invent, but well, what can you do? I guess the outage was a bigger thing to concentrate on.

Jeremy: That was a bigger thing they had to deal with. Yeah, maybe they were pushing the auto scaling functionality and they broke it, but ...

Anahit: Actually, that's exactly what I thought when they broke it. I was like, "Yes, auto scaling is coming," but then it turned out there were some other issues with that.

Jeremy: Right, right. Yeah. So well, anyways, so alright, so Kinesis though in terms of getting data into it, right, there's a number of different ways to send data into Kinesis. And another thing that's fascinating too, I think just about the Kinesis service in general is Lambda. We think of Lambda because its function as a service as a very serverless service, it sits perfectly in the serverless ecosystem. So whether or not Kinesis is 100%, serverless or not, it is used in a lot of applications, right, applications that have nothing to do with serverless applications or anything like that. Just it is a really good service that powers a lot of things, as we said. So, what are some of the different ways that you can get data into Kinesis? Because you mentioned batching, and some of those other things?

Anahit: Yeah, sure. So, of course, it wouldn't be too useful if we couldn't get data into it, right?

Jeremy: Right.

Anahit: So ...

Jeremy: And quickly.

Anahit: Yeah, that as well. So there's actually many different ways to do that. And, for example, one useful way, if you are going to stream your data from outside the cloud, to the cloud is the Amazon Kinesis agent, which is a standalone application that you run on your server, for example, that can stream files to Kinesis. So for example, if you want to stream your logs from your server to the cloud, that that can be done with the Kinesis agent. So, that's one way.

Then as I said, there are some direct integrations of some services, actually can push events directly to Kinesis, like CloudWatch, and stuff like that. One interesting service of those is, of course, API gateway, because it does require you some work because it basically acts like a proxy over the API calls for Kinesis. So you need to do some VTL magic and stuff. But it comes with some throughput limitations, of course, as with API gateway in general, but it's very useful for many cases.

And then there is tons of community-contributed tools and libraries that you can use to do it, but I think the mainstream, or the most common ways to write data to the stream is actually, either using communities produced from write library, so KPL, in short. And it's basically another level of abstraction above the API calls. And it gives you some extra functionality, but it also runs asynchronously in the background. So you need to have a C++ daemon, right, running on your system all the time. But it will collect the records and send them to Kinesis synchronously, which means that you might have some delay, or latencies that come with it, so it won't push them immediately. But the biggest issue with it is that it's actually only available in Java. So a bit limited use case.

And then the most favorite of mine, because it gives you the most flexibility when you need to, how to write data and how to handle the errors and stuff is the AWS SDK, which is basically the API calls. And luckily, there is a lot of SDK language support. So you don't have to be bound to just Java. Though I don't have anything against Java, I worked with for like, eight, seven years. I don't remember anymore.

Jeremy: Yeah, I'm not a big fan of Java anymore. But I write a lot of Lambda functions, so every time you ... I've never been able to get them to boot up quickly. The cold start has always been horrible with Java. So I've stuck to mostly Node and Python, just to ...

Anahit: Same.

Jeremy: ... keep things simple. But so all right, so you mentioned the Kinesis Producer Library, which I actually remember way back in the day, we had like a Ruby ETL thing, and we were using in the consumer library, and it was a mess, it was just a lot of things that had to happen. So it's easier if you can just have a nice simple SDK, or even better, have the service just natively push it into Kinesis for you, and then have another native service consume off of that, which is super easy. But so there are a lot of use cases with Kinesis. I think people can probably use their imagination for high throughput streaming data, click tracking, ad network type stuff for all kinds of things that you would need to see. Your sensor data, you mentioned IoT integrations and some of those things. So I think that makes a lot of sense. But what are some of the less common use cases? I know you have some ideas around how you can manipulate the system in a way to use it for your benefit, that's not super fast, high streaming data?

Anahit: No, we are mostly using, it's really not with my customer. We do use it mainly for the big data and streaming all that user interaction data, like the classical way of using it. So, that's what we do. And then there is actually one more use case nowadays, you can use it with DynamoDB as the events trace. They edited just again, just recently. So that's another cool thing, because I think DynamoDB string had some extra limitations with Kinesis doesn't, so.

Jeremy: Yep.

Anahit: But yeah, actually, as I mentioned in the beginning, the difference between the service integration services and analytic services, it doesn't necessarily exist, it's more likely in our head, is we don't really have to use Kinesis with the huge amount of data. And one use case that I personally found extremely useful is that when you, for example, have a Lambda function that needs to consume events from some stream or queue, and you want to invoke exactly one Lambda function at all times. So you want to process the events in order or basically consequently, not in parallel.

So with this SQS, what you can do is to use the Lambda reserve concurrency for that purpose. So you can say that, "Okay, I only allow one Lambda execution of this particular Lambda at all times." But it will mean that all the others will be throttled, then you have to take care of SQS visibility timeout and make sure that the retry attempts are big enough, so your valid messages don't end up in a dead letter queue and all that kind of extra worrying, I would even say, that is not necessary.

And what I found very useful is that with Kinesis, the way Lambda works with Kinesis is that it gives you one concurrent Lambda execution per shard. So, if you have a Kinesis stream, which is attached to a Lambda function, there is going to be as many concurrent Lambda executions at any given time as you have shards. So each Lambda will be reading from each dedicated shard.

Jeremy: Right.

Anahit: So basically, if your throughput requirements are okay, and you can have a stream with just one shard where you push all your events, then you have a Lambda consuming from it, then out of the box, you are getting a situation when just one Lambda function is reading from the stream at all times, and you don't have concurrent executions, you don't have to take care or worry about all the throttling and stuff. And then out of the box, you get all this nice functionality for error handling that Kinesis comes with. So I actually love it for that use case and it won't cost you millions, it probably will cost you like couple of hundreds per year. And I think it's pretty much well worth it if you think of all the management costs that you are avoiding that way.

Jeremy: Right. Yeah. And actually, the SQS, the reading off of the queue, the Lambda trigger for that, I believe that you need to set a minimum of five, concurrent ...

Anahit: Yep, yep.

Jeremy: ... for that, because that works that way. But I think and again, I could be wrong about this, because again, how can you possibly know all the services in AWS? But I believe if you use SQS FIFO queues with a single message group ID, that will also only invoke one Lambda function. I'm not 100% sure of that, but yeah, but either way, no matter which service you use to do that, that is a really cool use case. Because I can think of some cool things like, I don't know, maybe you were billing, you were doing like shipping labels, and something where you needed it to be like one after the other, they needed to be sequential, there could be some cool, definitely some cool use cases for that type of stuff.

Anahit: Yeah, we have found it very useful in one of our use cases. And the fun part was that I was struggling with SQS, like, "How do I do this properly? And I don't like this," and like ... then it was like, "Okay, I have been talking about Kinesis for like two years now to everybody around, so why didn't I think about it in the first place?" But yeah, it's a fun way to do that.

Jeremy: Right. All right. So let's move on to consuming data off of the stream. So there are a bunch of different ways to do this. I mentioned the Kinesis Consumer Library, which I think is also Java-based, and you need to run it in. But anyways, the easiest way to consume data off of a Kinesis stream, you've mentioned this, I think most people would agree would just be to use Lambda because it is a really, really cool integration. So what's the Lambda Kinesis story?

Anahit: Yeah, so I've mentioned it several times, because it's really my favorite way of consuming data from Kinesis. You don't need all that extra headache of keeping track of where you are exactly in each shard, and each stream on every given moment of your life. So it's very nice. And I think it takes care of a lot of heavy lifting on your behalf from reading or reading from the stream.

Jeremy: Right.

Anahit: And well, as I said, like, error handling is one thing, one big thing that Lambda makes also much easier for you with Kinesis stream. Then batching, so Lambda can read batches of records from the stream up to 10,000 batches of records in a single batch. So yeah, those are keeping track of ... that's the most important probably, keeping track of where actually where exactly you are in the stream because otherwise, you have to have some external ways to do it. And, for example, can this consumer library uses a DynamoDB table to do that, which it actually spins up behind the scenes without you even probably knowing about it.

Jeremy: And it's provisioned, too.

Anahit: And it's provisioned, and ...

Jeremy: Not on demand.

Anahit: ... it's pretty low. And then one day somebody from your team comes knocking on the door and saying, "Hey, I'm getting this weird DynamoDB provision throughput exceeded errors. We don't have a DynamoDB table." Hmm, where does that one come from? So yeah, it's much easier but then there is, of course, other services that we want actually to mention here is Kinesis Firehose and Kinesis Analytics, because those two are the other services in the Kinesis family. And they both have a very nice integration with Kinesis streams, so they both can be attached to a Kinesis stream as a stream consumer. In case of Kinesis Analytics, it actually can be a string producer as well.

So, Firehose is a service that is used for streaming data to a destination. So if Kinesis streams is just for streaming the data, and then you have to consume it somehow, the entire purpose of Firehose is to deliver data to the destination. So you can connect the two, to stream the data and then deliver it to the destination and the destination can be S3, Redshift, Elasticsearch. And I think that the coolest one recent one is the HTTP endpoint. So basically, you can deliver it anywhere you want. And then the Firehose has also some pretty neat features like batching, and transforming the data, and converting the format.

Jeremy: Yeah, transforming.

Anahit: Converting from like, for example, JSON to Parquet, that's what we use a lot, in our case, compressing the data ...

Jeremy: And then you can query it from Athena, for example.

Anahit: Yep, from Athena spectrum, and all that things. Yes, so it's very, very useful. And you can connect it directly to Kinesis streams, and it is truly serverless because you don't have to provision it. So it scales.

Jeremy: But there are no limits, though, right? I know I can just get a Firehose, like if you're choosing between Kinesis data streams and Kinesis data Firehose, there's an upper limit to the Firehose, right?

Anahit: Yes, that's true. I don't remember the exact limits. But then, the scary thing about Firehose was several years ago, that there was no mentioning anywhere that any of the operations can fail like rising to the stream, or to Firehose can actually fail. Because Kineses had all these metrics with exceeding the throughput, for example. So you see, you have a metric that says that something bad happens, so you know that something bad can happen. With Firehose, they didn't even have the metric that would tell you that, "Hey." So what they had is documentation that says that it's endlessly scaling service or something like that. And then there is the fine print with, "But if you use it with this, and this and this ..."

But as far as I know, if you're using it with Kinesis streams, it actually adjusts to the throughput of the Kinesis stream. So those limits don't apply anymore. So there's ...

Jeremy: Ah, interesting.

Anahit: ... yeah, there's this separation.

Jeremy: Interesting.

Anahit: Yeah. And then the other service from the Kinesis's family was that the Kinesis data analytics, which is one of my favorite ones, really, because it seems small, but you can do a lot of neat things with that. So what you can do is that you can analyze your streaming data in near real time. So you can basically write SQL queries with Kinesis data analytics, and it will perform joins and under different filters that aggregates over some, for example, time-based window. And then it can send the results of that aggregates to either another stream or another Firehose, or actually, it can send it to Lambda, so you can do whatever you want with it. So there's a lot of cool use cases that come with Kinesis analytics. And they both integrate very nicely with stream, you need to stream, but the "Got you," moment here, which apparently not many people realize is that both Firehose and Kinesis analytics, they act as a normal consumer for the stream.

So I mentioned that there is this throughput limit for each charge, right? So there is only one megabyte per second that you can write, and two megabytes per second that you can read. So this in practice means that you can have two consumers reading from each shard at the same time. And Kinesis analytics and Firehose are both considered consumers. So you can if you have a Kinesis analytics application, and the Firehose attached to the same stream, and then you want to add a Lambda function, for example, then you might exceed, end up exceeding that throughput. So you have to be careful about that, so yeah.

Jeremy: Yeah. But so then with that, though, so again, that makes sense. You can have two consumers, but they added something called enhanced fan-out. So how does that come into play?

Anahit: Right. So I enhanced fan-out is funny in the sense that it's very difficult to understand what actually happens by reading the documentation. I think that part took me actually the longest time to figure out because I personally don't use it at work, so it was a research project for me, mostly. I'm trying to figure out what is happening there because like all the combination, like enhanced fan-out and how it works with other features. But what it basically is, is that instead of sharing this two-megabyte throughput, outgoing throughput with all the other consumers, instead, you can have a separate, king of your dedicated elite highway that you get with the stream and then you get your own two megabytes per second of throughput. And you can have up to 20 consumers at the moment, I think, that each of them will get the actual megabytes. So you can basically consume a lot of data with that.

And the nice part is that the latency here is also much lower than with the standard throughput. So I think they claim it's 70 milliseconds of latency versus minimum of 200 milliseconds for the standard throughput, which is a big, big deal. And it actually stays the same in contrast with the standard throughput with where it goes up with each added consumer. So it's a really nice feature. And how partly how they achieve it is by using a HTTP two persistent connection, instead of HTTP. And the consumer, actually, instead of polling, as it does with standard throughput, instead of polling for records, Kinesis actually pushes the records through that persistent connection to the consumer. So in that way, we avoid all the limitations that come to polling the records from the stream, we can't get records API and that kind of thing. So that removes all the headache, but you have to pay for it.

Jeremy: Of course, of course.

Anahit: So, that's the problem. And the thing is that it sounds very cool, and you might think, like, "Why wouldn't you use it all the time?" Well, you have to pay for it. The truth of the matter is, in most cases, you don't need it. So if you have just up to three consumers for your stream, you're probably going to be just fine with a normal shared throughput model.

2021-03-08
Länk till avsnitt

Episode #90: Full-Stack Observability with the New Relic Explorer with Buddy Brewer

About Buddy Brewer

Buddy Brewer is the Field CTO for New Relic in the Americas. In this role, he helps customers get long-term value out of New Relic. Buddy has over 20 years of experience leading engineering and product management teams building tools to help developers and operations professionals deliver better digital experiences. A former entrepreneur in the observability space, Buddy has helped companies across every geography and industry in the world improve their software?s speed, quality, and user experience.


LinkedIn: https://www.linkedin.com/in/bbrewer/
Twitter: @bbrewer
Personal Website: BuddyBrewer.com
New Relic Free Tier: https://newrelic.com/signup
New Relic Explorer: https://newrelic.com/platform/full-stack-observability

Watch this video on YouTube: https://youtu.be/Y4n3fE8g9Ec

This episode is sponsored by New Relic.

Transcript

Jeremy: Hi everyone. I'm Jeremy Daly and this is Serverless Chats. Today I'm chatting with Buddy Brewer. Hey Buddy, thanks for joining me.

Buddy: Hey Jeremy. Thanks for having me.

Jeremy: You are a Field CTO at New Relic so I'd love it if you could tell the listeners a little bit about yourself and what's new with New Relic.

Buddy: Yeah. Been with New Relic for a couple years now and in this Field CTO role I get to spend lots of time with our customers to help them get long-term value out of our observability platform. I'm an engineer by trade. Started my career as a software developer like many of our customers in New Relic. Spent substantially all of my career in product development in various capacities. Engineering, leading engineering teams, product management. And like I said, now I spend most of my time with customers helping them tackle their own observability challenges in their businesses. We're doing a lot right now with New Relic to help people make sense out of the volume of data and to help people pull all of the different types of metrics, events, logs, and traces that go into all this observability into views that they can actually use to help their customers get better experiences in a world where software architectures are just ... They're just becoming more complex by the month.

Jeremy: Right. Well, awesome. First of all, I want to thank New Relic for sponsoring this episode and for the amazing amount of support that they give to us here at Serverless Chats and what we do. So thank you very much for that. Now, you mentioned these tools that you're working on to be able to observe modern applications. And the new tool that was recently launched is the New Relic Explorer. I've looked at this thing. This is absolutely fascinating. It does all kinds of really great things. But I'd love it if you could tell the listeners a little bit more about that product.

Buddy: Yeah. It's part of our full stack observability product in the New Relic One platform. So it's an in-place upgrade that everyone who uses full stack observability today gets. And what it does is it takes all of the information across all of the different dimensions that people are used to seeing in New Relic One, it pulls them together into new views that help people make sense at a macro level of what's going on in the health of their software across all of the dimensions that matter today. So infrastructure, front end, the application logic. All of that stuff in single views. And there's another part of New Relic Explorer that helps people understand in realtime what the key changes are that are happening in a way that requires zero configuration, which is really important to our customers today because the software architectures and the underlying containers and everything that serve those are changing so fast that people just don't have time to manually configure things today like they used to be able to.

Jeremy: Yeah, right. And one of the things too with cloud infrastructures, you've got all this telemetry data coming in from all these different places and most of the time ... I mean, I know at least what I had been doing is using a bunch of different dashboards and basically jumping between different things trying to figure out what's healthy, what's not healthy. And I love these new views that are in the New Relic Explorer because it actually shows you the changing ... If a problem is getting worse and worse and worse, it gives you this growing bubble. So these visualizations are really, really helpful. So I think that's Lookout right? That does that?

Buddy: That's right. Yeah, that's Lookout. The way that I think of Lookout is imagine if you could take something like the Unix diff command and apply it to all of your telemetry data comparing now versus any point in the past. Whereas the Unix diff command is a text console rendering, what Lookout does is it renders all of this in a visual display in a web browser so that you can see ... Like you said, you had these bubbles that really display two dimensions at the same time. The volume of data, whatever it is that you're looking at for a piece of data. A lot of people use this to visualize changes in errors or throughput or latency but it could also be order volume or really any metric that you want. That's the first dimension. And then the second dimension is the magnitude of changes. Right?

Jeremy: Right.

Buddy: What it helps you do is to zero-in, not just on the things that are red ... Because in environments where folks have thousands, or even tens of thousands for some of our enterprise customers, containers running on any given day, the nature of that design and the fault tolerance inherent in that architecture ensures that on any given day there's going to be stuff that's red. Right?

Jeremy: Right.

Buddy: So if a customer calls in about a problem, you log in, you see some things that are red. Well, some of that stuff was red yesterday. What Lookout helps you do is to focus specifically on those things that changed from healthy to not healthy around the same time as a customer-impacting problem. And then you can see all of the different pieces that also correlate to those changes so you could pull it all out and focus just on the things that matter.

Jeremy: Yeah. That's super helpful because, again, it's one of those things where ... I mean, I've worked as an SRE in the past and you get these constant errors sometimes that keep coming up and they're just kind of there. But sometimes it's the severity of the errors. It's how bad was it yesterday versus how bad is it today? Of course, we wouldn't leave a problem that long. But seeing those changes over time and seeing that growing bit of it, I think is just incredibly helpful from that sort of global view standpoint.

And then the other thing that's part of this, which I think is another really cool representation, is the Navigator piece. And this basically uses a red, yellow, and green sort of ... What is it? A hexagonal or an octagon or something like that. But basically shows these little blocks that show you what's healthy and what's not healthy and then you can dive down into each one of those to see more detail.

Buddy: That's right. And what we did was we designed that view to pack an order of magnitude more information density into a screen compared to the view that we had prior to this. Those views continue to be part of the product, but again, this New Relic Explorer is an in-place upgrade that everyone gets that you can use in addition to all the things that you already have with New Relic. But the first piece is that order of magnitude more information density on a screen. The other thing that it does is it summarizes all of this in a way that you can use it as ... You think of it like the new mission control for New Relic. So for your game day dashboard on the major event that you knew was coming and you wanted to be 100% situationally aware about everything going on in your software when that happens. Whether it's a major advertising campaign that you expect to bring a lot of traffic to your site or it's a big calendar event or major event in the news if your media or something like that. That mission control that allows you to see everything in a single view.

And then we have some flexibility where you can customize that view or create multiples of them that align to specific teams. We call those workloads. So you can take the different workloads that are running in your architecture, all if its constituent pieces, from the front end components, the back end components, the infrastructure, all of it, you can visualize in a single view that aligns to specific teams that work on them. So everyone can get their own tailored mission control. And then another piece of all of this ... Because everything that I've talked about so far has been this extremely high altitude look at what's going on in the software, which you need. But as soon as you notice something that requires you to take action, you very quickly need to switch to a lower altitude. So one of the things that we built in to New Relic Navigator is this concept of related entities.

An entity for us is any component in your architecture that makes your software go. Whether it's a Docker container or some other piece of infrastructure or it's a web application that's built in JavaScript or it's back end logic written in Node or Go or PHP or anything. All of those individual discreet components, we call those entities. They all have a health condition, an alert status. They can have events, logs, and traces that are associated with all of them. But one of the things that is a really critical piece of data that we have at New Relic that this release helps to expose is the relationships between all of those. So it's not just a simple linear relationship or even a tree structure. It's a connected graph of all of these different components. So if one thing is red, the thing that you click on might not be the root cause. And the impact of it being red might not be limited to just the pieces that are connecting to that piece. There could be a number of other services that depend on it that are being upstream or downstream impacted.

So every time you click on something we show you all of the upstream and downstream relationships so you can follow it. What you used to have to do is say I'm going to go see what's happening in the application tier, and then you sort out what you can sort out and then you had to pop out of that and then go into a different view and look at all of your front end and just stitch this together in your head. In the worst case, developers responding to problems had to load up five or six different tabs in their web browser and click back and forth between all of these different things. The related entities, New Relic Explorer, the hexagons in Navigator, all of that stuff is designed to help people with that problem so that you can go straight to what the root cause is by just navigating the shortest path through that graph instead of having to pop out and start your search over again.

Jeremy: Right. Yeah. And I wish I only had to open four or five tabs. I mean, you see it's a lot more than that. And you're searching through logs and trying to find that. Now, the other thing that's really cool ... And there are a lot of distributed tracing products out there now and it's very, very cool, where you can go and see how data is moving through different components in your applications, which is really, really helpful. But what's crazy, I think, about this visualization in Navigator and Lookout and everything that New Relic has done, it gives you the ability to connect, like you said, through that graph multiple services that might be sharing things or different data coming from different places. And all of that stuff is instrumented pretty much automatically. I mean, depending on which service you're using. But all of that stuff ... It's not like you have to go in and instrument all of these little tiny bits. This data's just being collected, these traces are being done. And then this really cool service just visualizes all of it for you.

Buddy: That's right. Full stack observability, which is where this new functionality lives. Like I mentioned, this isn't a new product, it's an enhancement to an existing one. Full stack observability exists on top of another part of our platform which we call the Telemetry Data Platform. We've been building that for so many years. We only in last July exposed it as its own product. Priced really simply just on ingest. And we have a free tier by the way that anybody can sign up for and you can ingest 100 gigabytes per month for free with no charge. And one of the great things about the Telemetry Data Platform is it's a high volume, scalable place to put all of your telemetry data, agnostic to whether it's in a metric or an event or a log or a trace. You can just put it all into this single platform. Which is the first step that you have to do if you ever want to have a shot at tearing down all these silos between all the different pieces of information.

So take traces for an example. Having the Telemetry Data platform enables us to do things like show logs in context. Because the logs are in the same data store as all of the trace data. So if you click on a trace, and even if you click on a span inside of the trace, then if you generated any log events that happened just during the context of that span of that trace, we can display it inline. What you used to have to do is you had to go into a different tab in your web browser and start over again and maybe try to use some kind of a trace ID or span ID or something like that that you hope was also indexed in your logging tool and that you could go find it there. Having all of it inside of one data store means that if you're looking at something that's a particular type of data like a trace, you can see other types of related data like logs in the same context and in the same view.

Jeremy: Right. Yeah. And I think an important question would be the simplification of this. Everybody wants things to be simpler and have these really simple views and good ways to represent and visualize their data. But one of the things is that if you were an SRE or you were an ops person in the past, you probably were familiar with all these different tools that you were using and you knew exactly what you needed to see and how different things ran and stuff like that. But it's not just ops people or SREs or people who are just always worried about the infrastructure that are impacted now by a lot of these changes because I think we've made a big shift in the way that we develop applications and who's responsible for the lifecycle of those applications. I'd love to talk about that evolving role. I guess we would call them modern developers maybe? That modern developers building for the cloud and building these complex systems. What kinds of responsibilities do you think have shifted to them?

Buddy: It's interesting how the nature of the role of software development has changed. I started my career 100% front end developer. And specifically building tools for front end developers to reason about the health of the front end experience. And then you had back end developers. And you could meet someone at these networking events and talk about which parts of the stack that you work on. But those lines are fading. There was a report that came out last year. I think it UBS. Compared how many developers identify as different types of developers, front end engineer, back end engineer, et cetera. And the thing that was remarkable about it is specifically people who identified as a full stack engineer, 55% of the respondents identified as full stack engineers. So more than half.

Now, in 2015, five years ago, it was only 29%. So it's the majority and also the fastest growing cohort of engineering role. And that's how you end up with the situation we were talking about before where you've got so many tabs open in your browser is because all of these tools have been built for specific slices of the application architecture. Logging tools, front end analysis tools, back end analysis tools, infrastructure analysis tools. All of that. Full stack observability, the product that we offer with New Relic aims at being a full stack analysis tool. So again, in that single tab you can see the relationships between all of these different tiers. And we did that specifically in response to what we saw as this broader trend, both from the analysts but also talking to our own customers and realizing ... New Relic's been in this business now for 13 years. Started in 2008. So we've seen a lot of this evolution firsthand among our customers and they were asking us for this. They wanted simpler views that connected all of these different pieces together because increasingly what happens, somebody gets a notification that there's a problem that they have to solve and it's not just in a slice of the architecture.

They're on a team that is designed to do everything that it takes to deliver a particular part of the customer experience. So if something goes wrong with that experience, whether it's in the infrastructure tier or the application tier or the front end tier, they're accountable to finding it and fixing it. So we're building tools to help people do that better.

Jeremy: Yeah. And I think that's interesting in terms of that evolution where even when ... Let's say AWS started with EC2s and things like that, the virtual machines, back in 2008, 2007, somewhere around there. You started building applications that way and I think you had very traditional ops people setting up the networking for people and setting up an EC2 and I don't think a lot of people were doing CI/CD, at least not like they are now. So you take that code that a developer would write and someone would set up that instance or that environment for you to dump the code into. And then as we moved towards things like containers, developers are now responsible for packaging their own containers and requiring the resources they need or the packages they need, things like that. And then moving even further down the line to serverless where in most cases you don't even have an ops person involved right?

Buddy: Yeah.

Jeremy: I mean, there's nothing for them to set up sometimes. So that change in how we're building applications ... Do you think that that change of how developers are getting closer and closer to the infrastructure, that that's sort of one of those things that prompts a need for this full stack observability?

Buddy: Oh yeah. Yeah, for sure. And it's affecting everyone. It has crossed the chasm. This is not just cloud native startups that are adopting this. Substantially every large enterprise that I talk to, and I speak to usually multiple per week, are somewhere along this journey of cloud migration. And that includes shifting workloads from data centers or traditional monoliths decomposing into microservices, moving from data centers into cloud. Orchestrating all of this with containers on Kubernetes. And increasingly across the board, cloud native and traditional enterprise, like you said, moving to serverless because there's certain economies and efficiencies that you get out of being able to take advantage of that layer of abstraction that companies of all sizes and across all industries ... Not just gaming and super high tech and media and commerce but also financial services and travel. Just everybody is moving toward this and adopting it and they're looking for tools that can help reason about the connections between all of these different pieces that they're now responsible for.

There's another component to this that we have been and continue to work hard on at New Relic, which was a point that you touched on earlier. This notion of when you have so many components that you have to manage and all of these things are moving are changing so rapidly, you don't have time to undertake these expensive manual tasks to create all of this instrumentation. So we've been for over a year now progressively opening up our platform to accept other types of third-party data, not just our own agent technology that we've building since 2008, but things like Prometheus and Open Telemetry. You can just point exporters at our endpoints and make it easier to get that data on board. Taking all of our instrumentation logic and making it easy to wrap that in automatable frameworks like things like Terraform scripts and stuff so you could actually build observability in as code and deploy it at scale.

We've been talking about New Relic Explorer which is our release that we're talking about today that's all about the visualization. But it's enabled by a tremendous amount of work that we've done and continues to be underway to simplify the instrumentation. Because as the architectures themselves become more complex, it obviously gets harder and harder to keep up. Frankly, a lot of our customers have issues where they can't instrument fast enough to keep up with the change that's happening in their infrastructure. So as a result they have all of these dark areas of their application that are critical to delivering the experiences to their customers, but they don't have observability into what's going on inside of it. So we've been doing a lot of work to give people tools to get leverage on that problem too so that they can add instrumentation to all the pieces that matter.

Jeremy: Right. And I know that New Relic has done a ton of work on instrumenting Lambda functions for example. Like being able to instrument these things where you can't necessarily run the agents. And I know there's been a lot of really cool innovations in the serverless space around some of that stuff. But I'm curious just from a developer perspective, and maybe you have some experience here of seeing some of your customers do this, how much is your average developer who's maybe building a cloud application working on a team, how much is that developer actually going in and using these observability tools to see what's going on? Is that something where they need to be heavily involved in that or are you still seeing a good separation between the ops team in that regard?

Buddy: It's evolving. We're seeing it change in a couple of dimensions. Developers use observability data and monitoring tools far more than they did years ago. Although, there's always been a segment who needed to do that. New Relic, one of the things that we're known for as a company is being the monitoring platform that is the most developer-friendly. Our CEO, Lou, is a programmer at heart who still writes code on the weekends even as the CEO of a public company. It's part of our DNA. And I think we've always had a natural affinity through that to the types of adopters who are developers, who both write the code and they deploy the code. What we're seeing is, and as our company has grown, that cohort of developers who are responsible for both writing the code and deploying and managing it are exploding in size and scale and the number of those people that are out there. So we've just been riding that wave, if you will, of developers who continue to be responsible for how all of that stuff actually materializes. And it's now becoming essentially the standard way of operating, like I mentioned before, not just for cloud native startups but at large enterprises as well.

And that blur between ops and dev is fading to the point that it's really difficult to see as we sit here in 2021. That's on the role side. One of the other things that's changing, I think, that's really interesting about just the way people are using telemetry data is the set of use cases that it's relevant to. Historically all of this data about what's happening in your software, the canonical use case for when you need that is when something's on fire. Right?

Jeremy: Right.

Buddy: The mean time to resolution. How fast can I get a problem solved? How quickly can I take something that's red and turn it green? It's the classic use case for New Relic and for any other tool in this space. But what we're seeing that's evolving is people are using this telemetry data outside of that context more and more frequently as part of their day-to-day software development. For example, how do you choose where to tune and target your reduction of technical debt? If we can present observability to you that helps you understand not just the parts that are the slowest ... Because sometimes things are slow but they're asynchronous, they don't matter, or whatever. But what are the parts that are the slowest that are actually impacting customer experiences in a way that damages your business or damages your brand? So we're seeing people use that telemetry data. Nothing's on fire. But they want to use it in order to better plan and prioritize their development work.

Or another example is we've seen a lot of development in this field of chaos engineering. And testing resiliency not just by looking at the data and evaluating the architecture or doing things like load tests, but actually intentionally breaking things and then seeing how the system reacts in response to that. A tool like New Relic Navigator is really good for being able to see ... Or actually Lookout might even be the best of the features that we're releasing now that help people with this. Where you can spot these changes that maybe you didn't anticipate so you didn't set up threshold alerting or something on. But you can in realtime see how all of the different pieces of your application change when you go in and you test the resilience of your system by breaking it.

So we're seeing continued convergence of the roles, which brings more and more folks into looking at observability data. But we're also a widening of the number of use cases beyond just the traditional fire fighting. It's a cliché, but it's true. As more and more businesses essentially become digital businesses, the data that describes how your digital experiences and working are taking on more and more strategic value to those companies, at least the forward-thinking ones. And so they're looking for ways to leverage that data and new and creative ways beyond traditional fire fighting.

Jeremy: Yeah. I want to talk to you about resiliency and a little bit about chaos engineering, but before we move on from this role, I'm really curious, you've been doing this for quite some time and as you see this evolve, where's the line for developers? How far should we push them down this getting into the ops role? I know we said it's very blurred, but is it at setting up their own automation? Is it at doing networking or actually touching infrastructure? How far do you think a modern developer really needs to go down that path?

Buddy: Well, it's different for different organizations and there's not a single pattern that you can apply. It's a complicated enough problem that everybody kind of needs to tailor their approach to fit the dynamics of the environment that they operate in. Sometimes things like regulatory compliance come into play and all the rest of that. But I think probably at the highest altitude, the broad trend is we're seeing developers become accountable by default to all of it until they reach the point where they can trade off the management to a third party like a cloud provider. So for example, the networking and things like that. If you can trade that off to your cloud provider, but when it comes to defining all of the infrastructure let's do that in code in an immutable way so that I can automate and deploy and do all of that and handle it as a developer. So developer and operations, we've been saying this for over 10 years now, but it's gotten to the point now as I sit here in 2021 where companies of all sizes and across all industries ... Increasingly I go in and I talk to people and you used to kind of ... In the prep, it's like is this going to be a meeting with the development group or is this the operations group?

We just don't talk about it that way anymore. People are accountable to all of it. There might be specialties within the teams where maybe somebody has a time bias. They spend a little bit more of their time in one category versus the other. But the fact is that most people I talk to today, they've got some level of accountability across all of that stuff. Which is one of the reasons why ... There's only so much that a human being can do at any given time.

Jeremy: Right. Exactly.

Buddy: So the way that a lot of people are gaining leverage on that is by trading some of those pieces off to third parties like the cloud providers.

Jeremy: Yeah. No, I think that makes a ton of sense. Another thing I think ... You mentioned something about complexity in there. And one of the things we're seeing quite a bit of now, which is a very popular way to develop software, is to go down the microservices route and get rid of those old monoliths. So as people are building more microservices and you have multiple teams, which means they probably don't always follow the same standards and some might be written in different languages, some might be running in different environments, the complexity of the data that's coming from that and all of that information, trying to organize all of it. This is just one of those things where something like New Relic I think captures ... It kind of captures that perfectly right? Where it's like you've got all this chaos and you try to make sense of it. So just your thoughts on microservices and the role of some of these tools now to make sense of all that data.

Buddy: Yeah. It's been, I think, really great for engineering teams who've moved to these architectures that it allows them to decouple things and move faster. As more and more of businesses move their revenue toward digital they necessarily have to scale up their headcount of people in engineering which means you now have an organizational problem of how do you keep all of these people in this increasingly large organization productive without creating so many dependencies on each other that they all grind to a halt. So microservices are really great for that. Of course, it's also true that the magnitude of data and complexity that engineers are responsible for and accountable to is scaling at a rate that's faster than headcount. So engineers are having to take on more today and they're having to do more with less. But microservices help them at least manage the dependencies versus the old monolith architecture. This is the reason why organizations are moving away from monoliths and toward microservices today.

It does, like you said, create a whole new set of problems for observability platforms like New Relic One to solve for. The relationships, there's orders of magnitude more of these relationships, orders of magnitude more components. All of that data has to be tracked. It all has to be managed and visualized in a way that allows folks to look at all of this at multiple altitudes so that you can see what's happening overall. The mission control kind of thing that we were talking about earlier. But it can't stop there. You have to be able to get very quickly to the individual metrics, events, the logs, the traces, and all of that stuff that are happening right around where a problem comes up. For New Relic, it's required us, in order to continue to serve our customers in the face of all this change to change almost everything about our platform. We went from having probably a dozen different discrete products ... We had a real user monitoring product, a synthetic monitoring product, a mobile app monitoring, APM, infrastructure, logs. We had all of these different discrete products. In order to keep pace with all this and continue to serve our customers, like we talked about earlier with the evolving role of the engineer toward full stack responsibilities, to bring all of that together into a single product.

That change happened because of changes that are happening in the organizations that we serve and in the broader application architectures. Another massive change that we made is we decoupled ... Well, we stopped counting hosts for one thing. We used to price all of this by units that just don't really make sense anymore in the modern era. How do you count up how many hosts that you've got in a world where it's going to be different an hour from now? So we stopped. We switched to ... You do know how many engineers you have. So full stack observability is priced by the seat. And then we decoupled the data. Because, like I said a minute ago, the amount of data that organizations are having to manage is scaling at a rate that's much faster than their headcount. So we took the data and we actually carved that out as a separate thing in our Telemetry Data Platform. Priced it very aggressively. 25 cents a gigabyte and then we give people 100 gigabytes a month for free if they sign up for our free tier. So that you can, in an economically feasible way, track all of this data across all of these different services.

That goes back to the point that I was making a few minutes ago about the big problem that a lot of organizations have is that they just don't have observability across all of their application. There's a couple of reasons for that. One, if the instrumentation is too complex they can instrument fast enough to keep up and so we're working on that. We've done a number of things to make things simpler and we continue to make investments there. The other is sometimes it's just not economically feasible. So we built our whole pricing model and packaging around making it actually feasible for people to be able to generate all this telemetry. Now, of course, once it shows up in the database, it's incumbent on us to help our customers make sense of all of that, hence things like New Relic Explorer which we're talking about today.

Jeremy: Yeah. And that's one of the things I was going to say. Abstraction is very hard. When you're trying to abstract anything it's hard to find the right level. So how do you approach all of this data without oversimplifying it?

Buddy: Yeah. It's hard. We do some of the things that you would expect. We have, and we've always had, curated views that are informed by our 13 years of experience helping thousands of companies manage their own data. We also happen to be a provider of software at fairly large scale. So we have a lot of experience living in the same problem space obviously as our customers do. So we work hard to give people out-of-the-box views that help them understand what's going on in their software. And then of course we've got the ability for you to create custom dashboards like you would expect. We have a query language that allows you to interact directly with the high cardinality events that we store on behalf of our customers. Not everyone can do this, but every organization that we work with usually has a small number or sometimes a lot, but usually at least a few power users who understand the query language and can get in there with a scalpel and pull exactly what they need out.

The thing that we do that I think is unique to New Relic, though, among observability platforms, we also added a programmability layer about a year and a half ago. And what that allows you to do is to move beyond just dragging and dropping widgets from the palate to create custom dashboards and it moves beyond query languages and working with raw data toward the ability to actually write your own code in ReactJS, interact with our data model using GraphQL. So standards that lots of people know. And you can build truly tailored bespoke visualizations. So we've had customers do everything from combine operational data with weather data, geographic data for people who have physical points of presence in stores and things like that. You can build your own. We also have an app catalog and an ecosystem where you can go in and you can install things.

So that's how we manage it at New Relic. We try to bring a point of view. Every company in the world who has this problem, and it's a common problem, it's a balancing act that you will never be done with. You're always working on it. But we try really hard to provide people without a box curated views that allow them to be immediately productive. But at the same time affording the flexibility, not just through custom dashboards and things like that, but actually a platform you can build applications on top of so that you can visualize any sort of way that you want but not make that ecosystem so convoluted to navigate and all of that stuff that it's impossible to find the pieces that solve 80% of the problem. So you can imagine if we took everything that anybody ever did custom and we made all of those first-class objects in the system, it would be such a huge haystack that you wouldn't be able to find the pieces to solve 80% of the problem. So we promote that to people as part of our out-of-box experience. But then we give you the flexibility if you want to, and many of our long-time customers have adopted this, to create truly custom applications to see the data exactly how you need to see it.

Jeremy: Right. Yeah. And I think when you have all this data coming in and you're collecting metrics and logs and traces, that's great to be able to look at all of that stuff independently but you just want to get to that root cause analysis. You want to be able to figure out what that root cause was and be able to jump in. So having those predefined views, I always find those to be helpful because if you just gave me like, "Hey, here's all the data. Just set up the alerts and the graphs and everything that you want," that usually doesn't get you very far until you can spend days and days and days digging into that. So having that top-level stuff and letting you dig in, I think, is a really good way to approach it.

Buddy: Yeah. Like I said, we try to bring that point of view. A little bit of a sidebar from our core discussion but for those in your audience who are interested in sort of historical trivia you may recall that New Relic got its start in 2008 building ... Our founder, Lou, built an APM product on top of Ruby on Rails, which was setting the world on fire back in 2008. Twitter was based on Rails. Some very major apps were based on Rails. One of the things that was a very defining characteristic of Rails was that it had a very strong point of view. You do not put your controllers in that directory, you put your controllers in this directory.

I think some of New Relic's early design intent, given the fact that it was built from within that Rails community, was to start with a point of view so that people could get productive as quickly as possible. It's one of the things people loved about Rails was you got that really fast out-of-box experience. So over time, as our business has grown ... And of course, now New Relic does way more than Ruby on Rails, although you still can monitor your Rails application with New Relic. You have all of this additional flexibility. But where the company got its start and one of the things that we were known for really early on was that developer productivity that came from bringing a point of view of once you get the ... Drop in the instrumentation and you log in and you immediately have insights. So we try to hold onto that even as we give people more tools to create all these custom applications and everything on top.

Jeremy: Yeah. And I think an opinionated approach to certain things with some flexibility, sometimes it can steer you in the wrong direction but you're right, it just gets people productive so much faster.

All right. I want to go back to the resiliency and some of that chaos engineering stuff because it seems like Lookout and Navigator, these are those perfect tools like you said for doing those chaos days or things like that. So what are some of your thoughts on building resiliency into these systems and how can New Relic Lookout and the other services underneath that, how can that help you make sure that you're building resilient systems?

Buddy: Yeah. A lot of what goes into New Relic Navigator and Lookout is having the ability to see in realtime what's changing in your application. So like I said, many of our early adopters for example, when we first started opening this up to a small set of customers before we reached our general availability launch these features, in addition to using the new features for the production events that were coming from real customer traffic and things like that, they were also using it as a way to reason about what was happening in their software architecture when they're intentionally making changes. That was another one of the core use cases that we saw people using this for. And in particular with Lookout, because it's zero config, it's really good at helping people reason about these unknown unknowns in software which is one of the differentiating characteristics that gave rise to this notion of observability in the first place.

A lot of the distinctions that people draw between monitoring and observability is that monitoring was defined by this characteristic of, "I know all of the failure modes, I'm going to instrument them all with threshold-based alerting, and then I want to get a page when something breaks." And in observability, one of the defining characteristics of it was in contrast to the way that people used to do things. More and more of the way that software fails today, oftentimes it fails in a unique fashion because of all of the ... There are more variables in the equation anymore than you can count because of microservices and all that stuff. So you have to have a model that allows you to see what is changing and not just where the changes are but actually direct you toward the ones that are causing a customer impact without relying on you having analyzed all of the possible failure modes in advance.

So since Lookout isn't reliant on prior configuration or thresholds, it's just looking for changes and then correlating all of those changes to each other so you can see where the clusters are. Which is a lot of what the problem space that people are solving in AI ops for example. This is an exploratory realtime versus a lot of the AI ops is about sending you notifications, which we also do. But this is about seeing it when you're actually logged in and exploring what's happening in your software. Makes it highly useful for those situations when you're doing chaos engineering. And it's something that we're seeing. We're far from a state where everybody's doing that today. But again, we're seeing a lot of growth and increasingly companies who are solving for those types of use cases and it was one of the things that we designed New Relic Lookout to help people do.

Jeremy: Yeah. Well, I think if you are at the point where you need to start doing chaos engineering, you probably have a lot of applications and a lot of services talking to one another. And I think just convincing some team, "Hey, by the way, we are going to break something in production to test it," if you're going to do that and you can actually convince some team members to let you do that, you better have a pretty good tool that's going to be able to capture and be able to observe what is actually breaking. And especially even if you have to revert quickly, at least be able to see the history of that and be able to go in and see okay, when this broke this particular service was no longer responding or something like that. So, are there any surprising things you found as people started adopting this stuff?

Buddy: Yeah. One of the things that I thought was most interesting when I was going through our feedback from our early access program ... We designed this for engineers to use. For people who are in the work every single day, to help them do their jobs better. It's common for us to see managers and directors and executives engage in our telemetry data but it's almost always rolled up to a summary that people are using to track things like SLOs and SLAs and maybe correlate that to some sort of a business outcome like conversion rate for a commerce company or ad impressions for media or something like that. Things that are at a higher altitude. One of the surprising things that we saw with the early access program for New Relic Explorer was we saw a use case where a manager who historically was unable because of all of the stuff that we've talked about so far, all of the complexity and everything, it's impossible for them to reason about it and do all of their other responsibilities as a manager.

So their job typically was air traffic control to get managerial leverage on a larger problem. So it's like, "Here's something going on over here. I'm going to send this to the person on my team who's responsible for it, ask them to look into it as part of their job as day-to-day manager." What we found in the early access program ... One of our use cases in specific that comes to mind was someone who hadn't actually rolled up their sleeves and done the root cause analysis in quite a while. Because the complexity required and everything, just didn't have time to do it. And he discovered an issue using New Relic Explorer. But before sending it on to the person on their team responsible for it, they went ahead and clicked in and said, "Let me just see if I can figure out what's going wrong here."

And for the first time in a long time they were actually able to perform root cause analysis on the thing and send it directly to the engineer outside of their team who was responsible for doing the work to actually file the ticket and fix it and all that stuff without having to task it out to an individual on their team to do the investigation. So it probably saved them, what? A day? Two days maybe? At least a day.

Jeremy: That's a lot of time.

Buddy: Yeah. So that was something that we didn't necessarily expect was that it was going to unlock the ability of folks who don't ordinarily do root cause analysis and detailed work to be able to actually navigate to what the root cause was in a way that they'd never been able to do before. It was actually one of the more, I think for me personally, hugely validating points that we had achieved what we had set out to in terms of making an interface that people could use efficiently. When not only the people in your target audience, but also people who weren't necessarily in your target audience were still able to diagnose a problem in realtime because the connections were there in the right place. I mean, the data's always been there. Collecting data's not hard. What's hard is making it all accessible at scale and connecting all of it and delivering insights. Not just piling a bunch of data into a data lake somewhere or something like that.

So when we got that story back that someone was able to, who doesn't ordinarily do this day-to-day, actually get in and diagnose a problem, it was surprising and it was also really validating for us and for the team.

Jeremy: Yeah. And I think that's amazing. No matter what level you are at, whether you're a developer or you're a manager or you're somewhere in between, not only do you reduce mean time to recovery and you can find those problems faster and figure out what the issue is, but that saves a lot of time. I can't tell you how many times I spent days looking through logs and all kinds of things trying to figure out exactly why every 100th time this thing runs something goes wrong. And being able to go and trace that and find that information quickly saves you time, saves you money, saves you mental anguish, I would think, for a lot of these things. So that's pretty cool.

Buddy: Yeah. Sure. We thought so.

Jeremy: Awesome. All right. Well, listen Buddy, I really appreciate you being here and sharing all this stuff about the New Relic Explorer and the New Relic One platform. So if people want to find out more about you, ask you some questions maybe, or they want to find out or sign up for New Relic One and use this new New Relic Explorer, how do they do that?

Buddy: Yeah. Well, for me personally, I'm most active these days on LinkedIn of the social platforms so you can find me there. Just Buddy Brewer. I'll be the one that pops up working at New Relic. And for New Relic, like I mentioned earlier, we have a free tier that is really easy and really the best place for someone to get started who's had no exposure to New Relic. You just go on our website. It's up at the top right. Click on sign up. And what you'll get is 100 gigabytes a month of ingest that you can put into the New Relic platform and one seat license for all of this stuff that we just talked about today. So you can actually ingest 100 gig of your own data every month and just go use New Relic Explorer and all the other parts of full stack observability.

Jeremy: Awesome. And you can find that at newrelic.com. Thanks again, Buddy.

Buddy: Thanks, Jeremy.

2021-03-01
Länk till avsnitt

Episode #89: Serverless in a DevOps World with Sarjeel Yusuf

About Sarjeel Yusuf

Engineer turned product manager, Sarjeel Yusuf is greatly interested in how the move to cloud computing and the rise of DevOps is revolutionizing the way we manage and release our software systems. Ex Thundra, and currently at Atlassian, Sarjeel is focused on bringing DevOps enabling solutions from the perspective of incident investigation and resolution in Opsgenie. By leveraging his past experience in Serverless monitoring and debugging at Thundra, he believes that there is a great opportunity in how serverless can unlock the potential of DevOps teams. 

In his free time, Sarjeel loves to write about new advancements in the fields of serverless, DevOps, and more recently, product management strategies. His writings can be found on his personal medium account as well as other publications. He would love to get in touch with anyone who would love to brainstorm ideas in pushing existing technologies to build amazing products. 

Twitter: @SarjeelY Linkedin: https://www.linkedin.com/in/syedsarj/ Website: sarjeelyusuf.me Opsgenie: https://www.atlassian.com/software/opsgenie

Watch this video on YouTube: https://youtu.be/T7eUUUBRZQQ

This episode is sponsored by Epsagon.

Transcript

Jeremy: Hi, everyone. I'm Jeremy Daly, and this is Serverless Chats. Today, I'm joined by Sarjeel Yusuf. Hey, Sarjeel, thanks for joining me.

Sarjeel: Hey, Jeremy, thank you so much for having me. I just want to say it's pretty exciting to be here. I've been watching the show for quite a while now, and it's just exciting to be here with you and talk about everything serverless, I guess.

Jeremy: I'm excited to have you here. So, just to introduce yourself. So, you are a product manager at Atlassian. So, I'd love it if you could tell the listeners a little bit about your background and what you do at Atlassian.

Sarjeel: Sure. So, yeah, as you've mentioned, I'm a product manager at Atlassian. Actually, a very new product manager. Just a year ago, I was a software developer within Atlassian, within Opsgenie, and now I'm a product manager at Opsgenie. So, I made the switch to product management very recently, actually.

And so, for those who don't know what Opsgenie is, Opsgenie is basically an on-call incident management tool. It allows you to route your alerts to the right person, make sure that everybody is aware of incidents that may occur. And it helps you all the way from incident awareness to incident investigation and retribution. And my specific role at Opsgenie is basically helping DevOps practicing teams to better their entire DevOps flow, especially considering incident management in the DevOps pipeline.

Jeremy: Right. So, that's actually what I want to talk to you about today, is just about DevOps. It's such an interesting discipline. And as teams sort of evolve and start using the cloud, it's almost like it's sort of necessary, I think, in order for you to adopt some sort of a DevOps culture.

And working at Atlassian, obviously, Atlassian has Jira, and Opsgenie, and all these other services that help with software development, and the software development lifecycle and things like that. But I think there's a major confusion out there about what exactly we mean by DevOps. And especially when you see companies labeling tools as like, "Hey, here's a DevOps tool." Or you've got DevOps engineers and things like that, that just seems really weird to me, because I don't think of DevOps that way. And maybe we could start there and sort of just set a baseline for the listeners here, and have you explain what exactly is DevOps, and what do we sort of mean by as a practice or as a culture as opposed to a set of tools or engineers?

Sarjeel: Yes. Yeah, that's it, right? DevOps right now, the reality that DevOps has ... The word DevOps has become a buzzword. Actually, quite interestingly, I think it was yesterday or a few days ago, I saw a tweet by Patrick Debois who was saying that just because ... It goes along the line of something like this. Just because an idea has become a buzzword doesn't mean that you should shy away from it. You should still go into it and explore what it is, and you learn from it.

That's the problem right now. The industry has been capitalizing on DevOps. Especially a lot of new startups are capitalizing on DevOps, marketing themselves as DevOps tool. So much so that the promise of DevOps is kind of lost or not fulfilled when you have all of these DevOps tools or DevOps engineers or DevOps certifications coming up in the industry.

Let's try to understand what exactly DevOps is. I think the best person who explains this or who captured this is Jez Humble. He basically describes DevOps as a set of practices, a cultural mindset, not exactly a set of tools. Yes, you can have tools to help with your DevOps practices. I'm not saying that, "Oh, any tool that says is associated with DevOps, that's definitely a lie." No, it's not like that.

So, you can have tools to help with your DevOps practices, your DevOps culture. Harboring that culture in your company or in your team. But at the end of the day, it comes down to how you and your team and your entire organization are going from the ideation phase all the way to the release to production and then maintaining of your product. For example, that's where we, at Opsgenie, operate incident management. How you maintain your product, and then how you learn from that and then go through that loop again.

So, traditionally, what we saw was that we had all these separate teams where you had different roles associated to a separate state in your development flow. For example, you had ideation. The first one would be ideation where you would see more involvement of product managers and designers and sometimes engineering managers. I'm just talking very generally. You would have build, you would have tests, release, monitoring, incident management, feedback. All of these were siloed.

And the problem became that when your product, when your software would go from one stage to another stage, when those involved in one stage would throw it over the wall to those involved in the next stage, the people receiving it in the next stage, there was some communication gap. And what that resulted in was that things just went slower, especially when you would scale your product, and especially when things would go wrong. That's what we see as an incident management tool.

Especially for our customers, when our customers are using Opsgenie and the responders are not necessarily the people who were responsible for building the code, it takes them longer to resolve the incident. That's expected. You are trying to resolve something that you didn't build, that you don't know the nitty gritty details about, and you're trying to find what went wrong. That's what DevOps aims to solve. So, I would say that with DevOps, what you can achieve is that you can go faster. You can increase your velocity while maintaining stability. That's the entire promise of DevOps.

Jeremy: Yeah. I like, basically, that quote of just because it's a buzzword doesn't mean you don't need it. And I feel like the same thing has happened with serverless as well, where everybody just starts slapping the term serverless on their product, or say we do something with serverless. I think it just confuses things more and more. And so, when you say things like DevOps, we need a DevOps tool, or we need a DevOps engineer, it sort of perverts the underlying principles, I guess, of what you're trying to achieve. And so, maybe let's go there for a second. From a principle standpoint or a cultural philosophy, as you had said, what are sort of the main objectives here? What are we trying to achieve with DevOps? Because you mentioned this idea of throwing it over the wall. And that happened all the time, right? I wrote some code, I give it to my ops team. My ops team tries to put it into production. And I'm going back a way. I know you're actually much younger than I am, so good for you. But that, actually, I think, is good, because it gives you a fresh perspective on seeing how things should be working, as opposed to old people like me saying to ourselves like, "Well, we used to do it this way. So, maybe we should keep doing it this way."

So, that idea of throwing things over the wall and having something not work, and then having to just kind of kick it back as opposed to just have a flow that this whole thing gets taken care of. So, what are sort of those principles that sort of enable you to break down those walls or break down those silos and just kind of have your software flow all the way from ideation through to production, and deployment, and then to even monitoring, and troubleshooting, and incident response?

Sarjeel: Yeah, that's actually a very good question. What exactly is the solution? If we say that, "Okay, all those tools that are coming out, or all the certifications that are coming out isn't exactly the solution." Then what can we do to break down those silos? I believe that there are two things that we can do. One is to try to involve everybody across that stream in mostly all the stages. Even as a product manager, I try my best to get involved in all the stages. And then also, even within the ideation phase, get the technical side involved within the ideation phase.

So, it's not only a product PM group only, like get everybody on the same table and understand how we can go from ideation to production. And that is one culture, that is one practice that you really need to incorporate in your team. Stop thinking about people as just fixed roles and allow more flexibility and allow the flow of ideas more. That one way is how we can really break down the silos.

Another way is that, "Okay, now that you have everybody involved in everything." The responsibility of the groups. I mean, it's, it's almost impractical to have a single or a group of engineers building everything and also making sure everything runs and also maintaining the systems and getting everything deployed while ensuring its stability. It becomes very difficult. If we still look at traditional practices, having one team do everything would become very difficult. So this is where I believe automation comes in, and automation is key.

Also, while we're talking about automation, we should also try to think of this left shift culture. Bringing everything closer to either the development team or the ops team, or whoever else, but basically bringing it closer to the build stage. Right now, we are seeing this trend. A lot of people, and including I, would say that CI/CD is kind of the backbone of DevOps, because CI/CD is now looking at a lot of automation. And we see a lot of automation features coming up over there. When you're looking at automation, you're also looking at incident resolution. You think that entire incident resolution that would sit over here, coming closer to your CI/CD. And eventually, we're also seeing CI/CD tests, and all the automated tests coming closer to the developers themselves. You see debugging and having all these integrations in the IDE. Being able to locally test your cloud apps and things like that.

Yeah. It's pretty great. We are seeing a left shift, we are seeing an increase in automation. So, it's not only a buzzword, but even though it is perceived that way, but the reality that we are seeing, these improvements happen. And we are seeing an increase in DevOps practices and successful practices, actually.

Jeremy: Yeah. I think automation is a good point, because that's one of those things where sort of like automate all the things. It sounds really, really good. But then it also scares people too. A lot of ops people say, "Wait a minute, if you automate away my job, then what am I supposed to do?" And the answer to that is there's a million more things that you can do, especially around security, around speeding up the pipeline. Again, minimizing your time to recovery, or just things that you can work on. But the idea of automation is a key principle, I think, in DevOps, because it just gets ... It's the idea of getting things from somebody's IDE into production as quickly as possible. And then being able to sort of understand how that change maybe impacted the overall system or whatever, and be able to resolve those things much more quickly.

I remember the days where we used to work for months on a software release, and then we would put the software release out there, and then 80 things would be broken. So, we would decide, "All right, is it bad enough that we have to roll back the whole thing? Or is it okay where we can live with some of these bugs, and then just set out the QA team to start doing some bug hunting?" And you don't want to do that. That's just not the way that rapid software development and modern software development works. So, this idea of deploying very quickly and being able to see if there's any impact that is negative or whatever, and be able to roll back those changes quickly, I think, is super important.

And then the other thing you mentioned about sort of shifting left, or this idea where the developers become more responsible for the code that they write. I think that's actually a really, really good thing, where it's like, "If I'm going to put a piece of code out there that is going to use too many cycles, or it's slowing things down, or it's affecting the latency or whatever it is." I shouldn't rely on some other engineer that's running my system to say, "Hey, I found this problem in your code, can you go fix it?" It should basically be as a team, you're saying, "Okay, I released this code. We're noticing these high latency warnings, or errors, or whatever. I'm the one who is responsible for that. I should go in and I should be the one that fixes that."

Sarjeel: Yeah. That's absolutely true. Okay. You mentioned that at some point, you used to write code, and then you used to interact with the QA engineers and things like that. In that sense, Jeremy, I have been lucky that when I started my career ... I started my career around 2018. Not that way back then. When I started my career in 2018, the first company that I joined was Thundra, actually. You probably heard of  Thundra. I believe you have had ...

Jeremy: Absolutely right.

Sarjeel: You have had Emrah ?amdan over here also talking about serverless observability and debugging, and things like that. I joined Thundra. And then after Thundra, I joined Opsgenie. And both of these companies practiced building software, along the principles of DevOps. So, I have never actually seen QA engineers or a specific team just to resolve incidents. For us, it was always like, "Okay, you write the code. You wrote the code. If something goes wrong, you're on call" ... And if you're on call, or even if you're not on call, the person on call would alert you that whatever changes you made, something was wrong. Then they would pull you in as a responder.

And then I look at our customers. Some of our customers still do have these practices. Especially when you're a large enterprise customer, it's a bit harder to change the entire culture. It's a bit slower. When I talk to these customers, and they tell me about these problems, it becomes very difficult for me to relate to them, essentially, because I have never ... But coming back to your point about like, if you build it, you run it. And I think that's exactly what I see serverless and serverless offerings as a great opportunity, especially when you're new to DevOps and you're trying to look at DevOps, or you're thinking of adopting DevOps, or your team is thinking of adopting DevOps. I believe this is where serverless comes into play. If we go back to what I previously said about like, "Okay, we want to try to reduce ops. We want to see a left shift of you build it, you run it." So, as things coming closer to the people who are building things. We also want to see automation. This is where I believe serverless comes into play.

Jeremy: Right.

Sarjeel: The reason why I say this is because ... So, when I graduated from university and I got my first job as a junior developer, Thundra gave me a perspective of both ... This is cloud computing, right? And I just graduated, and I had seen, "Okay, this is cloud computing now." I had always heard about it in university. Right in university, you hear about the latest trends and things like that. "Oh, my god, I'll get to work on AWS." I've never interacted with any AWS service before. And I was presented containers, EC2 containers, and I was presented AWS Lambda. And with AWS Lambda, I just got to it, wrote my first lines of code, got it and uploaded it, and I was able to trigger the lambda function. With EC2, I spent quite a while trying to understand, getting over that learning curve, to a point where I was like, "Oh my god, if I don't get it done by this week, I'll probably be fired."

Jeremy: Well, it's funny, though, that you mentioned the idea of where serverless fits in, in DevOps. I totally agree with you here. And I'll give you a history lesson. And so, I hope I don't sound like an old man yelling at clouds. But essentially, how it used to be was that you would need to maintain a server somewhere. And usually, it was a physical server that ... We weren't even talking about VMs and things like that. It was a physical server, and there was networking, and there's all these other things you had to do with it.

And that was something where there was a clear line between someone who was a developer and was writing code to between someone who was actually installing software patches, and doing the networking and actually plugging in cables in a data center somewhere. So, a lot of that changed in the late aughts, 2008, 2009, when EC2 started to become more popular with AWS, and so forth. And that made it a little bit easier, but you were still thinking about VPCs and trying to do networking and that kind of stuff. It was easier, but still something that you wanted someone to set up for you so that as a developer, I would just have an environment that I could use.

What serverless has changed is that now you just have an environment. And so, you don't have to set up an environment, you just need an AWS account or a Google Cloud account, or IBM or whatever, that you can just go and just upload some code and have it immediately execute within that environment. And so, that's one of the things for me, where if you try to say to a developer, "Hey, I need you to take responsibility for all of this stuff. And oh, by the way, we're running on EC2 instances, and VPCs, and you need to know the security groups, and you need to understand how all of these things might be able to affect you." That is too much, in my opinion, to ask somebody. But to say, look, and you're throwing your code into ... Even if it's a container in Fargate or something like that, or you're doing a Lambda function, that's pretty isolated environment. It's pretty easy for you to reason about if something is not working. "I'm not able to connect to a service. It's running too slow. It's timing out." Things that are easy, I think, for you to understand and debug, and that just becomes ...

I don't think that's too much of an ask. So, I do think that you're asking developers now to go all the way through that spectrum, and to understand a little bit of the operational aspect of it, but they don't have to understand the deep networking stuff or how packets are routed and some of that stuff. They just need to understand some of the basic cloud principles. I think serverless enables that and really is this huge enabler of companies accepting DevOps.

Sarjeel: Yeah, exactly. That whole point about a lot of the underlying infrastructure being abstracted away to the cloud vendor and becoming the responsibility of the cloud vendor. That in itself is just extremely helpful to anybody trying to practice DevOps, any team trying to practice DevOps. Because all of a sudden, you no longer have to worry about your ENIs, or your security groups as you mentioned. All of that is managed by the cloud vendor that you're using, whether it be AWS or Google Cloud provider. What that allowed you to do is, as we have seen quite a bit, as one of the well-known benefits of serverless is it actually allows you to focus on your business logic more. It not only allows you to focus on your business logic, but another hidden gem, I would say, is that it also allows you to connect and communicate, focus on the communication and sharing of code, and getting over that learning curve when the other teams are involved.

So, even though you didn't write the code yourself, if you look at somebody else's Lambda function, you can focus on ... Or if you look at somebody else's FaaS functions or Lambdas, let's say, it's easier to understand. It's easier to collaborate on a code base. And, on top of that, it just makes it easier for an entire team going through that spectrum to manage that pipeline, the DevOps pipeline going from ideation to ... In fact, I say that, especially as a product manager, I would say that all product managers should also learn how to deploy Lambda functions, especially when you're trying to ideate through an idea.

It's become so easy. You can write throwaway code. It becomes very easy to write. You just write throwaway code. Just code that works, just to test whether an idea works or not. And especially when you're trying to find that perfect product market fit, just write a bunch of Lambda functions with your engineering manager or your lead engineer, and show that to the test group of customers, see if it works, go back and ideate it. It's so easy to do that because, one, serverless functions are cheap, or serverless services are cheap. The pay-as-you-go model. They're very lightweight, they're very easy to get up and running with. You don't need to worry about all that infrastructure that we already talked about. So, even there, just in the ideation phase, it's very easy to go forward.

Jeremy: Yeah. I think there are a lot of benefits to just using serverless to do some of these DevOps practices. And I know we haven't really mentioned all of the principles, I guess. We mentioned a couple of the main ones, but I think one of the things a part of the DevOps culture, or at least a part of what you need to do to fully embrace it is this idea of building microservices, right?

I mean, microservices allow individual teams or small groups of people to work on parts of the application independently. And when you start dealing with some massive monolith, and you've got a bunch of different teams all contributing to the same code base, it gets really, really messy. So, being able to break those up into smaller things is super important.

Serverless, I think, has a bunch of really cool things baked in, especially with intercommunication between microservices, and you don't have to set up things like Kafka, or RabbitMQ, or some of these other things that's just another thing to manage. So, what are your thoughts on that? What are some of the tools or the services available as part of the serverless ecosystem that just help with microservices?

Sarjeel: As you mentioned, microservices, we're all familiar with the benefits of microservices.

Jeremy: I hope we are.

Sarjeel: Hopefully. Believe me, I have dealt with a monolith, especially like when you look at front end as a monolith. In many cases, front ends can be considered a monolith you have this one big front end code base, and it just becomes very difficult. You really do see the benefits of microservices. It's actually the idea of microservices that really plays well with the entire DevOps culture and practice, where you can have each team working on something, you can go fast on that. Especially when you look at the stability.

I know we're going off on a tangent over here. We haven't started talking about how serverless is baked into the benefits of building microservices. I just wanted to mention that one point that is really amazing that I have seen dealing with monoliths and microservices is that when you're looking at it from a DevOps perspective, and you're looking at stability of your system, just having one part break and not affecting the other part. That in itself, I believe, is taken granted for. It's pretty amazing. Being able to decouple all these different aspects or all these components of your entire system. And looking at them individually where one component's failure does not necessarily result to another component failure. That, in itself, is pretty amazing with microservices.

However, what does that lead to is that when you're thinking of microservice architectures, then you also need to think about communication overhead, as you mentioned. Yes, you did decouple all of these, but now you still need all of these to communicate with one another.

Jeremy: And reliably.

Sarjeel: And reliably. Yes, exactly. Reliably. As you mentioned, there's a lot of overhead over there. I personally haven't dealt with Kafka or RabbitMQ.

Jeremy: Consider yourself lucky.

Sarjeel: Yeah. We saw EventBridge, and I think a lot of people would agree with me over here, that EventBridge is definitely the next best thing after AWS Lambda. That's because of all the use cases that it has enabled, and how powerful of a service it is. And it really allows you to think about serverless architectures and event-driven architectures from a whole new perspective.

One of the best things that it allows you to do is reduce all those ops that you would otherwise have to deal with. All that overhead with communication. One of the things like even marshalling and de-marshalling, you're literally just communicating in the form of events. Being able to leverage other capabilities of EventBridge, such as routing of events based on rules. That in itself also just enabled a lot of use cases within your microservice architecture and also as ancillary services supporting your DevOps pipelines.

Yes, one is definitely EventBridge. Again, when we were mentioning all of these services, it's also good to point out that age-old myth about serverless equating to only Lambda function. A lot people, even I, when I began, looking at, "Okay, what is this word, serverless?" I started thinking, "Okay, yeah. Serverless equals Lambda functions." Then I realized, no, it's actually a whole set of tools. It's a whole set of services that are available out there.

We mentioned EventBridge. We should also give credit to DynamoDB. Especially when you're looking at it from the point of scalability, you can have your entire microservice architecture built using serverless service. But if your data layer isn't scalable, then what's the point?

Jeremy: What's the point? Right.

Sarjeel: Exactly. Having that incorporated also. And then also, having a lot of the responsibility being abstracted away to the cloud vendor, that in itself allows a lot of teams trying to adopt DevOps to go faster.

When you're looking at EventBridge, DynamoDB, then of course, you have your AWS Lambda functions, your Fargate, basically your containers as a service. If you find FAS services a bit limiting, you can always look at containers as a service. We are seeing a rise in popularity with containers as a service. So, you have this whole set of tools in your cupboard that you can just basically bring in plug and play. And that's what serverless allows you to do. It lets you bring in a service and let you plug it in and play it in the entire way your microservice architecture operates.

Jeremy: Right. Yeah. I think you bring up a point too. You mentioned DynamoDB, which has global tables, and all kinds of things that allow you to replicate data to other regions.

The other thing that's cool about DynamoDB, or EventBridge, or Lambda functions, is that it runs in multiple availability zones, even if you're running it in a single region. And that gives you redundancy and resiliency and all these backups that ... Again, speaking of Kafka or RabbitMQ or something like that, where you'd have to have multiple services or multiple systems running in multiple regions or multiple availability zones that were subscribing to all these events and trying to manage all of that complexity.

EventBridge just kind of does that for you. You don't even have to think about it. Same thing with DynamoDB. But DynamoDB global table is actually something where this could get us to, maybe not an easy way to get to it, but certainly possible to start thinking about active-active regions, where you can actually have your systems running in Europe, and you have them running in the US, and maybe you have them running maybe in Australia or something like that.

So, what are your thoughts on that and where serverless helps get teams to deliver ... Not only to deliver software faster, but to deliver software to more places or more regionally.

Sarjeel: Right. If we take a step back, and if we look at ... You mentioned active-active. Yes, active-active architectures in itself is a whole different topic. And you have one of the great personalities in this field, Adrian Hornsby, who talks about this quite well. He has a great set of resources, blogs, and talks about that. Anybody who's interested and wants to learn anything what active-activity, they can definitely go and refer to that.

But if you look at that architecture from a DevOps point of view, from the fact that what do we actually want to achieve with this type of architecture. You trace back a lot of its motivation and its origins, not origins per se, because it's been there in academia for quite a while now, but a lot of the motivation to adopt such an architecture. A lot of it comes from the fact that, that one horrific story of the Netflix outage.

I just want to mention as a side note. We've been talking about this Netflix outage for quite a while. I'm just waiting for the next big outage because I think we have been overusing this Netflix outage story quite a bit now. Working in an incident management tool, like an incident management company. We do hear about a lot of outages, but none of them compared to what we saw with Netflix, or on that scale, but regardless. So, we see a lot of that motivation for active-active coming from the outage of Netflix, and we saw Netflix kind of start pushing the idea of resilient architectures. I'm not saying that it wasn't there before. Of course, it was, but we saw Netflix, one of the big tech companies starting to push and really think about it from a whole new perspective. And the whole point over here is to maintain stability. As we mentioned earlier, that actually is one of the goals of DevOps. Now, when you're thinking about active-active, it's easier said than done.

Jeremy: That's very true.

Sarjeel: Right. It's actually easier said than done. When you start thinking of how serverless tools can come and help with setting up such an architecture, we do see a lot of burden lifted off. So, for example, you mentioned DynamoDB global tables. Then there's also Route 53. So, we have geo routing that they recently announced. I believe it's one of the more recent capabilities with Route 53. We have DNS failover with Route 53. Then you have API gateway, which came up with custom domains, which allows you to target now regional endpoints. So, we can see, we can actually build this entire active-active service with serverless services, where a lot of that responsibility again, gets abstracted away to the cloud vendor.

I know I've said this statement, being abstracted away as a cloud vendor quite a bit. Simply because I want to stress the fact of how important it is to try to reduce the ops to eventually move towards a very successful DevOps practicing team. Again, having a lot of these activities, let's say, being automated by these managed or services, especially when it comes to scalability. We talk about serverless in the sense that a lot of times when you talk about the limitations of serverless, you look at it, one of them is that it's stateless. A lot of people have difficulties in thinking about stateless. How to think of stateless, whole business logic, and how would you have a business logic translated to a stateless architecture. But with active-active, it actually becomes an advantage to have stateless architectures. To have stateless compute services, because you don't want to hold the state in too long in an active-active. And you want to keep on switching between nodes.

Another benefit where serverless really shines is the fact that it's pay as you go model. So, if you have nodes that aren't being used, why pay for them? So, that's another advantage where you can see the benefits of serverless come into play.

Now, there's that, and there's also the scalability. So, all of a sudden, you have a lot of traffic being routed to a specific node. You may not have handled your routing rules very well, considering your traffic or things like that, but it's okay. It's okay. You DynamoDB is going to scale. If you have, let's say, a Lambda function over there, or maybe Fargate instance, it will scale. So, auto-scaling, pay as you go, statelessness, all of these come together to really help you build that active-active architecture and start thinking of how you can build that active-active architecture.

Now, by the way, another thing I want to mention, which I think we missed upon was when we're talking about the characteristics of serverless functional or serverless in general, and how we are using these serverless functions, or serverless tools to build microservices. One of the characteristics is that when you think of a serverless architecture, it's event-driven, and that plays very well when you're dealing with microservices. All of a sudden, you now have to start thinking about event-driven architectures. Again, we're coming back to EventBridge where your EventBridge may be triggering a lot of your Fargate instances or Lambda functions. Just having that constraint of having functions or having your compute resources being triggered by events allows you to make sure that you think about this architecture in an event-driven fashion.

There is a possibility, though. I must point this out that there is a possibility for you to fall into an anti-pattern. Especially when you're trying to adopt serverless, and you're moving to this granular architecture, you're moving to serverless architecture, a microservices from your monolith. It is easy to fall into an anti-pattern where you try to replicate your entire logic, your entire business logic on monoliths exactly into your serverless architecture where you would have one Lambda sitting before another Lambda function, which sits before another Lambda function. All of a sudden, you have this anti-pattern, where you can't do things as synchronously. And asynchronicity is something that is, again, another benefit of having Lambda function, but just thinking about microservices. So, there's this anti-pattern you may fall into, but as long as you think about it in an event-driven fashion, as long as you know what you're doing, as long as you do your research before building its architecture, you should be good.

Jeremy: Yeah. I think that anti-pattern is very prevalent where people just end up, unfortunately, trying to stack too much logic, or try to chain functions together in a way that is definitely slower.

We talked a lot about the benefits, I think, from a DevOps perspective, or from a DevOps culture of building things with serverless, and I think that makes a lot of sense. There's still ops work to be done. We still have to clean up development environments, or maybe run some audits or some of these other things. So, I guess from that perspective of ... And maybe this falls more on operations, but I think it's sort of part of the full cycle. Where does serverless fit in there? And what are some of the tools that are available for you to sort of just kind of run the infrastructure beyond just trying to deploy code that is maybe client-facing?

Sarjeel: Yeah. Actually, this is a pretty great question, because when I look at serverless and how serverless can aid in DevOps practices. We talked about serverless functions and serverless technologies inside your main code base, inside your main infrastructure itself. But yes, then there's a whole set of other use cases that we can come to where your serverless function to serverless technologies can act as ancillaries, helper functions or ancillary services, aiding you to get through that DevOps pipeline.

So, as you mentioned, cleaning up your environment, or even just thinking about deployments, how we're looking at automated deployments throughout the CI/CD stage, and also automated tests. Then again, monitoring and debugging and identifying root causes of incidents and remediation and all of that. All of that can actually be done with several functions. There are a lot of tools out there that are trying to help you achieve these things. Atlassian itself is building a lot of tools that helps you achieve this, helps you automate through this. But having those tools, and having those third-party tools and having serverless technologies integrated with those tools really does give you that extra boost to go faster while maintaining the stability that we're always talking about, that we're really trying to go for. I can give you an example.

Jeremy: Absolutely.

Sarjeel: There are actually many examples that we can talk over that I would actually like to point out. One of the examples that I really like is, for example, we recently ... Actually, not recently. About a year ago or so, we built an integration with EventBridge. And basically, the use cases were such that the way we saw customers using the Opsgenie, EventBridge integration was, okay, they get an alert from either Datadog and New Relic about some form of configuration drift in their infrastructure. Once they identify this infrastructure drift in the AWS setup, Opsgenie would send you an alert. And that alert acts as a trigger through EventBridge into your AWS infrastructure that can run automated playbooks to correct that configuration drift. I think that was just an amazing use case that we saw some of our customers using.

Another thing was like, for example, security compliance, or when you see some suspicious activity in your account, you can use AWS CloudTrail, or you can use any other security monitoring or audit logging tool. You integrate that with Opsgenie. Opsgenie gets that alert. And upon that alert, using ... Again, this is where I've seen customers leverage the event routing capability of EventBridge. Depending on what the content of the alert is, they're routed to the right area of their infrastructure to immediately remediate that. All of this being done automatically.

So, what we're actually seeing is kind of a reduction in the need for SRE teams, the need for infrastructure maintenance, and basically all of ops. The developers themselves can't set this up, because it's so easy. It's so easy to get up and running with EventBridge and Lambda functions or serverless in general, that you can have your development teams set this up, and take responsibility of that ops part also.

Jeremy: Right. Yeah.

Sarjeel: I think in the beginning, you mentioned like sometimes ops can get scared, like, "Oh, what's the point? What are we needed for?" Why not? Let's come together. That's the whole point, of coming together, and helping. If the development team can't set it up, that's where I feel that we need to start thinking of ops in a whole different way, especially with the advent of serverless and all of these third-party tools. We need to start thinking of ops in a whole different way of how we can leverage this new technology in the best way possible to accelerate according to the team's development practices, according to the team's cultural practices in building software. Because, again, every team is different. That's also a reason why. You can't really say that there's one solution or one tool that fits that would solve all the DevOps problems of the industry. No. Every team is different. Every team is different within an organization. Every organization is different.

Regardless, you can have like third-party tools to try to help you bolster your DevOps solutions. But as soon as you see that, "Okay, it's not working." That's where you can fill in the gaps with serverless services I think that in itself is just pretty amazing.

We are looking at customers do this with Opsgenie wasting a lot of automation come up. For example, in Opsgenie itself, we're using Lambda functions to replicate customer traffic. You have synthetic monitoring Lambda function, and you have transactional monitoring Lambda function. So, these synthetic monitoring functions, they're hitting our APIs. Then the transactional monitoring Lambda functions are receiving the input and processing it and sending it to New Relic, and the other monitoring tools that we're using. Whenever something is wrong that Lambda function will automatically send an alert to Opsgenie, a surprise, we use Opsgenie ourselves internally. We get an alert. So this way, we manage to track or we managed to catch errors or incidents before our customers can even get it.

And remember, this is, again, where you can leverage the characteristics of serverless services or tools, because, again, it's pretty easy to set up, so developers can set this up. I remember going around and playing with a few monitoring Lambda functions myself when I was a developer. So, I set it up, and then getting that connected to New Relic, and doing the whole ... Again, we try to play by that motto. You build it, you run it as much as possible. So, for example, when something goes wrong in Opsgenie, we get that alert. We try to investigate it ourselves. So, all of that is made possible because at some point, we are using Lambda functions to send a lot of monitoring data and generate a lot of data and send that monitoring data over to New Relic.

Jeremy: Yeah. I think you hit the nail on the head in terms of where the SRE team members go after some things become easier. And you mentioned this idea of CI/CD pipelines. So, if it's super easy to set up a CI/CD pipeline, and it's just a matter of a couple of clicks in a dashboard, or it's just you have to deploy maybe another cloud formation template or something. If I was an SRE, which I've done roles similar to SRE in the past, I would be really, really tired of setting up another CI/CD pipeline for somebody. If that was my job, just, "Oh, we got to set up another one of these. Set up another one of these." That is just wasted human capital where you could be spending that time, like you said, writing a lambda function that sends synthetic traffic or getting into chaos engineering.

If you have people who know the ops side of things, and can say, "Hey, what happens if this service can no longer communicate with that service? How does your service react? How does the other service recover, and so forth?" And again, becoming chaos engineers around that, I think, is a hugely important thing that larger teams have got to start doing maybe even smaller teams. But you've got to start doing to understand the nature of distributed systems, and what happens when one thing breaks down. So, I do think that there's an evolution here, where it's like the more you can automate, the more sort of your developers can own some of that stack, it just frees up people to do more important work than things that can just easily be automated.

Sarjeel: Yeah. No, I definitely agree with you. Once you have a lot of automation ... Again, we get back to the same point where you have automation, you can start thinking of the business logic and start thinking about how you want your company to perform to scale and basically work for your customers. Now, when we look at SRE, SRE is now free to start basically looking at the resiliency of the system. Performing more tests, making sure that we are at the amount of nines that we want in terms of resiliency and availability. And SRE gets freed up because of that, right?

Also, when you're talking about automation, you mentioned CI/CD. I wanted to sidetrack to this upcoming concept of GitOps. So, we've heard quite a bit of it recently. There's a company we've worked, I believe, they're really pushing the needle on this. They're looking at this quite a bit. Even that, when we think about automation. So, a lot of that automation, when you're managing ... This entire idea of GitOps, the motivation comes behind like the rise of Kubernetes, Kubernetes becoming popular, how you would manage your Kubernetes infrastructure. And they're looking at the property of Kubernetes to kind of ... Because it's a defined architecture. I forgot the term. I'm so sorry. You define it in your kubectl and everything. You'd push the infrastructure changes or your code changes. That's when GitOps would basically automate. You'd go from continuous delivery to continuous deployment. It's a push from continuous delivery to continuous deployment.

Whenever I look at continuous deployment, and so whenever I look at GitOps in general, I find it very scary, because, okay, you pushed something. And all of a sudden, all of these things are happening automatically. Your entire infrastructure is just about to change. Your entire code base is about to change. It's always very nice. You have somebody in the middle, in a staging environment. You first push to a staging environment, somebody in the middle.

Jeremy: You test it. Right Yeah, exactly.

Sarjeel: You test it. Exactly. And then there's a little button that says, "Okay, deploying." And it goes to production, everything is cool. But you're reducing that. At the end of the day, that's what the idea of DevOps is, to try to reduce the manual labor and push for automation. And then I look at GitOps, and I'm like, "Did we just go crazy? Are we going too far?" Then that's where I see, "Okay, you can have ..." GitOps shouldn't only be thought about in terms of, okay, yeah, automating your deployment, but you should also think of it from a perspective of observability. And, again, when we're talking about observability, that's where I believe we can leverage these Lambda functions, because one, your Lambda functions are very light, and they're just being used for monitoring. So, you're continuously monitoring the actual state, as compared to the desired state. And whenever you see the actual state had drifted away from the desired state, again, you can either use Opsgenie as your alert consolidation tool. Or you can just trigger an event through EventBridge from your lambda function, send an event to EventBridge and, again, go and remediate that. Yeah. Basically, just go and remediate that drift away from the desired state.

So, this is one way. We are looking at automation, and we are trying to find ways to go faster and faster. And this happens that, okay, as we go faster, we still need to remember, maintain stability, maintain availability. And this is where we see the benefit of, especially, Lambda functions, or Azure functions or whatever form of FaaS functions you're using. Because when we talk about using FaaS functions in production for your actual code base, there are a lot of limitations that everybody talks about. A lot of edge cases that aren't covered by these services. But in this case, in this regard, it fits perfectly. One, it's cost-effective, it's easy to spin up, and use it scalable. At Opsgenie, when we want to increase the traffic on a certain API, simply have several concurrent Lambda functions, just bombarding that API with requests and different kinds of requests. This is scalable. It's easy to spin up.

This is exactly where one of the benefits lies. But again, yeah, as I mentioned, in production, you may have some limitations. When you're thinking about it, in terms of microservices and active-active architecture as we talked about before. But when you're thinking about it as ancillary services, and just helping you go through that DevOps pipeline, when you're thinking of it as a glue code, especially, it's really beneficial to use serverless functions.

Jeremy: Yeah. I think all that ties together too. I mean, GitOps is something that just ... CI/CD continuous deployment is one of those things where, yes, it scares a lot of people, because it's just going through FaaS. It allows you to move so quickly and make changes so quickly. And I think that if you embrace the whole culture, if you embrace the idea of microservices and serverless deploying very small units of code. Just this idea of test-driven development or being able to have the tests that you need in there, and so forth, the ability for you to roll back quickly, adding in things like chaos engineering to know what happens if we put something out there and it breaks that we know that the other things will degrade gracefully. Having that capability and kind of following that whole thing. I mean, that's sort of the holy grail of doing this stuff, because it's okay if you break something sometimes, but it should go through a test process and there should be a development environment where you're testing these things against other things. But if something does break, you're isolating it, you're minimizing, you're creating those bulkheads there that are minimizing the impact that it has on a larger scale.

So, we're running out of time and so before we finish, though, I do want to talk about ... I mean, we've been talking a lot about serverless, and EventBridge, and active-active, and DynamoDB and all these great things. It's not like a team can just go ahead and shift tomorrow and start using all this stuff, right? There are a number of barriers to adoption, some of those being just the cultural change in a company, first of all. But also, just this idea of the learning of these tools, and then maybe even the limitations of some of these tools. So, what are your thoughts on some of the barriers that might exist to people who want to adopt, not only DevOps, but maybe DevOps with serverless?

Sarjeel: I can best answer this with a story of mine, or something that I experienced, especially when I switched over to Opsgenie from Thundra. Remember, I was in Thundra serverless. At that time, Thundra was a serverless monitoring tool. Now, it's become much more, of course. At that time, we were focused on serverless. And I was just like, "Oh my god, it's an amazing technology." Then I switched over to Opsgenie and I see, "Okay, we aren't using it that much." And in fact, when I switched over to it, a major functionality of ours, that was initially being built on serverless architecture, the senior engineers rolled back on the decision and went back to EC2 and other forms of container services.

And I asked, "Why did this happen?" I didn't have that much experience, and I really wanted to know what happened. So, I remember, one of the co-founders actually sat me down. He was a pretty cool guy. He actually took me to a whole different meeting room. He sat me down, like, "Okay, I'm going to teach you something now." I'm like, "Okay." He told me that the service is great, and it is definitely, in some way, the future. But in its current state, we do see a lot of issues. And this is back in late 2018, let's say. Around that time, the maximum run time was five minutes, I believe, for a Lambda function.

Jeremy: Five minutes. Yes.

Sarjeel: Yeah. It is in that same year or a bit later that we saw 15 minutes then. So, that is a huge improvement, I would say. At that time, we weren't ready. Our use case was not the best use case. Or the way we were looking at our use case was not in the best manner to adopt serverless. And I think it's very important to understand the limitations of what you can and cannot build, and how you can get around these barriers, because I feel that there's always a way to get around these barriers. Are you just willing to invest in it? It's not like Opsgenie gave up on serverless. We continued. We still use a lot of serverless components in a lot of areas in Opsgenie. Especially in our DevOps pipeline, for example, when we want to spin up emergency instances, we use Fargate. We were using Fargate. I'm not sure if we are still. We're using Fargate for our SRE, in our logging, and other operations, because it's easy to spin up, and it's cost-effective for that use case. But there are definitely limitations.

What I have noticed, Jeremy, even in the small period from 2018 since I graduated to now, I have seen ... We have all seen major leaps of improvement. Just mentioning five minutes to 15 minutes runtime, that was a major improvement. I remember there was this conversation, this whole conversation that I had with Emrah ?amdan and how we were looking at, "Okay, we need to think about runaway cost." We now see that the billing has become more granular for a lot of serverless services, for example, Lambda functions, which was 100 millisecond. Now, it's one millisecond. That in itself is just a huge improvement. I think we see the same thing definitely for a lot of serverless services.

And then there are other things. For example, being able to debug your serverless infrastructure. That in itself is problematic, but we do see a lot of improvements in the industry, for example. And also, a lot of third-party tools are coming up. So, for example, I've been following Thundra's growth as they went from ... They kind of started encapsulating all of ... enabling cloud developers. Recently, they came up with Thundra Sidekick, which is, I think, a very cool feature. If people haven't seen that, I recommend they go check it out. We're definitely looking at it. And this whole community that's coming up to fill in those gaps, and there's still a long way to go. But I still feel that even what we have right now is pretty amazing.

Jeremy: Yeah. No, I agree. And I think that there are limitations. Serverless is not a silver bullet. You're going to run into limitations, but I do see ... I mean, I would recommend to anybody, if you're trying to establish a really good DevOps practice within your organization, or you're just building applications, the services that are serverless, and have those serverless qualities are going to be the ones that make the most sense for you to choose if you can. If you can't, then don't. But if you can, choose those, because that just gives you all of those benefits we've been talking about through this entire episode. And just that ability for you to really own your code, get those CI/CD pipelines to the point where you're delivering multiple releases per day and things like that.

So, Sarjeel, listen, thank you so much for joining me and spending this time and sharing your knowledge on DevOps and serverless. If people want to get ahold of you or find out more stuff that you're working on, how do they do that?

Sarjeel: Well, I'm a pretty open guy. You can just contact me with whatever channel you find. Twitter is great. So, Jeremy, I think you are putting ...

Jeremy: Yes, I'll put the stuff in the show notes. Yeah.

Sarjeel: Yeah. You can contact me through Twitter or on my email. You can find my email on my website, which I think is also going to be in the show notes.

Jeremy: Yep, sarjeelyusuf.me, right?

Sarjeel: