Originally published January 31, 2019. We are taking a few weeks off. We?ll be back soon with new episodes. Artificial intelligence is reshaping every aspect of our lives, from transportation to agriculture to dating. Someday, we may even create a superintelligence?a computer system that is demonstrably smarter than humans. But there is widespread disagreement on
The post Architects of Intelligence with Martin Ford (Summer Break Repeat) appeared first on Software Engineering Daily.
Cruise is an autonomous car company with a development cycle that is highly dependent on testing its cars?both in the wild and in simulation. The testing cycle typically requires cars to drive around gathering data, and that data to subsequently be integrated into a simulated system called Matrix. With COVID-19, the ability to run tests
Grafana is an open source visualization and monitoring tool that is used for creating dashboards and charting time series data. Grafana is used by thousands of companies to monitor their infrastructure. It is a popular component in monitoring stacks, and is often used together with Prometheus, ElasticSearch, MySQL, and other data sources. The engineering complexities
Apache Airflow was released in 2015, introducing the first popular open source solution to data pipeline orchestration. Since that time, Airflow has been widely adopted for dependency-based data workflows. A developer might orchestrate a pipeline with hundreds of tasks, with dependencies between jobs in Spark, Hadoop, and Snowflake. Since Airflow?s creation, it has powered the
The post Apache Airflow with Maxime Beauchemin, Vikram Koka, and Ash Berlin-Taylor appeared first on Software Engineering Daily.
The life cycle of data management includes data cleaning, extraction, integration, analysis and exploration, and machine learning models. It would be great if all of this data management could be handled with automation, but unfortunately that is not an option. For most applications, data management requires a human in the loop. A human in the
The post Human in the Loop Data Analytics with Aditya Parameswaran appeared first on Software Engineering Daily.
Kubernetes continues to mature as a platform for infrastructure management. At this point, many companies have well-developed workflows and deployment patterns for working with applications built on Kubernetes. The complexity of some of these deployments may be daunting, and when a new employee joins a company, that employee needs to get quickly onboarded with the
Uber needs to visualize data on a range of different surfaces. A smartphone user sees cars moving around on a map as they wait for their ride to arrive. Data scientists and operations researchers within Uber study the renderings of traffic moving throughout a city. Data visualization is core to Uber, and the company has
A frontend developer issuing a query to a backend server typically requires the developer to issue that query through an ORM or a raw database query. Prisma is an alternative to both of these data access patterns, allowing for easier database access through auto-generated, type-safe query building tailored to an existing database schema. By integrating
The post Prisma: Modern Database Tooling with Johannes Schickling appeared first on Software Engineering Daily.
Machine learning workflows have had a problem for a long time: taking a model from the prototyping step and putting it into production is not an easy task. A data scientist who is developing a model is often working with different tools, or a smaller data set, or different hardware than the environment which that
The post Tecton: Machine Learning Platform from Uber with Kevin Stumpf appeared first on Software Engineering Daily.
Many data sources produce new data points at a very high rate. With so much data, the issue of data quality emerges. Low quality data can degrade the accuracy of machine learning models that are built around those data sources. Ideally, we would have completely clean data sources, but that?s not very realistic. One alternative
The post HoloClean: Data Quality Management with Theodoros Rekatsinas appeared first on Software Engineering Daily.
Server infrastructure traditionally consists of monolithic servers containing all of the necessary hardware to run a computer. These different hardware components are located next to each other, and do not need to communicate over a network boundary to connect the CPU and memory. LegoOS is a model for disaggregated, network-attached hardware. LegoOS disseminates the traditional
Kubernetes has become a highly usable platform for deploying and managing distributed systems. The user experience for Kubernetes is great, but is still not as simple as a full-on serverless implementation?at least, that has been a long-held assumption. Why would you manage your own infrastructure, even if it is Kubernetes? Why not use autoscaling Lambda
Every software company is a distributed system, and distributed systems fail in unexpected ways. This ever-present tendency for systems to fail has led to the rise of failure testing, otherwise known as chaos engineering. Chaos engineering involves the deliberate failure of subsystems within an overall system to ensure that the system itself can be resilient
Brex is a credit card company that provides credit to startups, mostly companies which have raised money. Brex processes millions of transactions, and uses the data from those transactions to assess creditworthiness, prevent fraud, and surface insights for the users of their cards. Brex is full of interesting engineering problems. The high volume of transactions
Devices on the edge are becoming more useful with improvements in the machine learning ecosystem. TensorFlow Lite allows machine learning models to run on microcontrollers and other devices with only kilobytes of memory. Microcontrollers are very low-cost, tiny computational devices. They are cheap, and they are everywhere. The low-energy embedded systems community and the machine
For the last five months, we have been working on a new version of Software Daily, the platform we built to host and present our content. We are creating a platform that integrates the podcast with a set of other features that make it easier to learn from the audio interviews. Software Daily includes the
Over the last 5 years, web development has matured considerably. React has become a standard for frontend component development. GraphQL has seen massive growth in adoption as a data fetching middleware layer. The hosting platforms have expanded beyond AWS and Heroku, to newer environments like Netlify and Vercel. These changes are collectively known as the
Geospatial analytics tools are used to render visualizations for a vast array of applications. Data sources such as satellites and cellular data can gather location data, and that data can be superimposed over a map. A map-based visualization can allow the end user to make decisions based on what they see. ArcGIS is one of
The post ArcGIS: Geographic Information Software with Max Payson appeared first on Software Engineering Daily.
Customer data infrastructure is a type of tool for saving analytics and information about your customers. The company that is best known in this category is Segment, a very popular API company. This customer data is used for making all kinds of decisions around product roadmap, pricing, and design. RudderStack is a company built around
The post RudderStack: Open Source Customer Data Infrastructure with Soumyadeb Mitra appeared first on Software Engineering Daily.
Matterport is a company that builds 3-D imaging for the inside of buildings, construction sites, and other locations that require a ?digital twin.? Generating digital images of the insides of buildings has a broad spectrum of applications, and there are considerable engineering challenges in building such a system. Matterport?s hardware stack involves a camera built
The post Frontend Performance with Anycart?s Rafael Sanches appeared first on Software Engineering Daily.
Amazon?s virtual server instances have come a long way since the early days of EC2. There are now a wide variety of available configuration options for spinning up an EC2 instance, which can be chosen from based on the workload that will be scheduled onto a virtual machine. There are also Fargate containers and AWS
A credit score is a rating that allows someone to qualify for a line of credit, which could be a loan such as a mortgage, or a credit card. We are assigned a credit score based on a credit history, which could be related to work history, rental payments, or loan repayments. One problem with
The post International Consumer Credit Infrastructure with Brian Regan and Misha Esipov appeared first on Software Engineering Daily.
A large software company such as Dropbox is at a constant risk of security breaches. These security breaches can take the form of social engineering attacks, network breaches, and other malicious adversarial behavior. This behavior can be surfaced by analyzing collections of log data. Log-based threat response is not a new technique. But how should
The post Grapl: Graph-Based Detection and Response with Colin O?Brien appeared first on Software Engineering Daily.
Infrastructure-as-code tools are used to define the architecture of software systems. Common infrastructure-as-code tools include Terraform and AWS CloudFormation. When infrastructure is defined as code, we can use static analysis tools to analyze that code for configuration mistakes, just as we could analyze a programming language with traditional static analysis tools. When a developer writes
The post Static Analysis for Infrastructure with Guy Eisenkot appeared first on Software Engineering Daily.
Social distancing has been imposed across the United States. We are running an experiment unlike anything before it in history, and it is likely to have a lasting impact on human behavior. By looking at location data of how people are moving around today, we can examine the real-world impacts of social distancing. SafeGraph is
Dropbox is a consumer storage product with petabytes of data. Dropbox was originally started on the cloud, backed by S3. Once there was a high enough volume of data, Dropbox created its own data centers, designing hardware for the express purpose of storing user files. Over the last 13 years, Dropbox?s infrastructure has developed hardware,
?Data stream? is a word that can be used in multiple ways. A stream can refer to data in motion or data at rest. When a stream is data in motion, an endpoint is receiving new pieces of data on a continual basis. Each new data point is sent over the wire and captured by
The post Pravega: Storage for Streams with Flavio Junquiera appeared first on Software Engineering Daily.
Redis is an in-memory object storage system that is commonly used as a cache for web applications. This core primitive of in-memory object storage has created a larger ecosystem encompassing a broad set of tools. Redis is also used for creating objects such as queues, streams, and probabilistic data structures. Machine learning systems also need
For many applications, a transactional MySQL database is the source of truth. To make a MySQL database scale, some developers deploy their database using Vitess, a sharding system built on top of Kubernetes. Jiten Vaidya and Anthony Yeh work at PlanetScale, a company that focuses on building and supporting MySQL databases sharded with Vitess. Their
The post Multicloud MySQL with Jiten Vaidya and Anthony Yeh appeared first on Software Engineering Daily.
We are all living in social isolation due to the quarantine from COVID-19. Isolation is changing our habits and our moods, ravaging the economy, and changing how we work. One positive change is that more people have been reconnecting with their friends and family over frequent calls and video chats. Isolation is not a normal
A data warehouse is a system for performing fast queries on large amounts of data. A data lake is a system for storing high volumes of data in a format that is slow to access. A typical workflow for a data engineer is to pull data sets from this slow data lake storage into the
A content management system (CMS) defines how the content on a website is arranged and presented. The most widely used CMS is WordPress, the open source tool that is written in PHP. A large percentage of the web consists of WordPress sites, and WordPress has a huge ecosystem of plugins and templates. Despite the success
The post JAMStack Content Management with Scott Gallant, Jordan Patterson, and Nolan Phillips appeared first on Software Engineering Daily.
A data workflow scheduler is a tool used for connecting multiple systems together in order to build pipelines for processing data. A data pipeline might include a Hadoop task for ETL, a Spark task for stream processing, and a TensorFlow task to train a machine learning model. The workflow scheduler manages the tasks in that
A relational database often holds critical operational data for a company, including user names and financial information. Since this data is so important, a relational database must be architected to avoid data loss. Relational databases need to be a distributed system in order to provide the fault tolerance necessary for production use cases. If a
Python is the most widely used language for data science, and there are several libraries that are commonly used by Python data scientists including Numpy, Pandas, and scikit-learn. These libraries improve the user experience of a Python data scientist by giving them access to high level APIs. Data science is often performed over huge datasets,
Chatbots became widely popular around 2016 with the growth of chat platforms like Slack and voice interfaces such as Amazon Alexa. As chatbots came into use, so did the infrastructure that enabled chatbots. NLP APIs and complete chatbot frameworks came out to make it easier for people to build chatbots. The first suite of chatbot
Serverless computing is a way of designing applications that do not directly address or deploy application code to servers. Serverless applications are composed of stateless functions-as-a-service and stateful data storage systems such as Redis or DynamoDB. Serverless applications allow for scaling up and down the entire architecture, because each component is naturally scalable. And this
The post Cloudburst: Stateful Functions-as-a-Service with Vikram Sreekanti appeared first on Software Engineering Daily.
NGINX is a web server that can be used to manage the APIs across an organization. Managing these APIs involves deciding on the routing and load balancing across the servers which host them. If the traffic of a website suddenly spikes, the website needs to spin up new replica servers and update the API gateway
Web development has historically had more work being done on the server than on the client. The observability tooling has reflected this emphasis on the backend. Monitoring tools for log management and backend metrics have existed for decades, helping developers debug their server infrastructure. Today, web frontends have more work to do. Detailed components in
Zoom video chat has become an indispensable part of our lives. In a crowded market of video conferencing apps, Zoom managed to build a product that performs better than the competition, scaling with high quality to hundreds of meeting participants, and millions of concurrent users. Zoom?s rapid growth in user adoption came from its focus
Facebook applications use maps for showing users where to go. These maps can display businesses, roads, and event locations. Understanding the geographical world is also important for performing search queries that take into account a user?s location. For all of these different purposes, Facebook needs up-to-date, reliable mapping data. OpenStreetMap is an open system for
The post Facebook OpenStreetMap Engineering with Saurav Mohapatra and Jacob Wasserman appeared first on Software Engineering Daily.
NGINX is a web server that is used as a load balancer, an API gateway, a reverse proxy, and other purposes. Core application servers such as Ruby on Rails are often supported by NGINX, which handles routing the user requests between the different application server instances. This model of routing and load balancing between different
Shopify is a platform for selling products and building a business. It is a large e-commerce company with hundreds of engineers and several different mobile apps. Shopify?s engineering culture is willing to adopt new technologies aggressively, trying new tools that might provide significant leverage to the organization. React Native is one of those technologies. React
Ceph is a storage system that can be used for provisioning object storage, block storage, and file storage. These storage primitives can be used as the underlying medium for databases, queueing systems, and bucket storage. Ceph is used in circumstances where the developer may not want to use public cloud resources like Amazon S3. As
Data analysts need to collaborate with each other in the same way that software engineers do. They also need a high quality development environment. These data analysts are not working with programming languages like Java and Python, so they are not using an IDE such as Eclipse. Data analysts predominantly use SQL, and the tooling
When a developer spins up a virtual machine on AWS, that virtual machine could be purchased using one of several types of cost structures. These cost structures include on-demand instances, spot instances, and reserved instances. On-demand instances are often the most expensive, because the developer gets reliable VM infrastructure without committing to long-term pricing. Spot
Machine learning models require the use of training data, and that data needs to be labeled. Today, we have high quality data infrastructure tools such as TensorFlow, but we don?t have large high quality data sets. For many applications, the state of the art is to manually label training examples and feed them into the
The post Snorkel: Training Dataset Management with Braden Hancock appeared first on Software Engineering Daily.
A workflow is an application that involves more than just a simple request/response communication. For example, consider a session of a user taking a ride in an Uber. The user initiates the ride, and the ride might last for an hour. At the end of the ride, the user is charged for the ride and
Kafka is a distributed stream processing system that is commonly used for storing large volumes of append-only event data. Kafka has been open source for almost a decade, and as the project has matured, it has been used for new kinds of applications. Kafka?s pubsub interface for writing and reading topics is not ideal for
The post kSQLDB: Kafka Streaming Interface with Michael Drogalis appeared first on Software Engineering Daily.