[MUSIC] Welcome to Distributed Computing with Sparks SQL, produced by UCDavis Continuing and Professional Education, in partnership with Databricks. My name is Brooke Wenig, and I lead the Machine Learning Practice Team at Databricks. I've been working with Apache Spark for just over six years, and have a master's degree in computer science from UCLA, focused and distributed machine learning. Fun fact, I enjoy riding bikes and I'm fluent in mandarin Chinese. I'm accompanied here by my esteemed colleague Conor Murphy. >> Hi, I'm Conor Murphy, a Lead Data Scientist at Databricks. I've been focusing my work on Apache Spark in distributed computing for the past five years. Before pivoting my career to technology, I applied my understanding of data in the nonprofit sector with a focus on developing economies. My educational background includes graduate degrees in philosophy and neuroscience. And outside of data engineering and data science problems, I spend much of my time in freefall as a skydiver. Currently, Brooke and I worked together at Databricks, which was founded by the original creators of Apache Spark. I wanted to start this course with a bit of high level motivation, I mean, very high level motivation. First, why do we care about data anyways? Consider how a good portion of the successes of the human species can be attributed to our advanced use of tools. The computer is one of the most impactful tools we've ever created, and if you know how to write code like SQL queries or perform advanced analytics, you can truly unleash the full potential of these valuable tools. What's more is that a rigorous use of data helps us improve our decision making. As humans, we oftentimes think about things linearly and locally rather than globally and exponentially. There's way more information in the world than we could realistically hope to process using conventional tools. If we use data well, we can start to step outside of these limitations to improve our decision making. Whether we're trying to improve business outcomes, discover a cure for cancer, or build self driving cars. Data literacy is among the most important 21st century skills. >> This course is designed to scale the SQL queries and workloads that you developed an earlier courses in this series. It is designed for students who are already familiar with SQL but want to work on larger datasets, where they have more data than can fit in memory on any single machine. This is where distributed computing and Apache Spark come in. Spark solves the problem of scaling queries to large data sets. Working with large datasets poses a number of unique challenges, and this course will give you the conceptual framework to approach those challenges as well as hands on experience writing spark code. Now, let's talk about what you will be able to accomplish by the end of this class. In the first week, we'll cover the core concepts of distributed computing and when and where it is useful. Just because you have more compute doesn't necessarily mean that your queries will finish more quickly. You will also be introduced to the basic data structure in Spark, called the data frame. This is a collection of data distributed across a number of machines. We'll also introduce the collaborative data breaks workspace, how to access data, and write SQL code that executes against a cluster of networked machines. In module two, we'll discuss the core concepts of Spark, and by the end, you'll be able to apply Spark SQL to optimize your SQL queries. Sparks SQL look very similar to the way you've accessed data and databases in the previous courses, with some key distinctions because of Sparks distributed nature. Spark itself is not a database, It is a computation engine, this means we need to explore how to access data using Spark as well as discuss how data's partitioned or subdivided in memory. A common SQL task involves joining two tables, but joining data works differently in a distributed environment, and we will discuss how to officially join distributed data. Another optimization we will employ is cashing our data, so that we don't reread our data from source every time we run a query or continually re compute the same results of a query. Finally, we will examine the Spark user interface to get a better sense of how Spark works under the hood and some of the new features in sparked three auto, including adaptive query execution. >> In module three, we'll talk about engineering data pipelines. This allows us to go under the hood with how Spark clusters connect to databases using the JDBC protocol, which is a common way of connecting to databases in job environments. We'll also discuss schemas, and types, and why they matter in data pipelines. Certain file formats work well in distributed environments and certain formats do not, we'll discuss some of those trade offs. We'll also explore best practices for writing data to save the results of our queries. In the final module, we'll cover data warehouses, data lakes, and the new Lakehouse Architecture. Lakehouses combine the scalability and low cost storage of data lakes, with the speed and acid transactional guarantees of data warehouses. We will then dive into the open source Delta Lake project, no pun intended, to see how we can achieve the best of both worlds. Finally, we will wrap up this course with how to extend your data analysis skills to machine learning, as well as a summary of key concepts that you have learned so far. All you need to take this class is a working knowledge of SQL, a desire to learn, and internet access. Brooke and I have both dedicated much of our careers to data science and spark. We hope you'll find the power of the tools and approaches we'll discuss in the coming weeks as exciting and as motivating as we do. And above all else, we hope that they help empower you to be more data driven in whatever domain you choose to apply these powerful tools.