The key concept we'll explore, is understanding how data is stored, and therefore how it's processed. There are different abstractions for storing data. If you store data in one abstraction instead of another, it makes different processes easier or faster. For example, if you store data in a file system, it makes it easier to retrieve that data by name. If you store data in a database, it makes it easier to find data by logic such as SQL. If you store data in a processing system it makes it easier and faster to transform the data not just retrieve it. The data engineer needs to be familiar with basic concepts and terminology of data representation. For example, if a problem is described using the terms rows and columns, since those concepts are used in SQL, you might be thinking about a SQL database such as Cloud SQL or Cloud Spanner. If an exam question describes an entity and a kind which are concepts used in Cloud Datastore, and you don't know what they are, you'll have a difficult time answering the question. You won't have time or resources to look these up during the exam, you need to know them going in. So, exam tip is that it's good to know, how data is stored and what purpose or use case is the storage or database optimized for. Flat serialized data is easy to work with but it lacks structure and therefore meaning. If you want to represent data that has meaningful relationships, you need a method that not only represents the data but also the relationships. CSV, which stands for comma-separated values is a simple file format used to store tabular data. XML, which stands for eXtensible Markup Language was designed to store and transport data and was designed to be self-descriptive. JSON, which stands for JavaScript Object Notation is a lightweight data interchange format based on name-value pairs and an ordered list of values, which maps easily to common objects in many programming languages. Networking transmits serial data as a stream of bits, zeros and ones and data is stored as bits. That means, if you have a data object with a meaningful structure to it you need some method to flatten and serialize the data first so that it's just zeros and ones. Then it can be transmitted and stored and when it's retrieved, the data needs to be de-serialized to restore the structure into a meaningful data object. One example of software that does this is Avro. Avro is a remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols and serializes data in a compact binary format. It's primary uses in Apache Hadoop where it can provide both a serialization format for persistent data and a wire format for communication between Hadoop nodes and from client programs to the Hadoop services. It helps to understand the data type supported in different representations systems. For example there's a data type in modern SQL called Numeric. Numeric is similar to floating point. However, it provides a 38 digit value with nine digits to represent the location of the decimal point. Numeric is very good at storing common fractions associated with money. Numeric avoids the rounding error that occurs in a full floating point representation. So, it's used primarily for financial transactions. Now, why did I mention the numeric data type? Because to understand numeric you have to already know the difference between integer and floating-point numbers. You already have to know about rounding errors that can occur when performing math on some kinds of floating point data representations. So, if you understand this you understand a lot of the other items that you ought to know for SQL and data engineering. You should also make sure you're familiar with these basic data types. Your data in BigQuery is in tables in a dataset. Here's an example of the abstractions associated with a particular technology. You should already know that every resource in GCP exist inside a project and besides security and access control, a project is what links usage of a resource to a credit card. It's what makes up resource billable. Then in BigQuery data stored inside datasets, and datasets contain tables, and tables contain columns. When you process the data BigQuery creates a job. Often the job runs a SQL query, although there are some update maintenance activity supported using data manipulation language or DML. Exam tip," know the hierarchy of objects within a data technology and how they relate to one another". BigQuery is called a columnar store. Meaning that it's designed for processing columns not rows. Column processing is very cheap and fast in BigQuery and row processing is slow and expensive. Most queries only work on a small number of fields, and BigQuery only needs to read those relevant columns to execute a query. Since each column has data of the same type, BigQuery could compress the column data much more effectively. You can stream append data easily to BigQuery tables but you can't easily change existing values. Replicating the data three times also helps the system determine optimal compute nodes to do filtering mixing and so forth. You treat your data in Cloud Dataproc and Spark as a single entity but Spark knows the truth. Your data is stored in Resilient Distributed Datasets or RDDs. RDDs are an abstraction that hides the complicated details of how data is located and replicated in a cluster. Spark partitions data in memory across the cluster and knows how to recover the data through an RDDs lineage, should anything go wrong. Spark has the ability to direct processing to occur where there are processing resources available. Data partitioning, data replication, data recovery, pipelining of processing, are all automated by Spark so you don't have to worry about them. Here's an exam tip, "you should know how different services store data, and how each method is optimized for specific use cases as previously mentioned but also understand the key value of the approach". In this case RDDs hide complexity and allow Spark to make decisions on your behalf. There are a number of concepts that you should know about Cloud Dataflow. Your data in data flow is represented in PCollections. The pipeline shown in this example reads data from BigQuery, does a bunch of processing and writes it's output to cloud storage. In Dataflow each step is a transformation and the collection of transforms makes a pipeline. The entire pipeline is executed by a program called a runner. For development there's a local runner, and for production there's a Cloud Runner. When the pipeline is running on the Cloud each step, each transform, is applied to a PCollection and results in a PCollection. So, the PCollection is a unit of data that traverses the pipeline and each step scales elastically. The idea is to write Python or Java Code and deploy it to Cloud Dataflow, which then executes the pipeline in a scalable serverless context. Unlike Cloud Dataproc, there's no need to launch a cluster or scale the cluster, that's handled automatically. Here are some key concepts from Dataflow, that a data engineer should know: in a Cloud Dataflow pipeline, all the data is stored in a PCollection. The input data is a PCollection. Transformations make changes to a PCollection and then output another PCollection. A PCollection is immutable. That means you don't modify it. That's one of the secrets of its speed. Every time you pass data through a transformation it creates another PCollection. You should be familiar with all the information we've covered in the last few slides but most importantly you should know that a PCollection is immutable and that it's one source of the speed and Cloud Dataflow Pipeline processing. Cloud Dataflow is designed to use the same pipeline, the same operations, the same code for both batch and stream processing. Remember that batch data is also called bounded data and it's usually a file. Batch data has a finite end. Streaming data is also called unbounded data and it might be dynamically generated. For example, it might be generated by sensors or by sales transactions. Streaming data just keeps going. Day after day, year after year with no defined end. Algorithms that rely on a finite end won't work with streaming data. One example is a simple average, you add up all the values and divide by the total number of values. That's fine with batch data because eventually, you'll have all the values. But that doesn't work with streaming data because there may be no end. So, you never know when to divide or what number to use. So what DataFlow does, is it allows you to define a period or window and to calculate the average within that window. That's an example of how both kinds of data can be processed with the same single block of code. Filtering and grouping, are also supported. Many Hadoop workloads can be run more easily and are easier to maintain with Cloud Dataflow. But PCollections and RDDs are not identical. So, existing code has to be redesigned and adapted to run in the Cloud Dataflow pipeline. This can be a consideration because it can add time and expense to a project. Your data in TensorFlow is represented in tensors. Where does the name TensorFlow come from? Well, the flow is a pipeline just like we discussed in Cloud Dataflow but the data object in TensorFlow is not a PCollection but something called a tensor. A Tensor is a special mathematical object that unifies scalars, vectors, and matrixes. Tensor zero is just a single value, a scalar. Tensor one is a vector. Having direction and magnitude. Tensor two is a matrix. Tensor three is a cube-shape. Tensors are very good at representing certain kinds of math functions such as coefficients in an equation, and TensorFlow makes it possible to work with tensor data objects of any dimension. TensorFlow is the open source code that you use to create machine learning models. A tensor is a powerful abstraction because it relates different kinds of data types. There are transformations in tensor algebra that apply to any dimension or rank of tensor. So, it makes solving some problems much easier.