Once a well-trained machine-learning model has been deployed, the data ingestion pipeline for that model will also be deployed. That pipeline will consist of a collection of tools and systems use to fetch, transform and feed data to the machine learning system in production. However, that pipeline cannot be finalized during the development of the machine learning model it feeds. Finalizing the process of data ingestion before models have been run and your hypotheses about the business use case had been tested, often leads to lots of rework. Early experiments almost always fail and you should be careful about investing large amounts of time in building a data ingestion pipeline until there is enough accumulated evidence that a deployed model will help the business. Instead of building a complete data ingestion pipeline, data scientists will often use sparse matrices during the development and testing of a machine learning model. Sparse matrices are used to represent complex sets of data. For example, word counts from a set of documents, in a way that reduces the use of computer memory and processing time. There are Python libraries available in the SciPy package to work with sparse matrices. The code block below imports the SciPy library as well as the NumPy library for calculations. Sparse matrices offer a middle ground between a comprehensive data warehouse with extensive coverage on one side and a directory of text files and database dumps. On the other side, sparse matrices do not work for all data types. But in situations where they are an appropriate technology, you can leverage them even under load in production. A sparse matrix is one in which most of the values are equal to zero. If the number of zero-valued elements divided by the size of the matrix is greater than 0.5, then it is considered sparse. In this code block, we generate an array of 100,000 random integers between zero and two. Then we reshape that array into a 100 by 1000 matrix, then computed sparsity which as you can see here turns out to be 0.5007, which means it fits the definition of a sparse matrix. Very large matrices require significant amounts of memory. For example, if we make a matrix of counts for a document or a book where the features are all known English words, the chances are high that your personal machine doesn't have enough memory to represent it as a dense matrix. Sparse matrices have the additional advantage of getting around time complexity issues that arise with operations on large dense matrices. In this code block, we create a 10 by 100 array of random numbers drawn from a Poisson distribution. Then we cast that sparse matrix into a coordinate format matrix or coo using coo_matrix. Then we smash it down into a dense matrix. We can go ahead and look at the different matrices here. Let's go ahead and first print matrix A, which is just all of the random numbers. If we take a look at those, we see we just have this dense array of random numbers listed here. Then we cast it into a coordinate matrix, matrix B. Notice this sparse coordinate matrix is made up of the coordinates of the individual integers and the integer value. Then, of course, we can cast it back into a new dense matrix, matrix C. Take a look at it and sure enough there it is, all the dense matrix with all of the zeros and ones. A csc_matrix: When there are repeated entries in the rows or columns, we can remove the redundancy by indicating the location of the first occurrence of a value and then it's increment instead of the full coordinates. When the repeats occur in columns, we can use a CSC format. So in this example here, again, we create a dense matrix using integers from a Poisson distribution and this matrix is a 10 by 100. Then we use the sparse csc_matrix function to cast that into matrix B, which is a sparse matrix and then here we have it printed out. So let's do a side-by-side comparison. First, let us look at what matrix A looks like. But by casting this matrix into a sparse csc_matrix, you save a lot of time by simply storing the coordinates of the value that appears in the matrix using the csc_matrix function. It's a lot easier to create matrices with coordinates instead of the row after row and column after column of a dense matrix for example. So since the coordinate format is easier to create it's common to create a matrix first using coordinates. Then if you need to, to cast that matrix into another more efficient format or into a dense matrix format. Let's show you first how to create a matrix and coordinates. In this example, we've defined row coordinates, column coordinates and then values that are going to go into those rows and columns. So for example here, we have a row zero column one, which is going to have a value of one in it. Row one column zero is going to have a value of two, etc. We take these coordinates and values and feed them into the coo_matrix Python function. Feeding it the values of rows and columns, creating a matrix A. Then if we want, we can then take matrix A and cast it into a dense matrix. That is what I have printed out here, is that entire matrix cast in a dense format. Notice this is just a lot easier than having to fill in and keep track of all the zeros and make sure that the zeros and the non-zero integers go in the right place. We can then take that matrix and cast it into a CSR matrix. So we take the matrix A and then use the tocsr function to cast it into a new CSR matrix, matrix B. That's what we see here printed as matrix B, we have the coordinates as well as the integers that appear, the one, the two, the one and the four, and the coordinates that each one of those integers appeared at. It's also important to note here that CSR matrices work directly with the Python scikit-learn Machine Learning Library and other Python libraries. So you're probably going to be using CSR matrices a lot if you're working with the machine learning libraries in Python. Because this introduction to sparse matrices is applied to data ingestion, we would need to be able to first concatenate matrices. So for example, you might want to say add a new user to a recommender matrix and you're also, of course, need to be able to read and write matrices to and from a disk. Let's take a look first at concatenating matrices. So in this example, we create a matrix C, which is a CSR matrix from this simple array, and then we reshape that array into an array of one row and nine columns. Then cast that into a CSR matrix. The values that appear here are the shapes of the matrix. The first matrix B that we created previously is a nine by nine matrix. Then this new matrix C that we just created here is a one by nine. Then we take the matrices B and C, the matrix B we created earlier, and this new matrix, matrix C which we would consider a new entry to that matrix and use the sparse vstack function to stack those matrices together into a new matrix D and then we go ahead and print the new matrix D as a dense matrix. So here you have the old matrix that we created previously, matrix B the new matrix added here on the bottom on the last row. In this example, we have a file that we've created on my machine called sparse_matrix.npz. We save the matrix D that we just created to that file and then we load that matrix from the file into a new matrix E. Then we go ahead and print the shape of E which is 10 by 9 which matches up the shape of the new matrix that we just created that included the extra addition of an extra row.