Now I want to take a little side side jaunt and talk to you about hashes. The key to why I like talking about hashes is I just want to be able to use it in a sentence and people think that hashes are magic, but they're not really magic. There's a lot of science behind hashes. They're critical for things like dictionaries and database indexes, and they're really, really important. So I want to dig in a little bit not so much so you could build a hash function, but so that you know how they're built so that you know they're not magic. So a hash function is basically a function that takes a large amount of data, from one character to a million characters, and performs some computation on that data and ends up with a fixed-length resulting value. And it's not a random value. It is a value that if you take the same data and run it through the hash function, you'll get the same value. But the idea of the output of the hash function is a fixed size. It might be 64 bits, or 128 bits, or 256 bits, but that's the idea. Now, within that, there is a whole bunch of science. Now, why do we need hashes? So, you send a bunch of data over a network, and that network might have some flaw. It might confuse something. It might flip a bit or something. And so what we would do is we would send the message and then we'd compute a hash as we were sending the message. And then we would send a fixed-length 128-bit checksum in the message. And so then we would receive the message, run the same calculation on the data as it comes in. We check the checksum. If something goes wrong, it's thrown away. And the checksum is often done in networking on much lower level protocols like Ethernet or Wi-Fi. They would automatically retransmit. So checksums are not something you're going to see a lot in your use of networks, say, in Python or whatever. But in the underlying media like the Wi-Fi or the Ethernet or even a phone connection, there's a lot of data checking going back and forth. Because Wi-Fi's not perfect, right? Even Ethernet's not perfect and phone modems are also not perfect. And if you're using sort of wireless on your phone, that's not perfect. And so when your data is being sent between your phone and a tower, there's a little checksum. And then what happens is the tower rejects it and says, you know, I didn't get all that because somebody did something, like a car started their engine or caused some interference. And so it ate part of the message, but it knows that because of this checksum. So hashes are a checksum. Another one is a cryptographic hash. And so we can use hashes sort of for basic testing or we can use them for like checking a signature. And in a sense, when we're using a hash for cryptography we hold it to a higher standard. And hash functions are essential to Python dictionaries and database tables and database indexes. And so it's really an important kind of underlying technology concept that I want you to understand. So there's a couple of things that we look for in hash functions. And people spend their whole life researching hash functions, literally. It's an area of research in computer science. We don't have to know it, but we can be thankful that some people build these really good hash functions. So deterministic means that it's not a random number. You put the same large amount of data in and the same thing goes out. Sensitive means if you send 100,000 characters and you change one character, the hash better be different, right? So uniform distribution, is kind of related to sensitivity in that you shouldn't just like generate, if you just take a bunch of random data and compute hashes on all the data, you should hope that it uses all the bits. So if it's like a 64-bit hash, it shouldn't just give you numbers from one to ten. It needs to use all possible combinations of 64 bits. So just uniform distribution. So it's pretty easy to take a bunch of random data, you run it through the hash function, and you kind of see where it happens. And that's so that they don't cluster or they don't collide. Because if you have a bunch of hundred-character strings, you want them spread out in the hash. Sensitive again, any change in input, whether it's adding a character or changing a character, should provide a change in output. One-way means that you should not be able to derive the input looking at the output. You can't go backwards. Now in a sense, if you have a million characters, and it comes down to 64 bits, it's kind of impossible to go backwards. But sometimes you're hashing things that are pretty short, like passwords. And actually hashing passwords is a common technique. so you're not storing the plaintext password, but it can also get you in trouble. Short passwords can become hashes, but then you can go backwards. And this is kind of the one of the reasons that they want passwords to be really long. Because if you have a really long password, then it's longer than the hash and it's harder to reverse it. And so we'll talk a little bit more about cryptographic problems. So I want to show you, I know it's just easy to read the Wikipedia. But I want to actually show you a hash computation. Okay? So I'm going to show you a couple of operators in Python that you probably will never use and after this, you'll probably never see, but I don't care. I think it's awesome. Okay, so what we're going to do here, let me get this nice pointer thing. So remember I was talking about the actual bits, right? So this is the zeros and ones. And this is the zeros and ones are how the computer actually stores the data and then things like 15. So this 11111, so this is 1, 2, 4, 8, 16. So this 15 is 8 plus 4 plus 2 plus 1 that's 11111 and that's 15. And so that's how we represent digital integer numbers. And you can print numbers out. So I got like x equals 15 and y is the ordinal position of H, which is 72. And I'm using this Python format and this 08b to say print this out. Print out x followed by x as an integer. But then I want to see x as bits. And so that's why we see x 15 and then 00001111, right? So I can print these things out. And I'm only showing you this because I want to show you debugging, okay? So these operators shift operate on the bits. So left shift the 1 shifts these bits over. An exclusive OR combines the bits in a way that if the bits are different it becomes a 1 and if the bits are the same it becomes a 0. So 1 plus 0 is 1, 0 plus 1 is 1 and 1 plus 1 is 0. And 0 plus 0 is 0. So if they're different, it's a 1. And it's a way of doing a, it's kind of an addition, but it's a merging addition. So addition makes numbers get bigger all the time. You can exclusive OR two numbers together and it's kind of a merger of the two numbers, but it doesn't get bigger. So if I keep adding 10 to something and so it's a way of taking each letter and sort of merging it into the same width. So I could take just the exclusive OR of a bunch of letters. And AND is what's called masking. So if we take this 15, which is all 1 bits here and 0 bits at the top. So it's four 0 bits and four 1 bits. It's a way to take this x, which is 72 right here. And it basically masks off and forces all the first four bits to be 0 and then it copies those bits. So that's an AND, okay? So that's this and you can take a little time to understand this but then I want to move on and show you how a hash function works. So this is like a very simple hash function in Python and you can grab the code for this. And the only line that really matters is right here. And so we're going to enter a string, we're going to stick with ASCII but it sort of doesn't matter. I'll just get out of this loop if there's nothing there and this hash value is an accumulation. And so what we're going to do is go through each line, each character, and we're to take the old hash value and we're going to left shift it one, I'll show you this over here. Then we're going to exclusive OR it with the ordinal number, the integer number that's equivalent to the letter, and then we're not going to let it get bigger than 32 bits. And so we're going to AND it with 0x, that's a hexadecimal digit, ffffff, six f's, and that just means this thing is going to get bigger, we're going to jam in. So it's like we're taking a character, we're shoving it over to the left, we're merging in the next character, we're moving that to the left, merge the next character, move it to the left, merge the next character, move it to the left, merge the next character, move it to the left. And then what we're doing is we're chopping it off anything that gets bigger than 6 times 4, 24. No. Yeah. Well, I got it. Well it stops at that size. So, and so then what I'm doing is if this hash values out to less than 2000, I'm printing it out in the letter, the bits of the actual letter, the working running hash value, and then the letter itself that we're looking at and a hash value in an integer. So if you look at Hello, so what you see is that the first character is this that's an H, and the hash value is just the first character. Then we get this e. So what happens is the current hash value gets shifted left one and then we combine using exclusive OR into that left shift so the e affects all these bits. So we're like shift, merge, shift, merge, shift, merge and you see this thing growing and eventually it will stop growing because we're only going to let them get as long as this. We are only going to get them this long, okay? But I'm only showing you the first part. So you don't have to fully understand this. You kind of see this shift, merge, shift, merge, shift, merge with a maximum length and part of hash functions is that they're a maximum length. And so if I end up with Hello, the hash result is 1 7 1 1 and this is a sensitivity. I change one character, the uppercase H to the lowercase h, and then the hash result is 11 99, so one character change needs to change the resulting hash. Now if I just flip two characters, e and h, that also changes, so this is the same characters but in slightly different order and so this is 1047. And again, that's why you shift then merge and shift and merge. So if you put an H then shift and merge an e it's quite different than putting in an e and then shifting and merging the H. So order matters, length matters, the characters matter, upper lowercase matters. What I'm showing you is a lousy hash function, okay? But it's one that if you take a look at you can kind of understand, you can run the code. and you see this shift, merge, shift. Shift, merge, mask, shift, merge, mask, shift, merge, mask. Okay? Now, a thing you might want to try is can you come up with two strings, just short short strings, that come up with the same hash? So far I've shown you no collisions, right? But a hashing function is bad if you can take two input strings that come to the same hash value. This one's not too hard because it's a terrible hashing function, but it is the essence of how hashing functions work. Okay? So designing real hash calculations is serious work. There is this organization called the National Institute of Standards and Technology called NIST that runs multi-year competitions when new hashing algorithms are needed. Now, what happens is we have a hashing algorithm, somebody comes up with it, maybe they publish a paper. In the old days they just like here's my hashing function. We'll be like, yeah, that looks pretty good and then everyone starts using it. And then someone would do research on what's wrong with that hashing function and we call these flaws. We're like well, I have these two strings and I run them through your hashing function, so my hashing function is highly flawed. All you have to do to kind of discredit my hashing function, this one here, if you find two strings, and it's not hard. If you find two strings that come up with the same hash that are different then you're like that's a lousy hashing function. And so what happens is this is in some ways some of the purest computer science research because a hashing is a beautiful simple algorithm that's easily expressed. And I can put it out there and then another person can find the flaw in it and they just say here's the two strings that crash, your hashing function is flawed, or I fixed your hashing function and I think mine is better than yours. So what happens is NIST will have competitions where teams from all over the world will propose new hashing functions. And then they have so that everyone has to sort of like put out their hashing function for the new hash, the new technique, and then all the teams go attack each other's hashing function. And so then they like, so you got 30 teams doing it, you're one, you put yours in, and then you go attack 29 of them, right? And so you have 29 people beating on your code to try to find out what's wrong with yours, right? And then they have a round where they, okay, here's the top 10 these 20 are [SOUND] everyone blew those up. So and then what they do is have another round of those 10, so they maybe let them change them a little bit from the learning of all the things that were wrong with them, so they tweak them a little more. And then they have a series of things and this takes more than a year and they have conferences about it. And it's really a super fascinating thing because when it's all said and done it's a little tiny function that we call MD5, okay? So tons of research, lots of people. MD5 is the wrong answer. the next one, SHA-256, is the one that was a multi-year. MD5 is a very early hash that was beloved, right? It's a 128-bit hash, which means you get a 128 bits out no matter what. And so when you go look at hashing, you go start reading up on hashing in Wikipedia, you'll see these pictures, right? And so what we're basically doing here is these are shifts, combinations, and exclusive ORs, okay? And so that's what's going on here, this is like exclusive OR this. So this is like that old hv value before and after because I wrote a loop that goes through every character and this is telling you what to do when you encounter a character. And I think this is going through four characters at a time A B C and D and it's doing something to those characters and shifting it and exclusive ORing it and doing a whole bunch of stuff and so that's what's going on here. And so somebody came up with this. It's a loop just like the one showed you except the middle part's a lot more complex because it's actually going through four characters at a time. And I don't even know the year but this has been around a long time and it's widely implemented. There's a function inside Postgres called MD5 that takes any string and gives you back the shortened version of MD5, we'll show you that in a second. Widely implemented, but it was also ferociously broken for cryptography. Now people will tell you that a MD5 is terrible, and for like if you have a thousand documents and you want to check which ones are unique, MD5 is fine for that. The problem with cryptography because the people making those documents aren't trying to construct the documents to cause a hash collision. If you're just looking at documents that are not intentionally trying to break MD5, it's actually not that easy to break MD5. It turns out there's whole conferences these days on breaking MD5, which is I think wonderfully fascinating. It's all been broken, but now they're trying to break it faster. And if it's cryptography and you are sending data to a trusted source, two trusted sources, and I can grab your data and I can twiddle with it and then move it on and the signature verifies, that's when it's broken for cryptography. So if it's going by, you signed it with MD5. I grab it, I alter it, and I keep the same MD5 signature, then I can fake your data. And so that's why MD5 is broken in those situations where someone with an intention to alter your data in transit can do so. The other problem that happened with MD5 that's kind of funny and fun is MD5 is not technically reversible. But because people used MD5 to hash passwords, so they would take your your password which is puppy123, they would run it through MD5 which gets this random number. They would store that in the database and then when you type your password puppy123 it would hash that and check the database. And that way they're not storing your plaintext password in the database. And so you think oh, that's great because MD5 is not really reversible to do all possible combinations of all possible passwords. But the problem is that the way we do passwords is there's a lot of common things. There's a lot of sequences of numbers like 1, 2, 3, the word puppy, the word puppy with the u turned into something. And so what's happened is the world has collected a lot of plaintext passwords and then run an MD5 computation on those plaintext passwords and ended up with terabytes of data. But if you type in an MD5 that is in the strings that they've pre-calculated, you can in an instant get the string back. So it's not really running the calculation backwards, you can't sort of run the calculation backwards. But if you run the calculation forwards on a large enough set of input strings you can then do a seeming backwards calculation back for those strings and for passwords in particular. So you can Google rainbow tables and you can create an MD5 of a puppy123 and then you can go backwards go to rainbow table and then type in the hash that comes out and it will come back with puppy123. And so again, it's broken for cryptography, whether it's checksums for signatures or whether it is storing data in a table. But it turns out it's not entirely broken for things like uniqueness and so we can use it for things in. Now the hash function that's the SHA-2 family of hashes that started in 2001 and took a couple years to build. The SHA-2 is of various lengths and SHA-256 is a 256-bit version of it. And again, you see this picture of like taking stuff and shifting and masking and ORing and doing and moving and shifting and doing this, that, and the other thing, right? with exclusive ORs and it's very complex and there was many hashes that competed to be the SHA-2. And so SHA-256 is built in and so you'll see it's a longer hash. The MD5, you can do MD5 function on any text that we have inside of Postgres and SHA-256. And this isn't just in Postgres, there's all kinds of places that have functions like MD5 and SHA-256 to compute these things. So that's a little bit of how hashes work and I just want you to basically understand hashes. And up next we're going to talk about how we use hashes and other kinds of index to really speed things up.