Hi, I’m Jabril and welcome to Crash Course AI!
There used to be a time when a group of friends at dinner could ask a question like “is a hot dog a sandwich?” and it would turn into a basic shouting match with lots of gesturing and hypothetical examples.
But now, we have access to a LOT of human knowledge in the palm of our hands… so our friends can look up memes and dictionary definitions and pictures of sandwiches to prove that none of them have a connected bun like hot dogs (disappointed).
Search engines are a huge part of modern life.
They help us access information, find directions to places, shop, and participate in sandwich arguments.
But how does Google find answers to questions?
How are Siri and Alexa so smart but also easily stumped?
How did IBM’s Watson beat the best Jeopardy players in the world?
Well, search engines are just AI systems that are getting better and better at helping us find what we’re looking for.
INTRO When we talk about search engines, we typically think about the AI systems online, like Google, Bing, Duck Duck Go and Ask Jeeves.
But the basic ideas behind non-AI search engines have existed for centuries.
Essentially, search engines gather data, create organization systems to sort that data, and find results to a question.
For example, when you needed an answer to a question and couldn’t search online, you could go to the library!
Libraries gather data in the form of books and newspapers that are stacked neatly on the shelves.
Librarians have organization systems to help you find what you’re looking for.
Knowing that magazines are on shelves by the water fountain, while kids books are on the second floor is a kind of organization.
Plus, fiction books are sorted by the author’s last name, while nonfiction has the Dewey Decimal System, and so on.
Once you (or the librarian) have the resources you need, you’ll be able to find results to your question!
Now, rather than looking through books, web search engines look through all the data on the World Wide Web, aka “the Web”.
And instead of asking a human librarian where to find information, we ask an AI like John-Green-bot instead.
Jabril: Oh John Green Bot?
[JGB dialup beeps] Alright John Green Bot you're all set.
We're going to need that later.
And just so we’re clear, we’re using “Web” throughout this video even though it might sound a little old-fashioned.
That’s because the Internet and the Web are not the same thing.
The Internet is a collection of computers that send messages to each other.
Video services like Netflix that play on your TV, for example, use the Internet, not the Web.
The Web, on the other hand, is part of the Internet and uses the Internet’s connections to send documents and other content in a format that can be displayed by a browser like Chrome or Safari.
As with most AI systems, the first step is to gather lots of data.
To gather data on the Web, we can use a computer program called a Web crawler, which systematically finds and downloads Web pages.
This is a HUGE task and happens before the search engine AI can take any questions.
It starts on some Web page that we pick, called a seed, and downloads that page and finds all its links.
Then, the crawler downloads each of the linked Web pages and finds their links, and so on... until we’ve crawled the whole Web.
After we have collected all the data, the AI’s next step is to organize it by building an index, which is a kind of lookup system.
The kind that’s used for organizing Web pages is called an inverted index, which is like the index in the back of a textbook.
For each word, it lists all of the Web pages that contain that word.
Usually, the Web pages are represented by I.D.
numbers so we don’t have a long, messy list of URLs.
Let’s say 0 is the seed - which happens to be a page about Genghis Khan.
It has a lot of words on it like “the, mongol, Khan, Genghis, who, and is”.
In this inverted index, page 1 is about Marco Polo, but it mentions the word “Genghis” along with words like “the, Marco, Polo, who, are, and is.” Page 2 is about the Mongols, page 3 is a different webpage about Marco Polo, and page 4 is about Water Polo.
So, let’s say we type “Who is Genghis Khan?” into a search engine.
Our AI can use this inverted index to find results, which in this case, are links to Web pages.
The AI will look at the words “who”, “is”, “Genghis”, and “Khan” and use the inverted index to find relevant pages.
Our AI might find that Web pages zero, one, two and five have at least one of the words from the question “who is Genghis Khan?” When Siri says “I found this for you,” the AI is just returning a list of Web pages that contain the same terms as the question.
Except… most search engines include one more step.
There are millions of pages online that contain the same terms.
So it’s important for search engines to rank Web pages, so that the top result is more likely to be relevant than the tenth result or the hundredth.
Of course, Google and Bing don’t hire “supervisors” to grade each possible question and answer to help their AI systems learn from training data.
That would take forever, and they wouldn’t be able to keep up with all the new content that gets created every day.
Really, regular users like us do this training for free all the time.
Every time we use a search engine, our behavior tells the AI whether or not the results answered our question.
For example, if we type in “who is Genghis Khan” into a search engine, and click on a Web page about Star Trek II: The Wrath of Khan, we might be disappointed to find Genghis Khan isn’t ANYWHERE in that movie.
So we’ll bounce back to the search results, and try again until we find a page that answers our question.
A bounce indicates a bad result.
But if we click on a Wikipedia article about Genghis Khan and stay for a while reading, that’s a click through, which probably means that we found what we were looking for… so that indicates a good result.
Human behavior like bounces and click throughs give AI systems the training data they need to learn how to rank search results and better answer our questions.
Data from the Web and data from how we use the Web helps make better and better search engines.
Now, sometimes we ask our smart devices questions and we want actual answers… not links to Web pages.
When I say “OK Google, what’s the weather like in Indianapolis?” I don’t want to scroll through results.
For this kind of problem, instead of using an inverted index, AIs rely on knowledge bases.
Which you might remember from our video about Symbolic AI.
A knowledge base encodes information about the universe as relationships between objects like "chocolate donut" and "John Green Bot wears polo".
One of the main problems with knowledge bases is that it’s really hard to write down all of the facts in the universe, especially common sense things that humans take for granted but computers need to be told.
Enter AI researcher Tom Mitchell and his team of scientists from Carnegie Mellon University.
In 2010, they created a huge knowledge base called the Never Ending Language Learner or NELL, which was able to extract hundreds of thousands of facts from random Web pages.
The way it works is really clever, so let’s go to the Thought Bubble to see how.
NELL starts with some facts provided by a human, for example, the genre of music that Mozart plays is classical.
Which was represented like this: Mozart.
Similarly, Jimi Hendrix.
And Darth Vader.
Then, NELL gets to work and reads through each Web page one-by-one for words mentioned in those facts.
Maybe it finds the text “Mozart plays the piano.” NELL doesn’t know much about these symbols, but this text matches the same pattern as one of the facts provided by a human, specifically, the “plays” relationship.
So NELL learns a new object: Piano.
And a new fact: Mozart.
By searching over the entire Web, NELL can learn lots of facts based on just the three original ones that humans gave it!
Some facts might appear hundreds or thousands of times online, like Lenny Kravitz.
But NELL might also find facts that are mentioned SOMEWHERE online and extract them as potentially true.
Like, for example, Darth Vader.
We just don’t know!
Just like how we look for multiple sources when writing a paper, NELL uses repetition and multiple sources to build confidence that the facts it’s finding are actually true.
To consider other relationships, NELL uses the highly confident facts it learned and searches through the Web again.
Only this time, NELL is looking for new relationships.
Maybe it finds the text “Darth Vader cuts off Luke Skywalker’s hand,” and NELL learns a new (very specific) relationship: cutsOffHand.
Over and over again, NELL will use known relationships to find new objects, and known objects to find new relationships -- creating a huge knowledge base.
Thanks, Thought Bubble!
AI systems can use huge knowledge bases, like this one extracted by NELL, to answer our questions directly.
Instead of using the words from our questions to search through an inverted index, an AI like Siri can reformulate our questions into incomplete facts and then look for matches in a knowledge base.
Hey John Green Bot….
John Green Bot: Yes, Jabril?
Jabril: “Who wrote The Bluest Eye?” His AI could then reformulate that question into an incomplete fact, replacing “who” with a question mark.
If John-Green-bot extracted that information earlier, he can find matches in his knowledge base and return the most confident result.
John-Green-bot: Toni Morrison wrote The Bluest Eye!
Different words are categorized differently, so an AI like John-Green-bot can tell the difference between questions asking “who” and “when” and “where.” But that gets more complicated, so we’re not going to dive into the details here.
If you want to learn more, you can read about part of speech tagging systems.
Using all these strategies, search engines have become really good at answering common questions.
But questions like “How many trees are in Ohio?” or “How many hotdogs are eaten in the South Sandwich Islands annually?” still stump most AI systems, because not enough people ask them and AI hasn’t learned how to answer them well yet.
It’s also important to watch out for search engine answers to questions like “Who invented the time machine?” because AI systems have a tough time with nuance and incomplete data.
Sorry Doc Brown.
And a big, sort of hidden, problem is that search engine AI systems, are influenced by any biases in data online.
For example, if I ask Google for images of “nurses,” it will mostly show pictures of female nurses.
So next time, we’ll talk about how an algorithm can be biased, where bias comes from, and what we can do to address bias in AI.