Preface.
Remember this $100 million worth of AI core code?
while True: AI = input('Me:') print(("Is it?", " ").replace('?','!').replace('?','!'))
This code above is our topic today, rule-based chatbots
1. Chatbot
A chatbot itself is a machine or software that mimics human interaction through text or sentences. In short, it is possible to chat using software that resembles a conversation with a human.
Why try to create a chatbot? Maybe you're interested in a new project, or your company needs one, or you want to go about pulling in investment. Whatever the motivation, this article will try to explain how to create a simple rule-based chatbot.
2. Rule-based chatbots
What is a rule-based chatbot? It is a chatbot that answers text given by humans based on specific rules. Since it is based on imposed rules so the response generated by this chatbot is almost accurate; however, if we receive a query that does not match the rules, the chatbot will not answer. The opposite version is a model-based chatbot, which uses machine learning models to answer a given query. (The difference between the two is that the rule-based one requires us to specify each rule, and the model-based one automatically generates the rules by training the model. Remember from our last "Introduction to Machine Learning" post, "Machine learning provides systems with the ability to automatically learn and improve based on experience without being explicitly programmed.")
Rule-based chatbots may be based on rules given by humans, but that doesn't mean we don't use datasets. The main goal of chatbots is still to automate questions asked by humans, so we still need data to formulate specific rules.
In this paper, we will utilize cosine similarity distance as a basis for developing a rule-based chatbot. Cosine similarity is a similarity measure between vectors (especially non-zero vectors in inner product space) and is commonly used to measure the similarity between two texts.
We will use cosine similarity to create a chatbot that answers questions posed by a query by comparing the similarity between the query and the corpus we developed. This is the reason why we needed to develop our corpus in the first place.
3. Creating a corpus
For this chatbot example, I want to create a chatbot to answer all questions about cats. In order to collect data about cats, I will grab it from the internet.
import bs4 as bs import #Open the cat web data page cat_data = ('/wiki/Cat').read() #Find all the paragraph html from the web page cat_data_paragraphs = (cat_data,'lxml').find_all('p') #Creating the corpus of all the web page paragraphs cat_text = '' #Creating lower text corpus of cat paragraphs for p in cat_data_paragraphs: cat_text += () print(cat_text)
Using the code above, you will get the code from thewikipedia
A collection of paragraphs for the page. Next, the text needs to be cleaned up to remove useless text such as bracketed numbers and spaces.
import re cat_text = (r'\s+', ' ',(r'\[[0-9]*\]', ' ', cat_text))
The above code will remove the bracket symbols from the corpus. I purposely did not remove the symbols and punctuation because it would sound natural when having a conversation with a chatbot.
Finally, I will create a list of sentences based on the corpus I created earlier.
import nltk cat_sentences = nltk.sent_tokenize(cat_text)
Our rule is simple: measure the cosine similarity between the chatbot's query text and each text in the list of sentences, and whichever result yields the closest similarity (highest cosine similarity) then it is our chatbot's answer.
4. Create a chatbot
Our corpus above is still in text form, and cosine similarity does not accept text data; so the corpus needs to be converted into numeric vectors. This is usually done by converting the text into bag-of-words (word counts) or using theTF-IDF
method (frequency probability). In our example, we will use theTF-IDF
。
I will create a function which takes the query text and gives an output based on the cosine similarity in the following code. Let's have a look at the code.
from import cosine_similarity from sklearn.feature_extraction.text import TfidfVectorizer def chatbot_answer(user_query): #Append the query to the sentences list cat_sentences.append(user_query) #Create the sentences vector based on the list vectorizer = TfidfVectorizer() sentences_vectors = vectorizer.fit_transform(cat_sentences) #Measure the cosine similarity and take the second closest index because the first index is the user query vector_values = cosine_similarity(sentences_vectors[-1], sentences_vectors) answer = cat_sentences[vector_values.argsort()[0][-2]] #Final check to make sure there are result present. If all the result are 0, means the text input by us are not captured in the corpus input_check = vector_values.flatten() input_check.sort() if input_check[-2] == 0: return "Please Try again" else: return answer
We can represent the above function using the following flowchart:
Finally, create a simple answer interaction using the following code.
print("Hello, I am the Cat Chatbot. What is your meow questions?:") while(True): query = input().lower() if query not in ['bye', 'good bye', 'take care']: print("Cat Chatbot: ", end="") print(chatbot_answer(query)) cat_sentences.remove(query) else: print("See You Again") break
The script above will receive queries and process them through the chatbot we developed earlier.
As you can see in the image above, the results are still acceptable, but there are also some strange answers. But let's think about the fact that the results are currently obtained from only one data source and no optimization has been done as well. If we improve it with additional datasets and rules, it will definitely answer the questions better.
5. Summary
The chatbot project is an exciting data science project because it is helpful in many areas. In this paper, we use data taken from web pages and utilize cosine similarity andTF-IDF
usePython
Created a simple chatbot project to really get our $100 million project off the ground. There's actually a lot of improvements to be made here:
The choice of vectorization, in addition to theTF-IDF
It is also possible to useword2vec
, and even extracts word vectors using pre-trained bert.
The answer session, in fact, is to search for the most matching answer from our corpus by some specific algorithm or rule, and the similarity TOP1 method used in this paper is in fact one of the simplest classgreedsearch
For the optimization of the answer results, a beamsearch-like algorithm can also be used to extract the matches of the answers.
And so on and so forth.
Before the rise of end-to-end deep learning, a lot of chatbots were run this way and there are a lot of examples on the ground, so if you want to make a quick POC presentation, this rule-based approach is still very useful.
To this article on the use of Python to create a rule-based chatbot article is introduced to this, more related Python to create chatbot content please search for my previous articles or continue to browse the following related articles I hope you will support me in the future more!