How does existor work
We mentioned that Cleverbot actively uses less data than it could due to technical practicalities. There are also social and moral considersations. Cleverbot employs many filtering rules and patterns to determine which rows should make up the active database. The rules favour longer non-repetitive lines and conversations, and largely ban swearing and explicit sexual references so as to prevent Cleverbot from engaging in such activities.
Much of the filtering consists of matches against manually curated lists of strings, which are more completely specified in English than most other languages. With this in mind, we can present some general statistics about the Cleverbot data, with figures actual and estimated as of 2nd December Much of our active data was collected before we started storing all unfiltered data, and as a result only 6. On that basis the total overall number of lines can be said to be million. We have now introduced the Cleverbot data set, including how the data is collected and a brief statistical breakdown.
Now we will address how suitable this data is for machine learning. Thus far all statistics have referred to conversational interactions: the bot says something and the user replies. Each row in the database represents one of these interactions. Some people swear at Cleverbot or try to chat it up, change the topic every line or type complete nonsense.
But all of those are still valid human responses in the current conversational context. A very small percent may be other bots chatting with Cleverbot or Cleverbot chatting against itself , but we have various rules in our top level servers to prevent that kind of usage or learning. Usually it follows the flow of the conversation and sometimes it is strikingly good, but it can also suddenly change topics and may come across as having a poor factual memory — it forgets names and preferences.
We have run most of our own machine learning test on the data using the second method above, because it effectively doubles the size of the data set, and is much easier to work with. In we started using modern machine learning techniques to build a new, more intelligent conversational AI. We began with unsupervised learning techniques to build a model that could capture the natural structure of our data at a line and conversation level.
We were inspired by the word-level vector relationships that word2vec reveals. Impressively, these can be plotted, vividly showing the word-level data structures:. As described in the introduction, going from word vectors to line vectors the composition problem is an open machine learning challenge. We hoped that with sheer quantity of data, we could more-or-less bypass the issue and analyse the relationships between lines directly, without first splitting them into words.
This turned out to be impractical. There are far too many unique lines. Instead we used a simple composition approach, followed by clustering, to reduce the number of unique lines, and then analysed the clustered lines.
All these stages are unsupervised. Our experiment was as follows:. We aimed to build an efficient model of Cleverbot data in order to encode line level relationships. To this end, we implemented the following pipeline:. We have tested on 2 million, 20 million and 50 million lines 1, 10, and 25 million interactions treating both the bot and user side of the conversation equally.
We first extract the bot and user columns from the raw log files, lower case them, and remove punctuation. We save the output to a long text file, with each line of conversation on one line of the file.
We run the resulting text file through word2vec to turn all the words into dimensional vectors. We use relatively low dimensions to allow the following stages to run faster. We use skip-grams with a context window of 12 which usually encompasses the whole line as they are on average 3 words long. Then we cluster those summed and normalised lines. Note that this step is only to reduce the number of unique lines. For lines with several words, these clusters usually contain similar words in different orders.
For lines with 1 or 2 words, the clusters contain different words used in similar contexts, such as numbers 18, 21, 20 or first names bob, paul, lisa, jenny. In a data set of 50 million lines, there are about 24 million unique lines, and only 6 million which occur more than once. We have modified the clustering algorithm to work with unique lines plus line counts.
It results in a slightly higher sum-of-squares error, but means we can run it 10 times faster. The output is a vector for each cluster. We use this to label the original text file, matching each line to its closest cluster. Choosing the exact number of clusters is difficult, because we are not seeking to just reduce the sum-of-squares error. Rather, we want few enough to find structures, but many enough for those structures to be meaningful.
As determined by our testing procedure described below, we ended up with , clusters for 50m lines of data. Then we run word2vec again on the labelled lines. The vocabulary of this instance of word2vec is small as it just represents the clusters. We found the best results using the continuous bag of words method with a small number of dimensions and small context window because we wanted our model to capture more focused information about lines. We realised this is actually a reflection of a larger issue with all word2vec calculations, where the resulting vectors is often very close to one of the operators.
We came up with a percentage metric to represent this. The graph below is based on 50 million lines of data with 20, clusters, showing the first two dimensions after PCA. To label the graph, we worked backwards from the cluster numbers like C12 and found the most commonly occurring line near to that cluster. Also shown is a 3D representation using WebGL. Click the image to open the visualisation and then use your mouse wheel to zoom in and out and explore clusters in 3D , but it only works in Chrome.
We ran the movie subtitle database through our pipeline as well. This hints at the ability to build quite an efficient model of conversational data using an unsupervised pipeline. As we have shown above this model can be used as it is to answer questions by simple vector operations. This is rather like parts-of-speech tagging but on a line level.
Parts-of-speech taggers also use small context windows to extract the relationships between words and their immediate neighbours. This pipeline does something similar to lines. To imagine a conversation you can draw a path between labels. Our current machine learning work involves using the Cleverbot data to train a language model to generate conversational replies.
This work is in its early stages, but we are already seeing some good results. We trained it with just 7. Each separate conversation was considered as a single sequence, with a beginning-of-conversation marker and end-of-line markers between lines. The vocabulary size was around , words. We did no programming of our own for this — just using available tools and optimising different parameters. User: do you have something to say? Two examples that piqued my interest are Storybricks , the gaming AI startup founded by serial entrepreneur Rodolfo Rosini, and Existor , maker of a number of AI-assisted bots and smartphone apps that create natural language two-way communication between a human and computer.
Meanwhile, Existor is thought to have taken no funding at all. Largely in stealth since it was founded in , Storybricks started out life with the super-ambitious mission to create a new browser-based MMO that would let users turn stories into games. Aside from being British, what both Storybricks and Existor have in common is that their AI technology is about understanding context. As more and more data points are created, AI is needed to figure out that a person wants one piece of information or action over another.
How have its programmers equipped it with so much conversational, contextual and factual knowledge? The answer is very simple: crowdsourcing. As the chatbot's designer, Rollo Carpenter, put it in a video explainer produced by PopSci.
Since coming online in , Cleverbot has engaged in about 65 million conversations with Internet users around the world, who chat with it for fun via the Cleverbot website. Like a human learning appropriate behavior by studying the actions of members of his or her social group, Cleverbot "learns" from these conversations.
It stores them all in a huge database, and in every future conversation, its responses to questions and comments mimic past human responses to those same questions and comments. If, for example, you were to ask Cleverbot, "How are you? And, because it's pulling up an answer that a human has typed, the response will sound mostly human at least in theory.
0コメント