NOTE: This is just rambling and therefore must not be construed as anything quotable, but may be used to inspire new ideas. So, in short, quote at your own risk, I do not guarantee any of the references below are correct, nor do I guarantee that all of the ideas that should be are referenced. This is not material that is of publishable quality, so it should not be treated as such.
Human problem solving has always been an interesting issue. We consider the ways that wee solve problems, and then try and encode these methods in computers, but in truth, should we do this? Traditional Artificial Intelligence deals with this facet of problem sovling, trying to encode our knowledge of the methods of learning and problem solving into a computer to achieve some level of machine learning, or seemingly intelligent behavior. Though as has so often shown up in the research in this field, we are not really sure of what it is that allows us to learn. We continually use methods that we think of for learning, such as what Psychology shows us of learning, such as prototype learning and operant learning (or reinforcement learning), or even latent learning, and we attempt to encode this, to varying levels of success. Though in each field, we do not achieve what we term perfect performance, as measured against a human, but only very close to perfect performance.
Perhaps though, it is because we are trying to go for a monolithic view of intelligence as one very complex method, when in fact, we should be looking at intelligence as a complex interaction between several very specific problem solvers. This idea is actually something that has been done very often, if we look at the Cyc research (by Lenat), we see that in his proposed solution for general intelligence, he simply trains many expert Cyc systems, which are trained only to solve a very specific set of problems. This panel of experts approach is also used in machine learning proper, as seen in voting systems. or in heirarchical neural networks. Though there is another method for doing this that slightly different. This method is known as Classifier systems, and was originally created by John Holland in 1978.
These systems rely on a method of learning called genetic algorithms. As the name implies, these systems are built on some ideas from genetics. The main idea is that a problem solution may be encoded into a string of characters and that string of characters may be operated upon using genetic reproduction. The reproduction involves taking the string as a string of DNA and then combining it with another string from some other solution to the problem in order to create one or more children solutions. Each child is a combination of the parents DNA string combined in some way. The current way used is crossover.
If one looks at any string, it may be combined with another string by choosing a number of characters, and then splitting both strings at that number of characters from the beginning of each string. Then by swapping the two tail portions of the strings, we have two new strings. These strings are related in some way to the two parent strings, in that they have some number of characters in their string that are exactly the same as one parent and another number of characters in their string that are exactly the same as the other parent. Though in this way, each offspring cannot introduce anything new into the population except in the fact of emergent phenomenon produced by new combinations of string characters. Though sometimes in learning, we must go in an entirely new direction, that we as humans call innovation, or creativity. As such, any sufficient learning algorithm must have a method of creating unique solutions that were previously impossible as well. Using this insight, we introduce mutation.
With some very small random probability any character in each string may be turned into another character. This is called mutation. As such, we may introduce characters with no antecedent into the population. Taking our human example, we look at this as the sixth finger example. In the human species, at various times there have been unique individuals born. These individuals have some quality that none of their ancestors have had, such as a sixth finger. Sometimes these mutations are problematic, and do not help the individual to survive, like developing internal hair, in say the heart, where it would impede blood flow and subsequently, probably, kill the creature. Though in the case of a sixth finger, it may be an innocuous, or even enhancing feature. We notice that in nature, the features which are not helpful to a creature do not appear in successive generations, except in very rare cases, but features that are helpful often re-appear in the next generation or other subsequent generations. This too must appear in any learning system that we create on these bases, as if we attempt to implement a system with inherent problems, we must deal with those problems.
More coming...How is it that people really store and access language? Looking at it from a biological perspective, all the brain is, is a large number of neurons. If we take an abstract view of this that means that somehow, all words, or concepts that a person ever stores in his or her brain, is somehow a combination of these neurons. Therefore, if we take a simple approach of just using this knowledge, and using a feature structure, that resembles a tree, this should be sufficient for modelling language. Though somehow this breaks down. Though Collins and Quillian(1) showed that this structure seems to work, in their semantic networks, it was later showed (reference goes here) that this was not sufficient, as some concepts are not apparently stored in this manner. Later, (someone?? Reference goes here), showed that much closer to what happens is that the network can be modelled as an unrooted tree, and a simple search is used to find any feature that matches the concept, and then in a spreading activation (the impulse hits the first neuron that it finds, to search, and then spreads in all directions (minus the one it came from), in order to find the full concept. Somehow either the path to this synapse, or some path out of the synapse (maybe to the speech center from this synapse) must store the information of what the word is that that concept maps to.
This is somewhat of a different idea than is currently used to represent knoweldge, as almost all representations of knowledge store the information in the nodes (in this case, these would be the synapse). Given the biological representation of the information, it is unrealistic for anything to be stored in the synapse, because all it is, is the communcation channel between neurons therefore somehow the information must be stored in the neurons.
This is an ongoing prooject to discover the true and underlying ways in which information is stored in the brain.
Just how do we make the computer understand language? Part of this is gone over above, though more to the point, how is it that we could do anything interesting given that approach. If the data is stored implicitly in the network, how could that be used for programming as we are used to thinking about it.
We could just take all language and store it in a similar tree format to the Wordnet database. Well, this has already been done, and is quite good at it, so how could we do this more implicitly. One of the arguments against the way computer scientists have been dealing with language is that apparently, in the brain, most of the information is garnered by how one gets to a certain neuron, and out again. The path actually determines the organisms reaction to the stimulus given, so the information is not, per se, stored anywhere, it is an implicit property of the network.
In this project, I am thinking about more, how can this be done, and how can the network's progress be monitored in such a way as to give useful information to the programmer, or user of the network.
Given a document, what is the likelihood that it belongs to a certain category? Further, what should that category be? Many people argue for pre-determined categories, the computer is given documents that belong to categories, and either learns, or is given criteria that a document must fulfill in order to belong to a category, but what is it that the category?
Well, the category is determined, as stated above, often times by the person who makes the categorizer. So, what if you let the computer choose the category based on how similar the documents are, or better yet, based on what the content of the documents is? This is often termed document clustering, so abusing terminology slightly here, this is basically what I am after. Right now the research consists of a decision tree based approach on the concepts in the document, represented by heirarchies of synonym, and hypernym relations.
Umm... Well this is where I am really doing most of my work. This is really as yet undefined, but eventually, if I figure out a good way of doing it, this will go with the document Categorizer above, actually just a part of it, that can be modified to do compression of the document, to create a Document Summarizer, for which there really is not any good algorithms that are right now available.
All work done for Searchbuilder.com was proprietary, and therefore neither the source code, nor intellectual property may be released.
Keyword Extraction -- unfortunately this is still in the planning stages and no executable code has been written yet.
Document Categorization -- This is the most current project that I am working on and mostly involves using document categorization strategies to index web pages, similar to the method employed by search engines. This project uses a method of synonym and hyponym tracking and unification to achieve its goals. This is being continued on personal time, and therefore is no longer Searchbuilder's responsibility or property.
Text Generation -- This project was initially to generate web pages. The specific goals of this project were to generate web pages automatically that have similar content, but use a knowledge base to generate the actual text in a manner such that each page generated is unique in size, content and slightly in appearance. The current approach uses a template filling method which, though simplistic is surprizingly accurate. The pages produced by this project still have to go under human review, but generally do not require extensive editting.
Page Evaluation -- This was the initial project which began the Document
Categorization effort. It involved rating a page for how well it
belonged to a Category. This project began small, and stayed quite
small, as it was realized that a more general solution (see above, Document
Categorization) was a more effective use of time.
References: (To be added)
Have you got ideas to make this
page better? Email me!
(I need all the help I can get
;)