AVNTK- Basic English AI

Basic English for Artificial Intelligence

Probably one of the greatest challenges of our time is that of communications. At a technical level of telecommunications technology, it is one of the fastest growing industries. However, once physical communication is achieved between individuals, the next step is that of understanding. This is normally achieved using language, not always phonetically as in sign language. The English language in particular is the predominant language for human interaction across the whole world today. Two thirds of Internet traffic takes place in English. Most of the top universities, business transactions, entertainment, as well as scientific and technological publications use the English language. Going by the vocabulary generally used on Internet, if we also consider scientific and technological words, the required vocabulary for successful interaction comes close to 100,000 words.

An increasingly important aspect of communications is that of human-machine interaction, not only for command and control of robots or artificial cognition systems, but perhaps more importantly for knowledge acquisition purposes. Cognitive systems we define as natural or artificial information processing systems, including those responsible for perception, learning, reasoning, decision-making, communication and action. Many opportunities lie at the interface between life sciences, social sciences, engineering and physical sciences.

At the root of the concept of language lies the very definition of a word. For centuries words like "virtue" have defied an exact generally-accepted definition, though this is a subtle complex term. However, as first noted by Wittgenstein (1953) even simple words like "dog" or "game" can be equally challenging. His suggestion was that there exists a "family resemblance" which allows us to identify a particular instance as a member of a group such as a "dog". Following the work of Rosch (1973) who first tried to connect these ideas into a psychological interpretation, we can conceive of these family resemblances as based on the fact that the human brain really possesses prototypes or exemplars in order to represent the meaning of a particular word, in relation to these. She found, through some extensive experiments with children, that there is in fact about 400 core concepts in western children, which are intensively used in growing up to interpret meaning. Moreover, Rosch argued that there is a "natural" level of categorization that we tend to use to communicate. This level is known as "basic-level categorization" and has been found to reflect a natural way humans categorize the objects in our world. This is important because we know that the processes of acquisition, storage and retrieval in the human brain are inextricably intertwined. Therefore, we can be fairly certain that basic-level categorization is a fundamental aspect of human cognition and communication. Now, this is not a particularly novel concept in itself, when viewed in a more global context.

Chinese writing for instance possesses more than 40,000 mainly ideographic signs, but knowledge of four thousand is enough for most purposes. Chinese writing, insofar as it is phonetic, is also monosyllabic, for the very good reason that the words of the language consist of only one syllable, with a large number of homophones, which made it important to have signs that distinguished between these homophones, and so the script avoided being purely phonetic. Even in this case, early simplification such as the one performed by James Yen in 1923, resulted in a selection of 1,200 of the traditional characters, in order to form what can be called Basic Chinese, enabling illiterate people to read in this system after four months work. A later refinement by Yuan Chao produced a system of about 2,500 of the traditional characters, which it was claimed can cover basically all of the language. The Japanese resolved the basic linguistic problem by adding Hira Gana, children are taught 1,200 from 40,000 symbols, which often contain a Chinese root and suffixes.

Another attempt at devising a simplified version of a language is that of Basic English, as proposed by Charles K. Ogden in the 1920s. The fact that it is possible to say almost everything we normally desire to say with 850 words, makes Basic English something more than a mere educational experiment. Eight hundred fifty words are sufficient for ordinary communication in idiomatic English. Six hundred words form a first stage at which a wide range of simple matter can be provided. By the addition of 100 words required for any general field (such as science, business, military use) and 50 internationally recognized words, a total of 1,000 words enable successful communication. With this vocabulary, the style and brevity has no literary pretensions, but is clear and precise. Above the 1,000 maximum, we are at the level of standardizing English. Normal vocabulary hovers between the alleged 600 words of Somersetshire farmers (and possibly football hooligans…) and the 12,000 of the average undergraduate, though a cultured person can easily command as much as 50,000.

The grammar of this reduced-vocabulary English is similar to Standard English except the rules fill one chapter rather than a whole book. There are fewer exceptions. The chief form-changes are those which make the behaviour of verbs and pronouns the same as in normal English; together with 'plurals,' -ly for 'adverbs', the degrees of comparison, and the -er, -ing, -ed endings of 300 of the names of things. In this way the Basic English is not troubled by a great number of forms and endings which are not regular, and the outcome is simple natural English. Compound words may be combined from two nouns (milkman) or a noun and a directive (sundown).

The attraction of Wittgenstein’s ideas for artificial intelligence lies in their potential to provide a framework within which a computer could achieve “consciousness”, i.e. be able to “understand” by deducing the meaning of words in relation to all other words. The difficulty lies in the size of the vocabulary. If we consider a vocabulary of 100,000 words and wished to relate all words to each other, the resulting matrix would be 1.0e+010. However, there are a few caveats: we would need say about 25 samples per entry to make it statistically significant, 50 words make up more than 43% of English written text, and unless we obtained all this text from a single individual, there would be the added complication of ambiguity arising out of the use each person might give to any one word. All in all, a corpus of somewhere over 1.0e+012 words obtained from a single individual would be needed, or to put it another way 10,000 times the size of the British National Corpus database of 100 million words! This is an interesting number as it is 100 times more than the number of neurons in a human brain (assuming a value of 100 billion) and so, bearing in mind more than a 1/3 of the brain is occupied just by vision-related tasks, we can be confident that the human brain does not actually hold this type of information individually but more than likely, as Wittgenstein suggests, as relative positions in relation to a very limited number of exemplars, possibly in a multivariate space representation of more than 6 and no more than 12 dimensions.

Therefore, the first step in order to follow this line of reasoning lies in producing an automatic translation system to enable the conversion of English material, however complex, into a reduced vocabulary, while substantially keeping its information content. This is the aim of the SIMPLSIH translating tool available on this site. In this case, relating 850 words to each other through a matrix represents a little over 700,00 entries, each taking a few seconds to enter, and so it took only a little over 4 months for one person to provide all the required relations, from which a multidimensional relational space can be derived.

Clearly, where complex or ambiguous material is being automatically translated from English into a reduced-vocabulary representation, there will be some loss of semantic content. This is a fundamental difference between this tool and human communication. Where a phrase is being translated between two languages for instance, any semantic loss is due to a mistake or a lack of equivalence for a particular concept in the target language, i.e. words like zeitgeist, panache, kiosk ormonozukuri (literally “the drive to create products of value”). When a person has to express an idea, on the other hand, even using a reduced-lexicon, he can choose how to express himself in order to be able convey some particular subtlety and so there can be minimal information loss. Most words provide shades of meaning that are not strictly necessary to convey ideas, though they might be very entertaining or revealing, as any avid Oscar Wilde reader will immediately confirm.

In any case, material of a legal, business, scientific and technological nature is normally specifically produced in a way that seeks to be both precise and clear, and is therefore amenable to a reduced-vocabulary representation that substantially maintains semantic content. Thus, although there may be some semantic loss, this is more than compensated for by the increased functionality available by a machine potentially being able to “understand” natural language.

Ambiguity in language has three aspects that need to be tackled in any automatic system: application, semantic, duality. Ambiguity in application arises out of the area of application so that if the word “amphibian” is used, there is a need to work out if the topic is botany, biology or machines. The second sort of ambiguity lies in words like “charter”… is it a thing, action or right? The third type arises out of the everyday use of scientific words such as “fusion”… do we mean the physics of matter or simply “bringing together”? SIMPLISH is able to deal with all three forms, although it ignores other minor and more subtle forms. In the case of semantic ambiguity, SIMPLISH resolves the issue by inter-relating a set of 50 rough semantic “basic-level categorization” tags. These were derived by combining neuroscience knowledge, linguistics, semantics, and logic in a process of consilience, which is ongoing. Other complexities such as idiomatic phrases, names, places, substances, etc. are treated as special cases in parsing.

Translations using SIMPLISH can sound a little strange, particularly without the extra application-specific 100-word vocabulary, but they do convey the general idea of the original message. It must be borne in mind that in AI applications, the purpose is primarily as a means of representing knowledge rather than readability, though that might be nice in human-machine interaction applications! However, the primary objective of reducing language to 1,000 words is not readability or tag-manipulation but to the ability of converting similarity of meaning to spatial proximity, by displaying sentences in a multidimensional space conditioned by the inter-relations between words contained in this reduced vocabulary. This ability enables the system to carry out the six fundamental logic operations on any two sentences (equivalence, similarity, abduction, deduction, induction and metaphor) enabling an artificial cognition system to process knowledge. This is a current area of development for the AI research group within The Goodwill Company, currently attempting to build a general cognition system that uses SIMPLISH to interact with the user. Their work can be followed through www.rachaelrepp.org

Bibliography

Graham, E.C. (ed.), “The Basic English Dictionary of Science”, The MacMillan Company, New York, 1965.
Moorhouse, A.C., "The Triumph of the Alphabet - A History of Writing", Henry Schuman New York, 1953.
Ogden, C.K., “Basic English: A General Introduction with Rules and Grammar”, Small format, hardcover. Publisher: Paul Treber & Co., Ltd. London, 1930.
Ogden, C.K. (ed.), “The General Basic English Dictionary”, London, 1940.
Rosch, E., "On the Internal Structure of Perceptual and Semantic Categories". In T.E. Moore (Ed.), "Cognitive Knowledge and the Acquisition of Language", (pp. 111-144), New York: Academic Press 1973.
Wekker, H., Haegeman, L., "A Modern course in English Syntax", Routledge, London, 1996.
Wittgenstein, L., "Philosophical Investigations", Oxford, England, Blackwell, 1953.
Yu, Liyang, “Introduction to the Semantic Web and Semantic Web Services”, Chapman & Hall/CRC, Boca Raton, USA, 2007.