Within the complex and broad scope domain of Natural Language Processing (NLP), one of the essential functions of the parser or parsers is
in minimum units. So a program like this could generate the following list of tokens from the phrase "Hello World!" [161, 72, 111, 108, 97, 32, 77, 117 , 110, 100, 111, 33] where each of the numbers on the list corresponds to the ASCII (American Standard Code for Information Interchange) for each one of the smallest units of Significance n identified the phrase in the same order. Of course it is possible to carry out the reverse process, and from that list to generate the tokens chains that form the phrase in question. The tokenization is therefore the basic process that can handle natural language written for further processing, based on its decomposition in minimum units of information with meaning. Most of the programming language provide for specific instructions to carry out the process of tokenization ordered character strings alphanumeric, although this operation can be implemented alternatively by other methods provided by these languages. Thus, a program attempting to "read" a text, you must first "tokenised" generating a list of tokens
or minimal lexical units with their own meaning, as identified in the text. Then proceed to identify larger units of meaning itself (looking for the presence, as a separator, ASCII character36, which corresponds to the space blank), which could assimilate as "words", to finally end up identifying other units of higher-order meaning, phrases or sentences. Different sentences of the text "read", the parser proceeds to perform the actual parsing, thus identifying for the constituent parts of those sentences that, for this purpose are compared with previously defined patterns of possible structures , which depend on the language of writing the text, and the level of complexity of analysis to be attained, because look at all possible structures of a language and many variations, and represent them through a series of rules, there is a very simple task.
detection of changes in each language supported position in relation to the order of words, or analysis of transformation processes is made by structural analysis aimed at identifying the deep structure of a sentence in relation to its surface structure. Structural analysis based on surface structure (2) a oracióny, changing the order of certain words, try to determine its possible transformation kind of deep structure (1): (1) Deep structure, "Peter eats an apple" (2) surface structure, "Eat an apple Pedro"
The implementation process of tokenization , apart from the use of specific instructions that directly transform a string of alphanumeric characters in a string of tokens
, involves the use of other instructions whose function is to "read" individual, one by one, the characters present in the active group channel or input (input stream ) is specified, it will usually be either your computer keyboard, which is the input channel is active by default (like the channel output data stream output , defaults to monitor computer) or a text file located in the path specified.
Thus, there is the Prolog predicate
default name (? AtomOrInt,? String). The argumentis the variable that represents the string of alphanumeric characters or "atom" that is want
AtomOrInt
tokenizer, while the String argument is the variable representing the resulting list. The symbol "? " indicates that both arguments are reversible, ie they can function as both input variables and output variables, although one of them must necessarily be instantiated. Their operation is as follows: ? - Name ('Hello World', X).
X = [161, 72, 111, 108, 97, 32, 77, 117, 110 the interpreter provides the list of tokens , incomplete as indicated by the vertical bar followed by an ellipsis " atoms, as referenced in "Analysing and Constructing Atoms " manual SWI-Prolog . tokenizer Another way is to use atoms in Prolog predicate get0 / 1
andand some kind of recursive algorithm to be "traveling" all text in the active channel input ( an external file, for example) and enter the
whileresulting tokens, including blanks (
get / 1 andget / 2
not read), in a cumulative list, while not being achieved stop particular marker, as defined previously (for this end the atom
widely neglected end_of_fileThe parser, based on the constituents of a sentence (see principles of generative grammar of Noam Chomsky ) and by a finite number of rules, try to determine grammaticality or otherwise of an infinite number of constructions. A parser try to see how far you can submit a group of words to a structure of rules. For example, if we have the sentence: Peter eats an apple first, and through a process of tokenization , generates a list of words it contains. From this initial list of words, to differentiate a sub-list corresponding to the noun phrase (SN) of the sentence, and if it can be concatenated with other sub-lists according to certain rules that are verified as verb phrase ( SV), the prayer is concluded that it is grammatical. What matters in the constituents is the order of the words of prayer. The parser performs the analysis sequentially, word by word, from an initial list, following the example of prayer exposed, would be:
[pedro, eats one apple ]computing process Parser rules should result in another list, which will be an empty list [] if the initial sentence is grammatical (always based on rules that have defined the analyzer). In conclusion starting the initial list of words, the parser checks if it can be subdivided into two sub-lists, which correspond, respectively, SN
and SVprayer. More information: The parsing and semantic analysis .
0 comments:
Post a Comment