Monday, March 6, 2006

Welcome Address By School

DCG: notation of grammar rules in Prolog parsers and tokenization

grammatical notation DCG (Definite Clause Grammar, Castilian grammar of clauses defined ) is a syntactic variant or extension of the regular syntax Prolog, which aims to implement gram ; formal policies for short, and therefore simpler and more readable parser written in this language, its ability to handle information (meaning by this "variables") implied. His expression is a shorthand notation syntax Prolog language, collected by almost all parsers and current for this language interpreters, which transform automatically and internal standard clauses. In the notation DCG the infix operator "-" is replaced by the operator of the same character "-->", and can establish, by way of example, the following correspondence between notations, the first Prolog clause in ordinary syntax, and the second DCG notation equivalent:

sentence (S0, S) - sintagma_nominal (S0, S1) sintagma_verbal (S1, S).
prayer -> sintagma_nominal, sintagma_verbal.

The above Prolog clause is expressed in words as follows:

"There is a prayer between S0 and S if there is a noun phrase between S0 and S1, and there is a verb phrase between S1 and S"

The DCG notation is based on that used for context-free grammars (context-free grammars or CFGs, N. Chomsky), indicating hierarchical relationships constructions language, described by BNF ( Backus Normal Form or Backus-Naur form ), originally created to define the syntactic structure of language ALGOL60 (1960, John Backus), and widely used both to define the syntactic structure of programming languages n, to define syntactic structures of languages \u200b\u200bin general. Therefore, using DCG notation is possible to handle directly in Prolog context-free grammars. Context-free grammars are to be a simplification of ordinary grammar, which, for example, and referring to Castilian language, could take the following expression in DCG notation :

prayer -> sintagma_nominal, sintagma_verbal.

sintagma_nominal -> decisive name.

sintagma_verbal -> verb, sintagma_nominal.
sintagma_verbal -> verb.

decisive -> [a].
decisive -> [la].

name -> [man].
name -> [apple].

verb -> [eats].
verb -> [sings]. That

Prolog system (in this case SWI-Prolog ) automatically translates the next set of clauses:

prayer (A, B): - sintagma_nominal (A, C),
; sintagma_verbal (C, B).

sintagma_nominal (A, B): - determiner (A, C),
name (C, B).

sintagma_verbal (A, B): - verb (A, C),
sintagma_nominal (C, B).
sintagma_verbal (A, B): - verb (A, B).

determiner ([the

As is easily deduced, the program based on this limited Pequena grammar, only accepted as grammatically correct "man eats the apple", "the man sings," and similar combinations, including phrases such obviously grammatically incorrect as "the singing apple," the man eats, etc., returning an empty list, but will not accept such "dancing man" (the verb "dance" is not within the basis of facts or knowledge of the program.) It is therefore understood that a complete grammar, covering all possible structures, relationships and combinations of the constituent parts of a language requires a very complex and difficult to implement. In the example above, you can get all accepted as grammatically correct sentences by the grammar of the program, launched in the interpreter's next target:

? - Sentence (X ,[]).


Meanwhile, a very simple grammar for English, written by DCG

notation takes the following expression:

sentence -> noun_phrase, verb_phrase.


noun_phrase --> noun.
noun_phrase --> determiner, noun, rel_clause.

verb_phrase --> verb.
verb_phrase --> verb, noun_phrase.


rel_clause --> [].

rel_clause --> [that], verb_phrase.

determiner --> [the].

determiner --> [a].

noun --> [john].

noun --> [annie].

noun --> [man].

noun --> [men].

noun --> [woman].

noun --> [women].

...



verb --> [like].

verb --> [likes].

...

In the notation

DCG deleted

two parameters, an input list, and an output list. The first can be represented, for example, by prayer [the dog bite] and the second by an empty list [] is the result of checking the syntactic structure of that sentence by function n the parser

used. Additionally, you can add variables to grammars represented by DCG

notation, for example, to specify the gender and number. As

above, the DCG notation to add additional arguments, and also incorporate Prolog goals in the body of rules, the latter enclosed in braces {}. Additional arguments allow us to introduce eg correct grammar rules based on the conventions adopted by the target language representation, and avoid mistakes in the matching gender / number, etc. Also, by DCG notation is also possible to construct programs that can to represent parse trees of sentences (representing grammatical categories involved in the breakdown of its parts) and

syntactic trees (parse trees

) than a graphically show the structure as a tree falling in a certain grammar. As explained in the previous notation, the parser or parsers are responsible for the construction of parse trees of sentences of a language able to represent the phrase structure of the same. Following the example of the simple grammar in Castilian, as described above, the graphical representation of the parse tree of the sentence "the man eats the apple" looks like this , making use of this program it : ? - draw (or (sn (d (el), n (man)), sv (v (come), sn (d (la), n (apple ))))). or         sn

  
   Where

apple "or" = sentence, "sn" = noun phrase, "d" = Determinant, "n" = Name, "sv" = verb phrase, and "v" = Word. It should be noted that this program does the tree of analysis itself, which is:

or (sn (d (el), n (man)), sv (v (come), sn (d ( la), n (apple))))

simply holds its graphic representation in chart form
ASCII characters
. The source code of that program, originally called draw.swi

has been copied to a plain text editor and saved with the extension. "pl" to be executed by the SWI-Prolog interpreter

, although another option is to associate files with extension ". swi "with the executable of the interpreter. Its author is
Ludäscher
Bertram, the "University of California Davis."

Examples of notation

DCG and Prolog in Castilian are taken from Chapter 9, "Using Prolog grammar rules", paragraph 9.1, "The problem of parsing" of the WF Clocksin and work Mellish CS Programming in Prolog "(2 nd ed., Barcelona: Gustavo Gili, 1993, ISBN: 84-252-1339-8), original English translation of" Programming in Prolog ", in I refer to a detailed explanation of grammar rules, parsing, and notation

DCG in Prolog, given that the authors reviewed are a clear and perfectly understandable the particular, perfectly accessible to people not too versed in this language. Meanwhile

example grammar English has been obtained from Chapter 17, "Language Processing with Grammar Rules", from the book "Prolog - Programming for Artificial Intelligence" (Ivan Bratko, 2nd ed., Addison-Wesley, 1990, ISBN: 0 -201-41606-9). This is a highly recommended book of more advanced and sophisticated than the Clocksin and Mellish, clearly more introductory, and as its name indicates it on one side the general Prolog language, and secondly its apply the main techniques in the field of IA (search and classification, expert systems and representation knowledge, PLN , Machine Learning, etc.). Finally, we mention another example of practical use of the notation

DCG and Prolog programming language, the library Html-write [1

] [ 2] [3

] shell and development environment

SWI-Prolog that translates Prolog terms in HTML , generating output in the latter language, for which uses precisely structures written in DCG notation

.



Más información ( PLN
y
DCG s
en Prolog):




Adventure
in Prolog
- 15.
Natural Language
.


En

Learn
Prolog Now!

: 7.
Definite Clause Grammars
, 8.
More Definite Clause Grammars
(

P.
Blackburn

another location).

Perl Style Regular Expressions in Prolog
    (
  • another location). Introduction to Definite Clause Grammars
  • .
  • CFGs and BNF:
  • Programming Languages: BNF notation and other grammars . What is BNF notation?
  • Backus-Naur Form (in

    FOLDOC).
  • Context-free grammar (in
  • Wikipedia ).
  • are part of the same set of class readings : Four Concepts in Programming Language Description: Syntax, Semantics, Pragmatics and Metalanguage
  • .
  • Grammatical Description of Syntax: The BNF Metalanguage .
Derivations, Ambiguity and Semantics

. Nonrecursive

Syntax: The Regular Expression
    Metalanguage. Between
  • Lexics and Syntax: Whitespace, Layout, Comment
  • Conventions.

Sunday, February 26, 2006

Effect Of Vodka Impotency



Within the complex and broad scope domain of Natural Language Processing (NLP), one of the essential functions of the parser or parsers is
  • chain analysis tokens for possible syntax errors (remember that the syntax, broadly understood, is that part of grammar that deals with the rules governing the issuing of words
  • in larger structures such as
  • prayers, as well as the relations established between them these words.)
  • A token can be defined as the smallest unit of information with their own meaning in a sequence of alphanumeric characters. These chains of units called lexical information is a process of decomposition, as a list of those strings of tokens

    in minimum units. So a program like this could generate the following list of tokens from the phrase "Hello World!" [161, 72, 111, 108, 97, 32, 77, 117 , 110, 100, 111, 33] where each of the numbers on the list corresponds to the ASCII (American Standard Code for Information Interchange) for each one of the smallest units of Significance n identified the phrase in the same order. Of course it is possible to carry out the reverse process, and from that list to generate the tokens chains that form the phrase in question. The tokenization is therefore the basic process that can handle natural language written for further processing, based on its decomposition in minimum units of information with meaning. Most of the programming language provide for specific instructions to carry out the process of tokenization ordered character strings alphanumeric, although this operation can be implemented alternatively by other methods provided by these languages. Thus, a program attempting to "read" a text, you must first "tokenised" generating a list of tokens

    or minimal lexical units with their own meaning, as identified in the text. Then proceed to identify larger units of meaning itself (looking for the presence, as a separator, ASCII character

    36, which corresponds to the space blank), which could assimilate as "words", to finally end up identifying other units of higher-order meaning, phrases or sentences. Different sentences of the text "read", the parser proceeds to perform the actual parsing, thus identifying for the constituent parts of those sentences that, for this purpose are compared with previously defined patterns of possible structures , which depend on the language of writing the text, and the level of complexity of analysis to be attained, because look at all possible structures of a language and many variations, and represent them through a series of rules, there is a very simple task.

    detection of changes in each language supported position in relation to the order of words, or analysis of transformation processes is made by structural analysis aimed at identifying the deep structure of a sentence in relation to its surface structure. Structural analysis based on surface structure (2) a oracióny, changing the order of certain words, try to determine its possible transformation kind of deep structure (1): (1) Deep structure, "Peter eats an apple" (2) surface structure, "Eat an apple Pedro"

    The implementation process of tokenization , apart from the use of specific instructions that directly transform a string of alphanumeric characters in a string of tokens

    , involves the use of other instructions whose function is to "read" individual, one by one, the characters present in the active group channel or input (input stream ) is specified, it will usually be either your computer keyboard, which is the input channel is active by default (like the channel output data stream output , defaults to monitor computer) or a text file located in the path specified.

    Thus, there is the Prolog predicate

    default name (? AtomOrInt,? String)
    . The argument
    AtomOrInt
    is the variable that represents the string of alphanumeric characters or "atom" that is want

    tokenizer, while the String argument is the variable representing the resulting list. The symbol "? " indicates that both arguments are reversible, ie they can function as both input variables and output variables, although one of them must necessarily be instantiated. Their operation is as follows: ? - Name ('Hello World', X).

    X = [161, 72, 111, 108, 97, 32, 77, 117, 110 the interpreter provides the list of tokens , incomplete as indicated by the vertical bar followed by an ellipsis " atoms, as referenced in "Analysing and Constructing Atoms " manual SWI-Prolog . tokenizer Another way is to use atoms in Prolog predicate get0 / 1

    and
    get0 / 2

    and some kind of recursive algorithm to be "traveling" all text in the active channel input ( an external file, for example) and enter the

    while

    resulting tokens, including blanks (

    get / 1 and

    get / 2

    not read), in a cumulative list, while not being achieved stop particular marker, as defined previously (for this end the atom

    widely neglected end_of_file
    , which corresponds to the end of text). This predicate actually reads the byte alphanumeric character for each individual, associated with its corresponding ASCII code.

    The parser, based on the constituents of a sentence (see principles of generative grammar of Noam Chomsky ) and by a finite number of rules, try to determine grammaticality or otherwise of an infinite number of constructions. A parser try to see how far you can submit a group of words to a structure of rules. For example, if we have the sentence: Peter eats an apple first, and through a process of tokenization , generates a list of words it contains. From this initial list of words, to differentiate a sub-list corresponding to the noun phrase (SN) of the sentence, and if it can be concatenated with other sub-lists according to certain rules that are verified as verb phrase ( SV), the prayer is concluded that it is grammatical. What matters in the constituents is the order of the words of prayer. The parser performs the analysis sequentially, word by word, from an initial list, following the example of prayer exposed, would be:

    [pedro, eats one apple ]

    computing process Parser rules should result in another list, which will be an empty list [] if the initial sentence is grammatical (always based on rules that have defined the analyzer). In conclusion starting the initial list of words, the parser checks if it can be subdivided into two sub-lists, which correspond, respectively, SN

    and SV

    prayer. More information: The parsing and semantic analysis .

    Sunday, February 19, 2006

    Constipation 10 Weeks Pregnant

    Has the thesaurus to documentaries?

    Although not usually my habit entirely publish full text on this site (for those duties remain open Seen and Read), I would do so on this occasion given the undoubted interest of the I reproduce below, in accordance with the ideas it raises, the practical experiences made manifest and shared by the author, and the debate that all this should trigger. This is a message sent by José Ramón Pérez Agüera

    (Department of Information Systems and Programming, School of Computing, University Complutense of Madrid) to the mailing list IweTel , last 13/02/2006 (to access the original text in the files section, you need to be signed to that discussion forum for professionals in libraries and documentation centers).

    [Begin text Pérez Agüera]

    "Although I have to publish notice in Thinkepi , took a few months (in fact any other year) turning to this issue and I would like to have the opinion of the community of filmmakers, beyond my own observations, which this mail is not intended as a note, but give rise to a debate in which the documentary is not n having a say, to develop within the field of computing. Work

    automatic generation of thesauri, which has led me to conduct experiments of automatic indexing and query expansion from hand-made thesaurus. Specifically I used three thesauri: ISOC-Economy, EUROVOC and spin, all known to spare. The collection on which I performed the tests has been the sub-set of Economy and policy news generated by the EFE Agency in 1994 (efe94 is a typical collection experiments information retrieval which consists of a total of 215,738 documents. I've used 23,390 in my experiments to focus on the area of \u200b\u200bpolitics and economics, which are largely covered by the thesaurus above).

    Besides I also have a set 22 of consultations with their respective relevance judgments for the domain mentioned in the face of the experiments. These consultations have obtained from Congress the CLEF

    [Cross-Language Evaluation Forum] which is held every year and focusing on recovery issues mono and multilingual information. As I used the search engine Lucene , adapted into English stemming the terms of indexation, which is based on the traditional model

    Salton's vector space ( a classic, come on). The purpose of my first experiments was to check how they affect automated information retrieval thesauri using documentary as those used every day all documentation centers the world. And what was not my surprise to find that both together and separately, using all or some of the types of relationships that exist in the thesaurus, by direct or weighted global expansion (the way I balanced the thesaurus is another history), in each of the cases mentioned thesauri have not improved recovery virtually nothing in the collection, or precision, or recall (or another hill of measures have been implemented based on the model proposed by TREC

    Congress has a software called trec_eval completillo enough to evaluate recovery) is more in some of the experiments, depending on the query length documentary using handmade thesauri worse results.

    The next step in my research has been working with thesauri automatically generated from three basic methodologies: Linguistic processing collection ( POS-Tagging , parsing, tree analysis, dependency between terms.)

    Analysis co-occurrence for the generation of relationships between terms (Latent Semantic analysis

    , Qui and Frei (and its English version implemented by Zazo, Berrocal and Cia de Salamanca), Jing and Croft, etc.). Using other linguistic resources (read

    EuroWordNet

    English version, and dic).

    thesauri automatically generated from these methods themselves have provided significant improvements in recovery. No I want to add me count heavier on the technical details and figures but for that I can spend like they are.

    Anyway, I mentioned the fact to Antonio García Jiménez, thesaurus that this documentary does a while, and told me some very valuable ideas may in part explain the results, and could be sum (Antonio, if you're out there, correct me if I'm wrong) that the thesaurus did not adapt well to the collection on which I applied and therefore would need a thesaurus fact hand to the collection that I work for an improvement based on the use of thesauri documentaries.

    After this comment I left ruminating and modify the library to suit terminological thesauri with which I had to pull together both sets of data as possible and check whether some improvement in the capacity of recovery n thesauri, but unfortunately the data has remained fairly discouraging. After all these tests I got the following question: do take place handmade thesauri, and based on the methodology and traditional standards in the automated recovery scenario prevailing today, whether inside or outside the Internet? My answer at the moment, in the absence of your comments, you do not take place and that it is urgently necessary to consider several changes in the methodology of development of thesauri that currently exists and that the ISO standards Gilchrist's book and Aitchison and Gil White's book, represent the main references.

    The main problems of using thesauri documentaries are Automated Information Retrieval:

    dispersion of data: That is constantly displayed in the library thesaurus words that can not normalize (this problem is not solved with a regular update crafted in terms of growth of the collection.) Semantic Ambiguity over even domain-specific thesauri such as these.

    Inconsistencies in the structure of the thesaurus.

    • All these problems are normal considering that thesauri are hand made and managed maso no less automatic mechanism control of consistency (in fact the mere import of thesauri to SQL has allowed the detection of these structural inconsistencies) beyond MultiTes and other such programs.
    • to this is that as thesauri do today, and contrary to what many think, not used to the transition ontologies, due to basic design issues or (mainly object-oriented paradigm) with which the documentary thesauri do not meet anywhere near that causes serious problems of consistency when trying to convert documentary thesaurus in an ontology. Given these facts because I do not give more of my time on the subject, I would like to know your opinion on this issue (as many are doing the bread in it, I think) . For concrete, the initial questions, without excluding other possible you can go by are:
    • What is the role of thesauri and documentaries in the context of automated information retrieval centers documentation?
    • What is the role of thesauri documentary in information retrieval on the Internet?

    Is it necessary to change the paradigm currently prevailing development of thesauri? In what sense?

    I am not an expert but I have my opinions thesaurus I'll post here if the debate is successful, but my interests are yours.

    I hope I was clear, if you have any questions about what I've written or something does not mean not hesitate to ask, I hope that with luck and together we can give a touch to this problem as purely documentary. "

    [End Agüera Pérez text]

    As you know, any comment, correction, input etc., In relation to the issues raised in the preceding text, can send it to the IweTel that list, and enrich the debate that certainly deserves all the issues raised by Perez Aguera in relation to information retrieval, automated indexing, and the role thesaurus as a tool for standardized description play in all this ...

    De Pérez Aguera, and on the issues addressed in its submission IweTel list, see also: "Automation
      Thesaurus and its use in the Semantic Web
    • (SWAD-Europe
    • ,
    • workshop Introduction to the Semantic Web
    • , June 13, 2004). See also generally
    SWAD-Europe Reports and Presentations

    SWAD-Europe. SWAD means Semantic Web Activity: Advanced Development

    . I also think it appropriate review, issue of Journal of Documentation

    (n º 7, 2004, pp. 1979-1995), the article by Antonio García Jiménez Instruments
      Knowledge Representation: Ontologies versus
    • Thesaurus "(in
    • PDF).
    • In another vein, I take this opportunity to connect and then a series of links, references and texts that have been deserving of my attention in recent months (the quotes are direct quotes taken from referenced sites):
    articles, introductions, annotations of "blogs"

    Why Use Prolog?

    (Jocelyn Paine

    ). Document which sets out ten (good) reason to (in the opinion of the author) use the programming language Prolog.

    "I'm sorry Dave, I'm Afraid I Can not Do That": Linguistics, Statistics, and Natural Language Processing circa 2001 (in PDF , Lillian Lee, Cornell University). Programming using Visual Prolog 6.0 (R. Fuentes Covarrubias, University of Colima, School of Mechanical and Electrical Engineering, Mexico).


    The legacy
    of the Reverend Bayes
    (en Devlin's
    Angle , febrero 2000).


    Dos muy buenas introducciones básicas al lenguaje Prolog: First
    Steps in Prolog: an easy introduction to this
    AI
    language
    /

    Free
    Prologs: a guide to freely available Prolog systems

    (H. Collingbourne;
    en

    Bitwise Magazine

    ).





    Modeling Decisions for
    Artificial Intelligence
      (Tarragona, 3-5 abril 2006): "Decision making
      processes, and information fusion tools at large, are currently embedded
      in most Artificial Intelligence applications. As a consequence, systems
      based on decision making and fusion techniques are becoming pervasive.
      They are currently in use in all kind of environments, from entertainment
      gadgets to safety-critical or risk management software.".



    • 22nd International Conference
      on Logic Programming
    • (ICLP 2006, 17-20 August 2006).