February 2006

Has the thesaurus to documentaries?

Although not usually my habit entirely publish full text on this site (for those duties remain open Seen and Read), I would do so on this occasion given the undoubted interest of the I reproduce below, in accordance with the ideas it raises, the practical experiences made manifest and shared by the author, and the debate that all this should trigger. This is a message sent by José Ramón Pérez Agüera

(Department of Information Systems and Programming, School of Computing, University Complutense of Madrid) to the mailing list IweTel , last 13/02/2006 (to access the original text in the files section, you need to be signed to that discussion forum for professionals in libraries and documentation centers).

[Begin text Pérez Agüera]

"Although I have to publish notice in Thinkepi , took a few months (in fact any other year) turning to this issue and I would like to have the opinion of the community of filmmakers, beyond my own observations, which this mail is not intended as a note, but give rise to a debate in which the documentary is not n having a say, to develop within the field of computing. Work

automatic generation of thesauri, which has led me to conduct experiments of automatic indexing and query expansion from hand-made thesaurus. Specifically I used three thesauri: ISOC-Economy, EUROVOC and spin, all known to spare. The collection on which I performed the tests has been the sub-set of Economy and policy news generated by the EFE Agency in 1994 (efe94 is a typical collection experiments information retrieval which consists of a total of 215,738 documents. I've used 23,390 in my experiments to focus on the area of \u200b\u200bpolitics and economics, which are largely covered by the thesaurus above).

Besides I also have a set 22 of consultations with their respective relevance judgments for the domain mentioned in the face of the experiments. These consultations have obtained from Congress the CLEF

[Cross-Language Evaluation Forum] which is held every year and focusing on recovery issues mono and multilingual information. As I used the search engine Lucene , adapted into English stemming the terms of indexation, which is based on the traditional model

Salton's vector space ( a classic, come on). The purpose of my first experiments was to check how they affect automated information retrieval thesauri using documentary as those used every day all documentation centers the world. And what was not my surprise to find that both together and separately, using all or some of the types of relationships that exist in the thesaurus, by direct or weighted global expansion (the way I balanced the thesaurus is another history), in each of the cases mentioned thesauri have not improved recovery virtually nothing in the collection, or precision, or recall (or another hill of measures have been implemented based on the model proposed by TREC

Congress has a software called trec_eval completillo enough to evaluate recovery) is more in some of the experiments, depending on the query length documentary using handmade thesauri worse results.

The next step in my research has been working with thesauri automatically generated from three basic methodologies: Linguistic processing collection ( POS-Tagging , parsing, tree analysis, dependency between terms.)

Analysis co-occurrence for the generation of relationships between terms (Latent Semantic analysis

, Qui and Frei (and its English version implemented by Zazo, Berrocal and Cia de Salamanca), Jing and Croft, etc.). Using other linguistic resources (read

EuroWordNet

English version, and dic).

thesauri automatically generated from these methods themselves have provided significant improvements in recovery. No I want to add me count heavier on the technical details and figures but for that I can spend like they are.

Anyway, I mentioned the fact to Antonio García Jiménez, thesaurus that this documentary does a while, and told me some very valuable ideas may in part explain the results, and could be sum (Antonio, if you're out there, correct me if I'm wrong) that the thesaurus did not adapt well to the collection on which I applied and therefore would need a thesaurus fact hand to the collection that I work for an improvement based on the use of thesauri documentaries.

After this comment I left ruminating and modify the library to suit terminological thesauri with which I had to pull together both sets of data as possible and check whether some improvement in the capacity of recovery n thesauri, but unfortunately the data has remained fairly discouraging. After all these tests I got the following question: do take place handmade thesauri, and based on the methodology and traditional standards in the automated recovery scenario prevailing today, whether inside or outside the Internet? My answer at the moment, in the absence of your comments, you do not take place and that it is urgently necessary to consider several changes in the methodology of development of thesauri that currently exists and that the ISO standards Gilchrist's book and Aitchison and Gil White's book, represent the main references.

The main problems of using thesauri documentaries are Automated Information Retrieval:

dispersion of data: That is constantly displayed in the library thesaurus words that can not normalize (this problem is not solved with a regular update crafted in terms of growth of the collection.) Semantic Ambiguity over even domain-specific thesauri such as these.

Inconsistencies in the structure of the thesaurus.

All these problems are normal considering that thesauri are hand made and managed maso no less automatic mechanism control of consistency (in fact the mere import of thesauri to SQL has allowed the detection of these structural inconsistencies) beyond MultiTes and other such programs.
to this is that as thesauri do today, and contrary to what many think, not used to the transition ontologies, due to basic design issues or (mainly object-oriented paradigm) with which the documentary thesauri do not meet anywhere near that causes serious problems of consistency when trying to convert documentary thesaurus in an ontology. Given these facts because I do not give more of my time on the subject, I would like to know your opinion on this issue (as many are doing the bread in it, I think) . For concrete, the initial questions, without excluding other possible you can go by are:
What is the role of thesauri and documentaries in the context of automated information retrieval centers documentation?

Is it necessary to change the paradigm currently prevailing development of thesauri? In what sense?

I am not an expert but I have my opinions thesaurus I'll post here if the debate is successful, but my interests are yours.

I hope I was clear, if you have any questions about what I've written or something does not mean not hesitate to ask, I hope that with luck and together we can give a touch to this problem as purely documentary. "

[End Agüera Pérez text]

As you know, any comment, correction, input etc., In relation to the issues raised in the preceding text, can send it to the IweTel that list, and enrich the debate that certainly deserves all the issues raised by Perez Aguera in relation to information retrieval, automated indexing, and the role thesaurus as a tool for standardized description play in all this ...

De Pérez Aguera, and on the issues addressed in its submission IweTel list, see also: "Automation

(SWAD-Europe
,
, June 13, 2004). See also generally

SWAD-Europe Reports and Presentations

SWAD-Europe. SWAD means Semantic Web Activity: Advanced Development

. I also think it appropriate review, issue of Journal of Documentation

(n º 7, 2004, pp. 1979-1995), the article by Antonio García Jiménez Instruments

Thesaurus "(in
PDF).
In another vein, I take this opportunity to connect and then a series of links, references and texts that have been deserving of my attention in recent months (the quotes are direct quotes taken from referenced sites):

articles, introductions, annotations of "blogs"

Why Use Prolog?

(Jocelyn Paine

). Document which sets out ten (good) reason to (in the opinion of the author) use the programming language Prolog.

"I'm sorry Dave, I'm Afraid I Can not Do That": Linguistics, Statistics, and Natural Language Processing circa 2001 (in PDF , Lillian Lee, Cornell University). Programming using Visual Prolog 6.0 (R. Fuentes Covarrubias, University of Colima, School of Mechanical and Electrical Engineering, Mexico).

The legacy
of the Reverend Bayes (en Devlin's
Angle , febrero 2000).

Dos muy buenas introducciones básicas al lenguaje Prolog: First
Steps in Prolog: an easy introduction to this AI
language /

Free
Prologs: a guide to freely available Prolog systems

(H. Collingbourne;
en

Bitwise Magazine

).

Linear
Logic - Naturally! (en That
Logic Blog ): "Linear logic has enjoyed enormous popularity over the
last couple of decades or so. For those without some training in structural
proof theory, understanding the system remove intimidating Can Be, Especially Because Of The notation funny and weird jargon. In this post, I am going to show you That, in fact-, You Could Have invented linear logic! [...]".
Logic, Math, Deduction (Manuel Ojeda Aciego, Dept. Applied Mathematics, University of Malaga, in PDF ): We present a brief historical perspective of development in parallel and sometimes interlaced, of logic and mathematics, with the ultimate goal of presenting Computational Logic, and in particular, Automated Deduction, as a research area of \u200b\u200btremendous potential practical math, no wonder other prestige known authors claim that the logic is as CA Computer ; calculation is Physics Infinitesimal. ".
pages and websites:
Neural Java: Neural Networks Tutorial with Java Applets ." Neural Java is a series of Exercises and demos. Each Exercise Consists of a short introduction to small demonstration program Written in Java (Java applet), and a series of questions Which Are Intended as an invitation to play with the Programs and explore the Possibilities of Different algorithms. [...]".
Wikibook on Prolog ("[...] This book dog serve as a textbook or tutorial for Anyone Who Wants to learn the programming language Prolog. No prior programming experience is required. Some basic Knowledge of logic dog come in handy. [...]"). Yande. The set of modules designed to help students of logic, implemented in language Prolog. Currently are: TT: Building truth tables for propositional logic. It also includes two predicates to determine a formula as valid, satisfiable, or unsatisfiable, and reasoning as correct or incorrect, ND: construction of natural deduction proofs, CNF: convert formulas conjunctive normal form, or in clausal form. TRIPLE
("[...] an RDF query, inference, and transformation language for the Semantic Web. [...]"). Glossary
Recuperación
de Información en la Web ).

Conferencias, congresos:

Modeling Decisions for
Artificial Intelligence

22nd International Conference
on Logic Programming

Within the complex and broad scope domain of Natural Language Processing (NLP), one of the essential functions of the parser or parsers is

chain analysis tokens for possible syntax errors (remember that the syntax, broadly understood, is that part of grammar that deals with the rules governing the issuing of words

in larger structures such as

prayers, as well as the relations established between them these words.)

A token can be defined as the smallest unit of information with their own meaning in a sequence of alphanumeric characters. These chains of units called lexical information is a process of decomposition, as a list of those strings of tokens

in minimum units. So a program like this could generate the following list of tokens from the phrase "Hello World!" [161, 72, 111, 108, 97, 32, 77, 117 , 110, 100, 111, 33] where each of the numbers on the list corresponds to the ASCII (American Standard Code for Information Interchange) for each one of the smallest units of Significance n identified the phrase in the same order. Of course it is possible to carry out the reverse process, and from that list to generate the tokens chains that form the phrase in question. The tokenization is therefore the basic process that can handle natural language written for further processing, based on its decomposition in minimum units of information with meaning. Most of the programming language provide for specific instructions to carry out the process of tokenization ordered character strings alphanumeric, although this operation can be implemented alternatively by other methods provided by these languages. Thus, a program attempting to "read" a text, you must first "tokenised" generating a list of tokens

or minimal lexical units with their own meaning, as identified in the text. Then proceed to identify larger units of meaning itself (looking for the presence, as a separator, ASCII character

36, which corresponds to the space blank), which could assimilate as "words", to finally end up identifying other units of higher-order meaning, phrases or sentences. Different sentences of the text "read", the parser proceeds to perform the actual parsing, thus identifying for the constituent parts of those sentences that, for this purpose are compared with previously defined patterns of possible structures , which depend on the language of writing the text, and the level of complexity of analysis to be attained, because look at all possible structures of a language and many variations, and represent them through a series of rules, there is a very simple task.

detection of changes in each language supported position in relation to the order of words, or analysis of transformation processes is made by structural analysis aimed at identifying the deep structure of a sentence in relation to its surface structure. Structural analysis based on surface structure (2) a oracióny, changing the order of certain words, try to determine its possible transformation kind of deep structure (1): (1) Deep structure, "Peter eats an apple" (2) surface structure, "Eat an apple Pedro"

The implementation process of tokenization , apart from the use of specific instructions that directly transform a string of alphanumeric characters in a string of tokens

, involves the use of other instructions whose function is to "read" individual, one by one, the characters present in the active group channel or input (input stream ) is specified, it will usually be either your computer keyboard, which is the input channel is active by default (like the channel output data stream output , defaults to monitor computer) or a text file located in the path specified.

Thus, there is the Prolog predicate

default name (? AtomOrInt,? String)

. The argument
AtomOrInt

is the variable that represents the string of alphanumeric characters or "atom" that is want

tokenizer, while the String argument is the variable representing the resulting list. The symbol "? " indicates that both arguments are reversible, ie they can function as both input variables and output variables, although one of them must necessarily be instantiated. Their operation is as follows: ? - Name ('Hello World', X).

X = [161, 72, 111, 108, 97, 32, 77, 117, 110 the interpreter provides the list of tokens , incomplete as indicated by the vertical bar followed by an ellipsis " atoms, as referenced in "Analysing and Constructing Atoms " manual SWI-Prolog . tokenizer Another way is to use atoms in Prolog predicate get0 / 1

and

get0 / 2

and some kind of recursive algorithm to be "traveling" all text in the active channel input ( an external file, for example) and enter the

while

resulting tokens, including blanks (

get / 1 and

get / 2

not read), in a cumulative list, while not being achieved stop particular marker, as defined previously (for this end the atom

widely neglected end_of_file

, which corresponds to the end of text). This predicate actually reads the byte alphanumeric character for each individual, associated with its corresponding ASCII code.

The parser, based on the constituents of a sentence (see principles of generative grammar of Noam Chomsky ) and by a finite number of rules, try to determine grammaticality or otherwise of an infinite number of constructions. A parser try to see how far you can submit a group of words to a structure of rules. For example, if we have the sentence: Peter eats an apple first, and through a process of tokenization , generates a list of words it contains. From this initial list of words, to differentiate a sub-list corresponding to the noun phrase (SN) of the sentence, and if it can be concatenated with other sub-lists according to certain rules that are verified as verb phrase ( SV), the prayer is concluded that it is grammatical. What matters in the constituents is the order of the words of prayer. The parser performs the analysis sequentially, word by word, from an initial list, following the example of prayer exposed, would be:

[pedro, eats one apple ]

computing process Parser rules should result in another list, which will be an empty list [] if the initial sentence is grammatical (always based on rules that have defined the analyzer). In conclusion starting the initial list of words, the parser checks if it can be subdivided into two sub-lists, which correspond, respectively, SN

and SV

prayer. More information: The parsing and semantic analysis .

Constipation 10 Weeks Pregnant

).

Law, Technology And Administrative Modernization

Sunday, February 26, 2006

Effect Of Vodka Impotency

Sunday, February 19, 2006