Sunday, February 19, 2006

Constipation 10 Weeks Pregnant

Has the thesaurus to documentaries?

Although not usually my habit entirely publish full text on this site (for those duties remain open Seen and Read), I would do so on this occasion given the undoubted interest of the I reproduce below, in accordance with the ideas it raises, the practical experiences made manifest and shared by the author, and the debate that all this should trigger. This is a message sent by José Ramón Pérez Agüera

(Department of Information Systems and Programming, School of Computing, University Complutense of Madrid) to the mailing list IweTel , last 13/02/2006 (to access the original text in the files section, you need to be signed to that discussion forum for professionals in libraries and documentation centers).

[Begin text Pérez Agüera]

"Although I have to publish notice in Thinkepi , took a few months (in fact any other year) turning to this issue and I would like to have the opinion of the community of filmmakers, beyond my own observations, which this mail is not intended as a note, but give rise to a debate in which the documentary is not n having a say, to develop within the field of computing. Work

automatic generation of thesauri, which has led me to conduct experiments of automatic indexing and query expansion from hand-made thesaurus. Specifically I used three thesauri: ISOC-Economy, EUROVOC and spin, all known to spare. The collection on which I performed the tests has been the sub-set of Economy and policy news generated by the EFE Agency in 1994 (efe94 is a typical collection experiments information retrieval which consists of a total of 215,738 documents. I've used 23,390 in my experiments to focus on the area of \u200b\u200bpolitics and economics, which are largely covered by the thesaurus above).

Besides I also have a set 22 of consultations with their respective relevance judgments for the domain mentioned in the face of the experiments. These consultations have obtained from Congress the CLEF

[Cross-Language Evaluation Forum] which is held every year and focusing on recovery issues mono and multilingual information. As I used the search engine Lucene , adapted into English stemming the terms of indexation, which is based on the traditional model

Salton's vector space ( a classic, come on). The purpose of my first experiments was to check how they affect automated information retrieval thesauri using documentary as those used every day all documentation centers the world. And what was not my surprise to find that both together and separately, using all or some of the types of relationships that exist in the thesaurus, by direct or weighted global expansion (the way I balanced the thesaurus is another history), in each of the cases mentioned thesauri have not improved recovery virtually nothing in the collection, or precision, or recall (or another hill of measures have been implemented based on the model proposed by TREC

Congress has a software called trec_eval completillo enough to evaluate recovery) is more in some of the experiments, depending on the query length documentary using handmade thesauri worse results.

The next step in my research has been working with thesauri automatically generated from three basic methodologies: Linguistic processing collection ( POS-Tagging , parsing, tree analysis, dependency between terms.)

Analysis co-occurrence for the generation of relationships between terms (Latent Semantic analysis

, Qui and Frei (and its English version implemented by Zazo, Berrocal and Cia de Salamanca), Jing and Croft, etc.). Using other linguistic resources (read

EuroWordNet

English version, and dic).

thesauri automatically generated from these methods themselves have provided significant improvements in recovery. No I want to add me count heavier on the technical details and figures but for that I can spend like they are.

Anyway, I mentioned the fact to Antonio García Jiménez, thesaurus that this documentary does a while, and told me some very valuable ideas may in part explain the results, and could be sum (Antonio, if you're out there, correct me if I'm wrong) that the thesaurus did not adapt well to the collection on which I applied and therefore would need a thesaurus fact hand to the collection that I work for an improvement based on the use of thesauri documentaries.

After this comment I left ruminating and modify the library to suit terminological thesauri with which I had to pull together both sets of data as possible and check whether some improvement in the capacity of recovery n thesauri, but unfortunately the data has remained fairly discouraging. After all these tests I got the following question: do take place handmade thesauri, and based on the methodology and traditional standards in the automated recovery scenario prevailing today, whether inside or outside the Internet? My answer at the moment, in the absence of your comments, you do not take place and that it is urgently necessary to consider several changes in the methodology of development of thesauri that currently exists and that the ISO standards Gilchrist's book and Aitchison and Gil White's book, represent the main references.

The main problems of using thesauri documentaries are Automated Information Retrieval:

dispersion of data: That is constantly displayed in the library thesaurus words that can not normalize (this problem is not solved with a regular update crafted in terms of growth of the collection.) Semantic Ambiguity over even domain-specific thesauri such as these.

Inconsistencies in the structure of the thesaurus.

  • All these problems are normal considering that thesauri are hand made and managed maso no less automatic mechanism control of consistency (in fact the mere import of thesauri to SQL has allowed the detection of these structural inconsistencies) beyond MultiTes and other such programs.
  • to this is that as thesauri do today, and contrary to what many think, not used to the transition ontologies, due to basic design issues or (mainly object-oriented paradigm) with which the documentary thesauri do not meet anywhere near that causes serious problems of consistency when trying to convert documentary thesaurus in an ontology. Given these facts because I do not give more of my time on the subject, I would like to know your opinion on this issue (as many are doing the bread in it, I think) . For concrete, the initial questions, without excluding other possible you can go by are:
  • What is the role of thesauri and documentaries in the context of automated information retrieval centers documentation?
  • What is the role of thesauri documentary in information retrieval on the Internet?

Is it necessary to change the paradigm currently prevailing development of thesauri? In what sense?

I am not an expert but I have my opinions thesaurus I'll post here if the debate is successful, but my interests are yours.

I hope I was clear, if you have any questions about what I've written or something does not mean not hesitate to ask, I hope that with luck and together we can give a touch to this problem as purely documentary. "

[End Agüera Pérez text]

As you know, any comment, correction, input etc., In relation to the issues raised in the preceding text, can send it to the IweTel that list, and enrich the debate that certainly deserves all the issues raised by Perez Aguera in relation to information retrieval, automated indexing, and the role thesaurus as a tool for standardized description play in all this ...

De Pérez Aguera, and on the issues addressed in its submission IweTel list, see also: "Automation
    Thesaurus and its use in the Semantic Web
  • (SWAD-Europe
  • ,
  • workshop Introduction to the Semantic Web
  • , June 13, 2004). See also generally
SWAD-Europe Reports and Presentations

SWAD-Europe. SWAD means Semantic Web Activity: Advanced Development

. I also think it appropriate review, issue of Journal of Documentation

(n º 7, 2004, pp. 1979-1995), the article by Antonio García Jiménez Instruments
    Knowledge Representation: Ontologies versus
  • Thesaurus "(in
  • PDF).
  • In another vein, I take this opportunity to connect and then a series of links, references and texts that have been deserving of my attention in recent months (the quotes are direct quotes taken from referenced sites):
articles, introductions, annotations of "blogs"

Why Use Prolog?

(Jocelyn Paine

). Document which sets out ten (good) reason to (in the opinion of the author) use the programming language Prolog.

"I'm sorry Dave, I'm Afraid I Can not Do That": Linguistics, Statistics, and Natural Language Processing circa 2001 (in PDF , Lillian Lee, Cornell University). Programming using Visual Prolog 6.0 (R. Fuentes Covarrubias, University of Colima, School of Mechanical and Electrical Engineering, Mexico).


The legacy
of the Reverend Bayes
(en Devlin's
Angle , febrero 2000).


Dos muy buenas introducciones básicas al lenguaje Prolog: First
Steps in Prolog: an easy introduction to this
AI
language
/

Free
Prologs: a guide to freely available Prolog systems

(H. Collingbourne;
en

Bitwise Magazine

).





Modeling Decisions for
Artificial Intelligence
    (Tarragona, 3-5 abril 2006): "Decision making
    processes, and information fusion tools at large, are currently embedded
    in most Artificial Intelligence applications. As a consequence, systems
    based on decision making and fusion techniques are becoming pervasive.
    They are currently in use in all kind of environments, from entertainment
    gadgets to safety-critical or risk management software.".



  • 22nd International Conference
    on Logic Programming
  • (ICLP 2006, 17-20 August 2006).

0 comments:

Post a Comment