Introduction
Information retrieval (IR) is a field of computing and information science that studies the process of obtaining information resources relevant to an information need from a collection of information resources [1]. It is the science of searching for information in documents, searching for documents themselves, and also searching for the metadata that describes documents. IR is a core technology for search engines, digital libraries, and other information retrieval systems.
In information retrieval, a query does not uniquely identify a single object in the collection [1]. Instead, several objects may match the query, perhaps with different degrees of relevancy. In the 1960s, the first large information retrieval research group was formed by Gerard Salton at Cornell [1].
Overview
An information retrieval process begins when a user enters a query into the system [1]. The query is analyzed and processed to create a set of query terms that represent the user’s information need. The query terms are then used to retrieve a set of documents from a collection. The documents are then ranked according to their relevance to the query. Finally, the results are presented to the user.
Automated information retrieval systems were introduced in the 1950s: one even featured in the 1957 romantic comedy, Desk Set [1]. These systems used manually created indexing vocabularies to tag documents with terms that allowed for keyword-based search.
Models
In general, measurement considers a collection of documents to be searched and a search query [1]. The process of retrieval is then based on computing the similarity between the query and the documents. This is done by using a model of the document collection and a model of the query.
The models used in information retrieval are generally divided into two dimensions: a mathematical basis and the properties of the model.
The first dimension: mathematical basis
The mathematical basis of information retrieval models is generally either vector-based or probabilistic. Vector-based models use the vector space model to represent both queries and documents as vectors of identifiers such as words or terms. Probabilistic models use probability theory to represent the relevance of a document to a query.
The second dimension: properties of the model
Models without term-interdependencies treat different terms/words as independent [1]. This approach has been used in vector space models and probabilistic models.
Models with term-interdependencies take into account the relationships between terms in the document collection and in the query. Examples of such models include language models and latent semantic analysis.
Applications
Automated information retrieval systems are used to reduce what has been called information overload [1]. This is done by providing users with the ability to quickly and easily locate relevant information resources.
In the 1990s, the Text REtrieval Conference (TREC) was initiated by the National Institute of Standards and Technology (NIST). The aim of this was to look into the information retrieval community by supplying the infrastructure that was needed for evaluation of text retrieval methodologies on a very large text collection [1].
Conclusion
Information retrieval technology is an invaluable tool for text-based search. It enables users to quickly and easily locate relevant information resources, reducing the amount of information overload. Vector-based and probabilistic models are used to represent queries and documents, and models with and without term-interdependencies are available. Automated information retrieval systems are used to facilitate the process, and the Text REtrieval Conference (TREC) has been instrumental in advancing the field.