Feeds:
Posts
Comments

Posts Tagged ‘information’


UK internet users now spend 64% more time using search engines (31 million hours per month in April 2010) than they did 3 years ago.

(UKOM, May 2010)

We are currently crossing the 1 zettabyte mark for the total amount of the world’s digital information (That’s a million milion gigabytes!). Digital information grew by 62% in 2009 to 800,000 petabytes (1 million gigabytes). This amount could be stored on 75 billion iPads, and is the equivalent of a century’s worth of constant tweeting by every man, woman and child.

(The Guardian, May 2010)

It is estimated that by 2012, 90% of data will be video

(Cisco as cited by www.readwriteweb.com, July 2010)

It is estimated that globally, over 62 million consumers will have internet access in their cars by 2016. This compares to the 970,000 consumers with access in 2009

(iSuppli Corporation as cited by eMarketer, June 2010)

Read Full Post »


47 per cent of IT professionals watch YouTube videos to research products and potential purchases.

[B2B Marketing Online, March 2010]

1 in 3 organisations are currently incorporating mobile into their overall ad strategiesEIAA Ad Barometer:

[Marketers’ Internet H209, December 2009]

64% of C-level executives conduct six or more searches per day to locate business information.

[Google, Forbes, BtoB, June 2009, June 2009]

73% of C-suite executives are using the Internet daily, according to new research Google conducted with Forbes of 500 executives at companies with sales of $1 billion or higher.

[Google, Forbes, BtoB, June 2009]

Read Full Post »


When you talk about the Internet growing to 225 million sites, you’ve got to ask: Who’s parsing all that? How do you make sense of all that stuff?

I mean, who has time to wander all over the Internet?

Tomorrow’s Yahoo! is going to be really tailored. I’m not talking about organization — organizing means that you already know what you want and somebody’s just putting it in shape for you. I’m talking about both smart science and people culling through masses of information on the fly and figuring out what people want to know.

We will be delivering your interests to you. For instance, if you’re a sports fan but have no interest in tennis, we won’t show you tennis. We would know that you do things in a certain sequence, so we’d say, “Here’s your portfolio. Here’s some news you might like. Oh, you went to this movie last week, here’s some other movies you might want to check out.”

I call it the Internet of One. I want it to be mine, and I don’t want to work too hard to get what I need. In a way, I want it to be HAL. I want it to learn about me, to be me, and cull through the massive amount of information that’s out there to find exactly what I want.

Read Full Post »


It’s massive how the amount of digital content is equal to a million million gigabytes.

In fact, Planet’s digital content grew by 62% last year, to 800,000 petabytes – a petabyte is a million gigabytes – or 0.8 zettabytes. That is the equivalent of all the information that could be stored on 75bn Apple iPads, which would equal the digital output from a century’s worth of constant tweeting by all of Earth’s inhabitants. wow!

EMC and IDC first examined the digital universe back in 2007 and estimated that it was equivalent to 161 exabytes, 161,000 petabytes or 161bn gigabytes. At the time they forecast the digital universe would grow to 988 exabytes, just under 1 zettabyte, by this year. The fact that growth has been faster than expected even in that short period of time comes as little surprise to a veteran of the rapidly changing IT industry such as McDonald.

Check the full article here.

Read Full Post »


Many people are more comfortable formulating search queries in their own language but have difficulty typing these queries into Google (try typing नमस्ते on a keyboard with English letters). To overcome the difficulty they face in typing in their local language scripts, some people have resorted to copying and pasting from other sites and from online translation tools. But there’s an easier way! Check out this new feature on google.

Read Full Post »


(2009)

IDC has estimated that the overall amount of digital data now equals 487 billion gigabytes…It was reported that this figure will double approximately every 18 months.”

“… during the next couple of years the number of Internet users will increase by 600 million.” (~9% world pop.)

“Researchers calculated that 850 million people will acquire and sell goods and services on the Internet by 2012. “

“Internet commerce will represent a $13 trillion industry, which is twice the value calculated in 2008.”

source:  infoniac.com

Read Full Post »


Research in Question Answering (QA) systems has been improved by the Text Retrieval Conference (TREC) series since 1999. Almost all QA systems fielded at TREC employ some passage retrieval technique to reduce the size of the relevant document set to a manageable number of passages. Here are a bunch of algorithms that might be useful to be aware:

MITRE

The algorithm presented by [Light et al., J. of Natural. Lang.Eng., Special Issue on QA 2001] simply counts the number of terms a passage has in common with the query. Each sentence is treated as a separate passage. This algorithm represents the simplest passage retrieval technique and serves as a good baseline for comparison.

Sliding Window scored with Okapi BM25

Okapi bm25 weighting scheme [Robertson et al., TREC 4] represents the state of the art in document retrieval. It is based on the probabilistic retrieval framework developed by Robertson in 1970. BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of the inter-relationship between the query terms within a document. A simple passage retrieval algorithm based on a sliding window scored with bm25, could serve as another baseline for comparison.

MultiText

The MultiText algorithm [Clarke et al., TREC 9] is a density-based passage retrieval algorithm that favours short passages containing many terms with high idf values. Each passage window in the algorithm starts and ends with a query term, and its score is based on the number of query terms in the passage as well as the window size. Once the highest scoring passage has been identified, our implementation creates a new window of the required length around the center point of the original passage.

Incorpores unique technique for arbitrary passage retrieval. The technique efficiently locates high-scoring passages, where the score of a passage is based on its length and the weights of the terms occurring within it. Passage boundaries are determined by the query, and can start and end at any term position. If a document ranking is required, the score of a document is computed by combining the scores of the passages it contains.

IBM

IBM’s passage retrieval algorithm [Ittycheriah et al., TREC 9] computes a series of distance measures for the passage. The “matching words measure” sums the idf values of words that appear in both the query and the passage. The “thesaurus match measure” sums the idf values of words in the query whose WordNet synonyms appear in the passage. The “mis-match words measure” sums the idf values of words that appear in the query and not in the passage. The “dispersion measure” counts the number of words in the passage between matching query terms, and the “cluster words measure” counts the number of words that occur adjacently in both the question and the passage. These various measures are linearly combined to give the final score for a passage.

SiteQ

SiteQ’s passage retrieval algorithm [Lee et al., TREC 10] computes the score of an n-sentence passage by summing the weights of the individual sentences. Sentences are weighted based on query term density. This algorithm weights query terms based on their part of speech.

Alicante

Alicante’s passage retrieval algorithm [Llopis and Vicedo, CLEF 2001] computes the non-length normalized cosine similarity between query terms and the passage. It takes into account the number of appearances of a term in the passage and in the query, along with their idf values.

ISI

ISI’s passage retrieval algorithm [Hovy et al., TREC 10] ranks sentences based on their similarity to the question by weighing various features: exact match of proper names, match of query terms, and match of stemmed words. Their passage scoring function includes a term whose sole purpose is to offset scoring performed by the answer extractor.

Voting

[Tellex et al, SIGIR 2003] designed a passage retrieval algorithm by combining the results from their implemented collection of algorithms.

A simple voting scheme was implemented, that scored each passage based on its initial rank and also based on the number of answers the other algorithms returned from the same document.

References: Tellex et al, Quantitative Evaluation of Passage Retrieval Algorithms for Question Answering, SIGIR 2003

Read Full Post »


Passage Retrieval (PR) is a typical Information Retrieval (IR) system that returns short passages in response to a user query. But how to define the size and style of that short passage? It should be the paragraph where the answer probably is? Should we retrieve the whole section of the original document? Or should we only care about one sentence or part of it?

A simple way to define passages is based on the document structure. This entails using author-provided marking (e.g. period, indentation, empty line, etc.) as passage boundaries. Examples of such passages include paragraphs, sections, or sentences.

Nevertheless, passages can also be defined according to subject or content of the text. The main idea is to divide documents into coherent units with each unit corresponding to a subtopic. A well-known algorithm for deriving such passages is TextTiling.

Afterward, the third type of passage is window-based, which consists of a fixed number of words or bytes. Passages in this category may or may not take the logical structure of the document into account. Overlapped windows such as defined by Callan-1994 and non-overlapped windows such as defined by Kaszkiel-2001 do not depend on text, whereas pages in Zobel-1995 and bounded paragraphs in Callan-1994 make use of paragraph boundary information and restrict windows to some minimum length.

A more dynamic alternative to windows is arbitrary passages proposed by Kaszkiel-2001, Kaszkiel-1997 where the passage can start at any word in the document. Two subclasses are further defined. Fixed-length arbitrary passages resemble overlapped windows but with an arbitrary starting point. Variable-length arbitrary passages can be of any length. Unlike structural, topical, and window passages which are typically predefined (defined before or at indexing time), arbitrary passages are defined at query time.

References: Passage Retrieval Based On Language Models

Read Full Post »


What Lucene is?

Lucene is a high-performance, scalable Information Retrieval (IR) library, created originally by Doug Cutting. It provides indexing and searching features to applications. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Lucene is a mature, open-source project implemented in Java, available for free download. Moreover, it’s a member of the popular Apache Jakarta family of projects, licensed under the Apache Software License. One of the key factors behind Lucene’s popularity and success is its simplicity.

We can see that a lot of attention is paid in the index and search API. In fact, we don’t need in-depth knowledge about how Lucene’s information indexing and retrieval work’s in order to start using it.

The core logical architecture is the concept of a document which contains fields of text. This flexibility allows Lucene‘s API to be independent of the file format. Text from PDFs, HTML, Microsoft Word, and OpenDocument documents, as well as many others, can all be indexed so long as their textual information can be extracted.

Now, some features about Lucene:

  • ranked search, that is, the most fitting results return first;
  • multiple query types: phrase queries, wildcard queries, proximity queries, range queries and more;
  • field search(e.g., title, author, contents);
  • date-range searching;
  • sorting by field;
  • multiple-index search with merged results;
  • simultaneous update and searching;
  • implementations in multiple programming languages available, that are index-compatible;

However, we have to be aware that Lucene, by itseft is not a a ready-to-use application like a file-search program, a web crawler, or a web site search engine. Lucene is a software toolkit, not a full-featured search application.

A list of Lucene-Based Projects:

But, how to exploit Lucene‘s capabilities, in order to make an efficient Question and Answering System?

In fact, this  tool can be very useful for the passage/document retrieval task. Lucene can support, retrieval of relevant snippets of text for every question.

One simple approach is to create two indexes, one at paragraph level and one at document level. The paragraph index is more precise in terms of relevant text, and is preferred for snippet extraction. However, if the answer is not found in the paragraph index, the document index is returned instead. Using the Lucene query engine, we extract a ranked list of snippets for every question.

The algorithm for answer extraction can be based on the Lucene scores. It is very likely that the paragraph with de highest Lucene score should be correct answer (although this is not guaranteed).
Of course, there are refinement filters, which increase the Lucene score in the following cases:

  • the paragraph has the focus;
  • the paragraph has some of the name entities (directly proportional with the number of these name entities);
  • if the question answer type is Person, or Organization, etc., we try to identify these types of name entities in the extracted paragraphs (and increase the Lucene score accordingly with the number of them);
  • if the question type is Definition, then we prefer answers with definition form (could be identified by this grammar).

After applying all criteria, the paragraphs are awarded a score and the paragraph with the biggest score is chosen.

References: Lucene in Action, Lucene website, Question Answering on English and Romanian Languages

Read Full Post »


Why choose Passage Retrieval (PR) over Information Retrieval (IR) ? Which one is the best painkiller?

Information is essential. However, if we can’t find it, it really doesn’t exists. Even if we have all the documents properly stored and indexed,  our effort is wasted, unless we have reliable and fast mechanisms to look up these documents, in order to find the required information.

But first, let’s start with some basics:

Information Retrieval (IR) aims fundamentally to satisfy information needs. IR for instance, is the process of automatically locate and retrieve fragments of information (documents, files, passages, metadata), relevant to the user’s need, from a body of information. IR does not really matter whether you search for documents, passages, or images. What is important: unlike data retrieval, the information need is rather vague and the retrieval system has to guess what the user really needs.

The user’s information need is presented to the IR system as a query which usually consists of a string of words. Then, the IR system uses a matching mechanism to decide how closely a document is related to the query [see Xiaoyong Liu]. Hence, we can say that the goal of an IR system is to retrieve all the documents which will,  hopefully,  satisfy the user’s query while retrieving as few non-relevant documents as possible [see Ribeiro-Neto].

An example of a Information Retrieval Library is Lucene, as it allows you to add indexing and searching capabilities to your applications.

Following IR systems, some research was also made with focus on modeling portions of a document (“passages”). With the fast growth of information resources and the huge number of requests submitted to IR Systems, one of the most challenging and important process is to retrieve the best related excerpts with regard to the user’s questions. Thus,  we can eliminate the burden of reading lots or irrelevant documents, only to reach the desired answer.

In fact Passage Retrieval (PR) is concerned in finding  passage-blocks that could provide better evidences to the users, rather than the full document text. This can be specially usefully when the documents are long or span different subject areas [Callan, 1994]. This approach is very interesting, as evidence shows that users usually prefer passage sized answers over whole document, as it gives context. [Lin, 2003]

Currently, data is digitally distributed and represented in different formats.  Therefore, searching within high collections of data and hence acquiring information is becoming a big headache. As a result, in order to efficiently process and extract information, reliable, distributed, efficient and high-performance information systems are required.

Which kind of information retrieval systems suits you better?

Thanks Itman for the help in this post!

Read Full Post »

Follow

Get every new post delivered to your Inbox.