Here are the uptodate slides for the ICTIR’13 tutorial and quantum mechanics and information retrieval.
Category Archives: Research
Internship: Learning complex textual representations
Representing text and knowledge
Currently, the representation of knowledge is mainly symbolic, i.e. logical, and is still an area little discussed in the context of numerical and statistical models. However, being able to integrate this knowledge to the representation of objects, so it can be processed by statistical models, is an important issue. The classic example concerns the representation of textual documents which are rarely more complex than simple bags of words. In the case of text, the logical representation is the socalled semantic Web. There are currently no approaches that try to represent “raw” text in the context of a given knowledge. Several solutions can be considered:
 The first is to represent textual documents directly by creating new nodes and links in the semantic network;

The second would be to build a numerical representation of document with respect to a given knowledge In this internship, we will follow the latter approach, since the former might not be able to tackle noisy texts as found on the Web. More precisely, we will develop a numerical model of knowledge from current knowledge resources that can be updated in order to represent the knowledge after reading a document, hence representing a document in the context of a given prior knowledge. The methods will be based on regularized neural networks, and implemented on real data corpora with two target applications:

Analysis of data composed of very short texts (e.g. Twitter)

Automatic response to MCQ (Multiple Choice Questions)
Prerequisites
 Knowledge in statistical learning (neural networks if possible)
 Good programming skills (C + + in particular)
Contacts
 benjamin.piwowarski@lip6.fr
 ludovic.denoyer@lip6.Fr
KQP – first release!
The first public release of the Kernel Quantum Probability is now available for C++, Java and Python.
Slides from the ECIR’12 “Quantum Information Access and Retrieval” tutorial
Follow this link to download the handout of the slides of the tutorial.
The Kernel Quantum Probability Library
The Kernel Quantum Probability library (KQP) provides a generic API to construct “quantum” densities and events, compute and update quantum probabilities (and associated quantities).
You can find more information in
 B. Piwowarski, “The Kernel Quantum Probabilities (KQP) Library,” arXiv 2012.
[Bibtex]@techreport{Piwowarski2012KQP, arxivid = {1203.6005}, author = {Piwowarski, Benjamin}, dateadded = {20120625 10:19:03 +0000}, datemodified = {20120625 10:40:40 +0000}, eprint = {1203.6005}, institution = {arXiv}, link = {http://arxiv.org/abs/1203.6005}, month = {March}, title = {The Kernel Quantum Probabilities (KQP) Library}, year = {2012} }
and on the KQP website.
Internship (Master): Indices for quantum information access (INDEQ)
Context
Quantum probabilities are a formalism based on probability, logic and geometry – three ingredients used by most models of Information Access (IA). This led van Rijsbergen to suggest their use in IA [Rij04] because, unlike “standard” probabilities based on set theory, the “quantum” ones are based on vector spaces (Hilbert spaces) and are therefore more expressive.
This flexibility has begun to be exploited in order to tackle old and new challenges in IA, such as adhoc information retrieval (IR) [PF 10], contextual IR [Mel08], the problem of diversifying search results [ZA10], and summarization [PA 12]. In most of these studies, information objects are represented in a multidimensional, ie a set of vectors (probability density) or a subspace (event).
A crucial point for the success of quantum AI is to quickly find information objects. However, there is currently no effective technique for finding objects represented in Hilbert spaces directly, without first substantially reducing the number of objects considered with standard techniques. This limits the practical importance of such models and introduces a bias in the obtained results. It is necessary to design data structures and access methods adapted to the problem of quantum AI.
Objectives
This is a six months long internship (starting MarchOctober 2012) and will take place in the LIP6/CNRS laboratory in Paris, France. The monthly allowance is of 420 €.
The theoretical objective of the INDEQ project is to develop index structures for fast object retrieval, when objects are represented in Hilbert spaces. To achieve this goal, treebased indices will be developed that estimate at each node what is the probability distribution of objects in each branch (given a quantum probability distribution).
The practical objective of this project is to design a solution to recommend information objects (e.g., movies) to a user, while avoiding exploring exhaustively all the candidate objects. To this end, the project aims at defining a new access method for selecting the object or objects that have the highest probability to match the user requesting a recommendation.
Contact Information
Please email us if you are interested by this internship.
Benjamin Piwowarski – benjamin@bpiwowar.net
Hubert Naacke – hubert.naacke@lip6.fr
Bibliography
[Mel08] Melucci, M. (2008). A basis for information retrieval in context. ACM Transactions On Information Systems, 26(3), 1–41.
[PF+10] Piwowarski, B., Frommholz, I., Lalmas, M., & van Rijsbergen, K. (2010). What can Quantum Theory bring to IR In J. Huang, N. Koudas, G. Jones, X. Wu, K. CollinsThompson, & A. An (Eds.), CIKM’10: Proceedings of the nineteenth ACM conference on Conference on information and knowledge management.
[PA+12] Piwowarski, B., Amini, M.R., & Lalmas, M. (2012). On using a Quantum Physics formalism for Multidocument Summarisation. Journal of the American Society for Information Science and Technology (accepted paper to be published).
[Rij04] van Rijsbergen, K. (2004). The Geometry of Information Retrieval. Cambridge University Press.
[ZA10] Zuccon, G., & Azzopardi, L. (2010). Using the Quantum Probability Ranking Principle to Rank Interdependent Documents. In C. Gurrin, Y. He, G. Kazai, U. Kruschwitz, S. Little, T. Roelleke, S. Rüger, et al. (Eds.), Advances in Information Retrieval (Vol. 5993, pp. 357–369). Springer.
Summarisation and the Quantum Information Access framework
Our paper (with M.R. Amini and M. Lalmas) has just been accepted for publication in JASIS. It is the second time that the quantum formalism allows to reinterpret fully an existing model^{1}, but it is the first that it is useful to spot theoretical shortcomings (about LSAbased extractive summarisation) and gives hints on how to propose a new and more performant criterium for summarisation. Experiments on DUC datasets show that the QIA framework outperforms LSA (and other stateoftheart algorithms). Please ask if you want a preprint.
 B. Piwowarski, M. R. Amini, and M. Lalmas, “On using a quantum physics formalism for multidocument summarization,” Journal of the American Society for Information Science and Technology, vol. 63, iss. 5, pp. 865888, 2012.
[Bibtex]@article{Piwowarski2012Summarization, author = {Piwowarski, B. and Amini, M.R. and Lalmas, M.}, dateadded = {20120625 10:21:46 +0000}, datemodified = {20120625 10:40:38 +0000}, doi = {10.1002/asi.21713}, issn = {15322890}, journal = {Journal of the American Society for Information Science and Technology}, month = {May}, number = {5}, pages = {865888}, title = {On using a quantum physics formalism for multidocument summarization}, volume = {63}, year = {2012} }
It will be hopefully be followed by a paper about the quantum information framework (QIA) in all its generality (along with the code for computing “kernel quantum probabilities”).
 the first time was done by K. van Rijsbergen for the vectorial models [↩]
Two new quantum IR publications
Two more works based on the quantum formalism have been accepted.
The first one, a poster in ECIR 2011, deals with how to cope with the representation of the information need when the user reformulates the query and propose some heuristics (no evaluation yet):
 I. Frommholz, B. Piwowarski, M. Lalmas, and K. van Rijsbergen, “Processing Queries in Session in a Quantuminspired IR Framework,” in Proceedings of ECIR 2011, 2011.
[Bibtex]@inproceedings{Frommholz2011Processing, author = {Frommholz, Ingo and Piwowarski, Benjamin and Lalmas, Mounia and van Rijsbergen, Keith}, booktitle = {Proceedings of {ECIR} 2011}, dateadded = {20110103 16:56:38 +0000}, datemodified = {20120625 10:40:41 +0000}, month = {March}, note = {Poster}, title = {Processing Queries in Session in a Quantuminspired IR Framework}, year = {2011} }
The second one, a paper in the Italian IR 2011 workshop presents the algebra for information need that can be used to build structured query representations. The algebra is based on quantum operators that operates on socalled “information need aspects”. In the paper, we propose to use computational linguistics tools (segmentation, dependency parsing) in order to build automatically an algebraic repreesentation of the topic at hand.
 A. Caputo, B. Piwowarski, and M. Lalmas, “A Query Algebra for Quantum Information Retrieval,” in Proceedings of the 2nd Italian Information Retrieval Workshop, 2011.
[Bibtex]@inproceedings{Caputo2011QueryAlgebra, author = {Caputo, Annalina and Piwowarski, Benjamin and Lalmas, Mounia}, booktitle = {Proceedings of the 2nd Italian Information Retrieval Workshop}, dateadded = {20110103 16:59:26 +0000}, datemodified = {20120625 10:40:38 +0000}, month = {January}, title = {A Query Algebra for Quantum Information Retrieval}, year = {2011} }
Some notes on CIKM 2010
Just came back from CIKM 2010! Here is the list of the papers (or rather the presentations) I liked and a brief description of each of them.
 S. H. Yang and H. Zha, “Language Pyramid and MultiScale Text Analysis,” in CIKM’10: Proceedings of the nineteenth ACM conference on Conference on information and knowledge management, 2010.
[Bibtex]@inproceedings{Yang2010Language, Author = {Shuang Hong Yang and Hongyuan Zha}, Booktitle = {CIKM'10: Proceedings of the nineteenth ACM conference on Conference on information and knowledge management}, DateAdded = {20101101 15:45:21 +0000}, DateModified = {20101102 11:16:44 +0000}, Doi = {10.1145/1871437.1871520}, Editor = {Huang, Jimmy and Koudas, Nick and Jones, Gareth and Wu, Xindong and CollinsThompson, Kevyn and An, Aijun}, Location = {Toronto, Canada}, Publisher = {{ACM}}, ShortBooktitle = {{CIKM}}, ShortTitle = {{CIKM}}, Title = {Language Pyramid and MultiScale Text Analysis}, Year = {2010}}
Extends the BOW representation by using multiresolution techniques from the image processing community. Basically, each text is first represented as a binary 2D matrix (term / position). Using spatial and semantic smoothing, this matrix can be blurred at different levels (until it becomes a matrix with just one value) which gives a multiresolution text representation. Some experiments are performed for classification and retrieval.
 R. Cummins, M. Lalmas, and C. O’Riordan, “Examining the Information Retrieval Process from an Inductive Perspective,” in CIKM’10: Proceedings of the nineteenth ACM conference on Conference on information and knowledge management, 2010.
[Bibtex]@inproceedings{Cummins2010Examining, Author = {Ronan Cummins and Mounia Lalmas and Colm O'Riordan}, Booktitle = {CIKM'10: Proceedings of the nineteenth ACM conference on Conference on information and knowledge management}, DateAdded = {20101101 15:01:59 +0000}, DateModified = {20101102 11:16:44 +0000}, Doi = {10.1145/1871437.1871453}, Editor = {Huang, Jimmy and Koudas, Nick and Jones, Gareth and Wu, Xindong and CollinsThompson, Kevyn and An, Aijun}, Location = {Toronto, Canada}, Publisher = {{ACM}}, ShortBooktitle = {{CIKM}}, ShortTitle = {{CIKM}}, Title = {Examining the Information Retrieval Process from an Inductive Perspective}, Year = {2010}}
Looks at what percentage of the information from the three usual sources (collection, document, and query) is necessary for IR models to perform well.
 F. Raiber and O. Kurland, “On Identifying Representative Relevant Documents,” in CIKM’10: Proceedings of the nineteenth ACM conference on Conference on information and knowledge management, 2010.
[Bibtex]@inproceedings{Raiber2010IdentifyingRepresentative, Author = {Fiana Raiber and Oren Kurland}, Booktitle = {CIKM'10: Proceedings of the nineteenth ACM conference on Conference on information and knowledge management}, DateAdded = {20101102 11:00:55 +0000}, DateModified = {20101102 11:16:43 +0000}, Doi = {10.1145/1871437.1871454}, Editor = {Huang, Jimmy and Koudas, Nick and Jones, Gareth and Wu, Xindong and CollinsThompson, Kevyn and An, Aijun}, Location = {Toronto, Canada}, Publisher = {{ACM}}, ShortBooktitle = {{CIKM}}, ShortTitle = {{CIKM}}, Title = {On Identifying Representative Relevant Documents}, Year = {2010}}
Evaluates different strategies to select relevant documents for relevance feedback for BM25 and Language Models.
 L. Zhao and J. Callan, “Term Necessity Prediction,” in CIKM’10: Proceedings of the nineteenth ACM conference on Conference on information and knowledge management, 2010.
[Bibtex]@inproceedings{Zhao2010TermNecessity, Author = {Le Zhao and Jamie Callan}, Booktitle = {CIKM'10: Proceedings of the nineteenth ACM conference on Conference on information and knowledge management}, DateAdded = {20101101 15:04:49 +0000}, DateModified = {20101102 11:16:43 +0000}, Doi = {10.1145/1871437.1871474}, Editor = {Huang, Jimmy and Koudas, Nick and Jones, Gareth and Wu, Xindong and CollinsThompson, Kevyn and An, Aijun}, Location = {Toronto, Canada}, Publisher = {{ACM}}, ShortBooktitle = {{CIKM}}, ShortTitle = {{CIKM}}, Title = {Term Necessity Prediction}, Year = {2010}}
Explores how necessary a term is for a document to be relevant – i.e. P(term in document  document is relevant to q). This probability is ignored in most models – and proposes machine learning based techniques to estimate this probability for query features.
 K. Min, Z. Zhang, J. Wright, and Y. Ma, “Decomposing Background Topics from Keywords by Principal Component Pursuit,” in CIKM’10: Proceedings of the nineteenth ACM conference on Conference on information and knowledge management, 2010.
[Bibtex]@inproceedings{Min2010Decomposing, Author = {Kerui Min and Zhengdong Zhang and John Wright and Yi Ma}, Booktitle = {CIKM'10: Proceedings of the nineteenth ACM conference on Conference on information and knowledge management}, DateAdded = {20101101 15:20:05 +0000}, DateModified = {20101102 11:16:44 +0000}, Doi = {10.1145/1871437.1871475}, Editor = {Huang, Jimmy and Koudas, Nick and Jones, Gareth and Wu, Xindong and CollinsThompson, Kevyn and An, Aijun}, Location = {Toronto, Canada}, Publisher = {{ACM}}, ShortBooktitle = {{CIKM}}, ShortTitle = {{CIKM}}, Title = {Decomposing Background Topics from Keywords by Principal Component Pursuit}, Year = {2010}}
An interesting extension to LSI, where documents are represented both by a reduced representation (given by LSI) and by a additive term that captures salient features of this document (e.g. low frequency terms).
 I. B. Hauskrecht, “Constructing Classication Features Using Minimal Predictive Patterns,” in CIKM’10: Proceedings of the nineteenth ACM conference on Conference on information and knowledge management, 2010.
[Bibtex]@inproceedings{Hauskrecht2010Constructing, Author = {Iyad BatalMilos Hauskrecht}, Booktitle = {CIKM'10: Proceedings of the nineteenth ACM conference on Conference on information and knowledge management}, DateAdded = {20101101 15:43:19 +0000}, DateModified = {20101102 11:16:44 +0000}, Doi = {10.1145/1871437.1871549}, Editor = {Huang, Jimmy and Koudas, Nick and Jones, Gareth and Wu, Xindong and CollinsThompson, Kevyn and An, Aijun}, Location = {Toronto, Canada}, Publisher = {{ACM}}, ShortBooktitle = {{CIKM}}, ShortTitle = {{CIKM}}, Title = {Constructing Classication Features Using Minimal Predictive Patterns}, Year = {2010}}
Explores how to do feature based selection with mining frequent term sets strategies. Shows substantial improvement over other methods.
 K. Punera and S. Merugu, “The Anatomy of a Click: Modeling User Behavior on WebInformation Systems,” in CIKM’10: Proceedings of the nineteenth ACM conference on Conference on information and knowledge management, 2010.
[Bibtex]@inproceedings{Punera2010TheAnatomy, Author = {Kunal Punera and Srujana Merugu}, Booktitle = {CIKM'10: Proceedings of the nineteenth ACM conference on Conference on information and knowledge management}, DateAdded = {20101101 15:47:51 +0000}, DateModified = {20101102 11:16:43 +0000}, Doi = {10.1145/1871437.1871563}, Editor = {Huang, Jimmy and Koudas, Nick and Jones, Gareth and Wu, Xindong and CollinsThompson, Kevyn and An, Aijun}, Location = {Toronto, Canada}, Publisher = {{ACM}}, ShortBooktitle = {{CIKM}}, ShortTitle = {{CIKM}}, Title = {The Anatomy of a Click: Modeling User Behavior on WebInformation Systems}, Year = {2010}}
Models observed and unobserved behaviour of the user on a SERP, using a HMMlike model where a state is a vector of random variables (that can be observed or not) and the output is the observed (click) or unobserved (e.g. explores next/previous document) user action.
 C. Hauff, L. Azzopardi, and D. Kelly, “A Comparison of User and System Query Performance Predictions,” in CIKM’10: Proceedings of the nineteenth ACM conference on Conference on information and knowledge management, 2010.
[Bibtex]@inproceedings{Hauff2010AComparison, Author = {Claudia Hauff and Leif Azzopardi and Diane Kelly}, Booktitle = {CIKM'10: Proceedings of the nineteenth ACM conference on Conference on information and knowledge management}, DateAdded = {20101101 15:44:12 +0000}, DateModified = {20101102 11:16:43 +0000}, Doi = {10.1145/1871437.1871562}, Editor = {Huang, Jimmy and Koudas, Nick and Jones, Gareth and Wu, Xindong and CollinsThompson, Kevyn and An, Aijun}, Location = {Toronto, Canada}, Publisher = {{ACM}}, ShortBooktitle = {{CIKM}}, ShortTitle = {{CIKM}}, Title = {A Comparison of User and System Query Performance Predictions}, Year = {2010}}
Compares how different are query difficulty predictors from what users would say.
 R. W. White, P. N. Bennett, and S. T. Dumais, “Predicting ShortTerm Interests Using ActivityBased Search Context,” in CIKM’10: Proceedings of the nineteenth ACM conference on Conference on information and knowledge management, 2010.
[Bibtex]@inproceedings{White2010Predicting, Author = {Ryen W. White and Paul N. Bennett and Susan T. Dumais}, Booktitle = {CIKM'10: Proceedings of the nineteenth ACM conference on Conference on information and knowledge management}, DateAdded = {20101101 15:47:02 +0000}, DateModified = {20101102 11:16:44 +0000}, Doi = {10.1145/1871437.1871565}, Editor = {Huang, Jimmy and Koudas, Nick and Jones, Gareth and Wu, Xindong and CollinsThompson, Kevyn and An, Aijun}, Location = {Toronto, Canada}, Publisher = {{ACM}}, ShortBooktitle = {{CIKM}}, ShortTitle = {{CIKM}}, Title = {Predicting ShortTerm Interests Using ActivityBased Search Context}, Year = {2010}}
Looks at how to predict how much the context of a query (i.e. queries in the current session) is to be taken into account. Uses a parametrized mixture of language models from the context and the query.