Making Sense of Big data
The state-of-the-art opportunity to generate, collect and store digital data, at unprecedented volumes and speed, is still compromised by the limited human ability to understand and extract value from them. As a consequence, the classical ways for users to access data, i.e. through a search engine or by querying a database, have today become insufficient when confronted with the new needs of data exploration and interpretation typical of an information-producing society. Dealing with Big Data represents nowadays a most critical task from both a strategic and a short-term perspective. We are in the era of large, decentralized, distributed environments where the amount of devices and data and their heterogeneity is getting a little out of control every day: Gartner reports that the worldwide information volume is growing at a minimum rate of 59% annually.
The term Information overload was already used by Alvin Toffler in his book Future Shock back in 1970. It refers to the difficulty to understand and make decisions when too much information is available.
We propose INDIANA, a system conceived to support a novel paradigm of database exploration. INDIANA guides users that are interested in gaining insights about a database by involving them in a “conversation”. During the conversation, the system proposes some features of the data that are “interesting” from the statistical viewpoint. The user selects some of these as starting points, and INDIANA reacts by suggesting new features related to the ones selected by the user. A key requirement in this context is the ability to explore transactional databases, without the need to conduct complex pre-processing. To this end, we develop a number of novel algorithms to support interactive exploration of the database. We report an in-depth experimental evaluation to show that the proposed system guarantees a very good trade-off between accuracy and scalability, and a user study that supports the claim that the system is effective in real-world database-exploration tasks.
One of the techniques that can be used in exploration is based on intensional semantic features. The term intension suggests the idea of denoting objects by means of their properties rather than by exhibiting them. The ambitious objective is building a novel, original scientific framework that can support the user in an amazing variety of “scouting” activities like progressive inspecting, observing, investigating, examining, discovering, searching, surveying, mimicking discussion with the system. The envisaged result goes much beyond the current state of the art of query systems, search engines, data analysis and recommendation systems, sometimes borrowing methods and techniques from them but developing a brand-new theory for modeling and manipulating intensional set representation.
We should notice that in real life we use intensional knowledge very often, since our brain is much more apt to capturing (and reasoning over) properties of objects, than to memorizing long lists of them. While trying to describe reality with 100% accuracy is practically impossible, an approximate description -with an accuracy lower than 100% – of a collection of data is still possible; the ultimate goal of this research is thus to formulate the theory of the intensional datasets, encompassing the notion of “semantic relevance” to synthesize and operate on the relevant features of sets by means of an appositely defined algebra that works on approximate/intensional/extensional set (and feature relevance) representation.
Many interpretations of the notion of context have emerged in various fields of research like psychology, philosophy, or computer science. Context-aware systems are pervading everyday life, therefore context modelling is becoming a relevant issue and an expanding research field. Context has often a significant impact on the way humans (or machines) act, and on how they interpret things; furthermore, a change in context causes a transformation in the experience that is going to be lived.
The word itself, derived from the Latin con (with or together) and texere (to weave), describes a context not just as a profile, but as an active process dealing with the way humans weave their experience within their whole environment, to give it meaning.
Accordingly, while the computer science community has initially perceived the context as a matter of user location, in the last few years this notion has been considered not simply as a state, but as part of a process in which users are involved; thus, sophisticated and general context models have been proposed, to support context-aware applications which use them to: (a) adapt interfaces; (b) determine the set of application-relevant data, (c) increase the precision of information retrieval, (d) discover services, (e) make the user interaction implicit, or (f) build smart environments.
In Information Management, context-aware systems are mainly devoted to determining which information is relevant with respect to the ambient conditions. Indeed, nowadays the amount of available data and data sources requires not only to integrate them (still a hard problem), but also to filter (tailor) their information in order to: 1) provide the user with the appropriately tailored set of data, 2) match devices’ physical constraints, 3) operate on a manageable amount of data (for improving query processing efficiency), and 4) provide the user with time- and location-relevant data (mobile applications). Given this scenario, this research is dealt with in the Context-ADDICT project, and has produced:
- a definition of the notion of context in the database field,
- a survey and comparison of the most interesting approaches to context modelling and usage available in the literature.
- a comprehensive evaluation framework, allowing application designers to compare context models with respect to a given target application,in particular for the problem of context-aware data management
- a design methodology providing a systematic support to the designer of data management applications — be them related to a huge (e.g., in data warehousing) or to a very small amount of data (e.g., in portable, lightweight data management systems) — in determining context-aware database portions to be delivered to each user in each specific context
More research topics are: Context query languages; context-awareness in social media; automatic learning of unknown contexts; representation of context evolution; requirement analysis based on the notion of context; architecture for a context manager; context-aware sensor query languages.
Context-aware, preference-based recommendations
With data tailoring, given a target scenario, in each specific context the system allows users or applications to access only the data view that is relevant in that context. However, this does not always yield the solution since user preferences may still vary according to the context the user is currently in, and a change in context may change the relative importance a user attributes to information. In this case, contextual preferences can be used to further refine the views associated with contexts, by imposing a ranking on the data of a context-aware view. On the other hand, we cannot actually expect a user to manually specify the long list of preferences that might be applied to all available data when a context becomes active; this is why in this paper we propose a methodology and a system, PREMINE, where data mining is used to infer contextual preferences from the previous user’s querying activity: in particular, our approach mines contextual preferences from the past interaction of the user with contextual views over a relational database, gathering knowledge in terms of association rules between each context and the data which is relevant for the user in that context.
For a quick but accurate account on the state of the art on context awareness, and of personalization in general, you can read the recent special issue of the IEEE Data Engineering Bulletin
XML is a rather verbose representation of semistructured data, which may require huge amounts of storage space. We propose a summarized representation of XML data, based on the concept of instance pattern, which can both provide succinct information and be directly queried. The physical representation of instance patterns exploits itemsets or association rules to summarize the content of XML datasets. Instance patterns may be used for (possibly partially) answering queries, either when fast and approximate answers are required, or when the actual dataset is not available, e.g., it is currently unreachable. Experiments on large XML documents show that instance patterns allow a significant reduction in storage space, while preserving almost entirely the completeness of the query result. Furthermore, they provide fast query answers and show good scalability on the size of the dataset, thus overcoming the document size limitation of most current XQuery engines. We also investigate novel data mining algorithms to infer patterns representing summarized and integrated representations of data and service functionalities.
This work is carried out in collaboration with the database and data mining group of Politecnico di Torino.
RELACS: RElational vioLation Analysis for Constraint Satisfaction
Frequent constraint violations on the data stored in a database may suggest that the represented reality is changing, and thus the database
does not reflect it any more. It is thus desirable to devise methods and tools to support (semi-)automatic schema changes, in order for the schema to mirror the new sitation. This work concerns methods and techniques to maintain the database integrity constraints specified at design time, in order to adjust them to the evolutions of the modeled reality that may occur during the database life.
Semantic Data Markets: a Flexible Environment for Knowledge Management
Nyaya is a system for the management of Semantic-Web data which couples a general-purpose and extensible storage mechanism with efficient ontology reasoning and querying capabilities. Nyaya processes large Semantic-Web datasets, expressed in multiple formalisms, by transforming them into a collection of Semantic Data Kiosks. Nyaya uniformly exposes the native meta-data of each kiosk using the Datalog± language, a powerful rulebased modelling language for ontological databases. The kiosks form a Semantic Data Market where the data in each kiosk can be uniformly accessed using conjunctive queries and where users can specify user-defined constraints over the data. Nyaya is easily extensible and robust to updates of both data and meta-data in the kiosk and can readily adapt to different logical organization of the persistent storage.The approach has been experimented using well-known benchmarks, and compared to state-of-the-art research prototypes and commercial systems. ( This work is carried out in collaboration with R. De Virgilio, G. Orsi and R. Torlone)