Thesis Proposal: Information Storage and Retrieval of Unstructured Data- Essay Solved

Thesis Proposal: Information Storage and Retrieval of Unstructured Data

(3905 words)


The main aim is to store the unstructured data in the 9may be in cloud) & it should be retrieved easily & cost effectively. use the quantum, temporal, spatial information & dB’s as well Creating ontological approaches to personalizing queries of unstructured data requires intensive use of XML-based tables and schema. From the legacy design efforts for CSDL (Roussopoulos, 1979) to the myriad of approaches to XML schema development including the development of XIRQL (Fuhr, Grojohann, 2004), Hybrid XML retrieval (Pehcevski, Thom, Vercoustre, 2005) and XML queries (Chien, Tsotras, Zaniolo, Zhang, 2006), the adoption of advanced techniques for unstructured content management is progressing rapidly. Paralleling these research advances is pervasive adoption of Cloud Computing platforms including Software-as-a-Service (SaaS), driven by the growth of the Amazon Web Services platform in addition to others.

The intent of this thesis proposal is to define an XML schema that can aggregate unstructured content that when combined based on the individualized taxonomies and ontological preferences of system users, delivers highly relevant and timely data. The proposed XML Schema Model for Unstructured Content Personalization shown in Figure 3.


Information storage and retrieval Information storage and retrieval, the systematic process of collecting and cataloging data so that they can be located and displayed on request. Computers computer, device capable of performing a series of arithmetic or logical operations. A computer is distinguished from a calculating machine, such as an electronic calculator, by being able to store a computer program (so that it can repeat its operations and make. Data processing techniques have made possible the high-speed, selective retrieval of large amounts of information for government, commercial, and academic purposes. There are several basic types of information-storage-and-retrieval systems. Document-retrieval systems store entire documents, which are usually retrieved by title or by key words associated with the document. In some systems, the text of documents is stored as data. This permits full text searching, enabling retrieval on the basis of any words in the document. In others, a digitized image of the document is stored, usually on a write-once optical disc.

Database systems store the information as a series of discrete records that are, in turn, divided into discrete fields (e.g., name, address, and phone number); records can be searched and retrieved on the basis of the content of the fields (e.g., all people who have a particular telephone area code). The data are stored within the computer, either in main storage or auxiliary storage, for ready access. Reference-retrieval systems store references to documents rather than the documents themselves. Such systems, in response to a search request, provide the titles of relevant documents and frequently their physical locations. Such systems are efficient when large amounts of different types of printed data must be stored. They have proven extremely effective in libraries, where material is constantly changing. The volume of information has been rapidly increasing in the past few decades.

While computer technology has played a significant role in encouraging the information growth, the latter has also had a great impact on the evolution of computer technology in processing data throughout the years. Historically, many different kinds of databases have been developed to handle information, including the early hierarchical and network models, the relational model, as well as the latest object-oriented and deductive databases. However, no matter how much these databases have improved, they still have their deficiencies. Much information is in textual format. This unstructured style of data, in contrast to the old structured record format data, cannot be managed properly by the traditional database models. Furthermore, since so much information is available, storage and indexing are not the only problems. We need to ensure that relevant information can be obtained upon querying the database. (Zobel p. 12). Information retrieval (IR) is the field of research investigating the searching of information in documents, searching for documents, searching for information about documents, or searching within databases, whether stand-alone or networked by hyper-links such as the internet.

Types of data searched include text, audio, video, images or other complex data types such as programs. An IR system is commonly understood as that which deals with the relationship between objects and queries. Queries are formal statements of information needs addressed to an IR system by the user. The object is an entity which stores information in a data set, known as a document. User queries are matched to documents stored in document collection. Often the documents themselves are not kept or stored directly in the IR system, but are instead represented in the system by their pointers. Automated information retrieval systems were initially used to manage information explosion in scientific literature in the last few decades. The value of information is directly related to its ability to be located and used effectively, search engines thereby form a crucial component in the research and understanding of modern times.

For something so crucial, IR can be a confusing area of study. Firstly, there is an underlying difficulty with the very definition of IR as there exist the adjacent fields of data, document and text retrieval [1 ]; knowledge, information and data management [2 ]; information seeking[3,4], information science [5,6 ] and others with their own bodies of literature, theory and technologies which are deeply related to IR and each other to the point where the boundaries are unclear. Secondly, IR is a broad interdisciplinary field, that draws upon secondary fields such as cognitive science, linguistics, computer science, library science and it does so in a loosely organized fashion. It is tempting to refer to this conjunction of diverse areas as “search science”. However, due to the presence of ad-hoc techniques used to perform experimentation in IR and the absence of a (general) formal language for definition of IR concepts, components and results, it cannot be called a science1. Furthermore, there are no specific dentitions of search. With the abundance of methods available for finding information, whether through computer applications, libraries and librarians, a combination thereof, or otherwise, a formal definition would need to accommodate a process far more complex than that of traditional web-based querying through systems like Google. The lack of a general formal specification method for search processes, IR research, and the absence of a strict scientific method underpinning it has posed major barriers to future development and usefulness of research in the field [10,11 ].

This thesis addresses the conceptual, methodological and theoretical foundations of information retrieval with the intention of creating standard principles for understanding search in its various forms and hoping to thereby enhance research methodology. Borrowing notions from the mathematical formalisms, operational methods and interpretational mechanisms of Quantum Theory (QT), this work aims to show that the conceptual ambiguities underlying current research methods are responsible for many of the research problems. Alternative ways of understanding search are proposed with corresponding methods of conceptualization, some allowing mathematical formalization. Personalized retrieval assistants are sought to facilitate information access and relieve users of the burden of query formulation and the information discovery process. Information filtering systems in the form of recommendation services (Zhu, Greiner & Haubl 2003, Balabanovic 1997), query expansion assistants (Harman 1988, White & Marchionini 2007) and collaborative agents (Resnick, Iacovou, Suchak,Bergstorm & Riedl 1994), to cater the needs of similar minded people, have gathered much attention over the years.

However, the role of personalization in retrieval systems has not been explored properly. Despite the fact that research has focused on improving existing ways of personalization, the drifting nature of user needs has received little attention. Long-term user interests gradually evolve after user interaction with new information. In addition, temporal topics may appear unexpectedly. Given the content generation and mature technological scenario, new classes of personalized and context sensitive access patterns are possible. For example, personal recommendation assistants can pro-actively profile searchers and adapt to their changing interests. The development of such class of techniques and systems is hampered by the lack of proper evaluation methodologies and collections.

There are two components to information filtering systems: a user modelling component as well as a recommendation and presentation subsystem. The majority of adaptive information retrieval tools employ relevance feedback (RF) gathering mechanisms to capture the user intentions and create their profile (Belkin & Croft 1992), an appropriate snapshot of their information needs. RF techniques commonly rely on explicit ratings of the objects in the information domain. These ratings are then processed to calculate user preferences. Alternatively, implicit feedback gathering techniques (Kelly & Teevan 2003, Nichols 1997) infer document relevance by observing user interactions with the retrieval systems.

1.Motivation Ontological classification of unstructured data is critically important in the managing of initiatives, programs and strategies. When the specialized requirements of the Internal Revenue Service (IRS) and their requirements to stay in compliance to government regulations are included, the complexity of their roles and the need for accuracy, auditability and transparency are critically important. The IRS has long been challenged by a lack of transparency and in creating an ontologically based model that can take into account role-based requirements; the proposed XML Schema Model for Unstructured Content Personalization integrates unstructured and structured content into role-based taxonomies. With regard to structured data integrated, XSLT style sheets and XML integration into the taxonomies and ontological frameworks is defined. For unstructured data, XML data Integration and a Latent Semantic Indexing (LSI) filter that classifies and organizes the content into ontologically-defined roles is used. These two XML integration workflows from structured data and unstructured data are also used for creating knowledge management structures, systems and processes. In evaluating how the XML Schema Model for Unstructured Content Personalization would accomplish this, the recursive nature of its workflow needs to be seen as a factor driving the accumulation of knowledge as a result of velocity of data transactions and fluidity of communication.

1.1 Motivation by Drowning in data Much of the data generated each year will stem from formerly analogue sources such as film, voice calls and TV signals, and it is assumed that individuals will be the source of the majority of the data, i.e., about 70% of it. In the not too distant future, this could lead to a change towards more system-based data generation as exemplified by surveillance systems, remote operations, automated control or automated analysis of biological systems (e.g., genomics, proteomics). Storing all these data may become a serious problem, as according to Gantz, in the year 2007 the amount of digitally created information surpassed the available storage capacity.(7) Figure 1. As illustrated in Figure 1, the annually expected growth in global data generation and storage capacity shows clearly that not all data generated can be stored8. If this trend continues then by 2012 there will be storage space only for half of the newly generated data. Whether the discrepancy between the amounts of data versus storage capacity will increase further in the disfavour of storage will depend on how successful the development of future storage technologies will be, e.g., holography or even protein-based storage technologies. With the current growth rate in storage capacity of about 37%, significantly below the estimated 57% increase in information growth, the issue of ‘What shall be stored?’ becomes evident.

1.2 Motivation by Decay-time of information value The time span an object should be kept depends on its value to the current user or some future user. Sometimes this time span is specified by law (e.g., in accounting), leading to a stepwise value function. After the mandatory time, the information value drops to zero or becomes very low. Deletion may even be mandated, e.g., for privacy reasons. In other situations the information value may gradually decay either slowly (e.g., books), or relatively quickly (e.g., weather news and stock market prices). Such a decay-time is tightly coupled with the life span of the processes (or persons) that use the information. If a person dies, a company is dissolved, a business process is reengineered or a law has been terminated, then the information may no more be needed. Processes or the legal environment, however, may lead to requirements for storing ‘dead’ information as long is it may be needed as evidence, in court or otherwise, of what was going on in the past. Then the question arises: what should be preserved for historical reasons, to document cultural heritage and for future research? Materials that may seem to have a very limited lifetime (e.g., commercials) may turn out to be of high value for describing the society and the time period in which the materials were created. Thus, deleting information that is no longer needed for any ‘practical’ purpose may not be wise. The explosion in the amount of data created will require specifying not only how to store information (format, metadata etc.) but increasingly what must, should, and could be stored, and especially what should definitely not be stored.

One of the selection criteria could be time, or more exactly, the time frame in which an object has some certain information value. On-the-fly or context-based generation of information may be embedded in, or very tightly coupled with, quickly evolving technologies such as games, blogs or viewers (e.g., Flash). Such objects will represent a major challenge to preservation, as the objects are not necessarily stored and/or require specific software to access or to read them. When the majority of new data no longer can be stored, then the only way to utilize them is to immediately process, analyse and combine them with existing information. In this respect the data universe may gradually resemble the electricity grid, where the electricity produced must be consumed immediately. First, after a further refinement of the raw data into a smaller sub-set, the information may then be stored somewhere. In other words, the information value of raw data approaches zero, and these data will therefore not be kept – only their aggregation with a higher information value will be available in the future.

1.3 Search and Retrieval With an exponentially growing volume of data, it can be expected that all aspects of search will increasingly be more challenging in the future. If the majority of newly generated data no longer can be stored, but only their aggregation, then this has a tremendous effect on the way future search will have to be done. In these cases the search engine will have to be at the right place at the right time to index the raw data if these data are needed for search. This type of search will, however, fall outside a search perspective where data objects are decades or centuries old. Search and retrieval may roughly be divided into three parts: pre-search (preparation), search and post-search. Post-search in turn can be divided into analysis of search results, aggregation, exchange/presentation and understanding. Finding the right data is an iterative process where search results are analyzed and new searches are initiated. The intended use determines the quality of the search results, and a main obstacle is to give an explicit description of what is actually desired. If all information about a subject shall be found, e.g., information about a person, then completeness would be a major quality parameter.

As pointed out above, in a query preparation time will have to be explicitly taken into account as the terms that were in use, say 40 years ago, may vary considerably from today’s terms, depending on the subject. A semantic time map specifically generated for a selected field of interest could therefore help to identify the correct search terms with respect to time. Figure2. A schematic representation of data search and retrieval is shown in Figure 2. As time goes by, all parts of the search process will have to take more explicitly time-related changes into account.

1.4 Motivations by Retrieval Modeling One of the key reasons for delving into formalizations for retrieval is for addressing the problem of modeling changing information need, in particular, for modeling ostensive retrieval. Ostensive Information Retrieval (OIR) [8 ] mainly aims to address vague and changing IN. It approaches the problem by defining user-system interaction so as to not require explicit query formulation. Change is accounted for treating certain interactions as more important than others. OIR is relatively new and has been shown to be effective for search, especially for image search. However, it is currently ad-hoc by nature as there exists no theoretically sound justifications for the specifics of its query-less interface or for the way it interprets IN change. Instead of requiring “artificial” communication of the user by reducing the interpretation of their IN to a short phrase, OIR is an attempt to recover details of IN inevitably lost in the reduction by introducing an interaction method assumed to be more natural with respect to cognitive processes. The idea of querying/feedback by ostension developed in OIR can be seen as a “median” between explicit feedback/query formulation and implicit feedback as they are traditionally understood. OIR suggests using an ostensive language as (1) it improves the users understanding of how the system works, and (2) IN defined in such a language simplifies its interpretation for the system, meaning that changes in need become more transparent to the system. Ostensive IR by definition addresses the query formulation problem by removing the need to express a word based query. The Ostensive Model (OM) of developing information needs also recognizes the dynamic nature of information needs during a search process providing a simplistic model for interpreting change. It is limited in many ways relative to HHIR, one being its dependence on the binary probabilistic model [9],which means only simple ‘relevant/not-relevant’ type interactions by the user are captured, thus limiting user expression. The interactions are interpreted based on assumptions about information need change (a concept pertaining to cognition phenomena) by means of a simplistic model for cognitive state change. Ostensive IR is a user centric search. A comprehensive investigation into OIR inevitably means addressing psychological and cognitive concepts or that some assumptions have to be made about these before user interactions and corresponding concepts of expressing behavior/personality can be adequately understood. In the past there have not been thorough, formal research presenting a cognitive user model for ostensive/interactive search. This thesis challenges the ad-hoc approach of the OM which limits the aims and potential achievable by OIR arguing that addressing the above issues, especially that of formal cognition modeling effectively, requires a broader theory for ostensive retrieval. The motivations have been to investigate expansion and formalization of the OM: (1) the cog- nitive model of IN change, (2) the interaction model, (3) the uncertain inferences about in made from interactions, (4) and how these affect retrieval. This thesis presents a frame- work in the language of quantum theory that addresses these issues, and the modeling of cognitive phenomena that pertain to information need change. In general, it was the advantage of OIR over traditional IR that motivated the investigation into the foundations of IR through a formal approach to OIR.

1.5 Motivations due to the requirement of a “Search Science” Recent work in [6,8 ] based on ideas borrowed from Quantum Theory (QT) have suggested methods of formalizing aspects of IR aiming toward a comprehensive theoretical basis in which a search process can be completely defined and reasoned about, and a scientific basis inspired by operational methods in QT. It was subsequently found that there is a potential for QT methods to play a wider role in resolving the above IR issues (of definition and lack of scientific method) than suggested. In addition, it was found that apart from the mathematical formalism of QT which offers methods for representation and analysis of IR concepts, the scientific method16and operational structure17are also very useful. Inspired by these peculiar connections and on attempting to apply these methods and map search to QT, it was found that search requires to be re-examined from a perspective quite different from how it is traditionally perceived (see [11 ]) in order to deduce the feasibility, utility and method of the mapping. Thinking about search in this new way also suggests approaches for re-defining the concept of search [12 ]. The overall goal for our research can be equated to being able to formally refer to IR as “search science” by establishing a specific definition of search and deducing scientific methods for the investigation of search, so it can be in all respects, a science. This addresses the conceptual foundations of information retrieval with the intention of creating standard principles and hoping to thereby enhance research methodology. Borrowing notions from the mathematical formalisms, operational methods and interpretational mechanisms of Quantum Theory (QT) this work aims to show that conceptual ambiguities underlying current research methods are responsible for many of the research problems. An alternative way of understanding search is proposed with corresponding methods of conceptualization, some allowing mathematical formalization. Conclusion Using constraint-based logic in the LSI filter will ensure that constraint and rules-based queries are completed, and the use of role-based taxonomies at the presentation layer of portals will provide IRS system users with personalization options. It is the intention of this thesis proposal to evaluate the feasibility of the XML Schema Model for Unstructured Content Personalization. Second, the intention of this thesis is to evaluate how effective this proposed model is in becoming self regenerative and learning over time. Figure :3 Proposed XML Schema Model for Unstructured Content Personalization


1.D. C. Blair. The data-document distinction in information retrieval. Commun. ACM, 27(4):369–374, 1984.

2.D. C. Blair. Knowledge management: Hype, hope, or help? Journal of the American Society for Information Science and Technology, 53(12):1019–1028, 2002.

3.D. O. Case. Looking for Information: A Survey of Research on Information Seeking, Needs and Behaviour. Emerald Group, 2006.

4.G. Marchionni Information Seeking in Electronic Environments. Cambridge University Press, 1995.

5.J. Williams. Information science: definition and scope. In J. Williams and T. Carbo, editors, Information Science: Still an Emerging Discipline. Cathedral Publishing, 1997.

6.C. J. van Rijsbergen. The Geometry Of Information Retrieval. Cambridge University Press, 2004.

7.. Time Challenges – Challenging Times for Future Information Search Thomas Mestl, Olga Cerrato, Jon Ølnes, Per Myrseth, Inger-Mette Gustavsen DNV, Research & Innovation 1322 Høvik Norway july ,2009

  1. I. Campbell.The Ostensive Model of Developing Information Needs. Ph.D.dissertation, Department of Computer Science, University of Glasgow, 2000.

9.N. Fuhr. Probabilistic models in information retrieval. The Computer Journal, 51.University press ,2000.35(3):243–255, 1992

10 [AvR07] S. Arafat and C. J. van Rijsbergen. Quantum theory and the nature of search. In Proceedings of the AAAI Symposium on Quantum Interaction,pages 114–121, 2007.

11.[AvRJ05] S. Arafat, C. J van Rijsbergen, and J. Jose. Formalising evaluation in information retrieval. In CoLIS Workshop on Evaluating User Studies in Information Access Fifth International Conference on Conceptions of Library & Information Science – Context: nature, impact and role., 2005.

12.Agosti, Maristella, Crestani, Fabio, Gradenigo, Girolamo. (1989). Towards Data Modelling in Information Retrieval. Journal of Information Science, 15(6), 307.

13.Alain Azagury, Michael E Facto, Yoelle S Maarek, Benny Mandler. (2002). A novel navigation paradigm for XML repositories. Journal of the American Society for Information Science and Technology, 53(6), 515-525.

14.Michael Benedikt, Christoph Koch. (2008). XPath leashed. ACM Computing Surveys, 41(1), 23.

15.Elisa Bertino, Giovanna Guerrini, Marco Mesiti. (2008). Measuring the structural similarity among XML documents and DTDs. Journal of Intelligent Information Systems, 30(1), 55-92.

16.Angela Bonifati, Stefano Ceri, Stefano Paraboschi. (2002). Pushing reactive services to XML repositories using active rules. Computer Networks, 39(5), 645-660.

17.Falguni Bhuta. (2006, June). Put Unstructured Data In Its Place. Information Week, (1094), 21.