Adaptive Search

From Kb

Jump to: navigation, search

Contact Article Author | Blog of Article Author | FirstPartners.net Home | LinkedIn profile of Author

Contents

Red Adaptive Search

Everybody has a Knowledgesphere – what they know , and what they understand. Most of the time , it’s stuck in people’s heads. Red Adaptive Search combines a personal search engine , Knowledgebase , Information Gatherer and the intelligence to learn what the user wants. With this combination , users can extend and exchange their Knowledgespheres.

This section outlines what the adaptive search portion of Red Piranha can do. It is a fully working system , demonstrating the core capabilities , with the ability to be easily extended. It is installable by a person with minimal knowledge of the Tomcat Web server , and usable to a person able to use a browser and the Google Search engine.

Problem

  • You have the information , but you cannot find it
  • Information is in disparate data sources and systems
  • A lot of the value on what is ’good information’ is in people’s heads.

1. Section Layout

1.1. Reminder of what it does

1.2. Sections a-b are business. Sections c-d are technical. It is done this way so that you can evaluate Red FC first, then decide to take the time to investigate further.

1.3. Tech Setup link

1.4. Start with business user overview

Update with simple / adaptive search

Business Case

Red-Piranha is an open source search system that can actually ’learn’ what you are looking for.  It lets you go everywhere , find anything , understand everything.

Because it is open source , it can integrate with any system. Because you can use it as a web page , command line or XML- WebService , it will work with most languages , including Java , Perl , C#/.Net and PHP. As a Java based program , it  will run on any platform including Windows , Linux / Unix and Mac.

SUGGESTED USES
  • Personal Search Engine for your Desktop (Windows , Linux and Mac).
  • Intranet Search Engine - Search your Company or College Intranet.
  • Part of your Development Project - Have search abilities up and running in a few minutes.
  • To provide Search facilities on your website.
  • As a P2P search engine.
  • In conjunction with a wiki, as a knowledge / document management solution.
  • Scan a set of websites for the data you want (e.g. Search Job sites on a hourly basis).
  • Explore the Semantic web using RDF.
  • Search RSS feeds for the information you want.
  • Search your Companies systems (including SAP , Oracle or any other Database / Data source).
  • Provide a back end for searching in your App (Web , Swing , SWT , Flash, Mozilla-XUL, PHP , Perl or even c#/.Net) .
  • Document Management for PDF, Word and other Docs.
  • As a Webservice to provide search information
  • As a command line tool , to give searching power to your scripts.
  • Provide a Search facility for your project documentation.


Red-Piranha allows you to search information with a minimum of effort. With a little effort , it can search *anything* , including Oracle Databases , XML Webservices (including Java/J2EE and .Net webservices) , RSS and even Web based XML feeds such as Google and Amazon.

GETTING STARTED

First of all download Red-Piranha from here. If you’re not sure which one you want, download the ready-to-deploy (bin) file. The other files contain either the ready-to-deploy plus full source (bin_src_lib) or the source only (src).

If you have not already installed Java and Tomcat , you can get them from the Sun Java and Apache Tomcat websites. Red-Piranha should work with Java 1.3 and Tomcat 3 , although we recommend Java 1.4 or higher and Tomcat 4/5.

Unzip the file you have downloaded - there should be a file called RP.war. Copy this file into the the ’webapps’ folder of your tomcat. Within a number of seconds you should see a new folder called ’RP’ created.

Congratulations - your copy of Red-Piranha has now deployed and is ready to use.

USING RED-PIRANHA

To use Red-Piranha - open your favourite web browser and point it at http://localhost:8080/RP . Within a few seconds , you should see the Red-Piranha start screen. This will have three items of interest

  • A Text box , where we enter the information to add or search
  • An ’add information’ button - to tell Red-Piranha about new information
  • A ’Search’ button - to carry out a search.

Before we can search , we must tell Red-Piranha we information we are interested in. This is as easy as putting the piece of information we want to add (e.g. the folder c:\temp\) in the search box and pressing the ’Add information’ button. A message will be displayed saying that your information is being added and will be available to search shortly. For more information , look in the logs at TOMCAT_HOME\Webapps\RP\logs\rp.log

Examples of things we can add to Red-Piranha are

  • A folder (e.g. C:\Temp\). All files in both this folder and *all* it’s subfolders will be added.
  • An individual file. This file can be text , a web page , a word document , or pdf document. For binary files (like word , which are not plain text) , Red-Piranha will scan the file for recognizable text and add that.
  • A Web page. Red-Piranha will add this web page , *and* web pages it links to.
  • A Google Search (e.g. http://www.google.com/search?q=some+thing&num=100). Red-Piranha will get the results of the google search , and add information on the pages it links to.
  • An XML file (including RSS feeds) , either on disk or over the web.
  • Favourites / Bookmarks folders - Red-Piranha will index the web pages that these favourites point to.


Adding information can take anything from a few milliseconds , depending on the amount of information being added. Once added, Red-Piranha will check on a regular basis to see if the information added has changed and re-index if required. Your information is now available to be searched.

To do a search , put the item you want to search for into the textbox and press ’search’. Red-Piranha will show the search results on the screen. Clicking on the link beside the search results will show you the original information (as long as you have access to it).

From version 0.3 onwards , Red-Piranha can ’learn’ what search results you are interested in an improve your future searches. To give Red-Piranha feedback and help it ’learn’ what you are interested in , click on any of the links on the ’search results’ page. Red-Piranha makes a note of your choice , which is used to adjust the search results later.

Using n the Enterprise

1.5. How it can be extended

1.6. How it can link to the other samples

Differeneces between simple and adatpive

  • feedback
  • simpleCategoryManager v CategoryManager
  • BareCategoryStore v BasicCategoryStore

1.7.

Screenshot – Default Search Screen


Screenshot – Add Information

Screenshot – Search Results

RUNNING RED-PIRANHA SEARCH

Get the samples binary (as per...)

The steps below can all be done from the command line. If you are using the command line, we’ll presume you know what you’re doing and you are able to take the information required from the Eclipse notes below. The process will be similar for other IDE’s (Netbeans , Websphere developer)

Troubleshooting

Security Notes

  • For this simple deploy, there are no restrictions on who can add items to be searched.

Security on documents found during a search is managed outside of the RP application

Red Adaptive Search and Web 2.0

  • Sharing of user knowledge
  • Social Networking (and tools)
  • Service Orientated Architecture
  • Demonstration of mashup
  • Searching RDF
  • User adding value by his / her actions
  • Knowledge Management
  • Unconventional data sources and adaptation to user preferences.

Deployment

  • Install of Red-Piranaha Project (http://red-piranha.sourceforge.net)
  • Does not have Web 2.0 interfaces – but the focus is elsewhere
  • User guide : how to use system to learn
  • Developer guide : how to extend system to capture new inforamtion
  • How to setup Engine against existing data sources, for example a HTML / wiki based knowledge base (ii) How to extend to handle new sources of data, such as RDF and RSS.

1.8. Download as per chapter x

1.9. Build ant file

1.10. Deploy Deploy the attached RP.war onto your WebServer

  • For Tomcat, this is a copy into the webapps directory

1. Open the RP Web page in a Browser

2. Add the directory that you wish to search later (e.g C:\Temp\ or http://www.iib.ie)

1. Wait a few moments to allow for indexing
1.11. Go head and Search!

Technical – Behind the Scenes

Here is a list of what Red-Piranha uses under the covers.
  • Spring - a J2EE lite framework that gives us a lot of functionality
  • Tomcat - the Java Web Server we can run in.
  • Lucene - the Apache project to give a searching and indexing engine.
  • RDF / XML - Jena , to store all our information in RDF (aka the Semantic Web)
  • Xerces - for XML manipulation
  • PDF box - for reading of PDF documents.
  • Rp Core (to give ...)


Alternative runtime configurations

How to build and run (or just use simple tests ...)

  • on web
  • from command line
  • in eclipse
  • ADD Run ... rpCommand line ... <arguments tab> (arguments). ADD C:\Temp\searchdata .... (working directory) ${workspace_loc:red-piranha/red-adaptive-search/war}
  • SEARCH Run ... rpCommand line ... <arguments tab> (arguments). SEARCH enterprise .... (working directory) ${workspace_loc:red-piranha/red-adaptive-search/war}
  • Otherwise the Spring appcontext will fail to load
  • simple version build(for web)

Project Aims

The main items that Red Piranha adaptive search currently provides

  • All necessary source Code including Java , configuration files , Junit Tests and build scripts. All files marked with copyright notice as per appendix C
  • Project can be built and deployed (using build scripts from 2.1) on Apache Tomcat, Java 1.3 , running on Windows XP/ 2000/NT and Redhat Linux (Fedora 2).
  • Once deployed on these platforms , all User Stories in this document (Section 3) can be carried out using the system.
  • The system works with users running with Internet explorer versions 4/5/6 and Firefox 1.0 (Section 4 and Appendix B)

User Stories

The user stories list the different ways in which the user can interact with the search application.

  • Story: Application Start

The steps to taken when the application is First deployed (Tomcat Hot Deploy) or when Tomcat is (re) started. No user output , only to log files.
(START) Tomcat is Started

  • Application loads the plugins as stated in PluginManager
  • Get all Classes implementing IPlugin Interface from
  • rp.war (the war file that contains the RP application)
  • Plugins Directory (as specified in directory structure in Section 7)
  • For each Plugin that has been loaded
  • Start a Background thread.
  • Call the onLoad method on each plugin

(END)

  • Story: Show Search Page

The user opens the default url : http://localhost:8080/rp

  • User opens page in browser
  • Show search screen

  • Story: Add Information

Details how the user can add information to the system

(Start) user presses ’Add Information’ button

Get list of Plugins implementing IInterestedInAdd from Plugin Manager.

For Each Plugin …

  • Start low priority thread
  • call add() on interface

Return to Search screen, showing the message "You can continue to search while we add your Information"

Examples of resources / information that can be added to the system are

  • Local Directory in the format C:\SomeDir\SomeSubDir – or other drive letter.
  • Local File in the format C:\SomeDir\SomeSubDir\Somefile.extension
  • Remote file in format http://someurl/somedir/somepage
  • Special files (local or remote) e.g. *.xml , *.html , *.rss
  • Text Files and Binary Files (e.g. *.doc *.pdf)
  • Add the url of another RP (remote) application. This (1) do the search on the remote RP and (2) add the search results (html page) to the (local) Knowledgebase.
  • Adding the url of a Google search , index the Google search results page.
  • Add a local directory containing bookmarks (IE / Mozilla format)
  • Add a local directory containing History (IE / Mozilla format)
  • Story: Normal Search

Details how the user can search for information in the Knowledgebase

User enters search term and presses ’Search Button’. Search Term can be simple e.g. (java j2ee x) , or as complex as Lucene allow (e.g. java AND j2ee NOT xml)
Get Search Results

  • Get list of Plugins implementing IInterestedInSearch from Plugin Manager.
  • For Each Plugin returned…
  • Start low priority thread
  • call search() on interface
  • loop until either ’isReady’ returns true or reaches timeout
  • Timeout set in global / plugin properties file
  • call getResults() to get search results
  • Combine into Collection of Search Results
  • If no results throw RP exception (to display error message on search page)

Filter Search Results

  • Get the preferred Plugins implementing IInterestedInFilter from Plugin Manager / as set in config. file.
  • For phase 1 , this is BasicIntelligence , or it’s delegates.
  • Use this class to sort search results

Display search results

Display search results(Sample search results).

  • Story: Feedback from Search Results

How the user can help RP ’learn’ what he or she wants. Subsequent searches return different results in line with what the user requests here .

(Start) Clicks on one of the feedback links /buttons on the screen to triggers feedback. This are detailled in Appendix B , but examples are:

  • (1)Search query (associates terms like Java J2ee together)
  • Search result (main url link) clicked on
  • Negative feedback (I like this)
  • Positive feedback (not for me)
  • (2)More from this category
  • Category X Use More | Use less

Get the plugins implementing IinterestedInFeedback as defined in the global properties file. (this be the BasicIntelligence, which then uses other classes as required for phase 1)

  • call giveFeedback / update on Interface , passing in the feedback.
  • Note of the user feedback is made in FeedbackDatastore
  • BasicIntelligence Class update() method , does quick adjustment of score.
  • When the update method completes , does the original search again and displays results.

Note(1): The original search (as per user story 3.4) automatically triggers feedback and (re)search , the user is unaware of having given feedback.

Note (2) after giving this feedback , search results coming only from the category that the user clicked on be displayed. These can be identified by Category name , should be stored via the BasicIndex class

  • Story (Exceptions)
  • What to do when something goes wrong

(Start)

If a RPException / other Exception is thrown.

3. If RPException , see if has details of UserFriendlyMessage () and log, display

4. If other type of exception , log details and display generic error message to user. The generic error message can be configured via the global config file.

(End)

User Interface

Screens

Search Screen – bare

Search Screen – with results / allowing for feedback.

  • Browser Output is
  • HTML output to be IE 4/5/6 and Mozilla Firefox 1.0 upwards compatible.
  • No JavaScript HTML Pages.
  • HTTP Post / Get Info
  • All Interaction with browser is by Http-Get , so that params form part of the url visible in the address bar of the browser.
  • Book marking a url (used to access the RP application) and recalling it later cause RP to do the same search.
  • Adding this url of another RP (remote) application cause the application to (1) the remote RP does the search and (2) local RP add the search results to it’s knowledgebase.
  • Java API
  • All the functionality of the system is available via a Java API (the main class being KnowledgeBase manager).
  • 3rd Party programs can use the RP application as a library via this API. The Javadoc that is provided as part of the product on the KnowledgeBase manager class give full instructions on how to interact with the system in this manner.
  • Command Line

All The functionality as defined for the HTML interface be available via the command line. A full readme file is available at xxxx giving details of how to drive the RP system via the command line.

Core Classes, Interfaces and Concepts

  • Plugins are the means by which the system can be easily extended. Plugins are dynamic in that they are discovered and reloaded at runtime (i.e. when the system starts). This section defines the various interfaces that a plugin implements.
  • The main plugin interfaces are:

IPlugin- Marks a class as being a plugin.

    1. Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\plugins\events\IInterestedInAdd.html IInterestedInAdd] – register to be notified when new info is added.
      ## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\plugins\events\IInterestedInFeedback.html IInterestedInFeedback] – register to be notified when the user gives feedback.
      ## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\plugins\events\IInterestedInResultsFilter.html IInterestedInResultsFilter] – register as being able to sort and filter search results.
      ## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\plugins\events\IInterestedInSearch.html IInterestedInSearch] – register as being able to carry out a search.
  • Other (utility) plugins are:
    1. Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\back\datasource\IDataExtractor.html IDataExtractor]
    2. Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\back\index\IIndexManager.html IIndexManager]
  • Concrete Implementations of Interfaces
  • The following concrete classes are used in managing plugins that implement these interfaces.
    1. Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\mid\global\KnowledgeSphereManager.html KnowledgeSphereManager]

First point of contact for the RP System , and the point at which all the user interfaces converge (the it is the controller in the MVC pattern) and provides access to all the RP core functionality. As such it does things such as catch exceptions ,manages threads etc

    1. Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\mid\global\PluginManager.html PluginManager]

Responsible for locating and loading plugins. On Application startup (inc Deploy of rp.war)

  • Search for classes implementing IPlugin in rp.war
  • Search for classes in Plugins Directory (specified in Section 7)
  • If no plugins found , log the reason, throw RPException.
  • The Diagram below outlines how plugins relate to each other.
<<UI>> Programmatic (Java API)
Command Line
HTML (Servlet)
   
  bgcolor = "#E6E6E6" |    
<<Singleton>> KnowledgeSphere Manager <- 1..1 -> relation PluginManager
CategoryManager
  bgcolor = "#E6E6E6" |   bgcolor = "#E6E6E6" |
<<Iplugin>> L-> Core Plugins   L-> Utility Plugins
 
    1. Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\plugins\events\IInterestedInAdd.html IInterestedInAdd]
    2. Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\plugins\events\IInterestedInFeedback.html IInterestedInFeedback]
    3. Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\plugins\events\IInterestedInResultsFilter.html IInterestedInResultsFilter]
    4. Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\plugins\events\IInterestedInSearch.html IInterestedInSearch]
  IDataExtractor
IIndexManager
  • Other Interfaces in System
  • These interfaces are not exposed externally (like the plugin interfaces) but are used internally to ensure a good , configurable , design)
  • ## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\mid\category\ICategory.html ICategory] – Basic Unit of info – many categories make up database.
  • ## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\plugins\IFeedback.html IFeedback] – Feedback is how the user teaches the System
  • IBasicCategoryStore– Persistent storage of Data as part of the systems.
  • ## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\plugins\INewInformation.html INewInformation] - items that the user id adding to the RP system.
  • ## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\plugins\ISearchQuery.html ISearchQuery] - something the user wants to find.
  • ## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\plugins\ISearchResult.html ISearchResult] – what RP finds in response to a search query.
  • Plugins Implementing the following interfaces:
  • ## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\plugins\events\IInterestedInAdd.html IInterestedInAdd]
    1. Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\mid\category\CategoryManager.html CategoryManager] Using
    1. Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\back\datasource\IDataExtractor.html IDataExtractor] sees which concrete implementation one can handle this type of data
    2. Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\mid\category\BasicCategory.html BasicCategory]

Handle to the IDataExtractor that formed it

Saves Data using BasicCategoryStore.

  • ## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\plugins\events\IInterestedInFeedback.html IInterestedInFeedback]
    1. Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\mid\BasicIntelligence.html BasicIntelligence] uses
    2. Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\mid\category\BasicCategory.html BasicCategory]
    3. Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\back\datastore\BasicMetaDataStore.html BasicCategoryStore]
    4. Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\mid\category\CategoryManager.html BasicIndex]

FeedbackDataStore

  • ## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\plugins\events\IInterestedInResultsFilter.html IInterestedInResultsFilter]
    1. Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\mid\BasicIntelligence.html BasicIntelligence]
  • ## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\plugins\events\IInterestedInSearch.html IInterestedInSearch]
    1. Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\mid\category\CategoryManager.html BasicIndex] uses
    2. Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\mid\category\BasicCategory.html BasicCategory]
    3. Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\back\datastore\BasicMetaDataStore.html BasicCategoryStore]
  • ## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\back\datasource\IDataExtractor.html IDataExtractor]
    1. Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\back\datasource\FileDataExtractor.html FileDataExtractor]
    2. Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\back\datasource\XmlDataExtractor.html XmlDataExtractor]
    3. Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\back\datasource\UrlDataExtractor.html UrlDataExtractor]
    4. Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\back\datasource\WebQueryDataExtractor.html WebQueryDataExtractor]
  • ## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\back\index\IIndexManager.html IIndexManager]
    1. Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\back\index\BasicIndex.html BasicIndex] uses ## Error converting ##: [C:\Temp\rp-javadoc\net\firstpartners\rp\back\datastore\BasicMetaDataStore.html BasicCategoryStore]
  • Other Core Classes in the System
  • RPException – Extensible / Chained Exception for the RP System. Contains a user friendly message (for example , how to display errors as per Screen 2 , Appendix B)
  • RPCommandLine – Command line entry point to the RP system
  • RP Struts Classes needed to implement HTML interface.

Basic Plugin Implementations

  • The previous section detailed the interfaces by which a plugin could extend the system. The section details the plugins currently implemented and supplied as part of phase 1.
  • Additional / modified classes needed for the system to function as specified are also provided.
  • Where background processes are specified , their priority can be set via the config files.
  • User Events

User events and the (main) classes that handle them are:

  • Add Information
  • CategoryManger (delegating to Categories)
  • Search
  • BasicIndex
  • Feedback
  • FeedbackDatastore (and BasicIndex to update)
  • Startup (onLoad)
  • CategoryManager (refreshing / updating Categories)
  • BasicIntelligence (relinking / rescoring Category and FeedbackDataStore information)
  • BasicIndex – reindexing updated information
  • Category Manager

onLoad() method

  • Get all known Categories
  • Check disk in Dir (section 7) and load all the Categories found there.
  • Persistence Mechanism (uses BasicCategoryStore )
  • Refresh the Category Data (as Background process)
  • For each Category found
  • Get data as per add() method below , save into tmp category. (the original url given by the user is stored in the category , so calling add() again is easy)
  • When ready , copy tmp category over old Category.
  • Notify BasicIntelligence to rescore
  • Notify BasicIndex to reindex

add() method

  • get a list of all available IDataExtractor plugins from PluginManager
  • If not IDataExtractor methods returned , throw RPException
  • for each IDataExtractor
  • call canHandle() method , make note of the int value returned
  • using the IDataExtractor that returned the highest int value
  • Construct a new BasicCategory class , passing in IndexManager (one for the entire RP app) , BasicCategoryStore (as IMetaDataStore) and the IDataExtractor
  • Call construct() method to start conversion from Information pulled by DataSource to Data as stored by BasicCategoryStore. (Using common class / interface produced by DataExtractor , consumed by BasicCategoryStore)
  • BasicCategoryStore also stores Category info , such as the
    IDataExtractor that created it , the URL provided during ’Add information’ etc.
  • Data Extractor implements IDataExtractor
  • Basic Tasks
  • Recognises can / cannot handle new piece of information
  • Converts the original data format into format as can be stored by BasicCategoryStore (e.g. as Nodes / Tuples)
  • If adding file / piece of information with same name , just create a new category with this info (e.g. SameName and SameName1)
  • Methods
  • canHandle(INewInfo as added by user)
  • returns on int depending on how suitable it is to handle information (or –1 if it cannot handle info)
  • addData (INewInfo as added by user)
  • extract data from the data source into / convert to nested tuple class.
  • Where possible data extractors should be configurable using local / global properties files e.g. the amount of data per Node/ tuple after parsing.
  • Some sample DataExtractor implementations are below. Additional / modified implementations may be needed to fully implement phase 1.
  • File Data Extractor
  • Handles generic text files.
  • canHandle(INewInfo) - returns 1 if can open using standard Java File() object , -1 if cannot
  • Converts Text file into Tuples /nodes as follows: (object , subject ,relation
  • 1st Pass : Anything like URL , convert in Keyword=Name, Value =href
  • follow to one level (the index file found at this url , but do not follow any of the links therein)
  • 2nd Pass:
  • Tokenise files into words : groups of letters
  • Take only those words of more than 5 letters (configurable via property file)
with characters A-z , 0-9 , -![ etc – configurable via property file.
  • Keyword = Generic , Value = Actual word.
  • For later display as summary (make note of this on screen shot) , take first X number of characters (as specified in config file) and save as part of data extraction.
  • Xml Data Extractor
  • Handles XML Format files
  • canHandle(INewInfo) – returns 10 if fileName as represented by INewInfo ends in .xml , .xhtml (configurable – but tell how it is configured )
  • addInfo (INewInfo)
  • Begin to Traverse XML Tree
  • Name of Element becomes Keyword in Tuple and searchable field in index
  • Value of Element becomes Value in Tuple and the value added under the field in the index.
  • Child Elements become nested child tuple
  • Sample XML Data

<Parent-Element>Parent-Value

<Child-Element>Child-Value</Child-Element>
    </Parent-Element>
  • Maps To

Keyword : Parent-Element Value : Parent-Value

Keyword : Child-Element Value : Child-Value
  • The parent-child node relationships are preserved when converting to the BasicCategory / BasicCategoryScore.
  • WebDataExtractor
  • Extracts Data from HTML files (be careful of cross-over with xml files)
  • canHandle(INewInfo) – returns 12 if fileName as represented by INewInfo ends in .html , .htm , .asp .etc (configurable – but tell how it is configured (global / local properties) or begins with http://
  • Strip out HTML Elements , then parse as per Generic DataExtractor
  • List the HTML elements to be stripped out in config file (so can easily be extended)
  • Configured to work correctly with
  • Google Search pages
  • Rss (as html / xml feed)
  • & give sample on how to configure it
  • For all HTML (and related files like .asp .jsp etc), we follow the link within the HTML file (to 1 level only) and parse the files found at those links (as if the user added them directly)
  • FileTree Data Extractor
  • Add file tree to system
  • Define canHandle() as can open with File() , isDir = true (confirm method)
  • Walk file tree
  • Pass all files found there to GenericDataExtractor
  • Category refers to each file found as if it was added individually by the user.
  • Basic Category
  • Basic Unit of organisation within Knowledgebase.
  • Simple methods getCategoryName() , getDataExtractor(),getCategoryDataStore() , getSearchAgent() and search , giveFeedBack()
  • Constructors
  • Takes Name (unique) , DataSource (DataExtractor) ,Search Agent , IndexManager used for later index
  • Construct() does the actual work of building
  • If Name not unique , make it unique by adding id to it e.g. Name1 , Name2 etc.
  • Create new CategoryStore in own sub directory (using Name as name of the directory) , under the directories specified in section 7.
  • Extract Data to tuples /java classes using the given DataExtractor
  • Store these tuples / java classes in the newly created directory using the BasicCategoryStore
  • Pass handle to the newly created BasicCategoryStore to the IndexManager for later indexing.
  • Store the CategoryName / DataExtractor / original String / last updated / other data during add as part of the CategoryDatastore.
  • OnLoad() / Update
  • Work in conjunction with Category Manager to refetch / rescore / reindex all the data.

  • BasicIndex
  • Carries out searches using the system wide Index manager
  • Index is stored using Lucene
  • Can be deleted, then recreated using Data stored in categories .
  • Indexes all available information , against a keyword / field name if possible.
  • Methods
  • ReIndex(nodeId , NodeDetails)
  • Quick reindex of single node , after feedback from user (see BasicIntelligence for more details) , using same principles as onLoad() method.
  • ReIndex(categoryDetails)
  • Quick reindex of Category , after feedback from user / refetch of category (see BasicIntelligence for more details)
  • OnLoad() – background re-index of all data
  • If no Lucene index , then create new one
  • Get all categories (from CategoryManager)
  • Pick oldest category
  • Iterate through Category (Tuple Tree)
  • Keyword as field name , value as value
  • Search (by nodeId) to see if item is already on index , if it is , remove.
  • Index as searchable , retrievable Lucene field .
  • Also index : id , parent id (of tree) , Date , score and all other details.
  • (tie this off against basic category
  • BasicCategoryStore
  • Provides Persistent storage of Category information to disk.

1.1.1. Storage Format

  • XML / RDF Data Format (Using Jeena or similar Library)
  • RDF Node : <Objection> -- relation  <Subject>
  • Node Can be nested
  • <Node 1><Node 2/></Node 1>
  • Data need to be made xml safe – the conversion of ’&’ character to & , similar to ’<’ ’>’ and other special xml characters.
  • Sample Format (real rdf format may differ – this is parsed for clarity)

<Category name="someCatName" orignal_score="1" calc_score="5" >

<Node unique_id="1234" direct_score="1.5" calc_score="15" last-update="ms">

This is some info parsed from the original source
<Other Tags / Attributes to describe piece of information/>

<Link link-to-id="3333"/>

<Link category =" someOtherCategoryName" link-to-id="1111/>

</Node>

<Node unique_id="3333" … other stuff />

</Category>

  • Notes on this sample
  • Category (root node) has name , and other attributes (such as original url / file link) – not shown
  • Nodes (or tuples in rdf speak) represent basic unit of info. The granularity (how big / small they are in character size) is to be set in global config file
  • Both Category and node have scores. Calc_score , direct_score. Both are set during feedback (basic intelligence class) direct_score – set by feedback onto this node. Calc_score depends on many nodes link to this.
  • Links. In this sample node 1234 links to both node 3333 (same category) and node 1111 (in another category)
  • Category Name is based on text supplied by user when ’add info’ button was pressed – first X letters , using characters A-Z and 0-9 (specified in config file). Name be made unique.
  • Category ID is unique-id , based on hash of category name (the same name gives same id.

1.1.2. Information Stored

  • All Information contained in the (Basic)Category , including
  • Overall
  • Category Name (unique within system)
  • Node
  • Unique ID (to this file , so each Node + Category Name is unique to the system)
  • ID is derived from Node contents so easily reproduced– 2 separate nodes, each from the same source , should have the same ID)
  • Score (original & calc)
  • Links to (unique ID of other node)
  • Date updated
  • Info linking to original url / piece of information
  • Summary Info (First 50 printable Chars) , shown as part of search results. (Printable to be defined in global config file.
  • Other info as required
  • Like BasicCategoryStore , BasicIndex also stores information on disk (using Lucene Indexes) , the differences between the two storage mechanisms are:
  • BasicCategory
  • Stores all information
  • Stores Data in XML Format on disk
  • XML file is editable on disk
  • Emphasis on completeness / robustness
  • BasicIndex
  • Extracts information from BasicCategory(s)
  • Stores information in Binary format using Apache Lucene Index (cannot be edited by hand)
  • Index file can be dropped and can recreated by extracting info from BasicCategory(s)
  • Emphasis is on speed of access.
  • Methods (overloaded)

Include , method to store and retrieve entire Categories and Nodes / Tuples , by ID number , name. Get the node only , or node & nested child nodes.

  • FeedbackDataStore

Stores all Feedback given by the user on search results for later use by the system.

  • All Feedback given by the user to the system is stored here. It be possible to reproduce all the calculate scores / links if this was lost.
  • Like BasicDataStore , storage format is RDF-XML.
  • For phase 1, only have one instance one instance of FileNameAsSpecifiedInPropertiesfile.xml on disk (in the directory, as specified in section 7).
  • The Storage format of this is XML, if added to another RP System (e.g. via the add button on the UI, can be read and added to this system.
  • Sample Format (real rdf format may differ)

<Category name="specialFeedbackCategory" >

(i)<Node searchquery="true" date="ms"> SearchTerm1 SearchTerm2 SearchTerm3

</Node>

(ii)<Node score "-1" id="1111" categoryName="categoryName" date="ms">

:Node summary as displayed on screen

</Node>

(iii)<Node score="1" categoryName="categoryName" date="ms">

:Original url used when user added this category

</Node>

</Category>

  • Notes on this sample
  • Nodes store both id (our own ease of update) and extra (redundant) pieces of info (useful if we pass this feedback.xml to someone else) – e.g. a summary of the node and /or where the data from the category came from.
  • Score is either +1 or –1 (we then use our own +/- weighting from property file)
  • Mapping Node to feedback types from Screen 2 , appendix B

The scores used are marked as either + or negative

  • Node (i) is example of feedback based on the original search query. Used by Search Term (3 +)
  • Node (ii) is example of feedback based on node (either click , or ’not-for-me’) - with key linking to category + node. Used by feedback items (6+) ,(10+) and (11-)
  • Node (iii) is example of feedback based on category (more / less from this category) – linking to category in question. Used by feedback items (8+) ,(14+) and (15-)
  • BasicIntelligence

Where most of the intelligence for the system lies.
The main responsibilities carried out by this class (or it’s delegates) are:

  • Sorting Search Results
  • Feedback , followed by a quick update of scores
  • More comprehensive update of scores during the onLoad() event

The class interacts with BasicCategory / BasicCategory store to implement a scoring system as follows.

  • Each node / category has a direct_score (given by feedback directly on the node) and a calc_score. (Direct Score adjusted by scores of other nodes that link to it).
  • There is a implied weighting = direct_score / calc_score. A direct_score of 10 and calc_score of 47 would imply a weighting of 4.7 It is this weighting that is adjusted during the feedback and the calc_score recomputed.
  • Score information is stored in the index. The value stored in the index is calc_score * category_score , which is used for later sorting of the search results.
  • Sorting Search Results – Implements I InterestedInResultsFilter
  • Phase 1 : Just sorting , filtering to the first X(e.g. 100 ,as set in properties file) of the sorted results.
  • Take combined search results
  • Sort Results set by
  • Results score (after category)
  • Lucene Score
  • By Date
  • To meet performance requirements , suggest organising Lucene index to do this for us automatically.

  • Feedback (implementing IinterestedInFeedback)

How the class reacts when notified of the following feedback events

Update()

All Feedback events should already have been added to FeedbackDataStore (as a node). This node is passed to the update method.

For node type (x) do the following :

Node (i)

Break into Words (the at spaces / other characters as set in config file)

E.g. Word1: Java Word2:J2EE Word3:Xml

Find Set of nodes (using the Basic index / Basic Category) that contain these words. One set per word (word1:set1).

Calculate average of each sets original score

For each set

Get number (Z) to adjust nodes by (For Set 1 , this is (set2score+..+setXscore) * weighting from property file

For each node in set1

Get implied weighting of these nodes

Adjust (either +/-) by (Z)

Recalculate the calc_score , save to Category on disk (using BasicCategory), reindex (using BasicIndex)

Node (ii)

Find nodes , either using the category name + id , or indirectly , by using the category summary.

Get the implied weighting of these nodes.

Adjust (either +/-) by weighting (from global properties file)

Recalculate the calc_score , save to Category on disk (using BasicCategory), reindex (using BasicIndex)

Node (iii)

Find nodes , either using the category name + id , or indirectly , by using the category summary.

Get the implied weighting of these nodes.

Adjust (either +/-) by weighting (from global properties file)

Recalculate the calc_score , save to Category on disk (using BasicCategory),

Reindex (using BasicIndex) all nodes within this category (as their score in the index have changed)

OnLoad()

Low priority task , called on system startup , systematic recalculation of all scores (allows for user editing of feedback.xml file , updates by Category Manager). Similar as update() event except:

2. We make a copy of all Category information , but with the calc_score set to zero.

3. We iterate through the FeedBackDataStore nodes (loop 1), scoring the categories / index as per the update() event.

4. When finished loop 1 , we copy Category information (with new scores) over to replace the old Categories.

If process in interrupted (e.g. power off) then we just pick up again half way through the copy , finish , and replace the old data.

Important: Only direct user feedback changes the scores in FeedbackDataStore

Implementation and Technologies

This section outlines what technologies be used in implementing the project , and in what way.

  • ’Source Code’ includes the following (and any other item) needed to make and deploy a working project from start to finish.
  • Java / JSP and other code
  • build scripts
  • configuration files
  • 3rd party and other libraries
  • unit tests
  • Code Quality
  • Javadoc and Logging

All methods except accessors fully documented and confirmed with Sun Javadoc checker.

Inline comments (//) such that code is understandable without any further documentation.

Logging (Log4j) statements at regular intervals so that running program flow and actions (across and within) methods can be followed using the logs alone.

  • Core Technology
  • Version of Java
Java 1.3 is preferred. May consider Java 1.3 + New IO Libraries , or Java 1.4 (if agreed in advance) and can be justified by the additional features required.
  • Servlet - Standard JSP tags – no Java on JSP Pages
  • Logging using Log4j
  • Ant Build scripts

Build from source to War file that deploys onto Tomcat

  • Unit tests

Junit tests written testing all main classes (including those specified in this document) and testing all methods on these classes apart from accessor (get/ set methods).

Junit Test written demonstrating each of the user stories (section 3), driving the application via the Java API.



  • Global and Plugin Level Configuration files

General system wide properties be stored in a global configuration file.
Plugin level configuration files be stored in the plugin directory / or named to be clearly associated with the plugin.

All Configuration files be in XML format and read (at startup only , not dynamic) using standard 3rd party library

The configuration files supplied also have default values , and comments explained what alternative values are / what they do.

Apart from configuration items noted in this document , other items are:

  • For all Java (non-JSP) Code: No Hard coding of any properties or ’magic numbers’. All such values all properties to be read from property file.

Examples of the values found in the configuration files are:

  • Number of search results to show
  • ’Dampening Value’ for use by basic intelligence



  • 3rd '''''Party Technologies

1.1.2.1. Presentation Layer

MVC from struts (Spring Considered)

1.1.2.2. Index Search

Lucene latest stable version

1.1.2.3. Meta Data Save

IBM RDF Library or Jeena RDF library from HP.

1.1.2.4. Other if these allow quicker implementation than hand coding a simple implementation

  • Reading properties files
  • URL scraping
  • Plugin Discovery (Eclipse)


  • Performance
  • For a P4 Machine running only the Tomcat Web Server and Mozilla Firefox Browser:
  • All Page requests return within 2 seconds from user click to page completion (Single user) .
  • A 300Kb Text file added to the system be available for searching within 10 seconds.(Single user , no other requests being made on the system).
  • When the user requests a search (or other click on the web page) , 90% of JVM resources devoted to the search / page rendering / fulfilling the user request. Background tasks like re-indexing take up less than 10% of the time available to the JVM.
  • When not doing a user request , the system make the best use of available resources within the JVM (the use near 100% on indexing / updating tasks).
  • Major Performance bottlenecks be avoided. This includes:
  • The system avoid blocking on IO requests – where this is impossible (the for Network access) , threading be used to allow progress in other areas.
  • Sensible caching of data in memory and optimisation of Data Storage (balance size of files V speed of access).
  • Optimisation of loops.
  • Where possible use optimised 3rd 'party (open source libraries) over ’home-grown’ code.


  • System Stability
  • The System can be run for more than 7 days , still meeting the performance requirements above.
  • As a stress test , the system be able to respond to 600 requests per hour (1 every 10 seconds) , over the 7 day period (mix of search , add , other requests) – not subject to performance requirements.
  • Where an error / exception occurs , the system be left in a consistent state , with no loss of data. A subsequent user request to the system be fulfilled as normal.
  • Where the System is killed (the Tomcat process killed rather than halted , or power off), no file data be lost or corrupted (or can be recovered e.g. index rebuilt.)
  • Where the user starts multiple requests (e.g. 2 adds within one second) the system respond gracefully , either by queuing the requests , or by display an appropriate error message. In all cases the data be uncorrupted.


  • Directory / War file structure

The rp.war file (as built by the build scripts) be deployed to the webapps directory. Once auto-deployed by Tomcat into the webapps directory , the File structure be as follows.

  • \RP

Root directory of the deployed application. JSP Pages required by the system are also found here.

  • \RP\WEB-INF

Standard web.xml as required by tomcat , plus other application level configuration / properties files

  • \RP\WEB-INF\classes

Compiled Interfaces , Core Plugins and Classes that make up the RP system.

  • \RP\WEB-INF\lib

3rd 'Party Libraries , as required by the system. This folder also a Readme.txt explaining what each of the Libraries / Jars are , their version , and where they can be obtained from.

  • \RP\WEB-INF\plugins

Additional plugins (as per sections 5/6 of this document ) that can be added by the user and ’discovered’ by the application on startup.

  • \RP\category
Where Data Persisted by BasicCategoryStore is stored under the structure:

\RP\category\SpecialData

\RP\category\CategoryName1

\RP\category\CategoryName2

\RP\category\CategoryName3 etc..

  • \RP\lucene

Where the files for the Lucene index are stored.

  • \RP\logs

Log4J log output


Sequence of Events

1.12. Talk through what is happening behind the scenes during the section ’Using .... in the Enterprise’

Tech sequence of events:

  • Tomcat startup
  • Spring loaded (with spring components)
  • go to url
  • spring dispatcher servlet in

Conclusion

1.13. Show that it does what we said it was going to do in

1.14. Link to next sample

Personal tools