Engine Architecture:
[ Virtual Search Engine ]
| \
| [ Other Database ]
|
[ Database Services ]
/ | \
/ | \ [ P
/ | \ H
[ Index ] [ Retrieve ] [ Search ] ---------------- Y
\ \ / S
\ \ / I
\ \ / C
[Classing \ \ / A
Engine] ------- [ DOCTYPES ] -- [ Presentation ] L
/ \
/ ------------------------- I
/ N
/ D
[ Document Parser ] ----------------------------- E
/ X ]
/
[ Document(s) ]
In contrast to many engines on the market IB 2.0 was designed from the start to
support multi-protocol applications, especially Z39.50. Its fully object oriented
and developed in C++.
- A Virtual Index
-
Is a mapping collection of one or more virtual and/or physical databases.
They may be arbitarily created at search time and may contain more or more
document types.
- A Physical database
- consists of Field Coordinate Tables (the location of the field data),
Multiple
Document Tables (the mapping of docid to physical document) and the word
index (the
basis for finding words)
- High Performance: Fast Index, Append, Modify and Search.
- Uses an implementation that compared with PAT array search provides
- Significantly higher performance
- Better maintainance (e.g. fast merging)
- Less demands on the I/O system.
- Suitability to slow optical storage media.
- More robust indexes.
- A good tradeoff of slightly higher disk storage requirements for speed.
- Small indexes: typically requires a space overhead of less than
60% (or as little as 1/3) the size of the source.
- Typical Indexing speed (depending upon hardware) between 200 MB to beyond 1 GB/hour
(e.g. indexing rates of 500 MB/hour measured on a 350 Mhz P-II system running Linux).
- One can quickly add new records to the database ("append") without need for
re-indexing.
- One can quickly delete or modify records in the database without need for re-indexing.
- Excellent I/O behaviour for high TPM rates.
- Suitable for CDROM and heirarchical storage.
- Scalable from low cost PC hardware to Enterprize class server clusters.
- Designed to support Gigabytes of data.
- Standard (32-bit O/S) Edition supports:
- Up to 32 million records per physical database
- Up to 500 million words per physical database
- Up to 255 physical databases per virtual database
- Special Edition for 64-bit Operating Systems with a significantly larger capacity
is planned.
- Does not require stopwords for indexing.
- Can Search for each and every word in a document (e.g. for searches
like "To be or not to be", "Vitamin A" or "the C++ programming
language").
- Stopwords "may" be specified during indexing or search-- even different
lists and also dependent upon language.
- National language support.
- Multinational Character Set support.
- Fully Fielded research.
- Boolean operators: And, Or, Not,
Xor, AndNot, Near, Before,
After, Adj and unary Not.
- Term and phrase (literal) search.
- Scan search (what some call "word wheels")
to allow users to search for "search terms" (even field level).
- User selectable term and phrase weighting.
- Phoenetic, Case dependant, Right and Left Truncated, glob and other
search attributes.
- "Smart" Date range search.
- RPN and Infix
notation advanced query.
- Field level Relevant Feedback.
- Supports user thesaurai.
- User selectable sort of result sets: Chronologically or by
Relevance.
- Workflow Aware:
- Robust Synchronous Update/Append: supports simultaneous indexing
and
search during indexing...
- Year 2000 compliant? IB 2.x is Option ONE Year-2000 Compliant.
See the BSn IB 2.x Year 2000 Compliance Statement.
-
- The Document Types (DOCTYPES)
- are responsible for the details of the native document format. It includes
the document parser and the record syntax conversion and presentation services.
- The standard collection includes SGML/XML and over 50 other native generic document formats.
- The structure of the document can be used to qualify the search.
- Document structure in used in most all doctypes: not just SGML or HTML.
If the document type/format has a structure (which most formats do) then
this structure is available.
- Automatic and transparent detection of implied hyperlinks.
- References, URLs,
Internet mail addresses, record links,
external containers etc.
- "On-the-fly" with late hyperlink bindings:
- Hypertext conversion and link detection occurs at runtime.
- Full HTML presentation of all document types ("on-the-fly").
- Structured Text (ASCII) presentation of all document types ("on-the-fly").
- Record-type MIME bindings to launch applications.
- The Classing Engine
- is used only at index time, it is used to determine the appropriate
DOCTYPE for a specific document.
In Preparation:
- Multi-threaded Search.
- Result caching.
- Numerical, geospatial and other object oriented searching
mechanisms.
- More Native doctypes: support for Word Processor formats.
- Distributed objects (CORBA).
Other Information