Integrating Structured Data and Text: Part 2Build relationally integrated systems to fully leverage your warehouse investments
David Grossman and Ophir Frieder We expect that as portals grow, there will be a greater demand to integrate both structured data and text. The knee-jerk reaction is to buy two systems - one for each - but our solution is to use a single relational system to attack the problem. The advantage is that you not only get text retrieval, but you can integrate easily with existing data in the warehouse. Yes, you pay a small price in overhead because you are using a general-purpose tool, but the benefits in integration and query functionality often outweigh the cost of computational resources. In our last column ("Integrating Structured Data and Text,"
Sept. 18, 2001), we presented an
approach to integrating structured data and text by modeling the text as a relational application.
Our presentation was limited to only simple functionality such as a single keyword search. We now
extend our discussion to include multiple keyword searches, threshold searches (TAND), and relevance
ranking. Recall our two sample documents and our ability to model the multivalued relationship
between terms and documents with a D1: The GDP increased 2 percent this quarter. D2: The economic slowdown continued this quarter. MultiTerm QueriesA query such as "Find all documents with the terms slowdown or recession" is an example of a
multiterm query. A query that integrates both structured data and text would ask to "identify those
documents with either slowdown or recession and occur in a year with a significant change in
salaries." This query requires a multiple term search of the unstructured document data as well as a
structured search of the human resource (HR) data. To show how the text part of this is done (the
structured portion is just a straightforward join on year to the HR tables), you would place all
query terms in a
Although you could construct an A two-term query using this approach looks like this:
Thus, we use the following alternative:
For many queries, when a clustered index on the QUERY
Note that this approach assumes that only one entry for an instance of a term in a document is
stored in the
|
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
| |||||||||||||||||||||||||||||||






















