Text Searching In Oracle
Oracle allows us to search upon text functionality includes fuzzy, stemming (search for plural/singulars), wildcard, proximity, results ranking, and keyword highlights. It also allows the use of a thesaurus ie make a search on widget and find documents that contain the word gadget.
Oracle provides this via the use of domain indexes.
What is a domain index?
A domain index is the hook which allows searching by a client application for specific string searches against content stored in the database.
Much in the same way a normal index works against structured data ie specific columns within a table, the domain index allows retrieval against unstructured data held in LOB columns (we use BLOB & CLOB data types in this release). From an application perspective, this is where we store the documents saved by the users or interfaces– eg Word, Quark, PDF or Wire files.
Domain Index Validity
to check the status of a domain index run the follow sql:
SELECT INDEX_NAME, INDEX_TYPE, TABLE_NAME, DOMIDX_STATUS, DOMIDX_OPSTATUS FROM user_indexes WHERE index_type='DOMAIN'
example output
PRTEXTINDEX2 DOMAIN PRITEM VALID INVALID PRTEXTINDEX3 DOMAIN PRITEM VALID VALID
if the index is marked INVALID, a search against it will likely to fail with ORA-29902.
are the domain indexes being updated?
When documents in the base table are inserted, updated, or deleted, their ROWIDs are held in a DML queue until you synchronize the index.
run the following sql as the schema owner, to determine if there are any tokens awaiting indexing. Until a token is indexed, it will not be found in a search.
SELECT COUNT(*) FROM ctx_user_pending;
Synchronising the Index:
Synchronising the index involves processing all pending updates, inserts, and deletes to the base table. You can do this in PL/SQL with the CTX_DDL.SYNC_INDEX procedure.
The following example synchronises the index with 2 megabytes of memory:
BEGIN ctx_ddl.sync_index('myindex', '2M'); END;
In prestige we supply a script to do this called “Prestige5_Oracle_Optimise.sql”
example code extract :
BEGIN FOR x IN ( SELECT owner,index_name FROM all_indexes WHERE index_type='DOMAIN' AND domidx_status='VALID' ORDER BY index_name ) LOOP BEGIN --change the memory allocated to the syncronisation if fragmentation is too excessive. CTX_DDL.SYNC_INDEX(x.owner || '.' || x.index_name,'50M'); -- endlog; EXCEPTION WHEN OTHERS THEN v_errm := v_errm || SQLERRM; END; END LOOP; END;
Domain Index Optimisation
it’s necessary to regularly rebuild the index via the optimise procedure.
In prestige we supply a script to do this called “Prestige5_Oracle_Optimise.sql”
example code extract :
BEGIN FOR x IN ( SELECT owner,index_name FROM all_indexes WHERE index_type='DOMAIN' AND domidx_status='VALID' ORDER BY index_name ) LOOP BEGIN dbms_output.put_line(x.owner || '.' || x.index_name); --change the duration allocated to the optimisation is not concluding in sufficient time. ctxsys.ctx_ddl.optimize_index(x.owner || '.' || x.index_name,'FULL', maxtime => '45'); EXCEPTION WHEN OTHERS THEN v_errm := v_errm || SQLERRM; END; END LOOP; END;
Domain index latency
Every time a new piece of content is saved into Prestige the content is scheduled to be synchronised to the index. This synchronisation doesn’t happen immediately – instead it occurs via a scheduled job – every 2minutes in this example. This means new content whilst visible to the user, can not be searched against until the synchronisation is complete. Two minutes has been determined as a good balance between user requirements and to reduce the overhead on the database. The timing can be changed to suit local site requirements.
This overhead not only causes I/O on the database, but if too frequent can cause a latency on the index through fragmentation, with the knock on that too great a fragmentation level will cause slowdowns to the user when searching against content.
However, there are index maintenance routines to alleviate this fragmentation via an optimisation routine – which we suggest is run once a day. Again, depending on local site conditions, this period can be adjusted.
Whilst Atex deliver standard scripts for both domain index creation and maintenance, like any other part of the infrastructure maintenance, these need to be monitored and maintained post implementation to suit the system usage by the DBA responsible for the system.
Index Description
to find out the index description, inc showing any lexer’s used, run the following sql : sql> select ctx_report.describe_index(‘indexname’) from dual;
eg :
13:43:08 SQL> SELECT ctx_report.describe_index('PRTEXTINDEX1') FROM dual; =========================================================================== INDEX DESCRIPTION =========================================================================== INDEX name: "AESUSER"."PRTEXTINDEX1" INDEX id: 1081 INDEX TYPE: CONTEXT base TABLE: "AESUSER"."PRITEM" PRIMARY KEY COLUMN: ITEM_ID text COLUMN: ITEM_NAME text COLUMN TYPE: VARCHAR2(255) language COLUMN: format COLUMN: charset COLUMN: =========================================================================== INDEX OBJECTS =========================================================================== datastore: DIRECT_DATASTORE filter: NULL_FILTER section GROUP: NULL_SECTION_GROUP lexer: BASIC_LEXER printjoins: _ wordlist: BASIC_WORDLIST stemmer: ENGLISH fuzzy_match: GENERIC stoplist: BASIC_STOPLIST stop_word: Mr stop_word: Mrs stop_word: Ms
Quantity of tokens per index
to determine the number of occurences per work (token) in a given table:
SELECT token_text AS word, SUM (token_count) AS occurrences FROM DR$PRTEXTINDEX5$I WHERE token_text='SOCCER' GROUP BY token_text
change the index name & token_text as appropriate.
Fragmentation
This is when too many rows are used for storing tokens in the index table.
Fragmentation of a domain index occurs:
1) During index build – insufficient memory is allocated to the index causing insert entries to be flushed once the memory is allocated. The qty of memory used can be specified at index creation time or if omitted it picks up the default (12MB).
2) Amendments to the index through synchronisation/delete. This is both via the frequency and the qty of memory allocated to the synchronisation process :(see index build above).
As mentioned earlier, fragmentation can be reduced via the use of an Optimisation script. As such Atex supply a generic package covering both the synchronisation and optimisation. However the DBA must monitor this, and if the fragmentation is too great due to insufficient memory allocation, then change accordingly.
checking fragmentation
SET TIME ON SET TIMING ON spool c:\temp\index5d_spool EXECUTE CTXSYS.CTX_ADM.SET_PARAMETER ('LOG_DIRECTORY','/tmp'); DROP TABLE output; CREATE TABLE output (result CLOB); EXECUTE ctxsys.ctx_output.start_log('PRTEXTINDEX1.lst'); DECLARE x clob := NULL; BEGIN ctx_report.index_stats('AESUSER.PRTEXTINDEX1',x); INSERT INTO output VALUES (x); ctx_output.end_log; COMMIT; dbms_lob.freetemporary(x); END; / / SET LONG 2000000000 SET head off SET pagesize 10000 SELECT * FROM output;
on the output report, look for the section entitled “FRAGMENTATION STATISTICS”
eg
--------------------------------------------------------------------------- FRAGMENTATION STATISTICS --------------------------------------------------------------------------- total size of $I data: 1,214,977 (1.16 MB) $I rows: 138,357 estimated $I rows if optimal: 126,783 **estimated row fragmentation: 8%**
example script to maintain Domain indexes. Applies to all applications – just call it from the relevant schema as a scheduled job.
CREATE OR REPLACE PACKAGE overnight IS -- Please keep these variables up to date. pk_version CONSTANT VARCHAR2(20) := '1.3'; pk_date CONSTANT VARCHAR2(8) := ''; pk_name CONSTANT VARCHAR2(20) := 'pkg uk_overnight'; pk_author CONSTANT VARCHAR2(40) := ''; -- use "select atex.getversion() from dual" to query package details. /*-- ------------------------------------------------------------------------------------------------------------------ Modifications 23 May 2011 1.3 NB */ PROCEDURE optimise_indexes; PROCEDURE ctx_sync_p5; FUNCTION getversion RETURN VARCHAR2; -- Return package version info END; / CREATE OR REPLACE PACKAGE BODY AESUSER.uk_overnight IS PROCEDURE optimise_indexes IS /*--------------------------------------------------------------------------------------------------------------- Name: optimise_indexes Author: nb. Description Optimise text indexes for running users ---------------------------------------------------------------------------------------------------------------*/ BEGIN DECLARE v_submodule VARCHAR2(50) := 'optimise_indexes'; v_errm VARCHAR2(255); v_debug VARCHAR2(100); BEGIN BEGIN FOR x IN ( SELECT owner,index_name FROM all_indexes WHERE index_type='DOMAIN' AND domidx_status='VALID' ORDER BY index_name ) LOOP BEGIN dbms_output.put_line(x.owner || '.' || x.index_name); --change the duration allocated to the optimisation is not concluding in sufficient time. ctxsys.ctx_ddl.optimize_index(x.owner || '.' || x.index_name,'FULL', maxtime => '45'); EXCEPTION WHEN OTHERS THEN v_errm := v_errm || SQLERRM; END; END LOOP; IF ( v_errm IS NOT NULL ) THEN raise_application_error( -20001, 'Errors ' || v_errm ); END IF; END; END; END optimise_indexes; PROCEDURE ctx_sync_p5 IS /*--------------------------------------------------------------------------------------------------------------- Name: ctx_sync_p5 Author: nb Description sync for text search. Remember to grant execute on ctx_ddl to schema owner eg sql> grant execute on ctx_ddl to aesuser; commit; ---------------------------------------------------------------------------------------------------------------*/ BEGIN DECLARE v_stmt VARCHAR2(2000); v_submodule VARCHAR2(50) := 'run_ctx_p5'; v_errm VARCHAR2(255); v_debug VARCHAR2(100); v_lock VARCHAR2(128); ret PLS_INTEGER; BEGIN dbms_lock.allocate_unique ('sync_aesuser_text_index',v_lock); ret := dbms_lock.request (v_lock, timeout=>0, release_on_commit=>FALSE); CASE ret WHEN 0 THEN -- semaphore available FOR x IN ( SELECT owner,index_name FROM all_indexes WHERE index_type='DOMAIN' AND domidx_status='VALID' ORDER BY index_name ) LOOP BEGIN --change the memory allocated to the syncronisation if fragmentation is too excessive. CTX_DDL.SYNC_INDEX(x.owner || '.' || x.index_name,'500M'); EXCEPTION WHEN OTHERS THEN v_errm := v_errm || SQLERRM; END; END LOOP; ret := dbms_lock.RELEASE (v_lock); WHEN 1 THEN -- semaphore busy dbms_output.put_line(SYSTIMESTAMP||' - an optimise or syncronisation job is currently running.'); ELSE dbms_output.put_line(SYSTIMESTAMP||' - Error on semaphore: '||ret); END CASE; IF ( v_errm IS NOT NULL ) THEN raise_application_error( -20001, 'Errors ' || v_errm ); END IF; END; END ctx_sync_p5; -- ----------------------------------------------------------------------------------------------- FUNCTION getversion RETURN VARCHAR2 IS v_version VARCHAR2(256); BEGIN v_version := pk_name||' v'||pk_version||', date '||TO_CHAR(TO_DATE(pk_date,'YYYYMMDD'))||' by '||pk_author; RETURN v_version; END getversion; END uk_overnight; /
Very usefull, thanks for sharing.