Oracle text search

Text Searching In Oracle

Oracle allows us to search upon text functionality includes fuzzy, stemming (search for plural/singulars), wildcard, proximity, results ranking, and keyword highlights. It also allows the use of a thesaurus ie make a search on widget and find documents that contain the word gadget.
Oracle provides this via the use of domain indexes.

What is a domain index?

A domain index is the hook which allows searching by a client application for specific string searches against content stored in the database.
Much in the same way a normal index works against structured data ie specific columns within a table, the domain index allows retrieval against unstructured data held in LOB columns (we use BLOB & CLOB data types in this release). From an application perspective, this is where we store the documents saved by the users or interfaces– eg Word, Quark, PDF or Wire files.

Domain Index Validity

to check the status of a domain index run the follow sql:

SELECT INDEX_NAME, INDEX_TYPE, TABLE_NAME, DOMIDX_STATUS, DOMIDX_OPSTATUS FROM user_indexes WHERE index_type='DOMAIN'

example output

PRTEXTINDEX2    DOMAIN    PRITEM    VALID    INVALID
PRTEXTINDEX3    DOMAIN    PRITEM    VALID    VALID

if the index is marked INVALID, a search against it will likely to fail with ORA-29902.

are the domain indexes being updated?

When documents in the base table are inserted, updated, or deleted, their ROWIDs are held in a DML queue until you synchronize the index.
run the following sql as the schema owner, to determine if there are any tokens awaiting indexing. Until a token is indexed, it will not be found in a search.

SELECT COUNT(*) FROM ctx_user_pending;

Synchronising the Index:

Synchronising the index involves processing all pending updates, inserts, and deletes to the base table. You can do this in PL/SQL with the CTX_DDL.SYNC_INDEX procedure.
The following example synchronises the index with 2 megabytes of memory:

BEGIN
ctx_ddl.sync_index('myindex', '2M');
END;

In prestige we supply a script to do this called “Prestige5_Oracle_Optimise.sql”
example code extract :

BEGIN
    FOR x IN (  SELECT owner,index_name FROM all_indexes  WHERE  index_type='DOMAIN'   AND domidx_status='VALID'  ORDER BY index_name )
    LOOP
        BEGIN
              --change the memory allocated to the syncronisation if fragmentation is too excessive.
                CTX_DDL.SYNC_INDEX(x.owner || '.' || x.index_name,'50M');
               -- endlog;
        EXCEPTION
            WHEN OTHERS THEN
                v_errm := v_errm || SQLERRM;
        END;
    END LOOP;
END;

Domain Index Optimisation

it’s necessary to regularly rebuild the index via the optimise procedure.
In prestige we supply a script to do this called “Prestige5_Oracle_Optimise.sql”

example code extract :

BEGIN
    FOR x IN (  SELECT owner,index_name FROM all_indexes  WHERE  index_type='DOMAIN'  AND domidx_status='VALID'  ORDER BY index_name )
    LOOP
        BEGIN
                dbms_output.put_line(x.owner || '.' || x.index_name);
        --change the duration allocated to the optimisation is not concluding in sufficient time.
                ctxsys.ctx_ddl.optimize_index(x.owner || '.' || x.index_name,'FULL', maxtime => '45');
            EXCEPTION
            WHEN OTHERS THEN
                v_errm := v_errm || SQLERRM;
        END;
    END LOOP;
END;

Domain index latency

Every time a new piece of content is saved into Prestige the content is scheduled to be synchronised to the index. This synchronisation doesn’t happen immediately – instead it occurs via a scheduled job – every 2minutes in this example. This means new content whilst visible to the user, can not be searched against until the synchronisation is complete. Two minutes has been determined as a good balance between user requirements and to reduce the overhead on the database. The timing can be changed to suit local site requirements.
This overhead not only causes I/O on the database, but if too frequent can cause a latency on the index through fragmentation, with the knock on that too great a fragmentation level will cause slowdowns to the user when searching against content.
However, there are index maintenance routines to alleviate this fragmentation via an optimisation routine – which we suggest is run once a day. Again, depending on local site conditions, this period can be adjusted.
Whilst Atex deliver standard scripts for both domain index creation and maintenance, like any other part of the infrastructure maintenance, these need to be monitored and maintained post implementation to suit the system usage by the DBA responsible for the system.

Index Description

to find out the index description, inc showing any lexer’s used, run the following sql : sql> select ctx_report.describe_index(‘indexname’) from dual;
eg :

13:43:08 SQL> SELECT ctx_report.describe_index('PRTEXTINDEX1') FROM dual;
===========================================================================
                             INDEX DESCRIPTION
===========================================================================
INDEX name:                    "AESUSER"."PRTEXTINDEX1"
INDEX id:                      1081
INDEX TYPE:                    CONTEXT
base TABLE:                    "AESUSER"."PRITEM"
PRIMARY KEY COLUMN:            ITEM_ID
text COLUMN:                   ITEM_NAME
text COLUMN TYPE:              VARCHAR2(255)
language COLUMN:
format COLUMN:
charset COLUMN:
===========================================================================
                               INDEX OBJECTS
===========================================================================
datastore:                     DIRECT_DATASTORE
filter:                        NULL_FILTER
section GROUP:                 NULL_SECTION_GROUP
lexer:                         BASIC_LEXER
   printjoins:                    _
wordlist:                      BASIC_WORDLIST
   stemmer:                       ENGLISH
   fuzzy_match:                   GENERIC
stoplist:                      BASIC_STOPLIST
   stop_word:                     Mr
   stop_word:                     Mrs
   stop_word:                     Ms

Quantity of tokens per index

to determine the number of occurences per work (token) in a given table:

SELECT token_text AS word,
            SUM (token_count) AS occurrences
    FROM  DR$PRTEXTINDEX5$I WHERE token_text='SOCCER'
    GROUP  BY token_text

change the index name & token_text as appropriate.

Fragmentation

This is when too many rows are used for storing tokens in the index table.
Fragmentation of a domain index occurs:
1) During index build – insufficient memory is allocated to the index causing insert entries to be flushed once the memory is allocated. The qty of memory used can be specified at index creation time or if omitted it picks up the default (12MB).
2) Amendments to the index through synchronisation/delete. This is both via the frequency and the qty of memory allocated to the synchronisation process :(see index build above).
As mentioned earlier, fragmentation can be reduced via the use of an Optimisation script. As such Atex supply a generic package covering both the synchronisation and optimisation. However the DBA must monitor this, and if the fragmentation is too great due to insufficient memory allocation, then change accordingly.

checking fragmentation

SET TIME ON
SET TIMING ON
spool c:\temp\index5d_spool
EXECUTE CTXSYS.CTX_ADM.SET_PARAMETER ('LOG_DIRECTORY','/tmp');
DROP TABLE output;
CREATE TABLE output (result CLOB);
EXECUTE ctxsys.ctx_output.start_log('PRTEXTINDEX1.lst');
 DECLARE
        x clob := NULL;
      BEGIN
          ctx_report.index_stats('AESUSER.PRTEXTINDEX1',x);
        INSERT INTO output VALUES (x);
        ctx_output.end_log;
        COMMIT;
        dbms_lob.freetemporary(x);
      END;
 /
/

  SET LONG 2000000000
  SET head off
  SET pagesize 10000
  SELECT * FROM output;

on the output report, look for the section entitled “FRAGMENTATION STATISTICS”

eg

---------------------------------------------------------------------------
                         FRAGMENTATION STATISTICS
---------------------------------------------------------------------------
total size of $I data:                                1,214,977 (1.16 MB)
$I rows:                                                          138,357
estimated $I rows if optimal:                                     126,783
**estimated row fragmentation:                                          8%**

example script to maintain Domain indexes. Applies to all applications – just call it from the relevant schema as a scheduled job.

CREATE OR REPLACE PACKAGE overnight
IS

  --  Please keep these variables up to date.
  pk_version CONSTANT VARCHAR2(20) := '1.3';
  pk_date    CONSTANT VARCHAR2(8)  := '';
  pk_name  CONSTANT VARCHAR2(20) := 'pkg uk_overnight';
  pk_author  CONSTANT VARCHAR2(40) := '';
  --  use "select atex.getversion() from dual" to query package details.

/*-- ------------------------------------------------------------------------------------------------------------------
Modifications
23 May 2011 1.3         NB  

*/

PROCEDURE optimise_indexes;

PROCEDURE ctx_sync_p5;

FUNCTION getversion RETURN VARCHAR2; -- Return package version info

END;
/
CREATE OR REPLACE PACKAGE BODY AESUSER.uk_overnight
IS

PROCEDURE optimise_indexes
IS

/*---------------------------------------------------------------------------------------------------------------
 Name: optimise_indexes

 Author: nb.

 Description
 Optimise text indexes for running users

---------------------------------------------------------------------------------------------------------------*/
BEGIN
DECLARE
  v_submodule VARCHAR2(50) := 'optimise_indexes';
  v_errm VARCHAR2(255);
  v_debug VARCHAR2(100);

BEGIN
            BEGIN
                FOR x IN (  SELECT owner,index_name FROM all_indexes  WHERE  index_type='DOMAIN'  AND domidx_status='VALID'  ORDER BY index_name )
                LOOP
                    BEGIN
                            dbms_output.put_line(x.owner || '.' || x.index_name);
                            --change the duration allocated to the optimisation is not concluding in sufficient time.
                            ctxsys.ctx_ddl.optimize_index(x.owner || '.' || x.index_name,'FULL', maxtime => '45');
                        EXCEPTION
                        WHEN OTHERS THEN
                            v_errm := v_errm || SQLERRM;
                    END;
                END LOOP;

    IF ( v_errm IS NOT NULL )
    THEN
        raise_application_error( -20001, 'Errors ' || v_errm );
    END IF;
END;

END;
END optimise_indexes;
PROCEDURE ctx_sync_p5
IS
/*---------------------------------------------------------------------------------------------------------------
 Name: ctx_sync_p5

 Author:  nb

 Description
 sync for text search.  Remember to grant execute on ctx_ddl to schema owner
 eg sql> grant execute on ctx_ddl to aesuser;

commit;

---------------------------------------------------------------------------------------------------------------*/
BEGIN
DECLARE
  v_stmt    VARCHAR2(2000);
  v_submodule VARCHAR2(50) := 'run_ctx_p5';
  v_errm VARCHAR2(255);
  v_debug VARCHAR2(100);
   v_lock VARCHAR2(128);
   ret    PLS_INTEGER;

BEGIN
     dbms_lock.allocate_unique ('sync_aesuser_text_index',v_lock);
     ret := dbms_lock.request (v_lock, timeout=>0, release_on_commit=>FALSE);
  CASE ret
        WHEN 0 THEN -- semaphore available
            FOR x IN (  SELECT owner,index_name FROM all_indexes  WHERE  index_type='DOMAIN'   AND domidx_status='VALID'  ORDER BY index_name )
            LOOP
                BEGIN
                        --change the memory allocated to the syncronisation if fragmentation is too excessive.
                        CTX_DDL.SYNC_INDEX(x.owner || '.' || x.index_name,'500M');
                    EXCEPTION
                    WHEN OTHERS THEN
                        v_errm := v_errm || SQLERRM;
                END;
            END LOOP;
               ret := dbms_lock.RELEASE (v_lock);
        WHEN 1 THEN -- semaphore busy
                dbms_output.put_line(SYSTIMESTAMP||' - an optimise or syncronisation job is currently running.');
              ELSE
                dbms_output.put_line(SYSTIMESTAMP||' - Error on semaphore: '||ret);
    END CASE;

    IF ( v_errm IS NOT NULL )
    THEN
        raise_application_error( -20001, 'Errors ' || v_errm );
    END IF;
END;
END ctx_sync_p5;
-- -----------------------------------------------------------------------------------------------
FUNCTION getversion RETURN VARCHAR2
IS
 v_version VARCHAR2(256);

BEGIN
 v_version := pk_name||' v'||pk_version||', date '||TO_CHAR(TO_DATE(pk_date,'YYYYMMDD'))||' by '||pk_author;

 RETURN v_version;
END getversion;

END uk_overnight;
/

One thought on “Oracle text search

Leave a comment