Extracting file contentAdded by IBM on February 11, 2013 | Version 1 (Original)
|To speed up the indexing process, you can use a SearchService command that extracts file content in a process that is separate from indexing.
Before you begin
To use SearchService administrative commands, you must use the IBM
® Application Server wsadmin client. See Starting the wsadmin client
About this task
The SearchService.startBackgroundFileContentExtraction command performs file content extraction outside of the indexing process. This command iterates over the persisted files seedlists and, for each file, it extracts the file content according to the specified configuration settings. This process is multithreaded, and is the same file content extraction process that occurs when you run the startBackgroundIndex command.
To extract file content outside of the indexing process, complete the following steps.
- Start the wsadmin client from the following directory of the system on which you installed the Deployment Manager:
is the WebSphere
Application Server installation directory and dm_profile_root
is the Deployment Manager profile directory, typically dmgr01.
You must start the client from this directory or subsequent commands that you enter do not execute correctly.
- After the wsadmin command environment has initialized, enter the following command to initialize the Search environment and start the Search script interpreter:
If prompted to specify a service to connect to, type 1 to pick the first node in the list. Most commands can run on any node. If the command writes or reads information to or from a file using a local file path, you must pick the node where the file is stored.
When the command is run successfully, the following message displays:
Search Administration initialized
- Use the following command:
SearchService.startBackgroundFileContentExtraction(persistence dir, components, extracted text dir, thread limit)
Extracts file content for all files referenced in the persisted seedlists in a process that is independent of the indexing task.
This command takes the following parameters:
A string that specifies the location of the persisted files seedlists.components
A string that specifies the application or applications for which you want to extract file content. The following values are valid: files, wikis.extracted text dir
A string that specifies the target location for the extracted text. The same directory structure and naming scheme is used for this directory as for the extracted text directory on the deployment: connections shared data/ExtractedText. For example, ExtractedText/121/31/36cdb7a0-92b2-4cf9-91f3-c4e7e527a5e1.thread limit
The maximum number of seedlist threads.
SearchService.startBackgroundFileContentExtraction("/bg_index/seedlists", "files", "/bg_index/ExtractedText", 10)
You typically run this command after running a startBackgroundCrawl command to act on up-to-date seedlists. If there are no persisted seedlists available, the behavior is the same as when you run the startBackgroundCrawl command, that is, the seedlists are crawled and persisted first.
- Verify that the target extracted text directory is populated with the extracted files content.
Open some of the extracted text files in a text editor. You can expect to see the typical format, for example, some header information followed by the extracted content.
What to do next
Parent topic: Creating background indexes
Creating a background index
Performing a background crawl
WebSphere Application Server environment variables
- Copy the extracted file content to the directory specified by the WebSphere Application Server environmental variable EXTRACTED_FILE_STORE. Storing the extracted file content in this directory means that when the Search application next detects a file update during indexing, if the update is a metadata change only, Search can avoid converting the file again unnecessarily. For more information about the EXTRACTED_FILE_STORE variable, see WebSphere Application Server environment variables.
- Complete the steps outlined in the topic, Creating a background index to create a background index using the extracted file content.