ShowTable of Contents
General remarks
Basics for WCM Seedlist
WCM Seedlist is the output format which is being used by WebSphere Portal to crawl and index WCM content. WCM Seedlist is based on the seedlist framework for seedlist v1.0. This is an ATOM feed based format. It doesn't only list links, but also provides additional metadata (author, category, publish date, access rights etc.). It furthermore tells the crawler how to handle the currently processed link: add to index, remove from index, update in index etc.
Intention of this article
This article describes how to extend the WCM Seedlist which is consumed by search crawlers like PSE, Omnifind, Google etc. You learn how to enrich the provided metadata with your own custom metadata: update, delete or add new values. The idea is to use the custom metadata in extended search queries to provide better or specific search results.
Implementation details of WCM Seedlist
The WCM Seedlist uses the Lotus Search Seedlist API framework provided by WebSphere Portal. The framework consists of a set of interfaces in order to provide links to documents, like WCM content items, and additional metadata on these links. The seedlist framework is based on OSGI technology and allows to register several plugins, each of which could describe different links to documents (file system based, WebSphere Portal Portlets, WCM documents, databases etc.).
The current implementation of the WCM Seedlist is a plugin to the Lotus Search Seedlist framework.
Tested Versions of WebSphere Portal
The modifications described in this article are tested on:
- WebSphere Portal 6.1.x
- WebSphere Portal 7.0.0.1
The details given below are entirely for WebSphere Portal 6.1.x. The last section describes necessary changes for WebSphere Portal 7.0.0.1.
Attachments to this article
The following attachments are required to run the sample in your environment:
- jcr_export.zip: The WCM library used as a sample in this article has been exported into the file system and can be imported to test the WCM Seedlist with custom metadata
- projects.zip: The sample Java code used in this article has been exported from Rational Application Developer 7.5 and zipped as project interchange file
Requirements to run the sample code provided
To run the sample you need:
- WebSphere Portal 6.1.0.4 or 6.1.5.1 (the version of WebSphere Portal is required for importing the WCM libraries which is only supported if using the same fix level on source and target server)
- Rational Application Developer 7.5 (which you will use to edit and modify the sample Java source code)
Steps to extend WCM Seedlist
Make use of OSGI
In order to extend the current implementation of the WCM Seedlist, you have to provide a new OSGI plugin which contains your business logic. But you should not re-invent the wheel: make use of the existing business logic to generate the entries in the seedlist.Therefore the sample code wraps the default implementation, which is based on interfaces and therefore it is safe to wrap.
If your plugin is called, first forward the call to the default implementation, after that use the information provided by the wrapped plugin logic and process the metadata using your logic.
Sample scenario
You can already issue advanced search queries to get results only from one author or from a
specific date.
Our sample scenario has a simple requirement: Users want to get search results for their query string limited to content items from their department.
The default WCM Seedlist does not know about departments in our WCM library. The SiteAreas for departments are only valid in the sample scenario.
The same requirement but more technology wise:
Some fields are available (publish date, author etc.) on the WCM Seedlist and can be used in queries.
However specific metadata/fields have to be added to the WCM Seedlist for allowing to
limit search results to specific department using special query like: [user's query string] +department::"marketing"
Question:
How to define a new metadata element, which represents a department for every entry in the WCM Seedlist?
Answer:
Extend the WCM Seedlist with a new metadata element called “department” .
Sample WCM library SiteArea structure
The following table describes an example for WCM content, with 5 different departments and 11 content items. The table describe the Site, SiteAreas and content items available. The content items need to be found by the crawler. The SiteAreas (departments) are not crawled, but every content item needs to know its associated department SiteArea.Sample WCM library SiteArea structure.
We use a rather simple structure. It could be more complex if “Marketing” has more sub-SiteAreas. Then in turn the content items beneath this sub-SiteArea need the “department” metadata as well.
A look at the default implementation
URL for access via web browser
The WCM seedlist is ATOM feed based. Therefore you can access it via a web browser. An ATOM feed is generated containing an entry for each content item, content link and file resource within the Site specified by the SeedlistId request parameter. The URL to access the default WCM Seedlist looks the following:
http://[host]:[port]/seedlist/myserver?SeedlitsId=[WCM library]/[WCM Site name]& Source=com.ibm.workplace.wcm.plugins.seedlist.retriever.WCMRetrieverFactory&Action=GetDocuments
If you use the WCM library provided with this article, the URL would look like:
https://localhost:10041/seedlist/myserver? SeedlistId=scenario1/intranet& Source=com.ibm.workplace.wcm.plugins.seedlist.retriever.WCMRetrieverFactory& Action=GetDocuments
You can use the URL to open the WCM Seedlist in a web browser. You can check the ATOM feed response for available metadata.
WCM Default Seedlist entry for “Second quarter results”
<atom:entry>
<atom:id>e9952f80433187c9b920b9831112d12f
</atom:id>
<atom:link
href="/wps/mypoc/!ut/p/digest!9kaMGHf8cyfSoHruz-yafw/wcm/path%3a%252F..."
rel="via" title="Second quarter results" />
<atom:content
src="/wps/wcm/myconnect/scenario1/intranet/sales/second%20quarter%20results" />
<wplc:securityId>
6QReDe5BEIJQ8663E03QC6J9CGJRCCPHOI3P062BEGJP46H9C43I56IHP0
</wplc:securityId>
<atom:author>
<atom:name>uid=wpsadmin,o=defaultWIMFileBasedRealm
</atom:name>
</atom:author>
<atom:title>Second quarter results</atom:title>
<atom:updated>2010-07-13T09:23:17+02:00
</atom:updated>
<wplc:action do="insert" />
<wplc:acls>
<wplc:acl>uid=wpsadmin,o=defaultwimfilebasedrealm
</wplc:acl>
<wplc:acl>cn=wpsadmins,o=defaultwimfilebasedrealm
</wplc:acl>
</wplc:acls>
<wplc:fieldInfo id="Name" name="Name"
description="This field shows an item's name" type="string"
contentSearchable="true" fieldSearchable="true" parametric="false"
returnable="true" sortable="false" supportsExactMatch="false" />
<wplc:fieldInfo id="Owner" name="Owner"
description="This field shows an item's owner" type="string"
contentSearchable="true" fieldSearchable="true" parametric="false"
returnable="true" sortable="false" supportsExactMatch="false" />
<wplc:fieldInfo id="ContentPath" name="ContentPath"
description="This field shows an item's content path" type="string"
contentSearchable="true" fieldSearchable="true" parametric="false"
returnable="true" sortable="false" supportsExactMatch="false" />
<wplc:fieldInfo id="AuthoringTemplate" name="AuthoringTemplate"
description="This field shows the used authoring template for this item"
type="string" contentSearchable="true" fieldSearchable="true"
parametric="false" returnable="true" sortable="false"
supportsExactMatch="false" />
<wplc:fieldInfo id="Modifier" name="Modifier"
description="This field shows the last modifier of an item" type="string"
contentSearchable="true" fieldSearchable="true" parametric="false"
returnable="true" sortable="false" supportsExactMatch="false" />
<wplc:fieldInfo id="EffectiveDate" name="EffectiveDate"
description="This field shows when an item became effective" type="date"
contentSearchable="true" fieldSearchable="true" parametric="false"
returnable="true" sortable="false" supportsExactMatch="false" />
<wplc:field id="Name">second quarter results</wplc:field>
<wplc:field id="Owner">uid=wpsadmin,o=defaultWIMFileBasedRealm
</wplc:field>
<wplc:field id="ContentPath">/scenario1/intranet/sales/second
quarter results</wplc:field>
<wplc:field id="AuthoringTemplate">article</wplc:field>
<wplc:field id="Modifier">uid=wpsadmin,o=defaultWIMFileBasedRealm
</wplc:field>
<wplc:field id="EffectiveDate">Jul 13 2010 09:21:48 CEST</wplc:field>
<atom:published>2010-07-13T09:23:17+02:00
</atom:published>
</atom:entry>
Running the sample
Import WCM library
Source code & doing
First you have to set up the sample project. Import the sample code into your Rational Application Developer's workspace.
Download projects.zip (Project interchange from RAD workspace - attachment of this article)
Remark: If you are using WebSphere Portal 7.0.0.1 or greater, then use projects_RAD803_WP7.zip. This file is a project archive from Rational Application Developer 8.0.3 with projects for WebSphere Portal 7)
Click File > Import
In the dialog click on Other > Project interchange
Browse for “projects.zip” provided with this article
Two projects are imported: ExtWCMSeedlist & ExtWCMSeedlistEAR. ExtWCMSeedlist project contains the logic to extend the seedlist. The corresponding ear file is the one deployed to WebSphere Portal. It packages the ExtWCMSeedlist project.
Resolve dependencies
ExtWCMSeedlist project requires some JAR files from WebSphere Portal installation directory to resolve dependencies.
Create a new classpath variable by clicking on:
Window > Preferences > Java > Build Path > Classpath Variables
Click on “New...”
Enter the name “PORTAL_SERVER_INSTALL”
Click on “Folder...” and choose the path to your local WebSphere Portal installation directory
The dependencies should now be resolved. The projects should not display any errors.
Implementation details
The plugin.xml in the WebContent/WEB-INF directory defines which extension point is used. A factory class is configured to be used.
Plugin.xml
<?xml version="1.0" encoding="UTF-8"?>
<plugin
id="custom.wcm.seedlist"
name="Sample Plugin"
version="1.0.0"
provider-name="ISSL">
<extension point="com.ibm.lotus.search.plugins.seedlist.RetrieverFactory" id="CustomWCMRetrieverFactoryImpl"
name="CustomWCMRetrieverFactory">
<Impl classname="com.ibm.custom.wcm.seedlist.CustomWCMRetrieverFactoryImpl">
<property name="RootSeedlistId" value="0"/>
</Impl>
</extension>
</plugin>
The CustomWCMRetrieverFactoryImpl wraps the retriever factory defined by the default WCM Seedlist implementation.
The factory gets the WCMRetrieverFactory and wraps it in turn into the CustomWCMRetrieverService.
The two classes do not extend/inherit the original WCM Seedlist implementation. They adapt the two classes and forward any method calls to the corresponding methods using the parameters given.
The class CustomWCMRetrieverService provides a private method called “processCustomImplementation”. This method is called once the original code has been executed and an entry set is available.The method iterates over the documents in the entry set and calls the custom implementations of the interface IWCMMetadataExtension.
Calling the extended seedlist
The URL to generate the WCM Seedlist has now changed, since you have to use the custom plugin (CustomWCMRetrieverFactory) instead of the default one. The new URL for the sample code provided is:
http://localhost:10040/seedlist/myserver?SeedlistId=scenario1/intranet&Source=custom.wcm.seedlist.CustomWCMRetrieverFactoryImpl&Action=GetDocuments
You can use the URL above to open the WCM Seedlist in a web browser. Check the response (which is in ATOM feed format) for you custom metadata.
The Department Metadata sample
The DepartmentMetadata class implements the interface IWCMMetadataExtension. It provides custom logic to find every entry of the WCM Seedlist in the WCM workspace. Once found, the parents of the document are processed bottom up in the SiteArea hierarchy. In our small sample the lookup is quite fast, since there is only one level to be processed. If the SiteArea structure is more complex the lookup will take longer.
A department SiteArea is identified by its default content based on the Authoring Template “index”. Most of the business logic within DepartmentMetadata class deals with the bottom up processing and identification of the ??index” template.
If a department SiteArea is found its name is put into a new metadata field for the seedlist entry.
This is the new entry for the “Second quarter results” document. The highlighted lines are different than in the previous version:
<atom:entry>
<atom:id>e9952f80433187c9b920b9831112d12f
</atom:id>
<atom:link
href="/wps/mypoc/!ut/p/digest!3V2vcGGxKlrLO0k4PLH3zg/wcm/path%3a%252F..."
rel="via" title="Second quarter results" />
<atom:content
src="/wps/wcm/myconnect/scenario1/intranet/sales/second%20quarter%20results" />
<wplc:securityId>
6QReDe5BEIJQ8663E03QC6J9CGJRCCPHOI3P062BEGJP46H9C43I56IHP0
</wplc:securityId>
<atom:author>
<atom:name>uid=wpsadmin,o=defaultWIMFileBasedRealm
</atom:name>
</atom:author>
<atom:title>Second quarter results</atom:title>
<atom:updated>2010-07-13T09:23:17+02:00
</atom:updated>
<wplc:action do="insert" />
<wplc:acls>
<wplc:acl>uid=wpsadmin,o=defaultwimfilebasedrealm
</wplc:acl>
<wplc:acl>cn=wpsadmins,o=defaultwimfilebasedrealm
</wplc:acl>
</wplc:acls>
<wplc:fieldInfo id="Modifier" name="Modifier"
description="This field shows the last modifier of an item" type="string"
contentSearchable="true" fieldSearchable="true" parametric="false"
returnable="true" sortable="false" supportsExactMatch="false" />
<wplc:fieldInfo id="EffectiveDate" name="EffectiveDate"
description="This field shows when an item became effective" type="date"
contentSearchable="true" fieldSearchable="true" parametric="false"
returnable="true" sortable="false" supportsExactMatch="false" />
<wplc:fieldInfo id="ContentPath" name="ContentPath"
description="This field shows an item's content path" type="string"
contentSearchable="true" fieldSearchable="true" parametric="false"
returnable="true" sortable="false" supportsExactMatch="false" />
<wplc:fieldInfo id="department" name="department"
description="short form of the document id of the parent department SiteArea"
type="string" contentSearchable="true" fieldSearchable="true"
parametric="false" returnable="false" sortable="false"
supportsExactMatch="false" />
<wplc:fieldInfo id="Owner" name="Owner"
description="This field shows an item's owner" type="string"
contentSearchable="true" fieldSearchable="true" parametric="false"
returnable="true" sortable="false" supportsExactMatch="false" />
<wplc:fieldInfo id="Name" name="Name"
description="This field shows an item's name" type="string"
contentSearchable="true" fieldSearchable="true" parametric="false"
returnable="true" sortable="false" supportsExactMatch="false" />
<wplc:fieldInfo id="AuthoringTemplate" name="AuthoringTemplate"
description="This field shows the used authoring template for this item"
type="string" contentSearchable="true" fieldSearchable="true"
parametric="false" returnable="true" sortable="false"
supportsExactMatch="false" />
<wplc:field id="department">sales</wplc:field>
<wplc:field id="Name">second quarter results</wplc:field>
<wplc:field id="AuthoringTemplate">article</wplc:field>
<wplc:field id="EffectiveDate">Jul 13 2010 09:21:48 CEST</wplc:field>
<wplc:field id="Owner">uid=wpsadmin,o=defaultWIMFileBasedRealm
</wplc:field>
<wplc:field id="ContentPath">/scenario1/intranet/sales/second
quarter results</wplc:field>
<wplc:field id="Modifier">uid=wpsadmin,o=defaultWIMFileBasedRealm
</wplc:field>
<atom:published>2010-07-13T09:23:17+02:00
</atom:published>
</atom:entry>
Summary: How to add a new field to an entry?
You have to create a new class which implements “IWCMMetadataExtension”.
Implement the method getMetadata
The method getMetadata gets as parameter a document object, which represents an entry in the seedlist.
Use the document object to access to all fields already defined in the entry.
Use the id to look up the corresponding WCM document via the WCM API.
The class WCMUtils in the sample project provides static helper methods to assist you.
Differences to WebSphere Portal 7.0.0.1
In WebSphere Portal 7 the Lotus Seedlist Framework has changed and therefore some changes are required within the source code to run successfully on WebSphere Portal 7.0.0.1.
Required Jars and their location
To compile, the Dynamic Web Application requires the following Jars from the Portal Installation:
- wp.search.seedlist.wcm.jar from /wcm/prereq.wcm/wcm/shared/app/wp.search.seedlist.wcm.jar
- ilel-seedlist.jar from /prereq/prereq.seedlist/shared/app/ilel-seedlist.jar
All other Jars are set by the WebSphere Portal 7 runtime in Rational Application Developer. If you do not use Rational Application Developer, then several other Jars must be added to the project. I would advise to use Rational Application Developer. Have a look at the comments - the Jars are listed there.
Packages and classes changed
Some packages and classes have changed:
- from: com.ibm.lotus.search.providers.content.seedlist to com.ibm.ilel.seedlist
- from: com.ibm.lotus.search.providers.content.seedlist.retriever.Request to com.ibm.ilel.seedlist.retriever.RetrieverRequest
Interfaces and inheritance changed
For simplification the inhertiance of the class CustomWCMRetrieverFactoryImpl to BaseFactoryImp is removed. The class only implements RetrieverFactory.
The class CustomWCMRetrieverFactoryImpl must implement the method getVersion.
Hints and tips for development
If you develop the service and require to deploy if often, you can do the following:
- Restart the application Seedlist_Servlet via the WebSphere Application Server Admninistration Console. If the application is restarted, then the classes registered via OSGI are re-loaded.
- Modify the ID of the Seedlist plugin within the plugin.xml before deploying to the Portal Server.
If the new Seedlist plugin is successfully registered in WebSphere Portal can either be seen in the SystemOut log or by using the URL
http://[host]:[port]/seedlist/myserver
. This page lists all registered Seedlist plugins.
The development team and contacts
This sample code has been developed together with the development team in Haifa (responsible for Portal Search Engine and Lotus Seedlist framework) and developers from Boeblingen (responsible for WebSphere Portal and WCM Seedlist implementation). If you have any questions on how the solution works, requirements or if it fits your customers' search scenarios you can contact one of the following developers:
About the author
I'm Thomas Spillecke from IBM Software Services for Lotus. Since 5 years I'm engaged in different customer projects as an IT Specialist. I assist customers, colleagues and business partners in setting up projects for WebSphere Portal and Web Content Management. I have gained skills and insights in requirements analysis, infrastructure planning and setup, develop and create deployment scenarios, train colleagues and users, create concept papers and implement JEE applications as a solution to customers' requirements. I had the technical lead in several projects.