Solr Search Application

Last modified by Admin on 2024/10/01 00:20

magnifierAllows searching on the wiki using Apache Solr
TypeXAR
Category
Developed by

XWiki Development Team

Rating
4 Votes
LicenseGNU Lesser General Public License 2.1
Bundled With

XWiki Standard

Installable with the Extension Manager

Description

searchPage.png

The default search engine is based on Apache Solr.

XWiki was using Lucene as the default search engine up until version 5.1 RC1. You can enable Solr on these older versions from the Search administration section. Also, you have to manually re-index the content of your wiki because Solr search module did not support automatic indexing prior to version 5.1.

User Interface

By default the search returns document results sorted by relevance. Depending on the selected result type (see the Result Type facet below), you can also sort the results by:

  • document title, last modification date or last author, if the results are documents

    searchDocumentSort.png

  • file name, file size, upload date or uploader, if the results are attachments

    searchAttachmentSort.png

Besides documents and attachments you can also search for objects and object properties:

searchObjectResult.png

searchObjectPropertyResult.png

Each search result highlights the places where the search keywords have been found. Only one match is displayed initially but you can view all the matches by clicking on the "Highlight all matches" link.

searchHighlighting.png

On the right side you have the search facets that will help you drill down the search results. The displayed facets are always relative to the current search results. You can see that the list changes when you select a facet.

searchFacets

The following facets are available:

  • Result Type: filters results based on their type (documents, attachments, objects and object properties); this determines the index fields that are used and the available sort fields; the document type is selected by default
  • Wiki: filters the results from the selected wiki; this facet is currently displayed only on the main wiki and only if you have multiple wikis
  • Location: filters the results based on their location

      searchLocationFacet.png

  • Language: filters the document results that match the selected language; in case of attachments, objects and object properties, it filters the results based on the language of the document that holds them. Please insure that the needed languages facets are also in the supported languages setting of the wiki.

      searchLanguageFacet

  • Last Author: filters the document results based on their last author

      searchUserFacet

  • Creator: filters the document results based on their creator
  • Last Modification Date: filters the results based on the last modification date of the corresponding document

      searchDateFacet

    The date facets (last modification date, creation date and upload date) display a list of predefined date intervals and offer the possibility to specify a custom interval. If you don't specify both end points of the interval then it means you want all dates after/bofore the specified date.

  • Creation Date: filters the results based on the creation date of the corresponding document
  • Object Type: filters results based on the type of object (XClass) they have. E.g. "documents that have Blog Posts", "Panel objects", "properties of Java Script Extension"

      searchObjectTypeFacet

  • File Type: filters results based on the attachment file type. E.g. "documents that have attached images", "text attachments"

      searchFileTypeFacet.png

    File types are grouped by category. You can select both an entire category and a specific file type. Categories can be expanded/collapsed.

  • Uploaded By: filters the results based on the user that uploaded the attachments
  • Upload Date: filters the results based on the date when the attachments were uploaded
  • File Size: filters the results based on the size (in bytes) of the attachments

      searchFileSizeFacet.png

    You can choose between 4 ranges:

    • tiny (less than 10KB)
    • small (between 10KB and 500KB)
    • medium (between 500KB and 5MB)
    • large (more than 5MB)

The number of displayed facet values is limited to 5 by default but you can see the rest of the values by clicking on the link following the facet values (each click will show 5 more values). You can select multiple values and the selection is preserved if you submit a new search query.

At the bottom of the search results you can find a link to a RSS feed that provides the most recent results that match the current search query and filters. It contains the same type of information that was included in the old Lucene search RSS feed.

The user interface has currently some limitations related to access rights:

  • It filters out the results that are not viewable by the user, but it does not properly clean up the response and, as a consequence, the pagination might be broken (results with "holes"), see XWIKI-8583.
  • Facets don't take into account rights filtering of results, see XWIKI-13089.

The search UI is responsive with the screen size. On small screens (phones) the list of search facets is collapsed before the search results. You can of course expand it with one tap.

searchPageMobile.png

Search Syntax

The Solr search engine used in XWiki parses the search query (what you type in the search input) using by default the Extended DisMax Query Parser. You'll have to read the Solr documentation for details. 

We're currently using a value of 2 for "Minimum should match" (i.e. mm in Solr language and minShouldMax in XWiki language), which means that when the search query has several terms, it needs to match at least 2 of them. Examples:

  • If you search for "A B" (equivalent of "A OR B") then you'll need to have the 2 terms matching to get a result as in "A AND B" (which is counter intuitive). You can force the OR using "+(A B)".
  • If you search for "A B C" (equivalent of "A OR B OR C"), then it'll act as if it were "(A AND B) OR (A AND C) OR (B AND C)".

The reason we use this minShouldMatch of 2 is because when searching for multiple terms (using plain text, no special search syntax), it can help obtain more relevant results. However there's an issue open about this>.
]].

Let's see some examples:

  • Boolean operators: AND, OR, NOT (uppercase), “+” and “-”

    +foo +bar "quick fox" -glitch

    equals to

    foo AND bar OR "quick fox" NOT glitch
  • Search for word "foo" in the title field
    title:foo
  • Search for phrase "foo bar" in the title field
    title:"foo bar"
  • Search for phrase "foo bar" in the title field AND the phrase "quick fox" in the content field
    title:"foo bar" AND doccontent:"quick fox"
  • Search for either the phrase "foo bar" in the title field AND the phrase "quick fox" in the content field, or the word "fox" in the title field
    (title:"foo bar" AND doccontent:"quick fox") OR title:fox
  • Search for word "foo" and not "bar" in the title field
    title:foo -title:bar
  • Wildcard to search for any single character (ex: foo or fox)
    fo?
  • Wildcard to search for any word that starts with "foo"
    foo*
  • Search for any word that starts with "foo" and ends with bar
    foo*bar
  • Note that Solr doesn't support suffix matching: title:*foo

  • Search only within the value "foo" from the XClass property "bar" results
    "quick fox" AND property.Documents.Code.DocumentsClass.bar:foo
  • Range query
    date:[20020101 TO 20030101]
    number:[* TO 100]
    number:[100 TO *]
    number:[* TO *]
  • Pure negative queries (all clauses prohibited) are allowed
    // finds all field values where hidden is not false
    -hidden:false

    // finds all documents without a value for field
    -field:[* TO *]
  • Boosting fields
    (title:foo OR title:bar)^1.5 (doccontent:foo OR doccontent:bar)
  • Search for all pages having a AWM tag
    property.XWiki.TagClass.tags:AWM
  • Search for all pages having a tag1 OR tag2 tag
    +(property.XWiki.TagClass.tags:tag1 property.XWiki.TagClass.tags:tag2)

You should also check the schema of the XWiki Solr Index to see what fields can be used in the Solr queries. If you want to query the XWiki Solr Index programmatically then see the Solr Query API available in XWiki.

Search Debug Mode

If you want to debug the search you can add &debug=true to the search URL query string. You'll get the following information:

  • the query parser used
  • the parsed query (see which index fields are used and what is their priority/boost)
  • the filter queries (see which filters/facets are applied and their values)
  • processing time by search component
  • the score for each search result and the way it was computed

searchDebug.png

Search UI Options

It's possible to enable/disable highlighting and faceting. Both are very slow tasks so disabling them when you don't really need them can give you an important speed boost in the search UI.

searchOptions.png

Search Administration Section

The default folder that stores the Solr index is <permanent directory>/cache/solr/search. You can change it by adding the property solr.embedded.home to the WEB-INF/xwiki.properties configuration file.

In the "Solr search administration" section (or Global Administration > Search > Solr), choose the action you wish to perform on your wiki:

  • add documents to the Solr index
  • remove documents from the Solr index
  • re-index the wiki

SolrActions.png

then click on "Apply".

If you have programming rights you may also limit the list of documents that will be affected by the selected action using a custom query. You can use either XWiki Query Language (XWQL) or Hibernate Query Language (HQL).

SolrCustomQuery.png

The documents are indexed asynchronously, in a background thread, so the "Apply" button only triggers the selected action. You can see how many documents are in the indexing queue and the estimated remaining time. You can use the search function right away but the search results will contain only documents that have been indexed so far.

SolrIndexQueueStatus.png

Advanced Search Suggest Sources

The Search Suggest feature retrieves live search results from various configurable sources. These sources specify the search engine to use and the search query. The search doesn't perform well, at least on Solr, if we use only the search query because each query is different (when the input text is different) so the cache is not used efficiently. Best is to rely on the filter cache but for this we need to be able to specify the filter query.

You can specify more advanced search parameters in the search query and they will be passed directly to the search engine. As an example, the following statement from the 'query' property of a Search Suggest Source

type:DOCUMENT AND (title:(__INPUT__*) OR name:(__INPUT__*))

can be written as

fq=type:DOCUMENT
qf=title^2 name

This way

  • you will be using the filter query which is the same for all search requests to this Search Suggest Source so it will be cached by Solr
  • you will be able to specify the boost for each field you want to search in
  • the query statement used is '__INPUT__' by default, if not specified.

In order to preserve backward compatibility with existing Solr Search Suggest Sources, we use the following convention:

 A line that doesn't start with 'xxx=' specifies the query statement; in other words, existing Solr Search Suggest Sources are specifying only the query statement.

For example:

foo __INPUT__* bar
fq=type:DOCUMENT
qf=title^2 name

means the query statement is 'foo __INPUT__* bar'. Which is equivalent to:

q=foo __INPUT__* bar
fq=type:DOCUMENT
qf=title^2 name

See the Solr Search Query API documentation for details on what parameters you can pass to the search engine.

Indexing User

XWiki 14.8-rc-1+  

This is a setting for a somewhat obscure detail concerning search result titles. You can leave this setting alone unless you really care about that detail. 

The indexing process, which keeps the search index up to date with the state of the wiki pages, runs in the background and is usually not associated with any user. Normally it does not need this, as it can access all data without permission checks. (The view permission is instead checked when querying the index and showing the search results.)
There is one exception to this: if the title of a page is computed by a sheet, then viewing that sheet must be allowed. If the sheet cannot be viewed by guest users (especially if your wiki is closed to the public), then the computation of the title does not happen for the values stored in the search index.

One case where this effect is visible are the profile pages of the users. If your wiki is closed to guest users, the search hits of user profiles will show the username (usually the login), not "Profile of «firstname» «lastname»".

To show the same title in the search results as it is shown on the page itself, the search indexer needs to access the pages with a user that has view rights to the given sheets (e.g. in case of user profiles to the XWiki.XWikiUserSheet page). This user can be defined in the setting of this section.

search-admin-indexer-user.png
Form to Set the Indexing User

It is highly recommended to create a dedicated, locked user account for this setting, if the setting is used at all. This user should only be member of the XWikiAllGroup, or maybe of groups with even more restricted view rights, if you have defined such groups. If you use a user with more view rights than some other users of your wiki, then these users can use the search results to access pages that they would otherwise have no access to.
In no case an admin account (or even the "superuser" account) should be used here.

An empty value (the default setting) corresponds to using the guest user.

If you are not completely certain about the implications of this settings, then do not fill in a value here.
Better have slightly odd search results than compromise the security of your wiki.

This value can only be set on the main wiki as the indexing process is a global process, affecting all wikis.

If you change this value this does not affect the current state of the index. An explicit reindex of the whole wiki is required to apply the change. This can be done via the "Search" section of the wiki administration, as explained above.

For Developers

What is Solr?

Apache Solr is a search engine. You index a set of documents (e.g. wiki pages) and then you ask Solr to return the set of documents that match the user query.

What is a search index?

The fastest way to retrieve pages in a book related to a keyword is by scanning the index at the back of a book, not by searching every word of every page of the book. If we translate this to XWiki, performing a database search is not the best way to search for some keyword. There is a better way.

This type of index is called an inverted index, because it inverts a page-centric data structure (page->words) to a keyword-centric data structure (word->pages).

bookIndex.jpeg

How does Solr represent data?

  • In Solr, a document is the unit of search and index.
  • An index consists of one or more documents, and a document consists of one or more fields.
  • In database terminology, a document corresponds to a table row, and a field corresponds to a table column.

You can view the Solr index as a database with a single table.

Solr Schema

The Solr schema describes the list of document fields, their type and how to index and search each of them.

The current schema is defined in the following XML file, which de facto gives a detailed, updated and comprehensive description of the default schema we use: https://github.com/xwiki/xwiki-platform/blob/master/xwiki-platform-core/xwiki-platform-search/xwiki-platform-search-solr/xwiki-platform-search-solr-server/xwiki-platform-search-solr-server-core-search/src/main/resources/conf/managed-schema.xml

Fields can use basic field types (int, boolean, date, string) or complex field types which combine one tokenizer with multiple filters. E.g.:

<analyzer>
 <tokenizer class="solr.StandardTokenizerFactory"/>
 <filter class="solr.SnowballPorterFilterFactory"/>
 <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
  1. Input: "flip flipped flipping"
  2. After Tokenizer: "flip", "flipped", "flipping"
  3. After Snowball: "flip", "flip", "flip"
  4. After Remove Duplicates: "flip"

The text is analysed at index time but also at query time. Each field type needs to specify the index analyzer and the query analyzer. Most of the time they are the same, but there are cases where we want them to be different.

Defining a field

Here's what a field declaration looks like:

<field name="id" type="text" indexed="true" stored="true" multiValued="true"/>
  • name: the name of the field
  • type: the field type (controls how the field is analysed at index / query time)
  • indexed: whether the field should be added to the inverted index or not
  • stored: whether the original value of this field should be stored or not (required for highlighting)
  • multiValued: can this field have multiple values?

You cannot search for a field that is not indexed and you cannot access from the search result the value of a field that was not stored.

Example of field that is not stored: title_sort. At the moment, the XWiki's Solr Schema doesn't have fields that are not indexed.

Solr Search Relevancy

Solr (actually Lucene under  the hood) is using a scoring algorithm known as the tf.idf. This scoring model involves a number of scoring factors:

  • Term Frequency (tf): The frequency with which a term appears in a document. Given a search query, the higher the term frequency, the higher the document score.
  • Inverse Document Frequency (idf): The rarer a term is across all documents in the index, the higher it's contribution to the score.
  • Coordination Factor (coord): The more query terms that are found in a document, the higher it's score.
  • Field length (fieldNorm): The more words that a field contains, the lower it's score. This factor penalizes documents with longer field values.

In addition to the scoring factors mentioned above, the primary method of modifying document scores is by boosting. There are 2 kinds of boosts. Index-time and Query-time boosts. Index-time boosts are applied when adding documents, and apply to the entire document or to specific fields. Query-time boosts are applied when constructing a search query, and apply to specific fields.

Query boosts are applied by appending the caret character ^ followed by a positive number to query clauses.

title:foo OR (title:foo AND title:bar)^2.0 OR title:"foo bar"^10

Solr Request Handler and Components

  • Request handlers are responsible for accepting search queries, performing searches and returning the results.
  • They actually delegate the work to a series of components
    •  query, facet, moreLikeThis, highlight, stats, debug, Spell Checking, suggester
  • Request handlers and components are configured in solrconfig.xml. We can overwrite this using query parameters.

Solr and XWiki

  • Solr is the default search engine
  • We index wiki pages, attachments, objects and object properties
  • The index is updated by:
    • Start-up sync between the database and the Solr index
    • saving/deleting XWiki entities
    • manual trigger from the Solr Search Administration Section
  • Solr can be used embedded (default) or as an external service
  • Exposed through the QueryManager (like hql or xwql)
  • Used by search suggest and the main search page

Search UI Configuration

All configuration parameters for the Solr Search UI can be found in Main.SolrSearchConfig. This simplifies the process of customizing the search UI.

For example you can find in it the default weights used for the various search fields:

{{velocity output="false"}}
#set ($__defaultSolrConfig = {
  'queryFields': {
    'DOCUMENT': 'title^10.0 name^10.0
                 doccontent^2.0
                 objcontent^0.4 filename^0.4 attcontent^0.4 doccontentraw^0.4
                 author_display^0.08 creator_display^0.08
                 comment^0.016 attauthor_display^0.016 spaces^0.016',
    'ATTACHMENT': 'filename^5.0 attcontent attauthor_display^0.2',
    'OBJECT': 'objcontent',
    'OBJECT_PROPERTY': 'propertyvalue'
  },
[...]

It also allows application developers to easily create a dedicated search page for their application data. As an example, we updated the FAQ application to use the new configuration parameters:

{{include reference="XWiki.SearchCode"/}}

{{velocity output="false"}}
#if ($searchEngine == 'solr')
  ## Customize the Solr Search UI for the FAQ application.
  #set ($solrConfig = {
    'queryFields': 'title^3 property.FAQCode.FAQClass.answer',
    'facetFields': ['creator', 'creationdate', 'author', 'date', 'mimetype', 'attauthor', 'attdate', 'attsize'],
    'filterQuery': [
      'type:DOCUMENT',
      "wiki:$xcontext.database",
      "space_exact:$doc.space",
      'class:FAQCode.FAQClass'
    ]
  })
#end
{{/velocity}}

{{velocity}}
{{include reference="$searchPage"/}}
{{/velocity}}

Sorting on Object Properties

#set ($solrConfig = {
...
'sortFields': {
    'DOCUMENT': {
      'property.XWiki.AverageRatingsClass.averagevote_sortFloat': 'desc'
    },
...

Detailed info here in XWiki Solr Index
Don't forget to add alias for new sort field in Main.SolrTranslations

Faceting on Object Properties

The Solr Search Query API describes how you can add a facet that is based on a object property. 

You can achieve the same using the Search UI configuration page, Main.SolrSearchConfig:

...
'facetFields': ['type', ..., 'attsize', 'property.Blog.BlogPostClass.publishDate_date'],
...

For example if you wish to add a "Tag" facet for filtering on tags, you can add the property.XWiki.TagClass.tags_string facet field value as in:

...
  'facetFields': ['type', 'wiki', 'space_facet', 'locale', 'author', 'creator', 'date',
    'creationdate', 'class', 'name_exact', 'mimetype', 'attauthor', 'attdate', 'attsize', 'property.XWiki.TagClass.tags_string'],
...

The facet displayer used for an XClass property is determined based on the property type, and can also be configured using either facetDisplayers or facetDisplayersByPropertyType configuration parameters. Note that you can even create you own custom facet displayer. Take a look at the existing facet displayers like Main.SolrUserFacet or Main.SolrFileSizeFacet. The facet displayer is a wiki page. The displayer code is in the page content.

Wikis Searchable from Main Wiki

You can restrict the list of wikis that are searchable by default from the main wiki by defining the following Velocity variable in a page that includes Main.SolrSearch:

#set ($wikisSearchableFromMainWiki = ["wiki1", "wiki2", "wiki3"])

Search Globally from a Subwiki

If you want to search globally (in all wikis) from a specific subwiki then you can edit Main.SolrSearchConfig and add this Velocity code at the end:

## Custom Solr Search Configuration.
## Add back the wiki facet (removed above).
#set ($discard = $solrConfig.facetFields.add(1, 'wiki'))
## By setting a filter query we are overwriting the default one.
#set ($discard = $solrConfig.filterQuery.add('wiki:*'))
## Preserve some of the default filter query items.
#if ($xwiki.getUserPreference('displayHiddenDocuments') != 1)
  #set ($discard = $solrConfig.filterQuery.add('hidden:false'))
#end

If you want to search only in a limited list of wikis then you should use something like:

#set ($discard = $solrConfig.filterQuery.add('wiki:(xwiki subwiki1 subwiki2)'))

Note that the changes made to Main.SolrSearchConfig don't affect the Search Suggest. For this you need to edit the search suggest sources (that use the Solr search engine) from that subwiki (using the dedicated administration section) and add this filter query:

fq=wiki:*

or one that looks like:

fq=wiki:(xwiki subwiki1 subwiki2)

Checkout the previous section on Advanced Search Suggest Sources for more details.

Translations

Translations for default Solr search page can be found in Main.SolrTranslations. To use custom translations for custom Solr search implementation, you need to create a dedicated translation page Main.SolrFAQTranslations with inserting this text into the content:

solr.field.property.FAQCode.FAQClass.answer=Answer for question

and trigger the usage of this translations before {{include reference="$searchPage"/}} by below code

#set ($discard = $services.localization.use('document', 'Main.SolrFAQTranslations'))

JSON Service

You can send requests to the Solr search service from your JavaScript code to get search results in JSON format:

require(['jquery'], function($) {
 var solrServiceURL = new XWiki.Document('SuggestSolrService', 'XWiki').getURL('get');
  $.post(solrServiceURL, {
    outputSyntax: 'plain',
    media: 'json',
    query: [
     'q=*__INPUT__*',
     'fq=type:DOCUMENT',
     'fq=class:XWiki.XWikiUsers',
     'qf=property.XWiki.XWikiUsers.last_name^10 property.XWiki.XWikiUsers.first_name^5 name^2.5'
    ].join('\n'),
    input: $('.userPicker').val()
  });
});

The response is an array with items that include all the information indexed by Solr. An item looks like this:

{
 "id": "xwiki:XWiki.Admin_",
 "hidden": false,
 "wiki": "xwiki",
 "spaces": ["XWiki"],
 "name": "Admin",
 "locale": "",
 "language": "",
 "type": "DOCUMENT",
 "fullname": "XWiki.Admin",
 "title_": "Profile of Administrator ",
 "doccontentraw_": "",
 "doccontent_": "",
 "version": "3.1",
 "doclocale": "",
 "locales": ["", "en"],
 "lang": ["", "en"],
 "author": "xwiki:XWiki.Admin",
 "author_display": "Administrator",
 "creator": "xwiki:XWiki.Admin",
 "creator_display": "Administrator",
 "creationdate": 1108463850000,
 "date": 1414057869000,
 "property.Dashboard.UserDashboardPreferencesClass.displayOnMainPage_boolean": [false],
 "object.Dashboard.UserDashboardPreferencesClass__": ["false"],
 ...
}

So you have access to all the document meta data, including objects and attachments.

The service has two main parameters: query and input. See the Advanced Search Suggest Sources section for more details on how to use them.

Dedicated Solr core

It is possible to "reserve" and initialize a dedicated Solr core by implementing the component role org.xwiki.search.solr.SolrCoreInitializer. You can also access a specific core using org.xwiki.search.solrSolr#getClient(String).

An org.xwiki.search.solrAbstractSolrCoreInitializer is provided to make easier to implement org.xwiki.search.solr.SolrCoreInitializer, it comes with the following features:

  • calls getVersion() to know the current specification version of the core
  • calls createSchema() when the shema does not exist yet
  • calls migrateSchema(long cversion) when the shema already exist but is in an older version
  • provide a lot of helper methods to create field types and fields including support for a virtual Map field.

An org.xwiki.search.solr.SolrUtils component is also provided to provide helpers to manipulate Solr cores outside of the creation/init.

Prerequisites & Installation Instructions

We recommend using the Extension Manager to install this extension (Make sure that the text "Installable with the Extension Manager" is displayed at the top right location on this page to know if this extension can be installed with the Extension Manager). Note that installing Extensions when being offline is currently not supported and you'd need to use some complex manual method.

You can also use the following manual method, which is useful if this extension cannot be installed with the Extension Manager or if you're using an old version of XWiki that doesn't have the Extension Manager:

  1. Log in the wiki with a user having Administration rights
  2. Go to the Administration page and select the Import category
  3. Follow the on-screen instructions to upload the downloaded XAR
  4. Click on the uploaded XAR and follow the instructions
  5. You'll also need to install all dependent Extensions that are not already installed in your wiki

Dependencies

Dependencies for this extension (org.xwiki.platform:xwiki-platform-search-solr-ui 16.8.0):

Get Connected