This is about XWATCH-146.

To resume the discussions on the mailing lists about speeding up Watch, here are a couple of summarized results.

The results in the tables that follow were obtained for 3 databases of 20, 60 and 120 thousands articles. I need to mention that the 20 000 articles database is a "real" database while the other two are generated databases. Even if I tried to generate data as random as possible for the two, the fact remains that the entropy is not quite the same as in real situations (for example, all articles belonging to a feed get inserted in the database at successive positions). On the other hand, I tried to randomize as much as possible the values that are used as keys in the database relations: feed names, feed urls, article names.

Keep in mind that most of the time results were pretty unstable, I tried to write down in the tables the 'overall feeling', so that we can get an idea about the improvements these methods could bring on XWiki Watch and their scalability.

Non-indexed database

Here are the results obtained with no indexes on the tables other than the ones in the database administration guide.

Since the results of the Lucene querying should not be affected by the database indexing or not, there is only one set of tests for Lucene on the 120 000 articles database, on the indexed database. See the table in the dedicated section.

Operation Sql Optimized Sql Lucene
Initial load (first time)
20:
60:30-40 seconds
120:can go up to 60s
20:
60:around 10
120:can go up to 60s
20:
60:up to 20 (15-16)
120:
Initial load
20:
60:~15 seconds
120:20-23?
20:
60:7s
120:13-14s
20:
60:~4-5 on average (but can go up to 10)
120:
All articles
20:
60:around 7-8-9 seconds
120:17-18
20:
60:5 on average, from 3 to 7
120:5-6 but can go as low as under a second
20:
60:1
120:
One feed
20:
60:3 seconds
120:7-8 seconds, up to 10
20:
60:under a second (0.7-0.8)
120:under a second but can go up to 2-3
20:
60:from under a second to a couple seconds
120:
Pagination navigation
20:
60:2-3 seconds
120:4-5 seconds but can go up to 10-15
20:
60:2-3 on average (from under one second to 4-5)
120:usually under a second but can go to 4-5
20:
60:1s on average
120:
Keyword search
20:
60:
120:7-8 on average; as low as 5 but as high as 15
20:
60:
120:3-4 on average. From under a second to 5-6
20:
60:
120:
Tag search
20:
60:
120:
20:
60:
120:
20:
60:
120:
Flagged articles
20:
60:
120:
20:
60:
120:
20:
60:
120:
Combined filter (kw + tag + flag)
20:
60:
120:
20:
60:
120:
20:
60:
120:

Indexed database

Note first that Lucene times should not change if the custom mapping tables for the feeds are indexed, since the Lucene retrieval does not use the database but for getting the documents, and object mapping in documents only uses the feeds tables ids, which are primary keys therefore indexed by default.

And since we can assume that indexing will always be possible for feeds tables, we focused our research on this case, namely to see how good can optimised sql get and how does it scale.

The indexes that we used are:

alter table feeds_feedentry add key (fee_feedurl(255)), add key (fee_flag), add key (fee_feedname(255)), add key (fee_date), add key (fee_read);
alter table feeds_feedentry add index (fee_title(255)), add index (fee_author(255)), add index (fee_category(255)), add index (fee_content(1024)), add index (fee_fullcontent(1024));
alter table feeds_aggregatorurl add key (agg_url(255)), add key (agg_name(255));
alter table feeds_aggregatorgroup add key (agg_name (255));
alter table feeds_aggregatorurlgroups add key (agl_value);
alter table feeds_feedentrytags add key (fet_value);
alter table feeds_keyword add key (key_name (255)), add key (key_group (255));

to try to optimise the following queries:

  • default select for getting the list of all articles:
select distinct doc.space, doc.name 
from XWikiDocument as doc, BaseObject as obj, XWiki.FeedEntryClass as feedentry 
where doc.fullName=obj.name and obj.className='XWiki.FeedEntryClass' and obj.id=feedentry.id  and (feedentry.flag>-1 or feedentry.flag is null) and doc.web='Watch' 
order by feedentry.date desc
which translates to the following sql query:
select distinct xwikidocum0_.XWD_WEB as col_0_0_, xwikidocum0_.XWD_NAME as col_1_0_, feedentryc2_.fee_date as col_2_0_ 
from xwikidoc xwikidocum0_, xwikiobjects baseobject1_, feeds_feedentry feedentryc2_ 
where xwikidocum0_.XWD_FULLNAME=baseobject1_.XWO_NAME and baseobject1_.XWO_CLASSNAME='XWiki.FeedEntryClass' and baseobject1_.XWO_ID=feedentryc2_.fee_id and (feedentryc2_.fee_flag>-1 or feedentryc2_.fee_flag is null) and xwikidocum0_.XWD_WEB='Watch' 
order by feedentryc2_.fee_date desc
  • the optimised query for the articles list:
select obj.name 
from BaseObject as obj, XWiki.FeedEntryClass as feedentry where obj.id=feedentry.id and obj.className='XWiki.FeedEntryClass' and (feedentry.flag>-1 or feedentry.flag is null) and obj.name like 'Watch.%' 
order by feedentry.date desc
which is translated to the following query:
select baseobject0_.XWO_NAME as col_0_0_ 
from xwikiobjects baseobject0_, feeds_feedentry feedentryc1_ 
where baseobject0_.XWO_ID=feedentryc1_.fee_id and baseobject0_.XWO_CLASSNAME='XWiki.FeedEntryClass' and (feedentryc1_.fee_flag>-1 or feedentryc1_.fee_flag is null) and (baseobject0_.XWO_NAME like 'Watch.%') 
order by feedentryc1_.fee_date desc

A few hacks, for the sake of performance (since "fast" and "clean" are two words that rarely go together) were to:

  • drop the documents table from the query and get the document names from the object names. The only place where the documents table was used was to test the space, which we replaced by "obj.name like 'Watch.%'". There are cases in which this can return odd results (if database is mysql then like is case insensitive by default) but I think they're rare enough.
  • drop the distinct condition for the returned results: the only situation when a document name can appear twice is when there are two objects of class XWiki.FeedEntryClass in the same document in the Watch space, which does not happen by normal usage of XWiki Watch. The distinct constraint is costfull to evaluat especially for large result sets. Note that, even if we query with a small limit, the distinct and the ordering are performed for the whole result set by the database engine.
The results are the following, noting that we focused on the performance on the 20 000 articles real database and the 120 000 articles database, to see how the optimisation scales.

Operation Sql Optimized Sql Lucene
Initial load (first time)
20:
60:
120:
20:
60:
120:
20:
60:
120:4s?
Initial load
20:8-10s
60:15-20s
120:30s and higher
20:2-3 seconds on average; can go under 1 second
60:4s
120:3-4s? can go up to 12 though
20:
60:
120:
All articles
20:6-7s on average
60:7-8-9s
120:15s and higher
20:1s; can go a lot under
60:0.6-0.7s
120:can be very fast, less than 0.5s but also up to 3
20:
60:
120:1.5s
One feed
20:
60:2-3s
120:1-3s but also 10s
20:0.5 seconds on average; never > 1s
60:0.5s
120:under a second but also a lot above, 3-4-5
20:
60:
120:0.8, very low for few articles and as high as 1.5 for 1500+
Pagination navigation
20:
60:
120:
20:
60:
120:
20:
60:
120:1s on average
Keyword search
20:5s
60:5s
120:8-9s?
20:0.8-1s
60:1s?
120:1.2s?
20:
60:
120:0.5s
Tag search (see note)
20:6-7s
60:2-3s
120:6s?
20:0.8-1s
60:0.9s?
120:1.2s
20:
60:
120:
Flagged articles
20:0.5s?
60:0.6s
120:1-2s
20:0.3s
60:0.3s
120:0.5s
20:
60:
120:0.8
Combined filter
20:0.5s?
60:from under a second to 3-4s, depends on the filter
120:?
20:0.3s-0.4s
60:very low, 0.1. depends on the criteria too.
120:as low as 0.2-0.3 but also depends on filter
20:
60:
120:

Note that the inconsistency of the results for the tag search between the three sizes might come from the fact that tag search on the 20 000 articles database reflects the search of a real tag, enough distributed over the database and enough used in the database. For the other two databases, the tag is an artificial one, not very spread across the database and potentially not well distributed. TODO: write tags generating code

Remarks

  • in the indexed table we notice a substantial improvement of the optimized sql over the "default" sql and its relative proximity to the results brought by lucene. The interesting results, from the speed improvement point of view are the ones one the 20 000 articles database, which is a real database with "real entropy"
  • even if the table might not show it, the optimised sql results go often enough under 1 second for basic operations such as listing all articles for a feed and basic filtering and these times scale up pretty well.

Conclusion

Tags:
Created by Vincent Massol on 2011/10/10 22:39
    

Get Connected