Wiki source code of Speeding up XWatch

Last modified by Thomas Mortagne on 2017/03/24 12:27

Show last authors
1 This is about [XWATCH-146 > https://jira.xwiki.org/browse/XWATCH-146].
2
3 To resume the [discussions on the mailing lists > http://xwiki.markmail.org/search/?q=XWATCH-116#query:XWATCH-116+page:1+mid:mk53oepppru7dava+state:results] about speeding up Watch, here are a couple of summarized results.
4
5 The results in the tables that follow were obtained for 3 databases of 20, 60 and 120 thousands articles. I need to mention that the 20 000 articles database is a "real" database while the other two are generated databases. Even if I tried to generate data as random as possible for the two, the fact remains that the entropy is not quite the same as in real situations (for example, all articles belonging to a feed get inserted in the database at successive positions). On the other hand, I tried to randomize as much as possible the values that are used as keys in the database relations: feed names, feed urls, article names.
6
7 Keep in mind that most of the time results were pretty unstable, I tried to write down in the tables the 'overall feeling', so that we can get an idea about the improvements these methods could bring on XWiki Watch and their scalability.
8
9 1.1 Non-indexed database
10
11 Here are the results obtained with no indexes on the tables other than the ones in the database administration guide.
12
13 #info("Since the results of the Lucene querying should not be affected by the database indexing or not, there is only one set of tests for Lucene on the 120 000 articles database, on the indexed database. See the table in the dedicated section.")
14
15 {pre}
16 <style type="text/css">
17 table tr.c20 {
18 background-color: #FFF8A5;
19 }
20 table tr.c60 {
21 background-color: #FFD66C;
22 }
23 table tr.c120 {
24 background-color: #FF9849;
25 }
26 table table {
27 width: 100%;
28 }
29
30 tr.c20 td, tr.c60 td, tr.c120 td {
31 color: #000000;
32 }
33 table.results td {
34 color: #000000;
35 }
36 </style>
37 <table class="results">
38 <tr><td>Operation</td> <td> Sql </td> <td>Optimized Sql</td> <td>Lucene</td></tr>
39 <tr>
40 <td> Initial load (first time)</td>
41 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td>30-40 seconds </td></tr><tr class="c120"><td>120:</td><td>can go up to 60s</td></tr></table></td>
42 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td>around 10</td></tr><tr class="c120"><td>120:</td><td>can go up to 60s</td></tr></table></td>
43 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td>up to 20 (15-16)</td></tr><tr class="c120"><td>120:</td><td></td></tr></table></td>
44 </tr>
45 <tr>
46 <td>Initial load</td>
47 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td>~15 seconds</td></tr><tr class="c120"><td>120:</td><td>20-23?</td></tr></table></td>
48 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td>7s</td></tr><tr class="c120"><td>120:</td><td>13-14s</td></tr></table></td>
49 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td>~4-5 on average (but can go up to 10)</td></tr><tr class="c120"><td>120:</td><td></td></tr></table></td>
50 </tr>
51 <tr>
52 <td>All articles</td>
53 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td>around 7-8-9 seconds </td></tr><tr class="c120"><td>120:</td><td>17-18</td></tr></table></td>
54 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td>5 on average, from 3 to 7</td></tr><tr class="c120"><td>120:</td><td>5-6 but can go as low as under a second</td></tr></table></td>
55 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td>1</td></tr><tr class="c120"><td>120:</td><td></td></tr></table></td>
56 </tr>
57 <tr>
58 <td>One feed</td>
59 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td>3 seconds</td></tr><tr class="c120"><td>120:</td><td>7-8 seconds, up to 10</td></tr></table></td>
60 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td>under a second (0.7-0.8)</td></tr><tr class="c120"><td>120:</td><td>under a second but can go up to 2-3</td></tr></table></td>
61 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td>from under a second to a
62 couple seconds </td></tr><tr class="c120"><td>120:</td><td></td></tr></table></td>
63 </tr>
64 <tr>
65 <td>Pagination navigation</td>
66 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td>2-3 seconds </td></tr><tr class="c120"><td>120:</td><td>4-5 seconds but can go up to 10-15</td></tr></table></td>
67 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td>2-3 on average (from under one second to 4-5)</td></tr><tr class="c120"><td>120:</td><td>usually under a second but can go to 4-5</td></tr></table></td>
68 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td>1s on average</td></tr><tr class="c120"><td>120:</td><td></td></tr></table></td>
69 </tr>
70 <tr>
71 <td>Keyword search</td>
72 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td></td></tr><tr class="c120"><td>120:</td><td>7-8 on average; as low as 5 but as high as 15</td></tr></table></td>
73 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td></td></tr><tr class="c120"><td>120:</td><td>3-4 on average. From under a second to 5-6</td></tr></table></td>
74 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td></td></tr><tr class="c120"><td>120:</td><td></td></tr></table></td>
75 </tr>
76 <tr>
77 <td>Tag search</td>
78 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td></td></tr><tr class="c120"><td>120:</td><td></td></tr></table></td>
79 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td></td></tr><tr class="c120"><td>120:</td><td></td></tr></table></td>
80 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td></td></tr><tr class="c120"><td>120:</td><td></td></tr></table></td>
81 </tr>
82 <tr>
83 <td>Flagged articles</td>
84 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td></td></tr><tr class="c120"><td>120:</td><td></td></tr></table></td>
85 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td></td></tr><tr class="c120"><td>120:</td><td></td></tr></table></td>
86 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td></td></tr><tr class="c120"><td>120:</td><td></td></tr></table></td>
87 </tr>
88 <tr>
89 <td>Combined filter (kw + tag + flag)</td>
90 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td></td></tr><tr class="c120"><td>120:</td><td></td></tr></table></td>
91 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td></td></tr><tr class="c120"><td>120:</td><td></td></tr></table></td>
92 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td></td></tr><tr class="c120"><td>120:</td><td></td></tr></table></td>
93 </tr>
94 </table>
95 {/pre}
96
97 1.1 Indexed database
98
99 Note first that Lucene times should not change if the custom mapping tables for the feeds are indexed, since the Lucene retrieval does not use the database but for getting the documents, and object mapping in documents only uses the feeds tables ids, which are primary keys therefore indexed by default.
100
101 And since we can assume that indexing will always be possible for feeds tables, we focused our research on this case, namely to see how good can optimised sql get and how does it scale.
102
103 The indexes that we used are:
104 {code}
105 alter table feeds_feedentry add key (fee_feedurl(255)), add key (fee_flag), add key (fee_feedname(255)), add key (fee_date), add key (fee_read);
106 alter table feeds_feedentry add index (fee_title(255)), add index (fee_author(255)), add index (fee_category(255)), add index (fee_content(1024)), add index (fee_fullcontent(1024));
107 alter table feeds_aggregatorurl add key (agg_url(255)), add key (agg_name(255));
108 alter table feeds_aggregatorgroup add key (agg_name (255));
109 alter table feeds_aggregatorurlgroups add key (agl_value);
110 alter table feeds_feedentrytags add key (fet_value);
111 alter table feeds_keyword add key (key_name (255)), add key (key_group (255));
112 {code}
113
114 to try to optimise the following queries:
115 * default select for getting the list of all articles:
116 {code}
117 select distinct doc.space, doc.name
118 from XWikiDocument as doc, BaseObject as obj, XWiki.FeedEntryClass as feedentry
119 where doc.fullName=obj.name and obj.className='XWiki.FeedEntryClass' and obj.id=feedentry.id and (feedentry.flag>-1 or feedentry.flag is null) and doc.web='Watch'
120 order by feedentry.date desc
121 {code}
122 which translates to the following sql query:
123 {code}
124 select distinct xwikidocum0_.XWD_WEB as col_0_0_, xwikidocum0_.XWD_NAME as col_1_0_, feedentryc2_.fee_date as col_2_0_
125 from xwikidoc xwikidocum0_, xwikiobjects baseobject1_, feeds_feedentry feedentryc2_
126 where xwikidocum0_.XWD_FULLNAME=baseobject1_.XWO_NAME and baseobject1_.XWO_CLASSNAME='XWiki.FeedEntryClass' and baseobject1_.XWO_ID=feedentryc2_.fee_id and (feedentryc2_.fee_flag>-1 or feedentryc2_.fee_flag is null) and xwikidocum0_.XWD_WEB='Watch'
127 order by feedentryc2_.fee_date desc
128 {code}
129 * the optimised query for the articles list:
130 {code}
131 select obj.name
132 from BaseObject as obj, XWiki.FeedEntryClass as feedentry where obj.id=feedentry.id and obj.className='XWiki.FeedEntryClass' and (feedentry.flag>-1 or feedentry.flag is null) and obj.name like 'Watch.%'
133 order by feedentry.date desc
134 {code}
135 which is translated to the following query:
136 {code}
137 select baseobject0_.XWO_NAME as col_0_0_
138 from xwikiobjects baseobject0_, feeds_feedentry feedentryc1_
139 where baseobject0_.XWO_ID=feedentryc1_.fee_id and baseobject0_.XWO_CLASSNAME='XWiki.FeedEntryClass' and (feedentryc1_.fee_flag>-1 or feedentryc1_.fee_flag is null) and (baseobject0_.XWO_NAME like 'Watch.%')
140 order by feedentryc1_.fee_date desc
141 {code}
142
143 A few hacks, for the sake of performance (since "fast" and "clean" are two words that rarely go together) were to:
144 * drop the documents table from the query and get the document names from the object names. The only place where the documents table was used was to test the space, which we replaced by "obj.name like 'Watch.%'". There are cases in which this can return odd results (if database is mysql then like is case insensitive by default) but I think they're rare enough.
145 * drop the distinct condition for the returned results: the only situation when a document name can appear twice is when there are two objects of class XWiki.FeedEntryClass in the same document in the Watch space, which does not happen by normal usage of XWiki Watch. The distinct constraint is costfull to evaluat especially for large result sets. Note that, even if we query with a small limit, the distinct and the ordering are performed for the whole result set by the database engine.
146
147 The results are the following, noting that we focused on the performance on the 20 000 articles real database and the 120 000 articles database, to see how the optimisation scales.
148
149 {pre}
150 <table class="results">
151 <tr><td>Operation</td> <td> Sql </td> <td>Optimized Sql</td> <td>Lucene</td></tr>
152 <tr>
153 <td> Initial load (first time)</td>
154 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td></td></tr><tr class="c120"><td>120:</td><td></td></tr></table></td>
155 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td></td></tr><tr class="c120"><td>120:</td><td></td></tr></table></td>
156 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td></td></tr><tr class="c120"><td>120:</td><td>4s?</td></tr></table></td>
157 </tr>
158 <tr>
159 <td>Initial load</td>
160 <td><table><tr class="c20"><td>20:</td><td>8-10s</td></tr><tr class="c60"><td>60:</td><td>15-20s</td></tr><tr class="c120"><td>120:</td><td>30s and higher</td></tr></table></td>
161 <td><table><tr class="c20"><td>20:</td><td>2-3 seconds on average; can go under 1 second</td></tr><tr class="c60"><td>60:</td><td>4s</td></tr><tr class="c120"><td>120:</td><td>3-4s? can go up to 12 though</td></tr></table></td>
162 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td></td></tr><tr class="c120"><td>120:</td><td></td></tr></table></td>
163 </tr>
164 <tr>
165 <td>All articles</td>
166 <td><table><tr class="c20"><td>20:</td><td>6-7s on average</td></tr><tr class="c60"><td>60:</td><td>7-8-9s</td></tr><tr class="c120"><td>120:</td><td>15s and higher</td></tr></table></td>
167 <td><table><tr class="c20"><td>20:</td><td>1s; can go a lot under</td></tr><tr class="c60"><td>60:</td><td>0.6-0.7s</td></tr><tr class="c120"><td>120:</td><td>can be very fast, less than 0.5s but also up to 3</td></tr></table></td>
168 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td></td></tr><tr class="c120"><td>120:</td><td>1.5s</td></tr></table></td>
169 </tr>
170 <tr>
171 <td>One feed</td>
172 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td>2-3s</td></tr><tr class="c120"><td>120:</td><td>1-3s but also 10s</td></tr></table></td>
173 <td><table><tr class="c20"><td>20:</td><td>0.5 seconds on average; never > 1s</td></tr><tr class="c60"><td>60:</td><td>0.5s</td></tr><tr class="c120"><td>120:</td><td>under a second but also a lot above, 3-4-5</td></tr></table></td>
174 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td></td></tr><tr class="c120"><td>120:</td><td>0.8, very low for few articles and as high as 1.5 for 1500+</td></tr></table></td>
175 </tr>
176 <tr>
177 <td>Pagination navigation</td>
178 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td></td></tr><tr class="c120"><td>120:</td><td></td></tr></table></td>
179 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td></td></tr><tr class="c120"><td>120:</td><td></td></tr></table></td>
180 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td></td></tr><tr class="c120"><td>120:</td><td>1s on average</td></tr></table></td>
181 </tr>
182 <tr>
183 <td>Keyword search</td>
184 <td><table><tr class="c20"><td>20:</td><td>5s</td></tr><tr class="c60"><td>60:</td><td>5s</td></tr><tr class="c120"><td>120:</td><td>8-9s?</td></tr></table></td>
185 <td><table><tr class="c20"><td>20:</td><td>0.8-1s</td></tr><tr class="c60"><td>60:</td><td>1s?</td></tr><tr class="c120"><td>120:</td><td>1.2s?</td></tr></table></td>
186 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td></td></tr><tr class="c120"><td>120:</td><td>0.5s</td></tr></table></td>
187 </tr>
188 <tr>
189 <td>Tag search (see note) </td>
190 <td><table><tr class="c20"><td>20:</td><td>6-7s</td></tr><tr class="c60"><td>60:</td><td>2-3s</td></tr><tr class="c120"><td>120:</td><td>6s?</td></tr></table></td>
191 <td><table><tr class="c20"><td>20:</td><td>0.8-1s</td></tr><tr class="c60"><td>60:</td><td>0.9s?</td></tr><tr class="c120"><td>120:</td><td>1.2s</td></tr></table></td>
192 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td></td></tr><tr class="c120"><td>120:</td><td></td></tr></table></td>
193 </tr>
194 <tr>
195 <td>Flagged articles</td>
196 <td><table><tr class="c20"><td>20:</td><td>0.5s?</td></tr><tr class="c60"><td>60:</td><td>0.6s</td></tr><tr class="c120"><td>120:</td><td>1-2s</td></tr></table></td>
197 <td><table><tr class="c20"><td>20:</td><td>0.3s</td></tr><tr class="c60"><td>60:</td><td>0.3s</td></tr><tr class="c120"><td>120:</td><td>0.5s</td></tr></table></td>
198 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td></td></tr><tr class="c120"><td>120:</td><td>0.8</td></tr></table></td>
199 </tr>
200 <tr>
201 <td>Combined filter</td>
202 <td><table><tr class="c20"><td>20:</td><td>0.5s?</td></tr><tr class="c60"><td>60:</td><td>from under a second to 3-4s, depends on the filter</td></tr><tr class="c120"><td>120:</td><td>?</td></tr></table></td>
203 <td><table><tr class="c20"><td>20:</td><td>0.3s-0.4s</td></tr><tr class="c60"><td>60:</td><td>very low, 0.1. depends on the criteria too.</td></tr><tr class="c120"><td>120:</td><td>as low as 0.2-0.3 but also depends on filter</td></tr></table></td>
204 <td><table><tr class="c20"><td>20:</td><td></td></tr><tr class="c60"><td>60:</td><td></td></tr><tr class="c120"><td>120:</td><td></td></tr></table></td>
205 </tr>
206 </table>
207 {/pre}
208
209 #info("Note that the inconsistency of the results for the tag search between the three sizes might come from the fact that tag search on the 20 000 articles database reflects the search of a real tag, enough distributed over the database and enough used in the database. For the other two databases, the tag is an artificial one, not very spread across the database and potentially not well distributed. TODO: write tags generating code")
210
211 1.1.1 Remarks
212
213 * in the indexed table we notice a substantial improvement of the optimized sql over the "default" sql and its relative proximity to the results brought by lucene. The interesting results, from the speed improvement point of view are the ones one the 20 000 articles database, which is a real database with "real entropy"
214 * even if the table might not show it, the optimised sql results go often enough under 1 second for basic operations such as listing all articles for a feed and basic filtering and these times scale up pretty well.
215
216 1.1 Conclusion

Get Connected