Hi there,
I try to deep dive into the inner LucenePostingFormat to check what might I
do for improving query performance. I'm curious about the termBlock stats
that I get from checkIndex -verbose.
What does the followong mean:
index FST bytes - the FST size, which is the field's partition of the .tip
file?
num of terms - written 2M, although Luke interface shows me 8M, how come?
term / index FST bytes - summing up all my fields bytes doesn't get me
close to the .tim / tip file, how come?
blocks - these are the SUFFIX blocks (.tim files), which are implemented as
Burst Tries, right?
block types - where can I get the info about these different types?
As background, my main performance issue is (random?) read miss IO while
looking up terms in the BlockTreeTerm (tim files, right?) on heavy-termed
queries, so my optimization is avoiding IO's. That said, is there any
reason getting the right block will require more than <segment_count> IO
(of 4kB)?
Does a certain distribution of prefix length of block types should alarm me
in some way?
field "text_txt"
index FST:
18300 nodes
45779 arc
583438 bytes
term:
2053393 terms
25597203 bytes (12.5 bytes/term)
blocks:
66086 blocks
51870 terms-only blocks
47 sub-block-only blocks
14169 mixed blocks
13599 floor blocks
22862 non-floor blocks
43224 floor sub-blcoks
18289568 term suffix bytes (276.8 suffix-bytes/block)
4174480 term stas bytes (63.2 stats-bytes/block)
7632796 other bytes (115.5 stats-bytes/block)
by prefix length:
0: 1
1: 683
2: 10782
3. 17133
etc...
Thanks alot,
Manuel
I try to deep dive into the inner LucenePostingFormat to check what might I
do for improving query performance. I'm curious about the termBlock stats
that I get from checkIndex -verbose.
What does the followong mean:
index FST bytes - the FST size, which is the field's partition of the .tip
file?
num of terms - written 2M, although Luke interface shows me 8M, how come?
term / index FST bytes - summing up all my fields bytes doesn't get me
close to the .tim / tip file, how come?
blocks - these are the SUFFIX blocks (.tim files), which are implemented as
Burst Tries, right?
block types - where can I get the info about these different types?
As background, my main performance issue is (random?) read miss IO while
looking up terms in the BlockTreeTerm (tim files, right?) on heavy-termed
queries, so my optimization is avoiding IO's. That said, is there any
reason getting the right block will require more than <segment_count> IO
(of 4kB)?
Does a certain distribution of prefix length of block types should alarm me
in some way?
field "text_txt"
index FST:
18300 nodes
45779 arc
583438 bytes
term:
2053393 terms
25597203 bytes (12.5 bytes/term)
blocks:
66086 blocks
51870 terms-only blocks
47 sub-block-only blocks
14169 mixed blocks
13599 floor blocks
22862 non-floor blocks
43224 floor sub-blcoks
18289568 term suffix bytes (276.8 suffix-bytes/block)
4174480 term stas bytes (63.2 stats-bytes/block)
7632796 other bytes (115.5 stats-bytes/block)
by prefix length:
0: 1
1: 683
2: 10782
3. 17133
etc...
Thanks alot,
Manuel