To save disk space for the on-disk indices, we compress the individual blocks using Huffman coding. Depending on the data values and the sorting order of the index, we achieve a compression rate of ≈ 90%. Although compression has a marginal impact on performance, we deem that the benefits of saved disk space for large index files outweighs the slight performance dip.
Figure 4 shows the correspondence between block size and lookup time, and also shows the impact of Huffman coding on the lookup performance; block sizes are measured pre-compression. The average lookup time for a data file with 100k entries (random lookups for all subjects in the index) using a 64k block size is approximately 1.1 ms for the uncompressed and 1.4 ms for the compressed data file. For 90k random lookups over a 7 GB data file with 420 million synthetically generated triples (more on that dataset in Section 7), we achieve an average seek time of 23.5 ms.
I still wonder how the DERI guys can make the claim about it being their indexing scheme especially when Kowari was open sourced before YARS or the original paper came out. Maybe it's who publishes first? See Paul's previous discussion about it in 2005 (under the title "Indexing"). I mind that this hasn't been properly attributed as I'd like Paul and any others to get the attribution they deserve. On the other hand, I'm glad that people are taking this idea and running with it.
It's good to see that text searching on literals now seems like a standard feature too. They used a sparse index to create all 6 indices. They also hint out how reasoning is going to be performed by linking to, "Unifying Reasoning and Search to Web Scale", which suggests a tradeoff over time and trust.