Saturday, December 31, 2005

RETE Rebuked

A recent blog I started reading after criticizing SPARQL. This time its criticizing an entry Paul made about the scalability of the RETE algorithm: "I am a bit confused by the statement that RETE does not scale. This is contrary to a mountain of papers by researchers and developers around the world. From the paragraph, a couple of things come to mind. "(loading indexes) does not need to be done often" tells me the author doesn't understand the purpose and goal of RETE. RETE was designed to solve machine learning problems where data changes rapidly and reasoning is a continuous process. What the author wants is something closer to BitMap indexes used in OLAP products."

A lot of the points raised, like RETE being for changing data, are mentioned in a previous post under Meeting. Drools was chosen as a starting to point to see what kind of system needed to be developed in Kowari.

Also mentioned, bitmap indexing: "Given Tucana is indexing everything, they might as well adapt Bitmap indexing and get better than linear performance. The problem described by the blog is a well understood problem in the OLAP world."

An previous entry, "Relational theory, RETE and Derby" points to some interesting articles about bitmap indexes (available in Oracle 9) and high scalability requirements: "In a large financial institution like a mutual fund company, they may have 1-20 million customers. If each customer has an average of 20-30 positions (aka specific holding of an equity) that means the potential dataset for firm wide compliance rule could involve 20million+ rows. Doing this within 2-5 seconds is rather hard, so it requires using lots of different techniques. In the extreme cases, a company might have 20 million accounts, which means the potential dataset is 600 million rows."

From the OTN article: "B-tree indexes are usually used when columns are unique or near unique; bitmap indexes should be used, or at least considered, in all other cases. While you would not generally use a b-tree index when retrieving 40 percent of the rows of a table, a bitmap index is often still faster than doing a full table scan. This is seemingly in violation of the 80/20 rule, which is to generally use an index when retrieving 20 percent or less of the rows and do a full table scan when retrieving more. Bitmap indexes are smaller and work differently from the 80/20 rule. You can effectively use bitmap indexes even when retrieving large percentages (20 to 80 percent) of a table. Bitmaps can also be used to retrieve conditions based on nulls (since nulls are also indexed) and for "not equal" conditions."

It would appear that this would be suitable for predicate indexation but not generally as both subjects and objects are near unique.

No comments: