Showing posts with label sql. Show all posts
Showing posts with label sql. Show all posts

Wednesday, April 15, 2009

MapReduce vs SQL Databases

A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks
...we present the results of running the benchmark on a 100-node cluster to execute each task. We tested the publicly available open-source version of MapReduce, Hadoop [1], against two parallel SQL DBMSs, Vertica [3] and a second system from a major relational vendor.

First, as we demonstrate in Section 4, at 100 nodes the two parallel DBMSs range from a factor of 3.1 to 6.5 faster than MapReduce on a variety of analytic tasks. While MR may indeed be capable of scaling up to 1000s of nodes, the superior efficiency of modern DBMSs alleviates the need to use such massive hardware on datasets in the range of 1–2PB (1000 nodes with 2TB of disk/node has a total disk capacity of 2PB). For example, eBay’s Teradata configuration uses just 72 nodes (two quad-core CPUs, 32GB RAM, 104 300GB disks per node) to manage approximately 2.4PB of relational data. As another example, Fox Interactive Media’s warehouse is implemented using a 40-node Greenplum DBMS. Each node is a Sun X4500 machine with two dual-core CPUs, 48 500GB disks, and 16 GB RAM (1PB total disk space) [7]. Since few data sets in the world even approach a petabyte in size, it is not at all clear how many MR users really need 1,000 nodes.


In section 3.1 there's some points made about the advantages of databases over MR in relation to data integrity, "...a MR framework and its underlying distributed storage system has no knowledge of these rules, and thus allows input data to be easily corrupted with bad data. By again separating such constraints from the application and enforcing them automatically by the run time system, as is done by all SQL DBMSs, the integrity of the data is enforced without additional work on the programmer’s behalf."

They mention that "all DBMSs require that data conform to a well-defined schema, whereas MR permits data to be in any arbitrary format. Other differences also include how each system provides indexing and compression optimizations, programming models, the way in which data is distributed, and query execution strategies."

If you strip it away they are talking about text processing versus indexed data structures (and other parts of a DBMS).

For loading, "Without using either block-or record-level compression, Hadoop clearly outperforms both DBMS-X and Vertica since each node is simply copying each datafile from the local disk into the local HDFS instance and then distributing two replicas to other nodes in the cluster." The obvious difference to me would be that the SQL databases are creating "...a hash partitioned across all nodes on the salient attribute for that particular table, and then sorted and indexed on different attributes..."

For text processing, they note that the main problems with Hadoop are the start-up costs (10-25 seconds before all Map tasks start) and during the Reduce phase the cost of combining many small files. When you are comparing a fully indexed system versus text processing then you would expect the indexed system to be faster. Compression was also considered an advantage in the systems like Vertica's over Hadoop - where it actually reduced performance. It depends on the work being done whether the overhead of compression is worth the overhead so obviously - it's not explained why compression was a negative for Hadoop.

They also talk about the problems in setting up and configuring the parallel databases over Hadoop, which is not an insignificant difference when you are scaling to 100s and 1000s of nodes.

In the summary they talk about 25 years of database development and the advantages of B-Trees and column stores. It begs the question, then why wasn't a similar system used on the Hadoop infrastructure? MR is really more like distributed processing not an indexed, querying system.

If you took away the distributed layer what they are doing is comparing something like grep (a really bad implementation of grep) with Lucene or MySQL. Would anyone be surprised with the results then? A better comparison would've been comparing it against HBase or other distributed, indexed, data stores like Hive or Cloudbase.

Update: There's a good followup on the hadoop list by Jonathan Gray "Hadoop is not suited for random access, joins, dealing with subsets of
your data; ie. it is not a relational database! It's designed to
distribute a full scan of a large dataset, placing tasks on the same nodes
as the data its processing. The emphasis is on task scheduling, fault
tolerance, and very large datasets, low-latency has not been a priority.
There are no "indexes" to speak of, it's completely orthogonal to what it
does, so of course there is an enormous disparity in cases where that
makes sense. Yes, B-Tree indexes are a wonderful breakthrough in data
technology". He suggested Pig, Hive and Cascading would be more suitable for comparison.

Friday, July 13, 2007

SPASQL

SPASQL
FeDeRate [FED] provides a mapping between RDF queries and SQL queries over conventional relational databases. SPASQL provides similar functionality, but eliminates the query rewriting phase by parsing the SPARQL [SPARQL] directly in the database server. This approach is efficient, and allows applications linked to currently deployed MySQL client libraries to execute SPARQL queries and get the results back through the conventional MySQL protocol.


This looks like a nice, pragmatic approach to solving the legacy SQL data problem. There's some interesting discussion about the mismatch between UNION in SQL and SPARQL.

Sunday, March 25, 2007

News from Oracle

Oracle 11g Gains Native OWL Support and Oracle 11g to support some OWL inferencing. Following the links to it leads to, "Semantic Technologies Center" and an announcement about TopQuadrant TopBraid Composer supporting Oracle. Apart from the native OWL inferencing, faster querying and faster bulk loading the PDF referenced talks about the various scalability achieved (600 million triples), clustering, failover, concurrent read/write and query features such as a semantic relationship operators: sem_related and sem_distance.

Friday, March 16, 2007

Radiant RDF

A new and hopefully readable article on XML.com, "A Relational View of the Semantic Web". The main reason is to illicit some feedback and to make a more understanble version of things like this. It also discusses whether or not to use blank nodes when mapping RDF to the relational model (which is something I first published in my thesis). It includes my take, which is no doubt influenced by Simon and working on Kowari, on the DISTINCT issue amongst others.

Update: Well if any feedback is good feedback, having positive feedback is awesome: "But it begs the question: given the relative proximity of RDF/SPARQL to the relational model, why are the Semantic Web standards not closer both syntactically and semantically to it? This is, for me, the major reason which prevents its adoption...Nice to see such a straightforward and well-referenced article outlining the Semantic Web and relational worlds in any case. Thank you."

The question raised in the comment, why not keep pushing SQL, is one I wish I had a short answer for. Often SPARQL syntax has been chosen to look like SQL. But obviously it's still different enough because it is a different domain. I still favour a clean break from SQL for that reason.

I wrote about SQL not being relational enough quite a while ago now (nearly 3 years). At the time, I think I was unaware of some of the operations in SQL like INTERSECT and EXCEPT (set difference).

Update 2: Also noted by Nova Spivacks, "Excellent Overview of Benefits of RDF and SPARQL".

Tuesday, October 03, 2006

SPARQL to SQL

Semantics Preserving SPARQL-to-SQL Query Translation for Optional Graph Patterns "...we proposed: (1) an efficient algorithm, BGPtoSQL, that generates an SQL query equivalent to an input SPARQL basic graph pattern to be evaluated over the triple store, and (2) a generic query translation strategy and an efficient algorithm, SPARQLtoSQL, that translates an input SPARQL query with a basic graph pattern or an optional graph pattern to an equivalent SQL query. To our best knowledge, our algorithmic solution for the optional graph pattern query mapping problem is the first published in the literature."

Tuesday, May 30, 2006

Shiny Languages

Reasons Why Your Startup Should Use Ruby On Rails (RoR) "When tempted to adopt the newest “silver bullet” technology that promises immense gains in programmer productivity (which RoR does), understand what the tradeoffs are. To get even keener insight, study a little bit of history and find out what has come before. It’s possible that in your situation RoR is indeed the best choice, but likely not for the reasons above. If you find yourself agreeing with one of the eight reasons above, we really need to talk."

Another related article, "Why Ruby is an acceptable LISP".

And, "SQL On Rails".

Via Javalobby.

Tuesday, May 09, 2006

Relations in my Language

A good interview with Ted Neward (from chapter 8) about the differences and benefits of LINQ and why it's not SQL in your language.

Thursday, March 30, 2006

Sudoku SQL

Solving Sudoku with SQL "To make it even more fun for myself, I embarked on an exercise to write a program that solves Sudoku puzzles. And to make it even more challenging I decided not to write the program in the popular object-oriented fashion (Java, C++, C#, etc.) or in any of the old-fashioned procedural programming languages (Pascal, C, Basic etc); but in Transact SQL, within SQL Server 2000. Basically, I wanted to see how the features of T-SQL can be used to develop something like a Sudoku puzzle solution. I have learnt some useful things from the exercise, which I’m eager to pass on to my fellow programmers.

T-SQL is rich in in-built programming functions and features. Far from being just for holding and manipulating data, T-SQL is a programming language in its own right. Many algorithm-based problems that used to be solved with mainstream procedural or object-oriented languages can now be dealt with completely within SQL Server using T-SQL, because not only does it have the usual programming constructs such as ‘While…End’; ‘Case’ and ‘Ifs’; it also, of course, has SQL."

Related to the previous solution using OWL: Now for my Tax Return.

Monday, January 16, 2006

Query Anything

Access In-memory Objects Easily with JoSQL "Once in a while something comes along that is so simple, so straightforward, and so obvious that it's amazing that nobody did it long ago."

An example query,
SELECT *
FROM java.io.File
WHERE name $LIKE "%.html"
AND lastModified BETWEEN toDate('01-12-2004')
AND toDate('31-12-2004')


I've blogged previously, Using RDF to improve Object-First Development. This is an example that is the same principal as Kowari's resolvers (or any idea of smushing metadata like Gnowsis) - allowing anything to be treated as RDF (or in JoSQL's case SQL).

Thursday, December 15, 2005

Outer Joins aren't Primitive

Optional data in SPARQL seems to be equivalent to left outer join in SQL. As it turns out, outer joins can be composed of disjunctions. This is similar to the original MAYBE function suggested to be added to Kowari (although that suggestions is quite a deal simpler). The below paper outlines algorithms to do outer queries more efficiently. They require computing the anti-join of certain relations - an antij-oin being the set difference between two tables (or MINUS operation). Here is a good explanation of semi-joins and anti-joins.

Outerjoins as Disjunctions "The outerjoin operator is currently available in the query language of several major DBMSs, and it is included in the proposed SQL2 standard draft. However, “associativity problems” of the operator have been pointed out since its introduction. In this paper we propose a shift in the intuition behind outerjoin: Instead of computing the join while also preserving its arguments, outerjoin delivers tuples that come either from the join or from the arguments. Queries with joins and outerjoins deliver tuples that come from one out of several joins, where a single relation is a trivial join. An advantage of this view is that, in contrast to preservation, disjunction is commutative and associative, which is a significant property for intuition, formalisms, and generation of execution plans.Based on a disjunctive normal form, we show that some data merging queries cannot be evaluated by means of binary outerjoins, and give alternative procedures to evaluate those queries. We also explore several evaluation strategies for outerjoin queries, including the use of semijoin programs to reduce base relations."

Also related, Outer Join in Edutella where each part of the outer query is done individually.

Tuesday, December 13, 2005

A Really Interactive Query Language

SQLBuilder "SQLBuilder uses clever overriding of operators to make Python expressions build SQL expressions -- so long as you start with a Magic Object that knows how to fake it."

An example:
>>> from SQLBuilder import *
>>> person = table.person
# person is now equivalent to the Person.q object from the SQLObject
# documentation
>>> person
person
>>> person.first_name
person.first_name
>>> person.first_name == 'John'
person.first_name = 'John'

Via, SQL API "I'd rather SQLObject be built on some ORM-neutral layer, where you can move down to that layer when SQLObject doesn't fit your problem; as opposed to now, where you kind of have to work around SQLObject."

This is almost exactly like something I was thinking about, to prevent semantically incorrect SQL queries. Add an AJAX interface on this and it would be cool and useful.

Saturday, November 26, 2005

Doomed to Repeat History

Detecting Semantic Errors in SQL Queries, for example: "SELECT * FROM EMP WHERE JOB = ’CLERK’ AND JOB = ’MANAGER’". Obviously, JOB cannot be two values - yet you are allowed to express it. This and other examples are queries that are simply wrong and should be detected as such. It seems that this entire area of research exists because it's possible to write semantically incorrect queries in the first place. While the first example above may require something like better feedback at the command line, other examples like HAVING and DISTINCT exist because of the language design. It would be neat to have an SQL interpreter that would prevent these incorrect queries from being submittted (similar to IntelliJ or Word). It also covers solving problems with subqueries (using Skolemization) and Null Values.

From the SQLLint page.

Sunday, July 10, 2005

Querying for Hierachies

Storing Hierarchical Data in a Database "Whether you want to build your own forum, publish the messages from a mailing list on your Website, or write your own cms: there will be a moment that you’ll want to store hierarchical data in a database. And, unless you’re using a XML-like database, tables aren’t hierarchical; they’re just a flat list. You’ll have to find a way to translate the hierarchy in a flat file."

"If you want to display the tree using a table with left and right values, you’ll first have to identify the nodes that you want to retrieve. For example, if you want the ‘Fruit’ subtree, you’ll have to select only the nodes with a left value between 2 and 11. In SQL, that would be:

SELECT * FROM tree WHERE lft BETWEEN 2 AND 11;"

It struct me how complicated either of these solutions are.

In iTQL it would be using walk:
"select $subject
...
where walk($subject <rdfs:subClassOf> <food:fruit>
and $subject <rdfs:subClassOf> $object);

Or in Sesame's SeRQL (concrete example here):
"SELECT DISTINCT _fruit
FROM _fruit serql:directSubClassOf <food:fruit>"

They aren't quite the same query - Sesame is inferring new statements and walk is not. In designing iTQL we were always deciding between magical predicates or functions (like walk) and sometimes slavishly sticking to triple patterns. Combining the walk and trans operations using a predicate, like in SeRQL, seems like a good approach to take.

Also, interesting in a fairly recent Sesame release are the ANY and ALL, EXISTS and MINUS features.

Friday, November 26, 2004

RDF on Lambda

RDF and Databases "Some RDF research dropped me to a nice paper (PDF) from IBM discussing RDF with relational databases. This combination can replace half-baked application data mechanisms. These crop up regularly in my consulting work. Think nested directories of Windows INI files and brittle, binary files breaking on minor design iterations. The pain, the pain."

"There are several projects in this domain. My favorite so far is OpenRDF Sesame. It supports querying at the semantic level. It seems more mature than others, having derived from previous efforts, and works with both PostgreSQL and MySQL as well as Oracle. An abstraction layer called SAIL makes Sesame database-agnostic. Sesame even sports a stand-alone b-tree system, or in-memory operation, if you don't want an external database."

Tuesday, July 13, 2004

Nature nuturing Oracle

A presenation by Nature Publishing Group looking at Oracle's NDM as an RDF store, the last slide:
"NPG and Oracle investigating the suitability of the NDM to store and query
RDF-encoded information
o Storage looks OK
o Can hold directed labelled graphs
o Allows URIs, literals and blank nodes
o Can include provenance information

Querying may need more development:
o Can extract sub-graphs but performance and scalability need to be tested
o RDF/XML import and export would be desirable
o Support for RDFS- and OWL-based inferencing"

Mentions Urchin.

Thursday, July 08, 2004

Oracle's RDF Store

Create a logical Network Data Model in Oracle and it would be great to store RDF. That was what I thought after reading "Re: Chemistry and the Semantic Web".

The attached document is based on articles available from Oracle. An indepth description is available in "Network Data Model Overview" (free registration required).

There are various schemas defined for storing networks which includes:
"NODE_NAME VARCHAR2(32) Name of the node.
NODE_TYPE VARCHAR2(24) User-defined string to identify the node type."

The schema is obviously not designed to store RDF unlike other RDBMS mappings.

One difference is their flexibility in storing different graphs and giving links a "cost".

Another difference is their nodes and links are typed as strings; this looks like it would limit the effectiveness of data type operations. Querying for all nodes that are numbers between two values or dates between two ranges is going to be costly compared to dedicated data type handling. That's apart from the obvious difficulty in trying to put everything into a VARCHAR2(24).

Unless they have optimised the query layer specifically for the task, which might be case, it will also incur the costs of joining against the same table many 10s or 100s of times.

It does have some neat operations (like shortest-path), a Java API, PL/SQL integration and of course it integrates well with existing Oracle databases.

Sunday, June 06, 2004

Not Relational Enough

Two recent articles about XML and database technology. One of the FAQs about Kowari is in respect to relational databases. Taking the pedantic view, the two databases mentioned are not relational but SQL. I've previously posted links about XML management systems too.

In, XML, the New Database Heresy "One of the major benefits of using XML in relational databases is that it is a lot easier to deal with fluid schemas or data with sparse entries with XML. When the shape of the data tends to change or is not fixed the relational model is simply not designed to deal with this. Constantly changing your database schema is simply not feasible and there is no easy way to provide the extensibility of XML where one can say "after the X element, any element from any namespace can appear". How would one describe the capacity to store “any data” in a traditional relational database without resorting to an opaque blob?

I do tend to agree that some people are going overboard and trying to model their data hierarchically instead of relationally which experience has thought us is a bad idea."

Edd Dumbill wrote in, Ron Bourret on XML and databases "My guess is that everything will pick up on this front in a year or two, with companies moving towards what I consider the holy grail of XML support in relational databases: native storage behind a first-class XML data type, XQuery support with extensions for (a) including relational data or SQL queries and (b) updates, SQL/XML support with extensions for embedded XQuery queries, and support for JSR 225 (see below)."

One of the aspects we've found in implementing TKS/Kowari was that it pays to stick closely to the relational model. Everything comes down to tuples and I think that the problems with putting RDF on top of SQL database is that they aren't relational or "relational enough" (whatever that means).

Actually, the DAWG is asking for comments, maybe ensuring that the queries are expressed in relational algebra. That should prevent things like NULL getting in there.