Most SQL systems were designed for transactional data processing. Many of these OLTP systems work with data one tuple at a time and are unsuitable for processing large amounts of data especially when analytical queries are involved. Analytic RDBMS products - mostly those that have a columnar architecture - generally do a much better job but are still plagued by the limitations of SQL when it comes to processing this kind of data. Rigid database schemas get in the way of application development and ongoing schema maintenance becomes a burden. Due to schema rigidity, data needs to be denormalized and stored in different tables. This introduces another complexity cost while performance also suffers.
To come around these limitations IT engineers frequently combine document databases or other NoSQL solutions with RDBMS products in order to store and process denormalized data. Unfortunately most of these document database systems suffer from the same issue that OLTP systems have when it comes to analytical type workloads. Some don't support any analytical queries at all and the sole purpose of using a NoSQL solution is to be able to easily store semi-structured data.
One of the primary goals of Raijin is to natively support semi-structured data to provide the flexibility that made document databases popular.
A big limitation of traditional SQL systems is schema rigidity. The Raijin database does not require a schema to be defined up-front allowing you to cope with data variety as records may contain different fields. The database does not store NULL values in order to efficiently deal with sparse data.
If you want schema enforcement, explicitly listing the column names and types in the CREATE TABLE statement will also work in order to remain compatible and ensure that your existing SQL code would still work. The difference from traditional SQL systems is that Raijin will accept records with field names that were not defined beforehand and it will allow using this data as if it were a regular column.
This allows denormalized data to be painlessly stored and processed.
SQL is a great declarative query language that all data analysts are familiar with.
The Raijin Database supports SQL as its primary query language while lifting some of SQL's limitations. It tries to adhere to the PostgreSQL syntax and also supports some other dialects that we considered useful.
In addition to lifting the requirement of having to define a schema, Raijin will also allow operations on some types that are forbidden in standard SQL. For example when an operation is attempted on a string (i.e. varchar) type, a conversion is done to deduce the required type and proceed with the operation instead of giving an error. The side-effect of being schemaless allows the same field (i.e. column) to contain different types and this automatic type conversion behavior aims to help with the veracity of the data. This makes Raijin's SQL dialect similar to a weakly typed language.
To be able to process large amounts of data, Raijin uses cache-aware algorithms and data structures to exploit the capabilities of modern CPUs. Instead of processing data one tuple at a time, it operates on data blocks. Using vectorized execution backed by optimized SIMD instructions, Raijin ensures that your CPUs are not wasting cycles. The code contains optimizations for the following two SIMD instruction set flavors:
- SSE2- supported by most x86 processors.
-- AVX2 - available in recent Intel and AMD CPUs.
Raijin is written in C++ using C++11 features and is compiled into a native binary instead of using an interpreted language that is inefficient at processing large amounts of data.
Most NoSQL solutions are inefficient at or totally lack support for analytical queries. Raijin DB supports group by and aggregations using standard SQL syntax.
Vectorized execution and cache-friendly algorithms allow large amounts of data to be operated on. Backed by optimized SIMD instructions and a modern compressed hybrid columnar storage layer Raijin strives to bring the most out of your hardware. This ensures that analytical queries touching millions of records will complete in seconds not hours.
Raijin can store data in a compressed format. Data compression not only saves disk space but provides a performance boost with modern CPUs. JSON is only an external transport format which can be used to import and export data and is not used internally. Instead, data is stored in a more space efficient binary format which can be compressed using block level data compression.
Raijin also employs data specific compression algorithms such as dictionary, run-length and delta encoding.
Compression can be enabled on a per-table basis.
The JSON data format has quickly become the defacto standard for data interchange in recent years.
In order to deal with sparse data, the Raijin Database uses a flat JSON representation for the data records. This is natively supported when loading and querying data unlike in other popular SQL solutions where this got introduced as a bolt-on afterthought. The JSON format is used both in query output as well as when loading data. Note that only flat JSON (i.e. a list of key-value pairs) is supported, unfortunately Raijin can't deal with nested data structures such as arrays and maps. For compatibility with existing SQL code, the traditional CSV format is also supported when loading data.
The support for the JSON format coupled with the REST API allows developers to get data in and out of the database easily, especially when interfacing with web-based applications.
The Raijin database provides an HTTP interface where queries and data can be submitted to. Queries using the standard SQL syntax are simply wrapped in the HTTP request and the database will return a result using JSON. Together with native JSON support this should enable rapid web-application development.
The same HTTP interface can be accessed from a web browser in order to submit SQL statements using a web-form so there is no need to install any client side tools.
Being able to crunch data as fast as possible - even if ultra-fast vectorized SIMD execution is used - will still waste precious CPU cycles if the data is uselessly processed. Our query optimizer employs the following techniques in order to avoid processing data that does not need to in order to make queries even faster:
- Block level operations and data skipping,
- Use of block-level metadata to answer queries,
- Other secret sauce and black magic.
Some of the above is still work in progress. As we enable these features, we expect the database to gain significant performance boost for some queries on large data sets.
The query optimizer in the Raijin Database engine is heavily integrated with the data storage and indexing layer, otherwise all this would be not possible as is the case with several well-known database solutions offering pluggable storage engines.
The Raijin Database engine is nowhere near complete. There is still an extreme amount of work to be done on features currently missing that we intend to implement, including but not limited to the following:
On the other hand there are features found in other SQL solutions that are not in-line with the design goals of Raijin and will probably not be implemented in the near future: