The MariaDB project is pleased to announce a special preview release of MariaDB 10.0.9 with significant performance gains on FusionIO devices. This is is a beta-quality preview release.
Download MariaDB 10.0.9-FusionIO preview
BackgroundThe latest work between MariaDB and FusionIO has focused on dramatically improvingperformance of MariaDB on the high-end SSD drives produced by Fusion-IO and at the same timedelivering much better endurance for the drives themselves.Furthermore, FusionIO flash memory solutions increase transactional database performance. MariaDB includes specialized improvements for FusionIO devices, leveraging a feature of the NVMFS filesystem on these popular, high performance solid state disks. Using this feature, MariaDB 10 can eliminate some of the overhead within the InnoDB storage engine when used with FusionIO devices.
In below Figure 1 shows a legacy architecture of SSDs on the left and the FusionIO architecture on the right.

Figure 1: Legacy architecture on left and new FusionIO architecture on right. Doublewrite buffer
When Innodb writes to the filesystem, there is generally no guarantee that a given write operation will be complete (not partial) in cases of a poweroff event, or if the operating system crashes at the exact moment a write is being done.
Without detection or prevention of partial writes, the integrity of the database can be compromised after recovery. Therefore, Innodb has had a mechanism to detect and ignore partial writes via the InnoDB Doublewrite Buffer (also innodb_checksum can be used to detect a partial write).
Doublewrites, controlled by the innodb_doublewrite system variable, comes with its own set of problems. Especially on SSD, writing each page twice can have detrimental effects (write leveling). Also the endurance of SSD is at stake since there is a maximum of writes it will be able to handlebefore it needs to be replaced. By writing twice the expected lifetime is cut by half.
Before writing pages to a data file, InnoDB first writes them to a contiguous tablespace area called the doublewrite buffer. Only after the write and the flush to the doublewrite buffer has completed does InnoDB write the pages to their proper positions in the data file. If the operating system crashes in the middle of a page write (causing a torn page condition), InnoDB can later find a good copy of the page from the doublewrite buffer during recovery.
A better solution is to directly ask the filesystem to provide an atomic (all or nothing) write guarantee. Currently this is only available on the NVMFS filesystem on FusionIO devices that provide atomic write functionality. This functionality is supported by MariaDB’s XtraDB and InnoDB storage engines. To use atomic writes instead of the doublewrite buffer, add:
innodb_use_atomic_writes = 1to the my.cnf config file.
For more information about this feature see https://mariadb.com/kb/en/fusionio-directfs-atomic-write-support/
InnoDB compressed tablesBy using the InnoDB table options for compression, you can create tables where the data is stored in compressed form. Compression can help to improve both raw performance and scalability. The compression means less data is transferred between disk and memory, and takes up less space on disk and in memory. The benefits are increased for tables with secondary indexes, because index data is compressed also. Compression is important for SSD storage devices, because….
InnoDB stores uncompressed data in 16K pages and these 16K pages are compressed into a fixed compressed page size of 1K, 2K, 4K, 8K. This compressed page size is chosen at table creation time using KEY_BLOCK_SIZE parameter. Compression is performed using regular software compression libraries (zlib).
Because pages are frequently updated, B-tree pages require special treatment. It is essential to minimize the number of times B-tree nodes are split, as well as to minimize the need to uncompress and recompress their content. Therefore, InnoDB maintains some system information in the B-tree node in uncompressed form, thus facilitating certain in-place updates. For example, this allows rows to be delete-marked and deleted without any compression operation.
Furhermore, InnoDB attempts to avoid unnecessary uncompression and recompression of index pages when they are changed. Within each B-tree page, the system keeps an uncompressed “modification log” to record changes made to the page. Updates and inserts of small records may be written to this modification log without requiring the entire page to be completely reconstructed.
When the space for the modification log runs out, InnoDB uncompresses the page, applies the changes and recompresses the page. If recompression fails, the B-tree nodes are split and the process is repeated until the update or insert succeeds.
To avoid frequent compression failures in write-intensive workloads, such as for OLTP applications, InnoDB reserves some empty space (padding) in the page, so that the modification log fills up sooner and the page is recompressed while there is still enough room to avoid splitting it. The amount of padding space left in each page varies as the system keeps track of the frequency of page splits.
However, all these have clear drawbacks:
Memory Space: Both uncompressed and compressed pages stored in buffer pool Access: Updates are applied to both copies in memory CPU consumption Software compression libraries (decompress on read from disk, recompress on split) Split & Recompress & Rebalance when mlog overflows Capacity benefit Fixed compression page size sets a bound on compression benefit Modification log and padding takes space decreasing the benefits Poor adoption Code is very complex and performance decrease compared to uncompressed tables significant Solution: Page compressionInstead of storing both compressed and uncompressed pages on buffer pool store only a uncompressed 16KB pages in buffer pool. This avoids very complex logic on when page needs to be recompressed or when to add a change to mlog. Similarly, there is no need to do page splits etc. Before creating a page compressed table, make sure the innodb_file_per_table configuration option is enabled, and innodb_file_format is set to Barracuda.
This work was done in co-operation with FusionIO, especially with:
Dhananjoy Das
Torben Mathiasen
When a page is modified, it is compressed just before it is written (fil layer) and only a compressed size (aligned to sector boundary) is written. If compression fails because of compression failure we write uncompressed page to the file space. Then we trim unused 512B sectors in compressed page by:
fallocate(file, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, off, trim_len);This leads to a situation where NVMFS file system reports that less space is used on media. If this fallocate call fails an error is reported on an error log and trim is not anymore used and server can continue normally.
When a page is read, it is decomp