Quantcast
Channel: CodeSection,代码区,数据库(综合) - CodeSec
Viewing all articles
Browse latest Browse all 6262

Comment on Significant performance boost with new MariaDB page compression on Fu ...

$
0
0

The MariaDB project is pleased to announce a special preview release of MariaDB 10.0.9 with significant performance gains on FusionIO devices. This is is a beta-quality preview release.

Download MariaDB 10.0.9-FusionIO preview

Background

The latest work between MariaDB and FusionIO has focused on dramatically improvingperformance of MariaDB on the high-end SSD drives produced by Fusion-IO and at the same timedelivering much better endurance for the drives themselves.Furthermore, FusionIO flash memory solutions increase transactional database performance. MariaDB includes specialized improvements for FusionIO devices, leveraging a feature of the NVMFS filesystem on these popular, high performance solid state disks. Using this feature, MariaDB 10 can eliminate some of the overhead within the InnoDB storage engine when used with FusionIO devices.

In below Figure 1 shows a legacy architecture of SSDs on the left and the FusionIO architecture on the right.


Comment on Significant performance boost with new MariaDB page compression on Fu ...
Figure 1: Legacy architecture on left and new FusionIO architecture on right. Doublewrite buffer

When Innodb writes to the filesystem, there is generally no guarantee that a given write operation will be complete (not partial) in cases of a poweroff event, or if the operating system crashes at the exact moment a write is being done.

Without detection or prevention of partial writes, the integrity of the database can be compromised after recovery. Therefore, Innodb has had a mechanism to detect and ignore partial writes via the InnoDB Doublewrite Buffer (also innodb_checksum can be used to detect a partial write).

Doublewrites, controlled by the innodb_doublewrite system variable, comes with its own set of problems. Especially on SSD, writing each page twice can have detrimental effects (write leveling). Also the endurance of SSD is at stake since there is a maximum of writes it will be able to handlebefore it needs to be replaced. By writing twice the expected lifetime is cut by half.

Before writing pages to a data file, InnoDB first writes them to a contiguous tablespace area called the doublewrite buffer. Only after the write and the flush to the doublewrite buffer has completed does InnoDB write the pages to their proper positions in the data file. If the operating system crashes in the middle of a page write (causing a torn page condition), InnoDB can later find a good copy of the page from the doublewrite buffer during recovery.

A better solution is to directly ask the filesystem to provide an atomic (all or nothing) write guarantee. Currently this is only available on the NVMFS filesystem on FusionIO devices that provide atomic write functionality. This functionality is supported by MariaDB’s XtraDB and InnoDB storage engines. To use atomic writes instead of the doublewrite buffer, add:

innodb_use_atomic_writes = 1

to the my.cnf config file.

For more information about this feature see https://mariadb.com/kb/en/fusionio-directfs-atomic-write-support/

InnoDB compressed tables

By using the InnoDB table options for compression, you can create tables where the data is stored in compressed form. Compression can help to improve both raw performance and scalability. The compression means less data is transferred between disk and memory, and takes up less space on disk and in memory. The benefits are increased for tables with secondary indexes, because index data is compressed also. Compression is important for SSD storage devices, because….

InnoDB stores uncompressed data in 16K pages and these 16K pages are compressed into a fixed compressed page size of 1K, 2K, 4K, 8K. This compressed page size is chosen at table creation time using KEY_BLOCK_SIZE parameter. Compression is performed using regular software compression libraries (zlib).

Because pages are frequently updated, B-tree pages require special treatment. It is essential to minimize the number of times B-tree nodes are split, as well as to minimize the need to uncompress and recompress their content. Therefore, InnoDB maintains some system information in the B-tree node in uncompressed form, thus facilitating certain in-place updates. For example, this allows rows to be delete-marked and deleted without any compression operation.

Furhermore, InnoDB attempts to avoid unnecessary uncompression and recompression of index pages when they are changed. Within each B-tree page, the system keeps an uncompressed “modification log” to record changes made to the page. Updates and inserts of small records may be written to this modification log without requiring the entire page to be completely reconstructed.

When the space for the modification log runs out, InnoDB uncompresses the page, applies the changes and recompresses the page. If recompression fails, the B-tree nodes are split and the process is repeated until the update or insert succeeds.

To avoid frequent compression failures in write-intensive workloads, such as for OLTP applications, InnoDB reserves some empty space (padding) in the page, so that the modification log fills up sooner and the page is recompressed while there is still enough room to avoid splitting it. The amount of padding space left in each page varies as the system keeps track of the frequency of page splits.

However, all these have clear drawbacks:

Memory Space: Both uncompressed and compressed pages stored in buffer pool Access: Updates are applied to both copies in memory CPU consumption Software compression libraries (decompress on read from disk, recompress on split) Split & Recompress & Rebalance when mlog overflows Capacity benefit Fixed compression page size sets a bound on compression benefit Modification log and padding takes space decreasing the benefits Poor adoption Code is very complex and performance decrease compared to uncompressed tables significant Solution: Page compression

Instead of storing both compressed and uncompressed pages on buffer pool store only a uncompressed 16KB pages in buffer pool. This avoids very complex logic on when page needs to be recompressed or when to add a change to mlog. Similarly, there is no need to do page splits etc. Before creating a page compressed table, make sure the innodb_file_per_table configuration option is enabled, and innodb_file_format is set to Barracuda.

This work was done in co-operation with FusionIO, especially with:

Dhananjoy Das

Torben Mathiasen

When a page is modified, it is compressed just before it is written (fil layer) and only a compressed size (aligned to sector boundary) is written. If compression fails because of compression failure we write uncompressed page to the file space. Then we trim unused 512B sectors in compressed page by:

fallocate(file, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, off, trim_len);

This leads to a situation where NVMFS file system reports that less space is used on media. If this fallocate call fails an error is reported on an error log and trim is not anymore used and server can continue normally.

When a page is read, it is decomp

Viewing all articles
Browse latest Browse all 6262

Trending Articles