System performance tuning¶
Primary parameters¶
There are a couple of primary parameters that limit the performance clients can achieve with backup and restore with the Backup service. These include:
- Host CPU
- Available CPU cycles on the client host
- Available AES-acceleration on the client host's CPU
- Host RAM
- Available RAM on client host
- Network IO
- Available underlying network bandwidth between the client host and the server
- Available network throughput (TCP/IP) between the client host and the server
- Disk IO
- Available underlying disk system ~random IOPS
- Available underlying disk system streaming throughput
In order to determine the limit to how fast a current system can perform, the performance of these factors can be baselined. It may also interest the designer of a new system, to factor for the parameters given in this guide.
File backup vs. Image backup¶
When working with file backups, TSM will store each individual file as it changes over time, on the backup server. With image backups, an entire raw device can be backed up. Raw image backup is much faster since the underlying storage can deliver basically streaming sequential IO to service the read requests. But there is a trade-off in recovery time, since a users restore request for a single email or word document implies the backup administrator must restore and mount the image in order to restore the document, which can take time. Thus, image backups optimizes for minimizing speed of doing a backup at the expense of time to restore an individual file, which is the typical backup restore request to an IT department.
Conversely, file backups have a much less streaming sequential IO pattern from the client host, and optimize more towards more simply restoring a simple file, than the restore of an entire block device. Restoring from a file backup implies to work on top of a file system, which has a bit less sequential IO pattern than simply writing to the underlying block device from start to finish.
File backups is the normal mode of operation on typical servers including file servers, and what we recommend. Image backups may be a fit virtual environments where the VM in itself is a functional unit and does not necessarily contain individualized data.
Host CPU / RAM on the client host¶
CPU cycles¶
AES-acceleration¶
RAM¶
Network IO¶
Underlying network bandwidth¶
Point-to-point TCP/IP throughput¶
Disk IO¶
Reference system (a year 2010 system):
- Host OS: Debian GNU/Linux, version Wheezy
- Kernel version: 3.2.0-4-amd64 (standard)
- Drive systems:
- SATA (4x1TB 7200 RPM) Linux software RAID-5
- PATA (10x300GB 7200 RPM) Linux software RAID-5
- Logical storage construct:
- Both RAID volumes in one LVM2 volume
- dmcrypt block level AES-256 encryption per individual logical volumes
- Filesystem: EXT4
- CPU: AMD Phenom™ II X4 965 Processor - Quad-core 3.4 GHz
- RAM: 16 GB
Underlying disk system ~random IOPS¶
A regular backup run on a file system backed up with the service involves
iterating over all files of the protected file system(s) and performing
stat()
on them, in order to compare with what the backup server has stored.
The comparisons are made on file's attributes such as modify-time, size, etc.
Since the change pattern of files approach random, and all files need to be
accessed, it is relevant to base-line a storage systems
Base-lining the random IOPS of a storage system¶
Using the file IO tester fio
, we can base-line a storage system:
root@system:/usr/local/src/fio-tests# fio No jobs(s) defined 2.0.8 fio [options] [job options] <job file(s)> --debug=options Enable debug logging. May be one/more of: process,file,io,mem,blktrace,verify,random,parse, diskutil,job,mutex,profile,time,net --output Write output to file --timeout Runtime in seconds --latency-log Generate per-job latency logs --bandwidth-log Generate per-job bandwidth logs --minimal Minimal (terse) output --version Print version info and exit --terse-version=x Set terse version output format to 'x' --help Print this page --cmdhelp=cmd Print command help, "all" for all of them --enghelp=engine Print ioengine help, or list available ioengines --enghelp=engine,cmd Print help for an ioengine cmd --showcmd Turn a job file into command line options --eta=when When ETA estimate should be printed May be "always", "never" or "auto" --readonly Turn on safety read-only checks, preventing writes --section=name Only run specified section in job file --alloc-size=kb Set smalloc pool to this size in kb (def 1024) --warnings-fatal Fio parser warnings are fatal --max-jobs=nr Maximum number of threads/processes to support --server=args Start a backend fio server --daemonize=pidfile Background fio server, write pid to file --client=hostname Talk to remote backend fio server at hostname Fio was written by Jens Axboe <jens.axboe@oracle.com> Jens Axboe <jaxboe@fusionio.com>
It's required to
- locate a directory where tests can be done, as it uses files to perform its tests, and
- define a test description to the program.
Fio is very versatile, so we decide to benchmark on random reads, using 4k block size, 1G test files, 1 job running, and a couple of different IO queue depths to check how a presumed increased concurrency of the backup client may affect performance:
[global] write_bw_log write_lat_log write_iops_log runtime=30 time_based directory=/mnt/backups/ size=2G direct=1 ioengine=libaio bs=4k wait_for_previous rw=randread [1] iodepth=1 [2] iodepth=2 [4] iodepth=4 [8] iodepth=8 [16] iodepth=16 [32] iodepth=32 [64] iodepth=64 [128] iodepth=128
The test configuration is stored in a file called system-randread-4k.ini
.
Now prepare and launch Fio:
root@system:/usr/local/src/fio-tests/# mkdir -p system/randread-4k root@system:/usr/local/src/fio-tests/# cd system/randread-4k root@system:/usr/local/src/fio-tests/system/randread-4k# fio --latency-log --bandwidth-log --output system-randread-4k.log ../../system-randread-4k.ini
Fio will now run and create 8x2GiB files to test against, and then perform the tests in sequence with queue depths according to the test configuration file above. Direct IO means to bypass the Linux kernel's memory page cache, and perform each IO operation directly against the underlying storage.
Once completed, Fio will have generated a log file, system-randread-4k.log
,
which contains information about the jobs run:
1: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1 2: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=2 4: (g=2): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=4 8: (g=3): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=8 16: (g=4): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16 32: (g=5): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32 64: (g=6): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64 128: (g=7): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=128
And then much more detailed information about each run:
1: (groupid=0, jobs=1): err= 0: pid=9877 read : io=16024KB, bw=546897 B/s, iops=133 , runt= 30003msec slat (usec): min=16 , max=286 , avg=81.11, stdev=12.73 clat (usec): min=236 , max=33898 , avg=7395.16, stdev=3127.19 lat (usec): min=321 , max=34004 , avg=7477.98, stdev=3127.41 clat percentiles (usec): | 1.00th=[ 410], 5.00th=[ 2704], 10.00th=[ 3600], 20.00th=[ 4576], | 30.00th=[ 5536], 40.00th=[ 6432], 50.00th=[ 7328], 60.00th=[ 8256], | 70.00th=[ 9152], 80.00th=[10048], 90.00th=[11328], 95.00th=[12224], | 99.00th=[14272], 99.50th=[15168], 99.90th=[25472], 99.95th=[26752], | 99.99th=[34048] bw (KB/s) : min= 392, max= 595, per=100.00%, avg=534.05, stdev=31.05 lat (usec) : 250=0.05%, 500=1.15%, 750=0.75%, 1000=0.30% lat (msec) : 2=0.45%, 4=10.68%, 10=66.15%, 20=20.24%, 50=0.22% cpu : usr=0.32%, sys=1.48%, ctx=4104, majf=0, minf=145 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=4006/w=0/d=0, short=r=0/w=0/d=0
Fio will also have generated log files that on a system with gnuplot and some
font libraries installed, you can run fio_generate_plot
to get some gnuplot
images of the test traces such as:
- IO Bandwidth
- IO Operations
- IO Latency
- IO Submission Latency
- IO Completion Latency
For more Fio test examples, check out
The Fio HOWTO manual is at
Base-lining time to stat() a file system with many files¶
The file system /mnt/backups
on the reference host is 394 GiB large, out of
which 388 GiB are used, and there are ~2M files on it.
- First, get and compile the recursive-stat program:
root@system:/usr/local/src# git clone https://github.com/IPnett/cloud-BaaS Cloning into 'cloud-BaaS'... remote: Counting objects: 241, done. remote: Compressing objects: 100% (22/22), done. remote: Total 241 (delta 6), reused 0 (delta 0) Receiving objects: 100% (241/241), 46.05 KiB, done. Resolving deltas: 100% (96/96), done. root@system:/usr/local/src# cd cloud-BaaS/unix/baselining/recursive-stat/ root@system:/usr/local/src/cloud-BaaS/unix/baselining/recursive-stat# make gcc -O2 -o recursive-stat recursive-stat.c root@system:/usr/local/src/cloud-BaaS/unix/baselining/recursive-stat
- Second, get IOPS values from the underlying disk system before running the test:
This shows that there have been
root@system:/usr/local/src/cloud-BaaS/unix/baselining/recursive-stat# grep 'md.' /proc/diskstats 9 2 md2 4663599 0 155657288 0 2888861 0 116705128 0 0 0 0 9 3 md3 17369972 0 791542623 0 177633 0 1382800 0 0 0 0
4663599 + 17369972 = 22033571
read IO ops on this system until now. - Third, empty the Linux dentry cache to make sure there are no cached direntries etc.
root@system:~# echo 3 > /proc/sys/vm/drop_caches
- Fourth, run the recursive stat program on the file system under test:
root@system:/usr/local/src/cloud-BaaS/unix/baselining/recursive-stat# ./recursive-stat /mnt/backups/ Files stat()'ed: 1999531 Listing completed: 1999531 files stat()'ed 409.54 seconds elapsed 4882.38 files stat()'ed per second
-
Fifth, get the after-test IO ops values from the underlying disk system:
Apparently, now after the run (other background activity at an absolute minimum), there are9 2 md2 4718083 0 156093160 0 2888861 0 116705128 0 0 0 0 9 3 md3 17658828 0 793853471 0 177633 0 1382800 0 0 0 0
4718083 + 17658828 = 22376911
read IO ops registered by the kernel. This implies that22376911 - 22033571 = 343340
read IOPS were issued in order to traverse all folders on the path (there are ~many) and run stat() in them, or1999531 / 343340 = 5.82
files stat()'ed per read IO op.This also means an average of
343340 / 409.54 = 838.4
read IOPS were delivered from the underlying disk system.XXX: COMPARE WITH fio random read test.
-
Sixth, run the benchmark once again to view the impact of a dentry cache on the file server:
Not a single read request were issued to the underlying disk system, and the CPU time involved in iterating through all directories,root@system:/usr/local/src/cloud-BaaS/unix/baselining/recursive-stat# ./recursive-stat /mnt/backups/ Files stat()'ed: 1999531 Listing completed: 1999531 files stat()'ed 11.78 seconds elapsed 169699.97 files stat()'ed per second root@system:/usr/local/src/cloud-BaaS/unix/baselining/recursive-stat# grep 'md.' /proc/diskstats 9 2 md2 4718083 0 156093160 0 2888861 0 116705128 0 0 0 0 9 3 md3 17658828 0 793853471 0 177633 0 1382800 0 0 0 0
stat()
'ing all files and hitting the dentry cache as well as progress updating the terminal, is at a minimum.
Underlying disk system streaming throughput¶
TBD ...