Browsing the tag zfs
This is a followup on my previous post concerning how to correctly snapshot databases on ZFS. Snapshotting MySQL any other way will just lead to corrupt database states, essentially making your backups useless.
Here is my script that I use to snapshot our MySQL database. It uses my zBackup.rb script for the automated backup rotation.
mysql -h fab2 -u usr -ppass -e ‘flush tables;flush tables with read lock;’
/usr/bin/ruby /opt/zbackup.rb rpool/mydata 7
mysql -h fab2 -u usr -ppass -e ‘unlock tables;’
I have a very non-standard storage setup at home. The setup is made up of a 3x500G raidz array on ZFS hosted by OSX. For the longest time I could not get files to copy over samba on ZFS. The files would stream just fine but not copy over, they would abort at the 99% transfer point. Well, I have finally found the fix for it; turn off extended attributes!
smb.conf
vfs objects = notify_kqueue,darwinacl
; The darwin_streams module gives us named streams support.
stream support = no
ea support = no
; Enable locking coherency with AFP.
darwin_streams:brlm = no
As Charles Heston would say, You can have my ZFS when you pry it from my cold dead hands.
viva ZFS on OSX!
CouchDB was made for next generation filesystems such as ZFS and BTRFS. First off, unlike PostgreSQL or MySQL, CouchDB can be snapshot while in production without any flushing or locking trickery since it uses an append only B-Tree storage approach. That alone makes it a compelling database choice on ZFS/BTRFS.
Second, CouchDB works hand-in-hand with ZFS’s block level compression. ZFS can compress blocks of data as they are being written out to the disk. However, it only does it for new blocks and not retroactively. Now, the awesome part, CouchDB on compaction writes out a brand new database file which can utilize the new gzip compression settings on ZFS. This means you can try out different gzip compression settings just by compacting your CouchDB.
Some tips on running CouchDB on ZFS:
1. Use automated snapshots to prevent $admin error, it is painless with ZFS and CouchDB loves being snapshot
You can give my little ruby script a try for daily snapshots; I use it both on Mac OSX and Solaris for automated ZFS snapshot goodness.
2. Try out various gzip compression schemes on your CouchDB workload, re-compact the database to use the new gzip compression settings. I personally use the gzip-4 compression for our workload which strikes the perfect balance between space and cpu utilization.
3. Set the ZFS dataset to 4k block record size and turn off atime. Yes the B-Tree append only approach is elastic on writes but you can have near perfect tiny writes with a small 4k block record size.
zfs set atime=off rpool/couchdb
I recently had a WD Raptor drive die in a server that hosted our PostgreSQL database. I had a ZFS snapshot strategy setup that sent over ZFS snapshots of the live database to a ZFS mirror for backup purposes. Looked good in theory right? Except, I forgot to do one critical thing, test my backups. Long story short, I had a bunch of snapshots that were useless. Luckily I had offsite nightly PostgreSQL dumps that I did test which were used to seed my development database. So in the end I avoided catastrophic data failure.
With that lesson in mind, I reconfigured our backup system to do it correctly after re-reading the PostgreSQL documentation.
Prerequisite: You must have WAL archiving on and have the archive directory under your database directory. For example if your database is under /rpool/pgdata/db1 configure your archive directory under /rpool/pgdata/db1/archives
Completely optional but I highly suggest you automate your backups; My zbackup ruby script is pretty simple to setup.
This is how my /rpool/pgdata/db1 Looks like:
victori@opensolaris:/# ls /rpool/data/db1 archives pg_clog pg_multixact pg_twophase postmaster.log backup_label pg_hba.conf pg_stat_tmp PG_VERSION postmaster.opts base pg_ident.conf pg_subtrans pg_xlog postmaster.pid global pg_log pg_tblspc postgresql.conf
Source for my pgsnap.sh script.
PGPASSWORD=”mypass” psql -h fab2 postgres -U myuser -c “select pg_start_backup(‘nightly’,true);”
/usr/bin/ruby /opt/zbackup.rb rpool/pgdata 7
PGPASSWORD=”mypass” psql -h fab2 postgres -U myuser -c “select pg_stop_backup();”
rm /rpool/pgdata/db1/archives/*
The process is quite simple. You issue a command to initiate the backup process so PostgreSQL goes into “backup mode.” Second, you do the ZFS snapshot, in this case I am using my zbackup ruby script. Third, you issue another SQL command to PostgrSQL to get out of backup mode. Lastly, since you have the database snapshot you can safely delete your previous WAL archives.
Now, this is all nice and dandy but you should *TEST* your backups, before assuming your backup strategy actually worked.
postgres –singleuser mydb -D /rpool/pgtest/db1
Basically you clone the snapshot and test it by running it under PostgreSQL in single user mode. Once in singleuser mode, test out your backup to make sure it is readable, you can issue a SQL queries to confirm that all is fine with the backup.
ZFS you rock my world
I needed something like zfs-auto-snapshot written by Tim Foster but portable so it works on all systems that support ZFS. I reviewed a few scripts on github and was unhappy with what was out there so I decided to write my own.
With zbackup.rb you can define what to snapshot and how many rotation days you want to go back.
So say you want a month of snapshots:
Simple, no?
# Create snapshots for a 7 day rotation.
# ./zbackup.rb iraidz/zWork 7
#
# Add to crontab
# crontab -e
# 0 2 * * * /usr/bin/zbackup.rb iraidz/zWork 7
pool = ARGV[0]
days_back = ARGV[1].to_i
if pool.nil? or pool.empty?
puts "\nDefine the pool you want to snapshot:"
puts "\tex: zbackup.rb iraidz/zWork 7\n\n"
exit 0
end
if days_back.nil? or days_back < 1
puts "\nDefine how many days for your rotation:"
puts "\tex: zbackup.rb iraidz/zWork 7\n\n"
exit 0
end
# response from zfs list
curr_snaps = `zfs list -t snapshot -o name`
# days back limit variable
date_back = Time.now - (86400*days_back)
curr_snaps.split(/\n/).each do |pline|
if m = pline.match(/#{pool}\@([0-9]+)\-([0-9]+)\-([0-9]+)/)
if date_back >= Time.local(m[1],m[2],m[3])
`zfs destroy #{pline}`
end
end
end
# take snapshot for this run if needed.
month = Time.now.month
day = Time.now.day
year = Time.now.year
if curr_snaps !~ /#{pool}\@#{year}\-#{month}\-#{day}/
`zfs snapshot -r #{pool}@#{year}-#{month}-#{day}`
end
Update: The following information could be beneficial to some, however my issues actually were with Caviar black drives shipping with TLER disabled. You need to pay Western Digital a premium for their “RAID” drives with TLER enabled. So for anyone reading this, avoid consumer Western Digital drives if you plan on using them for RAID.
zfs_vdev_max_pending
I can’t believe how long I have been tolerating horrible concurrent IO performance on OpenSolaris running ZFS. When I have any IO intensive writes happening the whole system slows down to a crawl for any further IO. Running “ls” on a uncached directory is just painful.
victori@opensolaris:/opt# iostat -xnz 1 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 87.0 0.0 2878.1 0.0 0.0 0.0 0.4 0 100 c4t0d0 0.0 83.0 0.0 2878.1 0.0 0.1 0.2 0.7 1 50 c4t1d0 1.0 0.0 28.0 0.0 0.0 0.0 0.0 5.4 0 1 c4t2d0
Notice c4t0d0 is blocking at 100%. If a disk is blocking at 100% good luck getting the disk to do any other operations such as reads.
SATA disks do Native Command Queuing while SAS disks do Tagged Command Queuing, this is an important distinction. Seems like OpenSolaris/Solaris is optimized for the latter with a 32 wide command queue set by default. This completely saturates the SATA disks with IO commands in turn making the system unusable for short periods of time.
Dynamically set the ZFS command queue to 1 to optimize for NCQ.
And add to /etc/system
Enjoy your OpenSolaris server on cheap SATA disks!
Recently a primary boot disk went bad on our server and I got blind sided by a non-bootable secondary mirror disk. All the data was intact but I could not boot it. This required a slow re-installation and migration process that took a very long time.
• ZFS attach automatically partitions the drive as EFI.
• ZFS send/recv transfers on gzip compressed data-slices is slow.
Here is the correct way of getting both the disks in the ZFS mirror to boot.
Plug the new drive into the server that you want to add to the ZFS mirror. If your hot swapping or adding a new drive while the server is still on, you need to use cfgadm to configure it.
Now that the drive is configured and seen by the server you need to repartition it with format so it can be used as a bootable device.
AVAILABLE DISK SELECTIONS:
0. c4t0d0
/pci@0,0/pci8086,346c@1f,2/disk@0,0
1. c4t1d0
/pci@0,0/pci8086,346c@1f,2/disk@1,0
2. c4t2d0
/pci@0,0/pci8086,346c@1f,2/disk@2,0
* select your new drive *
# fdisk
* use fdisk to remove the EFI partition and add a solaris2 partition. *
Select the partition type to create:
1=SOLARIS2 2=UNIX 3=PCIXOS 4=Other
5=DOS12 6=DOS16 7=DOSEXT 8=DOSBIG
9=DOS16LBA A=x86 Boot B=Diagnostic C=FAT32
D=FAT32LBA E=DOSEXTLBA F=EFI 0=Exit?
This step is very important, if you did not repartition your drive, zfs attach will default the drive back to an EFI partition table that is not bootable.
c4t0d0s2 — primary drive.
c4t1d0s2 — new drive that we are setting up.
You should now be able to attach the secondary drive to your mirror using the identical slice.
Once the mirror is done synchronizing you need to install the bootloader on the drive.
Updating master boot sector destroys existing boot managers (if any).
continue (y/n)?y
stage1 written to partition 0 sector 0 (abs 16065)
stage2 written to partition 0, 267 sectors starting at 50 (abs 16115)
stage1 written to master boot sector
Trouble Shooting
raw device must be a root slice (not s2)
You did not re-partition the drive to a solaris2 partition. EFI partitions can’t be made bootable. Use the format tool to reconfigure the drive with a solaris2 partition.
cannot open/stat device /dev/rdsk/c1t0d0s0
You did not copy your label information from your primary to your secondary disk with prtvtoc and fmthard.

(3 votes, average: 4.67 out of 5)
(5 votes, average: 3.80 out of 5)