Letsgetdugg

Random tech jargon

Browsing the tag zfs

This is a followup on my previous post concerning how to correctly snapshot databases on ZFS. Snapshotting MySQL any other way will just lead to corrupt database states, essentially making your backups useless.

Here is my script that I use to snapshot our MySQL database. It uses my zBackup.rb script for the automated backup rotation.

#!/bin/sh

mysql -h fab2 -u usr -ppass -e ‘flush tables;flush tables with read lock;’

/usr/bin/ruby /opt/zbackup.rb rpool/mydata 7

mysql -h fab2 -u usr -ppass -e ‘unlock tables;’

I have a very non-standard storage setup at home. The setup is made up of a 3x500G raidz array on ZFS hosted by OSX. For the longest time I could not get files to copy over samba on ZFS. The files would stream just fine but not copy over, they would abort at the 99% transfer point. Well, I have finally found the fix for it; turn off extended attributes!

smb.conf

#vfs objects = notify_kqueue,darwinacl,darwin_streams
vfs objects = notify_kqueue,darwinacl

; The darwin_streams module gives us named streams support.
stream support = no
ea support = no

; Enable locking coherency with AFP.
darwin_streams:brlm = no

As Charles Heston would say, You can have my ZFS when you pry it from my cold dead hands.

viva ZFS on OSX!

Tagged with ,

CouchDB was made for next generation filesystems such as ZFS and BTRFS. First off, unlike PostgreSQL or MySQL, CouchDB can be snapshot while in production without any flushing or locking trickery since it uses an append only B-Tree storage approach. That alone makes it a compelling database choice on ZFS/BTRFS.

Second, CouchDB works hand-in-hand with ZFS’s block level compression. ZFS can compress blocks of data as they are being written out to the disk. However, it only does it for new blocks and not retroactively. Now, the awesome part, CouchDB on compaction writes out a brand new database file which can utilize the new gzip compression settings on ZFS. This means you can try out different gzip compression settings just by compacting your CouchDB.

Some tips on running CouchDB on ZFS:

1. Use automated snapshots to prevent $admin error, it is painless with ZFS and CouchDB loves being snapshot ;-)

You can give my little ruby script a try for daily snapshots; I use it both on Mac OSX and Solaris for automated ZFS snapshot goodness.

zfs snapshot rpool/couchdb@mysnapshot-tuesday

2. Try out various gzip compression schemes on your CouchDB workload, re-compact the database to use the new gzip compression settings. I personally use the gzip-4 compression for our workload which strikes the perfect balance between space and cpu utilization.

zfs set compression=gzip-4 rpool/couchdb

3. Set the ZFS dataset to 4k block record size and turn off atime. Yes the B-Tree append only approach is elastic on writes but you can have near perfect tiny writes with a small 4k block record size.

zfs set recordsize=4k rpool/couchdb
zfs set atime=off rpool/couchdb
Tagged with , ,

I recently had a WD Raptor drive die in a server that hosted our PostgreSQL database. I had a ZFS snapshot strategy setup that sent over ZFS snapshots of the live database to a ZFS mirror for backup purposes. Looked good in theory right? Except, I forgot to do one critical thing, test my backups. Long story short, I had a bunch of snapshots that were useless. Luckily I had offsite nightly PostgreSQL dumps that I did test which were used to seed my development database. So in the end I avoided catastrophic data failure.

With that lesson in mind, I reconfigured our backup system to do it correctly after re-reading the PostgreSQL documentation.

Prerequisite: You must have WAL archiving on and have the archive directory under your database directory. For example if your database is under /rpool/pgdata/db1 configure your archive directory under /rpool/pgdata/db1/archives

Completely optional but I highly suggest you automate your backups; My zbackup ruby script is pretty simple to setup.

This is how my /rpool/pgdata/db1 Looks like:

victori@opensolaris:/# ls /rpool/data/db1
archives      pg_clog	     pg_multixact  pg_twophase	    postmaster.log
backup_label  pg_hba.conf    pg_stat_tmp   PG_VERSION	    postmaster.opts
base	      pg_ident.conf  pg_subtrans   pg_xlog	    postmaster.pid
global	      pg_log	     pg_tblspc	   postgresql.conf

Source for my pgsnap.sh script.

#!/bin/sh

PGPASSWORD=”mypass” psql -h fab2 postgres -U myuser -c “select pg_start_backup(‘nightly’,true);”

/usr/bin/ruby /opt/zbackup.rb rpool/pgdata 7

PGPASSWORD=”mypass” psql -h fab2 postgres -U myuser -c “select pg_stop_backup();”

rm /rpool/pgdata/db1/archives/*

The process is quite simple. You issue a command to initiate the backup process so PostgreSQL goes into “backup mode.” Second, you do the ZFS snapshot, in this case I am using my zbackup ruby script. Third, you issue another SQL command to PostgrSQL to get out of backup mode. Lastly, since you have the database snapshot you can safely delete your previous WAL archives.

Now, this is all nice and dandy but you should *TEST* your backups, before assuming your backup strategy actually worked.

zfs clone rpool/pgdata@2010-6-17 rpool/pgtest
postgres –singleuser mydb -D /rpool/pgtest/db1

Basically you clone the snapshot and test it by running it under PostgreSQL in single user mode. Once in singleuser mode, test out your backup to make sure it is readable, you can issue a SQL queries to confirm that all is fine with the backup.

ZFS you rock my world ;-)

Tagged with , ,

I needed something like zfs-auto-snapshot written by Tim Foster but portable so it works on all systems that support ZFS. I reviewed a few scripts on github and was unhappy with what was out there so I decided to write my own.

With zbackup.rb you can define what to snapshot and how many rotation days you want to go back.

So say you want a month of snapshots:

/usr/bin/zbackup.rb iraidz/zWork 30

Simple, no? ;-)

#!/usr/bin/env ruby

# Create snapshots for a 7 day rotation.
# ./zbackup.rb iraidz/zWork 7
#
# Add to crontab
# crontab -e
# 0 2 * * * /usr/bin/zbackup.rb iraidz/zWork 7

pool = ARGV[0]
days_back = ARGV[1].to_i

if pool.nil? or pool.empty?
  puts "\nDefine the pool you want to snapshot:"
  puts "\tex: zbackup.rb iraidz/zWork 7\n\n"
  exit 0
end

if days_back.nil? or days_back < 1
  puts "\nDefine how many days for your rotation:"
  puts "\tex: zbackup.rb iraidz/zWork 7\n\n"
  exit 0
end

# response from zfs list
curr_snaps = `zfs list -t snapshot -o name`
# days back limit variable
date_back = Time.now - (86400*days_back)

curr_snaps.split(/\n/).each do |pline|
  if m = pline.match(/#{pool}\@([0-9]+)\-([0-9]+)\-([0-9]+)/)
    if date_back >= Time.local(m[1],m[2],m[3])
      `zfs destroy #{pline}`
    end
  end
end

# take snapshot for this run if needed.
month = Time.now.month
day = Time.now.day
year = Time.now.year
if curr_snaps !~ /#{pool}\@#{year}\-#{month}\-#{day}/
  `zfs snapshot -r #{pool}@#{year}-#{month}-#{day}`
end

Tagged with

Update: The following information could be beneficial to some, however my issues actually were with Caviar black drives shipping with TLER disabled. You need to pay Western Digital a premium for their “RAID” drives with TLER enabled. So for anyone reading this, avoid consumer Western Digital drives if you plan on using them for RAID.

zfs_vdev_max_pending

I can’t believe how long I have been tolerating horrible concurrent IO performance on OpenSolaris running ZFS. When I have any IO intensive writes happening the whole system slows down to a crawl for any further IO. Running “ls” on a uncached directory is just painful.

victori@opensolaris:/opt# iostat -xnz 1 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 87.0 0.0 2878.1 0.0 0.0 0.0 0.4 0 100 c4t0d0 0.0 83.0 0.0 2878.1 0.0 0.1 0.2 0.7 1 50 c4t1d0 1.0 0.0 28.0 0.0 0.0 0.0 0.0 5.4 0 1 c4t2d0

Notice c4t0d0 is blocking at 100%. If a disk is blocking at 100% good luck getting the disk to do any other operations such as reads.

SATA disks do Native Command Queuing while SAS disks do Tagged Command Queuing, this is an important distinction. Seems like OpenSolaris/Solaris is optimized for the latter with a 32 wide command queue set by default. This completely saturates the SATA disks with IO commands in turn making the system unusable for short periods of time.

Dynamically set the ZFS command queue to 1 to optimize for NCQ.

echo zfs_vdev_max_pending/W0t1 | mdb -kw

And add to /etc/system

set zfs:zfs_vdev_max_pending=1

Enjoy your OpenSolaris server on cheap SATA disks!

Tagged with , ,

Recently a primary boot disk went bad on our server and I got blind sided by a non-bootable secondary mirror disk. All the data was intact but I could not boot it. This required a slow re-installation and migration process that took a very long time.

• EFI partitioned drives are not ZFS bootable.
• ZFS attach automatically partitions the drive as EFI.
• ZFS send/recv transfers on gzip compressed data-slices is slow.

Here is the correct way of getting both the disks in the ZFS mirror to boot.

Plug the new drive into the server that you want to add to the ZFS mirror. If your hot swapping or adding a new drive while the server is still on, you need to use cfgadm to configure it.

victori@solaris:~# cfgadm -c configure sata1/1

Now that the drive is configured and seen by the server you need to repartition it with format so it can be used as a bootable device.

victori@solaris:~# format

AVAILABLE DISK SELECTIONS:
0. c4t0d0
/pci@0,0/pci8086,346c@1f,2/disk@0,0
1. c4t1d0
/pci@0,0/pci8086,346c@1f,2/disk@1,0
2. c4t2d0
/pci@0,0/pci8086,346c@1f,2/disk@2,0

* select your new drive *

# fdisk

* use fdisk to remove the EFI partition and add a solaris2 partition. *

Select the partition type to create:
1=SOLARIS2 2=UNIX 3=PCIXOS 4=Other
5=DOS12 6=DOS16 7=DOSEXT 8=DOSBIG
9=DOS16LBA A=x86 Boot B=Diagnostic C=FAT32
D=FAT32LBA E=DOSEXTLBA F=EFI 0=Exit?

This step is very important, if you did not repartition your drive, zfs attach will default the drive back to an EFI partition table that is not bootable.

c4t0d0s2 — primary drive.
c4t1d0s2 — new drive that we are setting up.

victori@solaris:~# prtvtoc /dev/rdsk/c4t0d0s2 | fmthard -s – /dev/rdsk/c4t1d0s2

You should now be able to attach the secondary drive to your mirror using the identical slice.

zpool attach rpool c4t0d0s0 c4t1d0s0

Once the mirror is done synchronizing you need to install the bootloader on the drive.

victori@solaris:~# installgrub -m /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c4t1d0s0
Updating master boot sector destroys existing boot managers (if any).
continue (y/n)?y
stage1 written to partition 0 sector 0 (abs 16065)
stage2 written to partition 0, 267 sectors starting at 50 (abs 16115)
stage1 written to master boot sector

Trouble Shooting

victori@solaris:~# installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c4t1d0s0

raw device must be a root slice (not s2)

You did not re-partition the drive to a solaris2 partition. EFI partitions can’t be made bootable. Use the format tool to reconfigure the drive with a solaris2 partition.

zpool attach rpool c4t0d0s0 c4t1d0s0

cannot open/stat device /dev/rdsk/c1t0d0s0

You did not copy your label information from your primary to your secondary disk with prtvtoc and fmthard.

Tagged with , ,