Drive options for TrueNAS server - formerly FreeNAS - Starline Computer: Storage und Server Lösungen von erfahrenen Experten

Disk drive options for your OpenZFS TrueNAS Core® server

ATA, SCSI, SATA, SAS, NL-SAS, SSD, PFP, 512n, 4Kn: Confusing shortcuts, drive parameters and options for your ZFS installations at a glance.

Disk drive options for TrueNAS server

Disks are crucial building blocks of any storage solution as the characteristics of both the disk and the filesystem it is going to host can greatly impact the solution performance. When choosing a disk for your ZFS storage you should define the following parameters:

  • Physical Interface
  • Communication Protocol
  • Type
  • Capacity
  • MTBF
  • Capabilities
  • Power consumption and noise levels (relevant to HDDs only)

It is important to distinguish between the type of the disk’s physical interface and the communication protocol it uses. It is also important to understand how the Form Factor, combined with the physical interface and the used protocol, can affect the available storage capacities and the technical capabilities of a disk such as power failure protection and endurance.

Interface: SATA, SAS, U.2 and PCIe

Decide on the type of physical interface you will use. This will affect many other factors such as chassis backplane, individual disk IOPS, latency, and overall storage node throughput.

SAS and SATA drives are the standard these days. They use serial communication which means some of those drives can receive and send data on two active communication channels at the same time on the same cable. SSD disks are relatively new and some of them provide dazzling read and write speeds but the price tag is still a deterrent for most high-density storage applications. That doesn’t mean you should immediately write SSDs off of your shopping list because, when used smartly, they can greatly improve your OpenZFS server performance.

Pre-SATA – A bit of history

Before SATA we had Parallel ATA (PATA) and Small Computer Systems Interface (SCSI) disks. Electromagnetic interference, within the individual server, and higher manufacturing costs were important reasons why the industry has abandoned them and moved forward to the newer SATA/SAS cabling designs.

Parallel cables had flat ribbon-like wires with 40-pins on each end, with a wire count of either 40 or 80 – depending on the use of one or two wires to connect the pins on both ends of the cable.

Due to the high number of wires, the older parallel PATA cables were wider and thinner and, as with parallel communication, the cables supported two separate communication lanes for sending and receiving data with plenty of lanes (wires) in them. This meant that they could handle more data traffic at once but they had three main negative aspects:

  • They were relatively large and occupied more space within the server case (also cost more to manufacture)
  • PATA is Half Duplex. Data transmission and reception cannot happen at the same time.
  • They suffered from in-cable electromagnetic noise
  • PATA disks required setting jumpers to identify primary and secondary devices for boot
  • PATA ports were large. An average PC mainboard could have a maximum of 2 ports
  • PATA cables can carry a maximum of 2 disks, translating to a total of only 4 disks per PC

     

Data electrical currents create faint magnetic fields around the wires they run through. In spite of the field being weak, when two data cables run parallel to each other, their magnetic fields will intersect causing disruption to the data transmitting through the wires. This can either cause complete transmission interruption (inflight data loss) or, at least, increases transmission latency.

PATA, being Half-Duplex also meant longer interaction duration between the OS the drive – since data transmission and reception cannot happen at the same time. One party had to wait for the other.

Two drives that communicate via PATA

Two devices communicating over PATA

 

The move to SATA

Serial communication and cabling solved those issues by:

  • Using a seven-pin data cable
  • Isolating and shielding wires
  • Introducing Native Command Queuing of Incoming Data
  • SATA is Full Duplex. Data transmission and reception can happen at the same time
  • SATA ports are much smaller and a PC board can easily have up to 16 of them.

     

SATA cables have way fewer wires (7) thus are much thinner, giving manufacturers more design flexibility. The cables have internal dividers and shielding that helps prevent magnetic field interference. They are also cheaper to manufacture and easier to plug in/out.

Serial communication also introduced Native Command Queuing (NCQ) as an extension of the Serial ATA protocol. This function is relevant to rotating HDDs. It basically allowed the storage drive to select which read/write job, from the incoming queue of jobs, it will execute first. This helped the drive optimise the order in which the received read/write commands are executed resulting a reduced amount of unnecessary drive head movement. NCQ translated to better performance because efficient head movement means less wasted time.

SATA cabling resulted in a higher signaling rate, which meant faster data throughput. SATA is Half-Duplex and differential, with separate transmit and receive pairs.

Two devices that communicate via SATA. One device sends or receives via a specific path

Two devices communicating over SATA. A device either sends or receives over a specific path

 

One great feature the newer SATA controller offered us is the Advanced Host Controller Interface (AHCI) which, when enabled in your BIOS settings activates a host of useful features such as Hot Swapping of SATA drives and, in most server boards, the ability to use your mainboard controller to build Software RAID arrays using your SATA drives. It also offers higher performance than the IDE mode.

SATA controller on a server mainboard with AHCI mode and hot-plugging activated.

SATA controller on a server mainboard with the AHCI mode and disk hot-plugging enabled.

 

SCSI, the grandfather

SCSI (Small Computer System Interface) was an old parallel-based transmission technology that served as an alternative PATA/IDE. It was used to connect devices to computers. The “Small” segment of the name is because it was designed for personal and home/office computer systems which were way smaller than the enterprise mainframe ones.

SCSI used its own command set, and its interfaces allowed users to connect up to 16 different devices (storage devices, optical devices, printers, scanners, etc.) to the same SCSI controller using a single cable. Precise termination was a big issue with SCSI. It required that the last device on that cable be terminated so that the controller would know where the signal ends. Termination was either enteral (embedded into the device) or external, by connecting a physical terminator to the end of the SCSI cable. If the termination was improper or got messed up for some reason the entire string of devices would be useless until the cabling has been replaced.

State of the art at the time

One big advantage with SCSI disks was that they tended to spin at higher RPMs than IDE disks. While IDE disks came with 5200 RPM mechanisms, SCSI disks usually came with 10K and 15K ones. Faster spinning meant faster reads and writes but it also meant more noise.

The other big advantage of SCSI was that you could connect multiple disks to the same cable. IDE, on the other hand, allowed for only two disks per-cable and IDE-capable mainboards could handle a maximum of only four devices.

SCSI disks were also more expensive and usually used in pricier/business machines where users needed the benefits of SCSI. Back in the days (80s and 90s) most Apple computers came with SCSI disks and peripheral devices.

From right to left: SCSI, PATA and SATA cables. The black component on the right is a SCSI terminator.

Right to left: SCSI, PATA, and SATA cables. The black device on the right is a SCSI Terminator.

 

SAS is the new SCSI

SAS stands for “Serial attached SCSI. It is a step forward from the old SCSI’s parallel-based transmission. SAS is quite similar SATA since it uses cabling with the same number of wires.

So, what is the difference between SAS and SATA cabling?

In a SATA cable, all data wires are grouped in one cable tree. In a SAS cable, the 4 data wires are separated into 2 cable-groups – each containing 2 wires. Each group contains one incoming and one outgoing wire. This segregation allows us to connect more devices to each other.

  • With SATA cables, you can only connect the motherboard directly to the disk. True, you could attach an expansion device, but that costs money and takes up room inside the server. They also run for a short distance of up to one metre.
  • With a SAS cables, you can daisy-chain devices – just like old SCSI. You can connect the motherboard to both a disk and another piece of hardware that has SAS connectors. SAS also supports longer cable lengths of up to 20 metres.

Which is disk drive option, SAS or SATA?

The answer greatly depends on your use case.

Comparing the mechanical version of both disk types (more on SSDs later), SATA drives have great storage capacities at a reasonable price. But, because of their relatively basic performance levels, SATA disks are more suitable for dense storage installations where you usually have tens, hundreds or even thousands of them. This way you can harvest the collective performance of the drives. On the other hand, SAS drives spin faster, have twice the bandwidth and IOPS of SATA, and are a bit more expensive. They also consume a bit more power, run hotter, and are noisier.

Looking at the spinning (mechanical) version only, we can see that:

SATA disks generally offer the following:

  • 80 IPOS average
  • 7200 RPM
  • Larger capacity at the best price per Terabyte
  • Good Cache Data Buffer (64 and 128 MB of flash cache are common. Enterprise versions have up to 512MB)
  • Reasonable MTBF
  • Average power consumption
  • Lower operating noise and temperature

SAS disks generally offer the following:

  • 150 IPOS average
  • 10K and 15K RPMs
  • Large capacities, although at a higher price per Terabyte
  • Bigger Cache Data Buffer (256 and 512 MB of flash cache is the norm)
  • Power consumption is slightly higher than SATA ones
  • Operating noise levels are also higher than SATA disks due to the higher RPMs.

 

NL-SAS

Near-line SAS is a SAS variant spinning at 7200 RPM. NL-SAS offers a middle-ground between enterprise SAS and enterprise SATA disks. They are basically 7200RPM enterprise SATA with a SAS interface, making the only difference between them the protocol that they speak with the server. Under the hood, and except for the board/interface, they are mechanically identical. This makes them an attractive option for many users since they offer the high capacities and the good performance of SATA – paired with a SAS interface – at a lower price than 10K/15K RPM SAS.

SAS JBODS

One other advantage of SAS is that SAS controllers can have External ports to which you can attach JBOD enclosures. Those enclosures contain only backplane boards and can host additional drives to help increase your server’s capacity. No PC mainboard, CPU or RAM required, which means less expenditure. 

Time Limited Error Recovery

Time Limited Error Recovery (TLER), Error Recovery Control (ERC), and Command Completion Time Limit are three different names for the same technology. It is a function available in NAS and enterprise-grade storage drives’ firmware.

When your hard drive encounters an error while reading or writing to its platters the drive will automatically keep trying to recover from that read or write error. While the disk is trying to recover from the error, the system has to wait on the disk to respond. If the disk takes too long the system will slow down. If the disk never completes the recovery the system might freeze or even halt – forcing you to reboot it with your fingers crossed that your data would still be accessible.

TLER improves this situation by limiting the disk recovery attempts to a defined number of seconds. 7 seconds is the recognised industry standard here, which should be enough for the disk to recover from a still readable/writable block. If the disk cannot complete the recovery within the given 7 seconds it will abort the recovery and move forward. Any corruption will then be corrected using the redundant data within the OpenZFS RAIDZ array.  

One critical difference between how OpenZFS and conventional RAID controllers handle TLER-enabled drives is that, with a RAID controller, when an error occurs on the disk the RAID controller may stop using that disk altogether, mark it as faulty, and may even send the entire server into recovery mode to force you to replace the drive and imposing downtime. 

OpenZFS, on the other hand, will recognise the issues with the drive, log the errors, and continue to use the drive – while keeping an eye on the errors. If the drive continues to produce errors ZFS will eventually mark it as degraded, and alert you to replace it – completely leaving the decision to you.

Some desktop drives’ firmware supports TLER but it is not enabled in the firmware at the factory and there is plenty of information online on how to enable the feature in the firmware. Just remember that, even if you manage to enable TLER on those drives, that will not make them any better suited for storage servers because they are still rated for 8-hours of daily load.

 

Avoid the SMR trap!

In mechanical disks, data is stored in concentric (parallel) tracks on the surface of their spinning platters – with a micro distance separating each track from the next. This technology is called Perpendicular Magnetic Recording (PMR) and it is the standard in enterprise storage drives.  

SMR drives, on the other hand, use a Shingled layout to shrink the area those data tracks occupy thus increasing the total number of tracks and the overall amount of data that could be written on a single platter – thus, increasing disk capacity without having to increase the number of platters. A smart idea which could add up to 20% more capacity to the disk, except that the physics of the universe does not like it much. 

Shingling meant that tracks must overlap each other, just like an array of roof tiles. This means that, during a write process at an area where data has been previously saved, data from the bottom tracks must be erased for the new data to be written over. To avoid data loss in this situation the disk must read all the data stored in that specific area (SMR Zone) and save it to its cache, before writing the new data. When the drive is idle it will retrieve the data that was stored in its cache and start writing it on the physical platters. This process takes time and results in slower disk performance. This is also why SMRs have such attractive cache amounts that the manufacturers make sure you can see on the label 😉

SMR drives have the worst I/O ever. Especially for random reads and writes. And during resilvering (OpenZFS’s term for RAID rebuild) the process will take much longer, especially when your pool is full.

SMR disks may make sense in archival systems. But in demanding storage applications SMRs is not worth the few cents you think will save by purchasing them. Soon you will have to throw those away to rescue your server’s deteriorating I/O. Do your research before buying a disk and watch out for disk manufacturers submarining SMR drives as non-SMR ones.

Shingling means that the tracks overlap when writing.

Select the mechanical disk with the correct endurance rating

All mechanical drives have a time-based endurance rating. This is usually advertised in an hour/day format. For a storage server that is intended to stay on year-round you should only use drives that are rated for 24 hours, seven days a week – because those drives are manufactured to sustain such load for the duration of their warranty. If you plug consumer grade drives in your server they will fail more often and can cause data loss should more drives fail during data rebuild. See MTBF section below.

 

Solid-State Disks (SSDs)

Although SSDs have been around for quite a while, they are a relatively new comer to the high-capacity storage market. This is because the much higher cost-per-terabyte and the relatively limited capacities most of them can offer.

SSDs offer many advantages over HDDs. Some of those advantages are:

  • SSDs have no moving mechanical parts. Instead of platters they use flash memory (NAND) chips to store the data.
  • The use of flash NAND translates to much higher IO performance and much lower latencies than their mechanical siblings. This is very beneficial in demanding applications such as VM storage.
  • Consume much less power than HDDS.
  • Emit much less heat.
  • Do not produce noise (unless they have a faulty component or you can hear electricity running through the board)

The biggest disadvantage of SSDs is their price.  SSDs cost way more per TB than the HDDs currently available on the market. If your specific deployment will not benefit from pure-SSD setup, then it makes sense to use HDDs.

OpenZFS allows for the use fast SSDs as cache and log devices to help improve your storage pool performance. There is also the ability to construct multiple HDD-based and SSD-based pools within the same TrueNAS Core® server and dedicating each pool to the appropriate use case. This way you can have a high-capacity pool for your archive data and a faster pool for your VMs.

Consider those options if your use case can benefit from the extra IOPs .

 

Which interfaces do SSDs come with?

SSDs. are available with a wider set of interfaces than their mechanical alternatives. Those interfaces include:

  • SATA
  • SAS
  • U.2 (NVMe)
  • PCIe (NVMe)
  • M.2 (SFF SATA and NVMe)

SATA SSDs offer 6Gbit/Sec throughput with up to 550MB/Sec throughput. SAS SSDs have 12Gbit/Sec bandwidth and about 1200MB/Sec throughput. That is twice the bandwidth of SATA

PCIe/NVMe disks, on the other hand, offers 1GB/Sec per-lane. This means that if you plug your NVMe disk into a 4X PCIe slot you can get 4GB/Sec throughput. Same logic applies to 8X and 16X cards/slots.

U.2 is a new comer (announced in 2015). Although the U.2 bay connector is mechanically identical to the SAS one, but it uses the NVMe/PCI Express bus instead of SATA or SAS – through the use of dedicated pins.

PCIe and NVMe drives are faster because they enjoy a shorter, and accelerated, path to the CPU.

And, there is no NL-SAS SSDs.

Collection_of_Different_SSD_Types-1024x576

All common SSD interfaces and form factors. Some of those were pre-production test samples. Exclusive access to such samples is one of the perks of working at a n established storage server supplier.

 

SSDs and CPU utilisation

It is important to keep this point in mind. Because they are much faster than HDDs, SSDs require a faster CPU that can catch up with their speeds and subsequent load. If you have a slow CPU your storage server will not perform as expected, even if fully loaded with SSDs. This negative impact will be even more detectable if you enable AES encryption and will worsen if your CPU does not natively support it. So, make sure your processor has enough cores, speed, and supports the AES-NI instructions.

Are SSDs reliable?

The straightest answer is: For storage purposes Enterprise SSDs are reliable. Stay away from consumer grade SSDs when buying disks for your TrueNAS Core®. The reason lies within how an SSD disk is built and what mechanisms it uses to protect your data. Here is a bit of insight on how enterprise SSDs help better protect your data:

Error Correction

SSD controllers use Error Correction Code technology (ECC) to detect and correct most of the errors that can affect “data in flight”. Data in flight is the data moving between the host computer and the SSD disk. along this trajectory. Flash memory chips incorporate additional error correction information along with. The NAND chips themselves also have error detection mechanisms in place to help detect any corruption in the data that might happen as the data moves from the controller to the chip.

Reserve Blocks and Overprovisioning

Ever wondered why your Apple MacBook Pro SSD was reporting that weird 120GB of usable capacity? That is because it used more than 8GB of Reserve Blocks.

Data is stored on writable blocks within the cells of the NAND chips. When writing data on a magnetic mechanical disk, the write head will just overwrite existing data by re-magnetising the write area – similar to re-recording on a cassette tape. Unlike magnetic disks, NAND cells must be erased before new data could be written to them. This erase/write cycle wears out the cells more you write data to them. Very similar to the Bad Block phenomena of the mechanical disks, when a NAND cell gets worn out it is no longer able to store or present data. This is the biggest disadvantage of NAND.

To work around this issue, SSD manufacturers add extra NAND chips to provide “hidden” spare blocks to their SSDs. When a NAND block dies out one of the disk controller will mark it as dead, deactivate it, and one of the spare blocks get activated and remapped to replace the dead one. This process is blazing fast, happens on-the-fly, and is not detectable by the user or the operating system – which means your OS and any applications running on top of it will not hang. No for a second. 

The amount of those Reserve Blocks is much higher in enterprise SSDs than in consumer ones. Some manufacturers would add more than 20% of blocks to that reserve to guarantee a longer life span.

There is also the ability to overprovision an SSD disk. Overprovisioning is the process of adjusting the OS-detectable usable capacity of an SSD disk to a lower value, allowing for even more Reserve Blocks. This is usually done by a special software utility provided by the SSD manufacturer.

Total Drive Writes per Day (DWPD)

Finally, all SSDs have a Total Drive Writes per Day rating (DWPD) – sometimes also expressed as Total Bytes Written (TBW). Both terms point out the total amount of data that a can be written onto an SSD disk during its lifetime. This is an Endurance Rating and it is a critical number to consider when choosing your SSDs. Just like their mechanical counterparts, SSD wear out and eventually fail. While rotating drives have a time-based warranty, SSDs have a number of writes-based one. NAND cells must be erased before new data could be written to them. This erase/write cycle wears out the cells the more you write data to them.

When looking for SSDs consider your specific application and choose the TDWPD rating accordingly. If you are going to use those disks as OpenZFS SLOGs then you want the highest possible DWPD value. L2ARC generally is not as write-intensive as SLOG.

The original TDWPD rating, combined with some smart manual overprovisioning of the SSD, can help you extend the life cycle, and the performance, of your SSD disk way beyond the manufacturer specifications. For instance, you can overprovision a 512 GB SSD with a DWPD to 256 GB should allow that SSD to serve for twice as long.

Enterprise_vs_Consumer_SSD_Boards-1024x576

The boards of an enterprise and a consumer grade SSDs. You can clearly see the limited number of NAND chips which means a lower number of Reserve Blocks.

 

Power Failure Protection (PFP)

Capacitors are electrical components that can hold an electrical charge, just like a battery. But, unlike batteries, they are much faster to charge. A capacitor can be charged in a fraction of a second. The downside is that most capacitors can hold only few millivolts, which is not much of a charge to power an SSD. But there is a newer class of capacitors that can hold a relatively larger charge in a small form factor. Small enough to be deployed inside an SSD. Those are called Super Capacitors.

Enterprise SSDs have many of those tiny Super Capacitors on board. The function of those capacitors is to act as an Uninterruptable Power Supply (PSU) for the SSD components in the event of a power outage. They are usually positioned around the SSD controller and the NAND chips. This helps keep the controller and the NAND chips alive for enough time to safely write any data-in-flight to the NAND chips, thus protecting against data loss or corruption.

A controller in a Seagate enterprise SSD, surrounded by (yellow) tantalum supercapacitors to protect against power failures.

A controller in an enterprise Seagate SSD, surrounded by (yellow) Tantalum super capacitors to provide power failure protection.

 

This is exactly the same way your bicycle’s Standlicht works. If you ever open up one of those lights you will find a small circular, coin-sized, component inside with markings defining its voltage. As you pedal on your bicycle the dynamo in your wheels sends an electric current to the Standlicht’s circuit charging the embedded capacitor on board. When you stop, at a traffic light or an intersection, that electric current stops and the charge that has been stored inside the capacitor immediately gets poured into the circuit to power your Standlicht and help keep you visible and safe from oncoming traffic.

SuperCap under ZFS

Since we use SSDs as cache and log devices in our OpenZFS solutions it is important to have PFP. This is to make sure any data going through the cache gets safely written to the SSDs NAND chips because the loss of cache or log data can cause data loss within your pool.

In addition to the above, enterprise drives (both HDD and SSD) come with a 5-years manufacturer warranty. This, in combination with the different data protection mechanisms provided by the disk hardware and OpenZFS itself proved peace of mind and comprehensive protection for your data.

Standlicht_mit_Super_Capacitor-1024x576

A Standlicht circuit. The red LED is powered by the small coin-shaped 5.5V/1.0F Super Capacitor after charging it for 5 seconds using a 18650 battery.

UPS for Raspberry Pi

Are you interested in a cheap Raspberry Pi UPS that uses a few super capacitors? Here is a brief write up on how you can build one yourself. If you are not familiar with electronics and soldering or just need a bit of help with the project, our engineers are regular contributors to the TeckLab, our local HackSpace in Kirchheim – operating under the umbrella of Linde

Keep in mind that M.2 SSD rarely have PFP. This is due to two main reasons: the first reason is that, due to their limited capacities, M.2s are hardly used as storage mediums in enterprise environments. The second is the fact that implementing PFP requires space and M.2 are too small that it becomes challenging and expensive to add good PFP to them.

Micron_M.2_with_PFP-1024x576

The Micron 2200 is one of the few M.2 SSDs with PFP

 

Unlike mechanical hard drives, there are usually no early indicators an SSD is going to fail. There is also the added complexity of data recovery from SSD disks, which require specialised labs and costs more than with HDDs. In spite of all the advantages and assurances of modern SSDs, I cannot stress enough the importance of having a strict and current backup plan.

512n/512e/4Kn

512-byte Native, 512-byte Emulated, and 4K-bytes Native block sizes or Why the need for a bigger sector size?

Today’s SATA and SAS disks (both HDD and SSD variants) can support two different block sizes –  512-bytes and 4K-bytes. Those numbers refer to the size of the disk’s Physical Sectors and the smallest amount of data that could be written into it. 

Disks with a physical sector size of 512-bytes has been around for decades while disks with the larger 4K-bytes sector size are newer, with the first PC models available from January 2011. Those disks are commonly labeled as Advanced Format Drives.

The original 512-bytes Native (512n) sector size was fine for the types and amounts of data users needed to save to their disks in the early days of personal computers. After all it was mainly text files and rudimentary databases. But with the emergence of newer software applications, larger types of data files such as multimedia, advanced text formats, and modern databases came the issue of wasted disk space. That wasted space was due to the fact that a 512-bytes sector had less than 90% efficiency because

The biggest issue with the move to the newer, more efficient, 4K sector size was that it was introduced after decades of firmware software and stacks that were created around the 512-bytes block size. The existing operating systems and legacy software could not handle the new 4K format. Imagine attaching a set of 4K disks to a RAID controller with a firmware that expects 512n drives? That situation led disk manufacturers to come up with the 512-bytes Emulated format (512e). With this format, the disk platters had a 4K physical sector size, but presented itself to the OS as a 512-bytes one using a firmware emulation technique. This emulation is carried out on-the-fly by the disk controller. The OS and the software stack running atop the OS had to be aware of the 512-emulation process.

512 Byte anachronism

The downside for this workaround was the accompanying performance degradation because of the contentious physical to virtual size translation load on the disk controller – which is more pronounced in demanding storage applications. Read and write commands are issued to a 512e drive in the same format as a legacy 512n drive. However, during the read process, a 512e drive must load the entire 4096-byte physical sector containing the requested 512-byte data into the disks buffer memory so the emulation firmware can extract and re-formats that data into a 512-byte chunk and send it to the host OS. Things were even more complex for write operations resulting in a perceptible negative impact when random write activity was involved.

There was also the Partition Misalignment issue which commonly occurred when a 512e disk was used in a system with a software stack that is not 512e aware. The most pronounced symptom of this issue was the degraded disk performance.

In short, a 512e disk is a 4K disk pretending to be a 512n one and wasting away some of its power in the process. 512e was intended to be a low-cost “transitional format” allowing disk manufacturers, and end users, to benefit from the 4K sector size without breaking compatibility with legacy software that expected 512-bytes sectors. With most modern software, as with recent OpenZFS/TrueNAS Core® there is no need to use 512e. Just go for the 4Kn disks.

Modern operating systems and applications were slow to adopt the new 4Kn Advanced Format drives, with Mac OS X introducing support in its Mountain Lion 10.8.2 version (2012). Windows did not support it until version 8 and Server 2012, while FreeBSD and Linux were quicker to include the support. Linux kernel has 4K support starting with version 2.6.31 (2009).

512Bytes-Sector-768x332

A single 4K sector can host 8 time more data than the legacy 512-bytes one – with a 97.3% effeciency.

 

Advantages of 4K physical sector size included:

  • Better error detection.
  • Smarter correction algorithms.
  • Better platter areal density, which translates to lower costs.

So, which disk drive option to use with OpenZFS, 512e or 4Kn?

Just use 4Kn disks. Recent OpenZFS, and the TrueNAS Core® OS both support native 4K drives.

Important: You cannot use SAS drives in a SATA chassis!

Unfortunate but true. Due to the slight difference between the SAS and the SATA interface, you will not be able to plug in a SAS drive in a SATA backplane or cable. The opposite is possible though. So, make sure you get the correct enclosure or backplane for your disks.

SAS_vs_SATA_SSD_Interfaces-copy-1024x343

SAS vs SATA interfaces on SSD drives. Notice the notch in the middle.

 

Mean Time Between Failures (MTBF)

MTBF is the average predicted time between failures of a mechanical or electronic disk. Those averages are calculated based on various design and manufacturing factors and are usually stated in hours. And daily operating hours (or disk operating duty) is a critical number to consider when choosing a desk for your OpenZFS server.

Consumer-grade disks are usually rated for a maximum of 8 operating hours per day. This number is based on the expected average use of such machines. The also usually come with a shorter manufacturer warranty ranging between one and two years.

Server and enterprise-grade disks are rated for 24 operating hours per day for the duration of their manufacturer warranty which is almost standardised at 5 years. This means: a server-grade drive should be able to run 24/7 for 1,825 days without failing.

But it is not that simple because the health, and the possible failure of a disk, depends on multiple factors such as its workload rating, start stop cycles (relevant only to HDDs), operating temperature, and the amount of reserve blocks (relevant only to SSDs). Let me refer you here to this excellent article on MTBF/MTTF written by Rainer kaese, a Senior Manager Business Development Storage Products at Toshiba Electronics Europe. There is also this brief video from Rainer where he explains the subject eloquently.

 

Conclusion

There is a wide variety of disk options out there and the final choice depends on your target application. When you are looking for capacity at the lowest possible cost then SATA drives are the way to go. If you want more read/write performance then SAS is the better option – with NL-SAS as an economical step between SATA and SAS. If you intend to run critical applications that require the lowest possible latencies, such as database and AI image processing, then SSDs it is.

SATA server backplane cannot handle SAS drives. If the current budget allows only for SATA drives but there is a chance you might want to upgrade to SAS in the future, make sure you buy servers with SAS backplanes. This way you will only need to obtain new SAS drives and will not have to decommission your servers.

Consider external SAS JBODs if you want to increase the storage capacity of your existing servers as they can save you some money.Always research the type and the model of the disk before committing to purchasing it. Avoid non-enterprise and SMR drives at all costs and make sure your SSDs have enough TDWPD rating to sustain your target workload for the duration of their warranty.

 

TL;DR

SATA versus SAS

  • There are some factors to consider when choosing a disk for your ZFS: Interface, type, capacity, performance, power consumption, and noise level.
  • SATA disks offer reasonable performance if you do not need much IOPS or throughput.
  • SAS offers better IOPS performance and higher throughput at a lower number of disks.
  • SSD is the way to go if your application requires low latency and high IOPS. Database and virtualization backend storage are two main scenarios where you might want them.
  • PATA/IDE was a parallel-based technology used mostly to connect internal disk drives and optical devices.
  • PATA/IDE cabling was large and limited to two devices per cable.
  • SCSI was an old parallel-based transmission technology that allowed users to string up to 16 devices to the same cable. Users could use SCSI interfaces to daisy-chain and easily connect internal and external devices.
  • SCSI was sensitive to, and required, precise termination.
  • SATA solved many of PATA/IDE problems and offered much better performance levels
  • Currently SATA is the most common storage technology.
  • SATA offers high capacities at relatively cheaper prices.
  • SAS is a merge of SCSI and SATA technologies with thinner cables and faster I/O
  • SAS, just like SCSI, allows daisy-chaining of devices.
  • NL-SAS is a middle ground between SATA and SAS with affordable pricing and good performance
  • For most home and small office users SATA is the better option
  • SAS is when you need more IOPS and can afford the higher cost – and tolerate the noise

SSDs in practice

  • SSD is your best option for low-latency use cases – but it is also way more expensive
  • If you intend to deploy your server at a home or in an office, check the power consumption and noise ratio values of the components. This point is also almost irrelevant to datacentres.
  • Legacy disks were physically formatted using a 512-byte sector size – these drives were called 512 native (512n).
  • Newer disks have 4K-bytes sector size but can present themselves to the OS as 512-byte sector size ones. This is done via firmware emulation – these drives are called 512 emulated (512e).
  • The need for emulation was due to the slow transition of operating systems and many popular applications towards the newer sector format.
  • Modern disks have 4K-bytes sector size and present themselves to the OS as such, with no emulation involved – these drives are called 4K native (4Kn).
  • TLER/ERC is a useful firmware function which limits a disk error recovery to 7 seconds, thus helps prevent the disk from being marked down by the system and avert a possible system freeze.
  • Watch out for SMR HDDs. Do your research and make sure the disk you buy for your OpenZFS server is PMR.
  • SSDs do wear out. Manufacturers add hidden Reserve Blocks to replace dying NAND cells.
  • Advanced error correction, extra reserve blocks, PFP, overprovisioning and extended warranties are some of the benefits of purchasing enterprise-grade SSDs.
  • PFP allows data to be saved into the NAND chips before the SSD shuts down.
  • SSDs do fail. Even worse, they fail without a warning. Always have an up-to-date backup for your data.
  • Do not use SSD drives that do not have PFP in your OpenZFS server.
  • MTBF is the average predicted time between failures of a mechanical disk or ssd.
  • Consumer drives are usually rated for 8 working hours per day while server grade ones are rated for 24 hours.
  • Watch the video linked above.
  • Interested in a cheap Raspberry Pi PSU that uses a few super capacitors? Check the link above.
starline_logo_kontur_300
Enterprise Storage Solutions Team
Technik

Our experts are of course also experts in Linux, Ceph and ZFS