This document is a WORK IN PROGRESS.
This is just a quick personal cheat sheet: treat its contents with caution!

Smartmontools¶

Smartmontools is a set of tools to monitor storage systems and to provide advanced warning of disk degradation.

Reference(s)

Table of contents¶

Install
Config
Use
- Smartmontools with hardware RAID
- Troubleshooting

Install¶

emergepacmanaptyumdnf

# emerge -a sys-apps/smartmontools

# pacman -S smartmontools

# apt install smartmontools

# yum install smartmontools

# dnf install smartmontools

Config¶

Reference(s)

$ man smartd.conf

Configure the SMART daemon:

# vi /etc/smartd.conf
    > ...
    > # DEVICESCAN # ⚠️ Comment this line ⚠️
    > ...
    > # Monitors...
    > # ...SMART health status ("-H")
    > # ...SMART log errors and selftest ("-l error" and "-l selftest")
    > # ...failure of any 'usage' attributes ("-f")
    > # &
    > # Start self-test ("-s") of type 'short' ("S/../../5/03") every week on
    > # Friday at 3:00 a.m ("S/../../5/03").
    > # &
    > # Send a test mail every time SMART daemon start up and for every repport
    > # if there is a problem (with a daily reminder in this case).
    >
    > /dev/sda -m alerts@example.com -M test -H -l error -l selftest -f -s S/../../5/03
    > /dev/sdb -m alerts@example.com -M test -H -l error -l selftest -f -s S/../../5/03

The following entry:

    > /dev/sdb -m @script,usr1@add1 -M test -H -l error -l selftest -f -s S/../../5/03

Will also execute /usr/share/smartmontools/smartd_warning.d/script before sending any mail to usr1@add1

The following entry:

    > /dev/sdb -m root -M exec /path/to/script -H -l error -l selftest -f -s S/../../5/03

will execute /path/to/script instead of sending a mail. From the script you can access smartd environment variables:

STDIN
SMARTD_MAILER: set to the argument of -M exec if present, or else to 'mail'.
SMARTD_DEVICE: set to the device path (e.g.: /dev/sda).
SMARTD_DEVICETYPE: set to the device type specified by -d directive or 'auto' if none.
SMARTD_DEVICESTRING: set to the device description.
SMARTD_DEVICEINFO: set to device identify information (most of the info in smartctl -i).
SMARTD_FAILTYPE: set to the reason for the warning or message email. Possible value are:
- EmailTest: this is an email test message.
- Health: the SMART health status indicates imminent failure.
- Usage: a usage Attribute has failed.
- SelfTest: the number of self test failures has increased.
- ErrorCount: the number of errors in the ATA error log has increased.
- CurrentPendingSector: one of more disk sectors could not be read and are marked to be reallocated (replaced with spare sectors).
- OfflineUncorrectableSector: during off-line testing, or self testing, one or more disk sectors could not be read.
- Temperature: Temperature reached critical limit (see -W directive).
- FailedHealthCheck: the SMART health status command failed.
- FailedReadSmartData: the command to read SMART Attribute data failed.
- FailedReadSmartErrorLog: the command to read the SMART error log failed.
- FailedReadSmartSelfTestLog: the command to read the SMART self test log failed.
- FailedOpenDevice: the open() command to the device failed.
SMARTD_ADDRESS: set to the address argument ADD of the -m Directive.
SMARTD_MESSAGE: set to the one sentence summary warning email message string from smartd.
SMARTD_FULLMESSAGE: set to the contents of the entire email warning message string from smartd.
SMARTD_TFIRST: set to the time and date at which the first problem of this type was reported.
SMARTD_TFIRSTEPOCH: set to an integer, the Unix epoch for SMARTD_TFIRST.
SMARTD_PREVCNT: set to an integer specifying the number of previous messages sent.
SMARTD_NEXTDAYS: set to an integer specifying the number of days until the next message will be sent.

Start the SMART daemon and add it to the init system:

OpenRCRunitSysVinitSystemD

# /etc/init.d/smartd start
# rc-update add smartd default

# sv up smartd
# ln -s /etc/runit/sv/smartd /run/runit/service/

# service smartd start
# chkconfig smartd on

# systemctl start smartd
# systemctl enable smartd

If you don't want to use the default mailer ("mail") with the SMART daemon, but you rather prefer Neomutt for example:

# vi /etc/smartd_warning.sh
    > # Default mailer
    > os_mailer="neomutt"

Use¶

Print S.M.A.R.T. information:

# smartctl -i /dev/sda # print some drive information
# smartctl -a /dev/sda # print all drive information

See S.M.A.R.T. attributes details for more details about drive information.

Use smartctl:

# smartctl --info /dev/sda | grep 'SMART support is:'
    > SMART support is: Available - device has SMART capability.
    > SMART support is: Enabled

# smartctl -s on /dev/sda # enable SMART if not already enabled

# smartctl -c /dev/sda # print SMART capabilities
# smartctl -t short /dev/sda # run SMART short test, other possible tests are:
                             # offline (default test, w/o logs: just visible with '-l error' opt)
                             # long (extended test)
                             # conveyance (test intended to identify damage after transportation)
# smartctl -X # abort test

# smartctl -l selftest /dev/sda # print test results

# smartctl -l error /dev/sda # print errors if any

# smartctl -H /dev/sda # print SMART health status

# smartctl -A /dev/sda # print SMART health status

Print messages only if error(s):

# smartctl -q errorsonly -H -l selftest /dev/sda

Print in a neat, individual, predictably separate way (for script parsing):
```
# smartctl -i /dev/sda
```
Note: see https://blog.inf.ed.ac.uk/chris/smartctl-and-megaraid/

Get disk temperature

# smartctl -a /dev/sda | grep Temp | cut -d" " -f 2,37
# smartctl -A /dev/sda | grep Temperature_Celsius

Smartmontools with hardware RAID¶

Reference(s)

https://www.admin-linux.fr/smart-test-des-disques-sur-controleur-lsiperc-serveur-dell/

Get all info on hardware LSI RAID (on physical disk 4):
```
# smartctl -a -d megaraid,4 /dev/sda
```
For other type of hardware RAID: see https://www.smartmontools.org/wiki/Supported_RAID-Controllers

Troubleshooting¶

"Read failure" status. If after a test you get this kind of results:

# smartctl -l selftest /dev/sda # print test results
    > ...
    > == START OF READ SMART DATA SECTION ==
    > SMART Self-test log structure revision number 1
    > Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    > # 1  Extended offline    Completed: read failure       40%     32017         1374022944
    > ...

🚧 WIP 🚧

If this cheat sheet has been useful to you, then please consider leaving a star here.