Make your own cloud server - Testing the disk

Testing the disks

As the hard disk drive (hdd) is old, first thing was to connect it to my laptop and run all the tests I could. The OS in my laptop is an opensuse Leap 42.2 distribution. The laptop has the internal hdd (/dev/sda), and when the external hdd will be plugged, it should show up as /dev/sdb. Beware that this can be different in your system if you have another distribution and/or more disks. I put here my results as a guide, not as written on stone. Be patient and careful, and remember that the internet is full of information. And at first, do not plug the external hdd.


We need some tools, and as I like the command line more than the GUI, all commands are from the bash as root. So check that your seat belt is lock and open a root console

Check for the smartmontools (www.smartmontools.org) If this is not the case, use yast or zypp to install them.

earth:~ # rpm -qa | grep smartmontools
smartmontools-6.5-121.1.x86_64

In order to check any disk, first always update the SMART database

earth:~ # /usr/sbin/update-smart-drivedb
/usr/share/smartmontools/drivedb.h updated from branches/RELEASE_6_5_DRIVEDB

The first thing, prior to plug any hdd, is to take a look at which devices the OS recognizes as disks

earth:~ # ls -ls /dev/sd*
0 brw-rw---- 1 root disk 8, 0 Aug 23 10:42 /dev/sda
0 brw-rw---- 1 root disk 8, 1 Aug 23 10:42 /dev/sda1
0 brw-rw---- 1 root disk 8, 10 Aug 23 10:42 /dev/sda10
0 brw-rw---- 1 root disk 8, 2 Aug 23 10:42 /dev/sda2
0 brw-rw---- 1 root disk 8, 3 Aug 23 10:42 /dev/sda3
0 brw-rw---- 1 root disk 8, 5 Aug 23 10:42 /dev/sda5
0 brw-rw---- 1 root disk 8, 6 Aug 23 10:42 /dev/sda6
0 brw-rw---- 1 root disk 8, 7 Aug 23 10:42 /dev/sda7
0 brw-rw---- 1 root disk 8, 8 Aug 23 10:42 /dev/sda8
0 brw-rw---- 1 root disk 8, 9 Aug 23 10:42 /dev/sda9

Then plug the external hdd, and see how it shows up 

earth:~ # ls -ls /dev/sd*
0 brw-rw---- 1 root disk 8, 0 Aug 23 10:42 /dev/sda
0 brw-rw---- 1 root disk 8, 1 Aug 23 10:42 /dev/sda1
0 brw-rw---- 1 root disk 8, 10 Aug 23 10:42 /dev/sda10
0 brw-rw---- 1 root disk 8, 2 Aug 23 10:42 /dev/sda2
0 brw-rw---- 1 root disk 8, 3 Aug 23 10:42 /dev/sda3
0 brw-rw---- 1 root disk 8, 5 Aug 23 10:42 /dev/sda5
0 brw-rw---- 1 root disk 8, 6 Aug 23 10:42 /dev/sda6
0 brw-rw---- 1 root disk 8, 7 Aug 23 10:42 /dev/sda7
0 brw-rw---- 1 root disk 8, 8 Aug 23 10:42 /dev/sda8
0 brw-rw---- 1 root disk 8, 9 Aug 23 10:42 /dev/sda9
0 brw-rw---- 1 root disk 8, 16 Aug 23 11:10 /dev/sdb

Next step is discovering what kind of disk your hdd has inside, and how in good shape it is

earth:~ # smartctl -i /dev/sdb
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.79-18.26-default] (SUSE RPM)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:   Seagate Barracuda 7200.10
Device Model:   ST3320820A
Serial Number:  12345678
Firmware Version: 3.AAD
User Capacity: 320,072,933,376 bytes [320 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA/ATAPI-7 (minor revision not indicated)
Local Time is: Wed Aug 23 17:55:13 2017 CEST
SMART support is: Available - device has SMART capability
SMART support is: Enabled

So inside the box, a Seagate hard disk drive lies. And as it is a good hdd, the SMART support is enabled and available. Now let us see the health status of the drive, by asking for the last health report it has registered.

earth:~ # smartctl -H /dev/sdb
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.79-18.26-default] (SUSE RPM)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

That was short but informative, so now ask for the capabilities of the drive

earth:~ # smartctl -c /dev/sdb
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.79-18.26-default] (SUSE RPM)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  430) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 115) minutes.

Two types of tests can be made by the hardware (and firmware) of this external disk. Let us first run the short

earth:~ # smartctl -t short /dev/sdb
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.79-18.26-default] (SUSE RPM)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 1 minutes for test to complete.
Test will complete after Wed Aug 23 10:10:10 2017

Use smartctl -X to abort test.

We wat for a minute or so, and then

earth:~ # smartctl -l selftest /dev/sdb
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.79-18.26-default] (SUSE RPM)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description  Status                  Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline     Completed without error 00%       2086            -
# 2 Short offline     Completed without error 00%       1414            -
# 3 Short offline     Completed without error 00%       1394            -

This disk has been running for 2086 hours. This is a very short time, compared to a few disks I saw at work with near 50000 hours (that's more than 6 years of operation). So now it is time to see how it behaves on the long test.

earth:~ # smartctl -t long /dev/sdb
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.79-18.26-default] (SUSE RPM)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 115 minutes for test to complete.
Test will complete after Wed Aug 23 14:14:14 2017

Use smartctl -X to abort test.

This took a while, and in the mean time I did a bit of unboxing of the R3 (see next section), and then come back. Once the time has past, another good command is to ask for e(x)tended (a)ll information. Even now we know the device drive, that is SAT.

earth:~ # smartctl --device=sat --xall /dev/sdb
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.79-18.26-default] (SUSE RPM)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.10
Device Model:     ST3320820A
Serial Number:    12345678
Firmware Version: 3.AAD
User Capacity:    320,072,933,376 bytes [320 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA/ATAPI-7 (minor revision not indicated)
Local Time is:    Fri Aug 25 15:15:15 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Unavailable

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 121) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline
data collection:                (  430) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 115) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   101   093   006    -    89908515
  3 Spin_Up_Time            PO----   094   090   000    -    0
  4 Start_Stop_Count        -O--CK   100   100   020    -    867
  5 Reallocated_Sector_Ct   PO--CK   100   100   036    -    0
  7 Seek_Error_Rate         POSR--   081   060   030    -    132019117
  9 Power_On_Hours          -O--CK   098   098   000    -    2113
 10 Spin_Retry_Count        PO--C-   100   100   097    -    0
 12 Power_Cycle_Count       -O--CK   099   099   020    -    1470
187 Reported_Uncorrect      -O--CK   092   092   000    -    8
189 High_Fly_Writes         -O-RCK   100   100   000    -    0
190 Airflow_Temperature_Cel -O---K   054   050   045    -    46 (Min/Max 27/48)
194 Temperature_Celsius     -O---K   046   050   000    -    46 (0 17 0 0 0)
195 Hardware_ECC_Recovered  -O-RC-   059   052   000    -    58061254
197 Current_Pending_Sector  -O--C-   100   100   000    -    1
198 Offline_Uncorrectable   ----C-   100   100   000    -    1
199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    3
200 Multi_Zone_Error_Rate   ------   100   253   000    -    0
202 Data_Address_Mark_Errs  -O--CK   100   253   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01       GPL,SL  R/O      1  Summary SMART error log
0x02       GPL,SL  R/O      5  Comprehensive SMART error log
0x03       GPL,SL  R/O      5  Ext. Comprehensive SMART error log
0x06       GPL,SL  R/O      1  SMART self-test log
0x07       GPL,SL  R/O      1  Extended self-test log
0x09       GPL,SL  R/W      1  Selective self-test log
0x20       GPL,SL  R/O      1  Streaming performance log [OBS-8]
0x21       GPL,SL  R/O      1  Write stream error log
0x22       GPL,SL  R/O      1  Read stream error log
0x23       GPL,SL  R/O      1  Delayed sector log [OBS-8]
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0       GPL,SL  VS       1  Device vendor specific log
0xa1       GPL,SL  VS      20  Device vendor specific log
0xa2       GPL,SL  VS     101  Device vendor specific log
0xa8       GPL,SL  VS      20  Device vendor specific log
0xa9       GPL,SL  VS       1  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer
0xff       GPL     -    23552  Reserved

SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
Device Error Count: 8
        CR     = Command Register
        FEATR  = Features Register
        COUNT  = Count (was: Sector Count) Register
        LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
        LH     = LBA High (was: Cylinder High) Register    ]   LBA
        LM     = LBA Mid (was: Cylinder Low) Register      ] Register
        LL     = LBA Low (was: Sector Number) Register     ]
        DV     = Device (was: Device/Head) Register
        DC     = Device Control Register
        ER     = Error register
        ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 8 [7] occurred at disk power-on lifetime: 2103 hours (87 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 1f 58 33 d8 33 e0 00  Error: UNC at LBA = 0x1f5833d833 = 134623778867

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  25 00 da 00 08 00 1f 58 00 d8 30 e0 00     02:59:21.577  READ DMA EXT
  25 00 da 00 08 00 1f 58 00 d8 30 e0 00     02:59:21.575  READ DMA EXT
  25 00 da 00 08 00 1f 58 00 d8 28 e0 00     02:59:21.571  READ DMA EXT
  25 00 da 00 10 00 1f 59 00 d8 98 e0 00     02:59:21.568  READ DMA EXT
  25 00 da 00 f0 00 1f 58 00 d8 a8 e0 00     02:59:21.566  READ DMA EXT

Error 7 [6] ...
Error 6 [5] ...
...
Error 1 [0] occurred at disk power-on lifetime: 2103 hours (87 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 1f 58 33 d8 33 e0 00  Error: UNC at LBA = 0x1f5833d833 = 134623778867

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  25 00 da 00 f0 00 1f 58 00 d8 00 e0 00     02:59:21.577  READ DMA EXT
  25 00 da 00 20 00 1f 57 00 d8 e0 e0 00     02:59:21.575  READ DMA EXT
  25 00 da 00 f0 00 1f 56 00 d8 f0 e0 00     02:59:21.571  READ DMA EXT
  25 00 da 00 f0 00 1f 56 00 d8 00 e0 00     02:59:21.568  READ DMA EXT
  25 00 da 00 20 00 1f 55 00 d8 e0 e0 00     02:59:21.566  READ DMA EXT

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      2109         29464009922611
# 2  Extended offline    Completed: read failure       90%      2086         27616703817874
# 3  Short offline       Completed without error       00%      2086         -
# 4  Short offline       Completed without error       00%      1414         -
# 5  Short offline       Completed without error       00%      1394         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Commands not supported

Device Statistics (GP/SMART Log 0x04) not supported

SATA Phy Event Counters (GP Log 0x11) not supported

Well, this not really good news, because something got wrong along the long test. So probably I have to open the pocket mouth and buy a brand new hdd to make this project a stable one. But in the mean time, I will play along with this one.
Read prior entry about which hardware should we pick [←].
Read next entry [→].
I will also update this blog as soon as the new drive is in place :)
 

Comentarios