测试 Linux 服务器 SCSI/SATA/SSD 硬盘是否出现故障
我们的一位常客向我们提出了一个问题: 我如何测试我的硬盘是否出现故障?我在 /var/log/messages 文件中看到一些错误。/var/log/messages
中的 I/O 错误表明硬盘出了问题,可能正在发生故障。您可以使用 smartctl 命令检查硬盘是否有错误,该命令是 Linux/UNIX 类操作系统下 SMART 磁盘的控制和监视实用程序。
smartctl 控制内置于许多 ATA-3 以及更高版本的 ATA、IDE 和 SCSI-3 硬盘中的自我监控、分析和报告技术(SMART) 系统。SMART 的目的是监控硬盘的可靠性并预测驱动器故障,并执行不同类型的驱动器自检。
用于服务器的 smartctl
smartctl 是一个命令行实用程序,用于执行 SMART 任务,例如打印 SMART 自检和错误日志、启用和禁用 SMART 自动测试以及启动设备自检。首先,确保 BIOS 中启用了 SMART 支持。接下来,运行以下命令查看您的硬盘是否支持 SMART 技术:
# smartctl -i /dev/sdb
要启用 SMART,请运行:
# smartctl -s on -d ata /dev/sdb
示例输出:
smartctl version 5.33 [x86_64-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF ENABLE/DISABLE COMMANDS SECTION === SMART Enabled.
运行整体健康自我评估测试,输入:
# smartctl -d ata -H /dev/sdb
示例输出:
smartctl version 5.33 [x86_64-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
故障硬盘的输出示例:
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Please note the following marginal Attributes:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022 044 033 045 Old_age Always FAILING_NOW 56 (96 110 58 25)
下面将提供有关硬盘故障的更多信息:
# smartctl --attributes --log=selftest /dev/sda
示例输出:
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 098 092 006 Pre-fail Always - 238320363
3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 587
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 9
7 Seek_Error_Rate 0x000f 077 060 030 Pre-fail Always - 51672328
9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 4805
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 586
184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 417
188 Unknown_Attribute 0x0032 100 099 000 Old_age Always - 4295032833
189 High_Fly_Writes 0x003a 094 094 000 Old_age Always - 6
190 Airflow_Temperature_Cel 0x0022 044 033 045 Old_age Always FAILING_NOW 56 (96 122 58 25)
194 Temperature_Celsius 0x0022 056 067 000 Old_age Always - 56 (0 23 0 0)
195 Hardware_ECC_Recovered 0x001a 043 026 000 Old_age Always - 238320363
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 49
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 49
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 172082159686339
241 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 2155546016
242 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 3048586928
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 4789 1746972641
您可以通过输入以下命令从硬盘读取更多数据:
# smartctl -d ata -a /dev/sdb
示例输出:
smartctl version 5.33 [x86_64-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: WDC WD2500YS-01SHB0 Serial Number: WD-WCANY1729333 Firmware Version: 20.06C03 User Capacity: 251,000,193,024 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 7 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Wed Jul 4 15:04:38 2007 CDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (7800) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 92) minutes. Conveyance self-test routine recommended polling time: ( 6) minutes. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0003 190 187 021 Pre-fail Always - 5500 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 24 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0 9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 6382 10 Spin_Retry_Count 0x0013 100 253 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0013 100 253 051 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 23 194 Temperature_Celsius 0x0022 127 096 000 Old_age Always - 23 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0009 200 200 051 Pre-fail Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
关于 RAID 控制器的说明
要查看 3ware SCSI RAID 控制器后面的 ATA 磁盘,语法为:
请参阅如何使用 smartctl 命令检查Adaptec RAID 控制器后面的磁盘和3Ware RAID 卡后面的硬盘以获取更多信息。
# smartctl -a -d 3ware,2 /dev/sda
# smartctl -a -d 3ware,0 /dev/twe0
任务:驱动器的扩展自检
您需要开始对驱动器 /dev/hdc 进行扩展自检。您可以在正在运行的系统上发出此命令。完成后,可以在使用“-l selftest”选项显示的自检日志中看到结果:
# smartctl -d ata -t long /dev/sdb
硬盘故障详细报告示例
输入 smartctl 命令如下:
# smartctl -a /dev/sda
示例输出:
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: ST31500341AS
Serial Number: 9VS0TG4B
Firmware Version: CC1H
User Capacity: 1,500,301,910,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Mon Oct 26 21:16:15 2009 IST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 617) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x103f) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 098 092 006 Pre-fail Always - 238338845
3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 587
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 9
7 Seek_Error_Rate 0x000f 077 060 030 Pre-fail Always - 51672525
9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 4806
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 586
184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 417
188 Unknown_Attribute 0x0032 100 099 000 Old_age Always - 4295032833
189 High_Fly_Writes 0x003a 094 094 000 Old_age Always - 6
190 Airflow_Temperature_Cel 0x0022 044 033 045 Old_age Always FAILING_NOW 56 (96 126 58 25)
194 Temperature_Celsius 0x0022 056 067 000 Old_age Always - 56 (0 23 0 0)
195 Hardware_ECC_Recovered 0x001a 043 026 000 Old_age Always - 238338845
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 49
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 49
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 107168023974595
241 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 2155546480
242 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 3048590512
SMART Error Log Version: 1
ATA Error Count: 416 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 416 occurred at disk power-on lifetime: 4786 hours (199 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff ef 00 00:55:03.917 READ DMA EXT
27 00 00 00 00 00 e0 00 00:55:03.818 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 00:55:03.798 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 00:55:03.779 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 00:55:03.658 READ NATIVE MAX ADDRESS EXT
Error 415 occurred at disk power-on lifetime: 4786 hours (199 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff ef 00 00:55:00.927 READ DMA EXT
27 00 00 00 00 00 e0 00 00:55:00.837 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 00:55:00.817 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 00:55:00.800 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 00:55:00.747 READ NATIVE MAX ADDRESS EXT
Error 414 occurred at disk power-on lifetime: 4786 hours (199 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff ef 00 00:54:57.903 READ DMA EXT
27 00 00 00 00 00 e0 00 00:54:57.807 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 00:54:57.787 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 00:54:57.757 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 00:54:57.637 READ NATIVE MAX ADDRESS EXT
Error 413 occurred at disk power-on lifetime: 4786 hours (199 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff ef 00 00:54:54.862 READ DMA EXT
27 00 00 00 00 00 e0 00 00:54:54.767 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 00:54:54.746 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 00:54:54.728 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 00:54:54.677 READ NATIVE MAX ADDRESS EXT
Error 412 occurred at disk power-on lifetime: 4786 hours (199 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff ef 00 00:54:51.838 READ DMA EXT
27 00 00 00 00 00 e0 00 00:54:51.736 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 00:54:51.716 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 00:54:51.685 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 00:54:51.566 READ NATIVE MAX ADDRESS EXT
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 4789 1746972641
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
从备份还原
如果任何一个测试报告有错误,请更换硬盘并从备份中恢复数据。
在服务器上配置 smartd,以便在检测到问题时收到基于电子邮件的警告
smartd 是一个监控硬盘的守护进程,它会尝试在硬盘上启用 SMART 监控。它每 30 分钟轮询一次硬盘健康数据和 SCSI 设备(可配置选项)。它通过 SYSLOG 接口记录 SMART 错误和 SMART 属性的更改。这些 SYSLOG 通知和警告的默认位置取决于系统(通常为 /var/log/messages 或 /var/log/syslog)。除了记录到文件之外,还可以将 smartd 配置为在检测到问题时发送电子邮件警告。根据问题的类型,您可能需要对磁盘运行自检、备份磁盘、更换磁盘或使用制造商的实用程序强制重新分配坏的或不可读的磁盘扇区。有关更多信息,请参阅如何安装和配置 smartd。
Gnome 磁盘实用程序
大多数 Linux 和 Unix 类操作系统(如 FreeBSD 或 OpenBSD)都带有一个名为“磁盘实用程序”的 GUI 工具。只有当您运行基于 gnome 的台式机或笔记本电脑系统时,它才会起作用。要启动磁盘实用程序,请访问:
Applications > System Tools > Disk Utility
单击硬盘:
要查看详细信息,请单击智能数据:
健康硬盘的示例:
向 GSmartControl 问好
GSmartControl 是硬盘驱动器健康检查工具和 smartctl 命令的图形用户界面。此工具具有以下功能:
- 自动报告并突出显示任何异常;
- 允许启用/禁用 SMART;
- 允许启用/禁用自动离线数据收集 - 驱动器每四小时自动执行一次简短的自检,不会对性能产生影响;
- 支持 smartctl 的全局和每个驱动器选项的配置;
- 执行 SMART 自我测试;
- 显示驱动器身份信息、功能、属性和自检/错误日志;
- 可以从保存的文件中读取 smartctl 输出,将其解释为只读虚拟设备;
- 适用于大多数 smartctl 支持的操作系统,如 *BSD 和各种 Linux 发行版;
- 具有广泛的帮助信息。
您可以在基于 Debian 或 Ubuntu 的系统上使用apt-get 命令按如下方式安装它:
$ sudo apt-get install gsmartcontrol
如果您使用的是 RHEL 或 CentOS Linux,请使用yum 命令/dnf 命令按如下方式安装它:
# yum install gsmartcontrol
图 01:GSmartControl 实际运行
图02:GSmartControl显示硬盘信息
单击“执行测试”选项卡可执行短硬盘或长硬盘测试:
总结
好了,现在你已经知道了如何使用 GUI 和 CLI 工具检查 Linux 下即将损坏的硬盘 (SSD/SATA)。请使用 man 命令或 help 命令查阅手册页:
$ man smartctl
$ man smartd
- 测试 Linux 服务器 SCSI/SATA/SSD 硬盘是否出现故障
- Linux / UNIX:Smartctl 检查 3Ware RAID 卡后面的硬盘
- Linux 使用 smartctl 检查 Adaptec RAID 控制器后面的磁盘
- 在 Linux 或 UNIX 操作系统下使用 smartd 监控硬盘健康状况
- FreeBSD:使用 smartd 工具获取/读取硬盘温度
- Linux 使用 hddtemp 监控硬盘温度
- Linux 命令查找 SATA 链接速度(如 1.5 / 3.0 / 6.0 Gbps)[硬盘]
- 如何在 Linux 上查找硬盘规格/详细信息
- FreeBSD insatll smartctl top 检查硬盘健康状况