RAS(一)介紹
寫在開篇之前
近期收到了公司大禮包,想著在找工作期間把Linux RAS整理一下,寫成系列文章。畢竟作為OS RAS負責人兼開發,為阿里云X86和倚天710 RAS落地了很多RAS增強和解決方案,對阿里云服務器穩定性做出些許貢獻。期間也有不少其他團隊過來請教過RAS事項,所以想著記錄下來,對以后計劃了解和學習RAS的Linux愛好者有所幫助。另外個人視角主要從Linux內核出發,梳理Linux RAS涉及的組件、功能、特性都有哪些,也會介紹內核RAS涉及的硬件。
RAS背景
隨著云時代的到來,各個公司都在將產品、服務等遷移上云,云為數字化建設提供了極大的便利。國內外云服務器廠商如雨后春筍般出現,比如谷歌云、阿里云、騰訊云等等。近些年,全球范圍內云服務出現多次宕機事件,云的穩定性越來越受到大家關注。據《中國數據災備產業白皮書暨數據災備建設調研報告》中描述,業務宕機1分鐘,平均會使運輸業損失15萬美元,銀行業損失27萬美元,通信業損失35萬美元,制造業損失42萬美元,證券業損失45萬美元,同時公司聲譽等無形資產損失更是無法估量。
在嵌入式領域,越來越多的國產自研硬件商用發布,包括內存、硬盤、GPU、CPU等,應用于PC主機、汽車電子、工業控制等市場。隨著這些硬件使用數量的飛速上升,穩定性問題也逐步開始暴露出來。比如筆者在華為內核團隊負責對接某ARM嵌入式產品業務,在使用國產中發現使用國產內存條出現硬件問題的概率大大超過國外某大廠品牌。
總體來說,隨著軟件技術的成熟和完善,因為軟件導致的問題占比逐年減少。此消彼長下,硬件問題占比逐年突顯,比如硬件故障導致服務器異常或宕機問題已逐步成為云服務器Top1問題。
RAS定義
服務器硬件穩定性,主要體現在RAS上。RAS指機器的可靠性(Reliability)、可用性(Availability)和可服務性(Serviceability)。Linux Kernel對Reliability,Availability,Serviceability定義如下
Reliability
is the probability that a system will produce correct outputs.
?Generally measured as Mean Time Between Failures (MTBF)
?Enhanced by features that help to avoid, detect and repair hardware faults
Availability
is the probability that a system is operational at a given time
?Generally measured as a percentage of downtime per a period of time
?Often uses mechanisms to detect and correct hardware faults in runtime;
Serviceability (or maintainability)
is the simplicity and speed with which a system can be repaired or maintained
?Generally measured on Mean Time Between Repair (MTBR)
RAS目標是使系統盡可能長期可靠的運行而不停機,減少系統downtime;提供硬件檢測上報機制,以便在硬件錯誤引起數據丟失或宕機之前能夠通知管理員及時更換硬件;提供硬件錯誤恢復機制,并盡可能糾正錯誤,使系統可持續可靠的運行。
RAS涉及的硬件包括且不限于:CPU、Memory、IO、PCIe、硬盤和其他外設
?CPU – detect errors at instruction execution and at L1/L2/L3 caches;
?Memory – add error correction logic (ECC) to detect and correct errors;
?I/O – add CRC checksums for tranfered data;
?Storage – RAID, journal file systems, checksums, Self-Monitoring, Analysis and Reporting Technology (SMART).
通常來說,硬件錯誤分為CE、UE、Fatal Error、Non-fatal Error,定義如下
?Correctable Error (CE)?- the error detection mechanism detected and corrected the error. Such errors are usually not fatal, although some Kernel mechanisms allow the system administrator to consider them as fatal.
?Uncorrected Error (UE)?- the amount of errors happened above the error correction threshold, and the system was unable to auto-correct.
?Fatal Error?- when an UE error happens on a critical component of the system (for example, a piece of the Kernel got corrupted by an UE), the only reliable way to avoid data corruption is to hang or reboot the machine.
?Non-fatal Error?- when an UE error happens on an unused component, like a CPU in power down state or an unused memory bank, the system may still run, eventually replacing the affected hardware by a hot spare, if available.
但是實際這個定義比較寬泛且簡陋,比如還有Defferred Error(DE)
Deferred error
The error was detected, was not corrected, and was deferred. The error has not been silently propagated. The error might be latent in the system. It is IMPLEMENTATION DEFINED whether the error continues to infect the state of the node or whether it has been deferred to the consumer. The node continues to operate. If the error might have been silently propagated, it must be reported as an Uncorrected error.
又比如Intel將軟件可恢復的UC Error定義為UCR(Uncorrected Recoverable) Error,下面又分為SRAR、SRAO、UCNA等。
RAS基本框圖
RAS基本流程框圖如上,硬件發生故障后,通過硬件RAS能力觸發中斷或異常,通知到Firmware/OS,軟件收到通知后采取相應的策略,比如Panic、執行Recover actions或者通知到用戶。
隨著RAS功能不斷更新迭代以及架構不同,RAS體系開始呈現多樣性,因不同使用場景所有不同,體現在:
1.通知方式多樣
通知方式細分下來包括IRQ、Exception、Poll、SEA、SDEI、GPIO等方式。
2.Mode多樣
硬件故障先通知到Firmware,然后Firmware帶外處理或再通知到OS的方式,稱為Firmware First Mode;
硬件故障通知到OS,OS處理硬件故障的方式,稱為Kernel First Mode;
這兩種方式還可以支持混合使用,各有優劣,要學會因地制宜。比如對于CE來說,服務器經常發生大量CE事件,就會產生CE Irq風暴,CPU長時間在處理這些Irq,就會導致其他任務得不到調度,影響整體性能。
3.芯片架構、硬件多樣性
隨著近些年芯片行業發展,芯片架構越來越多樣性,包括Intel、AMD、ARM、RISC等,不同芯片架構下硬件組成也有些許差異。
4.軟件多樣性
對于Linux驅動來說,包括mce驅動、apei驅動、edac驅動等;
對于用戶態RAS服務來說,包括mcelog、rasdaemon、perf event通知等;
總體來說,RAS是一個復雜的體系,不同芯片架構、不同硬件RAS功能各不相同,作為RAS開發要根據不同業務場景采取對應的RAS方案。
RAS故障處理流程
以Intel服務器為例,
1.Intel服務器內存發生CE故障后,硬件觸發CMCI中斷,執行OS注冊的中斷處理函數;
2.該函數調用EDAC驅動代碼,讀取MCA狀態寄存器來獲取硬件故障信息,比如故障級別、故障硬件位置、故障地址等等。EDAC驅動會將信息保存在/dev/mcelog;
3.Mcelog是一個用戶態的服務程序,通過解析/dev/mcelog信息,將其保存在/var/log/mcelog。用戶可以通過查看該文件了解此服務器是否發生過硬件故障以及故障發生的時間、硬件信息、是否恢復等關鍵信息;
RAS硬件故障舉例
如下是x86服務器注入內存CE故障的日志,EDAC驅動會打印故障發生所在的硬件(Memory)、Addr、Processor、類型(CE)、memory channel/dimm等信息。
?
C++ [22715.830801] EDAC sbridge MC3: HANDLING MCE MEMORY ERROR [22715.834759] EDAC sbridge MC3: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010090 [22715.834759] EDAC sbridge MC3: TSC 0 [22715.834759] EDAC sbridge MC3: ADDR 12345000 EDAC sbridge MC3: MISC 144780c86 [22715.834759] EDAC sbridge MC3: PROCESSOR 0:306e7 TIME 1422553404 SOCKET 0 APIC 0 [22716.616173] EDAC MC3: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x12345 offset:0x0 grain:32 syndrome:0x0 - ?area:DRAM err_code0090 socket:0 channel_mask:1 rank:0) |
?
圖片來源:v2-8e4986144a6a70301ee1a30c60c5ffad_720w.webp (720×378) (zhimg.com)
編輯:黃飛
?
評論
查看更多