1. 倚天710的DDR5子系統
倚天710支持支持最先進的DDR5 DRAM,為云計算和HPC提供巨大的內存帶寬。倚天710有8 DDR5通道(channel),每個Die上有4個。每個通道相互獨立地服務系統的內存請求,分別支持用于1DPC(DIMM Per Channel)的DDR5-4400和2DPC的DDR5-4000。
1.2 DDR5 Architecture
DDR5的一個主要變化是新的DIMM通道結構(Fig 2中Channel Architecture)。DDR4 DIMM的總線位寬為72比特,由64比特數據位和8比特ECC位組成。DDR5的每個DIMM有兩個獨立的子通道。兩個通道中的總線位寬都為40比特:32比特的數據位和8比特的ECC位。盡管DDR4和DDR5的數據位寬相同(總共64比特),但兩個獨立通道可以提高內存訪問效率并減少延遲。單通道單次任務只能讀或寫,雙通道的DDR5則讀寫可以同時進行。
1.2 DDR5 理論帶寬
倚天2DPC的DDR5-4000的理論帶寬為:
- 4000MHz *32bit / 8 *8 *2 = 128 *10^9 *2 bytes = 128GB/s *2= 256 GB/s
- 內存等效頻率(4000MHz)_ 子通道位寬(32 bit)/ 8 _ 子通道數(8)* Die (2)
注意GB和GiB的不同:
- 1 GB = 1000000000 bytes (= 1000^3 B = 10^9 B)
- 1 GiB = 1073741824 bytes (= 1024^3 B = 2^30 B).
2. 倚天710 DDRSS PMU
倚天710的DDRSS為每個子通道都實現了獨立的PMU,用于性能和功能調試,每個子通道的PMU包含16個通用計數器。
帶寬計算公式為:
- DRAM Read Bandwidth = perf_hif_rd *DDRC_WIDTH *DDRC_Freq / DDRC_Cycle
- DRAM Write Bandwidth = (perf_hif_wr + perf_hif_rmw) *DDRC_WIDTH *DDRC_Freq / DDRC_Cycle
- DDRC_WIDTH: Units of 64 bytes
3. Cloud-kernel對DDRSS PMU的支持
#lscpu
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 1
Core(s) per socket: 128
Socket(s): 1
NUMA node(s): 2
...
測試環境為1個Socket,2個Die,包含兩個NUMA node。
#numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 0 size: 257416 MB
node 0 free: 187991 MB
node 1 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 1 size: 257014 MB
node 1 free: 194504 MB
node distances:
node 0 1
0: 10 15
1: 15 10
每個NUMA node有 256 GB內存。
#dmidecode|grep -P -A5 "Memorys+Device"|grep Size|grep -v Range
Size: 32 GB
Size: 32 GB
Size: 32 GB
Size: 32 GB
Size: 32 GB
Size: 32 GB
Size: 32 GB
Size: 32 GB
Size: 32 GB
Size: 32 GB
Size: 32 GB
Size: 32 GB
Size: 32 GB
Size: 32 GB
Size: 32 GB
Size: 32 GB
Size: No Module Installed
...
#dmidecode -t memory | grep Speed:
Speed: 4000 MHz
Configured Clock Speed: 4000 MHz
2DPC,共插了16根DIMM,每個Die8根DIMM,有效頻率為 4000MHz。
#ls /sys/bus/event_source/devices/ | grep drw
ali_drw_21000
ali_drw_21080
ali_drw_23000
ali_drw_23080
ali_drw_25000
ali_drw_25080
ali_drw_27000
ali_drw_27080
ali_drw_40021000
ali_drw_40021080
ali_drw_40023000
ali_drw_40023080
ali_drw_40025000
ali_drw_40025080
ali_drw_40027000
ali_drw_40027080
2DPC滿插時一共16個PMU設備,其中ali_drw_21000
與ali_drw_21080
為Die 0上同一個DIMM的兩個子通道,ali_drw_2X000
為Die 0的PMU設備,ali_drw_4002X000
為Die 1的PMU設備。
4. DDR 帶寬準確性驗證
4.1 TL;DR
帶寬單位:MB/s
可以看到,DDR PMU的帶寬統計誤差不超過 1%。測試原理,請閱讀《倚天710性能監控 —— CMN Flit Traffic Trace with Watchpoint Event》。
4.2 C0M0 rd
# First, run bw_mem as backgroud workload
# numactl --cpubind=0 --membind=0 ./bw_mem 40960M rd
# Then run perf command in another console
perf stat -e ali_drw_21000/perf_hif_wr/ -e ali_drw_21000/perf_hif_rd/ -e ali_drw_21000/perf_hif_rmw/ -e ali_drw_21000/perf_cycle/ -e ali_drw_21080/perf_hif_wr/ -e ali_drw_21080/perf_hif_rd/ -e ali_drw_21080/perf_hif_rmw/ -e ali_drw_21080/perf_cycle/ -e ali_drw_23000/perf_hif_wr/ -e ali_drw_23000/perf_hif_rd/ -e ali_drw_23000/perf_hif_rmw/ -e ali_drw_23000/perf_cycle/ -e ali_drw_23080/perf_hif_wr/ -e ali_drw_23080/perf_hif_rd/ -e ali_drw_23080/perf_hif_rmw/ -e ali_drw_23080/perf_cycle/ -e ali_drw_25000/perf_hif_wr/ -e ali_drw_25000/perf_hif_rd/ -e ali_drw_25000/perf_hif_rmw/ -e ali_drw_25000/perf_cycle/ -e ali_drw_25080/perf_hif_wr/ -e ali_drw_25080/perf_hif_rd/ -e ali_drw_25080/perf_hif_rmw/ -e ali_drw_25080/perf_cycle/ -e ali_drw_27000/perf_hif_wr/ -e ali_drw_27000/perf_hif_rd/ -e ali_drw_27000/perf_hif_rmw/ -e ali_drw_27000/perf_cycle/ -e ali_drw_27080/perf_hif_wr/ -e ali_drw_27080/perf_hif_rd/ -e ali_drw_27080/perf_hif_rmw/ -e ali_drw_27080/perf_cycle/ -a -- sleep 1
Performance counter stats for 'system wide':
12398 ali_drw_21000/perf_hif_wr/
40160751 ali_drw_21000/perf_hif_rd/
743 ali_drw_21000/perf_hif_rmw/
500620725 ali_drw_21000/perf_cycle/
12252 ali_drw_21080/perf_hif_wr/
40161013 ali_drw_21080/perf_hif_rd/
767 ali_drw_21080/perf_hif_rmw/
500619340 ali_drw_21080/perf_cycle/
11960 ali_drw_23000/perf_hif_wr/
40159522 ali_drw_23000/perf_hif_rd/
737 ali_drw_23000/perf_hif_rmw/
500613505 ali_drw_23000/perf_cycle/
12044 ali_drw_23080/perf_hif_wr/
40159066 ali_drw_23080/perf_hif_rd/
773 ali_drw_23080/perf_hif_rmw/
500607620 ali_drw_23080/perf_cycle/
12698 ali_drw_25000/perf_hif_wr/
40160138 ali_drw_25000/perf_hif_rd/
709 ali_drw_25000/perf_hif_rmw/
500601240 ali_drw_25000/perf_cycle/
12521 ali_drw_25080/perf_hif_wr/
40160169 ali_drw_25080/perf_hif_rd/
727 ali_drw_25080/perf_hif_rmw/
500594755 ali_drw_25080/perf_cycle/
12171 ali_drw_27000/perf_hif_wr/
40159404 ali_drw_27000/perf_hif_rd/
706 ali_drw_27000/perf_hif_rmw/
500589945 ali_drw_27000/perf_cycle/
12290 ali_drw_27080/perf_hif_wr/
40157620 ali_drw_27080/perf_hif_rd/
710 ali_drw_27080/perf_hif_rmw/
500583305 ali_drw_27080/perf_cycle/
1.000923276 seconds time elapsed
>>> 40159522*8*64/1000/1000.0
20561.675
# set CPU and memory to the same NUMA node
numactl --cpubind=0 --membind=0 ./bw_mem 40960M rd
40960.00 20507.82
4.3 C1M1 rd
# First, run bw_mem as backgroud workload
# numactl --cpubind=1 --membind=1 ./bw_mem 40960M rd
# Then run perf command in another console
perf stat -e ali_drw_40021000/perf_hif_wr/ -e ali_drw_40021000/perf_hif_rd/ -e ali_drw_40021000/perf_hif_rmw/ -e ali_drw_40021000/perf_cycle/ -e ali_drw_40021080/perf_hif_wr/ -e ali_drw_40021080/perf_hif_rd/ -e ali_drw_40021080/perf_hif_rmw/ -e ali_drw_40021080/perf_cycle/ -e ali_drw_40023000/perf_hif_wr/ -e ali_drw_40023000/perf_hif_rd/ -e ali_drw_40023000/perf_hif_rmw/ -e ali_drw_40023000/perf_cycle/ -e ali_drw_40023080/perf_hif_wr/ -e ali_drw_40023080/perf_hif_rd/ -e ali_drw_40023080/perf_hif_rmw/ -e ali_drw_40023080/perf_cycle/ -e ali_drw_40025000/perf_hif_wr/ -e ali_drw_40025000/perf_hif_rd/ -e ali_drw_40025000/perf_hif_rmw/ -e ali_drw_40025000/perf_cycle/ -e ali_drw_40025080/perf_hif_wr/ -e ali_drw_40025080/perf_hif_rd/ -e ali_drw_40025080/perf_hif_rmw/ -e ali_drw_40025080/perf_cycle/ -e ali_drw_40027000/perf_hif_wr/ -e ali_drw_40027000/perf_hif_rd/ -e ali_drw_40027000/perf_hif_rmw/ -e ali_drw_40027000/perf_cycle/ -e ali_drw_40027080/perf_hif_wr/ -e ali_drw_40027080/perf_hif_rd/ -e ali_drw_40027080/perf_hif_rmw/ -e ali_drw_40027080/perf_cycle/ -a -- sleep 1
Performance counter stats for 'system wide':
2329 ali_drw_40021000/perf_hif_wr/
40071983 ali_drw_40021000/perf_hif_rd/
58 ali_drw_40021000/perf_hif_rmw/
500572165 ali_drw_40021000/perf_cycle/
2374 ali_drw_40021080/perf_hif_wr/
40071737 ali_drw_40021080/perf_hif_rd/
39 ali_drw_40021080/perf_hif_rmw/
500569615 ali_drw_40021080/perf_cycle/
2330 ali_drw_40023000/perf_hif_wr/
40071063 ali_drw_40023000/perf_hif_rd/
74 ali_drw_40023000/perf_hif_rmw/
500565635 ali_drw_40023000/perf_cycle/
2372 ali_drw_40023080/perf_hif_wr/
40070344 ali_drw_40023080/perf_hif_rd/
54 ali_drw_40023080/perf_hif_rmw/
500561355 ali_drw_40023080/perf_cycle/
2362 ali_drw_40025000/perf_hif_wr/
40070906 ali_drw_40025000/perf_hif_rd/
45 ali_drw_40025000/perf_hif_rmw/
500557480 ali_drw_40025000/perf_cycle/
2385 ali_drw_40025080/perf_hif_wr/
40070168 ali_drw_40025080/perf_hif_rd/
46 ali_drw_40025080/perf_hif_rmw/
500552550 ali_drw_40025080/perf_cycle/
2333 ali_drw_40027000/perf_hif_wr/
40069233 ali_drw_40027000/perf_hif_rd/
28 ali_drw_40027000/perf_hif_rmw/
500548745 ali_drw_40027000/perf_cycle/
2211 ali_drw_40027080/perf_hif_wr/
40068227 ali_drw_40027080/perf_hif_rd/
30 ali_drw_40027080/perf_hif_rmw/
500544450 ali_drw_40027080/perf_cycle/
1.000863258 seconds time elapsed
>>> 40070906*8*64/1000/1000.0
20516.303
numactl --cpubind=1 --membind=1 ./bw_mem 40960M rd
40960.00 20492.53
4.4 C0M0 fwr
# First, run bw_mem as backgroud workload
# numactl --cpubind=0 --membind=0 ./bw_mem 40960M fwr
# Then run perf command in another console
perf stat -e ali_drw_21000/perf_hif_wr/ -e ali_drw_21000/perf_hif_rd/ -e ali_drw_21000/perf_hif_rmw/ -e ali_drw_21000/perf_cycle/ -e ali_drw_21080/perf_hif_wr/ -e ali_drw_21080/perf_hif_rd/ -e ali_drw_21080/perf_hif_rmw/ -e ali_drw_21080/perf_cycle/ -e ali_drw_23000/perf_hif_wr/ -e ali_drw_23000/perf_hif_rd/ -e ali_drw_23000/perf_hif_rmw/ -e ali_drw_23000/perf_cycle/ -e ali_drw_23080/perf_hif_wr/ -e ali_drw_23080/perf_hif_rd/ -e ali_drw_23080/perf_hif_rmw/ -e ali_drw_23080/perf_cycle/ -e ali_drw_25000/perf_hif_wr/ -e ali_drw_25000/perf_hif_rd/ -e ali_drw_25000/perf_hif_rmw/ -e ali_drw_25000/perf_cycle/ -e ali_drw_25080/perf_hif_wr/ -e ali_drw_25080/perf_hif_rd/ -e ali_drw_25080/perf_hif_rmw/ -e ali_drw_25080/perf_cycle/ -e ali_drw_27000/perf_hif_wr/ -e ali_drw_27000/perf_hif_rd/ -e ali_drw_27000/perf_hif_rmw/ -e ali_drw_27000/perf_cycle/ -e ali_drw_27080/perf_hif_wr/ -e ali_drw_27080/perf_hif_rd/ -e ali_drw_27080/perf_hif_rmw/ -e ali_drw_27080/perf_cycle/ -a -- sleep 1
Performance counter stats for 'system wide':
42910737 ali_drw_21000/perf_hif_wr/
108397 ali_drw_21000/perf_hif_rd/
495 ali_drw_21000/perf_hif_rmw/
500708510 ali_drw_21000/perf_cycle/
42911223 ali_drw_21080/perf_hif_wr/
117280 ali_drw_21080/perf_hif_rd/
515 ali_drw_21080/perf_hif_rmw/
500706780 ali_drw_21080/perf_cycle/
42910038 ali_drw_23000/perf_hif_wr/
109179 ali_drw_23000/perf_hif_rd/
516 ali_drw_23000/perf_hif_rmw/
500702100 ali_drw_23000/perf_cycle/
42911620 ali_drw_23080/perf_hif_wr/
111038 ali_drw_23080/perf_hif_rd/
523 ali_drw_23080/perf_hif_rmw/
500697340 ali_drw_23080/perf_cycle/
42910435 ali_drw_25000/perf_hif_wr/
111748 ali_drw_25000/perf_hif_rd/
469 ali_drw_25000/perf_hif_rmw/
500692500 ali_drw_25000/perf_cycle/
42908786 ali_drw_25080/perf_hif_wr/
110177 ali_drw_25080/perf_hif_rd/
456 ali_drw_25080/perf_hif_rmw/
500686595 ali_drw_25080/perf_cycle/
42908903 ali_drw_27000/perf_hif_wr/
114093 ali_drw_27000/perf_hif_rd/
490 ali_drw_27000/perf_hif_rmw/
500681405 ali_drw_27000/perf_cycle/
42908156 ali_drw_27080/perf_hif_wr/
109668 ali_drw_27080/perf_hif_rd/
489 ali_drw_27080/perf_hif_rmw/
500676420 ali_drw_27080/perf_cycle/
1.001100811 seconds time elapsed
>>> (42908156+489)*8*64/1000/1000.0
21969.226
numactl --cpubind=0 --membind=0 ./bw_mem 40960M fwr
40960.00 21936.50
4.5 C1M1 fwr
# First, run bw_mem as backgroud workload
# numactl --cpubind=1 --membind=1 ./bw_mem 40960M fwr
# Then run perf command in another console
perf stat -e ali_drw_40021000/perf_hif_wr/ -e ali_drw_40021000/perf_hif_rd/ -e ali_drw_40021000/perf_hif_rmw/ -e ali_drw_40021000/perf_cycle/ -e ali_drw_40021080/perf_hif_wr/ -e ali_drw_40021080/perf_hif_rd/ -e ali_drw_40021080/perf_hif_rmw/ -e ali_drw_40021080/perf_cycle/ -e ali_drw_40023000/perf_hif_wr/ -e ali_drw_40023000/perf_hif_rd/ -e ali_drw_40023000/perf_hif_rmw/ -e ali_drw_40023000/perf_cycle/ -e ali_drw_40023080/perf_hif_wr/ -e ali_drw_40023080/perf_hif_rd/ -e ali_drw_40023080/perf_hif_rmw/ -e ali_drw_40023080/perf_cycle/ -e ali_drw_40025000/perf_hif_wr/ -e ali_drw_40025000/perf_hif_rd/ -e ali_drw_40025000/perf_hif_rmw/ -e ali_drw_40025000/perf_cycle/ -e ali_drw_40025080/perf_hif_wr/ -e ali_drw_40025080/perf_hif_rd/ -e ali_drw_40025080/perf_hif_rmw/ -e ali_drw_40025080/perf_cycle/ -e ali_drw_40027000/perf_hif_wr/ -e ali_drw_40027000/perf_hif_rd/ -e ali_drw_40027000/perf_hif_rmw/ -e ali_drw_40027000/perf_cycle/ -e ali_drw_40027080/perf_hif_wr/ -e ali_drw_40027080/perf_hif_rd/ -e ali_drw_40027080/perf_hif_rmw/ -e ali_drw_40027080/perf_cycle/ -a -- sleep 1
Performance counter stats for 'system wide':
42906048 ali_drw_40021000/perf_hif_wr/
33939 ali_drw_40021000/perf_hif_rd/
76 ali_drw_40021000/perf_hif_rmw/
500629355 ali_drw_40021000/perf_cycle/
42905967 ali_drw_40021080/perf_hif_wr/
34018 ali_drw_40021080/perf_hif_rd/
63 ali_drw_40021080/perf_hif_rmw/
500631900 ali_drw_40021080/perf_cycle/
42905422 ali_drw_40023000/perf_hif_wr/
33843 ali_drw_40023000/perf_hif_rd/
75 ali_drw_40023000/perf_hif_rmw/
500628540 ali_drw_40023000/perf_cycle/
42905547 ali_drw_40023080/perf_hif_wr/
33858 ali_drw_40023080/perf_hif_rd/
68 ali_drw_40023080/perf_hif_rmw/
500623970 ali_drw_40023080/perf_cycle/
42905230 ali_drw_40025000/perf_hif_wr/
34028 ali_drw_40025000/perf_hif_rd/
56 ali_drw_40025000/perf_hif_rmw/
500620630 ali_drw_40025000/perf_cycle/
42904734 ali_drw_40025080/perf_hif_wr/
34141 ali_drw_40025080/perf_hif_rd/
61 ali_drw_40025080/perf_hif_rmw/
500615840 ali_drw_40025080/perf_cycle/
42903390 ali_drw_40027000/perf_hif_wr/
33712 ali_drw_40027000/perf_hif_rd/
84 ali_drw_40027000/perf_hif_rmw/
500610635 ali_drw_40027000/perf_cycle/
42903975 ali_drw_40027080/perf_hif_wr/
33916 ali_drw_40027080/perf_hif_rd/
106 ali_drw_40027080/perf_hif_rmw/
500606645 ali_drw_40027080/perf_cycle/
1.000953335 seconds time elapsed
>>> (42903975+106)*8*64/1000/1000.0
21966.889
#numactl --cpubind=1 --membind=1 ./bw_mem 40960M fwr
40960.00 21934.51
-
計數器
+關注
關注
32文章
2256瀏覽量
94568 -
DRAM芯片
+關注
關注
1文章
84瀏覽量
18015 -
HPC
+關注
關注
0文章
316瀏覽量
23772 -
PMU
+關注
關注
1文章
108瀏覽量
21600 -
DDR5
+關注
關注
1文章
422瀏覽量
24145
發布評論請先 登錄
相關推薦
評論