盗墓笔记同人小说,有声读物,欢乐颂小说结局是什么

先說明一下背景，目前正在魔改以下這篇論文的代碼：

https://github.com/QipengGuo/GraphWriter-DGLgithub.com

由于每次完成實驗需要5個小時（baseline），自己的模型需要更久（2倍），非常不利于調參和發現問題，所以開始嘗試使用多卡加速。

torch.nn.DataParallel ==> 簡稱 DP

torch.nn.parallel.DistributedDataParallel ==> 簡稱DDP

一開始采用dp試圖加速，結果因為dgl的實現（每個batch的點都會打包進一個batch，從而不可分割），而torch.nn.DataParallel的實現是把一個batch切分成更小，再加上他的加速性能也不如ddp，所以我開始嘗試魔改成ddp。

另外，作者在實現Sampler的時候是繼承了torch.utils.data.Sampler這個類的，目的在于agenda數據集的文本長度嚴重不均衡，如下：

為了讓模型更快train完，把長度相近的文本打包成一個batch（溫馨提醒，torchtext也有相關的類 bucketiterator^[1]，大概形式如下：

class BucketSampler(torch.utils.data.Sampler):
    def __init__(self, data_source, batch_size=32):
        self.data_source = data_source
        self.batch_size = batch_size 

    def __iter__(self):
        idxs, lens, batch, middle_batch_size, long_batch_size = basesampler(self.data_source , self.batch_size)
        for idx in idxs:
            batch.append(idx)
            mlen = max([0]+[lens[x] for x in batch])
            #if (mlen<100 and len(batch) == 32) or (mlen>100 and mlen<220 and len(batch) >= 24) or (mlen>220 and len(batch)>=8) or len(batch)==32:
            if (mlen<100 and len(batch) == self.batch_size) or (mlen>100 and mlen<220 and len(batch) >= middle_batch_size) or (mlen>220 and len(batch)>=long_batch_size) or len(batch)==self.batch_size:
                yield batch
                batch = []
        if len(batch) > 0:
            yield batch

    def __len__(self):
        return (len(self.data_source)+self.batch_size-1)//self.batch_size

這是背景。

寫bug第一步：繼承DistributedSampler的漏洞百出

我一開始理想當然的把作者的sampler源碼crtl-cv下來，唯獨只改動了這里：

class DDPBaseBucketSampler(torch.utils.data.distributed.DistributedSampler):

隨后就發現了幾個問題：

dataloader不會發包；
dataloader給每個進程發的是完整的數據，按武德來說，應該是1/n的數據，n為你設置的gpu數量；

然后我就開始看起了源碼^[2]，很快啊：

 def __iter__(self) -> Iterator[T_co]:
        if self.shuffle:
            # deterministically shuffle based on epoch and seed
            g = torch.Generator()
            g.manual_seed(self.seed + self.epoch)
            indices = torch.randperm(len(self.dataset), generator=g).tolist()  # type: ignore
        else:
            indices = list(range(len(self.dataset)))  # type: ignore

        if not self.drop_last:
            # add extra samples to make it evenly divisible
            padding_size = self.total_size - len(indices)
            if padding_size <= len(indices):
                indices += indices[:padding_size]
            else:
                indices += (indices * math.ceil(padding_size / len(indices)))[:padding_size]
        else:
            # remove tail of data to make it evenly divisible.
            indices = indices[:self.total_size]
        assert len(indices) == self.total_size

        # subsample
        indices = indices[self.rankself.num_replicas] # 這一步保證每個進程拿到的數據不同
        assert len(indices) == self.num_samples

        return iter(indices)

這里最關鍵的問題是是什么呢？首先在torch.utils.data.distributed.DistributedSampler里面，數據集的變量叫self.dataset而不是data_source；其次和torch.utils.data.Sampler要求你_重寫__iter__函數不同：

def __iter__(self) -> Iterator[T_co]:
        raise NotImplementedError

DistributedSampler這個父類里有部分實現，如果你沒有考慮到這部分，就自然會出現每個進程拿到的數據都是all的情況。

于是我重寫了我的DDPBaseBucketSampler類：

def basesampler(lens, indices, batch_size):
    # the magic number comes from the author's code
    t1 = []
    t2 = []
    t3 = []
    for i, l in enumerate(lens):
        if (l<100):
            t1.append(indices[i])
        elif (l>100 and l<220):
            t2.append(indices[i])
        else:
            t3.append(indices[i])
    datas = [t1,t2,t3]
    random.shuffle(datas)
    idxs = sum(datas, [])
    batch = []

    #為了保證不爆卡，我們給不同長度的數據上保護鎖
    middle_batch_size = min(int(batch_size * 0.75) , 32)
    long_batch_size = min(int(batch_size * 0.5) , 24)

    return idxs, batch, middle_batch_size, long_batch_size

class DDPBaseBucketSampler(torch.utils.data.distributed.DistributedSampler):
    '''
    這里要注意和單GPU的sampler類同步
    '''
    def __init__(self, dataset, num_replicas, rank, shuffle=True, batch_size=32):
        super(DDPBaseBucketSampler, self).__init__(dataset, num_replicas, rank, shuffle)
        self.batch_size = batch_size

    def __iter__(self):
        # deterministically shuffle based on epoch
        g = torch.Generator()
        g.manual_seed(self.epoch)
        #print('here is pytorch code and you can delete it in the /home/lzk/anaconda3/lib/python3.7/site-packages/torch/utils/data')
        if self.shuffle:
            indices = torch.randperm(len(self.dataset), generator=g).tolist()
        else:
            indices = list(range(len(self.dataset)))
        # add extra samples to make it evenly divisible
        indices += indices[:(self.total_size - len(indices))]
        assert len(indices) == self.total_size

        indices = indices[self.rankself.num_replicas]
        assert len(indices) == self.num_samples

        # 然后我也要拿到每個數據的長度 (每個rank不同)
        lens = torch.Tensor([len(x) for x in self.dataset])

        idxs, batch, middle_batch_size, long_batch_size = basesampler(lens[indices], indices, self.batch_size)
        
        for idx in idxs:
            batch.append(idx)
            mlen = max([0]+[lens[x] for x in batch])
            #if (mlen<100 and len(batch) == 32) or (mlen>100 and mlen<220 and len(batch) >= 24) or (mlen>220 and len(batch)>=8) or len(batch)==32:
            if (mlen<100 and len(batch) == self.batch_size) or (mlen>100 and mlen<220 and len(batch) >= middle_batch_size) or (mlen>220 and len(batch)>=long_batch_size) or len(batch)==self.batch_size:
                yield batch
                batch = []
        # print('應該出現2次如果是2個進程的話')
        if len(batch) > 0:
            yield batch

    def __len__(self):
        return (len(self.dataset)+self.batch_size-1)//self.batch_size

后面每個進程終于可以跑屬于自己的數據了（1/n，n=進程數量=GPU數量，單機）

緊接著問題又來了，我發現訓練過程正常結束后，主進程無法退出mp.spawn()函數。

寫bug第二步，master進程無法正常結束

number workers ddp pytorch下無法正常結束。具體表現為，mp.spawn傳遞的函數參數可以順利運行完，但是master進程一直占著卡，不退出。一開始我懷疑是sampler函數的分發batch的機制導致的，什么意思呢？就是由于每個進程拿到的數據不一樣，各自進程執行sampler類的時候，由于我規定了長度接近的文本打包在一起，所以可能master進程有一百個iter，slave只有80個，然后我馬上試了一下，很快啊：

▲DDPBucketSampler(torch.utils.data.distributed.DistributedSampler)類迭代函數__iter__

▲都能夠正常打印，證明__iter__函數沒有問題

發現只有細微的差別，并且，程序最后都越過了這些print，應該不會是batch數量不一致導致的問題。（順便指的一提的是，sampler在很早的時候就把batch打包好了）

加了摧毀進程，也于事無補

if args.is_ddp:
     dist.destroy_process_group()
     print('rank destroy_process_group: ' , rank)

然后只能點擊強制退出

File "train.py", line 322, in 
    main(args.gpu, args)
  File "/home/lzk/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/lzk/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 77, in join
    timeout=timeout,
  File "/home/lzk/anaconda3/lib/python3.7/multiprocessing/connection.py", line 920, in wait
    ready = selector.select(timeout)
  File "/home/lzk/anaconda3/lib/python3.7/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
TypeError: keyboard_interrupt_handler() takes 1 positional argument but 2 were given
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/lzk/anaconda3/lib/python3.7/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)
TypeError: keyboard_interrupt_handler() takes 1 positional argument but 2 were given

代碼參考：基于Python初探Linux下的僵尸進程和孤兒進程(三)^[3]、Multiprocessing in python blocked^[4]

很顯然是pytorch master進程產生死鎖了，變成了僵尸進程。

再探究，發現當我把dataloader的number workers設為0的時候，程序可以正常結束。經過我的注釋大法后我發現，哪怕我把for _i , batch in enumerate(dataloader)內的代碼全部注釋改為pass，程序還是會出現master無法正常結束的情況。所以問題鎖定在dataloader身上。參考：nero：PyTorch DataLoader初探^[5]

另外一種想法是，mp.spawn出現了問題。使用此方式啟動的進程，只會執行和 target 參數或者 run() 方法相關的代碼。Windows 平臺只能使用此方法，事實上該平臺默認使用的也是該啟動方式。相比其他兩種方式，此方式啟動進程的效率最低。參考：Python設置進程啟動的3種方式^[6]

現在試一下，繞開mp.spawn函數，用shell腳本實現ddp，能不能不報錯：

python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 --node_rank=0 --master_addr="192.168.1.201" --master_port=23456 我的文件.py

參數解釋：

nnodes：因為是單機多卡，所以設為1，顯然node_rank 只能是0了
local_rank:進程在運行的時候，會利用args插入local_rank這個參數標識進程序號

一番改動后，發現問題有所好轉，最直觀的感受是速度快了非常多！！現在我沒有父進程的問題了，但還是在運行完所有的程序后，無法正常結束：

此時我的代碼運行到：

上面的代碼是main函數，2個進程（master，salve）都可以越過barrier，其中slave順利結束，但是master卻遲遲不見蹤影：

這個時候ctrl+c終止，發現：

順著報錯路徑去torch/distributed/launch.py, line 239找代碼：

def main():
    args = parse_args()

    # world size in terms of number of processes
    dist_world_size = args.nproc_per_node * args.nnodes

    # set PyTorch distributed related environmental variables
    current_env = os.environ.copy()
    current_env["MASTER_ADDR"] = args.master_addr
    current_env["MASTER_PORT"] = str(args.master_port)
    current_env["WORLD_SIZE"] = str(dist_world_size)

    processes = []

    if 'OMP_NUM_THREADS' not in os.environ and args.nproc_per_node > 1:
        current_env["OMP_NUM_THREADS"] = str(1)
        print("*****************************************
"
              "Setting OMP_NUM_THREADS environment variable for each process "
              "to be {} in default, to avoid your system being overloaded, "
              "please further tune the variable for optimal performance in "
              "your application as needed. 
"
              "*****************************************".format(current_env["OMP_NUM_THREADS"]))

    for local_rank in range(0, args.nproc_per_node):
        # each process's rank
        dist_rank = args.nproc_per_node * args.node_rank + local_rank
        current_env["RANK"] = str(dist_rank)
        current_env["LOCAL_RANK"] = str(local_rank)

        # spawn the processes
        if args.use_env:
            cmd = [sys.executable, "-u",
                   args.training_script] + args.training_script_args
        else:
            cmd = [sys.executable,
                   "-u",
                   args.training_script,
                   "--local_rank={}".format(local_rank)] + args.training_script_args

        process = subprocess.Popen(cmd, env=current_env)
        processes.append(process)

    for process in processes:
        process.wait() # 等待運行結束
        if process.returncode != 0:
            raise subprocess.CalledProcessError(returncode=process.returncode,
                                                cmd=cmd)

可惡，master和dataloader到底有什么關系哇。。

這個問題終于在昨天（2020/12/22）被解決了，說來也好笑，左手是graphwriter的ddp實現，無法正常退出，右手是minst的ddp最小例程，可以正常退出，于是我開始了刪減大法。替換了數據集，model，然后讓dataloader空轉，都沒有發現問題，最后一步步逼近，知道我把自己的代碼這一行注釋掉以后，終于可以正常結束了：

def main(args):
    ############################################################
    print('local_rank : ' , args.local_rank )
    if args.is_ddp:
        dist.init_process_group(
        backend='nccl',
       init_method='env://',
        world_size=args.world_size,
        rank=args.local_rank
        )
    ############################################################
    # torch.multiprocessing.set_sharing_strategy('file_system')  萬惡之源

    os.environ["CUDA_VISIBLE_DEVICES"] = os.environ["CUDA_VISIBLE_DEVICES"].split(',')[args.local_rank]
    args.device = torch.device(0) 
    ...

為什么我當時會加上這句話呢？因為當時在調試number worker的時候（當時年輕，以為越大越好，所以設置成了number workers = cpu.count()），發現系統報錯，說超出了打開文件的最大數量限制。在torch.multiprocessing的設定里，共享策略（參考pytorch中文文檔^[7]）默認是File descriptor，此策略將使用文件描述符作為共享內存句柄。當存儲被移動到共享內存中，一個由shm_open獲得的文件描述符被緩存。當時，文檔還提到：

如果你的系統對打開的文件描述符數量有限制，并且無法提高，你應該使用file_system策略。

所以我換成了torch.multiprocessing.set_sharing_strategy('file_system')，但是卻忽略文檔里的共享內存泄露警告。顯然，或許這不是嚴重的問題，文檔里提到：

也有可能我所說的master進程就是這個torch_shm_manager，因為destory進程組始終無法結束0號進程：

這個BUG結束了，真開心，期待下一個BUG快快到來。

責任編輯：xj

原文標題：Pytorch翻車記錄：單卡改多卡踩坑記！

文章出處：【微信公眾號：深度學習自然語言處理】歡迎添加關注！文章轉載請注明出處。

聲明：本文內容及配圖由入駐作者撰寫或者入駐合作網站授權轉載。文章觀點僅代表作者本人，不代表電子發燒友網立場。文章及其配圖僅供工程師學習之用，如有內容侵權或者其他違規問題，請聯系本站處理。舉報投訴

機器學習

機器學習

+關注

關注
66

文章
8428

瀏覽量
132837
深度學習

深度學習

+關注

關注
73

文章
5510

瀏覽量
121338
pytorch

pytorch

+關注

關注
2

文章
808

瀏覽量
13283

原文標題：Pytorch翻車記錄：單卡改多卡踩坑記！

文章出處：【微信號：zenRRan，微信公眾號：深度學習自然語言處理】歡迎添加關注！文章轉載請注明出處。

4G模組SD卡接口編程：深度學習

今天我們需要深度學習的是4G模組SD卡接口編程，以我常用的模組Air724UG為例，分享給大家。

發表于 11-20 23:14 ?287次閱讀

4G模組SD<b class='flag-5'>卡</b>接口編程：<b class='flag-5'>深度</b><b class='flag-5'>學習</b>

在學習go語言的過程踩過的坑

作為一個5年的phper，這兩年公司和個人都在順應技術趨勢，新項目慢慢從php轉向了go語言，從2021年到現在，筆者手上也先后開發了兩個go項目。在學習go語言的過程中也學習并總結了一些相關的東西，這篇文章就分享下自己踩過的一

發表于 11-11 09:22 ?183次閱讀

如何使用 PyTorch 進行強化學習

強化學習（Reinforcement Learning, RL）是一種機器學習方法，它通過與環境的交互來學習如何做出決策，以最大化累積獎勵。PyTorch 是一個流行的開源機器

發表于 11-05 17:34 ?342次閱讀

Pytorch深度學習訓練的方法

掌握這 17 種方法，用最省力的方式，加速你的 Pytorch 深度學習訓練。

發表于 10-28 14:05 ?237次閱讀

<b class='flag-5'>Pytorch</b><b class='flag-5'>深度</b><b class='flag-5'>學習</b>訓練的方法

pytorch環境搭建詳細步驟

PyTorch作為一個廣泛使用的深度學習框架，其環境搭建對于從事機器學習和深度學習研究及開發的人

發表于 08-01 15:38 ?907次閱讀

pytorch和python的關系是什么

，PyTorch已經成為了一個非常受歡迎的框架。本文將介紹PyTorch和Python之間的關系，以及它們在深度學習領域的應用。 Python簡介 Python是一種高級、解釋型、通用

發表于 08-01 15:27 ?2124次閱讀

PyTorch深度學習開發環境搭建指南

PyTorch作為一種流行的深度學習框架，其開發環境的搭建對于深度學習研究者和開發者來說至關重要。在Windows操作系統上搭建

發表于 07-16 18:29 ?1161次閱讀

pytorch中有神經網絡模型嗎

當然，PyTorch是一個廣泛使用的深度學習框架，它提供了許多預訓練的神經網絡模型。 PyTorch中的神經網絡模型 1. 引言深度

發表于 07-11 09:59 ?733次閱讀

PyTorch的介紹與使用案例

PyTorch是一個基于Python的開源機器學習庫，它主要面向深度學習和科學計算領域。PyTorch由Meta Platforms（原Fa

發表于 07-10 14:19 ?431次閱讀

tensorflow和pytorch哪個更簡單?

： TensorFlow和PyTorch都是用于深度學習和機器學習的開源框架。TensorFlow由Google Brain團隊開發，而PyTorc

發表于 07-05 09:45 ?921次閱讀

如何使用PyTorch建立網絡模型

PyTorch是一個基于Python的開源機器學習庫，因其易用性、靈活性和強大的動態圖特性，在深度學習領域得到了廣泛應用。本文將從PyTorch

發表于 07-02 14:08 ?441次閱讀

TensorFlow與PyTorch深度學習框架的比較與選擇

深度學習作為人工智能領域的一個重要分支，在過去十年中取得了顯著的進展。在構建和訓練深度學習模型的過程中，深度

發表于 07-02 14:04 ?1013次閱讀

新手小白怎么學GPU云服務器跑深度學習?

新手小白想用GPU云服務器跑深度學習應該怎么做? 用個人主機通常pytorch可以跑但是LexNet，AlexNet可能就直接就跑不動,如何實現更經濟便捷的實現GPU云服務器深度

發表于 06-11 17:09

家庭路由器如何選？實用技巧讓你不再踩坑！

家庭路由器選購需考慮需求、預算、性能指標、硬件配置、軟件功能、認證與測試及售后服務。明確需求，選擇適合的型號和品牌，確保網絡穩定、高速。遵循這些技巧，避免踩坑，享受網絡便利。

發表于 04-29 11:38 ?728次閱讀

痛苦踩坑“電池電壓偵測電路”，含淚總結設計要點

和大家分享這個電路的設計要點，以及當時的設計失誤，幫助大家積累經驗，以后不要踩這種坑。設計要點一：設定分壓電阻的大小這種便攜式掌上閱讀器，當然是內置鋰電池的：

發表于 04-07 14:31 ?3772次閱讀

在线观看www成人影院-在线观看www日本免费网站-在线观看www视频-在线观看操-欧美18在线-欧美1级

搜索歷史

深度學習Pytorch翻車記錄：單卡改多卡踩坑記