天蚕土豆,完美世界国际版下载,武道至尊帝临小说

背景介紹

拒絕采樣是一種蒙特卡洛算法，用于借助代理分布從一個(gè)復(fù)雜的（“難以采樣的”）分布中采樣數(shù)據(jù)。

什么是蒙特卡洛？如果一個(gè)方法/算法使用隨機(jī)數(shù)來(lái)解決問(wèn)題，那么它被歸類為蒙特卡洛方法。在拒絕采樣的背景下，蒙特卡洛（也稱為隨機(jī)性）幫助實(shí)施算法中的標(biāo)準(zhǔn)。關(guān)于采樣，幾乎所有蒙特卡洛方法中存在的一個(gè)核心思想是，如果你不能從你的目標(biāo)分布函數(shù)中采樣，那么使用另一個(gè)分布函數(shù)（因此被稱為提議函數(shù)）。

上圖利用蒙特卡洛算法通過(guò)對(duì)矩形進(jìn)行投針實(shí)驗(yàn)，通過(guò)落在圓里面的頻率來(lái)估計(jì)圓的面積和「π值」

然而，采樣程序必須“遵循目標(biāo)分布”。遵循“目標(biāo)分布”意味著我們應(yīng)該根據(jù)它們發(fā)生的可能性得到若干樣本。簡(jiǎn)單來(lái)說(shuō)，高概率區(qū)域的樣本應(yīng)該更多。

這也意味著，當(dāng)我們使用一個(gè)提議函數(shù)時(shí)，我們必須引入必要的修正，以確保我們的采樣程序遵循目標(biāo)分布函數(shù)！這種“修正”方面然后采取接受標(biāo)準(zhǔn)的形式。

這個(gè)方法背后的主要思想是：如果我們?cè)噲D從分布p(x)中取樣，我們會(huì)使用另一個(gè)工具分布q(x)來(lái)幫助從p(x)中取樣。唯一的限制是對(duì)于某個(gè)M>1，p(x) < Mq(x)。它主要用于當(dāng)p(x)的形式使其難以直接取樣，但可以在任何點(diǎn)x評(píng)估它的情況。

以下是算法的細(xì)分：

從q(x)中取樣x。
從U(0, Mq(x))（均勻分布）中取樣y。
如果 y < p(x)，則接受x作為p(x)的一個(gè)樣本，否則返回第1步。

這個(gè)方法之所以有效，是因?yàn)榫鶆蚍植紟椭覀儗q(x)提供的“封包”縮放到p(x)的概率密度函數(shù)。另一種看法是，我們?nèi)狱c(diǎn)x0的概率。這與從g中取樣x0的概率成正比，我們接受的次數(shù)的比例，僅僅由p(x0)和Mq(x0)之間的比率給出。

上圖，一旦我們找到了q(x)的一個(gè)樣本（在這個(gè)例子中，x=2），我們就會(huì)從一個(gè)均勻分布中取樣，其范圍等于Mq(x)的高度。如果它在目標(biāo)概率密度函數(shù)的高度之內(nèi)，我們就接受它（綠色表示）；否則，我們就拒絕它。

結(jié)合我們這里的生成模型背景，我們這里提到的拒絕采樣微調(diào)通常是說(shuō)在一個(gè)微調(diào)過(guò)的模型基礎(chǔ)上面（可能是SFT微調(diào)也可能是經(jīng)過(guò)PPO算法微調(diào)等）進(jìn)行K個(gè)樣本采樣。然后我們有一個(gè)拒絕或者接受函數(shù)來(lái)對(duì)模型采樣生成的樣本進(jìn)行過(guò)濾篩選出符合我們目標(biāo)分布的樣本，再進(jìn)行模型微調(diào)。

相關(guān)研究

拒絕抽樣是一種簡(jiǎn)單而有效的微調(diào)增強(qiáng)技術(shù)，也用于LLM與人類偏好的對(duì)齊。

WebGPT: Browser-assisted question-answering with human feedback

Rejection sampling (best-of-n). We sampled a fixed number of answers (4, 16 or 64) from either the BC model or the RL model (if left unspecified, we used the BC model), and selected the one that was ranked highest by the reward model. We used this as an alternative method of optimizing against the reward model, which requires no additional training, but instead uses more inference-time compute.

Even though both rejection sampling and RL optimize against the same reward model, there are several possible reasons why rejection sampling outperforms RL:

1.It may help to have many answering attempts, simply to make use of more inference-time compute.
2.The environment is unpredictable: with rejection sampling, the model can try visiting many more websites, and then evaluate the information it finds with the benefit of hindsight.
3.The reward model was trained primarily on data collected from BC and rejection sampling policies, which may have made it more robust to over optimization by rejection sampling than by RL.
4..The reward model was trained primarily on data collected from BC and rejection sampling policies, which may have made it more robust to over optimization by rejection sampling than by RL.

簡(jiǎn)單來(lái)說(shuō)webgpt只是在推理階段使用拒絕采樣，并沒(méi)有使用拒絕采樣進(jìn)行微調(diào)。然后作者比較了RL和拒絕采樣的效果，發(fā)現(xiàn)拒絕采樣會(huì)更好，并且給出了一些解釋：比較認(rèn)同的是拒絕采樣比起RL算法來(lái)說(shuō)不需要調(diào)參，更加魯棒。

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Rejection Sampling (RS) with a 52B preference model, where samples were generated from a 52B context-distilled LM. In this case the number k of samples was a parameter, but most often we used k = 16.

We also test our online models' performance during training (Figure 15), compare various levels of rejection sampling .

In Figure 36 we show helpfulness Elo scores for a 52B context distilled model with rejection sampling (utilizing a 52B preference model trained on pure helpfulness) for k = 1, 4, 16, 64, showing that higher values of k clearly perform better. Note that the context distilled model and the preference models discussed here were trained during an earlier stage of our research with different datasets and settings from those discussed elsewhere in the paper, so they are not directly comparable with other Elo results, though very roughly and heuristically, our online models seem to perform about as well or better than k = 64 rejection sampling. Note that k = 64 rejection sampling corresponds to DKL = log(64) ≈ 4.2.

總結(jié)一下依然是在推理階段使用拒絕采樣，然后采樣的時(shí)候K值越大效果越好，online RLHF 模型似乎表現(xiàn)的比拒絕采樣更好。

Aligning Large Language Models through Synthetic Feedback

An important additional component is that we leverage the synthetic RM from the previous stage to ensure the quality of the model-tomodel conversations with rejection sampling over the generated outputs (Ouyang et al., 2022). We train LLaMA-7B on the synthetic demonstrations (SFT) and further optimize the model with rewards from the synthetic RM, namely, Reinforcement Learning from Synthetic Feedback (RLSF).

To ensure a more aligned response from the assistant, we suggest including the synthetic RM, trained in the first stage, in the loop, namely Reward-Model-guided SelfPlay (RMSP). In this setup, the assistant model,LLaMA-30B-Faithful-3shot, first samples N responses for a given conversational context. Then, the RM scores the N responses, and the best-scored response is chosen as the final response for the simulation, i.e., the RM performs rejection sampling (best-of-N sampling) (Nakano et al., 2021; Ouyang et al., 2022). Other procedures are the same as the Self-Play. Please see Figure 8 for the examples.

與前兩篇文章不同的是這里使用拒絕采樣得到的數(shù)據(jù)進(jìn)行微調(diào)了，利用ICL生成不同級(jí)別模型對(duì)prompt的response，然后前提假設(shè)大模型對(duì)回答效果好于小模型，得到偏好數(shù)據(jù)訓(xùn)練得到RM模型。然后使用拒絕采樣，使用RM模型選出分?jǐn)?shù)最高的response得到訓(xùn)練集，使用SFT訓(xùn)練模型。

Llama 2: Open Foundation and Fine-Tuned Chat Models

This process begins with the pretraining of Llama 2 using publicly available online sources. Following this, we create an initial version of Llama 2-Chat through the application of supervised fine-tuning. Subsequently, the model is iteratively refined using Reinforcement Learning with Human Feedback (RLHF) methodologies, specifically through rejection sampling and Proximal Policy Optimization (PPO). Throughout the RLHF stage, the accumulation of iterative reward modeling data in parallel with model enhancements is crucial to ensure the reward models remain within distribution.

Rejection Sampling fine-tuning. We sample K outputs from the model and select the best candidate with our reward, consistent with Bai et al. (2022b). The same re-ranking strategy for LLMs was also proposed in Deng et al. (2019), where the reward is seen as an energy function. Here, we go one step further, and use the selected outputs for a gradient update. For each prompt, the sample obtaining the highest reward score is considered the new gold standard. Similar to Scialom et al. (2020a), we then fine-tune our model on the new set of ranked samples, reinforcing the reward.

The two RL algorithms mainly differ in:

Breadth — in Rejection Sampling, the model explores K samples for a given prompt, while only one generation is done for PPO.
Depth — in PPO, during training at step t the sample is a function of the updated model policy fromt ? 1 after the gradient update of the previous step. In Rejection Sampling fine-tuning, we sample all the outputs given the initial policy of our model to collect a new dataset, before applying the fine-tuning similar to SFT. However, since we applied iterative model updates, the fundamental differences between the two RL algorithms are less pronounced.

總結(jié)一下使用的RLHF基準(zhǔn)是PPO和拒絕采樣（RS）微調(diào)（類似于N次采樣中的最佳值）。PPO是最受歡迎 on policy RL算法（可以說(shuō)是試錯(cuò)學(xué)習(xí))。這里重點(diǎn)提到了Here, we go one step further, and use the selected outputs for a gradient update. For each prompt, the sample obtaining the highest reward score is considered the new gold standard. Similar to Scialom et al. (2020a), we then fine-tune our model on the new set of ranked samples, reinforcing the reward.

說(shuō)明了llama用rm進(jìn)行拒絕采樣生成的樣本進(jìn)行了SFT訓(xùn)練，更新策略模型的梯度，同時(shí)，他們還將拒絕采樣生成的樣本作為gold 在舊的checkpoint上面重新訓(xùn)練RM模型，加強(qiáng)rm模型獎(jiǎng)勵(lì)。所以筆者認(rèn)為這里的拒絕采樣微調(diào)是同時(shí)對(duì)SFT和RM模型進(jìn)行微調(diào)迭代。

SCALING RELATIONSHIP ON LEARNING MATHEMATI-CAL REASONING WITH LARGE LANGUAGE MODELS

To augment more data samples for improving model performances without any human effort, we propose to apply Rejection sampling Fine-Tuning (RFT). RFT uses supervised models to generate and collect correct reasoning paths as augmented fine-tuning datasets. We find with augmented samples containing more distinct reasoning paths, RFT improves mathematical reasoning performance more for LLMs. We also find RFT brings more improvement for less performant LLMs. Furthermore, we combine rejection samples from multiple models which push LLaMA-7B to an accuracy of 49.3% and outperforms the supervised fine-tuning (SFT) accuracy of 35.9% significantly.

圖中相比SFT模型RFT模型效果在GSM8k上面提升明顯

總的來(lái)說(shuō)了在沒(méi)有任何人力的情況下增加更多數(shù)據(jù)樣本以提高模型性能，我們建議應(yīng)用拒絕采樣微調(diào) (RFT)。RFT 使用監(jiān)督模型生成和收集正確的推理路徑作為增強(qiáng)微調(diào)數(shù)據(jù)集。我們發(fā)現(xiàn)使用包含更多不同推理路徑的增強(qiáng)樣本，RFT 對(duì) LLM 提高了數(shù)學(xué)推理性能。我們還發(fā)現(xiàn) RFT 為性能較低的 LLM 帶來(lái)了更多改進(jìn)。此外，我們結(jié)合了來(lái)自多個(gè)模型的拒絕樣本，將 LLAMA-7B 推向 49.3% 的準(zhǔn)確率，并且顯著優(yōu)于 35.9% 的監(jiān)督微調(diào) (SFT) 準(zhǔn)確度。值得注意的上不同于上面使用的是RM模型來(lái)執(zhí)行拒絕采樣選出最好的response，這里直接使用的模型reponse給出答案和正確的答案比較，選出推理正確的結(jié)果。

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

However, the inefficiencies and instabilities associated with RL algorithms frequently present substantial obstacles to the successful alignment of generative models, necessitating the development of a more robust and streamlined approach. To this end, we introduce a new framework, Reward rAnked FineTuning (RAFT), designed to align generative models more effectively. Utilizing a reward model and a sufficient number of samples, our approach selects the high-quality samples, discarding those that exhibit undesired behavior, and subsequently assembles a streaming dataset. This dataset serves as the basis for aligning the generative model and can be employed under both offline and online settings. Notably, the sample generation process within RAFT is gradient-free, rendering it compatible with black-box generators. Through extensive experiments, we demonstrate that our proposed algorithm exhibits strong performance in the context of both large language models and diffusion models.

總結(jié)與思考

拒絕采樣使得SFT模型輸出的結(jié)果分布通過(guò)拒絕/接受函數(shù)篩選（這里可以是獎(jiǎng)勵(lì)模型也可以是啟發(fā)式規(guī)則），得到了高質(zhì)量回答的分布。提高了最終返回的效果。對(duì)于拒絕采樣來(lái)說(shuō)采樣的樣本K越大越好。同時(shí)在RLHF框架里面，使用拒絕采樣微調(diào)一是可以用來(lái)更新SFT模型的效果，對(duì)于ppo算法來(lái)說(shuō)，往往需要保證舊的策略和新的策略分布差距比較小，所以這里提高PPO啟動(dòng)的SFT模型效果對(duì)于PPO算法本身來(lái)說(shuō)也很重要，其次還可以利用拒絕采樣的樣本微調(diào)來(lái)迭代舊的獎(jiǎng)勵(lì)模型，加強(qiáng)模型的獎(jiǎng)勵(lì)。這個(gè)對(duì)于提高PPO最終效果和迭代也十分重要。同時(shí)針對(duì)COT能力來(lái)說(shuō)，拒絕采樣提供了更多的推理路徑來(lái)供模型學(xué)習(xí)。這對(duì)于模型來(lái)說(shuō)也非常重要。

聲明：本文內(nèi)容及配圖由入駐作者撰寫(xiě)或者入駐合作網(wǎng)站授權(quán)轉(zhuǎn)載。文章觀點(diǎn)僅代表作者本人，不代表電子發(fā)燒友網(wǎng)立場(chǎng)。文章及其配圖僅供工程師學(xué)習(xí)之用，如有內(nèi)容侵權(quán)或者其他違規(guī)問(wèn)題，請(qǐng)聯(lián)系本站處理。舉報(bào)投訴

算法

算法

+關(guān)注

關(guān)注
23

文章
4629

瀏覽量
93296
大模型

大模型

+關(guān)注

關(guān)注
2

文章
2543

瀏覽量
3120
LLM

LLM

+關(guān)注

關(guān)注
0

文章
298

瀏覽量
393

原文標(biāo)題：LLM大模型訓(xùn)練Trick系列之拒絕采樣

文章出處：【微信號(hào)：zenRRan，微信公眾號(hào)：深度學(xué)習(xí)自然語(yǔ)言處理】歡迎添加關(guān)注！文章轉(zhuǎn)載請(qǐng)注明出處。

評(píng)論

相關(guān)推薦

大型語(yǔ)言模型（LLM）的自定義訓(xùn)練：包含代碼示例的詳細(xì)指南

近年來(lái)，像 GPT-4 這樣的大型語(yǔ)言模型（LLM）因其在自然語(yǔ)言理解和生成方面的驚人能力而受到廣泛關(guān)注。但是，要根據(jù)特定任務(wù)或領(lǐng)域定制LLM，定制培訓(xùn)是必要的。本文提供了有關(guān)自定義訓(xùn)練

發(fā)表于 06-12 09:35 ?2867次閱讀

基于一個(gè)完整的 LLM 訓(xùn)練流程

? ? 在這篇文章中，我們將盡可能詳細(xì)地梳理一個(gè)完整的 LLM 訓(xùn)練流程。包括模型預(yù)訓(xùn)練（Pretrain）、Tokenizer 訓(xùn)練、指令

發(fā)表于 06-29 10:08 ?2079次閱讀

訓(xùn)練大語(yǔ)言模型帶來(lái)的硬件挑戰(zhàn)

生成式AI和大語(yǔ)言模型（LLM）正在以難以置信的方式吸引全世界的目光，本文簡(jiǎn)要介紹了大語(yǔ)言模型，訓(xùn)練這些模型帶來(lái)的硬件挑戰(zhàn)，以及GPU和網(wǎng)絡(luò)

發(fā)表于 09-01 17:14 ?1650次閱讀

大語(yǔ)言模型（LLM）預(yù)訓(xùn)練數(shù)據(jù)集調(diào)研分析

model 訓(xùn)練完成后，使用 instruction 以及其他高質(zhì)量的私域數(shù)據(jù)集來(lái)提升 LLM 在特定領(lǐng)域的性能；而 rlhf 是 openAI 用來(lái)讓model 對(duì)齊人類價(jià)值觀的一種強(qiáng)大技術(shù)；pre-training dataset 是大

發(fā)表于 09-19 10:00 ?1232次閱讀

從原理到代碼理解語(yǔ)言模型訓(xùn)練和推理，通俗易懂，快速修煉LLM

要理解大語(yǔ)言模型（LLM），首先要理解它的本質(zhì)，無(wú)論預(yù)訓(xùn)練、微調(diào)還是在推理階段，核心都是next token prediction，也就是以自回歸的方式從左到右逐步生成文本。

發(fā)表于 09-19 16:25 ?1617次閱讀

基于NVIDIA Megatron Core的MOE LLM實(shí)現(xiàn)和訓(xùn)練優(yōu)化

本文將分享阿里云人工智能平臺(tái) PAI 團(tuán)隊(duì)與 NVIDIA Megatron-Core 團(tuán)隊(duì)在 MoE (Mixture of Experts) 大語(yǔ)言模型（LLM）實(shí)現(xiàn)與訓(xùn)練優(yōu)化上的創(chuàng)新工作。

發(fā)表于 03-22 09:50 ?869次閱讀

llm模型和chatGPT的區(qū)別

LLM（Large Language Model）是指大型語(yǔ)言模型，它們是一類使用深度學(xué)習(xí)技術(shù)構(gòu)建的自然語(yǔ)言處理（NLP）模型。LLM模型可

發(fā)表于 07-09 09:55 ?1299次閱讀

LLM模型和LMM模型的區(qū)別

LLM（線性混合模型）和LMM（線性混合效應(yīng)模型）之間的區(qū)別如下：定義： LLM（線性混合模型）是一種統(tǒng)計(jì)

發(fā)表于 07-09 09:57 ?1190次閱讀

llm模型有哪些格式

LLM（Large Language Model，大型語(yǔ)言模型）是一種深度學(xué)習(xí)模型，主要用于處理自然語(yǔ)言處理（NLP）任務(wù)。LLM模型的格式

發(fā)表于 07-09 09:59 ?730次閱讀

llm模型訓(xùn)練一般用什么系統(tǒng)

LLM（Large Language Model，大型語(yǔ)言模型）是近年來(lái)在自然語(yǔ)言處理領(lǐng)域取得顯著成果的一種深度學(xué)習(xí)模型。它通常需要大量的計(jì)算資源和數(shù)據(jù)來(lái)進(jìn)行訓(xùn)練。以下是關(guān)于

發(fā)表于 07-09 10:02 ?493次閱讀

LLM預(yù)訓(xùn)練的基本概念、基本原理和主要優(yōu)勢(shì)

在人工智能和自然語(yǔ)言處理（NLP）領(lǐng)域，大型語(yǔ)言模型（Large Language Model，簡(jiǎn)稱LLM）的興起極大地推動(dòng)了技術(shù)的進(jìn)步和應(yīng)用的發(fā)展。LLM通過(guò)在大規(guī)模文本數(shù)據(jù)上進(jìn)行預(yù)訓(xùn)練

發(fā)表于 07-10 11:03 ?1269次閱讀

端到端InfiniBand網(wǎng)絡(luò)解決LLM訓(xùn)練瓶頸

的，這需要大量的計(jì)算資源和高速數(shù)據(jù)傳輸網(wǎng)絡(luò)。端到端InfiniBand（IB）網(wǎng)絡(luò)作為高性能計(jì)算和AI模型訓(xùn)練的理想選擇，發(fā)揮著重要作用。在本文中，我們將深入探討大型語(yǔ)言模型（LLM）

發(fā)表于 10-23 11:26 ?540次閱讀

LLM和傳統(tǒng)機(jī)器學(xué)習(xí)的區(qū)別

和訓(xùn)練方法 LLM：預(yù)訓(xùn)練和微調(diào)： LLM通常采用預(yù)訓(xùn)練（Pre-training）和微調(diào)（Fine-tuning）的方法。預(yù)

發(fā)表于 11-08 09:25 ?734次閱讀

如何訓(xùn)練自己的LLM模型

訓(xùn)練自己的大型語(yǔ)言模型（LLM）是一個(gè)復(fù)雜且資源密集的過(guò)程，涉及到大量的數(shù)據(jù)、計(jì)算資源和專業(yè)知識(shí)。以下是訓(xùn)練LLM

發(fā)表于 11-08 09:30 ?769次閱讀

小白學(xué)大模型：構(gòu)建LLM的關(guān)鍵步驟

隨著大規(guī)模語(yǔ)言模型（LLM）在性能、成本和應(yīng)用前景上的快速發(fā)展，越來(lái)越多的團(tuán)隊(duì)開(kāi)始探索如何自主訓(xùn)練LLM模型。然而，是否從零開(kāi)始

發(fā)表于 01-09 12:12 ?399次閱讀