樸素貝葉斯（ NB ）是一種簡單但功能強大的概率分類技術，具有良好的并行性，可以擴展到大規模數據集。

如果您一直從事數據科學中的文本處理任務，您就會知道機器學習模型可能需要很長時間來訓練。在這些模型上使用 GPU 加速計算通常可以顯著提高時間性能， NB 分類器也不例外。

通過使用 CUDA 加速操作，根據使用的 NB 模型，我們實現了從 5 到 20 倍的性能提升。對稀疏數據的智能利用使其中一個模型的速度提高了 120 倍。

在本文中，我們介紹了 RAPIDS cuML 中 NB 實現的最新升級，并將其與 Scikit-learn 在 CPU 上的實現進行了比較。我們提供基準測試來演示性能優勢，并通過算法的每個支持變量的簡單示例來幫助您確定哪個最適合您的用例。

什么是樸素貝葉斯？

NB 使用 Bayes’ theorem （圖 1 ）對如下所示的條件概率分布進行建模，以預測給定一些輸入特征（ x ）的標簽或類別（ y ）。在其最簡單的形式中，貝葉斯定理使用特征和可能標簽之間的聯合概率以及特征在所有可能標簽上出現的邊際概率來計算條件概率。

圖 1.貝葉斯定理表示由一組特征（ x ）產生的標簽（ y ）的概率作為條件概率。它是使用每個標簽與特征集發生的聯合概率和特征在所有可能標簽上發生的邊際概率來計算的

NB 算法在文本分類用例中表現良好。它們通常用于過濾垃圾郵件等任務；預測推特、網頁、博客帖子、用戶評分和論壇帖子的類別和情感；或對文檔和網頁進行排名。

NB 算法通過使每個特征（例如，輸入向量 x 中的每列）在統計上獨立于所有其他特征來簡化條件概率分布 naive assumption 。這使得該算法很棒，因為這種天真的假設提高了算法的并行化能力。此外，計算特征和類標簽之間簡單共生概率的一般方法使模型能夠進行增量訓練，支持不適合內存的數據集。

NB 有幾種變體，它們對各種類別標簽的聯合分布或共同出現的特征進行了某些假設。

樸素貝葉斯假設

為了預測未知輸入特征集的類別，關于聯合分布的不同假設使算法具有幾種不同的變體，該算法通過學習不同概率分布的參數來建模特征分布。

表 1 模擬了一個簡單的文檔/術語矩陣，該矩陣可以來自文本文檔集合。列中的術語代表一個詞匯表。一個簡單的詞匯表可能會將一個文檔分解為一組在所有文檔中出現的唯一單詞。

表 1 。包含沿行文檔和沿列出現在每個文檔中的詞匯的文檔/術語矩陣

在表 1 中，每個元素可以是一個計數，如此處所示， 0 或 1 表示特征的存在，或其他一些值，如在整個文檔集上出現的每個項的比率、擴散或離散度。

在實踐中，滑動窗口通常在整個文檔或術語上運行，將它們進一步劃分為小塊的單詞序列，稱為 n-grams 。對于下圖的第一個文檔， 2-gram （或 bigram ）將是“ I love ”和“ love dogs ”。這類數據集中的詞匯通常會顯著增大并變得稀疏。預處理步驟通常在詞匯表上執行，以過濾噪聲，例如，通過刪除大多數文檔中出現的常見術語。

將文檔轉換為文檔項矩陣的過程稱為矢量化。有一些工具可以加速這個過程，例如 CountVectorizer cuML 中的 CountVectorizer 、 TdfidfVectorizer 或 RAPIDS 估計器對象。

多項式和伯努利分布

表 1 表示一組文檔，這些文檔已矢量化為術語計數，結果矩陣中的每個元素表示特定單詞在其相應文檔中出現的次數。這種簡單的表示方法可以有效地用于分類任務。

由于特征代表頻率分布，多項式樸素貝葉斯變體可以有效地將特征及其相關類別的聯合分布建模為多項式分布。

可以通過合并色散度量來增強每個項的頻率分布，例如項頻率逆文檔頻率（ TF-IDF ），它考慮了每個項中出現的文檔數量。這可以通過對出現在較少文檔中的術語賦予更多權重來顯著提高性能，從而提高其識別能力。

雖然多項式分布在直接與項頻率一起使用時效果很好，但它也被證明在分數值上有很好的性能，如 TF-IDF 值。多項式樸素貝葉斯變體涵蓋了大量用例，因此往往是使用最廣泛的。類似的變體是伯努利樸素貝葉斯，它模擬每個項的簡單出現，而不是它們的頻率，從而得到 0 和 1 的矩陣（伯努利分布）。

不等階級分布

在現實世界中，經常會發現不平衡的數據集。例如，您可能有有限的垃圾郵件和惡意活動的數據樣本，但有豐富的正常和良性樣本。

補碼樸素貝葉斯變體通過在訓練期間為每個類使用聯合分布的補碼，例如，在所有其他類的樣本中出現特征的次數，有助于減少不平等類分布的影響。

分類分布

你也可以為你的每一個特征創建存儲箱，可能通過將一些頻率量化到多個存儲桶中，使得 0-5 的頻率進入存儲桶 0 ， 6-10 的頻率進入存儲桶 1 ，等等。

另一種選擇是將幾個術語合并到一個功能中，可能是為“動物”和“假日”創建桶，其中“動物”可能有三個桶，零個用于貓科動物，一個用于犬科動物，兩個用于嚙齒動物。“假日”可能有兩個桶，零用于個人假日，如生日或結婚紀念日，一個用于聯邦假日。

分類樸素貝葉斯變體假設特征遵循分類分布。樸素假設在這種情況下效果很好，因為它允許每個特征都有一組不同的類別，并且它使用（您猜對了）分類分布對聯合分布進行建模。

連續分布

最后，當特征是連續的時，高斯樸素貝葉斯變體非常有效，可以假設每個類別中的特征分布可以用高斯分布建模，即用簡單的均值和方差。

雖然這種變體在 TF-IDF 歸一化后可能在某些數據集上表現出良好的性能，但它在一般機器學習數據集上也很有用。

表 2.不同 NB 算法的比較 s

真實世界的端到端示例

如表 2 所示，為了證明每種算法變體的優點，我們逐步瀏覽了每種算法變體的示例筆記本。有關包含所有示例的全面端到端筆記本，請參閱 news_aggregator_a100.ipynb 。

我們使用新聞聚合器數據集來演示 NB 變體的性能。該數據集可從 Kaggle 公開獲取，由來自多個新聞來源的 422K 條新聞標題組成。每個標題都標有四個可能的標簽之一：商業、科技、娛樂和健康。使用 cuDF RAPIDS 將數據直接加載到 GPU 上，并繼續執行針對每個 NB 變體的預處理步驟。

高斯樸素貝葉斯

從高斯樸素貝葉斯，開始，我們運行 TD-IDF 矢量器將文本數據轉換為可用于訓練的實值向量。

通過指定ngram_range=（1，3），我們表示我們將學習單字以及 2-3-gram 。這顯著增加了要學習的術語或功能的數量，從 15K 個單詞增加到 180 萬個組合。由于大多數術語不會出現在大多數標題中，因此生成的矩陣稀疏，許多值等于零。 cuML 支持特殊結構來表示這樣的數據。

Gaussian NB

Transform the text through a TF-IDF vectorizer and iterate through the dataset to do multiple partial fits of Gaussian naive Bayes.

In[6]:

						vec = TfidfVectorizer(stop_words='english', ngram_range=(1,3), min_df=5)
x_train = vec.fit_transform(X_train_text)
x_test = vec.transform(X_test_text)

def dataset_traversal(X, Y, partial_function):
    chunk_size = 12000
    classes = cp.unique(Y)
    lower = 0
    for upper in iter(range(chunk_size, X.shape[0], chunk_size)):
        partial_function(X[lower:upper], Y[lower:upper], classes)
        lower = upper
    partial_function(X[upper:], Y[upper:], classes)

mnb = GaussianNB()
%time dataset_traversal(x_train,\
                        y_train,\
                        lambda x,y, c: mnb.partial_fit(x, y, c))

%time dataset_traversal(x_test,\
                        y_test,\
                        lambda x, y, c: print(mnb.score(x, y)))

					

CPU times: user 12.3 s, sys: 2.23 s, total: 14.5 s
Wall time: 22 s
0.8769999742507935
0.8840000033378601
0.878083348274231
0.8805833458900452
0.8756666779518127
0.8796666860580444
0.8786666393280029
0.8777499794960022
0.8823529481887817
CPU times: user 4.36 s, sys: 2.74 s, total: 7.1 s
Wall time: 22.8 s

In[7]:

						vec = TfidfVectorizer(stop_words='english', ngram_range=(1,3), min_df=5)
x_train = vec.fit_transform(X_train_text)
x_test = vec.transform(X_test_text)
x_train_np, x_test_np = x_train.get(), x_test.get()
y_train_np, y_test_np = y_train.to_numpy(), y_test.to_numpy()

def dataset_traversal(X, Y, partial_function):
    chunk_size = 5000
    classes = np.unique(Y)
    lower = 0
    for upper in iter(range(chunk_size, X.shape[0], chunk_size)):
        partial_function(X[lower:upper], Y[lower:upper], classes)
        lower = upper
    partial_function(X[upper:], Y[upper:], classes)

mnb = GaussianNB_sk()
%time dataset_traversal(x_train_np,\
                        y_train_np,\
                        lambda x, y, c: mnb.partial_fit(x.toarray(), y, c))

%time dataset_traversal(x_test_np,\
                        y_test_np,\
                        lambda x, y, c: print(mnb.score(x.toarray(), y)))

					

CPU times: user 2min 47s, sys: 1min 29s, total: 4min 17s
Wall time: 4min 17s
0.885
0.8736
0.8802
0.8828
0.8836
0.8738
0.8806
0.881
0.8832
0.8784
0.8714
0.879
0.8754
0.8782
0.8816
0.8844
0.875
0.8764
0.877
0.8864
0.8796
0.8842975206611571
CPU times: user 3min 8s, sys: 2min 7s, total: 5min 16s
Wall time: 5min 16s

NB 分類器的另一個優點是，可以使用partial_fit方法對Estimator對象進行增量訓練。這種技術適用于可能無法一次性放入內存或必須分布在多個 GPU 中的大規模數據集。

我們的第一個示例演示了使用高斯樸素貝葉斯的增量訓練，方法是在使用 TF-IDF 預處理為連續特征后，將數據分割成多個塊。高斯樸素貝葉斯的 cuML 版本在訓練方面比 Scikit 學習快 21 倍，在推理方面快 72 倍。

伯努利樸素貝葉斯

下一個示例演示了伯努利樸素貝葉斯，無需增量訓練，使用表示每個項存在或不存在的二進制特征。CountVectorizer對象可以通過設置binary=True來實現這一點。在本例中，我們發現比 Scikit learn 快 14 倍。

Bernoulli + CountVectorizer

In the Bernoulli variant, the feature vector is binarized. That's why using a CountVectorizer transformer is useful: You're more interested in the presence of the word rather than it's frequency.

In[241]:

						vec = CountVectorizer(stop_words='english', binary=True, ngram_range=(1,3))

x_train = vec.fit_transform(X_train_text)
x_test = vec.transform(X_test_text)

bnb = BernoulliNB()
%time bnb.fit(x_train, y_train)
%time bnb.score(x_test, y_test)

					

CPU times: user 44.4 ms, sys: 12.1 ms, total: 56.5 ms
Wall time: 56.5 ms
CPU times: user 14.9 ms, sys: 19.6 ms, total: 34.5 ms
Wall time: 34.2 ms

Out[241]:

0.8568723201751709

In[247]:

						vec = CountVectorizer(stop_words='english', binary=True, ngram_range=(1,3))
x_train = vec.fit_transform(X_train_text)
x_test = vec.transform(X_test_text)
x_train_np, x_test_np = x_train.get(), x_test.get()
y_train_np, y_test_np = y_train.to_numpy(), y_test.to_numpy()

bnb = BernoulliNB_sk()
%time bnb.fit(x_train_np, y_train_np)
%time bnb.score(x_test_np, y_test_np)

					

CPU times: user 293 ms, sys: 72.1 ms, total: 365 ms
Wall time: 365 ms
CPU times: user 141 ms, sys: 90.9 ms, total: 232 ms
Wall time: 231 ms

Out[247]:

0.8568817764310402

多項式樸素貝葉斯

多項式樸素貝葉斯是最通用和最廣泛使用的變體，如以下示例所示。我們使用 TF-IDF 矢量器而不是CountVectorizer來實現比 Scikit learn 快 5 倍的速度。

TF-IDF + Multinomial

Transform the text through a TF-IDF vectorizer, and run a multinomial naive Bayes model.

In[10]:

						vec = TfidfVectorizer(stop_words='english', ngram_range=(1,3))
x_train = vec.fit_transform(X_train_text)
x_test = vec.transform(X_test_text)

mnb = MultinomialNB()
%time mnb.fit(x_train, y_train)
%time mnb.score(x_test, y_test)

					

CPU times: user 55.4 ms, sys: 7.57 ms, total: 63 ms
Wall time: 63 ms
CPU times: user 20.3 ms, sys: 8.16 ms, total: 28.4 ms
Wall time: 28.2 ms

Out[10]:

0.9248046875

In[11]:

						vec = TfidfVectorizer(stop_words='english', ngram_range=(1,3))
x_train = vec.fit_transform(X_train_text)
x_test = vec.transform(X_test_text)
x_train_np, x_test_np = x_train.get(), x_test.get()
y_train_np, y_test_np = y_train.to_numpy(), y_test.to_numpy()

mnb = MultinomialNB_sk()
%time mnb.fit(x_train_np, y_train_np)
%time mnb.score(x_test_np, y_test_np)

					

CPU times: user 264 ms, sys: 67.6 ms, total: 332 ms
Wall time: 332 ms
CPU times: user 31.8 ms, sys: 27.9 ms, total: 59.7 ms
Wall time: 59.4 ms

Out[11]:

0.9248046967473131

補碼樸素貝葉斯

我們使用CountVectorizer證明了補碼樸素貝葉斯的威力，并表明在我們的不平衡數據集上，它比伯努利和多項式 NB 變體產生了更好的分類分數。

CountVectorizer + Complement

Complement naive Bayes models should be coupled with a CountVectorizer to have the best results.

In[19]:

						# First let's visualize the class imbalance

dataset['CATEGORY'].value_counts().to_pandas().plot(kind='bar', title='histogram of the class distributions')
dataset['CATEGORY'].value_counts() / len(dataset)

Out[19]:

3    0.360943
1    0.274531
2    0.256485
4    0.108042
Name: CATEGORY, dtype: float64

In[252]:

						vec = CountVectorizer(stop_words='english', ngram_range=(1,3))
x_train = vec.fit_transform(X_train_text)
x_test = vec.transform(X_test_text)

cnb = ComplementNB()
%time cnb.fit(x_train, y_train)
%time cnb.score(x_test, y_test)

					

CPU times: user 56.8 ms, sys: 12.3 ms, total: 69.1 ms
Wall time: 69.2 ms
CPU times: user 22.4 ms, sys: 7.27 ms, total: 29.7 ms
Wall time: 29.8 ms

Out[252]:

0.9502959251403809

In[253]:

						vec = CountVectorizer(stop_words='english', ngram_range=(1,3))
x_train = vec.fit_transform(X_train_text)
x_test = vec.transform(X_test_text)
x_train_np, x_test_np = x_train.get(), x_test.get()
y_train_np, y_test_np = y_train.to_numpy(), y_test.to_numpy()

cnb = ComplementNB_sk()
%time mnb.fit(x_train_np, y_train_np)
%time mnb.score(x_test_np, y_test_np)

					

CPU times: user 67.5 ms, sys: 31.8 ms, total: 99.3 ms
Wall time: 99.5 ms
CPU times: user 26.6 ms, sys: 11.4 ms, total: 38 ms
Wall time: 37.7 ms

Out[253]:

0.9449836611747742

范疇樸素貝葉斯

最后但絕對不是最不重要的是一個分類樸素貝葉斯的例子，我們使用 k-means 和之前在另一個 NB 變體上訓練的模型對其進行矢量化，以根據相似項對結果類的貢獻將其分組到相同的類別中。

我們發現，與 Scikit 相比，使用 315K 條新聞標題訓練模型的速度提高了 126 倍，使用 23 倍的速度進行推理和計算模型的準確性。

Categorical

To transform the text to categorical data, you can apply a clustering technique to merge the terms that are similar.

To create these clusters, you could reuse a previously fitted naive Bayes model but just for the purpose of clustering those words.

Preprocessing

In[14]:

						# First fit a TfIdf on the train dataset
tfidfvec = TfidfVectorizer(stop_words='english', min_df=10)
x_train = tfidfvec.fit_transform(X_train_text)

# Fit a Multinomial on the TdIdf data
mnb = MultinomialNB().fit(x_train, y_train)

# Use a KMeans algorithm to cluster on what the Multinomial NB learned of the TfIdf.
# This means that the words that contribute similarly to a category will be clustered together
km = KMeans(n_clusters=1000, random_state=1)
feature_to_cluster = km.fit_predict(mnb.feature_log_prob_.T)
feats2cluster = OneHotEncoder().fit_transform(feature_to_cluster)

# Print statistics on the repartition of the words in the clusters
print(cudf.Series(feats2cluster.sum(0)[0]).describe())

					

count    1000.000000
mean       14.967000
std        19.519501
min         1.000000
25%         6.000000
50%        11.000000
75%        18.000000
max       254.000000
dtype: float64

Here each cluster holds in average around 15 words

In[225]:

						# Lets plot the repartition of the words in each cluster
# And print the words in a few clusters
plt.hist(feature_to_cluster.get(), bins='auto')
print(tfidfvec.vocabulary_[cp.where(feature_to_cluster == 127)[0]])
print("\n")
print(tfidfvec.vocabulary_[cp.where(feature_to_cluster == 632)[0]])

					

47               117
1597            beam
2114       broadband
2435        carriers
2618         charter
3788       defective
4056            dire
4406            dual
5367           fixes
8365       materials
9072      networking
10900    recognition
11466        rollout
13666        tracker
14088      unveiling
Name: token, dtype: object


3293           core
3603          cyber
4751     enterprise
5719         gaming
8738         models
9801             pc
10074      platform
14338       virtual
Name: token, dtype: object

In[226]:

						# For Categorical Naive Bayes, the count of words is transformed into a count of cluster
vocab = tfidfvec.vocabulary_
countvec = CountVectorizer(stop_words='english')
countvec.vocabulary_ = vocab

x_train = countvec.transform(X_train_text)
x_test = countvec.transform(X_test_text)
print(x_train.shape)
print(feats2cluster.shape)

x_train_cluster = (x_train @ feats2cluster)
x_test_cluster = (x_test @ feats2cluster)

# For each cluster we will have:
# - 0: absence of those wprds.
# - 1: presence of those words
# - 2: multiple presence of those words (2+)

x_train_cluster.data[x_train_cluster.data > 2] = 2
x_test_cluster.data[x_test_cluster.data > 2] = 2

					

(316814, 14967)
(14967, 1000)

Little hack to make sure that if a cluster's max number is 1 in training, it is also 1 in testing

In[227]:

						max_one = cp.where(x_train_cluster.max(0).todense() == 1)[1]
for cluster in max_one:
    samples = (x_test_cluster[:, cluster] > 1)
    if samples.nnz == 0:
        continue
    samples = cp.where(samples.todense())[0]
    x_test_cluster[samples, cluster] = 1

					

Categorical model training

Now that the preprocessing is done we can train the Categorical model and see how it performs on these clusters

In[239]:

						%time cnb = CategoricalNB().fit(x_train_cluster, y_train)
%time cnb.score(x_test_cluster, y_test)

/dev/shm/rapids22.04_env/lib/python3.8/site-packages/cuml/naive_bayes/naive_bayes.py:1498: UserWarning: X dtype is not int32. X will be converted, which will increase memory consumption
  warnings.warn("X dtype is not int32. X will be "
/dev/shm/rapids22.04_env/lib/python3.8/site-packages/cupyx/scipy/sparse/compressed.py:545: UserWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  warnings.warn('Changing the sparsity structure of a '
/dev/shm/rapids22.04_env/lib/python3.8/site-packages/cuml/naive_bayes/naive_bayes.py:1516: UserWarning: X dtype is not int32. X will be converted, which will increase memory consumption
  warnings.warn("X dtype is not int32. X will be "

CPU times: user 110 ms, sys: 4.87 ms, total: 115 ms
Wall time: 112 ms
CPU times: user 64.7 ms, sys: 127 ms, total: 191 ms
Wall time: 193 ms

Out[239]:

0.9256380200386047

In[240]:

						x_train_cluster_np = x_train_cluster.get().todense()
x_test_cluster_np = x_test_cluster.get().todense()
y_train_np, y_test_np = y_train.to_numpy(), y_test.to_numpy()

%time cnb = CategoricalNB_sk().fit(x_train_cluster_np, y_train_np)
%time cnb.score(x_test_cluster_np, y_test_np)

					

/dev/shm/rapids22.04_env/lib/python3.8/site-packages/sklearn/utils/validation.py:593: FutureWarning: np.matrix usage is deprecated in 1.0 and will raise a TypeError in 1.2. Please convert to a numpy array with np.asarray. For more information see: https://numpy.org/doc/stable/reference/generated/numpy.matrix.html
  warnings.warn(

CPU times: user 13.7 s, sys: 434 ms, total: 14.2 s
Wall time: 14.2 s

/dev/shm/rapids22.04_env/lib/python3.8/site-packages/sklearn/utils/validation.py:593: FutureWarning: np.matrix usage is deprecated in 1.0 and will raise a TypeError in 1.2. Please convert to a numpy array with np.asarray. For more information see: https://numpy.org/doc/stable/reference/generated/numpy.matrix.html
  warnings.warn(

CPU times: user 4.4 s, sys: 110 ms, total: 4.51 s
Wall time: 4.51 s

Out[240]:

0.9256379906254438

基準

圖 2 中的圖表比較了 RAPIDS cuML 和 Scikit learn 之間的 NB 訓練和推理的性能，以及本文中概述的所有變體。

基準測試是在a2-highgpu-8g谷歌云平臺（ GCP ）實例上執行的，該實例配備了 NVIDIA Tesla A100 GPU 和 96 Intel Cascade Lake v CPU ，頻率為 2.2Ghz 。

圖 2.Scikit learn （藍色）和 cuML （綠色）之間的性能比較

GPU 加速樸素貝葉斯

我們能夠使用 CuPy 在 Python 中實現所有 NB 變體，這是一種 GPU 加速，幾乎可以替代 NumPy 和 SciPy 。 CuPy 還提供了用 Python 編寫自定義 CUDA 內核的功能。當 Python 應用程序運行時，它使用 NVRTC 的即時（ JIT ）編譯功能在 GPU 上編譯和執行它們。

所有 NB 變體的核心是兩個使用 CuPy 的 JIT 編寫的簡單原語，用于匯總和計算每個類的特征。

當單個文檔項矩陣過大而無法在單個 GPU 上處理時， Dask 庫可以利用增量訓練功能將處理擴展到多個 GPU 和多個節點。目前，多項式變量可以在 cuML 中與 Dask 一起分布。

結論

NB 算法應該在每個數據科學家的工具包中。使用 RAPIDS cuML ，您可以在 GPU 上加速 NB 的實現，而無需大幅更改代碼。這些強大而基本的算法，再加上 cuML 的加速，提供了您必須在超大或稀疏數據集上執行分類的一切。

關于作者

Mickael Ide 是 NVIDIA RAPIDS 團隊的機器學習工程師，專注于開發 GPU 加速的機器學習算法。米克爾擁有計算機科學碩士學位。

Corey Nolet 是 NVIDIA 的 RAPIDS ML 團隊的數據科學家兼高級工程師，他專注于構建和擴展機器學習算法，以支持光速下的極端數據負載。在 NVIDIA 工作之前， Corey 花了十多年時間為國防工業的 HPC 環境構建大規模探索性數據科學和實時分析平臺。科里持有英國理工學士學位計算機科學碩士學位。他還在攻讀博士學位。在同一學科中，主要研究圖形和機器學習交叉點的算法加速。科里熱衷于利用數據更好地了解世界。

審核編輯：郭婷

聲明：本文內容及配圖由入駐作者撰寫或者入駐合作網站授權轉載。文章觀點僅代表作者本人，不代表電子發燒友網立場。文章及其配圖僅供工程師學習之用，如有內容侵權或者其他違規問題，請聯系本站處理。舉報投訴

cpu

cpu

+關注

關注
68

文章
10863

瀏覽量
211765
NVIDIA

NVIDIA

+關注

關注
14

文章
4986

瀏覽量
103058
gpu

gpu

+關注

關注
28

文章
4740

瀏覽量
128949

機器學習的樸素貝葉斯講解

秦剛剛的機器學習成長之路之樸素貝葉斯法

發表于 05-15 14:41

樸素貝葉斯法的優缺點

樸素貝葉斯法(1) 之基礎概念

發表于 08-05 11:32

樸素貝葉斯法的惡意留言過濾

樸素貝葉斯法(2) 之惡意留言過濾

發表于 08-26 14:40

常用的分類方法：樸素貝葉斯法

統計學習方法樸素貝葉斯法

發表于 11-05 09:24

對樸素貝葉斯算法的理解

我對樸素貝葉斯算法的理解

發表于 05-15 14:13

機器學習之樸素貝葉斯應用教程

方法輸出的是某一類的概率，其取值范圍在 0-1 之間，樸素貝葉斯在做文本分類，或者說垃圾郵件識別的時候非常有效。

發表于 11-25 12:49 ?1386次閱讀

機器學習之<b class='flag-5'>樸素</b><b class='flag-5'>貝</b><b class='flag-5'>葉</b><b class='flag-5'>斯</b>應用教程

基于概率的常見的分類方法--樸素貝葉斯

方法輸出的是某一類的概率，其取值范圍在 0-1 之間，樸素貝葉斯在做文本分類，或者說垃圾郵件識別的時候非常有效。

發表于 02-03 14:37 ?5237次閱讀

基于概率的常見的<b class='flag-5'>分類</b>方法--<b class='flag-5'>樸素</b><b class='flag-5'>貝</b><b class='flag-5'>葉</b><b class='flag-5'>斯</b>

樸素貝葉斯NB經典案例

貝葉斯分類算法是統計學的一種分類方法，其分類原理就是利用貝

發表于 02-28 10:17 ?2次下載

一種新型樸素貝葉斯文本分類算法

針對在文本分類中先驗概率的計算比較費時而且對分類效果影響不大、后驗概率的精度損失影響分類準確率的現象，對經典樸素貝

發表于 03-05 11:19 ?0次下載

機器學習之樸素貝葉斯

學習過概率的人一定知道貝葉斯定理，在信息領域內有著無與倫比的地位。貝葉斯算法是基于貝葉斯定理的一類算法，主要用來解決分類和回歸問題。人工智能之機器學習中最為廣泛的兩種

發表于 05-29 09:01 ?894次閱讀

樸素貝葉斯算法詳細總結

樸素貝葉斯法是基于貝葉斯定理與特征條件獨立假設的分類方法，是經典的機器學習算法之一，處理很多問題時直接又高效，因此在很多領域有著廣泛的應用，

發表于 07-01 08:37 ?3.5w次閱讀

如何使用Spark計算框架進行分布式文本分類方法的研究

，分別在單機、Map Reduce和Spark三種不同的計算框架下測試了文本分類的效率，并使用控制變量的方法在Spark計算框架下設計對照實驗。實驗證明，Spark計算框架下的樸素貝葉

發表于 12-18 14:19 ?3次下載

一種特征假期樸素貝葉斯文本分類算法

樸素貝葉斯（NB）算法應用于文本分類時具有簡單性和高效性，但算法中屬性獨立性與重要性一致的假設，使其在精確度方面存在瓶頸。針對該問題，提出一

發表于 05-28 11:30 ?4次下載

樸素貝葉斯分類樸素貝葉斯算法的優點

。雖然這個簡化方式在一定程度上降低了貝葉斯分類算法的分類效果，但是在實際的應用場景中，極大地簡化了貝

發表于 10-02 17:14 ?9330次閱讀

PyTorch教程22.9之樸素貝葉斯

電子發燒友網站提供《PyTorch教程22.9之樸素貝葉斯.pdf》資料免費下載

發表于 06-06 09:22 ?0次下載

在线观看www成人影院-在线观看www日本免费网站-在线观看www视频-在线观看操-欧美18在线-欧美1级

搜索歷史

使用樸素貝葉斯和GPU進行更快的文本分類

Gaussian NB

Bernoulli + CountVectorizer

TF-IDF + Multinomial

CountVectorizer + Complement

Categorical

Preprocessing

Categorical model training

評論

機器學習的樸素貝葉斯講解

樸素貝葉斯法的優缺點

樸素貝葉斯法的惡意留言過濾

常用的分類方法：樸素貝葉斯法

對樸素貝葉斯算法的理解

機器學習之樸素貝葉斯應用教程

基于概率的常見的分類方法--樸素貝葉斯

樸素貝葉斯NB經典案例

一種新型樸素貝葉斯文本分類算法

機器學習之樸素貝葉斯

樸素貝葉斯算法詳細總結

如何使用Spark計算框架進行分布式文本分類方法的研究

一種特征假期樸素貝葉斯文本分類算法

樸素貝葉斯分類樸素貝葉斯算法的優點

PyTorch教程22.9之樸素貝葉斯