基于贝叶斯方法的文本分类

基于贝叶斯方法的文本分类

代码思路

  1. 数据预处理

    • 将句子统一成小写

    • 去除标点符号

  2. 转化为单词词组

  3. 去除停用词

  4. 去除数字

  5. 两种处理方法:词干提取和词形还原

    结果比较:

    image-20221104214434986

    可以从结果上看出,词干提取的方法直接将单词变成了原型。而词形还原更多的保留了语法的正确。

  6. 用词袋模型将训练集文本转化为向量(id 从0开始的字典)

  7. 使用贝叶斯模型进行分类

    这里一共有两种模型:

    • Multi-variate Bernoulli Naive Bayes

    • Multinomial Naive Bayes (Term Frequency)

7.1 计算先验概率:

P(c)=N(c, text )N( text )P(c)=\frac{N(c, \text { text })}{N(\text { text })}

7.2 计算似然概率 P(wic)P(w_i|c) 使用拉普拉斯平滑

  • Bernoulli:

P(dic)=N(Cdi)+1N(C)+2P\left(d_{i} \mid c\right)=\frac{N\left(C_{d_{i}}\right)+1}{N(C)+2}

N(Cd?)N\left(C_{d_{?}}\right)是这类文档中出现该词的文档个数,N(C)N(C)是这类文档的总个数。

  • Term Frequency:

P(wic)=N(wiinWc)+1N(wc)+mP\left(w_{i} \mid c\right)=\frac{N\left(w_{i} i n W_{c}\right)+1}{N\left(w_{c}\right)+m}

N(wiinWc)N\left(w_{i} i n W_{c}\right)是该词在这类文档中出现的次数(词频),N(wc)N\left(w_{c}\right)是这类文档的总词数。mm 是字典的长度

7.3 计算联合概率

P(cwi)=P(wic)P(c)P(c\cap w_i)=P(w_i|c)P(c)

  • Bernoulli:

P(textc)=i=1mP(dic)b(1P(dic))1b,diD,b={1 if di text 0 else \begin{aligned} P(\text {text} \mid c) & =\prod_{i=1}^{m} P\left(d_{i} \mid c\right)^{b}\left(1-P\left(d_{i} \mid c\right)\right)^{1-b}, \\ d_{i} & \in D, \\ b=&\left\{\begin{array}{cc} 1 & \text { if } d_{i} \in \text { text } \\ 0 & \text { else } \end{array}\right. \end{aligned}

  • Term Frequency:

P(textc)=i=1nP(wic),wiDP(\text {text} \mid c)=\prod_{i=1}^{n} P\left(w_{i} \mid c\right), w_{i} \in D

7.4 朴素贝叶斯优化问题:

argmaxcCi=1nP(wic)P(c)\operatorname{argmax}_{c \in C} \prod_{i=1}^{n} P\left(w_{i} \mid c\right) P(c)

  • Bernoulli:

ctext =argmaxcC[lnP(c)+i=1mlnP(dic)b(1P(dic))1b]c_{\text {text }}=\operatorname{argmax}_{c \in C}\left[\ln P(c)+\sum_{i=1}^{m} \ln P\left(d_{i} \mid c\right)^{b}\left(1-P\left(d_{i} \mid c\right)\right)^{1-b}\right]

  • Term Frequency:

ctext=argmaxcC[lnP(c)+i=1nlnP(wic)]c_{t e x t}=\operatorname{argmax}_{c \in C}\left[\ln P(c)+\sum_{i=1}^{n} \ln P\left(w_{i} \mid c\right)\right]

  1. 测试计算准确度

关键代码

Multi-variate Bernoulli Naive Bayes (Bernoulli)

  1. 计算似然概率 P(dic)=N(Cdi)+1N(C)+2P\left(d_{i} \mid c\right)=\frac{N\left(C_{d_{i}}\right)+1}{N(C)+2}

    1
    p_stat = (sents_count + 1) / (category_sents + 2)
  2. 测试

    1. 计算 lnP(c)\ln P(c)

      1
      p_c = np.log(p_c)
    2. 计算 i=1mlnP(dic)b(1P(dic))1b\sum_{i=1}^{m} \ln P\left(d_{i} \mid c\right)^{b}\left(1-P\left(d_{i} \mid c\right)\right)^{1-b}

      思路:

      1. 统计一个文档中出现了哪些词,得到矩阵 B\mathbf{B}。矩阵B的维度和p_stat的一致(词典长度x类别数)。

        字典中出现过的词标注为1,未出现过标注为0。

      2. 如果该词出现过,b=1,itextlnP(dic)b(1P(dic))1b=itextlnP(dic)\sum_{i \in text} \ln P\left(d_{i} \mid c\right)^{b}\left(1-P\left(d_{i} \mid c\right)\right)^{1-b}=\sum_{i \in text} \ln P\left(d_{i} \mid c\right)

        如果该词没有出现,b=0,itextlnP(dic)b(1P(dic))1b=itextln(1P(dic))\sum_{i \notin text} \ln P\left(d_{i} \mid c\right)^{b}\left(1-P\left(d_{i} \mid c\right)\right)^{1-b}= \sum_{i \notin text} \ln \left(1-P\left(d_{i} \mid c\right)\right)

      3. 使用矩阵运算:

        1. 先计算 i=1mlnP(dic)b\sum_{i=1}^{m} \ln P\left(d_{i} \mid c\right)^{b}i=1m(1P(dic))1b\sum_{i=1}^{m}\left(1-P\left(d_{i} \mid c\right)\right)^{1-b}

        2. itextlnP(dic)=i=1mlnP(dic)bB\sum_{i \in text} \ln P\left(d_{i} \mid c\right) = \sum_{i=1}^{m} \ln P\left(d_{i} \mid c\right)^{b} * \mathbf{B}

          itextln(1P(dic))=i=1m(1P(dic))1b(1B)\sum_{i \notin text} \ln \left(1-P\left(d_{i} \mid c\right)\right)= \sum_{i=1}^{m}\left(1-P\left(d_{i} \mid c\right)\right)^{1-b} * (1-\mathbf{B})

        3. i=1mlnP(dic)b(1P(dic))1b=i=1mlnP(dic)bB+i=1m(1P(dic))1b(1B)\sum_{i=1}^{m} \ln P\left(d_{i} \mid c\right)^{b}\left(1-P\left(d_{i} \mid c\right)\right)^{1-b} = \sum_{i=1}^{m} \ln P\left(d_{i} \mid c\right)^{b} * \mathbf{B} + \sum_{i=1}^{m}\left(1-P\left(d_{i} \mid c\right)\right)^{1-b} * (1-\mathbf{B})

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
        # 计算P,即argmax之后的内容
      for i, words in enumerate(data_x):
      # TODO:计算Bernoulli方法下的ln P(text|c)P(c),即这里的p
      # b 统计出现过的词
      b = np.zeros(len(dictionary))
      for word in set(words):
      # 跳过未登录词
      if word in dictionary:
      word_id = dictionary[word]
      b[word_id] = 1
      # 将b扩充成(词典长度x类别数)
      b = np.expand_dims(b, 1).repeat(len(categories), 1)
      _b = np.ones(b.shape) - b

      p = np.log(p_stat) * b + np.log(1 - p_stat) * _b
      p = np.sum(p, axis=0)
      p += p_c
    3. 计算 ctext =argmaxcC[lnP(c)+i=1mlnP(dic)b(1P(dic))1b]c_{\text {text }}=\operatorname{argmax}_{c \in C}\left[\ln P(c)+\sum_{i=1}^{m} \ln P\left(d_{i} \mid c\right)^{b}\left(1-P\left(d_{i} \mid c\right)\right)^{1-b}\right] 并和真实值进行比较。

      1
      2
      if np.argmax(p) == categories[data_y[i]]:
      count += 1

Multinomial Naive Bayes (Term Frequency)

  1. 计算似然概率 P(wic)=N(wiinWc)+1N(wc)+mP\left(w_{i} \mid c\right)=\frac{N\left(w_{i} i n W_{c}\right)+1}{N\left(w_{c}\right)+m}

    1
    p_stat = (words_frequency + 1) / (category_words + len(dictionary))
  2. 测试

    1. 计算 lnP(c)\ln P(c)

      1
      p_c = np.log(p_c)
    2. 计算 i=1nlnP(wic)\sum_{i=1}^{n} \ln P\left(w_{i} \mid c\right)

      1
      2
      3
      4
      5
      6
      7
      8
      9
      for i, words in enumerate(data_x):
      p = np.zeros(len(categories))
      # TODO:计算TF方法下的ln P(text|c)P(c),即这里的p
      for word in words:
      # 跳过未登录词
      if word in dictionary:
      word_id = dictionary[word]
      p += np.log(p_stat[word_id,:])
      p += p_c
    3. 测试 ctext=argmaxcC[lnP(c)+i=1nlnP(wic)]c_{t e x t}=\operatorname{argmax}_{c \in C}\left[\ln P(c)+\sum_{i=1}^{n} \ln P\left(w_{i} \mid c\right)\right] 并与真实结果进行比较

      1
      2
      if np.argmax(p) == categories[data_y[i]]:
      count += 1

运行结果

TF Bernoulli
stemmer
9479/10208 92.86%
2227/2552 87.26%
9484/10208 92.91%
2227/2552 87.26%
lemmatizer
9550/10208 93.55%%
2226/2552 87.23%
9551/10208 93.56
2227/2552 86.87%

结果与实验手册中保持一致

部分预处理操作缺失影响

缺失项 stemmer+TF Bernoulli
小写 Accuracy: 9480/10208 92.87%
Accuracy: 2228/2552 87.3%
Accuracy: 9480/10208 92.87%
Accuracy: 2231/2552 87.42%
标点 Accuracy: 9463/10208 92.7%
Accuracy: 2235/2552 87.58%
Accuracy: 9470/10208 92.77%
Accuracy: 2229/2552 87.34%
停用词 Accuracy: 9451/10208 92.58%
Accuracy: 2220/2552 86.99%
Accuracy: 9449/10208 92.56%
Accuracy: 2216/2552 86.83%
数字 Accuracy: 9500/10208 93.06%
Accuracy: 2224/2552 87.15%
Accuracy: 9504/10208 93.1%
Accuracy: 2221/2552 87.03%
词干提取
/
词形还原
Accuracy: 9587/10208 93.92%
Accuracy: 2233/2552 87.5%
Accuracy: 9595/10208 93.99%
Accuracy: 2230/2552 87.38%
全部缺失 Accuracy: 9569/10208 93.74%
Accuracy: 2227/2552 87.26%
Accuracy: 9574/10208 93.79%
Accuracy: 2231/2552 87.42%

**结论:**缺失部分预处理操作对结果的影响不是特别大

源码:https://github.com/laptype/Machine-Learning-Code/tree/main/Task 3_Bayes