基于LDA模型的网络刊物主题发现与聚类
作者:
作者单位:

作者简介:

通讯作者:

基金项目:


Topic Discovery and Clustering for Online Journals Based on LDA Algorithm
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    随着智能终端的普及,文本的主题挖掘需求也越来越广泛,主题建模是文本主题挖掘的核心,LDA生成模型是基于贝叶斯框架的概率模型,它以语义关联为基础,很好地解决了文本潜在主题的提取问题。对文本聚类过程的核心技术LDA生成模型、数据采样、模型评价等作了较为深入的阐述和解析,结合网络教育平台的2 794篇学习刊物进行了主题发现和聚类实验,建立了包含3 800个词项的词库,通过kmeans算法和合并向量算法(UVM)分两步解决了主题聚类问题。提出了文本挖掘实验的一般方法,并对层次聚类中文本距离的算法提出了改进。实验结果表明,该平台刊物的主题整体相似度比较好,但主题过于集中使得许多刊物的内容不具有辨识度,影响用户对主题的定位。

    Abstract:

    With the popularity of intelligent terminals, the demand of text topic mining is becoming more prevalent in many different domains. Theme modeling is the kernel of text topic mining. LDA (latent Dirichlet allocation) generating model is a probability model based on Bayesian framework, and it solves the problem of text potential topic extraction based on semantic association. The key technology of text clustering process, including LDA generating model, data sampling, model evaluation, was described and analyzed in depth. Theme discovery and clustering experiments were carried out in 2 794 learning journals on the network education platform. A thesaurus containing 3 800 terms was established. The problem of topic clustering was solved by kmeans algorithm and UVM (union vector method) algorithm in two steps. Meanwhile a general method of text mining experiment was proposed, and the algorithm of text distance in hierarchical clustering was improved. The experimental results show that the overall similarity of topics in the platform is good, but the focus of topics makes the content of many journals not identifiable, which affects the user's positioning of topics.

    参考文献
    相似文献
    引证文献
引用本文

杨传春,张冰雪,李仁德,郭强.基于LDA模型的网络刊物主题发现与聚类[J].上海理工大学学报,2019,41(3):273-280.

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2018-05-30
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2019-08-07