DeepDiveAI: Identifying AI-Related Documents in Large Scale Literature Dataset

In this paper, we propose and implement a systematic pipeline for the automatic classification of AI-related documents extracted from large-scale literature databases. This process results in the creation of an AI-related literature dataset named DeepDiveAI. The dataset construction pipeline integra...

Full description

Saved in:
Bibliographic Details
Main Authors: Xingzhou Liang, Xiaochen Zhou, Hui Zou, Yi Lu, Jingjing Qu
Format: Article
Language:English
Published: Tsinghua University Press 2025-06-01
Series:Journal of Social Computing
Subjects:
Online Access:https://www.sciopen.com/article/10.23919/JSC.2025.0007
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In this paper, we propose and implement a systematic pipeline for the automatic classification of AI-related documents extracted from large-scale literature databases. This process results in the creation of an AI-related literature dataset named DeepDiveAI. The dataset construction pipeline integrates expert knowledge with the capabilities of advanced models, structured into two primary stages. In the first stage, expert-curated classification datasets are used to train a Long Short-Term Memory (LSTM) model, which performs coarse-grained classification of AI-related records from large-scale datasets. In the second stage, a large language model, specifically Qwen2.5 Plus, is employed to annotate a random 10% of the initially coarse set of classified AI-related records. These annotated records are subsequently used to train a Bidirectional Encoder Representations from Transformers (BERT) based binary classifier, further refining the coarse set to produce the final DeepDiveAI dataset. Evaluation results indicate that the proposed pipeline achieves both accuracy and efficiency in identifying AI-related literature from large-scale datasets.
ISSN:2688-5255