A Five-Step Workflow to Manually Annotate Unstructured Data into Training Dataset for Natural Language Processing
Publication Name
Studies in health technology and informatics
Abstract
Natural Language Processing (NLP) is a powerful technique for extracting valuable information from unstructured electronic health records (EHRs). However, a prerequisite for NLP is the availability of high-quality annotated datasets. To date, there is a lack of effective methods to guide the research effort of manually annotating unstructured datasets, which can hinder NLP performance. Therefore, this study develops a five-step workflow for manually annotating unstructured datasets, including (1) annotator training and familiarising with the text corpus, (2) vocabulary identification, (3) annotation schema development, (4) annotation execution, and (5) result validation. This framework was then applied to annotate agitation symptoms from the unstructured EHRs of 40 Australian residential aged care facilities. The annotated corpus achieved an accuracy rate of 96%. This suggests that our proposed annotation workflow can be used in manual data processing to develop annotated training corpus for developing NLP algorithms.
Open Access Status
This publication may be available as open access
Volume
310
First Page
109
Last Page
113