A Five-Step Workflow to Manually Annotate Unstructured Data into Training Dataset for Natural Language Processing

Publication Name

Studies in health technology and informatics

Abstract

Natural Language Processing (NLP) is a powerful technique for extracting valuable information from unstructured electronic health records (EHRs). However, a prerequisite for NLP is the availability of high-quality annotated datasets. To date, there is a lack of effective methods to guide the research effort of manually annotating unstructured datasets, which can hinder NLP performance. Therefore, this study develops a five-step workflow for manually annotating unstructured datasets, including (1) annotator training and familiarising with the text corpus, (2) vocabulary identification, (3) annotation schema development, (4) annotation execution, and (5) result validation. This framework was then applied to annotate agitation symptoms from the unstructured EHRs of 40 Australian residential aged care facilities. The annotated corpus achieved an accuracy rate of 96%. This suggests that our proposed annotation workflow can be used in manual data processing to develop annotated training corpus for developing NLP algorithms.

Open Access Status

This publication may be available as open access

Volume

310

First Page

109

Last Page

113

Share

COinS
 

Link to publisher version (DOI)

http://dx.doi.org/10.3233/SHTI230937