Your browser version is outdated. We recommend that you update your browser to the latest version.

POET 2 / Faircode Project

These are projects that I am currently working on with Dr. John Hurdle. They are designed to investigate the effects of pre-processing techniques in NLP pipelines. There have been three separate projects thus far.


This was work done with Dr. John Hurdle, Jianlin Shi, Sean Igo, Dr. Charlene Weir, with assistance from Yijun Shoa

POETenceph is an extension of UtahPOET. Identifying inpatients with encephalopathy is important because the disorder is prevalent, often missed, puts patients at risk. We describe POETenceph a natural language processing pipeline, which ranks clinical notes on the extent of the evidence indicating that the patient had encephalopathy. We use a realist ontology of the entities and relationships relating clinical notes to the diagnosis of encephalopathy. POETenceph includes a passage rank algorithm, which takes identified disorders; matches them to the ontology; calculates the diffuseness, centrality, and length of the matched entry; adds the scores; and returns the ranked documents. We evaluate it on a corpus of clinical documents classified by the amount of evidence related to delirium found by human annotators. 65% of the bottom and 70% of the top scoring documents had little or no evidence and good evidence, respectively. POETenceph can effectively rank clinical documents for their evidence of encephalopathy as characterized by delirium. Portions of the ontology for acute change in mental status (ACMS) as characterized by Delirium are pictured below.

The top level of the ACMS ontologyThe top level of the ACMS ontology


Ontologic elements within the clinical pictures associated with ACMSOntologic elements within the clinical pictures associated with ACMS

This work was supported in part by a grant from the NLM, R01-LM010981. This material is based upon work supported by the Department of Veterans Affairs, Veterans Health Administration, Office of Research and Development, Biomedical Laboratory Research and Development: Veterans Health Administration Health Services Research & Development: # CRE 12-321. 


This work was done with Dr. John Hurdle, Jianlin Shi, and Sean Igo.

The pipeline of the UtahPOET systemThe pipeline of the UtahPOET system

 The UtahPOET system (pictured above) was developed for SemEval 2015 Task 14. UtahPOET is a cognitively inspired system designed to extract semantic content from general clinical texts. We find that our system performs much better on the context slot-filling aspects of Tasks 2A and 2B than the disorder CUI mapping of Tasks 1 and 2B or the body location CUI mapping of Task 2B. Our problems with CUI mapping suggested several possible system improvements. An alteration in the correspondence between the system architecture and psycholinguistic findings is also indicated. 

Identification of Prose and Nonprose in Clinical Notes

This was work overseen by Dr. John Hurdle in collaboration with Sean Igo.

The ideas first introduced in this work were incorporated into UtahPOET. This investigation is an initial foray into the application of dual-process theories (i.e., shallow to deep processing theories of language comprehension) to clinical natural language processing (cNLP). In it we present a shallow pre-processing step to distinguish ungrammatical (nonprose) text segments from grammatical (prose) segments. We argue that the current state-of-the-art in cNLP named entity recognition is treating clinical text as de facto ungrammatical text. We show that ungrammatical text can be easily identified for separate processing, leaving the remaining prose text available for future extraction of information based on grammatical rules. Thus, leveraging the prose/nonprose distinction can improve cNLP performance. 

Automatic Clustering of Clinical Documents by Specialty and Setting

This work built on work by Dr. Olga Patterson. The work was done with Dr. John Hurdle and Sean Igo.

A set of studies designed to identify sublanguages in documents for domain-specific processing across institutions. Psychological evidence indicates that humans use context-specific linguistic information when they read. Natural Language Processing (NLP) pipelines are successful within specific domains (i.e., contexts). To limit the number of domain-specific NLP systems, a natural focus would be on sublanguages. Sublanguages are identified by shared lexical and semantic features.[1] Patterson and Hurdle[2] developed a sublanguage identification system that functioned well for 12 clinical specialties at the University of Utah. The current work compares sublanguages across institutions. Using a clinical NLP pipeline augmented by a new document corpus from the University of Pittsburg (UPitt), new documents were assigned to clusters based on the minimum cosine-distance to a Utah cluster centroid. The UPitt documents were divided into a nine-group specialty corpus. Across institutions, five of the specialty groups fell within the expected clusters. We find that clustering encounters difficulty due to documents with mixed sublanguages; naming convention differences across institutions; and document types used across specialties. The findings indicate that clinical specialty sublanguages can be identified across institutions. Click here for the original paper.

Click here for the our paper, which is to be Presented at DTMBIO 2013.


My work on the POET2 projects was funded by:

POET-2: High-performance computing for advanced clinical narrative preprocessing. PI: John Hurdle