Notes
Slide Show
Outline
1
Biomedical corpora for natural language processing: What’s a nice Librarian like you doing in a project like this?
  • K. Bretonnel Cohen
  • Lynne Fox (lynne.fox@uchsc.edu)
  • Philip V. Ogren
  • Lawrence Hunter
  • University of Colorado Health Sciences Center
2
Today’s talk
  • The Center for Computational Pharmacology and their work
  • Define key terms
  • Illustrate corpus creation using our project as an example
  • Summarize findings of research on other biomedical corpora
  • Describe my role (10% appointment)
  • Encourage you to get involved at your institution
3
 
4
Center for Computational Pharmacology
http://compbio.uchsc.edu/
  • In the School of Medicine, Department of Pharmacology
  • Purpose: Creating novel algorithms and knowledge-based tools for the analysis and interpretation of high-throughput molecular biology data
  • Test and prove theories about biomedical linguistics using computer programs.
    (Can we create a natural language processing program that recognizes named entities [locations, proteins, etc.], or identifies a physiological process, or finds the co-occurrence of two proteins?)
5
Why biologists (and librarians) care about NLP

  • High-throughput data is generated (dna microarray chips)
  • Literature searches needs to be more automated (A+B, A+C, A+D . . . )
  • Annotation of data improves access via computer algorithms (More refined searching of relationships)
  • Knowledgebase creation provides access to new knowledge and discoveries


6
Definitions
7
Corpus  (plural: corpora)
  • A body of data in digital form that is used to build and test natural language processing systems, that may include:
    • Textual data (naturally occurring in any source – transcripts of spoken communication, textbooks, articles, etc.)
    • Annotation (“mark-up” can be automated or manual)
    • The ability to compare the results from computer programs to “right” and “wrong” answers in the corpus
8
Types of annotations in corpora:
  • Sentence segmentation:  <sentence>Here, we show that Bifocal (Bif), a putative cytoskeletal regulator, is a component of the Msn pathway for regulating R cell growth targeting. </sentence> <sentence>bif displays strong genetic interaction with msn. </sentence>
  • Tokenization: Divides the text into smallest units (usually words), removing punctuation. Challenge: What should be done with punctuation that has linguistic meaning?
    • Negative charge (Cl-)
    • Absence of symptom (-fever)
    • Knocked-out gene (Ski-/-)
    • Gene name (IL-2 – mediated)
    • Plus, “syntactic” uses (insulin-dependent)
  • Part-of-speech tagging: Met28/NOUN binds/VERB to/TO DNA/NOUN./.
9
 
10
Our Corpus
  • Experts manually annotate Gene References Into Function sentences from Entrez Gene
  • NLP researchers use the corpus to compare the results of tasks executed by their computer programs to right and wrong answers and “train” the programs to improve their results
  • The goal is to automatically process the texts from NCBI GeneRIFs (National Center for Biotechnology Information Entrez Gene References Into Function), into a formal knowledge-base of the current understanding of protein transportation.
11
GRIFs
  • A sentence about a specific gene and the function of the gene, usually taken from a sentence in a PubMed abstract
    • APOE does not play an important role in susceptibility to Parkinson's disease, but may play a role in dementia associated with Parkinson's disease  (PMID: 11863377, 2002)
12
 
13
How can we learn from extant corpora? Our research
  • What we wanted to know:
    • What corpora exist?
    • What design features did they include?
    • How did that impact use?
  • How did we find out?
    • Citation index search
    • Manually examined papers
    • Contacted corpus designers
14
Which corpora did we study?
15
Our Findings
  • Format of corpora determined use (portability of markup method)
  • Manual curation matters (the gold standard and training set)
  • Size doesn’t matter (within limits)
  • Named entity markup matters
  • Annotation style matters (linguistic and structural)
16
My Role
I’m familiar with user needs and I know why it’s tough to find information
  • Mediate in the process of creating data and using data
  • Consult on PubMed searching, MEDLINE structure, and controlled vocabularies (ontologies)
  • Draw attention to ways commercial databases mine data
  • Assist with Gene Reference into Function (GRIF) corpus annotation project
  • Idea generation
  • Copyright “buzz-kill”


17
What’s Next For Me?
  • Continue GRIF annotation project, corpus creation & design research
  • Grant application to bring my participation to 8 hours/week, 20% time
  • Consult on projects, search strategy, MEDLINE
  • KEEP LEARNING!


18
What can you do?
  • Network, network, network
    • Connect with folks on your campus or at a university in your area
      • Look for departments of linguistics, biocomputing, bioinformatics
      • Look for units “hidden” within research divisions
      • Electronic medical records managers
    • Attend AMIA, Association for Computational Linguistics, Intelligent Systems for Molecular Biology, Pacific Symposium for Biocomputing
  • Suggest grant ideas, especially for clinical data mining
  • Ask for funding for participation


19
Further Reading

  • Fuller SS, Revere D, Bugni PF, Martin GM.  A knowledgebase system to enhance scientific discovery: Telemakus. Biomedical Digital Libraries. 21;1(1):2, September 2004.
  • Rzhetsky A, Iossifov I, Koike T, et al. GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. Journal Of Biomedical Informatics. 37 (1): 43-53, February 2004.
  • Srinivasan P, Hristovski D.  Distilling conceptual connections from MeSH co-occurrences. Medinfo. 11(Pt 2):808-12, 2004.
  • Watson LA, Hersh WR Information retrieval: A health and biomedical perspective. Springer; 2nd edition, 2003, ISBN: 0387955224.


20
Summary
  • Librarians and computational linguists have common goals, can benefit from mutual cooperation, and librarians have unique skills that can contribute to the study of natural language processing in biomedicine.
  • Go forth and find NLPeople!
21
Acknowledgements
  • NIH grant R01-LM008111 to Larry Hunter
  • Lynette Hirschman, Alex Morgan, Kristofer Franzén (Comments on study)
  • Christian Blaschke, Mark Craven, Lorraine Tanabe, Kristofer Franzén (Corpus designers)