|
1
|
- K. Bretonnel Cohen
- Lynne Fox (lynne.fox@uchsc.edu)
- Philip V. Ogren
- Lawrence Hunter
- University of Colorado Health Sciences Center
|
|
2
|
- The Center for Computational Pharmacology and their work
- Define key terms
- Illustrate corpus creation using our project as an example
- Summarize findings of research on other biomedical corpora
- Describe my role (10% appointment)
- Encourage you to get involved at your institution
|
|
3
|
|
|
4
|
- In the School of Medicine, Department of Pharmacology
- Purpose: Creating novel algorithms and knowledge-based tools for the
analysis and interpretation of high-throughput molecular biology data
- Test and prove theories about biomedical linguistics using computer
programs.
(Can we create a natural language processing program that
recognizes named entities [locations, proteins, etc.], or identifies a
physiological process, or finds the co-occurrence of two proteins?)
|
|
5
|
- High-throughput data is generated (dna microarray chips)
- Literature searches needs to be more automated (A+B, A+C, A+D . . . )
- Annotation of data improves access via computer algorithms (More refined
searching of relationships)
- Knowledgebase creation provides access to new knowledge and discoveries
|
|
6
|
|
|
7
|
- A body of data in digital form that is used to build and test natural
language processing systems, that may include:
- Textual data (naturally occurring in any source – transcripts of spoken
communication, textbooks, articles, etc.)
- Annotation (“mark-up” can be automated or manual)
- The ability to compare the results from computer programs to “right”
and “wrong” answers in the corpus
|
|
8
|
- Sentence segmentation: <sentence>Here,
we show that Bifocal (Bif), a putative cytoskeletal regulator, is a
component of the Msn pathway for regulating R cell growth targeting. </sentence>
<sentence>bif displays strong genetic interaction with msn. </sentence>
- Tokenization: Divides the text into smallest units (usually words),
removing punctuation. Challenge: What should be done with punctuation
that has linguistic meaning?
- Negative charge (Cl-)
- Absence of symptom (-fever)
- Knocked-out gene (Ski-/-)
- Gene name (IL-2 – mediated)
- Plus, “syntactic” uses (insulin-dependent)
- Part-of-speech tagging: Met28/NOUN binds/VERB to/TO DNA/NOUN./.
|
|
9
|
|
|
10
|
- Experts manually annotate Gene References Into Function sentences from
Entrez Gene
- NLP researchers use the corpus to compare the results of tasks executed
by their computer programs to right and wrong answers and “train” the
programs to improve their results
- The goal is to automatically process the texts from NCBI GeneRIFs
(National Center for Biotechnology Information Entrez Gene References
Into Function), into a formal knowledge-base of the current
understanding of protein transportation.
|
|
11
|
- A sentence about a specific gene and the function of the gene, usually
taken from a sentence in a PubMed abstract
- APOE does not play an important role in susceptibility to Parkinson's
disease, but may play a role in dementia associated with Parkinson's
disease (PMID: 11863377, 2002)
|
|
12
|
|
|
13
|
- What we wanted to know:
- What corpora exist?
- What design features did they include?
- How did that impact use?
- How did we find out?
- Citation index search
- Manually examined papers
- Contacted corpus designers
|
|
14
|
|
|
15
|
- Format of corpora determined use (portability of markup method)
- Manual curation matters (the gold standard and training set)
- Size doesn’t matter (within limits)
- Named entity markup matters
- Annotation style matters (linguistic and structural)
|
|
16
|
- Mediate in the process of creating data and using data
- Consult on PubMed searching, MEDLINE structure, and controlled
vocabularies (ontologies)
- Draw attention to ways commercial databases mine data
- Assist with Gene Reference into Function (GRIF) corpus annotation
project
- Idea generation
- Copyright “buzz-kill”
|
|
17
|
- Continue GRIF annotation project, corpus creation & design research
- Grant application to bring my participation to 8 hours/week, 20% time
- Consult on projects, search strategy, MEDLINE
- KEEP LEARNING!
|
|
18
|
- Network, network, network
- Connect with folks on your campus or at a university in your area
- Look for departments of linguistics, biocomputing, bioinformatics
- Look for units “hidden” within research divisions
- Electronic medical records managers
- Attend AMIA, Association for Computational Linguistics, Intelligent
Systems for Molecular Biology, Pacific Symposium for Biocomputing
- Suggest grant ideas, especially for clinical data mining
- Ask for funding for participation
|
|
19
|
- Fuller SS, Revere D, Bugni PF, Martin GM. A knowledgebase system to enhance
scientific discovery: Telemakus. Biomedical Digital Libraries.
21;1(1):2, September 2004.
- Rzhetsky A, Iossifov I, Koike T, et al. GeneWays: a system for
extracting, analyzing, visualizing, and integrating molecular pathway
data. Journal Of Biomedical Informatics. 37 (1): 43-53, February
2004.
- Srinivasan P, Hristovski D. Distilling
conceptual connections from MeSH co-occurrences. Medinfo. 11(Pt
2):808-12, 2004.
- Watson LA, Hersh WR Information retrieval: A health and biomedical
perspective. Springer; 2nd edition, 2003, ISBN: 0387955224.
|
|
20
|
- Librarians and computational linguists have common goals, can benefit
from mutual cooperation, and librarians have unique skills that can
contribute to the study of natural language processing in biomedicine.
- Go forth and find NLPeople!
|
|
21
|
- NIH grant R01-LM008111 to Larry Hunter
- Lynette Hirschman, Alex Morgan, Kristofer Franzén (Comments on study)
- Christian Blaschke, Mark Craven, Lorraine Tanabe, Kristofer Franzén
(Corpus designers)
|