Google Docs Session Notes
In this two-hour workshop we will attempt to overview basic concepts in text and data mining, with a focus on open source implementations in R. The examples used will be trivial, to convey understanding of the principles. We will pick out some published examples from the biological and chemical literature to show how TDM techniques have been successfully applied. The session will end with an open discussion on the theme of: "Why aren't more researchers using text and data mining?" and all the open access related policy issues that come with this question.
Objectives: Overview the power, scalability, and utility of TDM techniques
Who should be interested:People who do not think of themselves as computer scientists
What attendees are expected to learn:* Some of what current TDM methods can and cannot do
* The significant difference(s) between "title, abstract, and keyword" mining vs. fulltext mining
* De-mystification of TDM jargon like document-term matrix (DTM), tokenization, part-of-speech (POS) tagging, named entity recognition...
* Why open access papers must be licensed to permit public reposting, modification, and commercial use (a defence of CC BY licencing from the TDM point-of-view)