A Guide to Text and Data Mining at Indiana University Bloomington

Image 1: Graphs from biomedical literature
Kim D, Yu H (2011) Figure Text Extraction in Biomedical Literature. PLoS ONE 6(1): e15338. doi:10.1371/journal.pone.0015338

Text and data mining of academic databases are becoming increasingly popular ways to conduct research. They can allow scholars to make connections not previously discovered, or find solutions more quickly and efficiently. Such research has also gotten some researchers into trouble for alleged copyright and contract violations, when practiced without due diligence into existing legal restrictions.

For IU researchers interested in accessing the Libraries’ digital journals, databases, special collections (specifically, HathiTrust), and other subscription content for the purposes of text or data mining, we’ve put together a quick-and-dirty guide to text and data mining at IUB. Check it out and let us know what you think in the comments.

Welcome to a new year! IUScholarWorks Services

Welcome back for the 2012-2013 school year!  We’d like to remind our faculty and students of the services provided by IUScholarWorks, the open access publishing program of the IU Libraries:

  • New this year: Data Services: Indiana University Bloomington’s Data Management Service provides consultations on funding agency mandated data management plans, and data storage, access, and preservation options offered free-of-charge to campus researchers. Visit the IU Bloomington Data Management Service webpage for more information.
  • Journal Publishing:  We support IU faculty and graduate students who run electronic journals with their editorial needs such as author submissions, peer review, and journal website.  Please visit the IUScholarWorks Journals website or our recent blog post that showcases our publishing services.
  • Scholarly Research Archive:  Faculty can use our free, secure storage as a place for their Open Access research materials. The archive supports working papers, technical reports, media files, published articles, book chapters, and data: large and small.  Visit the archive, check it out, and contact us to learn more.
  • Graduate student theses and dissertations:  We actively collect PhD and EdD theses in the scholarly research archive.  A variety of departments also use the archive to showcase their masters theses.
  • Teaching: The Libraries Scholarly Communication department staff is available to lead workshops and guest lectures regarding our services, scholarly communication issues relative to the disciplines, and topics related to intellectual property and author rights.  Checkout our workshops pages (here and here) to see the latest offerings.

Visit the IUScholarWorks website to learn more about our services or to contact our staff

(#18) Standing on the Shoulders of Giants

We began last month exploring why copyright plays in important role in scholarly communication by looking at one case – publishing.  Copyright also plays important roles in other ways that scholars do their work.  As Clifford Lynch has said, “The most fundamental part of research, teaching, and scholarly discourse is the ability to build upon both evidence and prior scholarship.” (Center for Intellectual Property Handbook, p. 154) This building upon requires both access to the material as well as the ability to use portions of it to build your own case, make an argument against it, or to perhaps to establish a common understanding within your field.  This ability to use other works is not just important; it is, to use Cliff’s word, fundamental.

Many of the questions that I receive are about the use of others’ copyrighted work.  Can I include this image in my paper?  Can I show this film clip in my class?  Many of these questions rely on those limitations I also touched upon last time.  To recap: Section 106 of the copyright law goes about defining the exclusive rights of authors and creators.  Sections 107 through 122 are about setting limits on those exclusive rights.  These are not exceptions to copyright law.  These are statute defined limitations on the exclusive rights of the author.  I believe this is a very important distinction as exceptions are generally thought of as a thing to be gotten rid of, but defined limitations has a very different connotation.

Scholars rely on many of these limitations in order to do their work.  The most frequently used sections of the copyright law in higher education are Sections 107 (the fair use section), Section 108 (specifically for libraries and archives), and Section 110 (deals with teaching).  These are not the only parts of the law that scholars rely on, but probably the most heavily used.  More on each of these in future posts.  For now, let’s just think about how important it is to have the ability to build upon the scholarship of others, teach about developments in a field, or to freely read about these areas of our choosing.

(#16) Beyond the PDF

As the Digital Publishing Librarian I am frequently asked what format a researcher should use to publish their materials in our open access institutional repository or in our open access journals.  Leaving the other mediums aside for now, I will focus only on text files for this post.

The truth is, I spend a lot of time thinking about how to best direct researchers with this question and no great answer seems readily available.  My default response is, if you want to use PDF, please use archival PDF/A-1.  In my position I recognize how important it is to authors and editors that a document look its best, but I also need to think about how to best preserve it, digitally, for a long time.  I’m not a preservationist by a long shot, but these matters still keep me up some nights.

We have experimented with a few projects that stray from the PDF.  For example, The Medieval Review uses XML to generate their articles which they supply to our repository and in turn we then transform into HTML.  We hold on to the XML files primarily because we think they could be useful to our preservation strategy.

We’ve also worked with Museum Anthropology Review (see volume 5, issues 1-2) as time and staff help permits to create HTML versions of their PDF articles using a template a crafty student created.  While these files in particular are great HTML files, they take quite a bit of time to create as I learned one Friday afternoon last month when the editor and I sat down to try to create them ourselves!

Yes, time is a large part of the crux of the problem.  Staff expertise as well.  I have inquired of these editorial practices and support for the creation of well-formed preservable articles with other library staff doing similar work and our general response boils down to this:  we’re a shoe-string shop, trying to get by and do good work without spending a lot of time and money on the format of the output and so we resort to what seems good enough and people like:  the PDF.

In our spare time, folks like me keep abreast of the the NISO Journal Article Tag Suite – Standardized Markup for Journal Articles.  We play around with Annotum, an open-source, open-process, open-access scholarly authoring and publishing platform based on WordPress which allows for the easy creation of XML-based articles.  We try to create XML templates in Microsoft Word.  If you read into these projects you may notice many of them focus on scientific publications and I thank these developers for venturing into these arenas.  Most of the publications I support to date are humanities-based and am hopeful as humanists continue to explore viable options – ones that are easy for authors, editors, and peer reviewers to use and of course, that readers like to read.  I look forward to the possibility of discussing these questions at venues such as ThatCamp Publishing 2012.

This post is just as much a call for response as it may help point others wondering about these matters to useful resources.  I thank people like Michael Fenner at PLoS and Matthew Gold at CUNY for delving into these matters as well.

(#15) Copyright as the Center of the (Scholarly Communication) Universe

Whenever people think about Scholarly Communication, the first thing that comes to mind is probably not copyright.  They might think about the rising cost of journal subscriptions or new publishing methods or even think of putting their work on the web or in an institutional repository.  And while these are all valid first thoughts, copyright is generally only thought of after the fact.  Copyright should be a lot higher on the list for consideration.  Why?  Because copyright is the center of the scholarly communication universe!  And I’m not just saying that as the intellectual property librarian.  Ok, well maybe a little.  So, why is copyright so important in the scholarly communication universe?

Under US Copyright Law (US Code, Title 17), Section 106 grants authors certain exclusive rights as soon as they put their work into a fixed, tangible medium.  Sections 107 through 122 go about setting limitations on those exclusive rights, but for now, let’s just focus on the exclusive rights.  These rights include the ability to reproduce the copyrighted work, prepare derivative works, distribute the work, and display or perform the work publicly.  An author may share their work publicly on a web page or at a meeting or in any venue on their own as a means of spreading their work.  However, many authors want to publish their research or creative activity in a book or a journal in order to share their work.

For a publisher to make this work available, they must get permission from the author in order to reproduce (make copies) and distribute (in print or electronically) the author’s work as these are, up to this point, the exclusive rights belonging to the author.  This is where publishing agreements and copyright transfer agreements come in.  Some publishers have a policy in place that says that by sending them the article you are agreeing to have it published by them.  Thus, you are giving them permission.  Some publishers require not only permission, but an exclusive transfer of the copyright to them.  The transfer is exclusive in that the right was transferred and not retained by the author.  Exclusive transfers of copyright must be in writing, so this is why it is important to read the copyright transfer agreement carefully before you sign it.  Once an author exclusively transfers these rights, the author no longer has them.  This has many implications for the author and should be carefully considered.  For example, the author would generally no longer be able to reproduce or distribute their own paper without permission from the copyright owner (who is now the publisher) or by relying on another statute such as fair use.

There are arguments why an exclusive transfer is a good thing, and also arguments as to why they are a bad thing.  I’m not going there at the moment as this is a blog post and not a dissertation (although this is getting rather long).  Let’s suffice to say that ALL publishing requires at a minimum the permission of the original author.  This fact alone is why copyright plays such a central role in the way that scholars share their work from the publishing end of things.  Copyright plays important roles in other ways that scholars do their work, not just in publishing.  Think teaching, research, and other means of scholarly discourse.  I’ll be exploring these roles in a series of blog posts the first Monday of the month because copyright is just that important!  I hope you’ll join in the discussion.

(#13) IUScholarWorks Journals

IUScholarWorks includes a service for managing and publishing IU faculty and graduate student edited journals.  If you’re interested in getting a handle on the editorial workflow process (i.e., less email in your personal inbox!) or if you’re interested in pursuing an open access publishing business model for your journal, please contact us to talk about the possibilities.

We support the OJS software platform.  OJS = Open Journal Systems and is a product of the Public Knowledge Project.  The OJS software is a robust content management system for managing the editorial work of the journal.  It includes support for author submissions – including agreement to the journal’s copyright policy, peer review – including reviewer forms, and the editorial work for sections.  At its core is a large database that keeps track of all the communications between the involved scholars as well as all the article versions produced along the way.

OJS can also publish your journal if it is based on an open access publishing model – meaning free and available to the world on the internet. OJS provides RSS feeds for tables of contents to readers and you can allow readers to make comments on the content.

Please review the journals that publish with IUScholarWorks Journals.  Please know that we can support the editorial work if you publish with another publisher.  We can also address archiving open access backfiles if a journal could benefit from such a service.  No matter what option you choose, if you partner with IUScholarWorks Journals your content will be highly discoverable by search engines – including Google Scholar, the IU Libraries along with our partners in the Digital Library Program will take measures to preserve the content for the foreseeable future, and we will provide article level use statistics that are of value to both authors and editors.

(#11) FAR and IUScholarWorks

Have you noticed in the Faculty Annual Report (FAR), there is a check-box labeled ScholarWorks?  This check-box appears when you record your publications, creative activities, conference presentations, and even service activities.  By checking the
ScholarWorks button, you are indicating that you are interested in placing the
corresponding publication, presentation, etc. into IUScholarWorks, the digital
repository hosted by the IU Libraries (see https://scholarworks.iu.edu/dspace/).  By placing your work in the repository, you will gain increased visibility to your research and the work will be assigned a permanent, stable URL for easy linking and dissemination.  In order for a paper, powerpoint presentation, poster, or other material to be deposited into the repository, you must own the copyright to the material or have permission to make it
available.  Note for journal publications, this often means that that the publisher’s PDF copy of the article is generally not allowed, although a pre-print version may be.  If you are interested in increasing access to your work, while having the libraries be responsible for the long-term maintenance to it, then check the ScholarWorks button when you fill out this year’s FAR.  A librarian will contact you to discuss your work and to get copies
of the work for deposit.  If you have any questions regarding this process
or about IUScholarWorks, please feel free to contact us at IUSW@indiana.edu.

 

(#6) Searching WorldCat for Open Access Publications

If you’re interested in using one source to find Open Access publications in repositories around the world, I invite you to check out WorldCat.

WorldCat is the world’s largest network of library content and services. Perform your searches for books, articles, photographic images, audio, video, etc. in WorldCat and discover materials in libraries worldwide.  You can also discover freely available digital materials found in repositories worldwide.  Repositories such as the HathiTrust, Internet Archive, and institutional repositories like IUScholarWorks.

How?
Once you’ve performed your search, use the refinement tools on the left navigation side bar to narrow your results to ‘Internet Resources.’  From here, you will notice that many of the records come from a database called ‘OAIster‘ (OA meaning Open Access) and have an orange ‘View Now’ link associated.  Certainly take a look at the full record by clicking on the item’s title, but the view now link takes you to a repository that is storing the material openly for the world to access.

Check it out and please ask us or your library’s reference staff for help if you have questions.  WorldCat is a remarkable search engine.  Be sure to take advantage of creating an account and managing your resources within WorldCat.

(#2) What is an institutional repository?

I’d like to introduce you to IUScholarWorks Repository and explain what it can do for you, the IU researcher.

A definition of institutional repository (IR) by Clifford Lynch, Director of the Coalition for Networked Information :

“a university-based institutional repository is a set of services that a university offers to the members of its community for the management and dissemination of digital materials created by the institution and its community members. It is most essentially an organizational commitment to the stewardship of these digital materials, including long-term preservation where appropriate, as well as organization and access or distribution.” (2003; http://www.arl.org/bm~doc/br226ir.pdf)

IUScholarWorks Repository is an open access institutional repository and serves as a place to permanently archive research materials in any format such as:

  • Previously published materials (articles, book chapters, etc.)
  • Conference works and unpublished scholarly works
  • Lectures
  • Data files and databases

Understanding open access. Peter Suber, an  independent policy strategist for open access to research, provides a useful definition:

“Open-access (OA) literature is digital, online, free of charge, and free of most copyright and licensing restrictions.” (2004, revised 2010; http://www.earlham.edu/~peters/fos/overview.htm)

How does a researcher get started with the IUScholarWorks Repository?
IU Researchers should contact the IUScholarWorks administrator (me, Jennifer Laherty) via email at iusw@indiana.edu or jlaherty@indiana.edu if you are interested in depositing your research materials.  Together and often with assistance from Sherri Michaels, the Intellectual Property Librarian at IU Bloomington, we will determine if you have the rights to deposit your research materials, or if we need to seek permission from the rightsholder in order to make the deposit.  For each item submitted to the repository, the rightsholder must agree to the non-exclusive IUScholarWorks Repository license.

Although it should seem that the author is the rightsholder to the material, this is not often the case for materials already published, such as articles and book chapters.  In most cases, an author transfers a cadre of copyrights to their publisher in a copyright transfer agreement.  It is important to understand which rights were transferred in order to determine if the author has the right to post their work to an open access institutional repository.  We can help navigate to answer this question.  For students desiring to deposit their research, it may be done with permission of their academic department.

Once the copyright situation is figured out, research may be deposited.  Here’s a very short list of some interesting materials in IUScholarWorks:

Some words about access and preservation
IUScholarWorks Repository makes your research freely and broadly available to a worldwide audience (open access); it uses technology (DSpace) and metadata standards (the Open Archives Initiative Protocol for Metadata Harvesting, OAI-PMH) to ensure your works are more findable on the Internet; and the Libraries take care to archive and preserve your works for future generations.  IUScholarWorks is privileged to have support from the IU Digital Library Program, a a collaborative effort of the IU Libraries and University Information Technology Services in its efforts to achieve its mission.