Jailbreaking the PDF
STARTING IN
-178676465

DESCRIPTION
Join us in Montpellier for a one-day event to hack on scholarly PDFs! Currently, the bulk of peer-reviewed scientific knowledge is locked up in PDF documents, which are difficult to get information . We want to change that. If you’re interested in hacking on PDFs and exploring ways to access scholarly data in modern ways, this hackathon is for you.
Right now we have 6 projects with 8 participants.

Thumb_f2ca3b7b2bd0587452d400c7ad649d95
AMI2 is a prototype of an electronic scholar's assistant (amanuensis) designed for collaboration. The long-term idea is that AMI can read the literature and make semantic enhancements. UPDATE: examples in recent posts on http://blogs.ch.cam.ac.uk/pmr The current version concentrates on turning scholarly PDFs into semantic form. Attention has been given to unusual fonts (greek, symbols, maths, etc.) and the management of equations. Tables and diagrams can be analysed and data extracted from both (prototyped on chemistry and phylogenetic trees, but extensible to other fields). Very happy to collaborate with other projects and share code. http://bitbucket.org/petermr/svg2xml-dev download svg2xml-rev63.jar and run from commandline java -jar svg2xml-rev63.jar <files>
Team members
@petermr
PROJECT PROGRESS
0%
Thumb_d6bc8d5434878f602528d72436fe5ccb
This project is based on preliminary work building a document triage application for the Mouse Genome Informatics system (see http://www.youtube.com/watch?v=NIUkYF-x5Gk). What we propose is to use this Flex webapp as a front end for using the LAPDF-Text system to extract text from files based on rule files. Administrators will first load the PDF files into the system and parse them (this takes a little time). Then users of the website will be able to define their own rule files to extract the text from these documents and see the results in the browser. I'll be coding this from California so I'm 9 hours behind you and this will be an all-nighter for me. Let the hacking begin!
Team members
@GullyAPCBurns
PROJECT PROGRESS
0%
Project
CERMINE is a comprehensive open source system for extracting metadata and content from scientific articles in born-digital form. The system is able to process documents in PDF format and extracts: document's metadata, including title, authors, affiliations, abstract, keywords, journal name, volume and issue, parsed bibliographic references the structure of document's sections, section titles and paragraphs. CERMINE is based on a modular workflow, whose architecture ensures that individual workflow steps can be maintained separately. As a result it is easy to perform evaluation, training, improve or replace one step implementation without changing other parts of the workflow. Most steps implementations utilize supervised and unsupervised machine-leaning techniques, which increases the maintainability of the system, as well as its ability to adapt to new document layouts. CERMINE is a Java library and a web service for extracting metadata and content from scientific articles in born-digital form. Limitations This is an experimental service, and result may be not accurate. Uploaded file will be used only for metadata extraction, we do not store uploaded files. Accepted file format - *.pdf, maximum file size is 5 MB. License CERMINE is licensed under GNU Affero General Public License version 3.
Team members
@ebattaglia
PROJECT PROGRESS
0%
Project
CiTalO is a tool to infer automatically the nature of citations by means of Semantic Web technologies and NLP techniques.
Team members
@essepuntato
@andriry
PROJECT PROGRESS
0%
Project
I am a researcher and read a lot of PDFs everyday. For me it is really frustrating to not be able to make an hyperlink to a pdf segment, and therefore to refer precisely to a part of a scientific paper in my different brainstorming and authoring softwares. A solution would be hyper-annotation, a url that would point to a PDF segment. The potential would be great for bibliographical work. For instance, one can tag a corpus of pdf while reading it, and later displays all the bits of text related to a specific tag. When authoring a scientific paper, one can add a link to the precise paper bits s/he is citing, etc. In this project the idea would be to : - define a format for such pdf hyper-annotations - develop a tool to create and access such hyper-annotations - develop a first application to illustrate the potential of the approach (e.g. an application to 1/ tag segments of pdf, and 2/ display pdf segments related to a tag). Recently, a little application, called PDFoo, appears and goes in the good direction, but there is still a lot to do : http://youtu.be/53eNifYR2vQ
Team members
@solanki
@Amaury
Looking for
Designer
Developer
PROJECT PROGRESS
0%
Project
A web interface for evaluating output from various PDF extraction tools.
Team members
@caseyamcl
PROJECT PROGRESS
0%