Winfried Neun
Scientific Information Systems Department
Konrad-Zuse-Zentrum für Informationstechnik Berlin (ZIB)
neun_AT_zib.de
Abstract:
The aim of the Math-Net project (under the aegis of the
International Mathematical Union) is to build up a pool
of high quality information on mathematical research and
mathematicians worldwide.
In the framework of this project we at ZIB are harvesting
pages with mathematical contents from the Web.
These pages contain, besides simple text information,
mathematical formulae or keywords. These formulae are
traditionally encoded in LaTeX, but with the emerging
new standards like MathML, OpenMath, OMDoc we have to encounter
more webpages that use the new standards. Our goal is to
retrieve as much semantic information as possible independent
of the encoding style used for formulae in a
mechanized way by providing extensions to the Harvest
software. We finally want to classify the mathematical
information in the webpage based on the type of formulae
included and completed by mathematical keywords.
In this talk we discuss some problems with the
automatic detection of semantics which are caused by the
encoding schemes. One example is the well-known encoding
in MathML, where two different encoding types serving
different needs of the users as well as mixed types are defined.
Some of the attempts we make to overcome these problems are based
on heuristics.
DOWNLOAD PRESENTATION (pdf)
DOWNLOAD PAPER (pdf)
|