Next Previous Contents

4. Sample Files

4.1 SOIF

This is a sample SOIF file:


@FILE { http://harvest.sourceforge.net/
update-time{10}:        1065602907
full-text{718}: Harvest: A Distributed Search System
Harvest: A Distributed Search System
Home
Sites using Harvest
Download
Contributed Code
Todo List
Links
Contributors
User's Manual
FAQ
Installation
ChangeLog
NEWS
Harvest
Harvest Homepage
Miscellaneous Documents and Presentations
Directory Index of Work in Progress Version of Harvest
SourceForge: Project Info - Harvest
Stable Version of Harvest
Homepage of stable Version
Historic Versions of Harvest
Harvest User's Manual 1.4.pl2 (January 31, 1996)
Harvest User's Manual 1.4.pl2 as PostScript
Developers
Kang-Jin Lee
Javier Masa Marin
Harald Weinreich
Home
Sites using Harvest
Download
Contributed Code
Todo List
Links
Contributors
User's Manual
FAQ
Installation
ChangeLog
NEWS

headings{111}:  Harvest: A Distributed Search System
Harvest
Stable Version of Harvest
Historic Versions of Harvest
Developers

title{37}:      Harvest: A Distributed Search System

url-references{836}:    harvest/doc/index.html
harvest/doc/sites.html
harvest/doc/download.html
harvest/contrib/index.html
harvest/doc/todo.html
harvest/doc/links.html
harvest/doc/CONTRIBUTORS
harvest/doc/html/manual.html
harvest/doc/html/FAQ.html
harvest/INSTALL.harvest
harvest/ChangeLog
harvest/NEWS
harvest/doc/index.html
misc/
wip/
http://sourceforge.net/projects/harvest/
harvest-1.8/doc/index.html
harvest-1.4.pl2-docs/
harvest-1.4.pl2-docs/user-manual.ps.gz
developers/lee/
developers/masa/
http://www.weinreichs.de/
harvest/doc/index.html
harvest/doc/sites.html
harvest/doc/download.html
harvest/contrib/index.html
harvest/doc/todo.html
harvest/doc/links.html
harvest/doc/CONTRIBUTORS
harvest/doc/html/manual.html
harvest/doc/html/FAQ.html
harvest/INSTALL.harvest
harvest/ChangeLog
harvest/NEWS
http://harvest.sourceforge.net/
http://sourceforge.net/

keywords{595}:  Home
Sites using Harvest
Download
Contributed Code
Todo List
Links
Contributors
User's Manual
FAQ
Installation
ChangeLog
NEWS
Harvest Homepage
Miscellaneous Documents and Presentations
Directory Index of Work in Progress Version
     of Harvest
SourceForge:
     Project Info - Harvest
Homepage of stable
     Version
Harvest User's Manual 1.4.pl2
     (January 31, 1996)
Harvest User's
     Manual 1.4.pl2 as PostScript
Kang-Jin Lee
Javier Masa Marin
Harald Weinreich
Home
Sites using Harvest
Download
Contributed Code
Todo List
Links
Contributors
User's Manual
FAQ
Installation
ChangeLog
NEWS

md5{32}:        2ba0877c91bbc00e6db037d5604ea860
uri{31}:        http://harvest.sourceforge.net/
file-size{4}:   3142
type{4}:        HTML
gatherer-version{6}:    1.9.10
gatherer-host{10}:      dyn214.tab
gatherer-name{37}:      Contents of the dyn214.tab WWW server
refresh-rate{6}:        604800
time-to-live{7}:        2419200
last-modification-time{10}:     1039163872
description{37}:        Harvest: A Distributed Search System

}

4.2 XML

This is the XML file converted from the sample SOIF file by soif2gils.pl.


<gils>

<availability>
  <linkage>
    http://harvest.sourceforge.net/
  </linkage>
</availability>

<dateOfLastModification>
  1039163872
</dateOfLastModification>

<abstract>
  Harvest: A Distributed Search System
</abstract>

<author>

</author>

<localSubjectIndex>
  <localSubjectTerm>
  Home
Sites using Harvest
Download
Contributed Code
Todo List
Links
Contributors
User's Manual
FAQ
Installation
ChangeLog
NEWS
Harvest Homepage
Miscellaneous Documents and Presentations
Directory Index of Work in Progress Version
     of Harvest
SourceForge:
     Project Info - Harvest
Homepage of stable
     Version
Harvest User's Manual 1.4.pl2
     (January 31, 1996)
Harvest User's
     Manual 1.4.pl2 as PostScript
Kang-Jin Lee
Javier Masa Marin
Harald Weinreich
Home
Sites using Harvest
Download
Contributed Code
Todo List
Links
Contributors
User's Manual
FAQ
Installation
ChangeLog
NEWS
  </localSubjectTerm>
</localSubjectIndex>

<supplementalInformation>
  <bytes>
  3142
  </bytes>
</supplementalInformation>

<crossReference>
  <linkage>
  harvest/doc/index.html
harvest/doc/sites.html
harvest/doc/download.html
harvest/contrib/index.html
harvest/doc/todo.html
harvest/doc/links.html
harvest/doc/CONTRIBUTORS
harvest/doc/html/manual.html
harvest/doc/html/FAQ.html
harvest/INSTALL.harvest
harvest/ChangeLog
harvest/NEWS
harvest/doc/index.html
misc/
wip/
http://sourceforge.net/projects/harvest/
harvest-1.8/doc/index.html
harvest-1.4.pl2-docs/
harvest-1.4.pl2-docs/user-manual.ps.gz
developers/lee/
developers/masa/
http://www.weinreichs.de/
harvest/doc/index.html
harvest/doc/sites.html
harvest/doc/download.html
harvest/contrib/index.html
harvest/doc/todo.html
harvest/doc/links.html
harvest/doc/CONTRIBUTORS
harvest/doc/html/manual.html
harvest/doc/html/FAQ.html
harvest/INSTALL.harvest
harvest/ChangeLog
harvest/NEWS
http://harvest.sourceforge.net/
http://sourceforge.net/
  </linkage>
</crossReference>

<title>
  Harvest: A Distributed Search System
</title>

<Body-of-text>
Harvest: A Distributed Search System
Harvest: A Distributed Search System
Home
Sites using Harvest
Download
Contributed Code
Todo List
Links
Contributors
User's Manual
FAQ
Installation
ChangeLog
NEWS
Harvest
Harvest Homepage
Miscellaneous Documents and Presentations
Directory Index of Work in Progress Version of Harvest
SourceForge: Project Info - Harvest
Stable Version of Harvest
Homepage of stable Version
Historic Versions of Harvest
Harvest User's Manual 1.4.pl2 (January 31, 1996)
Harvest User's Manual 1.4.pl2 as PostScript
Developers
Kang-Jin Lee
Javier Masa Marin
Harald Weinreich
Home
Sites using Harvest
Download
Contributed Code
Todo List
Links
Contributors
User's Manual
FAQ
Installation
ChangeLog
NEWS
</Body-of-text>

</gils>

4.3 Zebra Input Filter used to parse SOIF

This is the soif.flt from Zebra. This is unusable, but shows the mapping from SOIF to GILS.


# Crude input-filter for SOIF records -- one record per file.
# Author: Peter Valkenburg / TERENA (valkenburg@terena.nl)
# Version 0.2 (09/09/1998).
# This sort of follows the Nordic Web Index convention of GILS attribute use.
# Modified by Kang-Jin Lee (lee@arco.de)
# 07/10/1999

# We'll use GILS structured records.
BEGIN                                   { begin record gils }

# URL will be GILS' availability/linkage
/^@[A-Za-z](-|[.A-Za-z_])* { / BODY /$/ {
                                          begin element availability
                                          data -element linkage $1
                                          end element
                                        }

# Type will be GILS' availability/linkageType
/^[tT]ype{[0-9]+}:\t/ BODY /$/          {
                                          begin element availability
                                          data -element linkageType $1
                                          end element
                                        }

# Last modification time will be Bib-1 Use Attribute 1012
/^[lL]ast-[mM]odification-[tT]ime{[0-9]+}:\t/ BODY /$/  {
                                          data -element dateOfLastModification $1
                                        }

# The MD5 checksum is used as a unique identifier under Bib-1 Use Attribute 1007
/^[mM][dD]5{[0-9]+}:\t/ BODY /$/        { data -element controlIdentifier $1 }

# Description will be Bib-1 Use Attribute 62
/^[dD]escription{[0-9]+}:\t/ BODY /^((-|[._A-Za-z0-9])+{[0-9]+}:\t.*|})$/ {
                                          data -element abstract $1
                                          unread 2
                                        }

# Author will be Bib-1 Use Attribute 1003 (if gils.abs maps originator to it!!)
/^[aA]uthor{[0-9]+}:\t/ BODY /^((-|[._A-Za-z0-9])+{[0-9]+}:\t.*|})$/    {
                                          data -element author $1
                                          unread 2
                                        }

# Keywords will be GILS' localSubjectIndex/localSubjectTerm
/^[kK]eywords{[0-9]+}:\t/ BODY /^((-|[._A-Za-z0-9])+{[0-9]+}:\t.*|})$/  {
                                          begin element localSubjectIndex
                                          data -element localSubjectTerm $1
                                          unread 2
                                          end element
                                        }

# File-size will be GILS' supplementalInformation/bytes
/^[fF]ile-[sS]ize{[0-9]+}:\t/ BODY /$/  {
                                          begin element supplementalInformation
                                          data -element bytes $1
                                          unread 2
                                          end element
                                        }

# Update-Time will be GILS' supplementalInformation/lastChecked
/^[uU]pdate-[tT]ime{[0-9]+}:\t/ BODY /$/        {
                                          begin element supplementalInformation
                                          data -element lastChecked $1
                                          unread 2
                                          end element
                                        }

# url-references will be GILS' crossReference/linkage
/^[uU]rl-[rR]eferences{[0-9]+}:\t/ BODY /^((-|[._A-Za-z0-9])+{[0-9]+}:\t.*|})$/ {
                                          begin element crossReference
                                          data -element linkage $1
                                          unread 2
                                          end element
                                        }

# Title will be Bib-1 Use Attribute 4
/^[tT]itle{[0-9]+}:\t/ BODY /^((-|[._A-Za-z0-9])+{[0-9]+}:\t.*|})$/     {
                                          data -element Title $1
                                          unread 2
                                        }

# Body and Partial-Text will be Bib-1 Use Attribute 1010
# Is Body really commonly used in SOIF? Anyway, Full-Text is used by Harvest.
#/^[bB]ody{[0-9]+}:\t/ BODY /^((-|[._A-Za-z0-9])+{[0-9]+}:\t.*|})$/     {
#                                         data -element sampleText $1
#                                         unread 2
#                                       }
/^[fF]ull-[tT]ext{[0-9]+}:\t/ BODY /^((-|[._A-Za-z0-9])+{[0-9]+}:\t.*|})$/ {
                                          data -element sampleText $1
                                          unread 2
                                        }
/^[pP]artial-[tT]ext{[0-9]+}:\t/ BODY /^((-|[._A-Za-z0-9])+{[0-9]+}:\t.*|})$/ {
                                          data -element sampleText $1
                                          unread 2
                                        }

/^(-|[a-zA-Z0-9])+{[0-9]+}:\t/  BODY /^((-|[_A-Za-z0-9])+{[0-9]+}:\t.*|})$/ {
                                          unread 2
                                         }

END                                     { end record }

4.4 BIB-1 Attribute Set

This is BIB-1 attribute set. It is a subset of GILS.


# $Id: bib1.att,v 1.1 2002/10/22 12:51:09 adam Exp $
# Bib-1 Attribute Set
name bib1
reference Bib-1

att 1               Personal-name
att 2               Corporate-name
att 3               Conference-name
att 4               Title
att 5               Title-series
att 6               Title-uniform
att 7               ISBN
att 8               ISSN
att 9               LC-card-number
att 10              BNB-card-number
att 11              BGF-number
att 12              Local-number
att 13              Dewey-classification
att 14              UDC-classification
att 15              Bliss-classification
att 16              LC-call-number
att 17              NLM-call-number
att 18              NAL-call-number
att 19              MOS-call-number
att 20              Local-classification
att 21              Subject-heading
att 22              Subject-Rameau
att 23              BDI-index-subject
att 24              INSPEC-subject
att 25              MESH-subject
att 26              PA-subject
att 27              LC-subject-heading
att 28              RVM-subject-heading
att 29              Local-subject-index
att 30              Date
att 31              Date-of-publication
att 32              Date-of-acquisition
att 33              Title-key
att 34              Title-collective
att 35              Title-parallel
att 36              Title-cover
att 37              Title-added-title-page
att 38              Title-caption
att 39              Title-running
att 40              Title-spine
att 41              Title-other-variant
att 42              Title-former
att 43              Title-abbreviated
att 44              Title-expanded
att 45              Subject-precis
att 46              Subject-rswk
att 47              Subject-subdivision
att 48              Number-natl-biblio
att 49              Number-legal-deposit
att 50              Number-govt-pub
att 51              Number-music-publisher
att 52              Number-db
att 53              Number-local-call
att 54              Code-language
att 55              Code-geographic
att 56              Code-institution
att 57              Name-and-title
att 58              Name-geographic
att 59              Place-publication
att 60              CODEN
att 61              Microform-generation
att 62              Abstract
att 63              Note
att 1000            Author-title
att 1001            Record-type
att 1002            Name
att 1003            Author
att 1004            Author-name-personal
att 1005            Author-name-corporate
att 1006            Author-name-conference
att 1007            Identifier-standard
att 1008            Subject-LC-childrens
att 1009            Subject-name-personal
att 1010            Body-of-text
att 1011            Date/time-added-to-db
att 1012            Date/time-last-modified
att 1013            Authority/format-id
att 1014            Concept-text
att 1015            Concept-reference
att 1016            Any                 1016,4,1005,62
att 1017            Server-choice
att 1018            Publisher
att 1019            Record-source
att 1020            Editor
att 1021            Bib-level
att 1022            Geographic-class
att 1023            Indexed-by
att 1024            Map-scale
att 1025            Music-key
att 1026            Related-periodical
att 1027            Report-number
att 1028            Stock-number
att 1030            Thematic-number
att 1031            Material-type
att 1032            Doc-id
att 1033            Host-item
att 1034            Content-type
att 1035            Anywhere
att 1036            Author-Title-Subject

4.5 GILS Attribute Set

This is the GILS attribute set, which will be used in Harvest to store the summarized objects.


# $Id: gils.att,v 1.1 2002/10/22 12:51:09 adam Exp $
name gils
reference GILS-attset
include bib1.att

att 2000         Distributor
att 2001         Distributor-Name
att 2002         Index-Terms                            # Subject-Terms-Contr.
att 2003         Purpose
att 2004         General-Access-Constraints
att 2005         Use-Constraints
att 2006         Distributor-Organization
att 2007         Distributor-Street-Address
att 2008         Distributor-City
att 2009         Distributor-State-or-Province
att 2010         Distributor-Zip-or-Postal-Code
att 2011         Distributor-Country
att 2012         Distributor-Network-Address
att 2013         Distributor-Hours-of-Service
att 2014         Distributor-Telephone
att 2015         Distributor-Fax
att 2016         Resource-Description
att 2017         Order-Information
att 2018         Technical-Prerequisites
att 2019         Available-Time-Structured
att 2020         Available-Time-Textual
att 2021         Linkage
att 2022         Linkage-Type
att 2023         Contact-Name
att 2024         Contact-Organization
att 2025         Contact-Street-Address
att 2026         Contact-City
att 2027         Contact-State-or-Province
att 2028         Contact-Zip-or-Postal-Code
att 2029         Contact-Country
att 2030         Contact-Network-Address
att 2031         Contact-Hours-of-Service
att 2032         Contact-Telephone
att 2033         Contact-Fax
att 2034         Agency-Program
att 2035         Sources-of-Data
att 2036         Subject-Thesaurus
att 2037         Methodology
att 2038         West-Bounding-Coordinate
att 2039         East-Bounding-Coordinate
att 2040         North-Bounding-Coordinate
att 2041         South-Bounding-Coordinate
att 2042         Place-Keyword
att 2043         Place-Keyword-Thesaurus
att 2044         Time-Period-Structured
att 2045         Time-Period-Textual
att 2046         Cross-Reference-Title
att 2047         Cross-Reference-Linkage
att 2049         Original-Control-Identifier
att 2050         Supplemental-Information
att 2051         Record-Review-Date
att 2052         Originator-Dissemination-Control
att 2053         Security-Classification-Control
att 2054         Cost
att 2055         Cost-Information
att 2056         Schedule-Number
att 2057         Controlled-Subject-Index
att 2058         Uncontrolled-Term
att 2059         Spatial-Domain
att 2060         Bounding-Coordinates
att 2061         Place
att 2062         Time-Period
att 2063         Availability
att 2064         Order-Process
att 2065         Available-Time-Period
att 2066         Access-Constraints
att 2067         Point-of-Contact
att 2068         Cross-Reference
att 2069         Available-Linkage
att 2070         Cross-Reference-Relationship
att 2071         Language-of-Record
att 2072         Beginning-Date
att 2073         Ending-Date
att 2074         Controlled-Term

4.6 GILS File

This is a sample GILS file from Zebra.


<gils>

<Title>
UTAH EARTHQUAKE EPICENTERS
<Acronym>
UUCCSEIS
</Acronym>
</Title>

<Originator>
UTAH GEOLOGICAL AND MINERAL SURVEY
</Originator>

<Local-Subject-Index>
APPALACHIAN VALLEY; EARTHQUAKE; EPICENTER; SEISMOLOGY; UTAH
</Local-Subject-Index>

<Abstract>
Five files of epicenter data arranged by date comprise this data set.  These
files are searchable by magnitude and longitude/latitude.  Hardcopy of listing
and plot of requested area available.  Epicenter location and date, magnitude,
and focal depth available.
<Format>
DIGITAL DATA SETS
</Format>

<Data-Category>
TERRESTRIAL
</Data-Category>

<Comments>
Data are supplied by the University of Utah Seismograph Station. The Utah
Geologcial and Mineral Survey (UGMS) is merely a clearinghouse of the data.
</Comments>
</Abstract>

<Spatial-Domain>

<Geographic-Coverage>
US STATE
</Geographic-Coverage>

<Coverage-Description>
UTAH
</Coverage-Description>

<Bounding-Coordinates>

<West-Bounding-Coordinate>
-114
</West-Bounding-Coordinate>

<East-Bounding-Coordinate>
-109
</East-Bounding-Coordinate>

<North-Bounding-Coordinate>
42
</North-Bounding-Coordinate>

<South-Bounding-Coordinate>
37
</South-Bounding-Coordinate>
</Bounding-Coordinates>
</Spatial-Domain>

<Time-Period>

<Time-Period-Textual>
-PRESENT
</Time-Period-Textual>
</Time-Period>

<Availability>

<Distributor>

<Organization>
UTAH GEOLOGICAL AND MINERAL SURVEY
</Organization>

<Street-Address>
606 BLACK HAWK WAY
</Street-Address>

<City>
SALT LAKE CITY
</City>

<State>
UT
</State>

<Zip-Code>
84108
</Zip-Code>

<Country>
USA
</Country>

<Telephone>
(801) 581-6831
</Telephone>
</Distributor>

<Resource-Description>
UTAH EARTHQUAKE EPICENTERS
</Resource-Description>

<Technical-Prerequisites>

<Data-Set-Type>
AUTOMATED
</Data-Set-Type>

<Access-Method>
BATCH
</Access-Method>

<Number-of-Records>
8,700
</Number-of-Records>

<Computer-Type>
PC NETWORK
</Computer-Type>

<Computer-Location>
SALT LAKE CITY, UT
</Computer-Location>
</Technical-Prerequisites>
</Availability>

<Access-Constraints>

<Documentation>
NONE
</Documentation>
</Access-Constraints>

<Use-Constraints>

<Status>
OPERATIONAL
</Status>
</Use-Constraints>

<Point-of-Contact>

<Name>
BILL CASE
</Name>

<Organization>
UTAH GEOLOGICAL AND MINERAL SURVEY
</Organization>

<Street-Address>
606 BLACK HAWK WAY
</Street-Address>

<City>
SALT LAKE CITY
</City>

<State>
UT
</State>

<Zip-Code>
84108
</Zip-Code>

<Country>
USA
</Country>

<Telephone>
(801) 581-6831
</Telephone>
</Point-of-Contact>

<Control-Identifier>
ESDD0006
</Control-Identifier>

<Record-Source>
UTAH GEOLOGICAL AND MINERAL SURVEY
</Record-Source>

<Date-of-Last-Modification>
198903
</Date-of-Last-Modification>
</gils>


Next Previous Contents