[ PENN TREEBANK SAMPLE ]
http://www.cis.upenn.edu/~treebank/home.html

This is a ~5% fragment of Penn Treebank, (C) LDC 1995.  It is made
available under fair use for the purposes of illustrating NLTK tools
for tokenizing, tagging, chunking and parsing.  This data is for
non-commercial use only.

Contents: raw, tagged, parsed and combined data from Wall Street
Journal for 1650 sentences (99 treebank files wsj_0001 .. wsj_0099).
For details about each of the four types, please see the other README
files included in the treebank sample directory.  Examples of the four
types are shown below:

----raw----
Pierre Vinken, 61 years old, will join the board as a nonexecutive
director Nov. 29.
----tagged----
[ Pierre/NNP Vinken/NNP ]
,/, 
[ 61/CD years/NNS ]
old/JJ ,/, will/MD join/VB 
[ the/DT board/NN ]
as/IN 
[ a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ]
./. 
----parsed----
( (S (NP-SBJ (NP Pierre Vinken)
             ,
             (ADJP (NP 61 years)
		   old)
             ,)
     (VP will
         (VP join
             (NP the board)
             (PP-CLR as
		     (NP a nonexecutive director))
	     (NP-TMP Nov. 29)))
     .))
----combined----
( (S 
    (NP-SBJ 
      (NP (NNP Pierre) (NNP Vinken) )
      (, ,) 
      (ADJP 
        (NP (CD 61) (NNS years) )
        (JJ old) )
      (, ,) )
    (VP (MD will) 
      (VP (VB join) 
        (NP (DT the) (NN board) )
        (PP-CLR (IN as) 
          (NP (DT a) (JJ nonexecutive) (NN director) ))
        (NP-TMP (NNP Nov.) (CD 29) )))
    (. .) ))

-----------------------------------------------------------------------

[ README file for (POS-)tagged files ]

This hierarchy contains files tagged for Part of Speech.  There are 2499
files organized in 25 directories numbered 00 to 24.  Note that not all
files found here have a corresponding parsed file in this release.

Originally, each of the texts was run through PARTS (Ken Church's stochastic
part-of-speech tagger) and then corrected by a human annotator.  The square
brackets surrounding phrases in the texts are the output of a stochastic NP
parser that is part of PARTS and are best ignored.  Estimated error rate
for the POS tags is about 3%.

Words are separated from their part-of-speech tag by a forward slash.  In
cases of uncertainty concerning the proper part-of-speech tag, words are
given alternate tags, which are separated from one another by a vertical
bar.  The order in which the alternate tags appear is not significant, but
has not been standardized.  When they are part of the original text and
have not been added in the course of the tagging process, the delimiter
characters forward slash and vertical bar are quoted by being preceded by a
backward slash.

The part-of-speech tags used are described in detail in the POS tagging
guide (see the doc/ directory), with some changes as described below.
There has been no thorough revision of the WSJ material since the last
release, but a number of technical errors have been fixed along with a few
tagging errors.  The small ATIS-3 sample has not been previously released.


DIFFERENCES FROM THE TAGGING GUIDE

Several POS tags have changed names, to avoid conflicts with bracket tags
with the same name.  The most recent version of the guide reflects this
change, but earlier editions and papers may use the old names.

    old name  new name   description

	NP	NNP	Proper Noun
	NPS	NNPS	Proper Noun, plural 
	PP	PRP	Personal Pronoun (I, they, it, etc.)
	PP$	PRP$	Possessive Personal Pronoun (my, their, its, etc.)

The Guide lacks a description of the punctuation tags.  They are:

  $      Dollar sign, also US$, NZ$, etc.
  #      Pound sign (usually British currency)
  ``     Left (double or single) quote
  ''     Right (double or single) quote
  (      Left parenthesis (round, square, curly or angle bracket)
  )      Right parenthesis (round, square, curly or angle bracket)
  ,      Comma
  .      Sentence-final punctuation (. ! ?)
  :      Mid-sentence punctuation (: ; ... -- -)


CHANGES SINCE LAST RELEASE

This is a summary of changes to the WSJ tagged files since the 1992 CDROM1
release of the material.  The changes are described in greater detail, with
specific files and words, in CHANGES.LOG.

In general, most of the corrected errors were discovered during the parsing
of the material, so the corrections are concentrated in those files for
which there is a current parsed version.  However, many changes were
applied uniformly to all files.

  Missing/Doubled Words
	Due to various errors in the creation of these files, some words or
	portions thereof were lost, and other words (such as 'em or '87)
	were doubled.  Most of these errors have been corrected, usually
	with some sort of systematic check.

  Tokenization
	The tokenization of ellipsis dots has been made uniform:
		final ellipsis (4 dots) == .../: ./.
		medial ellipsis (3 dots) == .../:

	Most sentence-ending periods have been split, and some
	end-of-abbrev. periods have been joined to the abbrev.

	Policy has been to split "'s"'s from the preceding word, so that
	they can be separately tagged as VBZ or POSsessive, as appropriate.
	Many errors in this process, especially with words in ALL CAPS,
	have been corrected.

	Initials are done as separate tokens (W./NNP E./NNP B./NNP
	DuBois/NNP), but abbreviations are done as a single token
	(U.S.A./NNP, G.m.b.H./NNP).  The final period of an abbreviation is
	frequently split at the end of a sentence, however: U.S/NNP ./.

	Places where there should be a dash (--) in the rawtext have been
	split accordingly, even when the rawtext has a hyphen (-) or no
	boundary at all.

	When "its" is misspelled -- used as "it is" -- it is split to allow
	the sentence to have a separate subject and verb, just as "it's"
	and other such contractions have always been split.  (But when
	possessive its is misspelled as "it's", it is generally handled as
	it/PRP 's/POS.)

	More generally, word boundaries have been changed to make the text
	make more sense and to allow correct bracketing.

  Quotes
	Double quotes in the rawtext are automatically converted to
	matching `` and '' in our POS files.  Some of the errors in this
	process have been corrected by hand.
	  Also, there has been some attempt to regularize the tokenization
	of quotes acting as abbreviation markers, as in "rock 'n' roll".

  NP Brackets ([])
	The NP brackets added by PARTS were usually left alone, unless a
	sentence boundary was added in the middle, in which case they were
	adjusted appropriately.  However, some brackets had spurious POS
	tags; these were removed.

  Tagging
	Some tags have been corrected, either because they were near some
	other change or because they were strikingly wrong and somebody
	noticed.  No systematic correction of tags was performed, however.

  Errors in Rawtext
	There are various errors in our rawtexts, some from the initial
	source and some from the extraction of the text from that source.
	In particular, text from tables in the original source was supposed
	to not be extracted, but occasionally it was, usually as some sort
	of sentence fragment.  Many of these have been removed.  Similarly,
	one piece of text was misidentified as a table and came out as a
	fragment; the missing words were restored from another published
	version of the same text.
	   In addition, the source text contains a number of typographical
	errors, such as misspellings and repeated words.  Most of these
	remain uncorrected, since the "correct" version is not always clear
	and, after all, such typos are part of any real text.

  Lost Final Tokens
	Due to an error in preprocessing, the final token of many files had
	been lost.  About 2/3 of these final tokens were final close
	quotes.  All such tokens were restored automatically to the POS
	files as described in CHANGES.LOG.  However, there was not enough
	time to restore the missing text to all of the bracketed files.

  Double Tags

	Various bugs in preprocessing occasionally caused a word to have
	more than one tag, for example 5050/NNP/CD or n't/NNP/CD.  These
	have been searched out and corrected in the Treebank II material,
	but the Treebank I material may still have a few of these, limited
	to the NNP/CD case, in which case it is always correct to take the
	second tag.

-----------------------------------------------------------------------

[ README FROM ORIGINAL CDROM ]

This is the Penn Treebank Project: Release 2 CDROM, featuring a million
words of 1989 Wall Street Journal material annotated in Treebank II style.
This bracketing style, which is designed to allow the extraction of simple
predicate-argument structure, is described in doc/arpa94 and the new
bracketing style manual (in doc/manual/).  In addition, there is a small
sample of ATIS-3 material, also annotated in Treebank II style.  Finally,
there is a considerably cleaner version of the material released on the
previous Treebank CDROM (Preliminary Release, Version 0.5, December 1992),
annotated in Treebank I style.

There should also be a file of updates and further information available
via anonymous ftp from ftp.cis.upenn.edu, in pub/treebank/doc/update.cd2.
This file will also contain pointers to a gradually expanding body of
relatively technical suggestions on how to extract certain information from
the corpus.

We're also planning to create a mailing list for users to discuss the Penn
Treebank, focusing on this release.  We hope that the discussions will
include interesting research that people are doing with Treebank data, as
well as bugs in the corpus and suggested bug patches.  The mailing list is
tentatively named treebank-users@linc.cis.upenn.edu; send email to
treebank-users-request@linc.cis.upenn.edu to subscribe.

For questions that are not of general interest, please write to
treebank@unagi.cis.upenn.edu.


		      INVENTORY and DESCRIPTIONS

The directory structure of this release is similar to the previous release.

doc/                    --Documentation.
			This directory contains information about who
			the annotators of the Penn Treebank are and
			what they did as well as LaTeX files of the
			Penn Treebank's Guide to Parsing and Guide to
			Tagging. 

parsed/			--Parsed Corpora.
			These are skeletal parses, without part-of-speech
			tagging information.  To reflect the change in
			style from our last release, these files now have
			the extension of .prd.

  atis/  		--Air Travel Information System transcripts.
  April 1994		Approximately 5000 words of ATIS3 material.
			The material has a limited number of sentence
			types.  It was created by Don Hindle's Fidditch and
			corrected once by a human annotator (Grace Kim).

  wsj/			--1989 Wall Street Journal articles.
  November 1993		Most of this material was processed from our      
   -October 1994	previous release using tgrep "T" programs.
			However, the 21 files in the 08 directory and the
			file wsj_0010 were initially created using the
			FIDDITCH parser (partially as an experiment, and
			partly because the previous release of these files
			had significant technical problems).
			                                                  
			All of the material was hand-corrected at least
			once, and about half of it was revised and updated
			by a different annotator.  The revised files are
			likely to be more accurate, and there is some
			individual variation in accuracy.  The file
			doc/wsj.wha lists who did the correction and
			revision for each directory.


tagged/			--Tagged Corpora.

  atis/			--Air Travel Information System transcripts.
  April 1994		The part-of-speech tags were inserted by Ken
			Church's PARTS program and corrected once by a
			human annotator (Robert MacIntyre).
  
  wsj			--'88-'89 Wall Street Journal articles.
  Winter		These files have not been reannotated since the
   -Spring 1990		previous release.  However, a number of technical
			bugs have been fixed and a few tags have been
			corrected.  See tagged/README.pos for details.


combined/		--Combined Corpora.
			These corpora have been automatically created by
			inserting the part of speech tags from a tagged
			text file (.pos file) into a parsed text file (.prd
			file).  The tags are inserted as nodes immediately
			dominating the terminals.  See README.mrg for more
			details.


tgrepabl/		--Tgrepable Corpora.
			These are encoded corpora designed for use with
			version 2.0 of tgrep, included with this release.
			The (skeletally) parsed Treebank II WSJ material is
			in wsj_skel.crp, while the combined version, with
			part-of-speech tagging information included, is in
			wsj_mrg.crp.  See the README in tools/tgrep/ for
			more information.


raw/			--Rawtexts.
			These are source files for Treebank II annotated
			material.  Some buggy text has been changed or
			eliminated; tb1_075/ has the original versions.


tools/			--Source Code for Various Programs.
			This directory contains the "tgrep" tree-searching
			(and tree-changing) package, in a compressed tar
			archive.  It also contains the program used to make
			the combined files.  All programs are designed to
			be run on UNIX machines.


tb1_075/		--"Version 0.75" of Treebank I.
			This directory contains a substantially cleaner
			version of the Preliminary Release (Version 0.5).
			Combining errors and unbalanced parentheses should
			now be eliminated in the Brown and WSJ corpora, the
			tgrepable corpora are free of fatal errors, many
			technical errors in the POS-tagged files have been
			fixed, and some errors and omissions in the
			documentation have been corrected.  However, the
			material has NOT been reannotated since the
			previous release, with the exception of the WSJ
			parsed material, most of which has undergone
			substantial revision.


The new work in this release was funded by the Linguistic Data Consortium.
Previous versions of this data were primarily funded by DARPA and AFOSR
jointly under grant No. AFOSR-90-006, with additional support by DARPA
grant No. N0014-85-K0018 and by ARO grant No. DAAL 03-89-C0031 PRI.  Seed
money was provided by the General Electric Corporation under grant
No. J01746000.  We gratefully acknowledge this support.

Richard Pito deserves special thanks for providing the tgrep tool, which
proved invaluable both for preprocessing the parsed material and for
checking the final results.

We are also grateful to AT&T Bell Labs for permission to use Kenneth
Church's PARTS part-of-speech labeller and Donald Hindle's Fidditch parser.

Finally, we are very grateful to the exceptionally competent technical
support staff of the Computer and Information Science Department at the
University of Pennsylvania, including Mark-Jason Dominus, Mark Foster, and
Ira Winston.

-----------------------------------------------------------------------

[ COPYRIGHT ]

The following copyright applies to all datafiles on this CDROM:

              Copyright (C) 1995 University of Pennsylvania

      This release was annotated in Treebank II style by the Penn
      Treebank Project. We would appreciate if research using this
      corpus or based on it would acknowledge that fact.

The following articles are appropriate references for this release:

    Marcus, M., Kim, G., Marcinkiewicz, M.A., MacIntyre, R., Bies, A.,
    Ferguson, M., Katz, K. and Schasberger, B.  "The Penn Treebank:
    Annotating Predicate Argument Structure", in {\it Proceedings of the
    Human Language Technology Workshop}, Morgan Kaufmann Publishers Inc.,
    San Francisco, CA, March 1994.
    (This article gives an overview of the Treebank II bracketing scheme.
     A LaTeX version is included in this release, as doc/arpa94.tex.)

    Marcus, M., Santorini, B., Marcinkiewicz, M.A., 1993.
    Building a large annotated corpus of English: the Penn Treebank.
    {\it Computational Linguistics}, Vol 19.
    (This article describes the Penn Treebank in general terms.  A LaTeX
     version is included in this release, as doc/cl93.tex.)
