Wednesday, March 13, 2013

The REVISED SCRIPT


I. Introduction

The perl script below, which extracts the lexical data from the Areopagitica, is divided into three parts.

1. The first takes a raw ascii file one can get from Project Gutenberg or several other Milton sites and formats a word list.

http://www.constitution.org/milton/areopagitica.htm : This site has a ascii file of the Areopagitica with no markup ready to load into your word processor.

http://www.gutenberg.org/ebooks/608 : The Gutenberg sites has several formats of the Areopagitica, plain text, Kindle, EPUB, Mobile and others.

The source in this case had minimal markup, really just a few paragraph markers for 18,000 words. The only required pre-processing was to mark each period that actually was a sentence cusps. That task was not just a global replace but required inspecting each period and excluding abbreviations but also numbers such a V. in Leo V. Other problems revolved around that questions and answers appear as one sentence. My strategy was to replace each period with a XXX, crude but effective.

Excerpt from xae.txt, directly BELOW. The pre-processing involves adding XXX as sentence cusps:

"They who to States and Governours of the Commonwealth direct their Speech, High Court of Parlament, or wanting such accesse in a private condition, write that which they foresee may advance the publick good; I suppose them as at the beginning of no meane endeavour, not a little alter'd and mov'd inwardly in their mindes: Some with doubt of what will be the successe, others with fear of what will be the censure; some with hope, others with confidence of what they have to speake.
XXX
And me perhaps each of these dispositions, as the subject was whereon I enter'd, may have at other times variously affected; and likely might in these formost expressions now also disclose which of them sway'd most, but that the very attempt of this addresse thus made, and the thought of whom it hath recourse to, hath got the power within me to a passion, farre more welcome then incidentall to a Preface.
XXX
Which though ..."

NOTE: each sentence is separated by a cusp. The counting and printing of page numbers comes late in the process. For now, each processing run starts with the raw cusp file. The purpose is to allow adjustments to the initial input source without having to go through elaborate copying routines each time the source is changed. Working from the wordlist and updating the wordlist when the source is changed just adds unnecessary confusion.

During the analysis and re-reading printed pages do have a place, at least in my world.

The word list puts each word on a line with its sequential number in the text:

1 - They
2 - who
3 - to
4 - States
  ...
84 - to
85 - speake.

words in sen - 85
86 - And

The individual words are each on its own line with its sequential number in the text, e.g. "1 - They", followed by "2 - who" on the next line. Sentence cusps are marked. The previous sentence is closed and the total number of words in the sentence are appended. The beginning of sentences are also marked in traditional mark-up.

There are several reasons why I generally start with a numbered word list when I process a text. The text has to be brought under the control of an algorithm. There are many ways of doing that and one is generally no better then the next. For me the idea of one word on a line comes from the old unix days when one could execute a small script at the command line that would replace each space with a line feed. That list could be passed to the uniq filter that would remove duplicate lines and count the duplicates which could be passed a sort that formats result in descending order. Thus, the command "step1" would create output along the lines:

320 - and
205- the
190 - a ...

From there it was just another small script to collect all the sequential numbers of the individual words, producing a vector through the text.

dich,1
258692,
dichtungen,1
086119,
die,9768
000030, 000122, 000130, 000133, 000149, 000162, 000177, 000186, 000189, 000221, 000241
000248, 000253, 000257, 000270, 000293, 000302, 000344, 000368, 000372, 000390, 000401
000406, 000463, 000505, 000520, 000524, 000541, 000576, 000581, 000590, 000641, 000697
...

In this example from the text of Husserl's Logical investigations the three words, their totals and the following vector are shown. In the case of "die" - generally an article (the) but with other grammatical functions - there are close to 10,000 instances, the vector starts at 30. This set of numbers, derived from the sequential word list can lead to many different statistical calculations.

This particular function does not really interest us here with the Areopagitica. It would require too much pre-processing given the orthographic irregularities. Consequently, I plan to stick to the bi-labials for now.

2. The second part of the script extracts five different views of BP's.

a. the individual sentences, example, sentence 4:

213 - liberty
217 - hope,
234 - expect;
235 - but
237 - complaints
241 - deeply
244 - speedily
250 - bound
253 - liberty
SEN 004 BP 09 STOT 052 PCT BP/STOT 0.17

Each sentence has three features, the opening, the data, and a summary in the closing: the sentence number, the number of BP's [9], the total number of words [52], and the percentage of BP's [17%].

b. a summation of the data on each sentence, fore example: sentences 221 to 227:


SEN 221 BP 03 STOT 030 PCT BP/STOT 0.10
SEN 222 BP 06 STOT 033 PCT BP/STOT 0.18
SEN 223 BP 08 STOT 055 PCT BP/STOT 0.15
SEN 224 BP 13 STOT 047 PCT BP/STOT 0.28
SEN 225 BP 14 STOT 087 PCT BP/STOT 0.16
SEN 226 BP 13 STOT 088 PCT BP/STOT 0.15
SEN 227 BP 06 STOT 036 PCT BP/STOT 0.17


This output merely tabulates the summaries from the previous output.

c. a reformat of the summaries to allow easy sorting by the "percentage pf BP's":

[here sorted by highest percentage on top]

0.31% SEN 102 BP 04 STOT 013
0.31% SEN 038 BP 15 STOT 049
0.31% SEN 023 BP 10 STOT 032
0.30% SEN 091 BP 07 STOT 023
0.29% SEN 105 BP 06 STOT 021
0.28% SEN 224 BP 13 STOT 047
0.27% SEN 163 BP 06 STOT 022
0.27% SEN 114 BP 10 STOT 037

d.. a reformat of the summaries to allow easy sorting by the "length of sentences":

[here the five sentences with the most words]
197 29 0.15 - 186
165 31 0.19 - 292
154 19 0.12 - 335
153 21 0.14 - 183
152 21 0.14 - 243

Sentences and Data in table Format
e. the summaries in html form:













3. The third and last part of the script collects data in arrays and prints out views that depend on calculations. The previous parts have generated output while to processing occurred. The final part parks the data in arrays and then extracts information and makes calculations on data that was not ready till the sequential processing was complete.

For example, the output below shows the gap between the previous and the next gap. It thus pinpoints a series of three consecutive PB's. By following short or long gaps one can investigate dense or sparse occurrences of BP's. It may well be possible to do some vector razzle-dazzle, in this case the rubber meets the road in rereading passages to see if some sound pattern emerges.


16, 012, 000012, 004, Speech,
11, 004, 000016, 007, Parlament,
17, 007, 000023, 010, private
13, 010, 000033, 003, publick
08, 003, 000036, 005, suppose
22, 005, 000041, 017, beginning
21, 017, 000058, 004, doubt
13, 004, 000062, 009, be
14, 009, 000071, 005, be
14, 005, 000076, 009, hope,
12, 009, 000085, 003, speake.

The table above shows the first sentence. 12 is the distance to the first BP from the beginning of the file. 16 is the second word, 23 the third. Thus, the distance from the word "private" to its two neighbors is 17. That is the gap.



10, 006, 000213, 004, liberty
21, 004, 000217, 017, hope,
18, 017, 000234, 001, expect;
03, 001, 000235, 002, but
06, 002, 000237, 004, complaints
07, 004, 000241, 003, deeply
09, 003, 000244, 006, speedily
09, 006, 000250, 003, bound
16, 003, 000253, 013, liberty

The table above shows sentence four. There you can see some dense occurrences. The point of this table is to sort on the basis of the gap.


SEN 4
diff 6 213 - liberty
diff 4 217 - hope,
diff 17 234 - expect;
diff 1 235 - but
diff 2 237 - complaints
diff 4 241 - deeply
diff 3 244 - speedily
diff 6 250 - bound
diff 3 253 - liberty
SEN 004 BP 09 STOT 052 PCT BP/STOT 0.17





Above you can see the series of differences: 1, 2, 4, 3, 6, 3. The arrays can be tapped for any number of views depending on questions that arise during the reading. It is clear that at this early stage, this one script can give insight only into the distribution of BP's. That is why it was written.


II. THE SCRIPT


a
a
a
a
a
a
a
a
a
a
a
a
a
a
a