Digital Humanities vs. Prof. Stanley Fish: March 2012

Friday, March 9, 2012

APPENDIX C: The perl Programs

APPENDIX C: The perl Programs

A (belated) NOTE: My plans have not worked out for this programming project and I just found out that I had some issues with the perl code displayed in this post. I have rethought the problem and posted the "revised" code with further explanations in a new series of posts. However, at present, textareas are not a happy solution Naively, I thought a "code" tag would take care of everything. That is how I would have done it. It does some things but cannot deal with the angle brackets of a while loop. All robots should keep their skinny fingers off the text inside the "code" tag. Who did away with that anyway?Alas. Anyway, my apologies to all twelve persons who clicked this post. The problem is fixed, but it was a lot of trouble. ANYWAY, everything about last years blogs still stands, I have learned much about Milton. And will continue to post. cheers, PB

NOTE: A future blog will present some discussion of the programs. In this blog it is important to appreciate the applications of "programming" to one of the classics of the English language, the Areopagitica. The format for the presentation of the six perl programs will be:

a. INPUT, FILENAME, INPUT DATA SAMPLE,
b. PROGRAM STATEMENTS, PROGRAM LOGIC,
c. OUTPUT, FILENAME, OUTPUT DATA SAMPLE.

1. Program ONE - Creating a list of words one word per line. Counting sentences.

INPUT: The text of the Areopagitica. Minor pre-processing was done to mark sentence cusps with XXX. This is done as a convenience and aid for modern readers, e.g. me, since the 17c. prose style requires intense parsing of individual sentences. (Sample only, the actual program parses all sentences, ca. 300; sample contains: SENTENCES 1 to 3 ... and LAST SENTENCE.)

FILENAME: ae.txt

START INPUT FOR PROGRAM ONE

They who to States and Governours of the Commonwealth direct their Speech, High Court of Parlament, or wanting such accesse in a private condition, write that which they foresee may advance the publick good; I suppose them as at the beginning of no meane endeavour, not a little alter'd and mov'd inwardly in their mindes: Some with doubt of what will be the successe, others with fear of what will be the censure; some with hope, others with confidence of what they have to speake.
XXX
And me perhaps each of these dispositions, as the subject was whereon I enter'd, may have at other times variously affected; and likely might in these formost expressions now also disclose which of them sway'd most, but that the very attempt of this addresse thus made, and the thought of whom it hath recourse to, hath got the power within me to a passion, farre more welcome then incidentall to a Preface.
XXX
Which though I stay not to confesse ere any aske, I shall be blamelesse, if it be no other, then the joy and gratulation which it brings to all who wish and promote their Countries liberty; whereof this whole Discourse propos'd will be a certaine testimony, if not a Trophey.
XXX
.
.
.
But of these Sophisms and Elenchs of marchandize I skill not: This I know, that errors in a good government and in a bad are equally almost incident; for what Magistrate may not be mis-inform'd, and much the sooner, if liberty of Printing be reduc't into the power of a few; but to redresse willingly and speedily what hath bin err'd, and in highest autority to esteem a plain advertisement more then others have done a sumptuous bribe, is a vertue (honour'd Lords and Commons) answerable to Your highest actions, and whereof none can participat but greatest and wisest men.
XXX

END OF INPUT FOR PROGRAM ONE

START PROGRAM BELOW "###" Indicates comments on the statements on left.


#!c:\perl                     ###Perl requires this first line.

open (H, "ae.txt");           ###Queue up text file for processing.
open (OUT, ">wordlist.txt");  ###Queue empty file for the wordlist.

$ln=0;               ###Line count is 0.

open (C, "ae.txt");  ###Open text file.
while ($line = <H>){ #########Start reading TEXT - get line.

chomp $line;   ###Renove trailing line feed - don't worry about this.
               ###Line feeds i.e. characters indicating an new line
               ###get in the way. Remove them from the line and add
               ###them later when you need them. The character is "\n".
               ###Look four lines below, exacting but not that weird.
               ###A place for everything and everything in its place.

if ($line =~ /XXX/){          ###Check if there is a line
                              ###or a sentence cusp.
         
$ln++;                        ###Increment sentence count.
print OUT ("$ln-XXX\n");      ###Output: sentence count, dash,
                              ###sentence cusp and line feed "\n".
next;                         ###Get next line.
}

my @f = split(/ /,$line);     ##Split line into individual
                              ##words in the array @f.

foreach $f (@f){     ###Process each word in array @f.
print OUT "$f\n";    ###Put word on its own line
                     ###into the output file >wordlist.   
}                    ###End of word loop - get next word.
}                    ###End of line loop - get next line.

close H;             ###Shut down shop.
close OUT;           ###Shut down output file.


END PROGRAM ABOVE (TWO LINES)






PROGRAM STATEMENTS WITH INDENTS TO HIGHLIGHT THE LOGIC: 

#!c:\perl    

open (H, "ae.txt");           
open (OUT, ">wordlist.txt");  

$ln=0;
open (C, "ae.txt");                  #OPEN INPUT FILE 

while ($line = <H>){                    #START WHILE                 

     chomp $line; 

     if ($line =~ /XXX/){            #START IF 
          $ln++;           
          print OUT ("$ln-XXX\n");                             
          next;                         
     }                               #END IF

     my @f = split(/ /,$line);       #SPLIT line

     foreach $f (@f){                #START FOREACH
          print OUT "$f\n";    
   
     }                               #END FOREACH

}                                    #END WHILE

close H; 
close OUT;

NOTE 1: WHILE picks up one line at a time "while" there are lines in the INPUT file. It stops after the last one.

NOTE 2: IF tests the line if it is a cusp XXX or if it is a word. If it is a cusp, it performs the increment and prints out the cusp with the sentence number. The sentence number appears AFTER the words in the sentence.

NOTE 3: SPLIT takes a line of the text of the Areopagitica and splits it into a "stack" of words, literally a stack. It is called an ARRAY; the name of the array is "@f", could be anything with an "@" in front..

NOTE 4: FOREACH processes each item in the stack (above) one at a time. In this case it just adds it to the bottom of the OUTPUT file with a line feed, "\n".








OUTPUT: List of words from the Areopagitica, sentence one to three, sentence cusps marked and counted. (Sample only shown, [. . .] indicate gap, program produces complete list, 18,000 lines on 18,000 lines in 357 sentences.) This is considered a tiny dataset. A small first step.



FILENAME: wordlist.txt



START OUTPUT PROGRAM ONE

They
who
to
States
and
Governours
of
the
Commonwealth
.
.
.
beginning
of
no
meane
endeavour,
not
a
little
alter'd
and
mov'd
inwardly
in
their
mindes:
Some
.
.
.
others
with
confidence
of
what
they
have
to
speake.
1-XXX
And
me
perhaps
each
of
these
dispositions,
as
.
.
.
power
within
me
to
a
passion,
farre
more
welcome
then
incidentall
to
a
Preface.
2-XXX
Which
though
.
.
.
be
a
certaine
testimony,
if
not
a
Trophey.
3-XXX
.
.
.
[ed: last sentence below]
But
of
these
Sophisms
and
Elenchs
of
marchandize
I
skill
not:
This
I
know,
that
.
.
.
sooner,
if
liberty
of
Printing
be
reduc't
into
the
power
of
a
few;
but
to
redresse
willingly
.
.
.
whereof
none
can
participat
but
greatest
and
wisest
men.
357-XXX



END OF OUTPUT PROGRAM ONE








2. Program TWO - counting the words



INPUT: List of words from the Areopagitica, sentence one to three, sentence cusps marked and counted.



FILENAME: wordlist.txt (SAMPLE ONLY, shortened for this text . . .)



START INPUT

They

who

to

States

and

Governours

of

the

Commonwealth

direct

their

Speech,

High

Court

of

Parlament,

or

.

.

.

confidence

of

what

they

have

to

speake.

1-XXX

And

me

perhaps

each

of

these

dispositions,

as

the

subject

was

whereon

I

enter'd,

.

.

.

welcome

then

incidentall

to

a

Preface.

2-XXX

Which

though

I

stay

not

to

confesse

.

.

.

will

be

a

certaine

testimony,

if

not

a

Trophey.

3-XXX

END OF INPUT






START PROGRAM BELOW

#!c:\perl

open (H, "wordlist.txt");
open (OUT, ">wordnum.txt");

$n=1;

while ($line = <H>){
if ($line =~ /XXX/){       ###START IF - If it is a sentence 
                           ###cusp, do this below.

foreach $senlist (@senlist){        ###Start processing senlist array.
                                    ###Start by splitting the stack 
         ###@senlist into individual words.
         ###We are pacticing manipulatin arrays.
print OUT ("$n","-","$senlist");    ###Put out the word and its position,
$n++;                               ###increment the word counter,
$m++;                               ###increment the sentence counter, and
}         ###drop out of sentence cusp "if".
                                    ###Do not process the "else";
         ###we are practicing "if - else".
}else{                    ###Else if it is a word - do this below
push (@senlist, $line);   ###puch each word of a sentence on a stack
next;                     ###get the next word
}                         ###END OF IF 
print OUT ("$m","-","$line"); ###OUT: number of words in sen, dash and marker 
                              ###XXX. Line already has the number of sen.
                              ###You will get here only if you drop through
                              ###the IF. The ELSE will take you up to 
         ###get the next line. That is where the test
         ###"cusp or word" takes place, there are no
         ###words "XXX". If you test positive for XXX in
         ###IF, you process the array @senlist, i.e. 
         ###you count the words and when finished with  
         ###the sentence you end up here. All a bit 
         ###cryptic, I know, but it works.
$m=0;                         ###Reset the word coumter.
undef @senlist;               ###Reset the array for the words of the next sen.
}                             ###END OF WHILE LOOP - get next line if there
                              ###is one, else, all done close up shop.
close OUT;
close H;

END PROGRAM ABOVE






PROGRAM STATEMENTS WITH INDENTS TO HIGHLIGHT THE LOGIC: 

#!c:\perl

open (H, "wordlist.txt");
open (OUT, ">wordnum.txt");

$n=1;

while ($line = <H>){           START WHILE

if ($line =~ /XXX/){           START IF

foreach $senlist (@senlist){      START FOREACH
print OUT ("$n","-","$senlist");   
$n++;                              
$m++;                             
}                                 END FOREACH
                                  
}else{                         END IF    START ELSE
push (@senlist, $line);   
next;                     
}                                        END ELSE           

print OUT ("$m","-","$line"); 
$m=0;                         
undef @senlist;               

}                             END OF WHILE LOOP                               
close OUT;
close H;

NOTE 1: WHILE picks up a line from the file "wordslist" and stops after the last cusp XXX.

NOTE 2: IF tests whether the line is a cusp or if it is a word.

NOTE 3: FOREACH starts IF the line is a cusp XXX. That means take the array "@senlist" and print each item in the stack (first in, first out) with the word counter "$n" in front viz. 85-1-XXX.
NOTE 4: ELSE (i.e. NOT IF) push the current line (from WHILE) onto the array (stack) "@senlist).

The idea is to hols the sentence stacked in the array and at the next cusp, pop them out one at a time (first in first out, second in second out) and go get the next line from WHILE.




OUTPUT: The first 100 words - sentence cusp marked and counted at position 85, for example.



FILENAME: wordnum.txt



START OUTPUT

1-They

2-who

3-to

4-States

5-and

6-Governours

7-of

8-the

9-Commonwealth

10-direct

11-their

12-Speech,

13-High

14-Court

15-of

16-Parlament,

17-or

18-wanting

19-such

20-accesse

21-in

22-a

23-private

24-condition,

25-write

26-that

27-which

28-they

29-foresee

30-may

31-advance

32-the

33-publick

34-good;

35-I

36-suppose

37-them

38-as

39-at

40-the

41-beginning

42-of

43-no

44-meane

45-endeavour,

46-not

47-a

48-little

49-alter'd

50-and

51-mov'd

52-inwardly

53-in

54-their

55-mindes:

56-Some

57-with

58-doubt

59-of

60-what

61-will

62-be

63-the

64-successe,

65-others

66-with

67-fear

68-of

69-what

70-will

71-be

72-the

73-censure;

74-some

75-with

76-hope,

77-others

78-with

79-confidence

80-of

81-what

82-they

83-have

84-to

85-speake.

85-1-XXX

86-And

87-me

88-perhaps

89-each

90-of

91-these

92-dispositions,

93-as

94-the

95-subject

96-was

97-whereon

98-I

99-enter'd,

100-may (file continues to 18,000)

END OUTPUT










3. Program THREE. Extracting the words with bi-labials.



INPUT: The first 100 words - sentence cusp marked and counted at position 85.



FILENAME: wordnum.txt (SAMPLE ONLY)



START INPUT

1-They

2-who

3-to

4-States

5-and

6-Governours

7-of

.

.

.

53-in

54-their

55-mindes:

56-Some

57-with

58-doubt

59-of

60-what

61-will

62-be

63-the

64-successe,

65-others

66-with

67-fear

68-of

.

.

.

79-confidence

80-of

81-what

82-they

83-have

84-to

85-speake.

85-1-XXX

86-And

87-me

88-perhaps

89-each

90-of

91-these

92-dispositions,

93-as

94-the

95-subject

96-was

97-whereon

98-I

99-enter'd,

100-may

END INPUT






START PROGRAM BELOW

#!c:\perl

open (H, "wordnum.txt");      ###Queue up list of words and their position.
open (OUT, ">pbnum.txt");     ###Queue empty file for the bi-labials.
$i=0;                         ###Start bi-labial counter at 0/
while ($line = <H>){          ###Start reading file with words and positions.
                              ###Get a word or a cusp marker.
if ($line =~ /XXX/) {            ###Check if there is a line
                                 ###or a sentence cusp?
print OUT ("$i","-","$line");    ###IF cusp, OUT PUT bi-lab count and
                                 ###the line with the cusp marker which
         ###also carries the number of words in the
         ###sentence and the sen. number, viz. ###85-1-XXX becomes 11-85-1-XXX.
         ###READ: 11 bi-labs, 85 words, sen 1.
$i=0;                            ###Reset bi-lab counter for next sen.
next;                            ###Get next line, word or cusp.
}
if ($line =~ /b/) {              ###The next statement test for the various
                                 ###forms of bi-labials. The format is the
         ###same for each: 1. find it, 2. increment
         ###counter print out line with word 
         ###containing bi-lab and its position
         ###in text. Get next line.         
$i++;                 
print OUT ("$line");
next;
}
if ($line =~ /p/) {
$i++;
print OUT ("$line");
next;
}
if ($line =~ /B/) {
$i++;
print OUT ("$line");
next;
}
if ($line =~ /P/) {
$i++;
print OUT ("$line");
next;
}
}                                ###END of WHILE LOOP get next line
                                 ###if the previous word did not have
         ###a bi-lab. Had it had a bi-lab, a
         ###"next" in one of the IF's would
         ###have gotten the next line.
         ###Not that weird - algorithmic.
close OUT;
close H;

END PROGRAM ABOVE






OUTPUT: Words with bp are extracted and formatted with their position in the text. Sentence cusp marked: number of pb words, total words in sentence, sentence number, cusp marker XXX.



FILENAME: pbnum.txt



START INPUT

12-Speech,

16-Parlament,

23-private

33-publick

36-suppose

41-beginning

58-doubt

62-be

71-be

76-hope,

85-speake.

11-85-1-XXX

88-perhaps

92-dispositions,

95-subject

113-expressions

122-but

126-attempt

144-power

149-passion,

157-Preface.

9-72-2-XXX

170-be

171-blamelesse,

174-be

184-brings

190-promote

193-liberty;

198-propos'd

200-be

207-Trophey.

9-50-3-XXX

END OUTPUT






4. Program FOUR - Formatting the Bi-labials for further study



INPUT: words containing b or p, with numerical position in the text. Sentence cusps with number of pb words, total number of words, sentence number, cusp marker XXX.



FILENAME: pbnum.txt (SAMPLE ONLY)



START INPUT

12-Speech,

16-Parlament,

23-private

33-publick

36-suppose

41-beginning

58-doubt

62-be

71-be

76-hope,

85-speake.

11-85-1-XXX

88-perhaps

92-dispositions,

95-subject

113-expressions

122-but

126-attempt

144-power

149-passion,

157-Preface.

9-72-2-XXX

170-be

171-blamelesse,

174-be

184-brings

190-promote

193-liberty;

198-propos'd

200-be

207-Trophey.

9-50-3-XXX

213-liberty

217-hope,

234-expect;

235-but

237-complaints

241-deeply

244-speedily

250-bound

253-liberty

9-52-4-XXX

END INPUT






START PROGRAM

#!c:\perl

open (G, "pbnum.txt");
open (OUT, ">pbnum2.txt");

my $flag = "one";

while ($line = <G>){

if ($line =~ /XXX/) {
print OUT ("$line\n");
next;
}

my @numb = split(/ /, $line);

$hold=$numb[0];
$word=$numb[1];

if ($flag eq "one"){
$flag = "two";
$diff =($hold - 1);
$prev = $hold;
} else {
$diff=$hold-$prev;
$prev=$hold;
}

print OUT ("diff ");
print OUT ("$diff ");
print OUT ("$line");
}
close OUT;
close G;

END






OUTPUT: words with bp, difference to previous pb word, words containing b or p, with numerical position in the text. Sentence cusp marked as before in INPUT.



FILENAME: pbnum2



START OUTPUT

diff 11 12,Speech,

diff 4 16,Parlament,

diff 7 23,private

diff 10 33,publick

diff 3 36,suppose

diff 5 41,beginning

diff 17 58,doubt

diff 4 62,be

diff 9 71,be

diff 5 76,hope,

diff 9 85,speake.

11,85,1,XXX

diff 3 88,perhaps

diff 4 92,dispositions,

diff 3 95,subject

diff 18 113,expressions

diff 9 122,but

diff 4 126,attempt

diff 18 144,power

diff 5 149,passion,

diff 8 157,Preface.

9,72,2,XXX

diff 13 170,be

diff 1 171,blamelesse,

diff 3 174,be

diff 10 184,brings

diff 6 190,promote

diff 3 193,liberty;

diff 5 198,propos'd

diff 2 200,be

diff 7 207,Trophey.

9,50,3,XXX

END OUTPUT










PROGRAM 5: Calculate ratio of PB words for each sentence.



INPUT: The sentence cusps hold the total words in each sentence and the pb words. We are practicing programming.



FILENAME: pbnum2.txt (SAMPLE ONLY)



START INPUT

diff 11 12-Speech,

diff 4 16-Parlament,

diff 7 23-private

diff 10 33-publick

diff 3 36-suppose

diff 5 41-beginning

diff 17 58-doubt

diff 4 62-be

diff 9 71-be

diff 5 76-hope,

diff 9 85-speake.

11-85-1-XXX

diff 3 88-perhaps

diff 4 92-dispositions,

diff 3 95-subject

diff 18 113-expressions

diff 9 122-but

diff 4 126-attempt

diff 18 144-power

diff 5 149-passion,

diff 8 157-Preface.

9-72-2-XXX

diff 13 170-be

diff 1 171-blamelesse,

diff 3 174-be

diff 10 184-brings

diff 6 190-promote

diff 3 193-liberty;

diff 5 198-propos'd

diff 2 200-be

diff 7 207-Trophey.

9-50-3-XXX

END INPUT






START PROGRAM BELOW

#!c:\perl

open (W, "pbnum2.txt");
open (OUT, ">pbstat2.txt");


while ($line = <W>){
if ($line =~ /XXX/) {
my @numb = split(/-/,$line);
$pb=$numb[0];
$stot=$numb[1];
$sennum=$numb[2];
if ($pb eq 0){
print OUT ("ZERO, $stot, $sennum\n");
next;
} else {
$percent=($pb/$stot);
printf OUT ("%3.2f",$percent);
print OUT (", $pb, $stot, $sennum\n");
}
}
}
close OUT;
close G;

END PROGRAM






OUTPUT: Ratio pb/total, pb words, total words in sentence, sentence number.



FILENAME: pbstat2



START OUTPUT

0.13, 11, 85, 1

0.13, 9, 72, 2

0.18, 9, 50, 3

0.17, 9, 52, 4

0.10, 8, 78, 5

0.14, 10, 69, 6

0.15, 19, 123, 7

0.17, 7, 42, 8

0.16, 15, 94, 9

0.14, 11, 78, 10

0.10, 8, 78, 11

0.10, 5, 51, 12

0.08, 4, 48, 13

0.19, 6, 31, 14

0.14, 18, 129, 15

0.19, 19, 99, 16

0.26, 11, 42, 17

0.18, 16, 88, 18

0.22, 11, 49, 19

0.15, 13, 89, 20

0.21, 6, 28, 21

0.09, 5, 53, 22

0.31, 10, 32, 23

0.03, 1, 39, 24

0.13, 12, 89, 25

END OUTPUT








PROGRAM 6: Sort the percentages in descending order. Also select sentences over 14%.



INPUT: File with ratios or percentages of pb words.



FILENAME: pbstat2.txt (sample only)



START INPUT

0.13, 11, 85, 1

0.13, 9, 72, 2

0.18, 9, 50, 3

0.17, 9, 52, 4

0.10, 8, 78, 5

0.14, 10, 69, 6

0.15, 19, 123, 7

0.17, 7, 42, 8

0.16, 15, 94, 9

0.14, 11, 78, 10

0.10, 8, 78, 11

0.10, 5, 51, 12

0.08, 4, 48, 13

0.19, 6, 31, 14

0.14, 18, 129, 15

0.19, 19, 99, 16

0.26, 11, 42, 17

0.18, 16, 88, 18

0.22, 11, 49, 19

0.15, 13, 89, 20

0.21, 6, 28, 21

0.09, 5, 53, 22

0.31, 10, 32, 23

0.03, 1, 39, 24

0.13, 12, 89, 25

END INPUT






START PROGRAM

#!c:\perl

open (W, "pbstat2.txt");
open (OUT, ">pbstat3.txt");
open (SM, ">pbstat3sm.txt");

while ($line = <W>){

if ($line =~ /Z/){
next;
}
push (@pblist, $line);
my @g = split(/,/, $line);
if ($g[0] gt .14){
push (@pblistsm, $line);
}
}
@sortpb = sort {$b <=> $a} @pblist;
@sortpbsm = sort {$b <=> $a} @pblistsm;


print OUT (" ");
print OUT "@sortpb";
print SM "@pblistsm";
close OUT;
close SM;
close W;

END






OUTPUT 1: Sorted Percentages.



FILENAME: pbstat3.txt



START OUTPUT 1

0.31, 15, 49, 38

0.31, 10, 32, 23

0.31, 4, 13, 105

0.30, 7, 23, 94

0.29, 6, 21, 108

0.28, 7, 25, 56

0.28, 13, 47, 228

0.27, 10, 37, 117

0.27, 6, 22, 168

0.26, 11, 42, 17

0.26, 18, 70, 349

0.26, 9, 35, 171

0.25, 7, 28, 54

0.25, 9, 36, 131

0.25, 3, 12, 146

0.25, 7, 28, 46

0.25, 2, 8, 321

0.24, 8, 34, 212

0.24, 6, 25, 58

0.24, 4, 17, 291

0.24, 8, 34, 356

0.24, 14, 58, 254

0.24, 8, 34, 172

0.23, 10, 44, 151

0.23, 5, 22, 154

END OUTPUT 1



OUTPUT 2: Percentages over 13.



FILENAME:  pbstatsm.txt



START OUTPUT 2

0.18, 9, 50, 3

0.17, 9, 52, 4

0.15, 19, 123, 7

0.17, 7, 42, 8

0.16, 15, 94, 9

0.19, 6, 31, 14

0.19, 19, 99, 16

0.26, 11, 42, 17

0.18, 16, 88, 18

0.22, 11, 49, 19

0.15, 13, 89, 20

0.21, 6, 28, 21

0.31, 10, 32, 23

0.19, 13, 67, 26

0.15, 6, 39, 27

.

.

.

0.23, 8, 35, 206

0.17, 5, 29, 208

0.18, 5, 28, 209

0.24, 8, 34, 212

0.17, 14, 80, 213

0.15, 5, 33, 217

0.15, 6, 40, 218

0.22, 15, 67, 222

0.17, 8, 46, 223

0.22, 11, 51, 224

0.18, 6, 33, 226

0.15, 8, 55, 227

0.28, 13, 47, 228

0.16, 14, 87, 229

0.15, 13, 88, 230

0.17, 6, 36, 231

0.18, 5, 28, 233

0.19, 8, 43, 234

0.21, 6, 28, 236

0.15, 6, 40, 237

0.21, 4, 19, 239

0.18, 14, 76, 243

0.15, 13, 87, 244

0.15, 10, 68, 250

0.19, 18, 95, 251

0.24, 14, 58, 254

0.21, 21, 99, 257

0.15, 14, 95, 258

0.17, 7, 42, 260

0.21, 7, 33, 262

0.17, 8, 46, 264

0.16, 12, 76, 266

0.19, 9, 48, 272

0.15, 4, 27, 273

0.15, 5, 33, 276

0.22, 13, 58, 277

0.17, 5, 29, 282

0.15, 13, 88, 288

0.24, 4, 17, 291

END OUTPUT 2

Wednesday, March 7, 2012

APPENDIX B: Programming! huh! What is it good for ...

Julian Alps, Slovena

In the previous post I have attempted a guided tour of how an extremely modest humanities programming project would start. It is important to remember that the structure here is still: "individual facing a text." One person in front of one text with a computer as an aid. The computer in this case is an inexpensive laptop running a free programming environment called perl. The computing skill level required is modest, advanced skills are required only when entering statistical work or data base functions which may involve other programs.

This presentation is not designed for the experts who may do it differently, better, and quicker. There are quite a few packages that deliver all-inclusive services and many people have toolboxes of their own in the language of their choice. Here, one of the questions is about programming in general. Is it worth the investment to learn it, given that one has the intellectual capital to invest. Gotta have extra smarts. I maintain that mastering the techniques of programming and actually conceiving and implementing programs, gives one important insight into the digital world. Talking about it does nothing; it is like talking about sex to a shaker. No need to blog on this. This is for humanists who essentially agree with Fish, they can't read so they run the numbers.

One can think of it as learning to ride a quarter horse and using a lariat to cut a heifer out of the herd. It is learning a craft and one gains great insight into the constitution of texts. I see text programming as a form of craftsmanship.

Previously, I have attempted a guided interior monologue that considers the indictment that should wipe digital humanities from the windshield of the 18 wheeler driven by Prof. Fish.

I have tried to show that, even if we don't know exactly know what a chiasmus is, exactly, and even if we can be browbeat into accepting "Bishops and Presbyters" as a chiasmus, which we do only under muttered protest accompanied by much whining and sniffling, we can deal with the bi-labials. In the time it takes to parse four of Milton's sentences in the 70 words category, I can have a complete list of bi-labials, with the ratios to the total words in each sentence formatted in a few different views. That should be worth something.

I realize that the lists do not give a peak experience as Prof. Fish had and as he reported. However, having had the experience, and wanting to give the experience a more permanent form, this quick work of several hours could answer the question of the title: "What is it good for?"

I have explained that Fish is using the rhetorical mode of speaking from authority: his example, his authority, his insight, in your face. The result is that digital humanities is cast in the role of the half-grown stripling can't sing, can't dance. OK, I accept on behalf of digital humanists everywhere.

But we also have a story to tell that is certainly as old as Prof. Fish is, if not as old as Milton commentary. Thus I will interpose one more theoretical section before showing some actual programs.

* * *

The purpose here is to put some meat on the bones of one tiny aspect of digital humanities, manipulating text streams with perl, extracting and displaying features. In this exercise I will present a few perl programs useful for all sorts of texts. The idea is to build a toolbox that can be used in various situations. My main experience is with philosophy texts and literature. These techniques can also be used to generate queries of databases and display formatted output from the web to the web by adding a few skill sets.

The purpose is not to offer an interpretation of the Areopagitica with this list. True, numbers have been run. Some output was shown in the previous post. That does not mean that Prof. Fish can now rip apart the list of words or the numbers. There is nothing to rip apart. Sentence four has 52 words of which 9 are bp. Sentence 5 has 78, 8 respectively. That means 17% of the words in sentence 4 are bp and 10% of sentence 5. It will turn out that three sentences have a percentage of 31 and one of 30 and so on and forth.

The way to attack and destroy is to consider this strategy of approaching a text, silly and irrelevant. But if I had made some big deal in the electronic NYTimes about bi-labial stops in Milton, I would at least glance at the list briefly and casually over the top of my glasses.

Basically I am saying: you want bi-labials, here they are. You think your eye/ear is better than the computer, ok maybe, you spent a life on it. But let us look at it sentence by sentence to determine if it seems likely Milton has made a phonetic composition. Basically, I don't trust you. Don't trash talk lit. crit. at me, put the ball in the basket. Where does it begin, this bp-experience? Where does it end? Lets find that out for starters. Let us see if that can happen for everybody or if yours is the only instrument that can hear these sounds, which I am prepared to consider: literature mediated through Fish.

* * *

However, before we start, [Note: feel free to skip ahead, this is an old rhetorical form called repetition: I am going to go through the whole thing again in case some were not listening], some more words must be written on the question of "Programming: What is it good for?"

In some circles of on-line discussion among humanists, this is a recurring theme about do we really have to learn programming, do we really. The discussion generally dissolves at some point. The main line humanists think that e-mail is as far as they need to go. The programmers point to tools they have created, and the support people think of clever pedagogy to get all sorts of people involved at whatever level. I could think of similar discussions in the middle ages: reading, what good is it? Or in today's middle schools: ritin, whafo?
There should be no illusion, learning to read as an adult is hard. Learning to program is hard unless you are ten. Why is that? The scientists of the field, the CS people, think computing is simple: "process, iteration, test."

At some point they say that this kind of automation was all invented by simple weavers in the 19th century, and looking at a picture on the computer is no different than looking at a design woven in cloth. That may all be true, but it does not help. Learning how to weave was hard. Ok, one may have learned how to throw the shuttle back and forth, but programming a loom to make a picture is no easy thing. There are several levels from design through set-up through execution that require considerable initiation and applied talent.

Process - iteration - test is the simple formula. Of course, before that there is data. For the weavers it was string - string horizontal, string vertical, a matrix of different colors of different lengths makes a pattern. If you are clever enough, that pattern could be the City of London, date 1529. It could be a woven page from the Areopagitica, date 1644, although I pity the poor weaver setting up the loom.

In our case, the data is words. Words arranged in sequential sentences, separated by punctuation, arranged from one to n - n being the last word on the last page. That means we do not know the number of the last page till we count them. We also do not know the number of the last sentence till we count them, or the last word, ditto.

I say programming text is hard at the start, unless you get what I just wrote, unless it makes sense, there is no reason to argue and juxtapose some other view. It is what it is; it is a different way to look at text.

1. The Areopagitica has 18,000 words.
2. The Areopagitica has 350 sentences.
3. The number of pages has become uninteresting except when writing to others, at which point, and text will have to be located on a given page or on a given line in a given edition, that is how one could find things in texts 50 years ago, the only way, and it is still around. Explains a lot really. So forget the division into pages, that is merely an arbitrary presentation by the printer.

[Note: I should add that division into pages and reading from pages has real problems. 1. it is a completely arbitrary structure which habituates the reader. 2. certain parts of a "verso-recto" or "even-odd" double page presentation may be consistently short-changed by the attention efficiency of the eye-ball contact. I do notice that in myself.]

The single page scroll in a computer window is more elegant, especially if the chunks of text are presented sensibly divided (not pages). The addiction to pages in not easy to shake, been there.

That also means we have to find a way to harmonize the old way of citations with a new way that does not hang on different page numbers in different editions. Total nonsense really, printing up multiple paginations going back to some 16th century folio. One new way that I have used is to simply not quote at all and encourage the reader to put the string in quote marks and let Google find the reference. Works. Consider it. Everything required is linked.

This is hard for humanists to accept, because the fact that there are 18,000 words and 350 sentences seem irrelevant information to the typical humanist. But for the programmer, there is enough there already to weave a story.

For example: Of the 18,000 words, which is the most frequent word that carries some semantic meaning? Give up? books 70, book 24, booke 2, bookes 2, writings 6, writt'n 10, folio 1, and a few more.
Milton is difficult since the spelling is not standardized. But that can be overcome with lists that will be used to search as if they were one word.

So we could ask: "What other questions could we ask?" How many "and" are there? 800. I have a hunch that a census of and-pairs could be interesting, it always is.

Programming is hard for humanists because it is so different from sequential reading. A sequential reading of the 350 sentences of the Areopagitica, given problems with the references, with the language, with the grammar, with the meaning of words, with fatigue and more could take days. After that investment, it is not really certain what one actually has put into the bank. Certain temptations arise to achieve some closure with the text. One could go to some reference work to find a convenient place to hang the text and forget all this parsing. Get a modern edition or a German translation. One may have already done that tracking down references. One could flip through articles in JSTOR. One could find the imprimatured monographs on the text and on Milton. Or, one could make an outline and reading notes and let it go. Or get the Cliff Notes.

With programming, on the other hand, real processes have to be developed that can be iterated, tested and reiterated. So what does that mean, how does one do that? Of course there is Prof. Fish's reading of the P's and B's. I could take all the sentences that contain the two letters and calculate the ratio of words with P's and B's to the total words in the sentence.

OK, that could be interesting (skip to the next blog to see the programs - use a new window!):

The highest ratio is 31%. Three sentences have that. 23, 38 and 105. The ratios are: 10/32 15/49 4/13 - in sentence 228 we have 13 pb's 34 non-pb's and 47 total words in a substantial sentence. In sentence 221 (the Bishops and Presbyters sentence) we have 15 with and 120 total in another monster sentence. So I would be tempted to looks for sentences around 214 to the end with high ratios. Of course all ratios will show up and have to be interpreted.

You can see, Fish and I are going separate ways. Fish, expert on Milton, is reading the Areopagitica; he glimpses a phonetic pattern, or better, a phonetic pattern ambushes him, or as he would want it, his acute senses, finely tuned to Milton's prose have spotted a pattern and he now launches a full blown interpretation full of the villains that Milton is smiting with phonemes, intentionally. [Aside: He does smite right and left, but I would like to focus on the inspirational, not on the martial. The Areopagitica has not been read through history for Milton's view of church hierarchies.]

Milton's weapons are the sounds b and p, bi-labial stops. These sounds are meant to hurt [according to Fish] Bishops by reminding them how similar their phonetic profile is to that of Presbyters. It is meant to wound Presbyters by the same token. [Note: although forms of the word "presbyter" occur on five times. Other carriers of the phonemes e.g. "prelate" ca. 14 times, reinforce the meaning carried in the phonemes, according to Fish, but are enumerated only casually by Fish.]

I have constructed some programs that count the word and letters in the Areopagitica and I am trying to find out if there are any signs that there are any anomalous occurrences of b and p that would point to intention on Milton's part. So far the jury is still out. At present I am merely formulating a strategy for some iterative processes and tests that will yield some statistical output, some text entry points that could yield a serious second, third reading and some additional pouring over the text.

Why am I doing this? Prof. Fish has chosen to attack a view of digital humanities that dismisses the field because they run the numbers first and interpret based on what they find. Well, yes, before an empiricist says: "I think this yellow effluent is sulfur some tests will be run, it may be orange juice, it may be water-based acrylic, it may be poly-chloryl-hydro-phenol, the favorite food of lake-trout which happens to turn yellow when mixed with water. There are many examples in science where ideas were rejected because a scenario full of hope did not produce the numbers.

This case is more complicated. Once Prof. Fish has an idea, there are no more checks and balances. Completely of-the-wall pronouncements come to substantiate his claims. Since "Bishops and Presbyters" are lumped together in one sentence the term "opposite" in one instance 10 sentences later attains special status. What the ... is that all about?

Of course, this all has an air of plausibility, one is tempted to just accept, Milton, out of my field, sounds good, and move on. However, there is a human sensibility more acute and sensitive than any Prof. Fish can bring to bear on bi-labials in Milton: the suspicion of being "made to look foolish," "to be made fun of," "to be laughed at." The cook and the waiter are looking through the round windows in the swinging door to the kitchen and the cook chuckles: "Look, they are eating it."

Once that suspicion raises it head, the acuity of the interpretation goes on a full-court press. Legs apart, knees bent, palms of the hand give a quick slap to the floor to indicate serious business, body up, hands active; no sloppy pass will make it through the waving hands and quick feet, back-court trap.

...the effort to block free expression “meets for the most part with an event utterly opposite to the end which it drives at.” The stressed word in this climactic sentence is “opposite.” Can it be an accident that a word signifying difference has two “p’s” facing and mirroring each other across the weak divide of a syllable break? Opposite superficially, but internally, where it counts, the same." [Fish 1-26-2012]

This is where I get off the bus: "opposite" becomes "sameness" because of the syllables "op" and "po." Even granted that the word may be significant, there must be a better way to explain it. The next sentence kills me: "Opposite superficially, but internally, where it counts, the same." If that does not make you laugh then you obviously have not studied literature. Hence all the head-bobbing from the computer people. I can only ask: "Are you going to eat this just because it comes from somewhere where you assume there is a kitchen, and it has been brought by someone who could be confused with a waiter?"

Of course, Fish does pose the proposition (pos - pro - pos get it?) as a question. If you answer yes, you will surely be on "Candid Camera."

Below is the "climactic sentence":

|p230
Although their own late arguments and defences against the Prelats might remember them that this obstructing violence meets for the most part with an event utterly opposite to the end which it drives at: instead of suppressing sects and schisms, it raises them and invests them with a reputation: The punishing of wits enhaunces their autority, saith the Vicount St Albans, and a forbidd'n writing is thought to be a certain spark of truth that flies up in the faces of them who seeke to tread it out.

I quote the sentence from Milton [directly above] to give the reader a chance to work on the puzzle. Milton uses the term "Prelats," juxtaposed to a "their ... arguments" a should "remind them". The "them" here are "Presbyters." Here is where Fish makes the point that they are the same by phonetic inference added to Milton manifestly just saying so plainly. Paraphrase: Remember how useless censorship was in the hands of the Prelates against your own sect-building.

The term "opposite" is curious. The house "opposite" is really also a house, they are "opposite" spatially, although they may be the "same" structurally.

However, "an event utterly opposite" has a less ambiguous meaning.

For example: 1. We had hoped to have a happy, relaxed two week vacation in Transoxmenistan, snowboarding the local pistes. 2. We were swept up in a wave of kidnappings and suffered from acute cholera before the Marines rescued us out of a dank cave on the last day before our plane took us back to the States. No similarity here. Utterly "opposite" experiences.

Below you can read Milton's only other use of the term "opposite." Note the sentence number. I suspect there is mention to the retrograde motion of the planets, a spatial use of opposite..

|p264
Who can discern those planets that are oft Combust, and those stars of brightest magnitude that rise and set with the Sun, untill the opposite motion of their orbs bring them to such a place in the firmament, where they may be seen evning or morning.

To finish this section I cite the five sentences before the first "opposite" reference to make sure you can see what the antecedent to "them" might be.

|p225
To startle thus betimes at a meer unlicenc't pamphlet will after a while be afraid of every conventicle, and a while after will make a conventicle of every Christian meeting.
|p226
But I am certain that a State govern'd by the rules of justice and fortitude, or a Church built and founded upon the rock of faith and true knowledge, cannot be so pusillanimous.
|p227
While things are yet not constituted in Religion, that freedom of writing should be restrain'd by a discipline imitated from the Prelats, and learnt by them from the Inquisition to shut us up all again into the brest of a licencer, must needs give cause of doubt and discouragement to all learned and religious men.
|p228
Who cannot but discern the finenes of this politic drift, and who are the contrivers; that while Bishops were to be baited down, then all Presses might be open; it was the peoples birthright and priviledge in time of Parlament, it was the breaking forth of light.
|p229
But now the Bishops abrogated and voided out of the Church, as if our Reformation sought no more, but to make room for others into their seats under another name, the Episcopall arts begin to bud again, the cruse of truth must run no more oyle, liberty of Printing must be enthrall'd again under a Prelaticall commission of twenty, the privilege of the people nullify'd, and which is wors, the freedom of learning must groan again, and to her old fetters; all this the Parlament yet sitting.

In my view: With the Bishops gone, reformed out of the church; Presbyters are now taking up the Episcopal ways and the openings on the commission of 20 with the mission of censorship are filled by eager Protestants. All that is clear enough if you can fight through the irony.

I will let you try to parse the sentences above to see if something jumps out.

The curious thing is that the whole exercise on the professor's part is to demonstrate to digital humanists the superior "statistical pattern recognition" of the human mind, the mind of Fish. This is coupled with a sentence by Fish that says, since there are only 26 letters in the alphabet, computers are bound to find repeating patterns.

To counter this characterization, and I probably have to agree that many people who have not really thought very much about the topic might come to just this conclusion: they let the computer do the work for them, I am trying to flesh out the "run the numbers" part.

I am taking some pains to explain the difference between humanistic procedures and programming. I am trying to explain the difficult task, hard to learn for a humanist, of holding back on data, not jumping to conclusions, waiting for data to sprout. "Pretending" at least, that one does not have all the anwers. Suspend the certainty. Why is this hard? Much of reading produces no data, it produces lived experience. It can be affected by environmental variables, alcohol, chocolate cake, a rainy day, jack-hammers on the sidewalk. All that is left is a memory of having had an experience, perhaps a memorable experience. The reflex is to judge.

In subsequent encounters with that text, will the strategy be to recreate that experience, or will the task be to find something new, something that has not yet risen to awareness? At what point of the text (what page or paragraph or sentence) will you enter to find something new? Start at the beginning again? Look for margin notes? Post-its of different colors? Make an outline? Given that ideas are bubbling up between the ears, it may be difficult to be open for something different. All these questions can benefit from a suite of perl programs, an index and retrieval engine, even just a simple "find" in a word processor. This requires a stop in the digital humanities store to pick up some Milton texts.

We are at an early stage in this process from my side. I have the Areopagitica in a index retrieval engine now and I can see that there are only two references to "Presbyters" specifically, and none after sentence 220. I do have all sorts of references to all forms of "prelate, 14 in all - that could be considered significant compared to five "Presbyters."

If Prof. Fish had a digital humanist at his side, he would have been counseled to look at those entry points into the text.

The point is that I will not fight my way through to an interpretation of the Areopagitica today or tomorrow. I may do a thorough semantic inventory, I may find some anomalous features, and it still may not be enough for an "original" interpretation. I will, however be able to talk to someone who knows the Areopagitica well and compare notes. I could probably teach the Areopagitica even to graduate students, at least to get them to assert power over text by using the virtual memory with pinpoint access to lexical items instead of rhetorical handwaving and the penetrating stares to make the extemporized more permanent. We already have the notes and the paraphrases and the historical contexts. We know we can get off on phonemes. (off - ffo, get it?) What else is there? Ours is a gentle art. We let the words go through processes, perform iterations, and the processes lead us to sections we can reread. At some point, some inescapable "aha experience" manifests.

We are hunters, we have studied the movement of words, we know where they congregate. And when we find them, they are really there, all of them, not just an impression of them, and we always find them. This is an introduction, an attempt to set out the prerequisites of programming for humanists. Just because a humanist has never learned programming does not make programming uninteresting. Quite the opposite, a humanist who has learned programming can be very interesting and build spectacular monuments to the power of algorithmic systems displayed on an international network.

My bet on the Areopagitica lies with the "and-pairs." Of course everybody can work the "and-pairs," there are 800. Repeating phrases is also a promising avenue. Sentence length could be important here. In any case, the chase is never over until it is over, when the feature is in the bag. If it will not jump in the bag, it is a burger joint for supper and live to hunt another day.

* * *

The programs I will present, next post, would be called verbose in computer jargon. Long ago, when computing was expensive, an ethic of minimal, elegant efficiency evolved. Today, with programming essentially free and without memory limits, exceptions noted, that is no longer an issue; still programmers will look at code and roll their eyes. I have stuck with a verbose style because it saves me time trying to figure out just how clever I was the last time I looked at the program. Verbose is also good so that novices can follow the logic; much of this work could be done with a single line. Gives you something to shoot for.

Yet I find perl (and php) excellent to show the steps of the algorithm. First we open a file of words, we take a word, we count it (give it its number e.g. the 34th word in the Areopagitica) then we print out the number 34 and the word. Next line.

Once we have a list of the words with their position number, 1st, 2nd, 3rd ...18,000th, we can start work. For example: we can take the word "friends" (and foes) and go through the list and pick out all the place numbers for the word "friends." Then we print out the word friends and the sequence of numbers - that is a vector that can be graphed or compared to other vectors: friends 14, 69, 1120, 1125, 1130, 2400 etc. This is just an example, not data. [Note: when I say "print" in the context of perl, I do not mean print on paper, I mean "print to a file." Much of programming jargon evolved from analogies to words on paper, virtually.]

Since we cannot count on consistent spelling with all texts it helps to go through the first word list and count and remove all duplicates. Then you have a list that may look like this: fried 2, friend 14, friends 23, frown 2, frowns 1 etc.

One can also range the multiple positions in the text for a single word or a list of words.

For example: the two references to "fried" would look like this:

fried, 3, 483, 4500, 8220

The data structure would be A. lexical item B. total in text (3) C. three positions 483, 4500, 8220 (a vector).

Such a file can be used to compile all the members of a semantic category. So you can find - friend, friends, comrades, colleagues, or foe, foes, enemy, enemies as well as non-standard spellings. Then you are in a position to search the whole semantic category rather than a lexical item.

The usefulness of this is that you can find, in rapid succession the phrase "friends and foes" and you can find out in what context Milton is trying to build bridges to his former hated opponents in his anti-censorship crusade. "Friend nor foe" is a one-off, to bad, but I know that now.

Of course this can also be done by reading and remembering, but finding the exact entry points into the text is very helpful, especially in subsequent readings when precision is required. In this case "friends and foes" looks like a single reference, but there may be other constructions that say the same thing. The point is, and it is an important one - I know friends is a single references. I can stop wondering.

It is really as simple as that. It is easy to understand for researchers who are actually working on a text or a series of texts. These researchers are holding their texts in memory as they are formulating descriptive analyses or whatever. When you approach them with an indexing engine with their text loaded, they immediately start to pepper you with questions. It starts with, 'Can you show me this word? and this one, and this one.' What they want is to make their mental image of the text more precise, more reliable. And they are terrified of having missed something important.

That, in short, is the "discrete situation." Limited mortal human with a short term memory of 5 words at a time and a path to long term memory that is as certain a crossing the Medina in Marrakech with 8 dozen eggs and four bottles of Oulmes. Good luck finding Bab Doukkala to deposit whatever you are trying to remember and good luck finding it again in a week.

The fact of the matter is that not everybody is working on texts, I mean really pouring over a text for months. Many are simply working with intellectual constructs that will not benefit from a program that provides entry points to a text based on semantic information. However, I do think that it is important to understand what this is all about. I maintain that if one understands, one will demand to have such a tool in hand when one is studying a text. That is becoming increasingly easy.

Why is this not that case, generally, in the profession? Why can Prof. Fish get up and make fun of text searching. Perhaps literary studies is a kind of athletic contest where researchers pit their memory against each other. Sort of like wrestlers with ideas who want to pin an opponent to the ground. It may always have been like that, a battle of rhetoric. Perhaps indexing is unsporting; it is cheating the existential situation; it is pitting the Samurai against automatic weapons. I personally do not put the contest above all. I have observed the impact of tools on science. I do not think literary studies will be harmed by tools helping to locate semantic information in electronic versions of texts.

coming next, the programs ...