APPENDIX B: Programming! huh! What is it good for ...
.jpg) |
| Julian Alps, Slovena |
In the previous post I have attempted a guided tour of how an extremely modest humanities programming project would start. It is important to remember that the structure here is still: "individual facing a text." One person in front of one text with a computer as an aid. The computer in this case is an inexpensive laptop running a free programming environment called perl. The computing skill level required is modest, advanced skills are required only when entering statistical work or data base functions which may involve other programs.
This presentation is not designed for the experts who may do it differently, better, and quicker. There are quite a few packages that deliver all-inclusive services and many people have toolboxes of their own in the language of their choice. Here, one of the questions is about programming in general. Is it worth the investment to learn it, given that one has the intellectual capital to invest. Gotta have extra smarts. I maintain that mastering the techniques of programming and actually conceiving and implementing programs, gives one important insight into the digital world. Talking about it does nothing; it is like talking about sex to a shaker. No need to blog on this. This is for humanists who essentially agree with Fish, they can't read so they run the numbers.
One can think of it as learning to ride a quarter horse and using a lariat to cut a heifer out of the herd. It is learning a craft and one gains great insight into the constitution of texts. I see text programming as a form of craftsmanship.
Previously, I have attempted a guided interior monologue that considers the indictment that should wipe digital humanities from the windshield of the 18 wheeler driven by Prof. Fish.
I have tried to show that, even if we don't know exactly know what a chiasmus is, exactly, and even if we can be browbeat into accepting "Bishops and Presbyters" as a chiasmus, which we do only under muttered protest accompanied by much whining and sniffling, we can deal with the bi-labials. In the time it takes to parse four of Milton's sentences in the 70 words category, I can have a complete list of bi-labials, with the ratios to the total words in each sentence formatted in a few different views. That should be worth something.
I realize that the lists do not give a peak experience as Prof. Fish had and as he reported. However, having had the experience, and wanting to give the experience a more permanent form, this quick work of several hours could answer the question of the title: "What is it good for?"
I have explained that Fish is using the rhetorical mode of speaking from authority: his example, his authority, his insight, in your face. The result is that digital humanities is cast in the role of the half-grown stripling can't sing, can't dance. OK, I accept on behalf of digital humanists everywhere.
But we also have a story to tell that is certainly as old as Prof. Fish is, if not as old as Milton commentary. Thus I will interpose one more theoretical section before showing some actual programs.
* * *
The purpose here is to put some meat on the bones of one tiny aspect of digital humanities, manipulating text streams with perl, extracting and displaying features. In this exercise I will present a few perl programs useful for all sorts of texts. The idea is to build a toolbox that can be used in various situations. My main experience is with philosophy texts and literature. These techniques can also be used to generate queries of databases and display formatted output from the web to the web by adding a few skill sets.
The purpose is not to offer an interpretation of the Areopagitica with this list. True, numbers have been run. Some output was shown in the previous post. That does not mean that Prof. Fish can now rip apart the list of words or the numbers. There is nothing to rip apart. Sentence four has 52 words of which 9 are bp. Sentence 5 has 78, 8 respectively. That means 17% of the words in sentence 4 are bp and 10% of sentence 5. It will turn out that three sentences have a percentage of 31 and one of 30 and so on and forth.
The way to attack and destroy is to consider this strategy of approaching a text, silly and irrelevant. But if I had made some big deal in the electronic NYTimes about bi-labial stops in Milton, I would at least glance at the list briefly and casually over the top of my glasses.
Basically I am saying: you want bi-labials, here they are. You think your eye/ear is better than the computer, ok maybe, you spent a life on it. But let us look at it sentence by sentence to determine if it seems likely Milton has made a phonetic composition. Basically, I don't trust you. Don't trash talk lit. crit. at me, put the ball in the basket. Where does it begin, this bp-experience? Where does it end? Lets find that out for starters. Let us see if that can happen for everybody or if yours is the only instrument that can hear these sounds, which I am prepared to consider: literature mediated through Fish.
* * *
However, before we start, [Note: feel free to skip ahead, this is an old rhetorical form called repetition: I am going to go through the whole thing again in case some were not listening], some more words must be written on the question of "Programming: What is it good for?"
In some circles of on-line discussion among humanists, this is a recurring theme about do we really have to learn programming, do we really. The discussion generally dissolves at some point. The main line humanists think that e-mail is as far as they need to go. The programmers point to tools they have created, and the support people think of clever pedagogy to get all sorts of people involved at whatever level. I could think of similar discussions in the middle ages: reading, what good is it? Or in today's middle schools: ritin, whafo?
There should be no illusion, learning to read as an adult is hard. Learning to program is hard unless you are ten. Why is that? The scientists of the field, the CS people, think computing is simple: "process, iteration, test."
At some point they say that this kind of automation was all invented by simple weavers in the 19th century, and looking at a picture on the computer is no different than looking at a design woven in cloth. That may all be true, but it does not help. Learning how to weave was hard. Ok, one may have learned how to throw the shuttle back and forth, but programming a loom to make a picture is no easy thing. There are several levels from design through set-up through execution that require considerable initiation and applied talent.
Process - iteration - test is the simple formula. Of course, before that there is data. For the weavers it was string - string horizontal, string vertical, a matrix of different colors of different lengths makes a pattern. If you are clever enough, that pattern could be the City of London, date 1529. It could be a woven page from the Areopagitica, date 1644, although I pity the poor weaver setting up the loom.
In our case, the data is words. Words arranged in sequential sentences, separated by punctuation, arranged from one to n - n being the last word on the last page. That means we do not know the number of the last page till we count them. We also do not know the number of the last sentence till we count them, or the last word, ditto.
I say programming text is hard at the start, unless you get what I just wrote, unless it makes sense, there is no reason to argue and juxtapose some other view. It is what it is; it is a different way to look at text.
1. The Areopagitica has 18,000 words.
2. The Areopagitica has 350 sentences.
3. The number of pages has become uninteresting except when writing to others, at which point, and text will have to be located on a given page or on a given line in a given edition, that is how one could find things in texts 50 years ago, the only way, and it is still around. Explains a lot really. So forget the division into pages, that is merely an arbitrary presentation by the printer.
[Note: I should add that division into pages and reading from pages has real problems. 1. it is a completely arbitrary structure which habituates the reader. 2. certain parts of a "verso-recto" or "even-odd" double page presentation may be consistently short-changed by the attention efficiency of the eye-ball contact. I do notice that in myself.]
The single page scroll in a computer window is more elegant, especially if the chunks of text are presented sensibly divided (not pages). The addiction to pages in not easy to shake, been there.
That also means we have to find a way to harmonize the old way of citations with a new way that does not hang on different page numbers in different editions. Total nonsense really, printing up multiple paginations going back to some 16th century folio. One new way that I have used is to simply not quote at all and encourage the reader to put the string in quote marks and let Google find the reference. Works. Consider it. Everything required is linked.
This is hard for humanists to accept, because the fact that there are 18,000 words and 350 sentences seem irrelevant information to the typical humanist. But for the programmer, there is enough there already to weave a story.
For example: Of the 18,000 words, which is the most frequent word that carries some semantic meaning? Give up? books 70, book 24, booke 2, bookes 2, writings 6, writt'n 10, folio 1, and a few more.
Milton is difficult since the spelling is not standardized. But that can be overcome with lists that will be used to search as if they were one word.
So we could ask: "What other questions could we ask?" How many "and" are there? 800. I have a hunch that a census of and-pairs could be interesting, it always is.
Programming is hard for humanists because it is so different from sequential reading. A sequential reading of the 350 sentences of the Areopagitica, given problems with the references, with the language, with the grammar, with the meaning of words, with fatigue and more could take days. After that investment, it is not really certain what one actually has put into the bank. Certain temptations arise to achieve some closure with the text. One could go to some reference work to find a convenient place to hang the text and forget all this parsing. Get a modern edition or a German translation. One may have already done that tracking down references. One could flip through articles in JSTOR. One could find the imprimatured monographs on the text and on Milton. Or, one could make an outline and reading notes and let it go. Or get the Cliff Notes.
With programming, on the other hand, real processes have to be developed that can be iterated, tested and reiterated. So what does that mean, how does one do that? Of course there is Prof. Fish's reading of the P's and B's. I could take all the sentences that contain the two letters and calculate the ratio of words with P's and B's to the total words in the sentence.
OK, that could be interesting (skip to the next blog to see the programs - use a new window!):
The highest ratio is 31%. Three sentences have that. 23, 38 and 105. The ratios are: 10/32 15/49 4/13 - in sentence 228 we have 13 pb's 34 non-pb's and 47 total words in a substantial sentence. In sentence 221 (the Bishops and Presbyters sentence) we have 15 with and 120 total in another monster sentence. So I would be tempted to looks for sentences around 214 to the end with high ratios. Of course all ratios will show up and have to be interpreted.
You can see, Fish and I are going separate ways. Fish, expert on Milton, is reading the Areopagitica; he glimpses a phonetic pattern, or better, a phonetic pattern ambushes him, or as he would want it, his acute senses, finely tuned to Milton's prose have spotted a pattern and he now launches a full blown interpretation full of the villains that Milton is smiting with phonemes, intentionally. [Aside: He does smite right and left, but I would like to focus on the inspirational, not on the martial. The Areopagitica has not been read through history for Milton's view of church hierarchies.]
Milton's weapons are the sounds b and p, bi-labial stops. These sounds are meant to hurt [according to Fish] Bishops by reminding them how similar their phonetic profile is to that of Presbyters. It is meant to wound Presbyters by the same token. [Note: although forms of the word "presbyter" occur on five times. Other carriers of the phonemes e.g. "prelate" ca. 14 times, reinforce the meaning carried in the phonemes, according to Fish, but are enumerated only casually by Fish.]
I have constructed some programs that count the word and letters in the Areopagitica and I am trying to find out if there are any signs that there are any anomalous occurrences of b and p that would point to intention on Milton's part. So far the jury is still out. At present I am merely formulating a strategy for some iterative processes and tests that will yield some statistical output, some text entry points that could yield a serious second, third reading and some additional pouring over the text.
Why am I doing this? Prof. Fish has chosen to attack a view of digital humanities that dismisses the field because they run the numbers first and interpret based on what they find. Well, yes, before an empiricist says: "I think this yellow effluent is sulfur some tests will be run, it may be orange juice, it may be water-based acrylic, it may be poly-chloryl-hydro-phenol, the favorite food of lake-trout which happens to turn yellow when mixed with water. There are many examples in science where ideas were rejected because a scenario full of hope did not produce the numbers.
This case is more complicated. Once Prof. Fish has an idea, there are no more checks and balances. Completely of-the-wall pronouncements come to substantiate his claims. Since "Bishops and Presbyters" are lumped together in one sentence the term "opposite" in one instance 10 sentences later attains special status. What the ... is that all about?
Of course, this all has an air of plausibility, one is tempted to just accept, Milton, out of my field, sounds good, and move on. However, there is a human sensibility more acute and sensitive than any Prof. Fish can bring to bear on bi-labials in Milton: the suspicion of being "made to look foolish," "to be made fun of," "to be laughed at." The cook and the waiter are looking through the round windows in the swinging door to the kitchen and the cook chuckles: "Look, they are eating it."
Once that suspicion raises it head, the acuity of the interpretation goes on a full-court press. Legs apart, knees bent, palms of the hand give a quick slap to the floor to indicate serious business, body up, hands active; no sloppy pass will make it through the waving hands and quick feet, back-court trap.
...the effort to block free expression “meets for the most part with an event utterly opposite to the end which it drives at.” The stressed word in this climactic sentence is “opposite.” Can it be an accident that a word signifying difference has two “p’s” facing and mirroring each other across the weak divide of a syllable break? Opposite superficially, but internally, where it counts, the same." [Fish 1-26-2012]
This is where I get off the bus: "opposite" becomes "sameness" because of the syllables "op" and "po." Even granted that the word may be significant, there must be a better way to explain it. The next sentence kills me: "Opposite superficially, but internally, where it counts, the same." If that does not make you laugh then you obviously have not studied literature. Hence all the head-bobbing from the computer people. I can only ask: "Are you going to eat this just because it comes from somewhere where you assume there is a kitchen, and it has been brought by someone who could be confused with a waiter?"
Of course, Fish does pose the proposition (pos - pro - pos get it?) as a question. If you answer yes, you will surely be on "Candid Camera."
Below is the "climactic sentence":
|p230
Although their own late arguments and defences against the Prelats might remember them that this obstructing violence meets for the most part with an event utterly opposite to the end which it drives at: instead of suppressing sects and schisms, it raises them and invests them with a reputation: The punishing of wits enhaunces their autority, saith the Vicount St Albans, and a forbidd'n writing is thought to be a certain spark of truth that flies up in the faces of them who seeke to tread it out.
I quote the sentence from Milton [directly above] to give the reader a chance to work on the puzzle. Milton uses the term "Prelats," juxtaposed to a "their ... arguments" a should "remind them". The "them" here are "Presbyters." Here is where Fish makes the point that they are the same by phonetic inference added to Milton manifestly just saying so plainly. Paraphrase: Remember how useless censorship was in the hands of the Prelates against your own sect-building.
The term "opposite" is curious. The house "opposite" is really also a house, they are "opposite" spatially, although they may be the "same" structurally.
However, "an event utterly opposite" has a less ambiguous meaning.
For example: 1. We had hoped to have a happy, relaxed two week vacation in Transoxmenistan, snowboarding the local pistes. 2. We were swept up in a wave of kidnappings and suffered from acute cholera before the Marines rescued us out of a dank cave on the last day before our plane took us back to the States. No similarity here. Utterly "opposite" experiences.
Below you can read Milton's only other use of the term "opposite." Note the sentence number. I suspect there is mention to the retrograde motion of the planets, a spatial use of opposite..
|p264
Who can discern those planets that are oft Combust, and those stars of brightest magnitude that rise and set with the Sun, untill the opposite motion of their orbs bring them to such a place in the firmament, where they may be seen evning or morning.
To finish this section I cite the five sentences before the first "opposite" reference to make sure you can see what the antecedent to "them" might be.
|p225
To startle thus betimes at a meer unlicenc't pamphlet will after a while be afraid of every conventicle, and a while after will make a conventicle of every Christian meeting.
|p226
But I am certain that a State govern'd by the rules of justice and fortitude, or a Church built and founded upon the rock of faith and true knowledge, cannot be so pusillanimous.
|p227
While things are yet not constituted in Religion, that freedom of writing should be restrain'd by a discipline imitated from the Prelats, and learnt by them from the Inquisition to shut us up all again into the brest of a licencer, must needs give cause of doubt and discouragement to all learned and religious men.
|p228
Who cannot but discern the finenes of this politic drift, and who are the contrivers; that while Bishops were to be baited down, then all Presses might be open; it was the peoples birthright and priviledge in time of Parlament, it was the breaking forth of light.
|p229
But now the Bishops abrogated and voided out of the Church, as if our Reformation sought no more, but to make room for others into their seats under another name, the Episcopall arts begin to bud again, the cruse of truth must run no more oyle, liberty of Printing must be enthrall'd again under a Prelaticall commission of twenty, the privilege of the people nullify'd, and which is wors, the freedom of learning must groan again, and to her old fetters; all this the Parlament yet sitting.
In my view: With the Bishops gone, reformed out of the church; Presbyters are now taking up the Episcopal ways and the openings on the commission of 20 with the mission of censorship are filled by eager Protestants. All that is clear enough if you can fight through the irony.
I will let you try to parse the sentences above to see if something jumps out.
The curious thing is that the whole exercise on the professor's part is to demonstrate to digital humanists the superior "statistical pattern recognition" of the human mind, the mind of Fish. This is coupled with a sentence by Fish that says, since there are only 26 letters in the alphabet, computers are bound to find repeating patterns.
To counter this characterization, and I probably have to agree that many people who have not really thought very much about the topic might come to just this conclusion: they let the computer do the work for them, I am trying to flesh out the "run the numbers" part.
I am taking some pains to explain the difference between humanistic procedures and programming. I am trying to explain the difficult task, hard to learn for a humanist, of holding back on data, not jumping to conclusions, waiting for data to sprout. "Pretending" at least, that one does not have all the anwers. Suspend the certainty. Why is this hard? Much of reading produces no data, it produces lived experience. It can be affected by environmental variables, alcohol, chocolate cake, a rainy day, jack-hammers on the sidewalk. All that is left is a memory of having had an experience, perhaps a memorable experience. The reflex is to judge.
In subsequent encounters with that text, will the strategy be to recreate that experience, or will the task be to find something new, something that has not yet risen to awareness? At what point of the text (what page or paragraph or sentence) will you enter to find something new? Start at the beginning again? Look for margin notes? Post-its of different colors? Make an outline? Given that ideas are bubbling up between the ears, it may be difficult to be open for something different. All these questions can benefit from a suite of perl programs, an index and retrieval engine, even just a simple "find" in a word processor. This requires a stop in the digital humanities store to pick up some Milton texts.
We are at an early stage in this process from my side. I have the Areopagitica in a index retrieval engine now and I can see that there are only two references to "Presbyters" specifically, and none after sentence 220. I do have all sorts of references to all forms of "prelate, 14 in all - that could be considered significant compared to five "Presbyters."
If Prof. Fish had a digital humanist at his side, he would have been counseled to look at those entry points into the text.
The point is that I will not fight my way through to an interpretation of the Areopagitica today or tomorrow. I may do a thorough semantic inventory, I may find some anomalous features, and it still may not be enough for an "original" interpretation. I will, however be able to talk to someone who knows the Areopagitica well and compare notes. I could probably teach the Areopagitica even to graduate students, at least to get them to assert power over text by using the virtual memory with pinpoint access to lexical items instead of rhetorical handwaving and the penetrating stares to make the extemporized more permanent. We already have the notes and the paraphrases and the historical contexts. We know we can get off on phonemes. (off - ffo, get it?) What else is there? Ours is a gentle art. We let the words go through processes, perform iterations, and the processes lead us to sections we can reread. At some point, some inescapable "aha experience" manifests.
We are hunters, we have studied the movement of words, we know where they congregate. And when we find them, they are really there, all of them, not just an impression of them, and we always find them. This is an introduction, an attempt to set out the prerequisites of programming for humanists. Just because a humanist has never learned programming does not make programming uninteresting. Quite the opposite, a humanist who has learned programming can be very interesting and build spectacular monuments to the power of algorithmic systems displayed on an international network.
My bet on the Areopagitica lies with the "and-pairs." Of course everybody can work the "and-pairs," there are 800. Repeating phrases is also a promising avenue. Sentence length could be important here. In any case, the chase is never over until it is over, when the feature is in the bag. If it will not jump in the bag, it is a burger joint for supper and live to hunt another day.
* * *
The programs I will present, next post, would be called verbose in computer jargon. Long ago, when computing was expensive, an ethic of minimal, elegant efficiency evolved. Today, with programming essentially free and without memory limits, exceptions noted, that is no longer an issue; still programmers will look at code and roll their eyes. I have stuck with a verbose style because it saves me time trying to figure out just how clever I was the last time I looked at the program. Verbose is also good so that novices can follow the logic; much of this work could be done with a single line. Gives you something to shoot for.
Yet I find perl (and php) excellent to show the steps of the algorithm. First we open a file of words, we take a word, we count it (give it its number e.g. the 34th word in the Areopagitica) then we print out the number 34 and the word. Next line.
Once we have a list of the words with their position number, 1st, 2nd, 3rd ...18,000th, we can start work. For example: we can take the word "friends" (and foes) and go through the list and pick out all the place numbers for the word "friends." Then we print out the word friends and the sequence of numbers - that is a vector that can be graphed or compared to other vectors: friends 14, 69, 1120, 1125, 1130, 2400 etc. This is just an example, not data. [Note: when I say "print" in the context of perl, I do not mean print on paper, I mean "print to a file." Much of programming jargon evolved from analogies to words on paper, virtually.]
Since we cannot count on consistent spelling with all texts it helps to go through the first word list and count and remove all duplicates. Then you have a list that may look like this: fried 2, friend 14, friends 23, frown 2, frowns 1 etc.
One can also range the multiple positions in the text for a single word or a list of words.
For example: the two references to "fried" would look like this:
fried, 3, 483, 4500, 8220
The data structure would be A. lexical item B. total in text (3) C. three positions 483, 4500, 8220 (a vector).
Such a file can be used to compile all the members of a semantic category. So you can find - friend, friends, comrades, colleagues, or foe, foes, enemy, enemies as well as non-standard spellings. Then you are in a position to search the whole semantic category rather than a lexical item.
The usefulness of this is that you can find, in rapid succession the phrase "friends and foes" and you can find out in what context Milton is trying to build bridges to his former hated opponents in his anti-censorship crusade. "Friend nor foe" is a one-off, to bad, but I know that now.
Of course this can also be done by reading and remembering, but finding the exact entry points into the text is very helpful, especially in subsequent readings when precision is required. In this case "friends and foes" looks like a single reference, but there may be other constructions that say the same thing. The point is, and it is an important one - I know friends is a single references. I can stop wondering.
It is really as simple as that. It is easy to understand for researchers who are actually working on a text or a series of texts. These researchers are holding their texts in memory as they are formulating descriptive analyses or whatever. When you approach them with an indexing engine with their text loaded, they immediately start to pepper you with questions. It starts with, 'Can you show me this word? and this one, and this one.' What they want is to make their mental image of the text more precise, more reliable. And they are terrified of having missed something important.
That, in short, is the "discrete situation." Limited mortal human with a short term memory of 5 words at a time and a path to long term memory that is as certain a crossing the Medina in Marrakech with 8 dozen eggs and four bottles of Oulmes. Good luck finding Bab Doukkala to deposit whatever you are trying to remember and good luck finding it again in a week.
The fact of the matter is that not everybody is working on texts, I mean really pouring over a text for months. Many are simply working with intellectual constructs that will not benefit from a program that provides entry points to a text based on semantic information. However, I do think that it is important to understand what this is all about. I maintain that if one understands, one will demand to have such a tool in hand when one is studying a text. That is becoming increasingly easy.
Why is this not that case, generally, in the profession? Why can Prof. Fish get up and make fun of text searching. Perhaps literary studies is a kind of athletic contest where researchers pit their memory against each other. Sort of like wrestlers with ideas who want to pin an opponent to the ground. It may always have been like that, a battle of rhetoric. Perhaps indexing is unsporting; it is cheating the existential situation; it is pitting the Samurai against automatic weapons. I personally do not put the contest above all. I have observed the impact of tools on science. I do not think literary studies will be harmed by tools helping to locate semantic information in electronic versions of texts.
coming next, the programs ...