I have a dream. The complete Puritan Corpus updated.

Status
Not open for further replies.

davejonescue

Puritan Board Freshman
Disregard post. I thought I had a way to use Word macros to complete this task with help. But a single EEBO-TCP doc running it made it crash, lol. I can dream though, I can dream.
 
Last edited:

davejonescue

Puritan Board Freshman
Oo!

Oh...

Honestly this is already an amazing resource. But some day this has got to work!
I dont know if it is my computer or just the sheer amount of misspelled words in older Puritan documents, but I did find a macro that would tranfer all misspelled words to another document. Doing thus would theoretically allow one to make "master list" of potential misspelled words, which then one should be able to run a macro correcting them in every Puritan document. But, it kept crashing on me. If the Lord ever blesses me financially, I think I would pay to have a software created for this specific purpose; that would be to upload a document and auto-convert these texts to proper spelling. But he hasnt, so thus it is on the shelf as of right now.
 

CathH

Puritan Board Freshman
That would have been very cool. There is not just the irregular spellings but also variability between spellings. The txt files I've used from archive.org are very messy from that point of view and need a lot of manual work.
 

davejonescue

Puritan Board Freshman
That would have been very cool. There is not just the irregular spellings but also variability between spellings. The txt files I've used from archive.org are very messy from that point of view and need a lot of manual work.
True, but that is kind of the vision I had with the "master dictionary" or DOCX with all potential word misspellings and variations, which would show as misspellings. Personally, I am not adverse to thees and thous, nor archaic wording, as long as those words are spelled correctly. And to be honest with you, I can read 99.9% of EEBO-TCP text fine, and I honestly think many people can as well. But, it would go a long way in furthering peoples preconceived apprehensions if the barrier of grammatical discrepancies were removed. Maybe one day the Lord will put it on the heart of a software engineer to do something like this, but until then, I am for ever indebted to God, and EEBO-TCP for making what they have made available free of charge. eebo-typo2.jpg
 

davejonescue

Puritan Board Freshman
Forgive me if this is a pipe-dream; but here me out. It is these macros that make me feel that the entire Puritan/Non-Conformist EEBO-TCP Corpus can be updated to at least correct spelling. The macro I post below, claims to be, if I am reading it correctly, to be able to take a folder with Word docs in them, scan the entire document, then put a list of the misspelled words in a separate doc/folder. If this is the case, the entire process, would be theoretically this simple:

1. Take all existing EEBO-TCP Puritan works and put them in a Word document.
2. Run the macro on all said documents.
3. Compile a list of all misspelled words in all the possible documents.
4. Auto-alphabetize that list of misspelled words and (or find a macro) and get rid of duplicates.
5. Create a macro correcting all said misspelled words.
6. Run all EEBO-TCP Puritan works through the correction macro.

As I was thinking, if no one wants to partner, how I could do this myself; it came to me that I dont have to work on a single computer. I can get 10 computers up to specs, and try to bust out 10 a day; which at the most, with 5,000 works, is 500 days. Now this macro said it ran 8 docs of 100 pages in 8 minutes. We then come to the problem of redundancy of misspelled words. But according to the macro:

"Another thing to keep in mind is that the output can be quite long and seem rather redundant. This is because misspelled words can appear multiple times in the .SpellingErrors collection. For instance, let's say you have a document that contains the word "Cftype," which is obviously flagged as being misspelled. If the word is used 30 times in the document, it will be flagged 30 times and therefore end up 30 times in the misspelling list. Although it is beyond the scope of this tip, you could modify the macro to check if a word was previously flagged as misspelled and then add it only if it is unique misspelling."

So please, if anybody on here is good with macros I sure could use that extension to it and that would save a bunch of time.

With that being said, on my end, utilizing Kutools, I can already have most of the EEBO-TCP Corpus put into individual Word docs, since I already have about 4,500 of them in 171 Word docs separated by by heading, per title (for the Logos Personal Book project I am doing.) Kutools allows one to separate docs by heading and it will auto-create individual docs with them. No joking yall, I am dead serious about this. This looks like a totally doable blueprint; a year, two years tops. I apologize from the gate if this offends those who make their money publishing physical Puritan books; that is not my intent; but my intent in doing this would be to release all of them back into the public domain, or a copy-righted creative commons license for free usage. The reason for this is I believe if a revival is going to happen, God is going to use the Puritans to do it. I think about all the lower-income nations who are being bombarded with false theologies, who beg for people to send them Puritan works on these Reformed book sites; but no one can because of the cost of shipping. If we (or I) can correct the spelling of these works, use Calibre to freely turn them into Epubs/Ebooks; we can email, or put an entire library on a flash-drive that they can read on their phones. You cant read PDF's on your phone because the text doesnt auto-size like epub. I would love help, heck, I would love someone to take the lead (as I am not a leader,) but, it has to be for free first. Once the works are in the public domain, people can publish all the physical books they want from them; I dont care. But my goal is to make Puritan literature available to the world without the damper of cost. And, just to add icing to the cake; if we can get these works in correct spelling; auto translation software like DeepL means we can translate it to many nations around the world for $15 a month.

Sorry if I am ranting. Just wanted to show this dream isnt a figment, or impossible, it is very much possible; and I look forward to doing it if the Lord permits. Would anybody care to join?

 

Charles Johnson

Puritan Board Sophomore
I made software that does this a couple years ago and I made a basic list of words that should be respelled with a few thousand entries. Even so, any book passed through the software has to be checked by a human reviewer, because, for various reasons, the software cannot perform with 100% accuracy. It is much faster than retyping though.
 

davejonescue

Puritan Board Freshman
I made software that does this a couple years ago and I made a basic list of words that should be respelled with a few thousand entries. Even so, any book passed through the software has to be checked by a human reviewer, because, for various reasons, the software cannot perform with 100% accuracy. It is much faster than retyping though.
I would really like to talk to you more about this, and see if there is a way I could possibly license the software from you? With this macro, the potential is to be able to create a "master list" of all known misspells in all of the known EEBO-TCP Puritan documents by running them all through the macro, and disposing of duplicate misspellings. If it is possible to add that list to your software, without you having to reprogram anything, this could be groundbreaking. Because once that software plus the master list is created, it could not only be used for EEBO-TCP literature, but possibly as the key to collect an entire Puritan corpus by hunting down all known text, and using the same method EEBO-TCP did to OCR their documents (which I want to say is listed open-source online) then run them through "the key," and potentially; now I know this is dreaming, have a correct current spelling of all known Puritan works.

Of course, I know there is another way to do it besides creating a "master list." That is if the software you created has an option to "add new list" of misspelled words, then the docs could singly be run through the macro, create a list of misspelled words for that specific doc, put into your program, and re-edited individually. My goal would still be the same though, and that would be to put them into the public domain for universal access. But, this would also allow other people to pick up the helm and "tidy up" the works for e-publishing, because replacing the spelling is the biggest time consumer. Next would be figuring out incomplete words (which only constitute a fraction of the texts in their entirety.)

I am very serious about this though. My email is [email protected] if you would like to talk privately; but I am willing to pay (if it is within my means) to use it. I am going to go to bed for now, got to get up for work in a few. But this is very exciting. Thank you so much for reaching out, and hope to hear from you soon. God Bless.

edit*** And really just to reiterate; this dream is only for the good of the Church. I envision a website where people can come to, and be able to download a PDF/EPUB of the entire known Puritan Corpus for free. And with tools like DeepL, quite possibly in multiple languages such as French, Spanish, Chinese, Dutch, German; and the list goes on. Of all the docs I have, all 4,200 or so in DOC format, that is only 625mb. Hosting the entire corpus should not be hard, even on something like an $8.99 annual Google sites.
 
Last edited:

Logan

Puritan Board Senior
Is there a particular reason you're tied to the .doc format? I would suggest that anything in plain text format will be far more flexible and easy to process.

For example, as a test I used Python and a few lines of code from one of its built-in libraries to grab the html of the full-text webpage for one of the works, which I now have saved locally as an html file on my computer:

pythonw_Nw1aNt0Uy6.png

To do all of them you'd just stick this in a loop and iterate over all the URLs. Produces this text:

notepad++_1RQ1xMQpTp.png

Which can be viewed in a browser as:

chrome_YsCYt1Dl5Q.png

I would suggest that working the html markup is going to be more flexible than anything in Word. It would be very easy to go from html to epub (epub uses html tags for its formatting) or even to PDF. But you're going to be able to write a script to go through all the files (and even edit them) far easier in a text format like this than using a Word macro. Combine that with regular expressions and I'm confident you can automate all of this.
 

Logan

Puritan Board Senior
Oh, and then it would be fairly straight-forward to write some additional code to compile a list of misspelled words and eliminate duplicates. Then search and replace through all the html files which have been locally stored.

I might build a dictionary of sorts: moste = most, selues = selves, etc and then apply that dictionary to each file in sequence. I anticipate that an approach like this could process all the files in a day.

I will warn though that such a thing would be prone to mistakes because there was no standardized spelling and printers often varied the spelling of words (or the incorporation of shortcuts) even on the same page. So I suspect you will have multiple words with the same spelling and correcting one will make the other worse. But you might get 90% of the way there with an approach like this.
 

NaphtaliPress

Administrator
Staff member
Logan is correct on the issue of variation and lack of standardization. It's not the mispellings so much, which it's encouraging to hear of automation and scripts for that, but that every individual EEBO typists (grad students?) made their own errors in dropping letters or words they could not read due to page image imperfections (which varies from work to work based on the example chosen to film often dating to as early as the 1940s), and of course they omit Latin, Greek and Hebrew and EEBO TCP and Oxford handle footnotes differently, the first as separate links; Oxford embeds them. So if they used a very badly captured text it is not like they spend anytime struggling or trying to figure it out, they drop their symbol and move on. Then there are their actual mistakes in skipping words accidentally and text between the same word (a common error with a Greek name). If you want an accurate text, you still have to compare to the original page images.
Oh, and then it would be fairly straight-forward to write some additional code to compile a list of misspelled words and eliminate duplicates. Then search and replace through all the html files which have been locally stored.

I might build a dictionary of sorts: moste = most, selues = selves, etc and then apply that dictionary to each file in sequence. I anticipate that an approach like this could process all the files in a day.

I will warn though that such a thing would be prone to mistakes because there was no standardized spelling and printers often varied the spelling of words (or the incorporation of shortcuts) even on the same page. So I suspect you will have multiple words with the same spelling and correcting one will make the other worse. But you might get 90% of the way there with an approach like this.
 

Logan

Puritan Board Senior
Proof of concept:
First a spreadsheet containing a list of files to be processed, presumably pretty easy to generate since you have that information in your "creator" column. The fulltext URL can probably be generated automatically from the URL you list in your database.

EXCEL_8038UgX3NC.png

Another spreadsheet with a list of dictionary items to replace:

EXCEL_AvAKZovyAt.png

The code to iterate over it all:
pythonw_FbdbZKfRF8.png

The output:

explorer_l9fCGjWLOB.png

Old:
chrome_M7xLMiaFHI.png

New:
chrome_NHsvd0D8bb.png


The above four works took about 16 seconds total to download, process, replace archaic words, and write to files on my 7-year old machine. The time will obviously go up the more words there are to find and replace but I think this is a good proof of concept.

I'd be happy to help but first thing I need is some kind of speadsheet or table with author name, title, and URL. Doing the word replacement could be done at a later point in time.
 
Last edited:

NaphtaliPress

Administrator
Staff member
I would really like to talk to you more about this, and see if there is a way I could possibly license the software from you? With this macro, the potential is to be able to create a "master list" of all known misspells in all of the known EEBO-TCP Puritan documents by running them all through the macro, and disposing of duplicate misspellings. If it is possible to add that list to your software, without you having to reprogram anything, this could be groundbreaking. Because once that software plus the master list is created, it could not only be used for EEBO-TCP literature, but possibly as the key to collect an entire Puritan corpus by hunting down all known text, and using the same method EEBO-TCP did to OCR their documents (which I want to say is listed open-source online) then run them through "the key," and potentially; now I know this is dreaming, have a correct current spelling of all known Puritan works.

Of course, I know there is another way to do it besides creating a "master list." That is if the software you created has an option to "add new list" of misspelled words, then the docs could singly be run through the macro, create a list of misspelled words for that specific doc, put into your program, and re-edited individually. My goal would still be the same though, and that would be to put them into the public domain for universal access. But, this would also allow other people to pick up the helm and "tidy up" the works for e-publishing, because replacing the spelling is the biggest time consumer. Next would be figuring out incomplete words (which only constitute a fraction of the texts in their entirety.)

I am very serious about this though. My email is [email protected] if you would like to talk privately; but I am willing to pay (if it is within my means) to use it. I am going to go to bed for now, got to get up for work in a few. But this is very exciting. Thank you so much for reaching out, and hope to hear from you soon. God Bless.

edit*** And really just to reiterate; this dream is only for the good of the Church. I envision a website where people can come to, and be able to download a PDF/EPUB of the entire known Puritan Corpus for free. And with tools like DeepL, quite possibly in multiple languages such as French, Spanish, Chinese, Dutch, German; and the list goes on. Of all the docs I have, all 4,200 or so in DOC format, that is only 625mb. Hosting the entire corpus should not be hard, even on something like an $8.99 annual Google sites.
You do realize that EEBO's goal was not complete error free texts but reasonably complete? So you see <o> or some other symbol a lot where text (Greek, Hebrew, Latin) was omitted. And speaking from experience, each file is only as good as that typist and the condition of the page images. EEBO typists make mistakes as well leaving out words and text between similar words (type thereupon, look up and see thereupon on the next line and continue not knowing text was skipped; common mistake and there's a Greek word for it). You'll need to make a huge proviso on any end product so folks know these are still uncorrected texts as far as any mistakes and omissions by EEB. The only way to find and correct those is compare each text with a good original of the edition or the facsimile used by EEBO.
 

NaphtaliPress

Administrator
Staff member
Logan, are the notes, what were originally marginal notes in EEBO texts, embedded in html, and can they be pulled out with a script to automatically make Word footnotes? There use asterisks and letters depending if the orignal had letters numbering the notes. I realize you are talking html for Dave's project, but I need to create a base text for several Rutherford works for Works of Samuel Rutherford for RHB and it looks like I have to start with an EEBO text for at least one of them for 2023.
Proof of concept:
First a spreads
 

Logan

Puritan Board Senior
Chris, perhaps an option for your base text would be to output pure text (.txt) file, but wherever it runs into a footnote put in a couple of empty lines, insert the word "footnote", add in the footnote text, then a couple more empty lines, and continue the text. Something like this from Bownd:
notepad++_oXK5S01YmZ.png

But yes, it looks like the footnotes are links to separate pages so you'd just have to follow that link (like I'm already doing to grab the main text) and grab the footnote. Should definitely be possible. Going to Word from that is going to be trickier but that might be possible too.
 

NaphtaliPress

Administrator
Staff member
Please see if the Oxford is embedded rather than separate files below; but it may simply be pulling them from the links.
Rutherford, Survey of Spiritual Antichrist (Oxford TCP)

Rutherford, SSA (EEBO TCP).
Chris, perhaps an option for your base text would be to output pure text (.txt) file, but wherever it runs into a footnote put in a couple of empty lines, insert the word "footnote", add in the footnote text, then a couple more empty lines, and continue the text. Something like this from Bownd:
View attachment 9570

But yes, it looks like the footnotes are links to separate pages so you'd just have to follow that link (like I'm already doing to grab the main text) and grab the footnote. Should definitely be possible. Going to Word from that is going to be trickier but that might be possible too.
 

davejonescue

Puritan Board Freshman
Proof of concept:
First a spreadsheet containing a list of files to be processed, presumably pretty easy to generate since you have that information in your "creator" column. The fulltext URL can probably be generated automatically from the URL you list in your database.

View attachment 9562

Another spreadsheet with a list of dictionary items to replace:

View attachment 9563

The code to iterate over it all:
View attachment 9568

The output:

View attachment 9565

Old:
View attachment 9566

New:
View attachment 9567


The above four works took about 16 seconds total to download, process, replace archaic words, and write to files on my 7-year old machine. The time will obviously go up the more words there are to find and replace but I think this is a good proof of concept.

I'd be happy to help but first thing I need is some kind of speadsheet or table with author name, title, and URL. Doing the word replacement could be done at a later point in time.
Thank you so much for your help, this is truly a blessing. I will get working on that spreadsheet beginning today. I think I have a way to handle the multiple text spelling issue, that is, if we can keep all the words that were replaced, look over them and see which words are duplicated, we can then go into a program like Word (sorry I'm not stuck to Word its just all I know) and auto find-highlight those words, and correct them in context. But again, thank you so much, and I will let you know when I have that spreadsheet done.
 

ZackF

Puritan Board Professor
Proof of concept:
First a spreadsheet containing a list of files to be processed, presumably pretty easy to generate since you have that information in your "creator" column. The fulltext URL can probably be generated automatically from the URL you list in your database.

View attachment 9562

Another spreadsheet with a list of dictionary items to replace:

View attachment 9563

The code to iterate over it all:
View attachment 9568

The output:

View attachment 9565

Old:
View attachment 9566

New:
View attachment 9567


The above four works took about 16 seconds total to download, process, replace archaic words, and write to files on my 7-year old machine. The time will obviously go up the more words there are to find and replace but I think this is a good proof of concept.

I'd be happy to help but first thing I need is some kind of speadsheet or table with author name, title, and URL. Doing the word replacement could be done at a later point in time.
Is that code in a jupyter notebook?
 

davejonescue

Puritan Board Freshman
You do realize that EEBO's goal was not complete error free texts but reasonably complete? So you see <o> or some other symbol a lot where text (Greek, Hebrew, Latin) was omitted. And speaking from experience, each file is only as good as that typist and the condition of the page images. EEBO typists make mistakes as well leaving out words and text between similar words (type thereupon, look up and see thereupon on the next line and continue not knowing text was skipped; common mistake and there's a Greek word for it). You'll need to make a huge proviso on any end product so folks know these are still uncorrected texts as far as any mistakes and omissions by EEB. The only way to find and correct those is compare each text with a good original of the edition or the facsimile used by EEBO.
Yes sir, I understand this; but my intent for this is not really specifically for scholars; or, crowds where perfection is necessary. I do not mind the Greek, Hebrew, not being included; or the Latin not being translated. The goal is more devotional than it is scholarly; and truthfully, the intended audience are for those who are probably not fluent in Greek or Hebrew to begin with like a large majority of the church. Putting a "warning" is no problem. My only goal is to correct the spelling of the EEBO-TCP text for now, and put them in the public domain, as well as creating Epubs/PDFs. In the end, it is one huge step to getting the Puritan Corpus into current English. But I definitely see your point.
 

Pilgrim72

Puritan Board Junior
I did this as well. Just checked my word swap list, and I have about 7000 words so far.
I bet if we compiled our lists, it would be pretty big. If anyone wants to do this, let me know. :up:
 

davejonescue

Puritan Board Freshman
I did this as well. Just checked my word swap list, and I have about 7000 words so far.
I bet if we compiled our lists, it would be pretty big. If anyone wants to do this, let me know. :up:
Yes, this is awesome. And I was just thinking, that with Excel, I am pretty sure you can auto-alphabetize. So, it should be pretty straight forward to do that, see the words that might get jumbled in correction, take them out of the dictionary, and leave them for edit within word so they can be corrected in context. This is awesome. Thank you so much for your contribution!!
 

Logan

Puritan Board Senior
@NaphtaliPress I can extract the margin notes pretty easily from the Oxford page. The other one not so much because it requires following and gathering data from over 2000 links.

Unfortunately it looks like footnotes in Word is very manual process. So I've generated a word document for you that simply breaks out those marginal notes in a very obvious way:

Before:
1664557693511.png

After:
WINWORD_5MDz2NYU90.png


The Word document is in the attached zip file, if you find that useful. Seems like it would be at least a good start for copying/pasting.
 

Attachments

  • Rutherford.zip
    936.4 KB · Views: 0

Logan

Puritan Board Senior
Yes, this is awesome. And I was just thinking, that with Excel, I am pretty sure you can auto-alphabetize. So, it should be pretty straight forward to do that, see the words that might get jumbled in correction, take them out of the dictionary, and leave them for edit within word so they can be corrected in context. This is awesome. Thank you so much for your contribution!!

When you created your Word files, did you worry about the footnotes? Since EEBO seems to have them all as links to separate pages, this could be problematic, not just for speed but if they start blocking hundreds of thousands of web-page requests from the same IP.

If you're just concerned about the base text, that would make this a whole lot simpler.
 

NaphtaliPress

Administrator
Staff member
Thanks Logan; really appreciate it. Still will be a slog but that sort of processing will help I think.
@NaphtaliPress I can extract the margin notes pretty easily from the Oxford page. The other one not so much because it requires following and gathering data from over 2000 links.

Unfortunately it looks like footnotes in Word is very manual process. So I've generated a word document for you that simply breaks out those marginal notes in a very obvious way:

Before:
View attachment 9572

After:
View attachment 9571


The Word document is in the attached zip file, if you find that useful. Seems like it would be at least a good start for copying/pasting.
 

Logan

Puritan Board Senior
Thanks Logan; really appreciate it. Still will be a slog but that sort of processing will help I think.
Let me know if there is something else that might help. I'm sure you're much rather your time be spent doing something more productive than a repetitive formatting or parsing. Do you do all the footnotes in Word and then import a Word document into InDesign?

Word is a good tool for being able to see some things visually but it's hard to automate (maybe David can help with some macros). But if InDesign can import a different format then that might be helpful.
 

davejonescue

Puritan Board Freshman
When you created your Word files, did you worry about the footnotes? Since EEBO seems to have them all as links to separate pages, this could be problematic, not just for speed but if they start blocking hundreds of thousands of web-page requests from the same IP.

If you're just concerned about the base text, that would make this a whole lot simpler.
No sir, I am not worried about the footnotes. As I was creating my index, I simply went to "full text" copied all, and pasted in a Word doc. I completely disregarded the footnotes, in fact, I think I got rid of them all by deleting all hyperlinks which did away with the "page numbers" in between the text, and the footnote links within the text. Not having footnotes doesn't bother me at all.
 

NaphtaliPress

Administrator
Staff member
Logan, Dave, When you start out with the premise, all the Latin text, Greek text, Hebrew text, and marginal notes, comments and headings, which can be an extensive part of a work, will be omitted, and say the rest of the text itself may vary with missing words and sentences the typists omitted in error, what sort of collection is this? The question is, is it good enough and fairly represents the author's original work? If this were print, there is no question no one would ever put texts out like this; they would be bad texts; it does an injustice to the authors' works. App based? I hate to see the debasing of expectations of good texts of an author's work. Maybe in some cases with an uncomplicated text this will work; in other cases surely not; but that undercuts the mass processing idea and treating them all the same as far as not doing any more than using what EEB has with all its faults and intentional omissions.
When you created your Word files, did you worry about the footnotes? Since EEBO seems to have them all as links to separate pages, this could be problematic, not just for speed but if they start blocking hundreds of thousands of web-page requests from the same IP.

If you're just concerned about the base text, that would make this a whole lot simpler.

Yes sir, I understand this; but my intent for this is not really specifically for scholars; or, crowds where perfection is necessary. I do not mind the Greek, Hebrew, not being included; or the Latin not being translated. The goal is more devotional than it is scholarly; and truthfully, the intended audience are for those who are probably not fluent in Greek or Hebrew to begin with like a large majority of the church. Putting a "warning" is no problem. My only goal is to correct the spelling of the EEBO-TCP text for now, and put them in the public domain, as well as creating Epubs/PDFs. In the end, it is one huge step to getting the Puritan Corpus into current English. But I definitely see your point.
 
Status
Not open for further replies.
Top