USE AT YOUR OWN RISK! USER AGREES BY DOWNLOADING AND/OR USING TO ACCEPT ALL LIABILITY AND HOLDS THE CREATOR HARMLESS.
GUTENBERG CONVERSION HOME LIBRARY PROCESS (Beta .001)
In using Gutenberg eTexts I found the file names cryptic and not compatible with my home e-library format. This led to creating a process whereby the original Gutenberg files would be renamed to a more friendly format and one that fit my library standard.
The process renames Gutenberg Project (GP) files to a more user "home library" friendly format. It will change a file name from the GP etext number or past file name to one of the format:
"author last name (30 characters), author first name (40 characters) and any included errata such as dates) - title (first 80 characters)-etext number-<prior file name>.*"
FROM "waldn10.txt" TO "Thoreau, Henry David - Walden-205-waldn10.txt"
FROM "8453.txt" TO "Hugo, Victor - Actes et Paroles, Volume 2 Pendant l'exil 1852-1870-8453.txt"
FROM "7gwrl10.txt" TO "Hope, Laura Lee - The Outdoor Girls at Wild Rose Lodge or, the Hermit of Moonlight Falls-8211-7gwrl10.txt"
FROM "1.txt" TO "United States - United States Declaration of Independence-1.txt"
The process is free to non-commercial users.
The process is NOT PERFECT. It works from data derived from GP files. There has been no correction for errors, omissions nor inconsistencies in that data (Example: A Horseman in the Sky by Ambrose Bierce - The Author information has an error / inconsistency and contains "A Horseman in the Sky"). In addition, of the first 10k, it leaves about 80 files unchanged. For non-English documents: The translation of non-English characters in the names and titles (ex: umlauted) is very rough. Internal document text is not changed, only the name of the file. It has been tested with MS XP and should work with operating systems that support long file names.
There is also an index of all files. It can be used to create a MS Access database or MS Excel Spreadsheet. It is "delimited" by "~" (tilde). It includes the old filename, new file name, file type, author last name, author first name and title.
*** If you are not comfortable with this process DON'T USE IT! It is unsupported, supplied as is.
*** NEVER work on your only copy! Make a BACKUP! This process is NOT reversible!
Phase 0 - Read and Understand
a. Read and understand all steps before starting.
Phase 1 - Obtain the GP files
Summary: Use "wget" in a command prompt window to download the desired Gutenberg files (See documentation). Depending on what you select, this part can take days to complete!
a. create a new folder (aka directory) (ex: C:/gutenraw1)
b. open a command prompt window
c. in the command window type: cd c:/gutenraw1
d. type in the desired wget line. The following will get all text and html type files:
e. wait until completed
f. close the command window
g. create a new folder c:/gutenout2
Phase 2 - Unpack & Create Backup Copy
In this phase you will unpack all of the zip files in the folder you just downloaded. Zipghost is a utility that will do this for you. It is strongly advised for this and all following steps, you exit all other programs.
a. Install zipghost (download the latest copy if desired)
b. Start Zipghost and close the "tip" window
c. Say "no" to the "Zipghost is not associated..." message.
d. Click "Batch" tab
e. Click "Batch extract" option
f. Click "Process One Folder"
g. Click the "Browse" button and select the folder that contains the downloaded files (c:/gutenraw1)
h. Click the "Process all subfolders (Batch procession) box. THIS IS IMPORTANT.
i. Check the following: "All files";
"skip" (or other as desired);
"Extract all inner path info in archives"
"Using archive filename for folder"
DO NOT CHECK "Delete source..."
j. Choose "Save to user appointed path"
k. Click "Browse" and enter the desired output folder (c:/gutenout2)
l. UNCHECK "Rebuild folder structure of source path"
m. Click "Next"
n. Click "Start"
o. wait for it to finish. This can take from many minutes to hours depending on the number of files you downloaded.
p. copy the contents of the "Exceptions" folder to your output folder (c:/gutenout2)
Phase 3 - Backup Data
It is STRONGLY RECOMMENDED that this process not be performed on the original copy of data. It is NOT reversible.
This process can take from many minutes to multiple hours to complete.
a. create a new folder c:/gutenrename3
b. open the folder c:/gutenout2, select all files, right click and "copy"
c. open the folder c:/gutenrename3 and "paste" Keep C:/gutenout2 as your safety copy.
Phase 4 - Rename Copy
In this phase, the renaming of the files takes place. Again, this is not reversible. Hint: It is much simpler at this step to eliminate any unwanted files prior to renaming (ex: 7 bit ASCII files) because after renaming they will no longer sort together by type, but by author last name, making much more time consuming to select them out.
a. open folder c:/gutenrename3
b. remove any files that you do not wish to rename (ex: 7 bit ASCII files)
c. copy file zRename.bat into c:/gutenrename3
d. you can click on zRename.bat or run it from a command window
e. this can take from minutes to hours, you can see the progress as files are renamed.
f. wait for the operation to complete (you can stop it with a "control C")
Phase 5 - Cleanup
The output folder C:/gutenrename3 will contain the files that were renamed to the author, title format as well as files it could not rename (duplicates, not in the Gutenberg catalog, etc.) The renamed files will be grouped together, so it is simple at this time to cut and paste them to another folder if desired.
Note that many of the 7xxxxxxx.xxx files are duplicated under 8xxxxxxx.xxx.
I hope you find that this process makes even more enjoyable and accessible, the great works that the Gutenberg people have done. Nanoflyer
Q: What if I do not have all the files?
A: The renaming process can only rename what is present in the folder. It will show errors (ex: The system can not find the file specified), but they can be ignored.
Q: What about the files that were not renamed?
A: They are left with the original GP file name.
Q: How does this affect the contents of the files?
A: It does not.
Q: There are a lot of .txt files beginning with "7" (ex: 7xxxxxx.txt) files that are not renamed, why?
A: The files beginning with 7 are 7 bit ASCII, a format with little use these days. Most "7" files are available in "8" format.
Q: Can you provide other formats?
A: Custom programming is very expensive.
Q: You said this is not perfect, how is that?
A: For example, .jpg files were not named with a unique name that identifies which text they are associated with. These exceptions had to be checked and handled manually, and manual processes are especially prone to error. Also, some files do not seem to have a valid entry in the catalog, so they are not renamed. This can be because of a misspelling or such. In my tests only a few dozen files out of 10,000+ were in need of manual renaming. Exceptions that were found are in the "Exceptions" folder.
Q: There are many, many error messages such as "The system can not find the file specified" during the renaming process.
A: Yes, the rename file contains many files that you are likely never to download (There are over 60,000 lines in the rename file), including old versions and in process versions. Every file you do not have will create an error message. This is not normally an issue.
Q: Will the process keep the pictures related to an etext with the etext?
A: Yes. The pictures will be renamed to something similar to the etext. The picture will naturally sort next to the etext making it simple to find.