Internet Archive

Item Upload Procedures

(as of 3/2017)

These procedures describe how to load individual items to the Internet Archive. Batch loading is available for larger projects using Perl or Python.

1. Have your metadata ready and prepare the pdf. The pdf source file name should be a unique identifier such as a barcode (do not use a prefix). Using a call number is not advised as it is not truly a stable match point between systems. If you must use a call number, replace the spaces with underscores:

Barcode example: 31072001519406.pdf

Call Number example: SCR Folio N376 B14 becomes SCR_Folio_N376_B14.pdf

It is important to following the file naming conventions because Primo uses the URL syntax to dynamically link to content on the IA platform.

2. Go to the library’s page on Internet Archive:

https://archive.org/details/brooklynmuseumlibrary

https://archive.org/details/frickartreferencelibrary

https://archive.org/details/museumofmodernart

3. Login using the NYARC Internet Archive account

4. Click the Upload icon > Choose file to upload and browse to the pdf.

5. Add metadata using the template.

Tips:

- Do not copy and pasted directly from the Arcade display. This will add in additional junk html formatting into the text. Paste the text into Notepad first to remove all formatting if you copy the metadata from a browser.
- Watch diacritics – they often don’t transfer properly and must be edited.

Page Title: Replace the barcode with the title of the item.

Page URL: Add the library’s prefix to the page URL before the barcode.

- Brooklyn: add bml-
- Frick: add frick-

(yes, include the dash)

Subject Tags: Keywords are separated by commas. If you paste in LC subjects, you’ll need to remove commas in strings. A workaround is to add subjects in the full metadata editor after you upload the item. This allows for commas to be included in subject strings. To do this:

Go to item> choose Edit > Change the information > add new field (at bottom of page) and type subject and then add your subject string.

Language: If you want the item OCR’d, do not choose English (handwritten). Choose English if you have included a typed index for manuscript or image materials.

Collection: Make sure the collection includes the specific contributing institution.

6. Once the pdf and metadata are uploaded, the pdf is processed to generate all the additional files. This could take a while (under an hour), so be patient!

If you make a mistake and need to delete a file (for example, it wasn’t named properly) or edit the collection name (as I did in the screenshot above), you must email info@archive.org.

7. Create a bibliographic record for the uploaded item using the Internet Archive link as the 856|u to submit to WorldCat and Sierra.

Batch Uploading Digital Files and Metadata (perl)

Requirements: Perl (Strawberry Perl can be downloaded at http://strawberryperl.com/) and command line access

This process uses the IAS3 Bulk Uploader. Instructions for using the uploader, including necessary files and metadata guidelines can be found at https://github.com/kngenie/ias3upload.

1. Create a directory for the file uploads and download ias3upload.pl into the directory. ias3upload.pl is the Perl script file.

2. Load the pdfs to transfer to IA into the same directory.

3. Extract the following fields from Sierra (use saved export “Internet archive extract”) :

245 |ab

Author

Language

260|b

Date One

Subject

Note

Barcode

4. Reformat the data according to the IAS3 Bulk Uploader specifications using the JSON script for OpenRefine (see “ia_metadata_json.txt”).

The following fields are created:

item

file

mediatype*

collection*

title

creator

language

publisher

date

description

subject[0]

subject[1]

contributor*

sponsor* - sponsor is added for externally funded projects. This must be manually added. The script does not include the column or value for sponsor.

*indicates constant data fields

Additional metadata can be added as needed, consult the IAS3 Bulk Uploader for specifics. Also see the example metadata.csv file.

The script removes extraneous punctuation at the end of fields and renames columns.Take special note of serial and multi-volume titles. Each pdf needs a complete row in the metadata.csv file.

5. Save the metadata.csv into the same directory as the files to upload.

6. When ready to upload, open up the command line and navigate to the directory that contains all of your files.

7. Run the script with the following command:

perl ias3upload.pl -k kSJHoEYM6a0JWtX8:58A5GqJCuQU395AW

This initiates the file transfer to IA.

8. Troubleshoot the transfer: If any required metadata fields are missing the file will stop. Fix the metadata.csv and reload. Diacritics may not transfer correctly. They can be manually edited in the IA interface.

OpenRefine ia_metadata_json script:This script transforms Sierra export metadata to the IA specification for bulk import. Copy and paste the script below into OpenRefine.

[

{

"op": "core/column-rename",

"description": "Rename column 245|ab to title",

"oldColumnName": "245|ab",

"newColumnName": "title"

{

"op": "core/column-rename",

"description": "Rename column AUTHOR to creator",

"oldColumnName": "AUTHOR",

"newColumnName": "creator"

{

"op": "core/column-rename",

"description": "Rename column LANG to language",

"oldColumnName": "LANG",

"newColumnName": "language"

{

"op": "core/column-rename",

"description": "Rename column 260|b to publisher",

"oldColumnName": "260|b",

"newColumnName": "publisher"

{

"op": "core/column-rename",

"description": "Rename column 008 Date One to date",

"oldColumnName": "008 Date One",

"newColumnName": "date"

{

"op": "core/column-rename",

"description": "Rename column SUBJECT to subject[0]",

"oldColumnName": "SUBJECT",

"newColumnName": "subject[0]"

{

"op": "core/column-rename",

"description": "Rename column NOTE to description",

"oldColumnName": "NOTE",

"newColumnName": "description"

{

"op": "core/column-rename",

"description": "Rename column BARCODE to file",

"oldColumnName": "BARCODE",

"newColumnName": "file"

{

"op": "core/column-addition",

"description": "Create column item at index 8 based on column file using expression grel:\"frick-\"+value",

"engineConfig": {

"mode": "row-based",

"facets": []

"newColumnName": "item",

"columnInsertIndex": 8,

"baseColumnName": "file",

"expression": "grel:\"frick-\"+value",

"onError": "set-to-blank"

{

"op": "core/text-transform",

"description": "Text transform on cells in column creator using expression grel:value.replace(/\\.$/, '')",

"engineConfig": {

"mode": "row-based",

"facets": []

"columnName": "creator",

"expression": "grel:value.replace(/\\.$/, '')",

"onError": "keep-original",

"repeat": false,

"repeatCount": 10

{

"op": "core/text-transform",

"description": "Text transform on cells in column title using expression grel:value.replace(/\\.$/, '')",

"engineConfig": {

"mode": "row-based",

"facets": []

"columnName": "title",

"expression": "grel:value.replace(/\\.$/, '')",

"onError": "keep-original",

"repeat": false,

"repeatCount": 10

{

"op": "core/text-transform",

"description": "Text transform on cells in column publisher using expression grel:value.replace(/\\,$/, '')",

"engineConfig": {

"mode": "row-based",

"facets": []

"columnName": "publisher",

"expression": "grel:value.replace(/\\,$/, '')",

"onError": "keep-original",

"repeat": false,

"repeatCount": 10

{

"op": "core/text-transform",

"description": "Text transform on cells in column description using expression grel:value.replace(/\\;/, '').replace(/\\\"/, ' ')",

"engineConfig": {

"mode": "row-based",

"facets": []

"columnName": "description",

"expression": "grel:value.replace(/\\;/, '').replace(/\\\"/, ' ')",

"onError": "keep-original",

"repeat": false,

"repeatCount": 10

{

"op": "core/column-move",

"description": "Move column file to position 0",

"columnName": "file",

"index": 0

{

"op": "core/column-move",

"description": "Move column item to position 0",

"columnName": "item",

"index": 0

{

"op": "core/column-addition",

"description": "Create column mediatype at index 1 based on column item using expression grel:\"texts\"",

"engineConfig": {

"mode": "row-based",

"facets": []

"newColumnName": "mediatype",

"columnInsertIndex": 1,

"baseColumnName": "item",

"expression": "grel:\"texts\"",

"onError": "set-to-blank"

{

"op": "core/column-move",

"description": "Move column file to position 1",

"columnName": "file",

"index": 1

{

"op": "core/column-addition",

"description": "Create column collection at index 3 based on column mediatype using expression grel:\"frickartreferencelibrary\"",

"engineConfig": {

"mode": "row-based",

"facets": []

"newColumnName": "collection",

"columnInsertIndex": 3,

"baseColumnName": "mediatype",

"expression": "grel:\"frickartreferencelibrary\"",

"onError": "set-to-blank"

{

"op": "core/column-addition",

"description": "Create column contributor at index 4 based on column collection using expression grel:\"Frick Art Reference Library\"",

"engineConfig": {

"mode": "row-based",

"facets": []

"newColumnName": "contributor",

"columnInsertIndex": 4,

"baseColumnName": "collection",

"expression": "grel:\"Frick Art Reference Library\"",

"onError": "set-to-blank"

}

]

Python and Command-Line Interface

This method has not been used within NYARC - but give it a try! It has more features than the Perl method and the Perl script is no longer being supported.

https://github.com/jjjake/internetarchive

Page updated

Google Sites

Report abuse