Excavating the Library of Congress MARC Records - Maps Over Time

Mar 6, 2019 10:11 · 1263 words · 6 minutes read

For this week’s Artists in the Archive homework assignment we were to excavate the Library of Congress MARC records database and visualize something we discovered in an interesting way.

I learned a lot in the last class, in particular, how its best to ask a question when trying to extract something from data.

Jer showing us different ways to look at data.

I’ve personally always been infatuated by maps, buying books of maps when I was younger and was in a new place and wanted to get an understanding of it, or navigating using the Thomas guides for when my family would go on road trips. I thought of a question to ask:

How have the creation of maps changed over time? What can be discovered about the perception of the world with patterns in these chanegs?

Implementation/Technologies

All of the code can be seen in my github repo.

I had a difficult time running the existing code that downloaded converted the xml to an object, because the library xml2obj requires node version 6.0 and I wanted to use a current version of node that has es7 features. So I wrote my own conversion code using the library xml-stream. You can see this in tagCounts.ts, which extracts counts by tag, and records.ts, which eventually creates a json file with the extracted information to use for visualization.

For technologies, I used typescript, because I always find it incredibly useful when getting instant feedback when working with new libraries I don’t understand. On the front-end, I used react, d3, and typescript, because I loved the aesthetics of the d3 demos in observable and thought it would be easy to modify them for my speficic use. I chose react to render svg elements with attributes generated by d3, because the code is more readable when rendering the output vs when using long chains of d3 rendering commands.

Discovering Significantly Significant Data

I wanted to find, in the maps database, which datafield tags have the highest occurance, to know which type of data could be statistically significantly to dig into. This code parses a marc database xml file and gets the total counts by tag. Here are the top tags and their counts, as printed out by the program:

[[ '500', 970743 ],
[ '651', 410810 ],
[ '052', 348608 ],
[ '245', 302647 ],
[ '010', 302647 ],
[ '040', 302572 ],
[ '300', 302547 ],
[ '050', 302140 ],
[ '260', 297330 ],
[ '650', 247535 ],
[ '110', 246766 ],
[ '255', 240712 ],
[ '034', 240420 ],
[ '246', 91793 ],
[ '710', 84087 ],
[ '250', 80846 ],
[ '020', 70015 ],
[ '507', 65916 ],
[ '740', 55647 ],
[ '700', 52085 ],
[ '041', 41960 ],
[ '100', 36657 ],
[ '505', 32668 ],
[ '655', 31030 ],
[ '072', 30307 ],
... 64 more items ]

I searched around the LOC website and found out what the top 15 of these meant. I then printed out a bunch of datafields and subfields with these tags and put together this list with tags that had useful information:

245-a has title
260-c has publish year
260-a has publish city
650-a has category
255-a has map scale
034-b has geographic scale

Attempt 1 - Change in map scale over time

I thought it would be interesting to try to see how map scales change over time. So I need to parse the publish year records, and the scale records.

The publish year proved to be challenging because the year came in a bunch of different formats. Often it would be something like 195- where the dash inticated it wasn’t clear the exact year. I wrote a regex to search the year fields for 2-4 digit numbers, with one or two dashes on the end. Then I converted the dashes to 5, which is an approximation to the middle possible value it can have:

export function getAndParseYear(record: MarcRecord) {
  const tag = '260';
  const subTag = 'c';

  const yearString = getSubFieldFromRecord(tag, subTag, record);

  if (!yearString) {
    return null;
  }

  const approxYear = parseYear(yearString);

  if (!approxYear) {
    return null;
  }

  const year = approximateToMiddleYear(approxYear);

  if (!isValidYear(year)) {
    console.log('not four digits', year, yearString);
    return null;
  }
  return year;
}

const yearRegex = /^\d{4}|(d{3}?-)|(d{2}?-?-)$/
function parseYear(yearString: string) {
  const yearMatches = yearRegex.exec(yearString);

  if (yearMatches) {
    return yearMatches[0];
  } else
    return null;

}

function isValidYear(yearString: string) {
  if (!yearString) return null;

  return yearString.match(yearRegex);
}


function approximateToMiddleYear(approxYear: string) {
  return approxYear.replace('-', '5').replace('-', '5');
}

Then it was time to parse the scales. These came in in a bunch of different formats, such as:

Scale [1:12,000]. "1ʺ = 1,000ʹ"
Scale [ca. 1:14,500].
Scale [1:24,000]. 1 in. equals 2,000 ft.
Scale 1:25,000. 1 cm. = 0.25 km.
Not drawn to scale.

For this visualization, I just considered the numbers to the right of the column for the upper scale and assumed the lower scale is 1. I wrote some code that found the first colon, and scanned to the right until it reached the last digit. The string between the colon and last digit were considered the scale:

export function getAndParseScale(record: MarcRecord) {
  const sizeString = getSubFieldFromRecord('255', 'a', record);

  if (sizeString && isValidSizeString(sizeString)) {
    const upperScale = getUpperScale(sizeString);

    if (upperScale) {
      const withCommasStripped = parseInt(stripCommas(upperScale));
      if (!isNaN(Number(withCommasStripped))) {
        return withCommasStripped;
      } else {
      }
    }
  }

  return null;
}

const commaRegex = /,/g

function stripCommas(value: string) {
  const stripped = value.replace(commaRegex, '');
  return stripped;
}

const isValidSizeString = (sizeString: string) => (
    // todo: make regex
    !sizeString.includes('Not') && !sizeString.includes('not'))

const endingBracketCharacters = ']. ';
function getUpperScale(sizeString: string) {
  const colonLocation = sizeString.indexOf(':');

  if (!colonLocation) return null;

  // find either closing bracket or space after colon to mark the end
  // of the number
  const rightSide = sizeString.substring(colonLocation + 1, sizeString.length);

  for (let i = 0; i < rightSide.length; i++) {
    const character = rightSide[i];
    if (endingBracketCharacters.includes(character)) {
      return rightSide.substring(0, i);
    }
  }
  return null;
}

Unfortunately, I did not find any meaningful information revealed when attempting a couple visualizations using this data. First, I tried an x and y scatter plot. Because there were too many points, it slowed my computer to a halt. I ended up getting the average scale for each year and plotting that. Sadly I do not have this plot anymore as my code has changed, but basically there were no interesting patterns here. I then tried to use the d3-hexbin visualization tool to group points near each other into hexes. This didn’t produce anything that interesting:

Attempt at visualizing map scales over time, from year 1900 to 2016. This didn't produce anything interesting as most maps were distributed over the last 20 years. It was probably the wrong visualization for the task.

Final result

I wanted to come out with something meaningful, so the next thing I looked at was map categories over time. Maybe this would show what society was interested in?

I thought an interesting way to show this would be using d3-shape, based on this Observable demo of a Stacked Area Chart:

To do this, I took all possible records categories and got the top x categories by count. I then counted categories by years, and did counts for each category by year. The code for this can be seen in Visualization.tsx These got grouped into areas that were drawn:

I could not figure out how to get the right side legend to properly show the scale - but it should be over 1000 at the top. If I had more time I’d fix this.

The application can be viewed on glitch