Adaptive Learning Models for Efficient and Standardized Archival Processes

Project Title: Adaptive Learning Models for Efficient and Standardized Archival Processes

Duration

January 2020 – “Ongoing”

Institution

Carl Albert Congressional Research and Studies Center Archives

Project Overview

This project addresses the increasing demands placed on archival systems by large-scale digitization efforts and the complex metadata requirements that accompany them. By developing adaptive learning pipelines, the project enhances efficiency, accuracy, and repeatability in managing digitized historical documents, especially those relevant to American Indian policy, congressional records, and tribal sovereignty.

The model extracts meaningful metadata and enriches archival content using Natural Language Processing (NLP) and machine learning in an iterative, feedback-controlled environment.


Technical Innovation

A core innovation of this project is its adaptive loop system that combines multiple modules:

  • Preprocessing: Automated OCR cleanup, diacritic analysis, format normalization
  • Text Extraction & Entity Recognition: Using AWS Textract, spaCy, and custom models
  • Controlled Vocabulary Matching: Real-time lookup against dynamic dictionaries
  • Feedback Loops: Enables metadata correction and training model refinement
  • Contextual Matching Algorithms: Enables inference for names, tribes, themes
Controls Diagram
Figure 1. Adaptive controls for vocabulary and dictionary development.

Tools & Technologies

  • NLP Libraries: spaCy, TextBlob, NLTK
  • OCR and Text Processing: AWS Textract, Tesseract, Gensim
  • Machine Learning: Torch, Transformers
  • Python APIs: OpenAI, boto3, re, os, pandas
  • Custom Classifiers: For tribal affiliation, government functions, correspondence metadata
  • Feedback Loop Engines: Performance-aware batch revision triggers
Model Comparison Table
Model evolution: comparative summary of core elements and performance.

Metrics and Performance

MetricPurposeExample
AccuracyEvaluate model precision and recallF1 Score ≈ 0.941 on 100-page test set
SpeedPages processed per minute20 pages/minute on 1,000-page batch
Error Rate% of incorrect assignments2.5% on 1,000 document test case
ScalabilityHandles small to large datasets dynamicallyMaintains >0.88 F1 score on 10,000-page loads
Metric Table
Performance metrics and model assessment framework.

Use Case: “Pleasant Porter” Entity Assignment

This example illustrates how the adaptive model accurately linked a historical reference to Pleasant Porter with the correct tribe, region, and congressional records—without direct keyword matches—by triangulating date, sender location, and prior content relationships.

Pleasant Porter Matching Logic
Conditional model logic for recognizing complex entity matches.

Resources and Tribal Authority Integration

The project integrates language-specific dictionaries and subject matchers, including:

  • Tribal Directories
  • Historic Treaties
  • Language Dictionaries
  • Culturally significant terminology mapping
Tribal Data Reference
Language and tribal data integration resources used in contextual enrichment.

Outcomes

  • Created standardized metadata for over 75,000 records
  • Developed 3 evolving model pipelines tested on real collections
  • Automated tribal recognition and subject assignment with high accuracy
  • Released public training materials and scripts via GitHub
  • Framework adopted by the Congressional Portal Project and others