Adaptive Learning Models for Efficient and Standardized Archival Processes

Project Title: Adaptive Learning Models for Efficient and Standardized Archival Processes

Duration

January 2020 – “Ongoing”

Institution

Carl Albert Congressional Research and Studies Center Archives

Project Overview

This project addresses the increasing demands placed on archival systems by large-scale digitization efforts and the complex metadata requirements that accompany them. By developing adaptive learning pipelines, the project enhances efficiency, accuracy, and repeatability in managing digitized historical documents, especially those relevant to American Indian policy, congressional records, and tribal sovereignty.

The model extracts meaningful metadata and enriches archival content using Natural Language Processing (NLP) and machine learning in an iterative, feedback-controlled environment.

Technical Innovation

A core innovation of this project is its adaptive loop system that combines multiple modules:

Preprocessing: Automated OCR cleanup, diacritic analysis, format normalization
Text Extraction & Entity Recognition: Using AWS Textract, spaCy, and custom models
Controlled Vocabulary Matching: Real-time lookup against dynamic dictionaries
Feedback Loops: Enables metadata correction and training model refinement
Contextual Matching Algorithms: Enables inference for names, tribes, themes

Controls Diagram — Figure 1. Adaptive controls for vocabulary and dictionary development.

Tools & Technologies

NLP Libraries: spaCy, TextBlob, NLTK
OCR and Text Processing: AWS Textract, Tesseract, Gensim
Machine Learning: Torch, Transformers
Python APIs: OpenAI, boto3, re, os, pandas
Custom Classifiers: For tribal affiliation, government functions, correspondence metadata
Feedback Loop Engines: Performance-aware batch revision triggers

Model Comparison Table — Model evolution: comparative summary of core elements and performance.

Metrics and Performance

Metric	Purpose	Example
Accuracy	Evaluate model precision and recall	F1 Score ≈ 0.941 on 100-page test set
Speed	Pages processed per minute	20 pages/minute on 1,000-page batch
Error Rate	% of incorrect assignments	2.5% on 1,000 document test case
Scalability	Handles small to large datasets dynamically	Maintains >0.88 F1 score on 10,000-page loads

Metric Table — Performance metrics and model assessment framework.

Use Case: “Pleasant Porter” Entity Assignment

This example illustrates how the adaptive model accurately linked a historical reference to Pleasant Porter with the correct tribe, region, and congressional records—without direct keyword matches—by triangulating date, sender location, and prior content relationships.

Pleasant Porter Matching Logic — Conditional model logic for recognizing complex entity matches.

Resources and Tribal Authority Integration

The project integrates language-specific dictionaries and subject matchers, including:

Tribal Directories
Historic Treaties
Language Dictionaries
Culturally significant terminology mapping

Tribal Data Reference — Language and tribal data integration resources used in contextual enrichment.

Outcomes

Created standardized metadata for over 75,000 records
Developed 3 evolving model pipelines tested on real collections
Automated tribal recognition and subject assignment with high accuracy
Released public training materials and scripts via GitHub
Framework adopted by the Congressional Portal Project and others