Introduction
This tutorial demonstrates how to perform text mining on large inventory spreadsheets to locate all instances of specific names and their variations. This technique is invaluable for archival research, genealogy, historical documentation, and data analysis.
Project Overview
Step-by-Step Process
Create a dictionary mapping each person's primary identifier to all their name variations, including alternate spellings, titles, and shortened versions.
Import your Excel file using pandas and examine its structure to identify relevant columns.
Iterate through every row, checking for any name variations using case-insensitive pattern matching.
For each match, capture the row number, matched variation, and relevant metadata (Series, Box Name, Folder Name, etc.).
Create an Excel workbook with summary statistics and detailed results for analysis.
Complete Python Script
Below is the full script used for this text mining analysis. It requires Python 3 with pandas and openpyxl libraries.
Installation Requirements
The Script
Results Summary
Names Found
| Name | Total Instances | Status |
|---|---|---|
| Clarke, Mary Frances (Sister/Mother) | 41 | Found |
| DeCock, Mary (Sister) | 35 | Found |
| Garvey, Helen Maher (Robert Joseph)(Sister/President) | 8 | Found |
| Mann, Margaret | 3 | Found |
| Byrne, Catherine (Sister) | 1 | Found |
| Dougherty, Mary Cecilia (Mother) | 1 | Found |
| Farrell, Carolyn Lester (Sister) | 1 | Found |
| Baschnagel, Josita (Mother) | 1 | Found |
Names Not Found
The following names were searched but no instances were found in the inventory:
Sample Detailed Results
Each match includes comprehensive location information:
Matched Variation: Mary Frances Clarke
Row Number: 233
Series: BVM Constitutions
Box Name: BVM Constitutions Committees Vol. XII, Vol. XIII Resource Materials, Indexed
Box: 8.0
Folder Name: Mary Frances Clarke & successors
Matched Variation: Mary DeCock
Row Number: 1543
Series: General Administration
Box Name: Sisters' Correspondence - D
Box: 2.0
Folder Name: DeCock, Mary
Key Features
✓ Case-Insensitive Matching
Finds names regardless of capitalization (e.g., "mary frances clarke" matches "Mary Frances Clarke")
✓ Multiple Name Variations
Each person can have unlimited name variations, including formal names, nicknames, titles, and alternate spellings
✓ Comprehensive Metadata
Captures row numbers and all relevant columns (Series, Box Name, Box, Folder Name) for easy document retrieval
✓ Duplicate Prevention
Counts each person only once per row, even if multiple variations appear in the same row
✓ Professional Excel Output
Generates formatted reports with color-coded headers and optimized column widths
Applications
This text mining technique is useful for:
- Archival Research: Locate all references to historical figures across large document collections
- Genealogy: Track family members through records with varying name formats
- Data Quality: Identify inconsistent naming conventions in databases
- Historical Documentation: Create indices of people mentioned in archives
- Legal Discovery: Find all mentions of individuals in document collections
- Academic Research: Analyze frequency and context of historical figures in source materials
Customizing for Your Project
Adapting the Name List
Modify the names_to_search dictionary to include your specific people and their variations:
Changing Column Names
Update the column references to match your spreadsheet structure:
Adjusting Search Sensitivity
For more flexible matching, you can modify the search pattern:
Sample Output Files
View the actual report generated from this analysis to see the complete results with all 91 matches, formatted tables, and detailed metadata.
Tips & Best Practices
- Test with a subset first: Run the script on a smaller sample to verify accuracy before processing large files
- Include common variations: Think about how names might appear (with/without titles, first name only, last name only, etc.)
- Check for typos: Include common misspellings if known
- Review the context column: The first 200 characters can help verify matches are correct
- Consider punctuation: "O'Brien" vs "OBrien" or "St. Mary" vs "St Mary"
- Use word boundaries: Be careful with short names that might match partial words
- Document your variations: Keep a record of why you included each variation
Conclusion
This text mining approach provides an efficient, scalable method for locating individuals across large document collections. By automating the search process and capturing comprehensive metadata, researchers can quickly identify relevant materials and build comprehensive indices of their archival holdings.
The script is highly adaptable and can be modified for various types of text mining projects beyond name searches, including keyword analysis, topic identification, and pattern detection in structured data.