Text Mining Tutorial

How to Search for Multiple Names and Their Variations in Large Spreadsheets

Introduction

This tutorial demonstrates how to perform text mining on large inventory spreadsheets to locate all instances of specific names and their variations. This technique is invaluable for archival research, genealogy, historical documentation, and data analysis.

Use Case: Searching through 5,567 rows of archival inventory data to find 17 different people, each with multiple name variations (e.g., "Clarke, Mary Frances", "Mother Clarke", "Mary Frances Clarke").

Project Overview

5,567
Rows Searched
17
Names Tracked
91
Matches Found
8
Names Found

Step-by-Step Process

1 Prepare Your Name List

Create a dictionary mapping each person's primary identifier to all their name variations, including alternate spellings, titles, and shortened versions.

2 Load the Spreadsheet

Import your Excel file using pandas and examine its structure to identify relevant columns.

3 Search Each Row

Iterate through every row, checking for any name variations using case-insensitive pattern matching.

4 Record Matches

For each match, capture the row number, matched variation, and relevant metadata (Series, Box Name, Folder Name, etc.).

5 Generate Report

Create an Excel workbook with summary statistics and detailed results for analysis.

Complete Python Script

Below is the full script used for this text mining analysis. It requires Python 3 with pandas and openpyxl libraries.

Installation Requirements

pip install pandas openpyxl

The Script

import pandas as pd import re from collections import defaultdict from openpyxl import Workbook from openpyxl.styles import Font, PatternFill, Alignment # Load the spreadsheet file_path = 'October_2_Master_Inventory.xlsx' df = pd.read_excel(file_path, sheet_name=0) # Define the names to search for with all their variations names_to_search = { 'Clarke, Mary Frances (Sister/Mother)': [ 'Clarke, Mary Frances', 'Clarke, Frances', 'Mother Mary Frances Clarke', 'Mary Frances Clarke', 'Frances Clarke', 'Mother Clarke' ], 'Byrne, Catherine (Sister)': [ 'Byrne, Catherine', 'Sister Catherine Byrne', 'Catherine Byrne' ], 'Kelly, Eliza (Sister)': [ 'Kelly, Eliza', 'Kelly, Elizabeth', 'Sister Eliza', 'Eliza Kelly', 'Elizabeth Kelly' ], 'Mann, Margaret': [ 'Mann, Margaret', 'Sister Mary Margaret', 'Margaret Mann' ], 'Dougherty, Mary Cecilia (Mother)': [ 'Dougherty, Cecilia', 'Mother Mary Dougherty', 'Dougherty, Mary Cecilia', 'Cecilia Dougherty' ], # ... add all other names following the same pattern } # Search for all instances results = [] for idx, row in df.iterrows(): # Combine all text from the row for searching row_text = ' '.join([str(val) for val in row.values if pd.notna(val)]) # Check each name for primary_name, variations in names_to_search.items(): for variation in variations: # Case-insensitive search if re.search(re.escape(variation), row_text, re.IGNORECASE): results.append({ 'Primary Name': primary_name, 'Matched Variation': variation, 'Row Number': idx + 2, # +2 for Excel (1-indexed + header) 'Series': row.get('Series', ''), 'Box Name': row.get('Box Name', ''), 'Box': row.get('Box', ''), 'Folder Name': row.get('Folder Name', ''), 'Context': row_text[:200] # First 200 chars }) break # Only count once per row per person # Create summary statistics summary = defaultdict(int) for result in results: summary[result['Primary Name']] += 1 # Convert to DataFrames results_df = pd.DataFrame(results) summary_df = pd.DataFrame([ {'Name': name, 'Total Instances': count} for name, count in sorted(summary.items(), key=lambda x: x[1], reverse=True) ]) # Add names with 0 instances all_names = set(names_to_search.keys()) found_names = set(summary.keys()) missing_names = all_names - found_names for name in sorted(missing_names): summary_df = pd.concat([summary_df, pd.DataFrame([{ 'Name': name, 'Total Instances': 0 }])], ignore_index=True) # Save to Excel with formatting output_path = 'Name_Text_Mining_Report.xlsx' with pd.ExcelWriter(output_path, engine='openpyxl') as writer: # Write Summary sheet summary_df.to_excel(writer, sheet_name='Summary', index=False) # Write Detailed Results sheet results_df.to_excel(writer, sheet_name='Detailed Results', index=False) # Format the sheets workbook = writer.book # Format Summary sheet summary_sheet = workbook['Summary'] summary_sheet.column_dimensions['A'].width = 60 summary_sheet.column_dimensions['B'].width = 20 for cell in summary_sheet[1]: cell.font = Font(bold=True, size=12, color='FFFFFF') cell.fill = PatternFill(start_color='366092', end_color='366092', fill_type='solid') # Format Detailed Results sheet detail_sheet = workbook['Detailed Results'] detail_sheet.column_dimensions['A'].width = 60 detail_sheet.column_dimensions['B'].width = 30 detail_sheet.column_dimensions['C'].width = 15 detail_sheet.column_dimensions['D'].width = 40 detail_sheet.column_dimensions['E'].width = 40 detail_sheet.column_dimensions['F'].width = 15 detail_sheet.column_dimensions['G'].width = 50 for cell in detail_sheet[1]: cell.font = Font(bold=True, size=12, color='FFFFFF') cell.fill = PatternFill(start_color='366092', end_color='366092', fill_type='solid') print(f"Report created successfully! Total matches: {len(results)}")

Results Summary

Names Found

Name Total Instances Status
Clarke, Mary Frances (Sister/Mother) 41 Found
DeCock, Mary (Sister) 35 Found
Garvey, Helen Maher (Robert Joseph)(Sister/President) 8 Found
Mann, Margaret 3 Found
Byrne, Catherine (Sister) 1 Found
Dougherty, Mary Cecilia (Mother) 1 Found
Farrell, Carolyn Lester (Sister) 1 Found
Baschnagel, Josita (Mother) 1 Found

Names Not Found

The following names were searched but no instances were found in the inventory:

Barwick, Ramona (Simone) (Sister) Burrows, Joanne (Sister SC - not BVM) Collins, Donard (Sister) Cox, Joyce (Petrine) (Sister) Donaghoe, Terrence J., Rev. (Father) Dunn, Catherine (Sister) Hadro, Teri (Sister) Kelly, Eliza (Sister) Lilly, Mary Ascension (Mother)

Sample Detailed Results

Each match includes comprehensive location information:

Primary Name: Clarke, Mary Frances (Sister/Mother)
Matched Variation: Mary Frances Clarke
Row Number: 233
Series: BVM Constitutions
Box Name: BVM Constitutions Committees Vol. XII, Vol. XIII Resource Materials, Indexed
Box: 8.0
Folder Name: Mary Frances Clarke & successors
Primary Name: DeCock, Mary (Sister)
Matched Variation: Mary DeCock
Row Number: 1543
Series: General Administration
Box Name: Sisters' Correspondence - D
Box: 2.0
Folder Name: DeCock, Mary

Key Features

✓ Case-Insensitive Matching

Finds names regardless of capitalization (e.g., "mary frances clarke" matches "Mary Frances Clarke")

✓ Multiple Name Variations

Each person can have unlimited name variations, including formal names, nicknames, titles, and alternate spellings

✓ Comprehensive Metadata

Captures row numbers and all relevant columns (Series, Box Name, Box, Folder Name) for easy document retrieval

✓ Duplicate Prevention

Counts each person only once per row, even if multiple variations appear in the same row

✓ Professional Excel Output

Generates formatted reports with color-coded headers and optimized column widths

Applications

This text mining technique is useful for:

  • Archival Research: Locate all references to historical figures across large document collections
  • Genealogy: Track family members through records with varying name formats
  • Data Quality: Identify inconsistent naming conventions in databases
  • Historical Documentation: Create indices of people mentioned in archives
  • Legal Discovery: Find all mentions of individuals in document collections
  • Academic Research: Analyze frequency and context of historical figures in source materials

Customizing for Your Project

Adapting the Name List

Modify the names_to_search dictionary to include your specific people and their variations:

names_to_search = { 'Last, First (Title)': [ 'Last, First', 'First Last', 'Nickname', 'Title First Last' ], # Add more people... }

Changing Column Names

Update the column references to match your spreadsheet structure:

'Series': row.get('Your_Column_Name', ''), 'Box Name': row.get('Your_Column_Name', ''),

Adjusting Search Sensitivity

For more flexible matching, you can modify the search pattern:

# Exact match (default) if re.search(re.escape(variation), row_text, re.IGNORECASE): # Partial word match (use with caution) if variation.lower() in row_text.lower(): # Word boundary match (whole words only) if re.search(r'\b' + re.escape(variation) + r'\b', row_text, re.IGNORECASE):

Sample Output Files

View the actual report generated from this analysis to see the complete results with all 91 matches, formatted tables, and detailed metadata.

Tips & Best Practices

  • Test with a subset first: Run the script on a smaller sample to verify accuracy before processing large files
  • Include common variations: Think about how names might appear (with/without titles, first name only, last name only, etc.)
  • Check for typos: Include common misspellings if known
  • Review the context column: The first 200 characters can help verify matches are correct
  • Consider punctuation: "O'Brien" vs "OBrien" or "St. Mary" vs "St Mary"
  • Use word boundaries: Be careful with short names that might match partial words
  • Document your variations: Keep a record of why you included each variation

Conclusion

This text mining approach provides an efficient, scalable method for locating individuals across large document collections. By automating the search process and capturing comprehensive metadata, researchers can quickly identify relevant materials and build comprehensive indices of their archival holdings.

The script is highly adaptable and can be modified for various types of text mining projects beyond name searches, including keyword analysis, topic identification, and pattern detection in structured data.