HomeMichaela Jackson

             

A Python Pipeline for Federal Anti-Trans Bills





Purpose:

In this project, I designed and implemented a Python-based data pipeline to address a critical gap in accessibility and structure around U.S. federal anti-trans legislation. While TransLegislation.com does vital advocacy work by cataloging bills that impact trans communities, the site doesn't provide downloadable data or consistent formatting for further computational analysis. My goal was to transform this information into a centralized, clean, and analysis-ready dataset that could support research, advocacy, and public accountability.

Project Overview
Scraped TransLegislation.com to extract metadata, including bill titles, descriptions, and source links
  • Built direct XML links to retrieve full bill text from congress.gov
  • Parsed and cleaned raw XML using BeautifulSoup to remove legal markup
  • Exported structured metadata and cleaned text for use in further analysis

Results & Next Steps:

The final product is a working MySQL database built around normalized tables for dishes, wines, users, orders, pairings, and ratings. Users can search for pairings by cuisine, dietary tags, or wine type, and the database supports a future-facing feature set that includes views, triggers, and personalized recommendations. For example, a “BestPairings” view returns curated top matches, while triggers automatically update order histories or track price changes.

Skills Developed:

This project strengthened skills in web scraping, text extraction, data wrangling with pandas, andXML/HTML parsing using BeautifulSoup. It also involved constructing programmatic links, validating edge cases, and designing a pipeline structure suitable for future automation and machine learning applications.

Tools Used:

Python, SKlearn, BeautifulSoup, pandas, GitHub, Jupyter Notebook, Matplotlib






View Project on Github






Data

Collectio
n


Using requests and BeautifulSoup, I collected and parsed the HTML content of 2024 and 2025 bill listings. For each card, I extracted the title, category, description, and any external links. When LegiScan links were missing or inconsistent (especially in 2025), I relied on congress.gov as the primary source.



Metadata Extraction and Enrichment

From the scraped URLs, I used regular expressions and logic mapping to identify key fields:

  • Session number
  • Bill number (e.g., HR 3215)
  • Bill type abbreviation (e.g., hr, sres, etc.)
Example mapping code: map_code = {  'house-bill': 'hr',  'senate-bill': 's',  'house-resolution': 'hres',  'senate-resolution': 'sres',  'house-joint-resolution': 'hjres',  'senate-joint-resolution': 'sjres'}




Text Retrieval and Cleaning

Each bill’s XML was parsed using BeautifulSoup to extract only the meaningful body content, removing headers, footnotes, and tags. Both the raw and cleaned text were stored as TXT files for use in NLP workflows.


I constructed direct XML URLs using this format:
https://www.congress.gov/{session}/bills/ {bill_type}{bill_number}/BILLS-{session} {bill_type}{bill_number}ih.xml



Initial NLP Analysis: Word Cloud Phrasing

To explore recurring language, I generated two word clouds: one using individual word frequency, and one using 2- and 3-word n-grams. Stopwords were extended to include legal boilerplate and congressional boilerplate terms.

vectorizer = CountVectorizer(ngram_range=(2, 3), stop_words=custom_stopwords)
X = vectorizer.fit_transform([cleaned_text])
frequencies = {word: X.sum(axis=0)[0, idx] for word, idx in vectorizer.vocabulary_.items()}




The resulting visualizations help reveal rhetorical patterns and thematic repetition across the bills.









 Single-word frequency cloud from cleaned_text.txt









Phrase-based (bigrams and trigrams) word cloud – more policy-relevant