AutoTagger: A Streamlit App That Generates Blog Tags Using NLP + Keyword Extraction

How I built a simple web app to extract clean, SEO-friendly blog tags using Python, NLP, and keyword extraction algorithms

🧠 The Problem

Bloggers and content creators often struggle with the final (but critical) step of publishing: adding the right tags.

Tags affect everything from:

SEO visibility
Topic clustering
Internal search/navigation
To even reader discovery via recommended posts

Yet many people still add them manually — a guessing game prone to inconsistency and repetition.

There had to be a better way.

💡 The Solution

I built AutoTagger, a data-driven Streamlit app that accepts any blog post URL and returns structured, clean tag suggestions based on actual content.

Just paste your blog link — and the app automatically:

Extracts the main text body
Cleans + preprocesses it using NLP
Runs 3 independent keyword extraction models
Returns three sets of tag suggestions, each based on a different data method

✅ What the App Does

AutoTagger is more than just a web scraper. It integrates multiple layers of data processing, filtering, and ranking logicto ensure the tags you get are meaningful and useful.

Here’s what it does step-by-step:

1. HTML Parsing + Text Extraction

Uses requests + BeautifulSoup to download and parse article content.
Extracts the <article> tag or common blog containers (e.g. .post-content, body).
Rejects pages with <100 words or known error messages like “404”, “not found”.

2. Preprocessing with NLP

Tokenizes and lowercases the entire text
Removes stopwords using nltk
Lemmatizes tokens using WordNetLemmatizer
Filters out special characters, digits, 1-letter words, and noise

3. Three Tag Generation Methods

Each method contributes to a separate, ranked list of up to 10 relevant tags:

🔢 Method 1: Frequency-Based Tags

Extracts top terms using collections.Counter
Filters out duplicates and short terms
Lemmatizes results and deduplicates intelligently

Use case: Best for direct topics mentioned often in the article.

🧠 Method 2: RAKE (Rapid Automatic Keyword Extraction)

Uses the rake-nltk library
Identifies multi-word phrases based on word co-occurrence
Limits to phrases with 1–2 words for tag readability
Filters out messy characters, one-word noise, and repetition

Use case: Captures compound keyword phrases like “crypto investing” or “zero waste”.

🤖 Method 3: KeyBERT (BERT + Cosine Similarity)

Uses keybert to extract semantic keywords from the cleaned article text
Based on sentence-transformers under the hood
Captures meaningful and contextually relevant tags, even if terms appear rarely

Use case: Great for abstract topics, technical writing, or posts with synonyms.

📊 Output Overview

Once processed, the app displays:

A word cloud of cleaned tokens
The raw + cleaned word count
A table of the top 10 keywords/phrases from each method
Final tag suggestions in markdown format (comma-separated)

All tags are:

Cleaned (no “– forever” or stray symbols)
Deduplicated across tokens
Filtered for length, clarity, and relevance

🧱 Tech Stack

Here’s what powers the app:

🖼 Preview

🧪 Example Use Case

Let’s say you run a blog post on “How I Save Money Living in KL Without a Car”.

Paste the URL into AutoTagger, and you might get:

Top Frequency Tags: money, save, kl, cost, living
RAKE Tags: public transport, affordable rent, grocery shopping
KeyBERT Tags: budgeting tips, urban life, lifestyle inflation

This gives you 30 unique tag candidates — 10 from each method — that you can cherry-pick or combine depending on your SEO intent.

🔧 How I Built It

The logic centers around keyword filtering, deduplication, and phrase scoring. Here’s a simplified view of how KeyBERT keywords are filtered:

keybert_raw = kw_model.extract_keywords(text, top_n=50, stop_words='english')
keybert_cleaned = [
    kw[0].lower().strip()
    for kw in keybert_raw
    if len(kw[0]) > 1 and not re.search(r"[^a-zA-Z\s]", kw[0])
]

🗂 Output Specifications

🌐 Try It Yourself

👉 Launch AutoTagger on Streamlit
No login. No API key. Just paste your blog post URL and get instant tags.

📦 Folder Structure

autotagger/
│
├── nltk_data/                 # Local NLTK data folder to avoid download errors
│   └── tokenizers/
│       └── punkt/             # Contains pre-downloaded Punkt tokenizer for sentence splitting
├── app.py                     # Main Streamlit app script
├── requirements.txt           # Python dependencies for the app

🚀 What’s Next?

Some features I’m planning to add:

🧾 CSV export of tag results
🖇 Tag co-occurrence matrix
📊 Monthly trend tracker (for repeat URLs)
🗃 Upload local .txt or .md files instead of URLs

🎯 Final Thoughts

AutoTagger is my experiment in combining SEO relevance with natural language processing. If you write content or manage blogs, this app can save time and improve your tagging strategy — all while giving you insight into what your content is really about.

Give it a try, fork it, improve it — and let me know what you think!