AutoTagger: A Streamlit App That Generates Blog Tags Using NLP + Keyword Extraction


How I built a simple web app to extract clean, SEO-friendly blog tags using Python, NLP, and keyword extraction algorithms

🧠 The Problem

Bloggers and content creators often struggle with the final (but critical) step of publishing: adding the right tags.

Tags affect everything from:

  • SEO visibility
  • Topic clustering
  • Internal search/navigation
  • To evenĀ reader discoveryĀ via recommended posts

Yet many people still add them manuallyā€Šā€”ā€Ša guessing game prone to inconsistency and repetition.

There had to be a better way.


šŸ’” The Solution

I built AutoTagger, a data-driven Streamlit app that accepts any blog post URL and returns structured, clean tag suggestions based on actual content.

Just paste your blog linkā€Šā€”ā€Šand the app automatically:

  • Extracts the main text body
  • Cleans + preprocesses it using NLP
  • RunsĀ 3 independent keyword extraction models
  • ReturnsĀ three sets of tag suggestions, each based on a different data method

āœ… What the App Does

AutoTagger is more than just a web scraper. It integrates multiple layers of data processingfiltering, and ranking logicto ensure the tags you get are meaningful and useful.

Here’s what it does step-by-step:

1. HTML Parsing + Text Extraction

  • UsesĀ requestsĀ +Ā BeautifulSoupĀ to download and parse article content.
  • Extracts theĀ <article>Ā tag or common blog containers (e.g.Ā .post-content,Ā body).
  • Rejects pages with <100 words or known error messages like ā€œ404ā€, ā€œnot foundā€.

2. Preprocessing with NLP

  • Tokenizes and lowercases the entire text
  • Removes stopwords usingĀ nltk
  • Lemmatizes tokens usingĀ WordNetLemmatizer
  • Filters out special characters, digits, 1-letter words, and noise

3. Three Tag Generation Methods

Each method contributes to a separate, ranked list of up to 10 relevant tags:


šŸ”¢ Method 1: Frequency-Based Tags

  • Extracts top terms usingĀ collections.Counter
  • Filters out duplicates and short terms
  • Lemmatizes results and deduplicates intelligently

Use case: Best for direct topics mentioned often in the article.


🧠 Method 2: RAKE (Rapid Automatic Keyword Extraction)

  • Uses theĀ rake-nltkĀ library
  • IdentifiesĀ multi-word phrasesĀ based on word co-occurrence
  • Limits to phrases with 1–2 words for tag readability
  • Filters out messy characters, one-word noise, and repetition

Use case: Captures compound keyword phrases like ā€œcrypto investingā€ or ā€œzero wasteā€.


šŸ¤– Method 3: KeyBERT (BERT + Cosine Similarity)

  • UsesĀ keybertĀ to extract semantic keywords from the cleaned article text
  • Based onĀ sentence-transformersĀ under the hood
  • CapturesĀ meaningful and contextually relevant tags, even if terms appear rarely

Use case: Great for abstract topics, technical writing, or posts with synonyms.


šŸ“Š Output Overview

Once processed, the app displays:

  • AĀ word cloudĀ of cleaned tokens
  • TheĀ raw + cleaned word count
  • A table of theĀ top 10 keywords/phrasesĀ from each method
  • Final tag suggestions in markdown format (comma-separated)

All tags are:

  • Cleaned (no ā€œā€“ foreverā€ or stray symbols)
  • Deduplicated across tokens
  • Filtered for length, clarity, and relevance

🧱 Tech Stack

Here’s what powers the app:


šŸ–¼ Preview


🧪 Example Use Case

Let’s say you run a blog post on ā€œHow I Save Money Living in KL Without a Carā€.

Paste the URL into AutoTagger, and you might get:

  • Top Frequency Tags:Ā money, save, kl, cost, living
  • RAKE Tags:Ā public transport, affordable rent, grocery shopping
  • KeyBERT Tags:Ā budgeting tips, urban life, lifestyle inflation

This gives you 30 unique tag candidatesā€Šā€”ā€Š10 from each methodā€Šā€”ā€Šthat you can cherry-pick or combine depending on your SEO intent.


šŸ”§ How I Built It

The logic centers around keyword filtering, deduplication, and phrase scoring. Here’s a simplified view of how KeyBERT keywords are filtered:

keybert_raw = kw_model.extract_keywords(text, top_n=50, stop_words='english')
keybert_cleaned = [
kw[0].lower().strip()
for kw in keybert_raw
if len(kw[0]) > 1 and not re.search(r"[^a-zA-Z\s]", kw[0])
]

šŸ—‚ Output Specifications


🌐 Try It Yourself

šŸ‘‰ Launch AutoTagger on Streamlit
No login. No API key. Just paste your blog post URL and get instant tags.


šŸ“¦ Folder Structure

autotagger/
│
ā”œā”€ā”€ nltk_data/ # Local NLTK data folder to avoid download errors
│ └── tokenizers/
│ └── punkt/ # Contains pre-downloaded Punkt tokenizer for sentence splitting
ā”œā”€ā”€ app.py # Main Streamlit app script
ā”œā”€ā”€ requirements.txt # Python dependencies for the app

šŸš€ What’s Next?

Some features I’m planning to add:

  • 🧾 CSV export of tag results
  • šŸ–‡ Tag co-occurrence matrix
  • šŸ“Š Monthly trend tracker (for repeat URLs)
  • šŸ—ƒ Upload localĀ .txt orĀ .md files instead of URLs

šŸŽÆ Final Thoughts

AutoTagger is my experiment in combining SEO relevance with natural language processing. If you write content or manage blogs, this app can save time and improve your tagging strategyā€Šā€”ā€Šall while giving you insight into what your content is really about.

Give it a try, fork it, improve itā€Šā€”ā€Šand let me know what you think!

Scroll to Top