How I built a simple web app to extract clean, SEO-friendly blog tags using Python, NLP, and keyword extraction algorithms
š§ The Problem
Bloggers and content creators often struggle with the final (but critical) step of publishing: adding the right tags.
Tags affect everything from:
- SEO visibility
- Topic clustering
- Internal search/navigation
- To evenĀ reader discoveryĀ via recommended posts
Yet many people still add them manuallyāāāa guessing game prone to inconsistency and repetition.
There had to be a better way.
š” The Solution
I built AutoTagger, a data-driven Streamlit app that accepts any blog post URL and returns structured, clean tag suggestions based on actual content.
Just paste your blog linkāāāand the app automatically:
- Extracts the main text body
- Cleans + preprocesses it using NLP
- RunsĀ 3 independent keyword extraction models
- ReturnsĀ three sets of tag suggestions, each based on a different data method
ā What the App Does
AutoTagger is more than just a web scraper. It integrates multiple layers of data processing, filtering, and ranking logicto ensure the tags you get are meaningful and useful.
Hereās what it does step-by-step:
1. HTML Parsing + Text Extraction
- UsesĀ
requestsĀ +ĀBeautifulSoupĀ to download and parse article content. - Extracts theĀ
<article>Ā tag or common blog containers (e.g.Ā.post-content,Ābody). - Rejects pages with <100 words or known error messages like ā404ā, ānot foundā.
2. Preprocessing with NLP
- Tokenizes and lowercases the entire text
- Removes stopwords usingĀ
nltk - Lemmatizes tokens usingĀ
WordNetLemmatizer - Filters out special characters, digits, 1-letter words, and noise
3. Three Tag Generation Methods
Each method contributes to a separate, ranked list of up to 10 relevant tags:
š¢ Method 1: Frequency-Based Tags
- Extracts top terms usingĀ
collections.Counter - Filters out duplicates and short terms
- Lemmatizes results and deduplicates intelligently
Use case: Best for direct topics mentioned often in the article.
š§ Method 2: RAKE (Rapid Automatic Keyword Extraction)
- Uses theĀ
rake-nltkĀ library - IdentifiesĀ multi-word phrasesĀ based on word co-occurrence
- Limits to phrases with 1ā2 words for tag readability
- Filters out messy characters, one-word noise, and repetition
Use case: Captures compound keyword phrases like ācrypto investingā or āzero wasteā.
š¤ Method 3: KeyBERT (BERT + Cosine Similarity)
- UsesĀ
keybertĀ to extract semantic keywords from the cleaned article text - Based onĀ
sentence-transformersĀ under the hood - CapturesĀ meaningful and contextually relevant tags, even if terms appear rarely
Use case: Great for abstract topics, technical writing, or posts with synonyms.
š Output Overview
Once processed, the app displays:
- AĀ word cloudĀ of cleaned tokens
- TheĀ raw + cleaned word count
- A table of theĀ top 10 keywords/phrasesĀ from each method
- Final tag suggestions in markdown format (comma-separated)
All tags are:
- Cleaned (no āā foreverā or stray symbols)
- Deduplicated across tokens
- Filtered for length, clarity, and relevance
š§± Tech Stack
Hereās what powers the app:

š¼ Preview

š§Ŗ Example Use Case
Letās say you run a blog post on āHow I Save Money Living in KL Without a Carā.
Paste the URL into AutoTagger, and you might get:
- Top Frequency Tags:Ā money, save, kl, cost, living
- RAKE Tags:Ā public transport, affordable rent, grocery shopping
- KeyBERT Tags:Ā budgeting tips, urban life, lifestyle inflation
This gives you 30 unique tag candidatesāāā10 from each methodāāāthat you can cherry-pick or combine depending on your SEO intent.
š§ How I Built It
The logic centers around keyword filtering, deduplication, and phrase scoring. Hereās a simplified view of how KeyBERT keywords are filtered:
keybert_raw = kw_model.extract_keywords(text, top_n=50, stop_words='english')
keybert_cleaned = [
kw[0].lower().strip()
for kw in keybert_raw
if len(kw[0]) > 1 and not re.search(r"[^a-zA-Z\s]", kw[0])
]
š Output Specifications

š Try It Yourself
š Launch AutoTagger on Streamlit
No login. No API key. Just paste your blog post URL and get instant tags.
š¦ Folder Structure
autotagger/
ā
āāā nltk_data/ # Local NLTK data folder to avoid download errors
ā āāā tokenizers/
ā āāā punkt/ # Contains pre-downloaded Punkt tokenizer for sentence splitting
āāā app.py # Main Streamlit app script
āāā requirements.txt # Python dependencies for the app
š Whatās Next?
Some features Iām planning to add:
- š§¾ CSV export of tag results
- š Tag co-occurrence matrix
- š Monthly trend tracker (for repeat URLs)
- š Upload localĀ .txt orĀ .md files instead of URLs
šÆ Final Thoughts
AutoTagger is my experiment in combining SEO relevance with natural language processing. If you write content or manage blogs, this app can save time and improve your tagging strategyāāāall while giving you insight into what your content is really about.
Give it a try, fork it, improve itāāāand let me know what you think!



