Back to Home
Extract Plain Text from Medium Posts for RAG and Search Indexes

Extract Plain Text from Medium Posts for RAG and Search Indexes

B
Blizine Admin
·2 min read·0 views

Sebastian Casvean Posted on May 30 Extract Plain Text from Medium Posts for RAG and Search Indexes # ai # rag # llm # api Chunk clean article content for embeddings, summarization, and full-text search—skip nav, clap bars, and scripts. Extract Plain Text from Medium Posts for RAG and Search Indexes HTML embeds are for humans; plain text is for chunking, embeddings, and summarization. One call should return body text without nav, clap bars, or script tags. Tool outcome: ingest-medium-article.ts → chunked documents in your vector DB. Pipeline Discover ids via user feed or search . GET /article/{id}/content → plain text. Optional: GET /article/{id} for title, tags, author metadata. Chunk → embed → upsert vector store. Query in your chat UI or internal search. Ingest script const API = ' https://api.zenndra.com ' ; const headers = { Authorization : `Bearer ${ process . env . ZENNDRA_API_KEY } ` }; export async function fetchArticleText ( articleId ) { const [ contentRes , metaRes ] = await Promise . all ([ fetch ( ` ${ API } /article/ ${ articleId } /content` , { headers }), fetch ( ` ${ API } /article/ ${ articleId } ` , { headers }), ]); const { content } = await contentRes . json (); const meta = await metaRes . json (); return { id : articleId , title : meta . title , tags : meta . tags , text : content , }; } export function chunkText ( text , { size = 800 , overlap = 100 } = {}) { const words = text . split ( / \s +/ ); const chunks = []; for ( let i = 0 ; i < words . length ; i += size - overlap ) { chunks . push ( words . slice ( i , i + size ). join ( ' ' )); } return chunks . filter ( Boolean ); } Enter fullscreen mode Exit fullscreen mode Wire chunkText to OpenAI embeddings , Ollama , or your host’s model—swap the vector client, keep the ingest shape. Chunking tips Include title + tags in the embedding preamble for better retrieval. Store article_id and chunk_index in metadata for citations. Deduplicate re-ingest with content hash if posts are edited rarel

📰Dev.to — dev.to

Comments