With the rapid development of large AI models, building a practical chat application with advanced features has become a hot topic. This article shares the complete process of building a modern AI chat app from scratch, integrating RAG (Retrieval-Augmented Generation), multi-turn conversation management, and local vector storage.
Live Demo: https://chat.erishen.cn
Core Features
Intelligent Conversation
– Streaming responses: Real-time streaming via Vercel AI SDK
– Multi-turn dialogue: Complete conversation history and context retention
– Auto-titling: AI-generated conversation titles based on content
– Persistence: Conversation history survives page refreshes
RAG Document Retrieval
– Document upload: Supports TXT, MD formats
– Smart chunking: Auto-splits long documents into semantic chunks
– Client-side vectorization: Using @xenova/transformers for local embedding generation
– Semantic search: Cosine similarity-based intelligent document retrieval
– Context enhancement: Combines retrieved documents for more accurate responses
Modern UI/UX
– Responsive design: Adapts perfectly to desktop and mobile
– Theme switching: Light/dark/system themes
– Component architecture: Tailwind CSS-based design system
– Fluid animations: Polished interaction experience
Technical Architecture
Frontend Stack
Next.js 15 + React 19 + TypeScript, Tailwind CSS 4, Vercel AI SDK
RAG Architecture (Hybrid)
– Client side: Document processing + vectorization + local storage
– Server side: Semantic search + context generation
Core Libraries
@xenova/transformers (client ML), localStorage (vector DB), cosine similarity, React Hooks
Key Implementations
Multi-Turn Conversation Management
export function useMultiTurnChat() {
const [currentConversationId, setCurrentConversationId] = useState<string | null>(null)
const [conversations, setConversations] = useState<Conversation[]>([])
useEffect(() => {
if (!currentConversationId || messages.length === 0) return
conversationManager.updateConversation(currentConversationId, {
messages: messages.map(msg => ({
id: msg.id,
role: msg.role as 'user' | 'assistant',
content: msg.content,
timestamp: new Date(),
}))
})
}, [messages, currentConversationId])
return { messages, conversations, createNewConversation, switchConversation, deleteConversation }
}
RAG Document Processing Pipeline
class DocumentProcessor {
async processDocument(file: File): Promise<ProcessedDocument> {
const content = await this.readFileContent(file)
const chunks = await this.chunkText(content, { chunkSize: 500, chunkOverlap: 50 })
const embeddings = await this.generateEmbeddings(chunks)
await vectorStore.addDocument({
id: generateId(), title: file.name, content,
chunks: chunks.map((chunk, index) => ({
id: generateId(), content: chunk, embedding: embeddings[index]
}))
})
return processedDocument
}
}
Semantic Search with Cosine Similarity
class LocalStorageVectorStore {
async search(queryEmbedding: number[], topK: number): Promise<SearchResult[]> {
const allChunks = this.getAllChunks()
const similarities = allChunks.map(chunk => ({
...chunk,
similarity: this.cosineSimilarity(queryEmbedding, chunk.embedding)
}))
return similarities
.sort((a, b) => b.similarity - a.similarity)
.slice(0, topK)
.filter(result => result.similarity > 0.5)
}
private cosineSimilarity(a: number[], b: number[]): number {
const dotProduct = a.reduce((sum, val, i) => sum + val * b[i], 0)
const magnitudeA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0))
const magnitudeB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0))
return dotProduct / (magnitudeA * magnitudeB)
}
}
Performance Optimizations
– Client-side vectorization: Reduces server load, Web Workers prevent UI thread blocking, local cache avoids recomputation
– Smart chunking: 500 chars per chunk, 50 char overlap, preserves document structure
– Memory management: Lazy loading, LRU cache, proactive garbage collection
Problems Solved
– Page refresh data loss: Auto-save messages to localStorage on every change via useEffect
– Vector similarity precision: Normalized embeddings, dynamic thresholds, multi-tier ranking
– Large document performance: Web Workers for async processing, prevents blocking
Performance Metrics
– First load: < 2s
– Response time: < 500ms
– Document processing: < 3s for 1MB docs
– Search latency: < 100ms
Key Takeaways
Building this project gave me deep understanding of the complete AI application development pipeline — from UI/UX design to RAG system implementation, performance optimization to user experience. The RAG system implementation deepened my understanding of vector databases, semantic search, and context generation.