Natural Language Processing: Text Processing Fundamentals in Go
Hey linguist! 👋
Want to teach computers to understand human language? Analyze customer reviews, extract keywords from documents, or determine sentiment of comments?
Natural Language Processing (NLP) is the magic that transforms unstructured text into useful data. And yes, you can do this in Go!
Let’s explore NLP fundamentals, practical algorithms, and real examples of text processing in Go 🚀
1. What is NLP and why do you need it
NLP Definition
Natural Language Processing is a machine learning field that teaches computers to understand, interpret, and generate human language.
NLP task examples:
- Sentiment analysis - positive or negative review?
- Entity extraction - find names, dates, places in text
- Text classification - spam or not spam?
- Machine translation - Russian to English
- Text generation - automatic article writing
Real-world applications
// Examples of NLP usage in business
type NLPApplication struct {
Name string
Description string
Value string
}
var applications = []NLPApplication{
{"Review analysis", "Determine customer sentiment", "Product improvement"},
{"Chatbots", "Automatic question responses", "Reduce support load"},
{"Document search", "Semantic search in knowledge base", "Fast information access"},
{"Content moderation", "Automatic toxicity detection", "Safe environment"},
}
2. Text Processing Fundamentals
Tokenization - breaking into parts
package nlp
import (
"regexp"
"strings"
"unicode"
)
type Tokenizer struct {
wordRegex *regexp.Regexp
}
func NewTokenizer() *Tokenizer {
return &Tokenizer{
wordRegex: regexp.MustCompile(`\b\w+\b`),
}
}
func (t *Tokenizer) TokenizeWords(text string) []string {
text = strings.ToLower(text)
return t.wordRegex.FindAllString(text, -1)
}
func (t *Tokenizer) TokenizeSentences(text string) []string {
sentences := regexp.MustCompile(`[.!?]+`).Split(text, -1)
result := make([]string, 0, len(sentences))
for _, sentence := range sentences {
sentence = strings.TrimSpace(sentence)
if sentence != "" {
result = append(result, sentence)
}
}
return result
}
Text normalization
func (t *Tokenizer) Normalize(text string) string {
// Convert to lowercase
text = strings.ToLower(text)
// Remove punctuation
text = regexp.MustCompile(`[^\p{L}\p{N}\s]+`).ReplaceAllString(text, "")
// Remove extra spaces
text = regexp.MustCompile(`\s+`).ReplaceAllString(text, " ")
return strings.TrimSpace(text)
}
// Stop words removal
var stopWords = map[string]bool{
"и": true, "в": true, "на": true, "с": true, "по": true,
"для": true, "не": true, "от": true, "до": true, "из": true,
"the": true, "and": true, "or": true, "but": true, "in": true,
"on": true, "at": true, "to": true, "for": true, "of": true,
}
func RemoveStopWords(tokens []string) []string {
result := make([]string, 0, len(tokens))
for _, token := range tokens {
if !stopWords[token] && len(token) > 2 {
result = append(result, token)
}
}
return result
}
3. Sentiment Analysis
Simple dictionary-based approach
type SentimentAnalyzer struct {
positiveWords map[string]int
negativeWords map[string]int
}
func NewSentimentAnalyzer() *SentimentAnalyzer {
return &SentimentAnalyzer{
positiveWords: map[string]int{
"хорошо": 2, "отлично": 3, "прекрасно": 3, "замечательно": 2,
"good": 2, "great": 3, "excellent": 3, "amazing": 3,
"love": 2, "perfect": 3, "awesome": 3,
},
negativeWords: map[string]int{
"плохо": -2, "ужасно": -3, "отвратительно": -3, "кошмар": -2,
"bad": -2, "terrible": -3, "awful": -3, "hate": -3,
"horrible": -3, "disgusting": -3,
},
}
}
type SentimentResult struct {
Score int
Sentiment string
Confidence float64
}
func (sa *SentimentAnalyzer) Analyze(text string) SentimentResult {
tokenizer := NewTokenizer()
tokens := tokenizer.TokenizeWords(text)
tokens = RemoveStopWords(tokens)
score := 0
wordCount := 0
for _, token := range tokens {
if value, exists := sa.positiveWords[token]; exists {
score += value
wordCount++
}
if value, exists := sa.negativeWords[token]; exists {
score += value
wordCount++
}
}
sentiment := "neutral"
confidence := 0.0
if score > 0 {
sentiment = "positive"
confidence = float64(score) / float64(len(tokens))
} else if score < 0 {
sentiment = "negative"
confidence = float64(-score) / float64(len(tokens))
}
return SentimentResult{
Score: score,
Sentiment: sentiment,
Confidence: confidence,
}
}
Using sentiment analysis
func main() {
analyzer := NewSentimentAnalyzer()
reviews := []string{
"This product is just excellent! Very satisfied with purchase.",
"Terrible quality, money wasted. Don't recommend.",
"Regular product, nothing special.",
"Amazing product! Love it so much!",
}
for _, review := range reviews {
result := analyzer.Analyze(review)
fmt.Printf("Review: %s\n", review)
fmt.Printf("Sentiment: %s (%.2f)\n\n", result.Sentiment, result.Confidence)
}
}
4. TF-IDF for keyword extraction
TF-IDF implementation
import "math"
type TFIDFAnalyzer struct {
documents [][]string
vocabulary map[string]int
}
func NewTFIDFAnalyzer() *TFIDFAnalyzer {
return &TFIDFAnalyzer{
documents: make([][]string, 0),
vocabulary: make(map[string]int),
}
}
func (tfidf *TFIDFAnalyzer) AddDocument(tokens []string) {
tfidf.documents = append(tfidf.documents, tokens)
// Update vocabulary
for _, token := range tokens {
tfidf.vocabulary[token]++
}
}
func (tfidf *TFIDFAnalyzer) CalculateTF(tokens []string) map[string]float64 {
tf := make(map[string]float64)
totalWords := len(tokens)
for _, token := range tokens {
tf[token]++
}
for token := range tf {
tf[token] = tf[token] / float64(totalWords)
}
return tf
}
func (tfidf *TFIDFAnalyzer) CalculateIDF(term string) float64 {
documentsWithTerm := 0
for _, doc := range tfidf.documents {
for _, token := range doc {
if token == term {
documentsWithTerm++
break
}
}
}
if documentsWithTerm == 0 {
return 0
}
return math.Log(float64(len(tfidf.documents)) / float64(documentsWithTerm))
}
func (tfidf *TFIDFAnalyzer) GetTopKeywords(tokens []string, topN int) []KeywordScore {
tf := tfidf.CalculateTF(tokens)
scores := make([]KeywordScore, 0)
for term, tfScore := range tf {
idf := tfidf.CalculateIDF(term)
tfidfScore := tfScore * idf
scores = append(scores, KeywordScore{
Word: term,
Score: tfidfScore,
})
}
// Sort by descending score
sort.Slice(scores, func(i, j int) bool {
return scores[i].Score > scores[j].Score
})
if len(scores) > topN {
scores = scores[:topN]
}
return scores
}
type KeywordScore struct {
Word string
Score float64
}
5. Named Entity Recognition (NER)
Simple regex-based approach
type EntityExtractor struct {
patterns map[string]*regexp.Regexp
}
func NewEntityExtractor() *EntityExtractor {
return &EntityExtractor{
patterns: map[string]*regexp.Regexp{
"email": regexp.MustCompile(`\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b`),
"phone": regexp.MustCompile(`\b\d{3}-\d{3}-\d{4}\b|\b\+1\d{10}\b`),
"date": regexp.MustCompile(`\b\d{1,2}[./]\d{1,2}[./]\d{4}\b`),
"money": regexp.MustCompile(`\$\d+(?:,\d{3})*(?:\.\d{2})?|\d+\s*(?:dollars|USD)`),
"person": regexp.MustCompile(`\b[A-Z][a-z]+\s+[A-Z][a-z]+\b`),
},
}
}
type Entity struct {
Type string
Value string
Start int
End int
}
func (ee *EntityExtractor) Extract(text string) []Entity {
entities := make([]Entity, 0)
for entityType, pattern := range ee.patterns {
matches := pattern.FindAllStringIndex(text, -1)
for _, match := range matches {
entities = append(entities, Entity{
Type: entityType,
Value: text[match[0]:match[1]],
Start: match[0],
End: match[1],
})
}
}
return entities
}
6. Text Classification
Naive Bayes classifier
type NaiveBayesClassifier struct {
classes map[string]int
wordCounts map[string]map[string]int
totalWords map[string]int
vocabulary map[string]bool
}
func NewNaiveBayesClassifier() *NaiveBayesClassifier {
return &NaiveBayesClassifier{
classes: make(map[string]int),
wordCounts: make(map[string]map[string]int),
totalWords: make(map[string]int),
vocabulary: make(map[string]bool),
}
}
func (nb *NaiveBayesClassifier) Train(text string, class string) {
tokenizer := NewTokenizer()
tokens := tokenizer.TokenizeWords(text)
tokens = RemoveStopWords(tokens)
nb.classes[class]++
if nb.wordCounts[class] == nil {
nb.wordCounts[class] = make(map[string]int)
}
for _, token := range tokens {
nb.wordCounts[class][token]++
nb.totalWords[class]++
nb.vocabulary[token] = true
}
}
func (nb *NaiveBayesClassifier) Predict(text string) string {
tokenizer := NewTokenizer()
tokens := tokenizer.TokenizeWords(text)
tokens = RemoveStopWords(tokens)
bestClass := ""
bestScore := math.Inf(-1)
totalDocs := 0
for _, count := range nb.classes {
totalDocs += count
}
for class := range nb.classes {
score := math.Log(float64(nb.classes[class]) / float64(totalDocs))
for _, token := range tokens {
wordCount := nb.wordCounts[class][token]
totalWordsInClass := nb.totalWords[class]
vocabularySize := len(nb.vocabulary)
// Laplace smoothing
probability := float64(wordCount+1) / float64(totalWordsInClass+vocabularySize)
score += math.Log(probability)
}
if score > bestScore {
bestScore = score
bestClass = class
}
}
return bestClass
}
7. Practical Example: Review Analysis
Complete analysis system
type ReviewAnalyzer struct {
sentiment *SentimentAnalyzer
classifier *NaiveBayesClassifier
extractor *EntityExtractor
tfidf *TFIDFAnalyzer
}
func NewReviewAnalyzer() *ReviewAnalyzer {
return &ReviewAnalyzer{
sentiment: NewSentimentAnalyzer(),
classifier: NewNaiveBayesClassifier(),
extractor: NewEntityExtractor(),
tfidf: NewTFIDFAnalyzer(),
}
}
type ReviewAnalysis struct {
Text string
Sentiment SentimentResult
Category string
Keywords []KeywordScore
Entities []Entity
}
func (ra *ReviewAnalyzer) AnalyzeReview(text string) ReviewAnalysis {
tokenizer := NewTokenizer()
tokens := tokenizer.TokenizeWords(text)
cleanTokens := RemoveStopWords(tokens)
return ReviewAnalysis{
Text: text,
Sentiment: ra.sentiment.Analyze(text),
Category: ra.classifier.Predict(text),
Keywords: ra.tfidf.GetTopKeywords(cleanTokens, 5),
Entities: ra.extractor.Extract(text),
}
}
// Usage example
func main() {
analyzer := NewReviewAnalyzer()
// Train classifier
analyzer.classifier.Train("Excellent product, fast delivery", "positive")
analyzer.classifier.Train("Poor quality, don't recommend", "negative")
// Add documents for TF-IDF
tokenizer := NewTokenizer()
doc1 := RemoveStopWords(tokenizer.TokenizeWords("Excellent product"))
analyzer.tfidf.AddDocument(doc1)
// Analyze review
review := "Ordered on 01/15/2025, product arrived quickly. Quality is excellent! Recommend to everyone. My email: test@example.com"
result := analyzer.AnalyzeReview(review)
fmt.Printf("Review analysis:\n")
fmt.Printf("Sentiment: %s (%.2f)\n", result.Sentiment.Sentiment, result.Sentiment.Confidence)
fmt.Printf("Category: %s\n", result.Category)
fmt.Printf("Found entities:\n")
for _, entity := range result.Entities {
fmt.Printf(" %s: %s\n", entity.Type, entity.Value)
}
}
8. Performance and Optimization
Benchmarks
func BenchmarkTokenization(b *testing.B) {
tokenizer := NewTokenizer()
text := "This is a long text for testing tokenization performance in Go"
b.ResetTimer()
for i := 0; i < b.N; i++ {
tokenizer.TokenizeWords(text)
}
}
func BenchmarkSentimentAnalysis(b *testing.B) {
analyzer := NewSentimentAnalyzer()
text := "Excellent product, very satisfied with purchase, recommend to everyone"
b.ResetTimer()
for i := 0; i < b.N; i++ {
analyzer.Analyze(text)
}
}
// Results:
// BenchmarkTokenization-8 1000000 1200 ns/op
// BenchmarkSentimentAnalysis-8 500000 2400 ns/op
Optimizations
// Result caching
type CachedAnalyzer struct {
analyzer *SentimentAnalyzer
cache map[string]SentimentResult
mu sync.RWMutex
}
func (ca *CachedAnalyzer) Analyze(text string) SentimentResult {
ca.mu.RLock()
if result, exists := ca.cache[text]; exists {
ca.mu.RUnlock()
return result
}
ca.mu.RUnlock()
result := ca.analyzer.Analyze(text)
ca.mu.Lock()
ca.cache[text] = result
ca.mu.Unlock()
return result
}
9. External API Integration
Using Google Translate API
type TranslationService struct {
apiKey string
client *http.Client
}
func (ts *TranslationService) Translate(text, targetLang string) (string, error) {
url := fmt.Sprintf("https://translation.googleapis.com/language/translate/v2?key=%s", ts.apiKey)
payload := map[string]interface{}{
"q": text,
"target": targetLang,
"format": "text",
}
jsonData, _ := json.Marshal(payload)
resp, err := ts.client.Post(url, "application/json", bytes.NewBuffer(jsonData))
if err != nil {
return "", err
}
defer resp.Body.Close()
var result struct {
Data struct {
Translations []struct {
TranslatedText string `json:"translatedText"`
} `json:"translations"`
} `json:"data"`
}
json.NewDecoder(resp.Body).Decode(&result)
if len(result.Data.Translations) > 0 {
return result.Data.Translations[0].TranslatedText, nil
}
return "", fmt.Errorf("no translation found")
}
Conclusion: NLP in Go is real and effective
What we learned:
🔤 Tokenization - breaking text into parts
😊 Sentiment analysis - determining emotions
🔍 TF-IDF - keyword extraction
🏷️ NER - named entity recognition
📊 Classification - text categorization
Go advantages for NLP:
- High performance - fast processing of large text volumes
- Easy deployment - single binary file
- Excellent concurrency support - parallel document processing
- Rich standard library - regexp, strings, unicode
Next steps:
- Explore libraries:
prose,go-nlp,gse - Try ML model integration via gRPC
- Implement word embeddings with Word2Vec
P.S. What NLP tasks are you solving? Share your experience! 🚀
// Additional resources:
// - "Speech and Language Processing" - Jurafsky & Martin
// - Go NLP libraries: github.com/jdkato/prose
// - Stanford NLP Course: cs224n.stanford.edu