Building a Modern Site Search Engine with Laravel, Elasticsearch, and AI

February 3, 2025

projects development

When I set out to build SiteSearch, I wanted to create more than just another search tool. The goal was to build a powerful, scalable search solution that could handle multiple websites while providing intelligent, relevant results. Here's a deep dive into how I built it using Laravel, Elasticsearch, and OpenAI.

Note: Looking for alternative approaches? Check out my guides on building a hybrid search system with PostgreSQL and building a site search engine with Meilisearch for simpler, more lightweight alternatives.

The Challenge

Building a site search engine isn't just about matching keywords. Modern search needs to understand context, handle real-time updates, and provide relevant results across different types of content. The key challenges were:

Content Synchronization: Pages change frequently, requiring real-time index updates
Multi-tenancy: Each site needs its own isolated search index
Content Structure: Different sites have varying HTML structures
Performance: Search needs to be fast, even with large datasets
Intelligence: Search results need to understand context beyond simple keyword matching

Getting Started

Before diving into the advanced features, let's set up the basic requirements. We'll need Laravel as our foundation, Elasticsearch for powerful search capabilities, Sanctum for API authentication, and Guzzle for making HTTP requests to external services.

The environment variables are crucial for configuration:

ELASTICSEARCH_HOST: Points to your Elasticsearch instance
OPENAI_API_KEY: Required for AI-powered content enhancement

# Install Laravel and required packages
composer require laravel/framework
composer require elasticsearch/elasticsearch
composer require laravel/sanctum
composer require guzzlehttp/guzzle

# Set up environment variables
ELASTICSEARCH_HOST=localhost:9200
OPENAI_API_KEY=your_api_key

Basic Configuration

The configuration file sets up our core services. We separate these settings to make the application easily configurable across different environments:

Elasticsearch settings control connection and SSL verification
Crawl settings prevent overwhelming target sites
Rate limits protect our API and manage OpenAI costs

// config/sitesearch.php
return [
    'elasticsearch' => [
        'host' => env('ELASTICSEARCH_HOST', 'localhost:9200'),
        'ssl_verify' => env('ELASTICSEARCH_SSL_VERIFY', true),
    ],
    'crawl' => [
        'max_pages_per_site' => env('MAX_PAGES_PER_SITE', 1000),
        'request_delay' => env('CRAWL_REQUEST_DELAY', 1), // seconds
    ],
    'rate_limits' => [
        'search' => env('SEARCH_RATE_LIMIT', 60), // per minute
        'openai' => env('OPENAI_RATE_LIMIT', 100), // per minute
    ]
];

The Architecture

The system is built with several key components working together:

Core Search Service with Elasticsearch

The ElasticsearchService is the backbone of our search functionality. It handles:

Index management: Creating and updating search indices for each site
Document indexing: Converting web pages into searchable documents
Search operations: Executing complex queries with relevance scoring

Tip: While this implementation uses Elasticsearch, you might also be interested in my article on implementing hybrid search with PostgreSQL and pgvector for a database-native approach.

The indexPage method ensures real-time indexing of content, while searchPages implements a sophisticated multi-field search strategy:

Title fields get higher weight (^3, ^4) because they're typically more relevant
AI-enhanced fields (ai_title, ai_description) get extra weight for better context
The tie_breaker helps when the same term appears in multiple fields
minimum_should_match ensures quality results by requiring 80% term matches

class ElasticsearchService
{
    protected $client;

    public function __construct()
    {
        $this->client = ClientBuilder::create()
            ->setHosts([config('services.elasticsearch.host')])
            ->setRetries(2)
            ->setSSLVerification(config('services.elasticsearch.ssl_verify'))
            ->build();
    }

    public function indexPage($page)
    {
        try {
            $this->client->index([
                'index' => $page->site->site_index_name,
                'id'    => $page->uuid,
                'body'  => $page->toElasticArray(),
                'refresh' => true  // Ensure immediate searchability
            ]);
        } catch (\Exception $e) {
            Log::error('Elasticsearch indexing failed', [
                'page_id' => $page->id,
                'error' => $e->getMessage()
            ]);
            throw $e;
        }
    }

    public function searchPages($query, $siteId, $offset, $perPage)
    {
        $site = Site::find($siteId);
        $params = [
            'index' => $site->site_index_name,
            'body'  => [
                'from' => $offset,
                'size' => $perPage,
                'query' => [
                    'bool' => [
                        'must' => [
                            ['multi_match' => [
                                'query' => $query,
                                'fields' => [
                                    'title^3',
                                    'ai_title^4',
                                    'description^2',
                                    'ai_description^3',
                                    'open_graph^1.5',
                                    'keywords^2',
                                    'ai_keywords^2.5',
                                    'text'
                                ],
                                'type' => 'best_fields',
                                'tie_breaker' => 0.3,
                                'minimum_should_match' => '80%'
                            ]],
                        ]
                    ]
                ],
                'highlight' => [
                    'fields' => [
                        'title' => new \stdClass(),
                        'description' => new \stdClass(),
                        'text' => [
                            'fragment_size' => 150,
                            'number_of_fragments' => 3
                        ]
                    ]
                ]
            ]
        ];

        return $this->client->search($params);
    }
}

AI-Powered Content Enhancement

The OpenAI integration adds an intelligent layer to our search engine. It:

Generates relevant keywords that might not be in the original content
Creates AI-powered summaries for better context
Enhances search relevance by understanding content meaning

Related: For a deeper dive into combining AI with search technology, see my post on building a hybrid search system that leverages OpenAI embeddings for semantic understanding.

The service uses GPT-4 for optimal results and includes error handling and rate limiting:

class OpenAIService
{
    protected $client;

    public function __construct()
    {
        $this->client = new Client([
            'base_uri' => 'https://api.openai.com/v1/',
            'headers' => [
                'Authorization' => 'Bearer ' . env('OPENAI_API_KEY'),
                'Content-Type' => 'application/json',
            ],
        ]);
    }

    public function generateKeywords($content)
    {
        $response = $this->client->post('chat/completions', [
            'json' => [
                'model' => 'gpt-4',
                'messages' => [
                    [
                        'role' => 'system', 
                        'content' => 'You are a helpful assistant. You will just give me a list of keywords comma separated that i can plug into a database.'
                    ],
                    [
                        'role' => 'user', 
                        'content' => "Generate keywords for the following content:\n\n" . $content
                    ],
                ],
                'max_tokens' => 60,
            ],
        ]);

        return explode(', ', trim($data['choices'][0]['message']['content']));
    }

    public function generateAISummary($content)
    {
        // Similar implementation for summaries
    }
}

Real-time Updates with Observers

The observer pattern keeps our search index synchronized with content changes. When a page is saved:

It checks if AI enrichment is needed (based on content changes)
Processes the content through OpenAI if required
Updates the search index
Handles failures gracefully to ensure index consistency

This approach ensures users always get the most up-to-date search results:

class PageObserver
{
    protected $elasticsearchService;
    protected $pageEnrichmentService;

    public function __construct(
        ElasticsearchService $elasticsearchService,
        PageEnrichmentService $pageEnrichmentService
    ) {
        $this->elasticsearchService = $elasticsearchService;
        $this->pageEnrichmentService = $pageEnrichmentService;
    }

    public function saved(Page $page)
    {
        if ($this->shouldEnrich($page)) {
            try {
                $enrichedData = $this->pageEnrichmentService->enrichPage($page);

                $page->update([
                    'ai_keywords' => $enrichedData['keywords'],
                    'ai_summary' => $enrichedData['summary'],
                    'last_enriched_at' => now(),
                ]);

                dispatch(new IndexPageJob($page));

            } catch (\Exception $e) {
                Log::error('AI enrichment failed', [
                    'page_id' => $page->id,
                    'error' => $e->getMessage()
                ]);

                $this->elasticsearchService->indexPage($page);
            }
        } else {
            $this->elasticsearchService->indexPage($page);
        }
    }
}

Content Discovery and Crawling

The crawling system is designed to be:

Respectful of target sites (with rate limiting)
Efficient (using sitemaps when available)
Scalable (handling sites of any size)
Reliable (with comprehensive error handling)

Crawler Comparison: For an alternative implementation of an AI-enriched web crawler using PostgreSQL for storage, see my article on building a hybrid search system.

The CrawlerService manages the entire crawling process:

class CrawlerService
{
    protected $httpClient;
    protected $processedSitemaps = [];
    protected $pageScrapeService;

    public function crawlSite(Site $site)
    {
        $site->update(['crawl_status' => CrawlStatus::IN_PROGRESS]);

        try {
            if ($site->sitemap_url) {
                $this->crawlBySitemap($site, $site->sitemap_url);
            } else {
                $this->crawlByUrl($site);
            }

            $site->update([
                'crawl_status' => CrawlStatus::COMPLETED, 
                'last_crawled_at' => Carbon::now()
            ]);

        } catch (\Exception $e) {
            Log::error("Crawl failed for " . $site->url . 
                      " with error: " . $e->getMessage());
            $site->update(['crawl_status' => CrawlStatus::FAILED]);
        } finally {
            Notification::send($site->user, new CrawlFinished($site));
        }
    }
}

Intelligent Sitemap Detection

The sitemap detection system tries to: 1. Find standard sitemap locations (sitemap.xml) 2. Handle various sitemap formats 3. Fallback to URL-based crawling when needed

This makes the system work with virtually any website structure:

class SitemapService
{
    public function guessAndTestSitemapUrl($url)
    {
        $parsedUrl = parse_url($url);
        $sitemapUrl = $parsedUrl['scheme'] . '://' . 
                      $parsedUrl['host'] . '/sitemap.xml';

        try {
            $response = (new Client())->get($sitemapUrl);
            if ($response->getStatusCode() == 200) {
                return $sitemapUrl;
            }
        } catch (\Exception $e) {
            // Handle failure gracefully
        }

        return null;
    }
}

Rate Limiting and Error Handling

Rate limiting is crucial for:

Protecting the API from abuse
Managing costs (especially for AI operations)
Ensuring fair resource distribution

API Rate Limiting

The search rate limiting uses Laravel's built-in rate limiter with:

Per-user limits
Configurable windows
Clear error responses

class SearchController extends Controller
{
    public function search(Request $request)
    {
        $key = 'search-' . $request->user()->id;

        if (RateLimiter::tooManyAttempts($key, $maxAttempts)) {
            $seconds = RateLimiter::availableIn($key);
            return response()->json([
                'error' => 'Too many requests',
                'retry_after' => $seconds
            ], 429);
        }

        RateLimiter::hit($key, 60); // Reset after 60 seconds

        // Proceed with search
    }
}

OpenAI Rate Limiting

A separate rate limiter for OpenAI operations helps:

Control API costs
Prevent quota exhaustion
Maintain consistent performance

class OpenAIRateLimiter
{
    protected $redis;
    protected $maxRequestsPerMinute;

    public function acquire(): bool
    {
        $key = 'openai_requests:' . now()->format('Y-m-d-H-i');
        $count = $this->redis->incr($key);

        if ($count === 1) {
            $this->redis->expire($key, 60);
        }

        return $count <= $this->maxRequestsPerMinute;
    }
}

Error Handling and Retry Strategies

The retry system implements exponential backoff to handle:

Temporary network issues
API rate limits
Service unavailability
Transient errors

This makes the system more resilient:

trait RetryableOperation
{
    protected function withRetry(callable $operation, $maxAttempts = 3)
    {
        $attempt = 1;
        $lastException = null;

        while ($attempt <= $maxAttempts) {
            try {
                return $operation();
            } catch (\Exception $e) {
                $lastException = $e;
                Log::warning("Attempt {$attempt} failed", [
                    'error' => $e->getMessage()
                ]);

                if ($attempt === $maxAttempts) {
                    throw $e;
                }

                sleep(pow(2, $attempt - 1)); // Exponential backoff
                $attempt++;
            }
        }
    }
}

Performance Optimization

Elasticsearch Query Optimization

Query optimization focuses on:

Caching frequently used results
Removing duplicate content
Handling deep pagination efficiently
Maintaining fast response times

Performance Comparison: Curious about performance optimization in PostgreSQL-based search? See my article on hybrid search with PostgreSQL for database-specific optimization techniques.

protected function optimizeQuery($params)
{
    // Add query cache
    $params['request_cache'] = true;

    // Add field collapsing for duplicate content
    $params['body']['collapse'] = [
        'field' => 'content_hash'
    ];

    // Add pagination optimization
    if ($params['from'] > 1000) {
        // Switch to search after for deep pagination
        $params['body']['search_after'] = $this->getSearchAfterParams();
        unset($params['from']);
    }

    return $params;
}

Caching Strategy

The caching system is intelligent:

Popular queries are cached longer
Cache invalidation is automatic
Memory usage is optimized
Response times are minimized

class SearchService
{
    public function search($query, $site)
    {
        $cacheKey = "search:{$site->id}:{$query}";

        return Cache::remember($cacheKey, now()->addMinutes(60), function () 
            use ($query, $site) {
            return $this->performSearch($query, $site);
        });
    }

    protected function shouldCache($query): bool
    {
        // Cache popular searches more aggressively
        $popularity = SearchMetric::getPopularity($query);
        return $popularity > config('sitesearch.cache_threshold');
    }
}

Monitoring and Analytics

Search Analytics Dashboard

Analytics tracking helps:

Understand user behavior
Optimize search relevance
Monitor system performance
Guide feature development

class SearchMetric extends Model
{
    protected $fillable = [
        'site_id',
        'query',
        'results_count',
        'response_time',
        'user_clicked',
        'position_clicked'
    ];

    public static function recordSearch($query, $results, $duration)
    {
        return self::create([
            'site_id' => $results['site_id'],
            'query' => $query,
            'results_count' => $results['total'],
            'response_time' => $duration,
            'timestamp' => now()
        ]);
    }

    public static function getPopularQueries($siteId, $limit = 10)
    {
        return self::where('site_id', $siteId)
            ->select('query', DB::raw('count(*) as count'))
            ->groupBy('query')
            ->orderByDesc('count')
            ->limit($limit)
            ->get();
    }
}

Performance Monitoring

Health checks ensure:

System availability
Cluster health
Resource utilization
Early problem detection

class ElasticsearchHealthCheck
{
    public function check()
    {
        try {
            $client = app(ElasticsearchService::class)->getClient();
            $health = $client->cluster()->health();

            return [
                'healthy' => in_array($health['status'], ['green', 'yellow']),
                'status' => $health['status'],
                'nodes' => $health['number_of_nodes'],
                'active_shards' => $health['active_shards']
            ];
        } catch (\Exception $e) {
            return [
                'healthy' => false,
                'error' => $e->getMessage()
            ];
        }
    }
}

Scaling Considerations

Horizontal Scaling
- Use Elasticsearch's built-in clustering
- Implement read replicas for search queries
- Scale Laravel queue workers for background processing
Content Processing
- Implement batch processing for large sites
- Use queued jobs for AI enrichment
- Implement progressive loading for large result sets
Cost Management
- Cache expensive OpenAI calls
- Implement tiered pricing based on usage
- Optimize index storage and replication

Alternative Scaling: For scaling considerations with a PostgreSQL-based search system, see my article on hybrid search with PostgreSQL which discusses database-specific scaling strategies.

Troubleshooting Guide

Common issues and solutions:

Indexing Issues
- Check Elasticsearch cluster health
- Verify index mappings
- Monitor bulk indexing jobs
Search Relevance
- Adjust field weights
- Review AI enrichment quality
- Analyze search logs
Performance Issues
- Monitor query response times
- Check cache hit rates
- Review resource utilization

Conclusion

Building a modern search engine is complex, but the combination of Laravel's elegant architecture, Elasticsearch's powerful search capabilities, and OpenAI's intelligence creates a robust and scalable solution. The key is finding the right balance between features, performance, and maintainability.

Compare Approaches: Interested in an alternative implementation? Check out my guide on building a hybrid search system with PostgreSQL and pgvector to see how a database-native approach compares to Elasticsearch.

Show Comments