Building a Modern Site Search Engine with Laravel, Elasticsearch, and AI

February 3, 2025

projects development

When I set out to build SiteSearch, I wanted to create more than just another search tool. The goal was to build a powerful, scalable search solution that could handle multiple websites while providing intelligent, relevant results. Here's a deep dive into how I built it using Laravel, Elasticsearch, and OpenAI.

The Challenge

Building a site search engine isn't just about matching keywords. Modern search needs to understand context, handle real-time updates, and provide relevant results across different types of content. The key challenges were:

  1. Content Synchronization: Pages change frequently, requiring real-time index updates
  2. Multi-tenancy: Each site needs its own isolated search index
  3. Content Structure: Different sites have varying HTML structures
  4. Performance: Search needs to be fast, even with large datasets
  5. Intelligence: Search results need to understand context beyond simple keyword matching

Getting Started

Before diving into the advanced features, let's set up the basic requirements. We'll need Laravel as our foundation, Elasticsearch for powerful search capabilities, Sanctum for API authentication, and Guzzle for making HTTP requests to external services.

The environment variables are crucial for configuration:

# Install Laravel and required packages
composer require laravel/framework
composer require elasticsearch/elasticsearch
composer require laravel/sanctum
composer require guzzlehttp/guzzle

# Set up environment variables
ELASTICSEARCH_HOST=localhost:9200
OPENAI_API_KEY=your_api_key

Basic Configuration

The configuration file sets up our core services. We separate these settings to make the application easily configurable across different environments:

// config/sitesearch.php
return [
    'elasticsearch' => [
        'host' => env('ELASTICSEARCH_HOST', 'localhost:9200'),
        'ssl_verify' => env('ELASTICSEARCH_SSL_VERIFY', true),
    ],
    'crawl' => [
        'max_pages_per_site' => env('MAX_PAGES_PER_SITE', 1000),
        'request_delay' => env('CRAWL_REQUEST_DELAY', 1), // seconds
    ],
    'rate_limits' => [
        'search' => env('SEARCH_RATE_LIMIT', 60), // per minute
        'openai' => env('OPENAI_RATE_LIMIT', 100), // per minute
    ]
];

The Architecture

The system is built with several key components working together:

Core Search Service with Elasticsearch

The ElasticsearchService is the backbone of our search functionality. It handles:

The indexPage method ensures real-time indexing of content, while searchPages implements a sophisticated multi-field search strategy:

class ElasticsearchService
{
    protected $client;

    public function __construct()
    {
        $this->client = ClientBuilder::create()
            ->setHosts([config('services.elasticsearch.host')])
            ->setRetries(2)
            ->setSSLVerification(config('services.elasticsearch.ssl_verify'))
            ->build();
    }

    public function indexPage($page)
    {
        try {
            $this->client->index([
                'index' => $page->site->site_index_name,
                'id'    => $page->uuid,
                'body'  => $page->toElasticArray(),
                'refresh' => true  // Ensure immediate searchability
            ]);
        } catch (\Exception $e) {
            Log::error('Elasticsearch indexing failed', [
                'page_id' => $page->id,
                'error' => $e->getMessage()
            ]);
            throw $e;
        }
    }

    public function searchPages($query, $siteId, $offset, $perPage)
    {
        $site = Site::find($siteId);
        $params = [
            'index' => $site->site_index_name,
            'body'  => [
                'from' => $offset,
                'size' => $perPage,
                'query' => [
                    'bool' => [
                        'must' => [
                            ['multi_match' => [
                                'query' => $query,
                                'fields' => [
                                    'title^3',
                                    'ai_title^4',
                                    'description^2',
                                    'ai_description^3',
                                    'open_graph^1.5',
                                    'keywords^2',
                                    'ai_keywords^2.5',
                                    'text'
                                ],
                                'type' => 'best_fields',
                                'tie_breaker' => 0.3,
                                'minimum_should_match' => '80%'
                            ]],
                        ]
                    ]
                ],
                'highlight' => [
                    'fields' => [
                        'title' => new \stdClass(),
                        'description' => new \stdClass(),
                        'text' => [
                            'fragment_size' => 150,
                            'number_of_fragments' => 3
                        ]
                    ]
                ]
            ]
        ];

        return $this->client->search($params);
    }
}

AI-Powered Content Enhancement

The OpenAI integration adds an intelligent layer to our search engine. It:

The service uses GPT-4 for optimal results and includes error handling and rate limiting:

class OpenAIService
{
    protected $client;

    public function __construct()
    {
        $this->client = new Client([
            'base_uri' => 'https://api.openai.com/v1/',
            'headers' => [
                'Authorization' => 'Bearer ' . env('OPENAI_API_KEY'),
                'Content-Type' => 'application/json',
            ],
        ]);
    }

    public function generateKeywords($content)
    {
        $response = $this->client->post('chat/completions', [
            'json' => [
                'model' => 'gpt-4',
                'messages' => [
                    [
                        'role' => 'system', 
                        'content' => 'You are a helpful assistant. You will just give me a list of keywords comma separated that i can plug into a database.'
                    ],
                    [
                        'role' => 'user', 
                        'content' => "Generate keywords for the following content:\n\n" . $content
                    ],
                ],
                'max_tokens' => 60,
            ],
        ]);

        return explode(', ', trim($data['choices'][0]['message']['content']));
    }

    public function generateAISummary($content)
    {
        // Similar implementation for summaries
    }
}

Real-time Updates with Observers

The observer pattern keeps our search index synchronized with content changes. When a page is saved:

  1. It checks if AI enrichment is needed (based on content changes)
  2. Processes the content through OpenAI if required
  3. Updates the search index
  4. Handles failures gracefully to ensure index consistency

This approach ensures users always get the most up-to-date search results:

class PageObserver
{
    protected $elasticsearchService;
    protected $pageEnrichmentService;

    public function __construct(
        ElasticsearchService $elasticsearchService,
        PageEnrichmentService $pageEnrichmentService
    ) {
        $this->elasticsearchService = $elasticsearchService;
        $this->pageEnrichmentService = $pageEnrichmentService;
    }

    public function saved(Page $page)
    {
        if ($this->shouldEnrich($page)) {
            try {
                $enrichedData = $this->pageEnrichmentService->enrichPage($page);

                $page->update([
                    'ai_keywords' => $enrichedData['keywords'],
                    'ai_summary' => $enrichedData['summary'],
                    'last_enriched_at' => now(),
                ]);

                dispatch(new IndexPageJob($page));

            } catch (\Exception $e) {
                Log::error('AI enrichment failed', [
                    'page_id' => $page->id,
                    'error' => $e->getMessage()
                ]);

                $this->elasticsearchService->indexPage($page);
            }
        } else {
            $this->elasticsearchService->indexPage($page);
        }
    }
}

Content Discovery and Crawling

The crawling system is designed to be:

The CrawlerService manages the entire crawling process:

class CrawlerService
{
    protected $httpClient;
    protected $processedSitemaps = [];
    protected $pageScrapeService;

    public function crawlSite(Site $site)
    {
        $site->update(['crawl_status' => CrawlStatus::IN_PROGRESS]);

        try {
            if ($site->sitemap_url) {
                $this->crawlBySitemap($site, $site->sitemap_url);
            } else {
                $this->crawlByUrl($site);
            }

            $site->update([
                'crawl_status' => CrawlStatus::COMPLETED, 
                'last_crawled_at' => Carbon::now()
            ]);

        } catch (\Exception $e) {
            Log::error("Crawl failed for " . $site->url . 
                      " with error: " . $e->getMessage());
            $site->update(['crawl_status' => CrawlStatus::FAILED]);
        } finally {
            Notification::send($site->user, new CrawlFinished($site));
        }
    }
}

Intelligent Sitemap Detection

The sitemap detection system tries to: 1. Find standard sitemap locations (sitemap.xml) 2. Handle various sitemap formats 3. Fallback to URL-based crawling when needed

This makes the system work with virtually any website structure:

class SitemapService
{
    public function guessAndTestSitemapUrl($url)
    {
        $parsedUrl = parse_url($url);
        $sitemapUrl = $parsedUrl['scheme'] . '://' . 
                      $parsedUrl['host'] . '/sitemap.xml';

        try {
            $response = (new Client())->get($sitemapUrl);
            if ($response->getStatusCode() == 200) {
                return $sitemapUrl;
            }
        } catch (\Exception $e) {
            // Handle failure gracefully
        }

        return null;
    }
}

Rate Limiting and Error Handling

Rate limiting is crucial for:

API Rate Limiting

The search rate limiting uses Laravel's built-in rate limiter with:

class SearchController extends Controller
{
    public function search(Request $request)
    {
        $key = 'search-' . $request->user()->id;

        if (RateLimiter::tooManyAttempts($key, $maxAttempts)) {
            $seconds = RateLimiter::availableIn($key);
            return response()->json([
                'error' => 'Too many requests',
                'retry_after' => $seconds
            ], 429);
        }

        RateLimiter::hit($key, 60); // Reset after 60 seconds

        // Proceed with search
    }
}

OpenAI Rate Limiting

A separate rate limiter for OpenAI operations helps:

class OpenAIRateLimiter
{
    protected $redis;
    protected $maxRequestsPerMinute;

    public function acquire(): bool
    {
        $key = 'openai_requests:' . now()->format('Y-m-d-H-i');
        $count = $this->redis->incr($key);

        if ($count === 1) {
            $this->redis->expire($key, 60);
        }

        return $count <= $this->maxRequestsPerMinute;
    }
}

Error Handling and Retry Strategies

The retry system implements exponential backoff to handle:

This makes the system more resilient:

trait RetryableOperation
{
    protected function withRetry(callable $operation, $maxAttempts = 3)
    {
        $attempt = 1;
        $lastException = null;

        while ($attempt <= $maxAttempts) {
            try {
                return $operation();
            } catch (\Exception $e) {
                $lastException = $e;
                Log::warning("Attempt {$attempt} failed", [
                    'error' => $e->getMessage()
                ]);

                if ($attempt === $maxAttempts) {
                    throw $e;
                }

                sleep(pow(2, $attempt - 1)); // Exponential backoff
                $attempt++;
            }
        }
    }
}

Performance Optimization

Elasticsearch Query Optimization

Query optimization focuses on:

protected function optimizeQuery($params)
{
    // Add query cache
    $params['request_cache'] = true;

    // Add field collapsing for duplicate content
    $params['body']['collapse'] = [
        'field' => 'content_hash'
    ];

    // Add pagination optimization
    if ($params['from'] > 1000) {
        // Switch to search after for deep pagination
        $params['body']['search_after'] = $this->getSearchAfterParams();
        unset($params['from']);
    }

    return $params;
}

Caching Strategy

The caching system is intelligent:

class SearchService
{
    public function search($query, $site)
    {
        $cacheKey = "search:{$site->id}:{$query}";

        return Cache::remember($cacheKey, now()->addMinutes(60), function () 
            use ($query, $site) {
            return $this->performSearch($query, $site);
        });
    }

    protected function shouldCache($query): bool
    {
        // Cache popular searches more aggressively
        $popularity = SearchMetric::getPopularity($query);
        return $popularity > config('sitesearch.cache_threshold');
    }
}

Monitoring and Analytics

Search Analytics Dashboard

Analytics tracking helps:

class SearchMetric extends Model
{
    protected $fillable = [
        'site_id',
        'query',
        'results_count',
        'response_time',
        'user_clicked',
        'position_clicked'
    ];

    public static function recordSearch($query, $results, $duration)
    {
        return self::create([
            'site_id' => $results['site_id'],
            'query' => $query,
            'results_count' => $results['total'],
            'response_time' => $duration,
            'timestamp' => now()
        ]);
    }

    public static function getPopularQueries($siteId, $limit = 10)
    {
        return self::where('site_id', $siteId)
            ->select('query', DB::raw('count(*) as count'))
            ->groupBy('query')
            ->orderByDesc('count')
            ->limit($limit)
            ->get();
    }
}

Performance Monitoring

Health checks ensure:

class ElasticsearchHealthCheck
{
    public function check()
    {
        try {
            $client = app(ElasticsearchService::class)->getClient();
            $health = $client->cluster()->health();

            return [
                'healthy' => in_array($health['status'], ['green', 'yellow']),
                'status' => $health['status'],
                'nodes' => $health['number_of_nodes'],
                'active_shards' => $health['active_shards']
            ];
        } catch (\Exception $e) {
            return [
                'healthy' => false,
                'error' => $e->getMessage()
            ];
        }
    }
}

Scaling Considerations

  1. Horizontal Scaling

    • Use Elasticsearch's built-in clustering
    • Implement read replicas for search queries
    • Scale Laravel queue workers for background processing
  2. Content Processing

    • Implement batch processing for large sites
    • Use queued jobs for AI enrichment
    • Implement progressive loading for large result sets
  3. Cost Management

    • Cache expensive OpenAI calls
    • Implement tiered pricing based on usage
    • Optimize index storage and replication

Troubleshooting Guide

Common issues and solutions:

  1. Indexing Issues

    • Check Elasticsearch cluster health
    • Verify index mappings
    • Monitor bulk indexing jobs
  2. Search Relevance

    • Adjust field weights
    • Review AI enrichment quality
    • Analyze search logs
  3. Performance Issues

    • Monitor query response times
    • Check cache hit rates
    • Review resource utilization

Conclusion

Building a modern search engine is complex, but the combination of Laravel's elegant architecture, Elasticsearch's powerful search capabilities, and OpenAI's intelligence creates a robust and scalable solution. The key is finding the right balance between features, performance, and maintainability.