Building a Modern Site Search Engine with Laravel, Elasticsearch, and AI
February 3, 2025
projects developmentWhen I set out to build SiteSearch, I wanted to create more than just another search tool. The goal was to build a powerful, scalable search solution that could handle multiple websites while providing intelligent, relevant results. Here's a deep dive into how I built it using Laravel, Elasticsearch, and OpenAI.
The Challenge
Building a site search engine isn't just about matching keywords. Modern search needs to understand context, handle real-time updates, and provide relevant results across different types of content. The key challenges were:
- Content Synchronization: Pages change frequently, requiring real-time index updates
- Multi-tenancy: Each site needs its own isolated search index
- Content Structure: Different sites have varying HTML structures
- Performance: Search needs to be fast, even with large datasets
- Intelligence: Search results need to understand context beyond simple keyword matching
Getting Started
Before diving into the advanced features, let's set up the basic requirements. We'll need Laravel as our foundation, Elasticsearch for powerful search capabilities, Sanctum for API authentication, and Guzzle for making HTTP requests to external services.
The environment variables are crucial for configuration:
ELASTICSEARCH_HOST
: Points to your Elasticsearch instanceOPENAI_API_KEY
: Required for AI-powered content enhancement
# Install Laravel and required packages
composer require laravel/framework
composer require elasticsearch/elasticsearch
composer require laravel/sanctum
composer require guzzlehttp/guzzle
# Set up environment variables
ELASTICSEARCH_HOST=localhost:9200
OPENAI_API_KEY=your_api_key
Basic Configuration
The configuration file sets up our core services. We separate these settings to make the application easily configurable across different environments:
- Elasticsearch settings control connection and SSL verification
- Crawl settings prevent overwhelming target sites
- Rate limits protect our API and manage OpenAI costs
// config/sitesearch.php
return [
'elasticsearch' => [
'host' => env('ELASTICSEARCH_HOST', 'localhost:9200'),
'ssl_verify' => env('ELASTICSEARCH_SSL_VERIFY', true),
],
'crawl' => [
'max_pages_per_site' => env('MAX_PAGES_PER_SITE', 1000),
'request_delay' => env('CRAWL_REQUEST_DELAY', 1), // seconds
],
'rate_limits' => [
'search' => env('SEARCH_RATE_LIMIT', 60), // per minute
'openai' => env('OPENAI_RATE_LIMIT', 100), // per minute
]
];
The Architecture
The system is built with several key components working together:
Core Search Service with Elasticsearch
The ElasticsearchService
is the backbone of our search functionality. It handles:
- Index management: Creating and updating search indices for each site
- Document indexing: Converting web pages into searchable documents
- Search operations: Executing complex queries with relevance scoring
The indexPage
method ensures real-time indexing of content, while searchPages
implements a sophisticated multi-field search strategy:
- Title fields get higher weight (^3, ^4) because they're typically more relevant
- AI-enhanced fields (ai_title, ai_description) get extra weight for better context
- The tie_breaker helps when the same term appears in multiple fields
- minimum_should_match ensures quality results by requiring 80% term matches
class ElasticsearchService
{
protected $client;
public function __construct()
{
$this->client = ClientBuilder::create()
->setHosts([config('services.elasticsearch.host')])
->setRetries(2)
->setSSLVerification(config('services.elasticsearch.ssl_verify'))
->build();
}
public function indexPage($page)
{
try {
$this->client->index([
'index' => $page->site->site_index_name,
'id' => $page->uuid,
'body' => $page->toElasticArray(),
'refresh' => true // Ensure immediate searchability
]);
} catch (\Exception $e) {
Log::error('Elasticsearch indexing failed', [
'page_id' => $page->id,
'error' => $e->getMessage()
]);
throw $e;
}
}
public function searchPages($query, $siteId, $offset, $perPage)
{
$site = Site::find($siteId);
$params = [
'index' => $site->site_index_name,
'body' => [
'from' => $offset,
'size' => $perPage,
'query' => [
'bool' => [
'must' => [
['multi_match' => [
'query' => $query,
'fields' => [
'title^3',
'ai_title^4',
'description^2',
'ai_description^3',
'open_graph^1.5',
'keywords^2',
'ai_keywords^2.5',
'text'
],
'type' => 'best_fields',
'tie_breaker' => 0.3,
'minimum_should_match' => '80%'
]],
]
]
],
'highlight' => [
'fields' => [
'title' => new \stdClass(),
'description' => new \stdClass(),
'text' => [
'fragment_size' => 150,
'number_of_fragments' => 3
]
]
]
]
];
return $this->client->search($params);
}
}
AI-Powered Content Enhancement
The OpenAI integration adds an intelligent layer to our search engine. It:
- Generates relevant keywords that might not be in the original content
- Creates AI-powered summaries for better context
- Enhances search relevance by understanding content meaning
The service uses GPT-4 for optimal results and includes error handling and rate limiting:
class OpenAIService
{
protected $client;
public function __construct()
{
$this->client = new Client([
'base_uri' => 'https://api.openai.com/v1/',
'headers' => [
'Authorization' => 'Bearer ' . env('OPENAI_API_KEY'),
'Content-Type' => 'application/json',
],
]);
}
public function generateKeywords($content)
{
$response = $this->client->post('chat/completions', [
'json' => [
'model' => 'gpt-4',
'messages' => [
[
'role' => 'system',
'content' => 'You are a helpful assistant. You will just give me a list of keywords comma separated that i can plug into a database.'
],
[
'role' => 'user',
'content' => "Generate keywords for the following content:\n\n" . $content
],
],
'max_tokens' => 60,
],
]);
return explode(', ', trim($data['choices'][0]['message']['content']));
}
public function generateAISummary($content)
{
// Similar implementation for summaries
}
}
Real-time Updates with Observers
The observer pattern keeps our search index synchronized with content changes. When a page is saved:
- It checks if AI enrichment is needed (based on content changes)
- Processes the content through OpenAI if required
- Updates the search index
- Handles failures gracefully to ensure index consistency
This approach ensures users always get the most up-to-date search results:
class PageObserver
{
protected $elasticsearchService;
protected $pageEnrichmentService;
public function __construct(
ElasticsearchService $elasticsearchService,
PageEnrichmentService $pageEnrichmentService
) {
$this->elasticsearchService = $elasticsearchService;
$this->pageEnrichmentService = $pageEnrichmentService;
}
public function saved(Page $page)
{
if ($this->shouldEnrich($page)) {
try {
$enrichedData = $this->pageEnrichmentService->enrichPage($page);
$page->update([
'ai_keywords' => $enrichedData['keywords'],
'ai_summary' => $enrichedData['summary'],
'last_enriched_at' => now(),
]);
dispatch(new IndexPageJob($page));
} catch (\Exception $e) {
Log::error('AI enrichment failed', [
'page_id' => $page->id,
'error' => $e->getMessage()
]);
$this->elasticsearchService->indexPage($page);
}
} else {
$this->elasticsearchService->indexPage($page);
}
}
}
Content Discovery and Crawling
The crawling system is designed to be:
- Respectful of target sites (with rate limiting)
- Efficient (using sitemaps when available)
- Scalable (handling sites of any size)
- Reliable (with comprehensive error handling)
The CrawlerService
manages the entire crawling process:
class CrawlerService
{
protected $httpClient;
protected $processedSitemaps = [];
protected $pageScrapeService;
public function crawlSite(Site $site)
{
$site->update(['crawl_status' => CrawlStatus::IN_PROGRESS]);
try {
if ($site->sitemap_url) {
$this->crawlBySitemap($site, $site->sitemap_url);
} else {
$this->crawlByUrl($site);
}
$site->update([
'crawl_status' => CrawlStatus::COMPLETED,
'last_crawled_at' => Carbon::now()
]);
} catch (\Exception $e) {
Log::error("Crawl failed for " . $site->url .
" with error: " . $e->getMessage());
$site->update(['crawl_status' => CrawlStatus::FAILED]);
} finally {
Notification::send($site->user, new CrawlFinished($site));
}
}
}
Intelligent Sitemap Detection
The sitemap detection system tries to: 1. Find standard sitemap locations (sitemap.xml) 2. Handle various sitemap formats 3. Fallback to URL-based crawling when needed
This makes the system work with virtually any website structure:
class SitemapService
{
public function guessAndTestSitemapUrl($url)
{
$parsedUrl = parse_url($url);
$sitemapUrl = $parsedUrl['scheme'] . '://' .
$parsedUrl['host'] . '/sitemap.xml';
try {
$response = (new Client())->get($sitemapUrl);
if ($response->getStatusCode() == 200) {
return $sitemapUrl;
}
} catch (\Exception $e) {
// Handle failure gracefully
}
return null;
}
}
Rate Limiting and Error Handling
Rate limiting is crucial for:
- Protecting the API from abuse
- Managing costs (especially for AI operations)
- Ensuring fair resource distribution
API Rate Limiting
The search rate limiting uses Laravel's built-in rate limiter with:
- Per-user limits
- Configurable windows
- Clear error responses
class SearchController extends Controller
{
public function search(Request $request)
{
$key = 'search-' . $request->user()->id;
if (RateLimiter::tooManyAttempts($key, $maxAttempts)) {
$seconds = RateLimiter::availableIn($key);
return response()->json([
'error' => 'Too many requests',
'retry_after' => $seconds
], 429);
}
RateLimiter::hit($key, 60); // Reset after 60 seconds
// Proceed with search
}
}
OpenAI Rate Limiting
A separate rate limiter for OpenAI operations helps:
- Control API costs
- Prevent quota exhaustion
- Maintain consistent performance
class OpenAIRateLimiter
{
protected $redis;
protected $maxRequestsPerMinute;
public function acquire(): bool
{
$key = 'openai_requests:' . now()->format('Y-m-d-H-i');
$count = $this->redis->incr($key);
if ($count === 1) {
$this->redis->expire($key, 60);
}
return $count <= $this->maxRequestsPerMinute;
}
}
Error Handling and Retry Strategies
The retry system implements exponential backoff to handle:
- Temporary network issues
- API rate limits
- Service unavailability
- Transient errors
This makes the system more resilient:
trait RetryableOperation
{
protected function withRetry(callable $operation, $maxAttempts = 3)
{
$attempt = 1;
$lastException = null;
while ($attempt <= $maxAttempts) {
try {
return $operation();
} catch (\Exception $e) {
$lastException = $e;
Log::warning("Attempt {$attempt} failed", [
'error' => $e->getMessage()
]);
if ($attempt === $maxAttempts) {
throw $e;
}
sleep(pow(2, $attempt - 1)); // Exponential backoff
$attempt++;
}
}
}
}
Performance Optimization
Elasticsearch Query Optimization
Query optimization focuses on:
- Caching frequently used results
- Removing duplicate content
- Handling deep pagination efficiently
- Maintaining fast response times
protected function optimizeQuery($params)
{
// Add query cache
$params['request_cache'] = true;
// Add field collapsing for duplicate content
$params['body']['collapse'] = [
'field' => 'content_hash'
];
// Add pagination optimization
if ($params['from'] > 1000) {
// Switch to search after for deep pagination
$params['body']['search_after'] = $this->getSearchAfterParams();
unset($params['from']);
}
return $params;
}
Caching Strategy
The caching system is intelligent:
- Popular queries are cached longer
- Cache invalidation is automatic
- Memory usage is optimized
- Response times are minimized
class SearchService
{
public function search($query, $site)
{
$cacheKey = "search:{$site->id}:{$query}";
return Cache::remember($cacheKey, now()->addMinutes(60), function ()
use ($query, $site) {
return $this->performSearch($query, $site);
});
}
protected function shouldCache($query): bool
{
// Cache popular searches more aggressively
$popularity = SearchMetric::getPopularity($query);
return $popularity > config('sitesearch.cache_threshold');
}
}
Monitoring and Analytics
Search Analytics Dashboard
Analytics tracking helps:
- Understand user behavior
- Optimize search relevance
- Monitor system performance
- Guide feature development
class SearchMetric extends Model
{
protected $fillable = [
'site_id',
'query',
'results_count',
'response_time',
'user_clicked',
'position_clicked'
];
public static function recordSearch($query, $results, $duration)
{
return self::create([
'site_id' => $results['site_id'],
'query' => $query,
'results_count' => $results['total'],
'response_time' => $duration,
'timestamp' => now()
]);
}
public static function getPopularQueries($siteId, $limit = 10)
{
return self::where('site_id', $siteId)
->select('query', DB::raw('count(*) as count'))
->groupBy('query')
->orderByDesc('count')
->limit($limit)
->get();
}
}
Performance Monitoring
Health checks ensure:
- System availability
- Cluster health
- Resource utilization
- Early problem detection
class ElasticsearchHealthCheck
{
public function check()
{
try {
$client = app(ElasticsearchService::class)->getClient();
$health = $client->cluster()->health();
return [
'healthy' => in_array($health['status'], ['green', 'yellow']),
'status' => $health['status'],
'nodes' => $health['number_of_nodes'],
'active_shards' => $health['active_shards']
];
} catch (\Exception $e) {
return [
'healthy' => false,
'error' => $e->getMessage()
];
}
}
}
Scaling Considerations
Horizontal Scaling
- Use Elasticsearch's built-in clustering
- Implement read replicas for search queries
- Scale Laravel queue workers for background processing
Content Processing
- Implement batch processing for large sites
- Use queued jobs for AI enrichment
- Implement progressive loading for large result sets
Cost Management
- Cache expensive OpenAI calls
- Implement tiered pricing based on usage
- Optimize index storage and replication
Troubleshooting Guide
Common issues and solutions:
Indexing Issues
- Check Elasticsearch cluster health
- Verify index mappings
- Monitor bulk indexing jobs
Search Relevance
- Adjust field weights
- Review AI enrichment quality
- Analyze search logs
Performance Issues
- Monitor query response times
- Check cache hit rates
- Review resource utilization
Conclusion
Building a modern search engine is complex, but the combination of Laravel's elegant architecture, Elasticsearch's powerful search capabilities, and OpenAI's intelligence creates a robust and scalable solution. The key is finding the right balance between features, performance, and maintainability.