Building a Hybrid Search System with Laravel, OpenAI, and PostgreSQL

February 28, 2025

development

Ever tried searching your site and got a bunch of irrelevant results? Traditional search sucks at understanding what users actually want. Let's fix that by building something better - a hybrid search system that combines old-school keyword search with AI-powered semantic understanding.

Note: This article focuses on implementing hybrid search with PostgreSQL. If you're interested in an Elasticsearch-based approach, check out my guide on building a modern site search engine with Laravel, Elasticsearch, and AI.

I'll show you how to build a killer search system using Laravel, OpenAI, and PostgreSQL with pgvector. By the end of this guide, you'll have a search engine that actually understands your users' intent, not just their exact words.

What is Hybrid Search?

Hybrid search combines two powerful search methodologies:

  1. Full-text search: Traditional keyword-based search that excels at finding exact matches and variations of words.
  2. Vector search: AI-powered semantic search that understands the meaning and context of content.

By combining these approaches, hybrid search delivers results that are both precise (matching specific keywords) and semantically relevant (understanding the intent behind the search).

Related: For an alternative implementation using Elasticsearch instead of PostgreSQL, see my article on building a modern site search engine which also leverages AI for enhanced search capabilities.

Comparing Search Approaches

Feature Traditional Full-Text Search Vector Search Hybrid Search
Keyword Matching ✅ Excellent ⚠️ Limited ✅ Excellent
Semantic Understanding ❌ Poor ✅ Excellent ✅ Excellent
Handling Synonyms ⚠️ Requires manual configuration ✅ Built-in ✅ Built-in
Context Awareness ❌ None ✅ Strong ✅ Strong
Implementation Complexity ⭐ Low ⭐⭐⭐ High ⭐⭐⭐⭐ Very High
Query Speed ⚡ Fast ⚡⚡ Moderate ⚡⚡ Moderate
Resource Requirements 💻 Low 💻💻💻 High 💻💻💻 High
Relevance for Natural Language ⚠️ Limited ✅ Excellent ✅ Excellent

Why Hybrid Search Matters

Consider a user searching for "how to secure my website." A traditional keyword search might miss relevant content about "web application security best practices" if those exact words aren't present. Vector search would understand the semantic relationship but might miss exact keyword matches that are highly relevant.

Hybrid search gives you the best of both worlds:

Real-World Example: Hybrid Search in Action

Let's look at a concrete example of how hybrid search outperforms traditional approaches:

User Query: "protecting my online store from hackers"

Traditional Search Results (keyword-based): 1. "10 Ways to Protect Your E-commerce Store from Hackers" (100% match - contains keywords) 2. "Hacker Protection for Online Businesses" (80% match - contains keywords) 3. "Online Store Security Guide" (60% match - contains keywords)

Vector Search Results (semantic-based): 1. "Cybersecurity Best Practices for E-commerce" (90% match - semantically similar) 2. "Implementing SSL Certificates on Your Website" (85% match - semantically similar) 3. "Data Breach Prevention Strategies" (80% match - semantically similar)

Hybrid Search Results (combined approach): 1. "10 Ways to Protect Your E-commerce Store from Hackers" (95% match - keywords + semantics) 2. "Cybersecurity Best Practices for E-commerce" (90% match - semantics) 3. "Implementing SSL Certificates for Online Store Security" (85% match - partial keywords + semantics) 4. "Data Breach Prevention for Digital Retailers" (80% match - semantics)

Notice how the hybrid approach surfaces both exact keyword matches and semantically relevant content that traditional search might miss. This provides a more comprehensive set of results that better addresses the user's intent.

Implementation Comparison: For a real-world implementation using Elasticsearch instead of PostgreSQL, check out my SiteSearch project overview which demonstrates similar concepts with different technology.

The Technology Stack

Our hybrid search implementation uses:

Alternative Stack: If you prefer using Elasticsearch for your search infrastructure, check out my detailed guide on building a site search engine with Laravel and Elasticsearch.

Setting Up the Foundation

1. Installing Required Extensions

First, we need to enable the vector extension in PostgreSQL:

CREATE EXTENSION IF NOT EXISTS vector;

Setup Comparison: For comparison, see how to set up Elasticsearch for search in my other guide, which uses a different approach to search infrastructure.

In Laravel, we'll add this to a migration:

DB::statement('CREATE EXTENSION IF NOT EXISTS vector');

2. Database Schema

We need to store both vector embeddings and searchable text. Here's how we set up our migration:

// Add vector column for embeddings
DB::statement('ALTER TABLE pages ADD COLUMN IF NOT EXISTS embedding vector(1536)');

// Add tsvector column for full-text search
DB::statement('ALTER TABLE pages ADD COLUMN IF NOT EXISTS searchable_text tsvector');

// Create GIN indexes
DB::statement('CREATE INDEX IF NOT EXISTS pages_embedding_idx ON pages USING ivfflat (embedding vector_l2_ops)');
DB::statement('CREATE INDEX IF NOT EXISTS pages_searchable_text_idx ON pages USING GIN (searchable_text)');

The embedding column stores 1536-dimensional vectors from OpenAI's embedding model, while searchable_text stores optimized text for full-text search.

Step-by-Step Implementation Guide

1. Project Setup

Start by creating a new Laravel project or using an existing one:

composer create-project laravel/laravel sitesearch
cd sitesearch

2. Install Required Packages

Add the necessary packages to your project:

composer require pgvector/pgvector-php
composer require laravel/prompts guzzlehttp/guzzle

3. Configure PostgreSQL

Make sure your .env file is configured for PostgreSQL:

DB_CONNECTION=pgsql
DB_HOST=127.0.0.1
DB_PORT=5432
DB_DATABASE=sitesearch
DB_USERNAME=postgres
DB_PASSWORD=your_password

4. Create Migrations

Create the necessary migrations for your database schema:

php artisan make:migration create_sites_table
php artisan make:migration create_pages_table
php artisan make:migration add_postgres_search_columns_to_pages

5. Create Models

Create the Site and Page models:

php artisan make:model Site
php artisan make:model Page

6. Create Services

Create the necessary service classes for your application:

mkdir -p app/Services
touch app/Services/OpenAIService.php
touch app/Services/PostgresSearchService.php
touch app/Services/CrawlerService.php

Implementing the AI Integration

At the heart of our hybrid search system is the AI integration that powers both the semantic understanding and content enrichment. We leverage OpenAI's powerful language models in two key ways: generating vector embeddings for semantic search and enhancing content with AI-generated metadata.

Vector embeddings are numerical representations of text that capture semantic meaning. When text is converted to embeddings, similar concepts end up closer together in vector space, even if they use different words. For example, "automobile" and "car" would have similar embeddings because they represent the same concept. This is what enables our search to understand meaning beyond simple keyword matching. Each piece of content in our database is represented by a high-dimensional vector (1536 dimensions with OpenAI's ada-002 model), allowing us to perform similarity searches based on the meaning of content rather than just matching words.

1. OpenAI Service

We created a service to interact with OpenAI's API:

class OpenAIService
{
    // Generate embeddings for text
    public function createEmbedding(string $text): array
    {
        $response = $this->getClient()->post('embeddings', [
            'json' => [
                'model' => 'text-embedding-ada-002',
                'input' => $text
            ]
        ]);

        $data = json_decode($response->getBody(), true);
        return $data['data'][0]['embedding'];
    }

    // Additional methods for AI-enhanced content
    public function generateKeywords(string $content): array
    {
        // Implementation using OpenAI's chat completion
    }

    public function generateAISummary(string $content): string
    {
        // Implementation using OpenAI's chat completion
    }
}

This service handles: - Creating vector embeddings for search - Generating AI-enhanced metadata like keywords and summaries - Creating AI-generated titles and descriptions

2. AI-Enhanced Content

Our system doesn't just use AI for search—it also enhances content with AI-generated metadata:

This enriched content improves search quality and provides better snippets in search results.

The Hybrid Search Implementation

The core of our system is the PostgreSQL-based hybrid search service:

class PostgresSearchService implements SearchInterface
{
    public function searchPages(string $query, int $siteId, int $offset = 0, int $perPage = 10): array
    {
        $site = Site::findOrFail($siteId);

        // Get embedding for the search query
        $embedding = $this->getEmbedding($query);

        // Combine vector similarity search with full-text search
        $results = Page::query()
            ->where('site_id', $site->id)
            ->whereNotNull('embedding')
            ->whereNotNull('searchable_text')
            ->orderByRaw('
                (
                    -- Normalize vector similarity score (convert distance to similarity)
                    (1.0 - (embedding <=> ?) / 2.0) * 0.5 + 
                    -- Use normalized text search rank with cover density
                    ts_rank_cd(searchable_text, plainto_tsquery(\'english\', ?), 32) * 0.5
                ) DESC
            ', [$embedding, $query])
            ->offset($offset)
            ->limit($perPage)
            ->get();

        // Return results with pagination info
        return [
            'results' => $results,
            'total' => $total,
            'per_page' => $perPage,
            'current_page' => ($offset / $perPage) + 1
        ];
    }
}

Let's break down what's happening:

  1. We convert the search query into a vector embedding using OpenAI
  2. We perform a hybrid search that combines:
    • Vector similarity using the <=> operator (cosine distance)
    • Full-text search using ts_rank_cd with cover density
  3. We normalize and weight both scores (50% each) to get a combined relevance score
  4. Results are ordered by this combined score

This approach gives us the best of both worlds—semantic understanding and keyword precision.

Fine-Tuning the Hybrid Search Formula

The formula we use for hybrid search is:

(1.0 - (embedding <=> ?) / 2.0) * 0.5 + ts_rank_cd(searchable_text, plainto_tsquery('english', ?), 32) * 0.5

You can adjust the weights (currently 0.5 for each component) to favor either vector similarity or text search depending on your specific needs:

For example, to favor vector search more heavily:

(1.0 - (embedding <=> ?) / 2.0) * 0.7 + ts_rank_cd(searchable_text, plainto_tsquery('english', ?), 32) * 0.3

The AI-Enriched Web Crawler

To populate our search index, we built an AI-enriched web crawler that:

  1. Discovers content: Either through sitemaps or direct URL crawling
  2. Extracts content: Parses HTML to extract meaningful content
  3. Enriches with AI: Generates embeddings, summaries, and keywords
  4. Indexes for search: Stores both vector embeddings and searchable text

The crawler is implemented as a Laravel service:

class CrawlerService
{
    public function crawlSite(Site $site)
    {
        $site->update(['crawl_status' => CrawlStatus::IN_PROGRESS]);

        try {
            if ($site->sitemap_url) {
                $this->crawlBySitemap($site, $site->sitemap_url);
            } else {
                $this->crawlByUrl($site);
            }

            $site->update([
                'crawl_status' => CrawlStatus::COMPLETED,
                'last_crawled_at' => Carbon::now()
            ]);

        } catch (\Exception $e) {
            $site->update(['crawl_status' => CrawlStatus::FAILED]);
            throw $e;
        }
    }

    // Implementation details for sitemap crawling
}

For each page, we:

  1. Extract content, title, description, and metadata
  2. Generate AI embeddings for vector search
  3. Create optimized text vectors for full-text search
  4. Generate AI-enhanced metadata (summaries, keywords)
  5. Store everything in the database

Indexing Content for Hybrid Search

The indexing process combines both vector embeddings and full-text search preparation:

public function indexPage(Page $page): void
{
    $openai = app(OpenAIService::class);

    // Create embedding for the page content
    $content = $page->title . ' ' . $page->description . ' ' . $page->text;
    $embedding = $openai->createEmbedding($content);

    // Update the page with embedding and searchable text
    $page->update([
        'embedding' => new Vector($embedding),
        'searchable_text' => DB::raw("
            setweight(to_tsvector('english', " . DB::getPdo()->quote($page->title ?? '') . "), 'A') ||
            setweight(to_tsvector('english', " . DB::getPdo()->quote($page->description ?? '') . "), 'B') ||
            setweight(to_tsvector('english', " . DB::getPdo()->quote($page->text ?? '') . "), 'C')
        ")
    ]);
}

Note how we:

  1. Create a vector embedding using OpenAI
  2. Create a weighted tsvector for PostgreSQL full-text search
  3. Assign different weights to title (A), description (B), and content (C)

This weighting ensures that matches in titles are considered more important than matches in the body text.

Implementation Challenges and Solutions

During our implementation, we encountered several challenges. Here's how we addressed them:

Implementation Comparison: For a comparison with Elasticsearch-based error handling and retry strategies, check out the error handling section in my Elasticsearch search guide.

1. Handling Large Content Volumes

Challenge: OpenAI's API has token limits for embedding generation.

Solution: We implemented content chunking to break down large pages into manageable segments, then combined the embeddings using a weighted average approach. For very large pages, we prioritized the most important content sections.

// Example of content chunking
public function createEmbeddingForLargeContent(string $content): array
{
    // Split content into chunks of approximately 4000 tokens
    $chunks = $this->splitContentIntoChunks($content, 4000);

    // Get embeddings for each chunk
    $embeddings = [];
    foreach ($chunks as $index => $chunk) {
        $embeddings[] = [
            'embedding' => $this->createEmbedding($chunk),
            'weight' => $this->getChunkWeight($index, count($chunks))
        ];
    }

    // Combine embeddings with weights
    return $this->combineEmbeddings($embeddings);
}

2. Rate Limiting and API Costs

Challenge: OpenAI API has rate limits and can become expensive with large volumes.

Solution: We implemented a queuing system with retry logic and exponential backoff, plus caching of embeddings for identical content.

// Example of caching embeddings
public function createEmbeddingWithCache(string $text): array
{
    $cacheKey = 'embedding_' . md5($text);

    return Cache::remember($cacheKey, now()->addDays(30), function () use ($text) {
        return $this->createEmbedding($text);
    });
}

3. Balancing Vector and Text Search

Challenge: Finding the right balance between vector similarity and text relevance.

Solution: We experimented with different weights and implemented an A/B testing framework to optimize the formula based on user interactions.

4. Handling Multiple Languages

Challenge: PostgreSQL's full-text search is language-specific.

Solution: We implemented language detection and used appropriate text search configurations for different languages.

// Example of language-specific text search
public function createSearchableText(string $text, string $language = 'english'): string
{
    return "to_tsvector('{$language}', " . DB::getPdo()->quote($text) . ")";
}

Performance Considerations

Hybrid search can be computationally intensive. Here are some optimizations we implemented:

  1. Efficient indexes: Using GIN indexes for text search and IVFFLAT indexes for vector search
  2. Query caching: Caching common search queries and their embeddings
  3. Batch processing: Processing crawl operations in chunks with appropriate delays
  4. Asynchronous indexing: Using Laravel's queue system for background processing
  5. Selective embedding: Only creating embeddings for meaningful content

Deployment Considerations

When deploying a hybrid search system, consider:

  1. Database sizing: Vector operations require more memory and CPU
  2. API costs: OpenAI API usage costs can add up for large sites
  3. Crawl scheduling: Implement appropriate crawl frequencies and rate limiting
  4. Fallback mechanisms: Have a fallback to traditional search if vector search fails

Troubleshooting Common Issues

Here are some common issues you might encounter and how to solve them:

1. Vector Dimension Mismatch

Problem: PostgreSQL error about vector dimensions not matching.

Solution: Ensure all vectors have the same dimensions (1536 for OpenAI's ada-002 model). Implement validation:

if (count($embedding) !== 1536) {
    throw new \Exception("Invalid embedding dimensions: expected 1536, got " . count($embedding));
}

2. Slow Query Performance

Problem: Hybrid search queries taking too long.

Solution: - Ensure proper indexing - Limit the scope of searches (e.g., by site or category) - Consider using materialized views for common queries - Implement query timeouts and fallbacks

// Example of query with timeout
DB::statement('SET statement_timeout = 5000'); // 5 seconds timeout
try {
    // Run your query
} catch (\Exception $e) {
    // Fall back to simpler query
}

3. Content Extraction Issues

Problem: Web crawler fails to extract meaningful content.

Solution: Implement more robust HTML parsing with fallbacks:

public function extractContent($html)
{
    try {
        // Primary extraction method
        $content = $this->primaryExtractor->extract($html);

        if (empty($content)) {
            // Fallback method
            $content = $this->fallbackExtractor->extract($html);
        }

        return $content;
    } catch (\Exception $e) {
        // Log error and use basic extraction
        return strip_tags($html);
    }
}

Future Improvements

We're continuously working to enhance our hybrid search system. Here are some improvements we're planning:

  1. Adaptive weighting: Dynamically adjust the weights between vector and text search based on query characteristics
  2. User feedback loop: Incorporate user interactions to improve search relevance over time
  3. Multi-modal search: Extend the system to handle image and video search
  4. Newer embedding models: Experiment with newer, more efficient embedding models as they become available
  5. Query understanding: Implement more sophisticated query parsing to better understand user intent
  6. Personalized search: Incorporate user preferences and history for more personalized results

Results and Benefits

After implementing our hybrid search system, we observed:

Conclusion

Hybrid search represents the future of content discovery, combining the precision of traditional search with the semantic understanding of AI. By implementing this system with Laravel, OpenAI, and PostgreSQL, we've created a powerful search solution that understands both what users are looking for and the context behind their queries.

The combination of full-text search and vector embeddings provides a robust foundation that can be extended with additional AI capabilities as needed. As language models continue to improve, the potential for enhancing search experiences will only grow.

Explore More: If you're interested in comparing different search implementations, see my article on building a site search engine with Laravel and Elasticsearch for an alternative approach to intelligent search.

If you're looking to implement a similar system or have questions about our approach, feel free to reach out. We welcome contributions and feedback.

Resources

Happy searching!