Full-Text Search

Difficulty: Advanced
Generated on: 2025-07-10 02:36:08
Category: SQL Cheatsheet for Database Development

SQL Full-Text Search: Advanced Cheatsheet

1. Quick Overview

Full-Text Search (FTS) is a technique for quickly and efficiently searching large amounts of text data. Instead of relying on LIKE clauses (which can be slow), FTS uses specialized indexes to enable fast keyword-based searches. It’s ideal for:

Content Management Systems (CMS): Searching articles, blog posts, and other content.
E-commerce: Finding products based on descriptions, titles, and categories.
Knowledge Bases: Locating relevant documentation based on keywords.
Log Analysis: Searching log files for specific events or errors.

When to use: When you need to search large text columns for specific words or phrases, and performance is critical. Avoid using LIKE for these scenarios.

2. Syntax

The exact syntax varies depending on the database system. Here’s a breakdown for major platforms:

MySQL:

-- Creating a FULLTEXT index
ALTER TABLE articles ADD FULLTEXT INDEX article_content_idx (title, content);

-- Searching using MATCH ... AGAINST
SELECT id, title, content,
       MATCH(title, content) AGAINST ('search term' IN NATURAL LANGUAGE MODE) AS relevance
FROM articles
WHERE MATCH(title, content) AGAINST ('search term' IN NATURAL LANGUAGE MODE)
ORDER BY relevance DESC;

-- Boolean Mode (more control)
SELECT id, title, content
FROM articles
WHERE MATCH(title, content) AGAINST ('+important -unwanted' IN BOOLEAN MODE);

-- Query Expansion (WITH QUERY EXPANSION)
SELECT id, title, content
FROM articles
WHERE MATCH(title, content) AGAINST ('search term' WITH QUERY EXPANSION);

PostgreSQL:

-- Creating a tsvector column and GIN index (recommended for speed)
ALTER TABLE articles ADD COLUMN search_vector tsvector;
UPDATE articles SET search_vector = to_tsvector('english', coalesce(title, '') || ' ' || coalesce(content, ''));
CREATE INDEX article_search_idx ON articles USING GIN (search_vector);


-- Creating a GIST index (alternative to GIN, generally slower for reads, faster for writes)
-- CREATE INDEX article_search_idx ON articles USING GIST (search_vector);

-- Searching using to_tsquery and @@ operator
SELECT id, title, content, ts_rank_cd(search_vector, query) AS relevance
FROM articles, to_tsquery('english', 'search & term') AS query
WHERE search_vector @@ query
ORDER BY relevance DESC;

-- Updating the tsvector column (important to keep it synchronized)
CREATE TRIGGER articles_search_vector_update
BEFORE INSERT OR UPDATE
ON articles
FOR EACH ROW
EXECUTE PROCEDURE tsvector_update_trigger('search_vector', 'pg_catalog.english', 'title', 'content');


-- Creating a function if tsvector_update_trigger doesn't exist
CREATE OR REPLACE FUNCTION public.tsvector_update_trigger()
 RETURNS trigger
 LANGUAGE plpgsql
AS $function$
BEGIN
 NEW.search_vector := to_tsvector(COALESCE(NEW.title,'') || ' ' || COALESCE(NEW.content,''));
 RETURN NEW;
END
$function$
;

SQL Server:

-- Creating a FULLTEXT catalog
CREATE FULLTEXT CATALOG ArticleCatalog;

-- Creating a FULLTEXT index
CREATE FULLTEXT INDEX ON articles(title, content)
KEY INDEX PK_articles -- Replace with your primary key index name
ON ArticleCatalog;

-- Searching using CONTAINS
SELECT id, title, content
FROM articles
WHERE CONTAINS((title, content), 'search term');

-- Searching using FREETEXT
SELECT id, title, content
FROM articles
WHERE FREETEXT((title, content), 'search term');

-- Using CONTAINSTABLE for ranking
SELECT a.id, a.title, a.content, k.RANK
FROM articles AS a
INNER JOIN CONTAINSTABLE(articles, (title, content), 'search term') AS k
ON a.id = k.[KEY]
ORDER BY k.RANK DESC;

Oracle:

-- Creating a CTXSYS user and granting necessary privileges
-- (This is a one-time setup for the database)
-- CONNECT SYS AS SYSDBA
-- CREATE USER ctxsys IDENTIFIED BY password;
-- GRANT CONNECT, RESOURCE, CTXAPP to ctxsys;

-- Creating a preference for your language (important for stemming)
BEGIN
  CTX_DDL.CREATE_PREFERENCE('my_lexer', 'BASIC_LEXER');
  CTX_DDL.SET_ATTRIBUTE('my_lexer', 'BASE_LETTER', 'YES'); -- Remove accents
  CTX_DDL.SET_ATTRIBUTE('my_lexer', 'PRINTJOINS', 'YES'); -- Treat hyphenated words as one
  CTX_DDL.SET_ATTRIBUTE('my_lexer', 'INDEX_STEMS', 'ENGLISH'); --Stemming
END;
/

-- Creating a FULLTEXT index
CREATE INDEX article_content_idx ON articles(title, content)
INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS ('lexer my_lexer');  -- Use the created preference here

-- Searching using CONTAINS
SELECT id, title, content, SCORE(1) as relevance
FROM articles
WHERE CONTAINS(title, content, 'search term', 1) > 0
ORDER BY relevance DESC;

-- Creating a Section Group (for structured documents)
BEGIN
   CTX_DDL.CREATE_SECTION_GROUP ('my_section_group', 'NULL_SECTION_GROUP');
   CTX_DDL.ADD_FIELD_SECTION ('my_section_group', 'title', 'TITLE');
   CTX_DDL.ADD_FIELD_SECTION ('my_section_group', 'content', 'CONTENT');
END;
/

-- Re-create index with the section group
CREATE INDEX article_content_idx ON articles(title, content)
INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS ('lexer my_lexer section group my_section_group');

-- Searching with within section (example)
SELECT id, title, content, SCORE(1) as relevance
FROM articles
WHERE CONTAINS(title, content, 'search term WITHIN title', 1) > 0
ORDER BY relevance DESC;

3. Common Use Cases

Searching Product Catalogs: Allows users to quickly find products based on keywords in descriptions, names, and attributes.
- Example: Searching for “red cotton t-shirt” in an e-commerce database.
Finding Relevant Articles in a Knowledge Base: Helps users quickly locate documentation or articles that address their specific needs.
- Example: Searching for “troubleshooting network connectivity” in a support article database.
Analyzing Log Files: Enables developers to quickly identify specific events or errors by searching log files for keywords or patterns.
- Example: Searching for “exception” or “error” in application logs.
Searching Job Postings: Allows job seekers to find relevant positions based on skills, experience, and location.
- Example: Searching for “Java developer” or “data scientist” in a job board database.
Content Moderation: Identifying potentially offensive or inappropriate content by searching for specific keywords or phrases.
- Example: Flagging comments or posts containing profanity or hate speech.

4. Best Practices

Choose the Right Index Type: GIN (PostgreSQL) and FULLTEXT (MySQL, SQL Server) are generally preferred for read-heavy workloads. GIST (PostgreSQL) can be faster for writes but slower for reads. Oracle’s CTXSYS.CONTEXT is specifically designed for full-text indexing.
Use Stopwords: Configure your full-text index to ignore common words like “the,” “a,” “is,” etc. This reduces index size and improves search accuracy.
Stemming and Lemmatization: Use stemming (reducing words to their root form) and lemmatization (converting words to their dictionary form) to improve search relevance. MySQL’s ft_min_word_len variable and Oracle’s INDEX_STEMS attribute control stemming behavior. PostgreSQL uses dictionaries for stemming and lemmatization.
Regularly Update Indexes: When data changes frequently, update the full-text index to ensure search results are accurate. Use triggers (PostgreSQL) or scheduled tasks to automate this process.
Optimize Query Performance:
- Use appropriate search modes: NATURAL LANGUAGE MODE (MySQL) is suitable for general searches, while BOOLEAN MODE provides more control. CONTAINS and FREETEXT (SQL Server) offer different search semantics.
- Limit the number of columns in the index: Indexing only the necessary columns improves performance.
- Use ranking functions: MATCH ... AGAINST (MySQL), ts_rank_cd (PostgreSQL), and CONTAINSTABLE (SQL Server) provide ranking scores to prioritize relevant results. SCORE() in Oracle.
- Partition large tables: Partitioning can improve query performance by allowing the database to search only relevant partitions.
Security Considerations:
- Sanitize User Input: Always sanitize user input to prevent SQL injection attacks.
- Limit Access: Restrict access to the full-text index to authorized users.
- Beware of Query Expansion: WITH QUERY EXPANSION (MySQL) can be useful, but be cautious as it can broaden the search significantly and potentially return unexpected results. Review the expanded query carefully.

5. Examples

Sample Data (articles table):

id	title	content
1	Introduction to SQL	SQL is a powerful language for managing and querying databases. It is widely used in various applications.
2	Advanced SQL Techniques	This article explores advanced SQL techniques such as window functions, common table expressions (CTEs), and stored procedures.
3	Optimizing SQL Queries	Learn how to optimize SQL queries for performance. Techniques include indexing, query rewriting, and using appropriate data types.
4	SQL Server Full-Text Search	This document details the functionality of Full-Text Search in SQL Server.
5	PostgreSQL Full-Text Search Capabilities	PostgreSQL offers robust Full-Text Search capabilities, including stemming, stop words, and ranking.

MySQL Examples:

-- Find articles containing "SQL" and order by relevance
SELECT id, title, content, MATCH(title, content) AGAINST ('SQL' IN NATURAL LANGUAGE MODE) AS relevance
FROM articles
WHERE MATCH(title, content) AGAINST ('SQL' IN NATURAL LANGUAGE MODE)
ORDER BY relevance DESC;

-- Result: Articles 1, 2, 3, 4, 5 (ordered by relevance)

-- Find articles that MUST contain "SQL" and MUST NOT contain "Server"
SELECT id, title, content
FROM articles
WHERE MATCH(title, content) AGAINST ('+SQL -Server' IN BOOLEAN MODE);

-- Result: Articles 1, 2, 3, 5

PostgreSQL Examples:

-- Find articles containing "SQL" and "performance" and order by relevance
SELECT id, title, content, ts_rank_cd(search_vector, query) AS relevance
FROM articles, to_tsquery('english', 'SQL & performance') AS query
WHERE search_vector @@ query
ORDER BY relevance DESC;

-- Result: Article 3 (highest relevance), then Article 2

-- Search for articles containing a phrase
SELECT id, title, content, ts_rank_cd(search_vector, query) AS relevance
FROM articles, to_tsquery('english', '''Full-Text Search''') AS query -- Note the triple quotes
WHERE search_vector @@ query
ORDER BY relevance DESC;

-- Result: Articles 4, 5

SQL Server Examples:

-- Find articles containing "SQL"
SELECT id, title, content
FROM articles
WHERE CONTAINS((title, content), 'SQL');

-- Result: Articles 1, 2, 3, 4, 5

-- Find articles containing "SQL" and rank them
SELECT a.id, a.title, a.content, k.RANK
FROM articles AS a
INNER JOIN CONTAINSTABLE(articles, (title, content), 'SQL') AS k
ON a.id = k.[KEY]
ORDER BY k.RANK DESC;

Oracle Examples:

-- Find articles containing "SQL"
SELECT id, title, content, SCORE(1) AS relevance
FROM articles
WHERE CONTAINS(title, content, 'SQL', 1) > 0
ORDER BY relevance DESC;

-- Result: Articles 1, 2, 3, 4, 5 (ordered by relevance)

-- Find articles containing "Full-Text Search" in the title
SELECT id, title, content, SCORE(1) AS relevance
FROM articles
WHERE CONTAINS(title, content, 'Full-Text Search WITHIN title', 1) > 0
ORDER BY relevance DESC;

-- Result: Articles 4, 5

6. Common Pitfalls

Forgetting to Update Indexes: Failing to update the full-text index after data changes leads to inaccurate search results.
Ignoring Stopwords: Not using stopwords can significantly degrade search performance and relevance.
Incorrect Language Configuration: Using the wrong language setting for stemming and lemmatization can lead to incorrect results.
Not Sanitizing User Input: Failing to sanitize user input can expose your database to SQL injection attacks.
Over-Indexing: Indexing too many columns can negatively impact write performance. Index only the columns that are frequently searched.
Performance Issues with Large Datasets: Full-text search on very large datasets can be slow if not properly optimized. Consider partitioning, using appropriate index types, and tuning query parameters.
Incorrectly interpreting relevance scores: The relevance score is database-specific. Don’t directly compare scores across different database platforms.

7. Database Variations

MySQL:
- Uses MATCH ... AGAINST syntax.
- Supports NATURAL LANGUAGE MODE, BOOLEAN MODE, and WITH QUERY EXPANSION.
- Configured using variables like ft_min_word_len and ft_stopword_file.
- FULLTEXT indexes can only be created on MyISAM and InnoDB tables.
PostgreSQL:
- Uses tsvector data type and to_tsquery function.
- Requires creating a GIN or GIST index on the tsvector column.
- Uses dictionaries for stemming and lemmatization.
- Requires triggers to automatically update the tsvector column.
SQL Server:
- Uses CONTAINS and FREETEXT predicates.
- Requires creating a FULLTEXT CATALOG and FULLTEXT INDEX.
- Uses CONTAINSTABLE to retrieve relevance rankings.
Oracle:
- Uses the CTXSYS schema and CONTEXT index type.
- Requires creating a preference for the lexer and stemmer.
- Uses the CONTAINS operator with the SCORE function for ranking.
- Utilizes Section Groups for searching specific parts of a document.