Full-Text Search
Difficulty: Advanced
Generated on: 2025-07-10 02:36:08
Category: SQL Cheatsheet for Database Development
SQL Full-Text Search: Advanced Cheatsheet
Section titled “SQL Full-Text Search: Advanced Cheatsheet”1. Quick Overview
Section titled “1. Quick Overview”Full-Text Search (FTS) is a technique for quickly and efficiently searching large amounts of text data. Instead of relying on LIKE clauses (which can be slow), FTS uses specialized indexes to enable fast keyword-based searches. It’s ideal for:
- Content Management Systems (CMS): Searching articles, blog posts, and other content.
- E-commerce: Finding products based on descriptions, titles, and categories.
- Knowledge Bases: Locating relevant documentation based on keywords.
- Log Analysis: Searching log files for specific events or errors.
When to use: When you need to search large text columns for specific words or phrases, and performance is critical. Avoid using LIKE for these scenarios.
2. Syntax
Section titled “2. Syntax”The exact syntax varies depending on the database system. Here’s a breakdown for major platforms:
MySQL:
-- Creating a FULLTEXT indexALTER TABLE articles ADD FULLTEXT INDEX article_content_idx (title, content);
-- Searching using MATCH ... AGAINSTSELECT id, title, content, MATCH(title, content) AGAINST ('search term' IN NATURAL LANGUAGE MODE) AS relevanceFROM articlesWHERE MATCH(title, content) AGAINST ('search term' IN NATURAL LANGUAGE MODE)ORDER BY relevance DESC;
-- Boolean Mode (more control)SELECT id, title, contentFROM articlesWHERE MATCH(title, content) AGAINST ('+important -unwanted' IN BOOLEAN MODE);
-- Query Expansion (WITH QUERY EXPANSION)SELECT id, title, contentFROM articlesWHERE MATCH(title, content) AGAINST ('search term' WITH QUERY EXPANSION);PostgreSQL:
-- Creating a tsvector column and GIN index (recommended for speed)ALTER TABLE articles ADD COLUMN search_vector tsvector;UPDATE articles SET search_vector = to_tsvector('english', coalesce(title, '') || ' ' || coalesce(content, ''));CREATE INDEX article_search_idx ON articles USING GIN (search_vector);
-- Creating a GIST index (alternative to GIN, generally slower for reads, faster for writes)-- CREATE INDEX article_search_idx ON articles USING GIST (search_vector);
-- Searching using to_tsquery and @@ operatorSELECT id, title, content, ts_rank_cd(search_vector, query) AS relevanceFROM articles, to_tsquery('english', 'search & term') AS queryWHERE search_vector @@ queryORDER BY relevance DESC;
-- Updating the tsvector column (important to keep it synchronized)CREATE TRIGGER articles_search_vector_updateBEFORE INSERT OR UPDATEON articlesFOR EACH ROWEXECUTE PROCEDURE tsvector_update_trigger('search_vector', 'pg_catalog.english', 'title', 'content');
-- Creating a function if tsvector_update_trigger doesn't existCREATE OR REPLACE FUNCTION public.tsvector_update_trigger() RETURNS trigger LANGUAGE plpgsqlAS $function$BEGIN NEW.search_vector := to_tsvector(COALESCE(NEW.title,'') || ' ' || COALESCE(NEW.content,'')); RETURN NEW;END$function$;SQL Server:
-- Creating a FULLTEXT catalogCREATE FULLTEXT CATALOG ArticleCatalog;
-- Creating a FULLTEXT indexCREATE FULLTEXT INDEX ON articles(title, content)KEY INDEX PK_articles -- Replace with your primary key index nameON ArticleCatalog;
-- Searching using CONTAINSSELECT id, title, contentFROM articlesWHERE CONTAINS((title, content), 'search term');
-- Searching using FREETEXTSELECT id, title, contentFROM articlesWHERE FREETEXT((title, content), 'search term');
-- Using CONTAINSTABLE for rankingSELECT a.id, a.title, a.content, k.RANKFROM articles AS aINNER JOIN CONTAINSTABLE(articles, (title, content), 'search term') AS kON a.id = k.[KEY]ORDER BY k.RANK DESC;Oracle:
-- Creating a CTXSYS user and granting necessary privileges-- (This is a one-time setup for the database)-- CONNECT SYS AS SYSDBA-- CREATE USER ctxsys IDENTIFIED BY password;-- GRANT CONNECT, RESOURCE, CTXAPP to ctxsys;
-- Creating a preference for your language (important for stemming)BEGIN CTX_DDL.CREATE_PREFERENCE('my_lexer', 'BASIC_LEXER'); CTX_DDL.SET_ATTRIBUTE('my_lexer', 'BASE_LETTER', 'YES'); -- Remove accents CTX_DDL.SET_ATTRIBUTE('my_lexer', 'PRINTJOINS', 'YES'); -- Treat hyphenated words as one CTX_DDL.SET_ATTRIBUTE('my_lexer', 'INDEX_STEMS', 'ENGLISH'); --StemmingEND;/
-- Creating a FULLTEXT indexCREATE INDEX article_content_idx ON articles(title, content)INDEXTYPE IS CTXSYS.CONTEXTPARAMETERS ('lexer my_lexer'); -- Use the created preference here
-- Searching using CONTAINSSELECT id, title, content, SCORE(1) as relevanceFROM articlesWHERE CONTAINS(title, content, 'search term', 1) > 0ORDER BY relevance DESC;
-- Creating a Section Group (for structured documents)BEGIN CTX_DDL.CREATE_SECTION_GROUP ('my_section_group', 'NULL_SECTION_GROUP'); CTX_DDL.ADD_FIELD_SECTION ('my_section_group', 'title', 'TITLE'); CTX_DDL.ADD_FIELD_SECTION ('my_section_group', 'content', 'CONTENT');END;/
-- Re-create index with the section groupCREATE INDEX article_content_idx ON articles(title, content)INDEXTYPE IS CTXSYS.CONTEXTPARAMETERS ('lexer my_lexer section group my_section_group');
-- Searching with within section (example)SELECT id, title, content, SCORE(1) as relevanceFROM articlesWHERE CONTAINS(title, content, 'search term WITHIN title', 1) > 0ORDER BY relevance DESC;3. Common Use Cases
Section titled “3. Common Use Cases”-
Searching Product Catalogs: Allows users to quickly find products based on keywords in descriptions, names, and attributes.
- Example: Searching for “red cotton t-shirt” in an e-commerce database.
-
Finding Relevant Articles in a Knowledge Base: Helps users quickly locate documentation or articles that address their specific needs.
- Example: Searching for “troubleshooting network connectivity” in a support article database.
-
Analyzing Log Files: Enables developers to quickly identify specific events or errors by searching log files for keywords or patterns.
- Example: Searching for “exception” or “error” in application logs.
-
Searching Job Postings: Allows job seekers to find relevant positions based on skills, experience, and location.
- Example: Searching for “Java developer” or “data scientist” in a job board database.
-
Content Moderation: Identifying potentially offensive or inappropriate content by searching for specific keywords or phrases.
- Example: Flagging comments or posts containing profanity or hate speech.
4. Best Practices
Section titled “4. Best Practices”-
Choose the Right Index Type: GIN (PostgreSQL) and FULLTEXT (MySQL, SQL Server) are generally preferred for read-heavy workloads. GIST (PostgreSQL) can be faster for writes but slower for reads. Oracle’s CTXSYS.CONTEXT is specifically designed for full-text indexing.
-
Use Stopwords: Configure your full-text index to ignore common words like “the,” “a,” “is,” etc. This reduces index size and improves search accuracy.
-
Stemming and Lemmatization: Use stemming (reducing words to their root form) and lemmatization (converting words to their dictionary form) to improve search relevance. MySQL’s
ft_min_word_lenvariable and Oracle’sINDEX_STEMSattribute control stemming behavior. PostgreSQL uses dictionaries for stemming and lemmatization. -
Regularly Update Indexes: When data changes frequently, update the full-text index to ensure search results are accurate. Use triggers (PostgreSQL) or scheduled tasks to automate this process.
-
Optimize Query Performance:
- Use appropriate search modes:
NATURAL LANGUAGE MODE(MySQL) is suitable for general searches, whileBOOLEAN MODEprovides more control.CONTAINSandFREETEXT(SQL Server) offer different search semantics. - Limit the number of columns in the index: Indexing only the necessary columns improves performance.
- Use ranking functions:
MATCH ... AGAINST(MySQL),ts_rank_cd(PostgreSQL), andCONTAINSTABLE(SQL Server) provide ranking scores to prioritize relevant results.SCORE()in Oracle. - Partition large tables: Partitioning can improve query performance by allowing the database to search only relevant partitions.
- Use appropriate search modes:
-
Security Considerations:
- Sanitize User Input: Always sanitize user input to prevent SQL injection attacks.
- Limit Access: Restrict access to the full-text index to authorized users.
- Beware of Query Expansion:
WITH QUERY EXPANSION(MySQL) can be useful, but be cautious as it can broaden the search significantly and potentially return unexpected results. Review the expanded query carefully.
5. Examples
Section titled “5. Examples”Sample Data (articles table):
| id | title | content |
|---|---|---|
| 1 | Introduction to SQL | SQL is a powerful language for managing and querying databases. It is widely used in various applications. |
| 2 | Advanced SQL Techniques | This article explores advanced SQL techniques such as window functions, common table expressions (CTEs), and stored procedures. |
| 3 | Optimizing SQL Queries | Learn how to optimize SQL queries for performance. Techniques include indexing, query rewriting, and using appropriate data types. |
| 4 | SQL Server Full-Text Search | This document details the functionality of Full-Text Search in SQL Server. |
| 5 | PostgreSQL Full-Text Search Capabilities | PostgreSQL offers robust Full-Text Search capabilities, including stemming, stop words, and ranking. |
MySQL Examples:
-- Find articles containing "SQL" and order by relevanceSELECT id, title, content, MATCH(title, content) AGAINST ('SQL' IN NATURAL LANGUAGE MODE) AS relevanceFROM articlesWHERE MATCH(title, content) AGAINST ('SQL' IN NATURAL LANGUAGE MODE)ORDER BY relevance DESC;
-- Result: Articles 1, 2, 3, 4, 5 (ordered by relevance)
-- Find articles that MUST contain "SQL" and MUST NOT contain "Server"SELECT id, title, contentFROM articlesWHERE MATCH(title, content) AGAINST ('+SQL -Server' IN BOOLEAN MODE);
-- Result: Articles 1, 2, 3, 5PostgreSQL Examples:
-- Find articles containing "SQL" and "performance" and order by relevanceSELECT id, title, content, ts_rank_cd(search_vector, query) AS relevanceFROM articles, to_tsquery('english', 'SQL & performance') AS queryWHERE search_vector @@ queryORDER BY relevance DESC;
-- Result: Article 3 (highest relevance), then Article 2
-- Search for articles containing a phraseSELECT id, title, content, ts_rank_cd(search_vector, query) AS relevanceFROM articles, to_tsquery('english', '''Full-Text Search''') AS query -- Note the triple quotesWHERE search_vector @@ queryORDER BY relevance DESC;
-- Result: Articles 4, 5SQL Server Examples:
-- Find articles containing "SQL"SELECT id, title, contentFROM articlesWHERE CONTAINS((title, content), 'SQL');
-- Result: Articles 1, 2, 3, 4, 5
-- Find articles containing "SQL" and rank themSELECT a.id, a.title, a.content, k.RANKFROM articles AS aINNER JOIN CONTAINSTABLE(articles, (title, content), 'SQL') AS kON a.id = k.[KEY]ORDER BY k.RANK DESC;Oracle Examples:
-- Find articles containing "SQL"SELECT id, title, content, SCORE(1) AS relevanceFROM articlesWHERE CONTAINS(title, content, 'SQL', 1) > 0ORDER BY relevance DESC;
-- Result: Articles 1, 2, 3, 4, 5 (ordered by relevance)
-- Find articles containing "Full-Text Search" in the titleSELECT id, title, content, SCORE(1) AS relevanceFROM articlesWHERE CONTAINS(title, content, 'Full-Text Search WITHIN title', 1) > 0ORDER BY relevance DESC;
-- Result: Articles 4, 56. Common Pitfalls
Section titled “6. Common Pitfalls”- Forgetting to Update Indexes: Failing to update the full-text index after data changes leads to inaccurate search results.
- Ignoring Stopwords: Not using stopwords can significantly degrade search performance and relevance.
- Incorrect Language Configuration: Using the wrong language setting for stemming and lemmatization can lead to incorrect results.
- Not Sanitizing User Input: Failing to sanitize user input can expose your database to SQL injection attacks.
- Over-Indexing: Indexing too many columns can negatively impact write performance. Index only the columns that are frequently searched.
- Performance Issues with Large Datasets: Full-text search on very large datasets can be slow if not properly optimized. Consider partitioning, using appropriate index types, and tuning query parameters.
- Incorrectly interpreting relevance scores: The relevance score is database-specific. Don’t directly compare scores across different database platforms.
7. Database Variations
Section titled “7. Database Variations”-
MySQL:
- Uses
MATCH ... AGAINSTsyntax. - Supports
NATURAL LANGUAGE MODE,BOOLEAN MODE, andWITH QUERY EXPANSION. - Configured using variables like
ft_min_word_lenandft_stopword_file. - FULLTEXT indexes can only be created on
MyISAMandInnoDBtables.
- Uses
-
PostgreSQL:
- Uses
tsvectordata type andto_tsqueryfunction. - Requires creating a
GINorGISTindex on thetsvectorcolumn. - Uses dictionaries for stemming and lemmatization.
- Requires triggers to automatically update the
tsvectorcolumn.
- Uses
-
SQL Server:
- Uses
CONTAINSandFREETEXTpredicates. - Requires creating a
FULLTEXT CATALOGandFULLTEXT INDEX. - Uses
CONTAINSTABLEto retrieve relevance rankings.
- Uses
-
Oracle:
- Uses the
CTXSYSschema andCONTEXTindex type. - Requires creating a preference for the lexer and stemmer.
- Uses the
CONTAINSoperator with theSCOREfunction for ranking. - Utilizes Section Groups for searching specific parts of a document.
- Uses the