Search Query Generation

This script utilizes a Large Language Model (LLM) to generate a diverse list of search engine queries based on a user-provided topic. It aims to produce queries that facilitate comprehensive and relevant information retrieval.

Purpose¶

The primary goal of search-queries.py is to automate the creation of varied search queries for a given subject. This includes general overview queries, specific queries targeting authoritative sources, question-based queries, alternative phrasings, and queries using advanced search operators.

Usage¶

Run the script from the command line:

python search-queries.py <input_json> [output_json]

<input_json>: (Required) Path to the input JSON file containing the topic and configuration.
[output_json]: (Optional) Path to save the output JSON file. If not provided, a path is generated automatically based on the script name and a UUID.

The script uses the handle_command_args utility function to parse these arguments.

Input Files¶

The script expects a JSON input file with the following structure:

topic: (String, Required) The subject for which to generate search queries.
model: (String, Optional) The identifier for the LLM to use (defaults to “gemma3” if not provided in the script, although the script shows it defaults to “gemma3” within the function).
parameters: (Object, Optional) Any additional parameters to pass to the LLM during the chat interaction (e.g., temperature, top_p).

Example (examples/search-queries-in.json):

{
  "topic": "Sustainable agriculture practices",
  "model": "gemma3",
  "parameters": {
    "temperature": 0.7
  }
}

Key Functions¶

generate_search_queries(input_data): Takes the loaded input data, constructs prompts, interacts with the LLM via chat_with_llm, processes the response using parse_llm_json_response, and returns a list of generated queries.
main(): Handles command-line arguments using handle_command_args, loads the input JSON using load_json, calls generate_search_queries, prepares metadata using create_output_metadata and get_output_filepath, structures the final output data, and saves it using save_output.
Utility Functions (from utils.py):
- load_json: Loads data from a JSON file.
- save_output: Saves data to a JSON file.
- chat_with_llm: Manages the interaction with the specified LLM.
- parse_llm_json_response: Parses the LLM’s response, attempting to interpret it as JSON or splitting it into lines if parsing fails.
- create_output_metadata: Generates standard metadata (script name, timestamp, UUID).
- get_output_filepath: Determines the appropriate output file path.
- handle_command_args: Parses command-line arguments for input and output file paths.

LLM Interaction¶

System Prompt Construction: A detailed system message instructs the LLM to act as a search query generation assistant. It specifies the goal: generate multiple high-quality queries for a given topic, covering various types (general, specific, question-based, alternative phrasings, advanced operators like site:, filetype:, intitle:).
Output Format Request: The system prompt explicitly asks the LLM to output the queries in a simple list format, with each query separated by a single newline, and without any numbers, symbols, or extra formatting.
User Prompt Construction: A simple user message is created, providing the topic from the input file to the LLM.
LLM Call: The chat_with_llm function is called with the specified model, the constructed system and user messages, and any optional parameters.

Output Processing¶

Parsing: The raw text response from the LLM is passed to the parse_llm_json_response utility function with include_children=False. This function attempts to parse the response. Based on typical utility function behavior, it likely tries to parse the response as a JSON list first. If that fails, it might split the raw text by newlines to create a list of strings, where each string is a query.
Validation & Fallback: The script checks if the result from parse_llm_json_response is a list. If it’s not a list (indicating parsing or processing failed to produce the expected structure), it defaults to a list containing a single fallback message: ["No valid search queries could be generated"].

Output¶

The script generates a JSON output file containing:

metadata: An object with information about the script execution (script name, timestamp, UUID).
topic: The original topic string provided in the input file.
queries: A list containing the search queries generated by the LLM and processed by parse_llm_json_response. If generation or parsing failed, this list will contain the fallback message.

Example (examples/search-queries-out.json structure):

```json { “metadata”: { “script”: “Search Queries”, “timestamp”: “YYYY-MM-DDTHH:MM:SS.ffffff”, “uuid”: “xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx” }, “topic”: “Sustainable agriculture practices”, “queries”: [ “sustainable agriculture definition”, “benefits of sustainable farming”, “types of sustainable agriculture practices”, “organic farming vs sustainable agriculture”, “site:fao.org sustainable agriculture”, “filetype:pdf sustainable agriculture techniques”, “"regenerative agriculture" principles”, “how does sustainable agriculture help the environment?”, “challenges facing sustainable agriculture”, “intitle:"sustainable farming" case studies” // … more queries ] }