ChromaDBQueryEngine

autogen.agentchat.contrib.rag.ChromaDBQueryEngine #

ChromaDBQueryEngine(host='localhost', port=8000, settings=None, tenant=None, database=None, embedding_function=None, metadata=None, llm=None, collection_name=None)

This engine leverages Chromadb to persist document embeddings in a named collection and LlamaIndex's VectorStoreIndex to efficiently index and retrieve documents, and generate an answer in response to natural language queries. Collection can be regarded as an abstraction of group of documents in the database.

It expects a Chromadb server to be running and accessible at the specified host and port. Refer to this link for running Chromadb in a Docker container. If the host and port are not provided, the engine will create an in-memory ChromaDB client.

Initializes the ChromaDBQueryEngine with db_path, metadata, and embedding function and llm. Args: host: The host address of the ChromaDB server. Default is localhost. port: The port number of the ChromaDB server. Default is 8000. settings: A dictionary of settings to communicate with the chroma server. Default is None. tenant: The tenant to use for this client. Defaults to the default tenant. database: The database to use for this client. Defaults to the default database. embedding_function: A callable that converts text into vector embeddings. Default embedding uses Sentence Transformers model all-MiniLM-L6-v2. For more embeddings that ChromaDB support, please refer to embeddings metadata: A dictionary containing configuration parameters for the Chromadb collection. This metadata is typically used to configure the HNSW indexing algorithm. Defaults to {"hnsw:space": "ip", "hnsw:construction_ef": 30, "hnsw:M": 32} For more details about the default metadata, please refer to HNSW configuration llm: LLM model used by LlamaIndex for query processing. You can find more supported LLMs at LLM collection_name (str): The unique name for the Chromadb collection. If omitted, a constant name will be used. Populate this to reuse previous ingested data.

Source code in autogen/agentchat/contrib/rag/chroma_db_query_engine.py

def __init__(  # type: ignore[no-any-unimported]
    self,
    host: Optional[str] = "localhost",
    port: Optional[int] = 8000,
    settings: Optional["Settings"] = None,
    tenant: Optional[str] = None,
    database: Optional[str] = None,
    embedding_function: "Optional[EmbeddingFunction[Any]]" = None,
    metadata: Optional[dict[str, Any]] = None,
    llm: Optional["LLM"] = None,
    collection_name: Optional[str] = None,
) -> None:
    """
    Initializes the ChromaDBQueryEngine with db_path, metadata, and embedding function and llm.
    Args:
        host: The host address of the ChromaDB server. Default is localhost.
        port: The port number of the ChromaDB server. Default is 8000.
        settings: A dictionary of settings to communicate with the chroma server. Default is None.
        tenant: The tenant to use for this client. Defaults to the default tenant.
        database: The database to use for this client. Defaults to the default database.
        embedding_function: A callable that converts text into vector embeddings. Default embedding uses Sentence Transformers model all-MiniLM-L6-v2.
            For more embeddings that ChromaDB support, please refer to [embeddings](https://docs.trychroma.com/docs/embeddings/embedding-functions)
        metadata: A dictionary containing configuration parameters for the Chromadb collection.
            This metadata is typically used to configure the HNSW indexing algorithm. Defaults to `{"hnsw:space": "ip", "hnsw:construction_ef": 30, "hnsw:M": 32}`
            For more details about the default metadata, please refer to [HNSW configuration](https://cookbook.chromadb.dev/core/configuration/#hnsw-configuration)
        llm: LLM model used by LlamaIndex for query processing.
             You can find more supported LLMs at [LLM](https://docs.llamaindex.ai/en/stable/module_guides/models/llms/)
        collection_name (str): The unique name for the Chromadb collection. If omitted, a constant name will be used. Populate this to reuse previous ingested data.
    """
    self.llm: LLM = llm or OpenAI(model="gpt-4o", temperature=0.0)  # type: ignore[no-any-unimported]
    if not host or not port:
        logger.warning(
            "Can't connect to remote Chroma client without host or port not. Using an ephemeral, in-memory client."
        )
        self.client = None
    else:
        try:
            self.client = HttpClient(
                host=host,
                port=port,
                settings=settings,
                tenant=tenant if tenant else DEFAULT_TENANT,  # type: ignore[arg-type, no-any-unimported]
                database=database if database else DEFAULT_DATABASE,  # type: ignore[arg-type, no-any-unimported]
            )
        except Exception as e:
            raise ValueError(f"Failed to connect to the ChromaDB client: {e}")

    self.db_config = {"client": self.client, "embedding_function": embedding_function, "metadata": metadata}
    self.collection_name = collection_name if collection_name else DEFAULT_COLLECTION_NAME

llm `instance-attribute` #

llm = llm or OpenAI(model='gpt-4o', temperature=0.0)

client `instance-attribute` #

client = HttpClient(host=host, port=port, settings=settings, tenant=tenant if tenant else DEFAULT_TENANT, database=database if database else DEFAULT_DATABASE)

db_config `instance-attribute` #

db_config = {'client': client, 'embedding_function': embedding_function, 'metadata': metadata}

collection_name `instance-attribute` #

collection_name = collection_name if collection_name else DEFAULT_COLLECTION_NAME

init_db #

init_db(new_doc_dir=None, new_doc_paths_or_urls=None, *args, **kwargs)

Initialize the database with the input documents or records. It overwrites the existing collection in the database.

It takes the following steps, 1. Set up ChromaDB and LlamaIndex storage. 2. insert documents and build indexes upon them.

PARAMETER	DESCRIPTION
`new_doc_dir`	a dir of input documents that are used to create the records in database. TYPE: `Optional[Union[Path, str]]` DEFAULT: `None`
`new_doc_paths_or_urls`	a sequence of input documents that are used to create the records in database. a document can be a path to a file or a url. TYPE: `Optional[Sequence[Union[Path, str]]]` DEFAULT: `None`
`*args`	Any additional arguments TYPE: `Any` DEFAULT: `()`
`**kwargs`	Any additional keyword arguments TYPE: `Any` DEFAULT: `{}`

RETURNS	DESCRIPTION
`bool`	True if initialization is successful TYPE: `bool`

Source code in autogen/agentchat/contrib/rag/chroma_db_query_engine.py

def init_db(
    self,
    new_doc_dir: Optional[Union[Path, str]] = None,
    new_doc_paths_or_urls: Optional[Sequence[Union[Path, str]]] = None,
    *args: Any,
    **kwargs: Any,
) -> bool:
    """Initialize the database with the input documents or records.
    It overwrites the existing collection in the database.

    It takes the following steps,
    1. Set up ChromaDB and LlamaIndex storage.
    2. insert documents and build indexes upon them.

    Args:
        new_doc_dir: a dir of input documents that are used to create the records in database.
        new_doc_paths_or_urls:
            a sequence of input documents that are used to create the records in database.
            a document can be a path to a file or a url.
        *args: Any additional arguments
        **kwargs: Any additional keyword arguments

    Returns:
        bool: True if initialization is successful

    """

    self._set_up(overwrite=True)
    documents = self._load_doc(input_dir=new_doc_dir, input_docs=new_doc_paths_or_urls)
    self.index = VectorStoreIndex.from_documents(documents=documents, storage_context=self.storage_context)
    return True

connect_db #

connect_db(*args, **kwargs)

Connect to the database. It does not overwrite the existing collection in the database.

It takes the following steps, 1. Set up ChromaDB and LlamaIndex storage. 2. Create the llamaIndex vector store index for querying or inserting docs later

PARAMETER	DESCRIPTION
`*args`	Any additional arguments TYPE: `Any` DEFAULT: `()`
`**kwargs`	Any additional keyword arguments TYPE: `Any` DEFAULT: `{}`

RETURNS	DESCRIPTION
`bool`	True if connection is successful TYPE: `bool`

Source code in autogen/agentchat/contrib/rag/chroma_db_query_engine.py

def connect_db(self, *args: Any, **kwargs: Any) -> bool:
    """Connect to the database.
    It does not overwrite the existing collection in the database.

     It takes the following steps,
    1. Set up ChromaDB and LlamaIndex storage.
    2. Create the llamaIndex vector store index for querying or inserting docs later

    Args:
        *args: Any additional arguments
        **kwargs: Any additional keyword arguments

    Returns:
        bool: True if connection is successful
    """
    self._set_up(overwrite=False)
    self.index = VectorStoreIndex.from_vector_store(
        vector_store=self.vector_store, storage_context=self.storage_context
    )

    return True

add_docs #

add_docs(new_doc_dir=None, new_doc_paths_or_urls=None, *args, **kwargs)

Add new documents to the underlying database and add to the index.

PARAMETER	DESCRIPTION
`new_doc_dir`	A dir of input documents that are used to create the records in database. TYPE: `Optional[Union[Path, str]]` DEFAULT: `None`
`new_doc_paths_or_urls`	A sequence of input documents that are used to create the records in database. A document can be a path to a file or a url. TYPE: `Optional[Sequence[Union[Path, str]]]` DEFAULT: `None`
`*args`	Any additional arguments TYPE: `Any` DEFAULT: `()`
`**kwargs`	Any additional keyword arguments TYPE: `Any` DEFAULT: `{}`

Source code in autogen/agentchat/contrib/rag/chroma_db_query_engine.py

def add_docs(
    self,
    new_doc_dir: Optional[Union[Path, str]] = None,
    new_doc_paths_or_urls: Optional[Sequence[Union[Path, str]]] = None,
    *args: Any,
    **kwargs: Any,
) -> None:
    """Add new documents to the underlying database and add to the index.

    Args:
        new_doc_dir: A dir of input documents that are used to create the records in database.
        new_doc_paths_or_urls: A sequence of input documents that are used to create the records in database. A document can be a path to a file or a url.
        *args: Any additional arguments
        **kwargs: Any additional keyword arguments
    """
    self._validate_query_index()
    documents = self._load_doc(input_dir=new_doc_dir, input_docs=new_doc_paths_or_urls)
    for doc in documents:
        self.index.insert(doc)

query #

query(question)

Retrieve information from indexed documents by processing a query using the engine's LLM.

PARAMETER	DESCRIPTION
`question`	A natural language query string used to search the indexed documents. TYPE: `str`

RETURNS	DESCRIPTION
`str`	A string containing the response generated by LLM.

Source code in autogen/agentchat/contrib/rag/chroma_db_query_engine.py

def query(self, question: str) -> str:
    """
    Retrieve information from indexed documents by processing a query using the engine's LLM.

    Args:
        question: A natural language query string used to search the indexed documents.

    Returns:
        A string containing the response generated by LLM.
    """
    self._validate_query_index()
    self.query_engine = self.index.as_query_engine(llm=self.llm)
    response = self.query_engine.query(question)

    if str(response) == EMPTY_RESPONSE_TEXT:
        return EMPTY_RESPONSE_REPLY

    return str(response)

get_collection_name #

get_collection_name()

Get the name of the collection used by the query engine.

RETURNS	DESCRIPTION
`str`	The name of the collection.

Source code in autogen/agentchat/contrib/rag/chroma_db_query_engine.py

def get_collection_name(self) -> str:
    """
    Get the name of the collection used by the query engine.

    Returns:
        The name of the collection.
    """
    if self.collection_name:
        return self.collection_name
    else:
        raise ValueError("Collection name not set.")

ChromaDBQueryEngine

autogen.agentchat.contrib.rag.ChromaDBQueryEngine #

llm instance-attribute #

client instance-attribute #

db_config instance-attribute #

collection_name instance-attribute #

init_db #

connect_db #

add_docs #

query #

get_collection_name #

llm `instance-attribute` #

client `instance-attribute` #

db_config `instance-attribute` #

collection_name `instance-attribute` #