Capabilities

Pipeline Lifecycle

This is a high-level overview of the lifecycle of a pipeline in the Pipelines system.

A storage location is created (optional)

This step is optional and only needed for accessing data external storage.

Data for processing can be stored within the database in a table or in an external storage location. If you want to use an external storage location, you must create a storage location to access the data. This storage location can be an S3 bucket or a local file system.

The storage locations can be used to create a volume, suitable for a retriever to use to access the data it contains.

A model is registered

A model is registered with the Pipelines system. This model can be a machine learning model, a deep learning model, or any other type of model that can be used for AI tasks.

A retriever is registered

A retriever is registered with the Pipelines system. A retriever is a function that retrieves data from a table or volume and returns it in a format that can be used by the model.

By default, a retriever only needs:

  • a name
  • the name of a registered model to use

If the retriever is for a table, it also needs:

  • the name of the source table
  • the name of the column in the source table that contains the data
  • the data type of the column

If, on the other hand, the retriever is for a volume, it needs:

  • the name of the volume
  • the name of the column in the volume that contains the data

When a retriever is registered, by default it will create a vector table to store the embeddings of the data that is retrieved. This table will have a column to store the embeddings and a column to store the key of the data.

The name of the vector table and the name of the vector column and the key column can be specified when the retriever is registered; this is useful if you are migrating to aidb and want to use an existing vector table.

Embeddings are created

Embedding sees the data being retrieved from the source table or volume and encoded into a vector datatype. That vector data is then stored in the vector table.

If the source table already has data/rows at the time where the retriever is created, then a manual "bulk embedding" call must be made. This generates the embeddings for all the existing data in the source table.

Auto embedding can then be activated to keep the embeddings in sync going forward. Auto embedding uses Postgres triggers to detect insertions and updates to the source table and automatically generates embeddings for the new data.

Data is queried

The embedded data can be queried using the retriever. The retriever can return the key to the data or the data itself, depending on the query. The data can be queried using a text query or an image query, depending on the type of data that is being retrieved.

Next steps

While auto-embedding is enabled, the embeddings are always up-to-date and applications can use the retriever to query the data as needed.

Cleanup

If the embeddings are no longer required, the retriever can be unregistered, the vector table can be dropped and the model can be unregistered too.


Could this page be better? Report a problem or suggest an addition!