SimilaritySearchKit is a Swift package enabling on-device text embeddings and semantic search functionality for iOS and macOS applications in just a few lines. Emphasizing speed, extensibility, and privacy, it supports a variety of built-in state-of-the-art NLP models and similarity metrics, in addition to seamless integration for bring-your-own options.
Use Cases
Some potential use cases for SimilaritySearchKit include:
Privacy-focused document search engines: Create a search engine that processes sensitive documents locally, without exposing user data to external services. (See example project “ChatWithFilesExample” in the Examples directory.)
Offline question-answering systems: Implement a question-answering system that finds the most relevant answers to a user’s query within a local dataset.
Document clustering and recommendation engines: Automatically group and organize documents based on their textual content on the edge.
By leveraging SimilaritySearchKit, developers can easily create powerful applications that keep data close to home without major tradeoffs in functionality or performance.
Installation
To install SimilaritySearchKit, simply add it as a dependency to your Swift project using the Swift Package Manager. I recommend using the Xcode method personally via:
File → Add Packages... → Search or Enter Package Url → https://github.com/ZachNagengast/similarity-search-kit.git
Xcode should give you the following options to choose which model you’d like to add (see available models below for help choosing):
If you want to add it via Package.swift, add the following line to your dependencies array:
Splits a string into chunks of a given size, with a given overlap. This is useful for splitting long documents into smaller chunks for embedding. It returns the list of chunks and an optional list of tokensIds for each chunk.
Many parts of this project were derived from the existing code, either already in swift, or translated into swift thanks to ChatGPT. These are some of the main projects that were referenced:
This project has been inspired by the incredible advancements in natural language services and applications that have come about with the emergence of ChatGPT. While these services have unlocked a whole new world of powerful text-based applications, they often rely on cloud services. Specifically, many “Chat with Data” services necessitate users to upload their data to remote servers for processing and storage. Although this works for some, it might not be the best fit for those in low connectivity environments, or handling confidential or sensitive information. While Apple does have bundled library NaturalLanguage for similar tasks, the CoreML model conversion process opens up a much wider array of models and use cases. With this in mind, SimilaritySearchKit aims to provide a robust, on-device solution that enables developers to create state-of-the-art NLP applications within the Apple ecosystem.
Future Work
Here’s a short list of some features that are planned for future releases:
In-memory indexing
Disk-backed indexing
For large datasets that don’t fit in memory
All around performance improvements
Swift-DocC website
HSNW / Annoy indexing options
Querying filters
Only return results with specific metadata
Sparse/Dense hybrid search
Use sparse search to find candidate results, then rerank with dense search
Can be used to merge several query results into one, and clean up irrelevant text
Metal acceleration for distance calcs
I’m curious to see how people use this library and what other features would be useful, so please don’t hesitate to reach out over twitter @ZachNagengast or email znagengast (at) gmail (dot) com.
SimilaritySearchKit
SimilaritySearchKit is a Swift package enabling on-device text embeddings and semantic search functionality for iOS and macOS applications in just a few lines. Emphasizing speed, extensibility, and privacy, it supports a variety of built-in state-of-the-art NLP models and similarity metrics, in addition to seamless integration for bring-your-own options.
Use Cases
Some potential use cases for SimilaritySearchKit include:
Privacy-focused document search engines: Create a search engine that processes sensitive documents locally, without exposing user data to external services. (See example project “ChatWithFilesExample” in the Examples directory.)
Offline question-answering systems: Implement a question-answering system that finds the most relevant answers to a user’s query within a local dataset.
Document clustering and recommendation engines: Automatically group and organize documents based on their textual content on the edge.
By leveraging SimilaritySearchKit, developers can easily create powerful applications that keep data close to home without major tradeoffs in functionality or performance.
Installation
To install SimilaritySearchKit, simply add it as a dependency to your Swift project using the Swift Package Manager. I recommend using the Xcode method personally via:
File
→Add Packages...
→Search or Enter Package Url
→https://github.com/ZachNagengast/similarity-search-kit.git
Xcode should give you the following options to choose which model you’d like to add (see available models below for help choosing):
If you want to add it via
Package.swift
, add the following line to your dependencies array:Then, add the appropriate target dependency to the desired target:
If you only want to use a subset of the available models, you can omit the corresponding dependency. This will reduce the size of your final binary.
Usage
To use SimilaritySearchKit in your project, first import the framework:
Next, create an instance of SimilarityIndex with your desired distance metric and embedding model (see below for options):
Then, add your text that you want to make searchable to the index:
Finally, query the index for the most similar items to a given query:
Which outputs a SearchResult array:
[SearchResult(id: "id1", score: 0.86216, metadata: ["source": "example.pdf"])]
Examples
The
Examples
directory contains multple sample iOS and macOS applications that demonstrates how to use SimilaritySearchKit to it’s fullest extent.BasicExample
PDFExample
ChatWithFilesExample
Available Models
NaturalLanguage
MiniLMAll
Distilbert
MiniLMMultiQA
Models conform the the
EmbeddingProtocol
and can be used interchangeably with theSimilarityIndex
class.Available Metrics
DotProduct
CostineSimilarity
EuclideanDistance
Metrics conform to the
DistanceMetricProtocol
and can be used interchangeably with theSimilarityIndex
class.Bring Your Own
All the main parts of the
SimilarityIndex
can be overriden with custom implementations that conform to the following protocols:EmbeddingProtocol
Accepts a string and returns an array of floats representing the embedding of the input text.
DistanceMetricProtocol
Accepts a query embedding vector and a list of embeddings vectors and returns a tuple of the distance metric score and index of the nearest neighbor.
TextSplitterProtocol
Splits a string into chunks of a given size, with a given overlap. This is useful for splitting long documents into smaller chunks for embedding. It returns the list of chunks and an optional list of tokensIds for each chunk.
TokenizerProtocol
Tokenizes and detokenizes text. Use this for custom models that use different tokenizers than are available in the current list.
VectorStoreProtocol
Save and load index items. The default implementation uses JSON files, but this can be overriden to use any storage mechanism.
Acknowledgements
Many parts of this project were derived from the existing code, either already in swift, or translated into swift thanks to ChatGPT. These are some of the main projects that were referenced:
Motivation
This project has been inspired by the incredible advancements in natural language services and applications that have come about with the emergence of ChatGPT. While these services have unlocked a whole new world of powerful text-based applications, they often rely on cloud services. Specifically, many “Chat with Data” services necessitate users to upload their data to remote servers for processing and storage. Although this works for some, it might not be the best fit for those in low connectivity environments, or handling confidential or sensitive information. While Apple does have bundled library
NaturalLanguage
for similar tasks, the CoreML model conversion process opens up a much wider array of models and use cases. With this in mind, SimilaritySearchKit aims to provide a robust, on-device solution that enables developers to create state-of-the-art NLP applications within the Apple ecosystem.Future Work
Here’s a short list of some features that are planned for future releases:
I’m curious to see how people use this library and what other features would be useful, so please don’t hesitate to reach out over twitter @ZachNagengast or email znagengast (at) gmail (dot) com.