1
and 3
are quite similar.
One may argue that the text with the id 2
is somewhat similar to
the texts with the id 1
and 3
. Due to the fact in the example above
4-shingles are taken into account for measuring the similarity of the texts,
there are no intersections found for the text pairs 1
and 2
, respectively
3
and 2
and therefore there the similarity index for these text pairs
is 0
.
Data structures
Trino implements Set Digest data sketches by encapsulating the following components: The HyperLogLog structure is used for the approximation of the distinct elements in the original set. The MinHash structure is used to store a low memory footprint signature of the original set. The similarity of any two sets is estimated by comparing their signatures. The Trino type for this data structure is calledsetdigest
.
Trino offers the ability to merge multiple Set Digest data sketches.
Serialization
Data sketches can be serialized to and deserialized fromvarbinary
. This
allows them to be stored for later use.
Functions
make_set_digest
setdigest
corresponding to a bigint
array:
setdigest
corresponding to a varchar
array:
merge_set_digest
setdigest
of the aggregate union of the individual setdigest
Set Digest structures.
cardinality
HyperLogLog
component.
Examples:
intersection_cardinality
jaccard_index
x
and y
must be of type setdigest
.
Examples:
hash_counts
MinHash
structure belonging to x
.
x
must be of type setdigest
.
Examples: