跳至内容

Presto 0.289 文档

KHyperLogLog 函数

KHyperLogLog 函数¶

Presto 实现 KHyperLogLog 算法和数据结构。 KHyperLogLog 数据结构可以通过 khyperloglog_agg() 创建。

数据结构¶

KHyperLogLog 是一个数据草图，它紧凑地表示两列的关联。它在 Presto 中实现为一个两级数据结构，由一个 MinHash 结构组成，其条目映射到 HyperLogLog。

序列化¶

KHyperLogLog 草图可以转换为 varbinary，反之亦然。这使得它们可以存储以供将来使用。

函数¶

khyperloglog_agg(x, y) -> KHyperLogLog()¶: 返回表示列 x 和 y 之间关系的 KHyperLogLog 草图。MinHash 结构总结了 x，而 HyperLogLog 草图表示与 x 值关联的 y 值。

cardinality(khll) -> bigint(): 这计算了 MinHash 草图的基数，即 x 的基数。

intersection_cardinality(khll1, khll2) -> bigint()¶: 返回由 khll1 和 khll2 的 MinHash 结构表示的数据的集合交集基数。

jaccard_index(khll1, khll2) -> double()¶: 返回由 khll1 和 khll2 的 MinHash 结构表示的数据的 Jaccard 指数。

uniqueness_distribution(khll) -> map<bigint,double>()¶: 对于某个值 x'，唯一性是指在源数据集中有多少个 y' 值与它相关联。这是通过映射自与 x' 相对应的 MinHash 桶的 HyperLogLog 的基数获得的。此函数返回一个直方图，它表示唯一性分布，X 轴是 uniqueness，Y 轴是 x 值的相对频率。

uniqueness_distribution(khll, histogramSize) -> map<bigint,double>()¶: 返回具有给定桶数的唯一性直方图。如果省略，则该值默认为 256。所有大于 histogramSize 的 uniqueness 值都会累积在最后一个桶中。

reidentification_potential(khll, threshold) -> double()¶: 重新识别可能性是在给定 threshold 下具有低于 threshold 的 uniqueness 的 x 值的比率。

merge(khll) -> KHyperLogLog(): 返回各个 KHyperLogLog 结构的聚合并集的 KHyperLogLog。

merge_khll(array(khll)) -> KHyperLogLog()¶: 返回一个 KHyperLogLog 结构数组的并集的 KHyperLogLog。