en.wikipedia.org/wiki/Count%E2%80%93min_sketch
2 corrections found
where the error in answering a query is within an additive factor of ε with probability 1 − δ
The standard count–min sketch bound is not an additive error of just ε. The point-query error is additive in the stream size: at most εN (equivalently ε‖a‖₁) with probability 1−δ.
Full reasoning
This sentence drops the dependence on the total mass of the stream. In the standard count–min sketch guarantee, choosing width and depth from ε and δ yields a point-query estimate that is at most the true count plus εN, where N is the total number of items seen so far (equivalently ε‖a‖₁ in vector notation), with probability 1−δ.
Saying the error is merely "within an additive factor of ε" is incorrect because it changes the units and makes the bound far smaller than the actual guarantee. The same article later states the correct bound as â_i ≤ a_i + εN.
2 sources
- Spark 4.1.2 ScalaDoc - org.apache.spark.util.sketch.CountMinSketch
With probability `delta`, the estimate of this frequency is within the range `true frequency <= estimate <= true frequency + eps * N`, where `N` is the total count of items have appeared the data stream so far.
- Count-Min Sketch
required to answer point queries with error ε∥a∥1 with probability at least 1 −δ is given by ...
Let a ⋅ b ^ j = ∑ k = 0 w c o u n t a [ j , k ] ⋅ c o u n t b [ j , k ]
The summation limits are off by one. A sketch row has w columns, so the rowwise inner-product sum should cover exactly w entries, e.g. k=1..w or 0..w−1, not 0..w.
Full reasoning
Earlier the article defines the sketch as a two-dimensional array with w columns and d rows. A sum from k = 0 to w includes w+1 positions, which is inconsistent with a row that has only w columns.
The original count–min sketch paper defines the rowwise inner-product estimator over exactly w counters, using indices k = 1 to w. If one prefers zero-based indexing, the equivalent range would be k = 0 to w−1. As written, 0 through w is an off-by-one error.
2 sources
- Journal of Algorithms 55 (2005) 58–75
(a ⊙ b)_j = Σ_{k=1}^w count_a[j,k] * count_b[j,k].
- Count-Min Sketch
The Count-Min sketch can also be used to estimate the inner product between two vectors; in fact, this can be seen by considering the Count-Min sketch as a collection of d vectors of length w, and finding the minimum inner product between corresponding rows.