斯坦福算法专项课程Course2 week4习题解析

这一部分将带来Course2 week4的习题解析。

选择题

选择题 1

以下哪一项不是您期望精心设计的hash函数具有的属性？

hash函数应该“展开”大多数（即“非病理”）数据集（跨散列表的桶/槽）
hash函数应该易于计算（恒定时间或接近它）
hash函数应该易于存储（恒定空间或接近它）
hash函数应该“展开”每个数据集（跨散列表的桶/槽）

第四个选项的意思是希望hash函数将元素映射到每个bucket中，错误。

选择题 2

假设我们使用哈希函数$H$将$n$个不同的键映射到长度为$m$的数组$T$中。假设简单的uniform hashing——也就是说，每个键独立且均匀地映射到随机桶中，那么映射到第一个桶的预期键数是多少？更准确地说，该集合元素数量$\{k:h(k)=1\}$

$m/(2n)$
$1/n$
$n/m$
$1/m$
$n/(2m)$
$m/n$

每个元素映射到第一个桶的概率为$\frac 1 m$，因为有$n$个元素，所以答案为$n/m$，第三项正确

选择题 3

假设我们使用哈希函数$H$将$n$个不同的键映射到长度为$m$的数组$T$中。如果$h(x)= h(y),x\neq y$，那么称这两个键冲突。假设简单的uniform hashing——也就是说，每个键独立且均匀地映射到随机桶中，那么$x\neq y$冲突的概率是多少？

$1/m^2$
$1/n^2$
$1/(m-1)$
$1/n$
$1/m$

$x$的选择是任意的，$x,y$冲突当且仅当$y$落在$x$的桶中，所以概率为

$\frac 1 m$

最后一项正确

选择题 4

假设我们使用哈希函数$H$将$n$个不同的键映射到长度为$m$的数组$T$中。如果$h(x)= h(y),x\neq y$，那么称这两个键冲突。假设简单的uniform hashing——也就是说，每个键独立且均匀地映射到随机桶中，那么不同键冲突的数量的数学期望为？

$n/m^2$
$n^2/m$
$n/m$
$n(n-1)/m$
$n(n-1)/2m$

令

$\begin{aligned} z_{ij}=\begin{cases} 1, & h(i)=h(j),i\neq j\\ 0, & 其余情况 \end{cases} \end{aligned}$

那么不同键冲突的数量为

$Z=\sum_{i=1}^n \sum_{j>i} z_{ij}$

注意

$P(z_{ij}=1)= \frac 1 m$

所以取数学期望可得

$E[Z] =\sum_{i=1}^n \sum_{j>i} E[z_{ij}]=\frac 1 m C_n^2 =\frac{n(n-1)}{2m}$

最后一项正确

选择题 5

为了解释我们在演讲中对Bloom Filters 的启发式分析，我们打算在在Bloom Filters中对每个对象使用8位空间的情况。假设我们愿意使用两倍的空间（每个对象16位）。你能说出相应的 false positive rate是多少（假设$k$个hash表是最优的）？[选择最强的真实陈述。]

小于$1\%$
小于$0.1\%$
小于$0.01\%$
小于$0.001\%$

将$b=16$代入$\epsilon \approx (\frac 1 2 )^{b\ln 2 }$可得

$\epsilon \approx 0.00045863850789649714\\ \epsilon < 0.1\%$

第二项正确

编程题

参考链接

代码如下：

import numpy as np

#搜索函数
def Binary_search(num, data, l):
    left = 0
    right = l - 1
    while (right - left) > 1:
        if data[(right + left) // 2] > num:
            right = (right + left) // 2
        else:
            left = (right + left) // 2
    return left, right

filename = 'data.txt'
data = []

#读取数据
with open(filename) as f:
    for i in f.readlines():
        num = int(i)
        data.append(num)
        
#排序
data = np.sort(data)

#计算结果
l = len(data)
res = []
for num in data:
    n1 = -10000 - num
    n2 = 10000 - num
    l1, r1 = Binary_search(n1, data, l)
    l2, r2 = Binary_search(n2, data, l)
    #判断索引位置
    if (r1 <= l2):
        res += list(data[r1: (l2 + 1)] + num)

res = set(res)
print(len(res))

思考题

Recall that a set $H$ of hash functions (mapping the elements of a universe $U$ to the buckets $\{0,1,2,\ldots,n-1\}$) is universal if for every distinct $x,y∈U$, the probability $Prob[h(x) = h(y)]$ that $x$ and $y$ collide, assuming that the hash function $h$ is chosen uniformly at random from $H$, is at most $1/n$. In this problem you will prove that a collision probability of $1/n$ is essentially the best possible. Precisely, suppose that $H$ is a family of hash functions mapping $U$ to $\{0,1,2,\ldots,n-1\}$, as above. Show that there must be a pair $x,y∈U$ of distinct elements such that, if $h$ is chosen uniformly at random from $H$, then $\operatorname{Prob}[h(x)=h(y)] \geq \frac{1}{n}-\frac{1}{|U|}$.

参考资料：https://blog.asarkar.org/algorithms-design-analysis/hw-6-opt/

记：

$\begin{aligned} H&= \{h_1,\ldots, h_k\}\\ N & = |U| \end{aligned}$

首先固定哈希函数$h$，记

$\begin{aligned} n_i &=\# \{x: x\in U , h(x)= i\} \end{aligned}$

显然我们有

$\sum_{i=0}^{n- 1} n_i = N$

冲突的数量为

$\begin{aligned} L_h&=\sum_{i=0}^{n-1} \binom{n_i}{2} \\ &=\sum_{i=0}^{n-1} \frac{n_i^2}{2} - \sum_{i=0}^{n-1}\frac{n_i}{2}\\ &\ge\frac{(\sum_{i=0}^{n-1}n_i)^2}{2 n } -\frac{N}{2} \\ &=\frac{N^2}{ 2n} -\frac{N}{2} \end{aligned}$

考虑随机变量

$X_i(x,y) = \begin{cases} 1 \text{ if } h_i(x) = h_i(y) \\ 0 \text{ otherwise } \end{cases}$

记

$X=\sum_{x\neq y} X_i(x,y)$

那么

$\begin{aligned} E[X] &=\sum_{i=1}^k\sum_{x\neq y} E[X_i(x,y)]\\ &=\sum_{i=1}^k\sum_{x\neq y} P[h_i(x) =h_i(y)]\\ &\ge \frac{(\frac{N^2}{ 2n} -\frac{N}{2})k}{\frac{N(N-1)}{2} k} \\ &\ge \frac{\frac{N}{n} - 1}{N-1} \\ &\ge \frac{\frac{N}{n} - 1}{ N}\\ &= \frac 1 {n} -\frac{1}{N} \end{aligned}$

所以必然存在$x,y, h $，使得

$P[h(x) =h(y)] \ge \frac 1 n -\frac 1 N$