CS50 Introduction to Artificial Intelligence with Python Lecture 7

第七讲的主题是Lauguage,这里总结第七讲以及第七次作业。

课程地址:https://cs50.harvard.edu/ai/

备注:图片均来自课程课件。

Natural Language Processing

Context-Free Grammar

Context-Free Grammar指定语法规则,例如

NP → N | D N
N → she | city | car | Harry | …
D → the | a | an | …
V → saw | ate | walked | …
P → to | on | over | …
ADJ → blue | busy | old | …

利用Context-Free Grammar会产生解析树:

nltk

$n-$gram

文本中连续$n$项,例如连续$n$个单词或者字母。

tokenization

将字符序列拆分为多个部分(tokens)的任务,常见的token有word token以及sentence token。

bag-of-words model

将文本表示为单词的无序集合的模型。

Naive Bayes

考虑文本分类问题,例如

朴素贝叶斯假设条件独立性,即

其中

additive(Laplace smoothing) smoothing

注意上述参数估计方法会产生很多$0$,所以需要对参数估计平滑化,常见的方式为给每一项增加$\alpha$,如果$\alpha=1$则为拉普拉斯平滑。

information retrieval

信息检索是对用户查询查找相关文档进行反馈的任务。

topic modeling

主题模型是用于发现一组文档主题的模型。

term frequency

词频是术语在文档中出现的次数。

function words

本身没有什么意义的词,但用于语法上连接其他词,例如am, by, do, is, which, with, yet, …

content words

带有独立含义的单词,例如algorithm, category, computer, …

inverse document frequency

逆文档频率衡量文档中单词的普遍或稀有程度,计算公式为

tf-idf

通过将术语频率(TF)乘以逆文档频率(IDF)来对文档中重要单词进行排名。

Word Representation
one-hot representation
  • he [1, 0, 0, 0]
  • wrote [0, 1, 0, 0]
  • a [0, 0, 1, 0]
  • book [0, 0, 0, 1]

one-hot表示法无法表达单词之间的关系,并且不具有扩展性,例如增加一个新的单词就需要对全部单词重新编码。

word2vec

该方法利用的思路是一个单词和其上下文关系很大,然后使用神经网络进行训练,最后可以得到如下有趣的结果:

Project

Parser

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
def judge(word):
flag = False
for s in word:
if s.isalpha():
flag = True
break

return flag

def preprocess(sentence):
"""
Convert `sentence` to a list of its words.
Pre-process sentence by converting all characters to lowercase
and removing any word that does not contain at least one alphabetic
character.
"""
words = nltk.wordpunct_tokenize(sentence)
res = []
for word in words:
word_lower = word.lower()
if judge(word_lower):
res.append(word_lower)

return res

def np_chunk_helper(tree, List):
if tree == None:
return
#计算NP的数量
cnt = 0
tmp = []
for t in tree.subtrees():
if t.label() == "NP":
tmp.append(t)
cnt += 1
#如果只有1个NP则更新
if cnt == 1:
t = tmp[0]
#防止重复
if t not in List:
List.append(t)
return
elif cnt > 1:
for t in tree.subtrees():
#排除自己
if t != tree:
np_chunk_helper(t, List)

return

return

def np_chunk(tree):
"""
Return a list of all noun phrase chunks in the sentence tree.
A noun phrase chunk is defined as any subtree of the sentence
whose label is "NP" that does not itself contain any other
noun phrases as subtrees.
"""
List = []
np_chunk_helper(tree, List)

return List

Questions

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
def load_files(directory):
"""
Given a directory name, return a dictionary mapping the filename of each
`.txt` file inside that directory to the file's contents as a string.
"""
res = dict()
filenames = os.listdir(directory)
for filename in filenames:
path = os.path.join(directory, filename)
sentence = ""
with open(path, "r", encoding='UTF-8') as file:
for f in file.readlines():
sentence += f.strip() + " "
res[filename] = sentence

return res

def tokenize(document):
"""
Given a document (represented as a string), return a list of all of the
words in that document, in order.

Process document by coverting all words to lowercase, and removing any
punctuation or English stopwords.
"""
tmp = nltk.word_tokenize(document)
words = [t.lower() for t in tmp]
res = []
for word in words:
if word not in string.punctuation and word not in nltk.corpus.stopwords.words("english"):
res.append(word)

return res


def compute_idfs(documents):
"""
Given a dictionary of `documents` that maps names of documents to a list
of words, return a dictionary that maps words to their IDF values.

Any word that appears in at least one of the documents should be in the
resulting dictionary.
"""
res = dict()
n = len(documents)
tmp_documents = dict()
all_words = set()
for filename in documents:
tmp_documents[filename] = set(documents[filename])
all_words.update(tmp_documents[filename])

for word in all_words:
cnt = 0
for filename in tmp_documents:
if word in tmp_documents[filename]:
cnt += 1
res[word] = np.log(n / cnt)

return res

def compute_tf(word, words):
cnt = 0
for w in words:
if w == word:
cnt += 1

return cnt

def top_files(query, files, idfs, n):
"""
Given a `query` (a set of words), `files` (a dictionary mapping names of
files to a list of their words), and `idfs` (a dictionary mapping words
to their IDF values), return a list of the filenames of the the `n` top
files that match the query, ranked according to tf-idf.
"""
tmp = []
for filename in files:
score = 0
for word in query:
if word in idfs:
score += compute_tf(word, files[filename]) * idfs[word]
tmp.append((filename, score))

tmp.sort(key=lambda x: -x[1])
res = []
for i in range(n):
res.append(tmp[i][0])

return res

def compute_cnt(word, sentence):
score = 0
for s in sentence:
if s == word:
score += 1

return score / len(sentence)

#从小到大
def cmp(a, b):
if a[1] != b[1]:
return b[1] - a[1]
else:
return b[2] - a[2]

def top_sentences(query, sentences, idfs, n):
"""
Given a `query` (a set of words), `sentences` (a dictionary mapping
sentences to a list of their words), and `idfs` (a dictionary mapping words
to their IDF values), return a list of the `n` top sentences that match
the query, ranked according to idf. If there are ties, preference should
be given to sentences that have a higher query term density.
"""
tmp = []
for sentence in sentences:
s1 = 0
s2 = 0
for word in query:
if word in sentences[sentence]:
s1 += idfs[word]
s2 += compute_cnt(word, sentences[sentence])
tmp.append((sentence, s1, s2))

tmp = sorted(tmp, key=functools.cmp_to_key(cmp))

res = []
for i in range(n):
res.append(tmp[i][0])

return res

本文标题:CS50 Introduction to Artificial Intelligence with Python Lecture 7

文章作者:Doraemonzzz

发布时间:2020年05月06日 - 12:59:28

最后更新:2020年05月07日 - 16:47:46

原始链接:http://doraemonzzz.com/2020/05/06/CS50 Introduction to Artificial Intelligence with Python Lecture 7/

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。