Blog of Li

个人Blog 一年更新约一次

0%

homework
In [1]:
## 导入包及定义函数
import re
import numpy as np
import scipy.sparse as sparse
from sklearn import preprocessing


def net_to_csr(row, col, i):
    Mat = sparse.coo_matrix((np.ones(len(row)), (row, col)), shape=(i, i)).tocsr()
    return Mat


def mat_normalize(Mat):
    Mat = preprocessing.normalize(Mat, norm='l1').T
    return Mat


def calculate_pagerank(netMat, escape=0.85, rho=1e-5, max_iter=1000, details=0):
    i = netMat.shape[1]
    x = np.ones(i)
    p = escape
    a = (1 - p)
    for j in range(0, max_iter):
        x_n = p * netMat.dot(x) + a
        err = np.sum(np.abs(x - x_n))
        if details:
            print("iter:{}     res-error:{}".format(j, err))
            print(x_n.shape)
        if err < rho:
            break
        x = x_n
    return x_n

def topprint(dic,num=20):
    i=0
    for i in range(0,num):
        print(dic[i])

数据预处理

我们首先应该将数据格式化,以便于后续的处理。为了达到速度与灵活性的统一,我们使用字典来存储metadata,用列表和元组来存储引用关系。

In [2]:
## 从acl-metadata中获取元数据。
with open('acl-metadata.txt',encoding='iso-8859-16') as f:
    paperid = list()
    authors = list()
    title = list()
    venue = list()
    year = list()
    n = 0
    arrow=re.compile('(.*) = {(.*)}')
    for line in f:
        match=arrow.search(line)
        if match:
            label=match.group(1)
            item=match.group(2)
            if label=='id':
                n=n+1
                paperid.append(item)
            elif label=='author':
                authors.append(item)
            elif label=='title':
                title.append(item)
            elif label=='venue':
                venue.append(item)
            else:
                year.append(item)                
    metadata=dict()
    for i in range(0,n):
        metadata[paperid[i]]={'author':authors[i],'title':title[i],'venue':venue[i],'year':year[i]}

# 从acl.txt中获取引用的数据
with open('acl.txt') as g:
    citedata=list()
    arrow=re.compile('(.*) ==> (.*)')
    for line in g:
        match=arrow.search(line)
        if match:
            citedata.append((match.group(1),match.group(2)))
            
## 之后我们所有的数据都以从acl-metadata.txt和acl.txt中获取的metadata和citedata为准。
# metadata是一个字典,key是文章的ID,value是一个字典,具有{'author':author,'title':title,'venue':venue,'year':year}的形式
# citedata是一个列表,每个元素是一个二元元组,形如(paper_id_A,paper_id_B)

计算paper的PageRank

利用已知的数据构造Paper的PageRank,可以按照如下的顺序:

  • 首先,初始化一个Paper的邻接矩阵。(邻接矩阵代表一个有向多边图)
  • 之后,逐个读取文章的引用关系,A ==> B。在a,b中增加一条边。
  • 添加完所有边后,利用这个矩阵计算PageRank。

我们计算后输出pagerank最高的20篇论文。

In [8]:
## 计算Paper的pagerank

# 初始化邻接矩阵
papers=dict()
j=0 
for paper in metadata:
    if paper not in papers:
        papers[paper]=j
        j=j+1
paper_mat= sparse.dok_matrix((j, j))

# 添加边
for cite in citedata:
    paper_mat[papers[cite[0]],papers[cite[1]]]+=1

paper_mat = mat_normalize(paper_mat.tocsr())
p_r = calculate_pagerank(paper_mat, escape=0.85, rho=1e-5, max_iter=1000, details=0)

# 排序
paper_rank = dict()
i = 0
for paper in papers:
    paper_rank[metadata[paper]['title']] = p_r[i]
    i = i + 1
paper_rank = sorted(paper_rank.items(), key=lambda d: d[1], reverse=True)
topprint(paper_rank)
('A Stochastic Parts Program And Noun Phrase Parser For Unrestricted Text', 244.15283643104439)
('Finding Clauses In Unrestricted Text By Finitary And Stochastic Methods', 209.45708639080212)
('A Stochastic Approach To Parsing', 141.65469852327084)
('A Statistical Approach To Machine Translation', 119.71976953801889)
('The Contribution Of Parsing To Prosodic Phrasing In An Experimental Text-To-Speech System', 89.885370739846238)
('Building A Large Annotated Corpus Of English: The Penn Treebank', 82.465999402086624)
('The Mathematics Of Statistical Machine Translation: Parameter Estimation', 80.539415150271111)
('Attention Intentions And The Structure Of Discourse', 60.098641049948824)
('Deterministic Parsing Of Syntactic Non-Fluencies', 45.059238731941527)
('A Statistical Approach To Language Translation', 43.507364731513967)
('Class-Based N-Gram Models Of Natural Language', 41.933398095649849)
('Word-Sense Disambiguation Using Statistical Methods', 41.916640760669345)
('A Maximum Entropy Approach To Natural Language Processing', 40.985294464218583)
('Bleu: A Method For Automatic Evaluation Of Machine Translation', 34.775296237631672)
('Aligning Sentences In Parallel Corpora', 33.374603312297836)
('The Semantics Of Grammar Formalisms Seen As Computer Languages', 32.87845609235098)
('Grammatical Category Disambiguation By Statistical Optimization', 30.97523171298554)
('The Text System For Natural Language Generation: An Overview', 30.556023028266456)
('A Maximum Entropy Model For Part-Of-Speech Tagging', 29.603371914735625)
('A Procedure For Quantitatively Comparing The Syntactic Coverage Of English Grammars', 28.887702505766203)

计算Venue的PageRank

利用已知的数据构造Venue的PageRank,可以按照如下的顺序:

  • 首先,初始化一个Venue的邻接矩阵。(邻接矩阵代表一个有向多边图)
  • 之后,逐个读取文章的引用关系,A ==> B。并查找文章A,B所在的Venue a,b。在a,b中增加一条边。
  • 添加完所有边后,利用这个矩阵计算PageRank。
In [9]:
## 计算Venue的pagerank

# 初始化邻接矩阵
venues=dict()
j=0 
for paper in metadata:
    venue_name=metadata[paper]['venue']
    if venue_name not in venues:
        venues[venue_name]=j
        j=j+1
venue_mat=np.zeros([len(venues),len(venues)])

# 添加边
for cite in citedata:
    venue_mat[venues[metadata[cite[0]]['venue']],venues[metadata[cite[1]]['venue']]]+=1

# 矩阵迭代
venue_mat = mat_normalize(venue_mat)
v_r = calculate_pagerank(venue_mat, escape=0.85, rho=1e-5, max_iter=1000, details=0)

# 排序
venues_rank = dict()
i = 0
for venue in venues:
    venues_rank[venue] = v_r[i]
    i = i + 1
venues_rank = sorted(venues_rank.items(), key=lambda d: d[1], reverse=True)
topprint(venues_rank)
('ACL', 70.706727495417852)
('COLING', 42.832950071596649)
('CL', 39.292400111036997)
('EMNLP', 23.672341295691211)
('NAACL', 17.615968452252787)
('HLT', 13.645554708209707)
('CoNLL', 12.132319386741433)
('ANLP', 9.8833201022790131)
('EACL', 9.3469224441660597)
('Workshop On Speech And Natural Language', 8.0671489511315819)
('IJCNLP', 4.7168682538169575)
('Workshop on Statistical Machine Translation', 3.522085497796732)
('MUC', 3.3004139785694417)
('SIGDIAL', 2.4753753426338241)
('AJCL', 2.3698505578285762)
('Workshop on Semantic Evaluations (SemEval)', 2.2079089278500286)
('2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora', 1.9271831936510022)
('TINLP', 1.9138077985689321)
('LREC', 1.6114779926229823)
('INLG', 1.3633644958808695)

计算Author的PageRank

利用已知的数据构造Author的PageRank,可以按照如下的顺序:

  • 首先,初始化一个Author的邻接矩阵。(邻接矩阵代表一个有向多边图)
  • 之后,逐个读取文章的引用关系,A ==> B。并查找文章A,B a,b。在a,b中增加一条边。
  • 添加完所有边后,利用这个矩阵计算PageRank。
In [12]:
## 计算Author的pagerank

# 初始化邻接矩阵
authors=dict()
j=0 
for paper in metadata:
    author=metadata[paper]['author']
    if author not in authors:
        authors[author]=j
        j=j+1
author_mat= sparse.dok_matrix((j, j))

# 添加边
for cite in citedata:
    author_mat[authors[metadata[cite[0]]['author']],authors[metadata[cite[1]]['author']]]+=1

# 矩阵迭代
author_mat = mat_normalize(author_mat.tocsr())
a_r = calculate_pagerank(author_mat, escape=0.85, rho=1e-5, max_iter=1000, details=0)

# 排序
author_rank = dict()
i = 0
for author in authors:
    author_rank[author] = a_r[i]
    i = i + 1
author_rank = sorted(author_rank.items(), key=lambda d: d[1], reverse=True)
topprint(author_rank)
('Church, Kenneth Ward', 163.54365554539066)
('Sampson, Geoffrey', 147.38837027979844)
('Brown, Peter F.; Cocke, John; Della Pietra, Stephen A.; Della Pietra, Vincent J.; Jelinek, Frederick; Lafferty, John D.; Mercer, Robert L.; Roossin, Paul S.', 124.44001676127448)
('Shieber, Stuart M.', 86.499271733491057)
('Marcus, Mitchell P.; Marcinkiewicz, Mary Ann; Santorini, Beatrice', 84.580137744282467)
('Gale, William A.; Church, Kenneth Ward', 77.592677708877204)
('Yarowsky, David', 76.694580199697626)
('Hindle, Donald', 72.858655740437513)
('Brown, Peter F.; Della Pietra, Vincent J.; Della Pietra, Stephen A.; Mercer, Robert L.', 70.925518126752465)
('Brill, Eric', 66.189598967087605)
('Grosz, Barbara J.; Sidner, Candace L.', 58.801841522216399)
('Brown, Peter F.; Della Pietra, Stephen A.; Della Pietra, Vincent J.; Mercer, Robert L.', 58.45993158615051)
('Collins, Michael John', 56.6784687660966)
('Hobbs, Jerry R.', 54.105476751101804)
('Church, Kenneth Ward; Hanks, Patrick', 52.437286930051258)
('Och, Franz Josef; Ney, Hermann', 48.294724564180989)
('Brown, Peter F.; Cocke, John; Della Pietra, Stephen A.; Della Pietra, Vincent J.; Jelinek, Frederick; Mercer, Robert L.; Roossin, Paul S.', 45.332514324773499)
('Brown, Peter F.; Lai, Jennifer C.; Mercer, Robert L.', 45.166240145564657)
('Baker, Collin F.; Fillmore, Charles J.; Lowe, John B.', 42.081894458806062)
('Calzolari, Nicoletta', 40.379912551712962)