Jack's Blog

流淌的心,怎能阻拦,吹来的风,又怎能阻挡。

索引概论笔记

阅读全文

Python--倒排索引

一.       实验目的

  1. 掌握列表、集合和字典的定义、赋值、使用等基本操作,熟悉处理复杂数据类型的一般流程
  2. 熟悉列表、集合和字典的常用函数和技巧
  3. 考察对文本的灵活处理和对排序算法的运用

二.       实验内容

倒排索引(Inverted index),也常被称为反向索引,是一种索引方法,用来存储某个单词存在于哪些文档之中。是信息检索系统中最常用的数据结构。通过倒排索引,可以根据单词快速获取包含这个单词的文档列表。

本实验主要完成以下三个功能:

(1). 建立索引:首先输入100行字符串,用于构建倒排索引,每行字符串由若干不含标点符号的、全部小写字母组成的单词构成,每个单词之间以空格分隔。依次读入每个单词,并组成一个由<单词, 每个单词出现的行号集合>构成的字典,其中行号从1开始计数。

(2). 打印索引:按照字母表顺序依次输出每个单词及其出现的位置,每个单词出现的位置则按行号升序输出。例如,如果“created”出现在第3, 20行,“dead”分别出现在14, 20, 22行。则输出结果如下(冒号和逗号后面都有一个空格,行号不重复):

created: 3, 20

dead: 14, 20, 22

(3). 检索:接下来输入查询(Query)字符串,每行包含一个查询,每个查询由若干关键字(Keywords)组成,每个关键字用空格分隔且全部为小写字母单词。要求输出包含全部单词行的行号(升序排列),每个查询输出一行。若某一关键字在全部行中从没出现过或没有一行字符串包含全部关键字,则输出“None”。遇到空行表示查询输入结束。如对于上面创建的索引,当查询为“created”时,输出为“3, 20”;当查询为“created dead”时,输出为“20”;当查询为“abcde dead”时,输出为“None”;

(4). 高级检索:当输入的Query以“AND:”开始,则执行“与”检索,即要求输出包含全部关键字的行;如果输入的Query以“OR:”开始,则执行“或”检索,即某行只要出现了一个关键字就满足条件。默认情况(不以“AND:”或“OR:”开始),执行“与”检索。

依次完成以上功能(提交程序命名:“学号_姓名_5.py”)


以下是代码

     #'''''Part 1 : Setup index'''  
      
    dict = {} # a emtry dictionary.  
    n = 100  
    for row in range(0,n):    
      
        information = raw_input()  
          
        line_words = information.split()   
        # split the information inputed into lines by '/n'  
      
        for word in line_words : # Judge every word in every lines .         
      
            # If the word appear first time .  
            if word not in dict :  
                item = set()   # set up a new set .  
                item.add(row+1)  # now rows  
                dict[word] = item   # Add now rows into keys(item).  
      
            # THe word have appeared before .  
            else:     
                dict[word].add(row+1)    # Add now rows into keys(item).  
      
    # print dict    we can get the information dictionary.  
      
                  
    '''''Part 2 : Print index'''   
      
    word_list = dict.items()  # Get dict's items .  
      
    word_list.sort( key = lambda items : items[0] ) # Sort by word in dict.  
      
    for word , row in word_list : # Ergodic word and row in word_list .  
          
        list_row = list(row)  
        list_row.sort()  
      
        # Change int row into string row .  
        for i in range ( 0 , len(list_row) ):  
            list_row[i] = str(list_row[i])  
          
        # print result the part 2 needed .  
        print word + ':' , ', '.join(list_row)  
      
      
    ''''' Part 3 : Query '''  
    # define judger to judger if all querys are in dict.  
    def judger(dict , query):  
        list_query = query.split()  
        for word in list_query :  
            if word not in dict :  
                return 0    # for every query ,if there is one not in dict,return 0  
        return 1   # all query in dict .  
      
    query_list = []   
      
    # for input , meet '' ,stop input.  
    while True:  
        query = raw_input()  
        if query == '' :  
            break     
        elif len(query) != 0 :  
            query_list.append(query) # append query inputed to a list query_list .  
      
      
    # Ergodic every query in query_list.       
    for list_query in query_list :  
          
        # if judger return 0.  
        if judger(dict , list_query) == 0 :  
            print 'None'  
        
        else:  
            list_query = list_query.split()  
            query_set = set()  # get a empty set  
              
            # union set to get rows .  
            for isquery in list_query :  
                query_set = query_set | dict[isquery]  
             
            # intersection to get common rows .  
            for isquery in list_query :  
                query_set = query_set & dict[isquery]  
               
            # if intersection == 0   
            if len(query_set) == 0 :  
                print 'None'  
      
            else:  
                query_result = list(query_set)  
                query_result.sort()  
                for m in range(len(query_result)) :  
                    query_result[m] = str(query_result[m])  
                  
                print ', '.join(query_result)  

for python 3

word_dic = {}
line_len = 3

# 建立索引
for i in range(line_len):
    line = input()
    words = line.split()
    for word in words:
        if word in word_dic:
            word_dic[word].add(i+1)
        else:
            word_set = set()
            word_set.add(i+1)
            word_dic[word] = word_set

#打印索引
word_list = []
for word,word_set in word_dic.items():
    word_list.append((word,word_set))

for word,line in word_list:
   list_line = list(line)
   list_line.sort()
   for i in range(len(list_line)):
    list_line[i] = str(list_line[i])

    print(word + ': ' , ', '.join(list_line))

#检索与高级索引
def judger(dict , query):
    list_query = query.split()
    for word in list_query :
        if word not in dict :
            return 0
    return 1

query_list = []
while True:
    query = input()
    if query == '' :
        break
    elif len(query) != 0 :
        query_list.append(query)

    if judger(dict , list_query) == 0 :
        print('None')
    else:
        list_query = list_query.split()
        query_set = set()  # get a empty set
        for isquery in list_query :
            query_set = query_set | dict[isquery]
        for isquery in list_query :
            query_set = query_set & dict[isquery]
        if len(query_set) == 0 :
            print('None')

        else:
            query_result = list(query_set)
            query_result.sort()
            for m in range(len(query_result)) :
                query_result[m] = str(query_result[m])

            print(', '.join(query_result))