全国计算机二级Python第16套-综合应用-46-综合

词频统计并输出。要求如下:
(1)对“红楼梦.xt”中文本进行分词,并对人物名称进行归一化处理,仅归一化一下内容:凤姐、凤姐
儿、凤丫头归一为凤姐;宝玉、二爷、宝二爷归一为宝玉;黛玉、颦儿、林妹妹、黛玉道归一为黛玉;宝
钗、宝丫头归一为宝钗;贾母、老祖宗归一为贾母;袭人、袭人道归一为袭人;贾政、贾政道归一为贾政;
贾琏、琏二爷归一为贾琏。
(2)不统计“停用词.txt”文件中包含词语的词频。
(3)提取出场次数不少于40次的人物名称,将人物名称及其出场次数按照递减排序,保存到result.csv文
件中,出场次数相同的,则按照人物名称的字符顺序排序。示例如下:
宝玉,123
凤姐,101
.格)
其中,人物名称与出场次数之间采用英文逗号分隔,无空格,每组信息一行。

参考答案

  1. import jieba
  2.  
  3. f = "红楼梦.txt"
  4. sf = "停用词.txt"
  5.  
  6. fi = open(f,"r",encoding="utf-8")
  7. data = fi.read()
  8. fi.close()
  9.  
  10. fo = open(sf,"r",encoding="utf-8")
  11. words = fo.read()
  12. fo.close()
  13.  
  14. #分词
  15. ls = jieba.lcut(data)
  16. d = {}
  17. word = ["一个","如今","一面","众人""说道","只见","不知","两个","起来","二人","今日","听见","不敢","不能","东西","只得","心中","回来","几个","原来","进来","出去","一时" ,"银子","起身","答应","回去"]
  18.  
  19. for i in ls:
  20. if len(i) < 2 or i in words or i in word:
  21. continue#不统计
  22. #人物名词归一处理
  23. if i in ["凤姐","凤姐儿","凤丫头"]:
  24. i = "凤姐"
  25. elif i in ["宝玉","二爷","宝二爷"]:
  26. i = "宝玉"
  27. elif i in ["黛玉","颦儿","林妹妹","黛玉道"]:
  28. i = "黛玉"
  29. elif i in ["宝钗","宝丫头"]:
  30. i = "宝钗"
  31. elif i in ["贾母","老祖宗"]:
  32. i = "贾母"
  33. elif i in ["袭人","袭人道"]:
  34. i = "袭人"
  35. elif i in ["贾政","贾政道"]:
  36. i = "贾政"
  37. elif i in ["贾琏","琏二爷"]:
  38. i = "贾琏"
  39.  
  40. d[i] = d.get(i,0)+1
  41.  
  42. items = list(d.items())
  43. items.sort(key=lambda x:(x[1],x[0]), reverse=True)
  44. # 此行语句可以对items列表进行递减排序
  45.  
  46. f = open("result.csv","w")
  47. for l in items:
  48. if l[1] <40:
  49. break
  50. f.write("{},{}\n".format(l[0],l[1]))
  51. f.close()
  52.  
历年真题

全国计算机二级Python第16套-简单应用-45

2024-4-23 8:51:06

历年真题

全国计算机二级Python第17套-基本操作-41

2024-4-23 8:56:59

个人中心
购物车
优惠劵
今日签到
搜索