導航:首頁 > 編程語言 > 如何用python提取詞頻率

如何用python提取詞頻率

發布時間:2022-04-30 17:54:28

❶ 一個txt文檔,已經用結巴分詞分完詞,怎麼用python工具對這個分完詞的文檔進行計算統計詞頻,求腳本,非

#!/usr/bin/envpython3
#-*-coding:utf-8-*-

importos,random

#假設要讀取文件名為aa,位於當前路徑
filename='aa.txt'
dirname=os.getcwd()
f_n=os.path.join(dirname,filename)
#注釋掉的程序段,用於測試腳本,它生成20行數據,每行有1-20隨機個數字,每個數字隨機1-20
'''
test=''
foriinrange(20):
forjinrange(random.randint(1,20)):
test+=str(random.randint(1,20))+''
test+=' '
withopen(f_n,'w')aswf:
wf.write(test)
'''
withopen(f_n)asf:
s=f.readlines()

#將每一行數據去掉首尾的空格和換行符,然後用空格分割,再組成一維列表
words=[]
forlineins:
words.extend(line.strip().split(''))

#格式化要輸出的每行數據,首尾各佔8位,中間佔18位
defgeshi(a,b,c):
returnalignment(str(a))+alignment(str(b),18)+alignment(str(c))+' '
#中英文混合對齊,參考http://bbs.fishc.com/thread-67465-1-1.html,二樓
#漢字與字母格式化佔位format對齊出錯對不齊漢字對齊數字漢字對齊字母中文對齊英文
#alignment函數用於英漢混合對齊、漢字英文對齊、漢英對齊、中英對齊
defalignment(str1,space=8,align='left'):
length=len(str1.encode('gb2312'))
space=space-lengthifspace>=lengthelse0
ifalignin['left','l','L','Left','LEFT']:
str1=str1+''*space
elifalignin['right','r','R','Right','RIGHT']:
str1=''*space+str1
elifalignin['center','c','C','Center','CENTER','centre']:
str1=''*(space//2)+str1+''*(space-space//2)
returnstr1

w_s=geshi('序號','詞','頻率')
#由(詞,頻率)元組構成列表,先按頻率降序排序,再按詞升序排序,多級排序,一組升,一組降,高級sorted
wordcount=sorted([(w,words.count(w))forwinset(words)],key=lambdal:(-l[1],l[0]))
#要輸出的數據,每一行由:序號(佔8位)詞(佔20位)頻率(佔8位)+' '構成,序號=List.index(element)+1
for(w,c)inwordcount:
w_s+=geshi(wordcount.index((w,c))+1,w,c)
#將統計結果寫入文件ar.txt中
writefile='ar.txt'
w_n=os.path.join(dirname,writefile)
withopen(w_n,'w')aswf:
wf.write(w_s)

❷ Python!有2個txt中文文檔,1是課文,2是一些詞語(用單個空格分割)。如何統計2中單詞在1中出現的頻率

需要做分詞,然後進行匹配。

分詞第三方庫,可用jieba分詞,很好安裝。


用jieba將1文檔進行拆分,生成中間文件mid.txt。mid.txt應該是字典還是列表,我忘記了。。。

然後進行遍歷和統計就好了。


注意,jieba分詞模式有多種,要選擇不存在重復的那個模式。例如一個句子"好久不見":

不同的分詞模式可能分成以下幾種:

  1. ["好",」久「,」不「, 」見「, 」好久「, 」不見「,」久不「。。。]

  2. ["好久", 」不見「]

    如果選錯模式,你匹配」好」,就可能出現不同情況

❸ 如何用python爬蟲爬取出現頻率最高的詞

完全可以,
可以參考 python爬蟲聯想詞視頻 先學習一下基礎知識。

❹ 如何用python統計六級詞彙頻率

不知道你用什麼作為統計的資料
本文實例講述了python統計文本字元串里單詞出現頻率的方法。分享給大家供大家參考。具體實現方法如下:
# word frequency in a text
# tested with Python24 vegaseat 25aug2005
# Chinese wisdom ...
str1 = """Man who run in front of car, get tired.
Man who run behind car, get exhausted."""
print "Original string:"
print str1
print
# create a list of words separated at whitespaces
wordList1 = str1.split(None)
# strip any punctuation marks and build modified word list
# start with an empty list
wordList2 = []
for word1 in wordList1:
# last character of each word
lastchar = word1[-1:]
# use a list of punctuation marks
if lastchar in [",", ".", "!", "?", ";"]:
word2 = word1.rstrip(lastchar)
else:
word2 = word1
# build a wordList of lower case modified words
wordList2.append(word2.lower())
print "Word list created from modified string:"
print wordList2
print
# create a wordfrequency dictionary
# start with an empty dictionary
freqD2 = {}
for word2 in wordList2:
freqD2[word2] = freqD2.get(word2, 0) + 1
# create a list of keys and sort the list
# all words are lower case already
keyList = freqD2.keys()
keyList.sort()
print "Frequency of each word in the word list (sorted):"
for key2 in keyList:
print "%-10s %d" % (key2, freqD2[key2])

希望本文所述對大家的Python程序設計有所幫助。

❺ 如何用Python提取中文關鍵詞

去非中文字元
分詞
統計
提取

❻ 如何用python統計單詞的頻率

代碼:

passage="""Editor』s Note: Looking through VOA's listener mail, we came across a letter that asked a simple question. "What do Americans think about China?" We all care about the perceptions of others. It helps us better understand who we are. VOA Reporter Michael Lipin begins a series providing some answers to our listener's question. His assignment: present a clearer picture of what Americans think about their chief world rival, and what drives those perceptions.

Two common American attitudes toward China can be identified from the latest U.S. public opinion surveys published by Gallup and Pew Research Center in the past year.

First, most of the Americans surveyed have unfavorable opinions of China as a whole, but do not view the country as a threat toward the United States at the present time.

Second, most survey respondents expect China to pose an economic and military threat to the United States in the future, with more Americans worried about the perceived economic threat than the military one.

Most Americans view China unfavorably

To understand why most Americans appear to have negative feelings about China, analysts interviewed by VOA say a variety of factors should be considered. Primary among them is a lack of familiarity.

"Most Americans do not have a strong interest in foreign affairs, Chinese or otherwise," says Robert Daly, director of the Kissinger Institute on China and the United States at the Washington-based Wilson Center.

Many of those Americans also have never traveled to China, in part because of the distance and expense. "That means that like most human beings, they take short cuts to understanding China," Daly says.

Rather than make the effort to regularly consume a wide range of U.S. media reports about China, analysts say many Americans base their views on widely-publicized major events in China's recent history."""

passage=passage.replace(","," ").replace("."," ").replace(":"," ").replace("』","'").

replace('"'," ").replace("?"," ").replace("!"," ").replace(" "," ")#把標點改成空格

passagelist=passage.split(" ")#拆分成一個個單詞

pc=passagelist.()#復制一份

for i in range(len(pc)):

pi=pc[i]#這一個字元串

if pi.count(" ")==len(pi):#如果全是空格

passagelist.remove(pi)#刪除此項

worddict={}

for j in range(len(passagelist)):

pj=passagelist[j]#這一個單詞

if pj not in worddict:#如果未被統計到

worddict[pj]=1#增加單詞統計,次數設為1

else:#如果統計過了

worddict[pj]+=1#次數增加1

output=""#按照字母表順序,製表符

worddictlist=list(worddict.keys())#提取所有的單詞

worddictlist.sort()#排序(但大小寫會出現問題)

worddict2={}

for k in worddictlist:

worddict2[k]=worddict[k]#排序好的字典

print("單次 次數")

for m in worddict2:#遍歷輸出

tabs=(23-len(m))//8#根據單次長度輸入,如果復制到表格,請把此行改為tabs=2

print("%s%s%d"%(m," "*tabs,worddict[m]))

註:加粗部分是您要統計的短文,請修改。我這里的輸出效果是:

American 1

Americans 9

Center 2

China 10

China's 1

Chinese 1

Daly 2

Editor's 1

First 1

Gallup 1

His 1

Institute 1

It 1

Kissinger 1

Lipin 1

Looking 1

Many 1

Michael 1

Most 2

Note 1

Pew 1

Primary 1

Rather 1

Reporter 1

Research 1

Robert 1

S 2

Second 1

States 3

That 1

To 1

Two 1

U 2

United 3

VOA 2

VOA's 1

Washington-based1

We 1

What 1

Wilson 1

a 10

about 6

across 1

affairs 1

all 1

also 1

among 1

an 1

analysts 2

and 5

answers 1

appear 1

are 1

as 2

asked 1

assignment 1

at 2

attitudes 1

base 1

be 2

because 1

begins 1

beings 1

better 1

but 1

by 2

came 1

can 1

care 1

chief 1

clearer 1

common 1

considered 1

consume 1

country 1

cuts 1

director 1

distance 1

do 3

drives 1

economic 2

effort 1

events 1

expect 1

expense 1

factors 1

familiarity 1

feelings 1

foreign 1

from 1

future 1

have 4

helps 1

history 1

human 1

identified 1

in 5

interest 1

interviewed 1

is 1

lack 1

latest 1

letter 1

like 1

listener 1

listener's 1

mail 1

major 1

make 1

many 1

means 1

media 1

military 2

more 1

most 4

negative 1

never 1

not 2

of 10

on 2

one 1

opinion 1

opinions 1

or 1

others 1

otherwise 1

our 1

part 1

past 1

perceived 1

perceptions 2

picture 1

pose 1

present 2

providing 1

public 1

published 1

question 2

range 1

recent 1

regularly 1

reports 1

respondents 1

rival 1

say 2

says 2

series 1

short 1

should 1

simple 1

some 1

strong 1

survey 1

surveyed 1

surveys 1

take 1

than 2

that 2

the 16

their 2

them 1

they 1

think 2

those 2

threat 3

through 1

time 1

to 7

toward 2

traveled 1

understand 2

understanding 1

unfavorable 1

unfavorably 1

us 1

variety 1

view 2

views 1

we 2

what 2

who 1

whole 1

why 1

wide 1

widely-publicized1

with 1

world 1

worried 1

year 1

(應該是對齊的,到這就亂了)

註:目前難以解決的漏洞

1、大小寫問題,無法分辨哪些必須大寫哪些只是首字母大寫

2、's問題,目前如果含有隻能算為一個單詞里的

3、排序問題,很難做到按照出現次數排序

❼ 請問如何用python提取出一個txt文件中詞頻最高的二十個詞語並從大到小輸出

❽ python如何實現提取文本中所有連續的詞語

經常需要通過Python代碼來提取文本的關鍵詞,用於文本分析。而實際應用中文本量又是大量的數據,如果使用單進程的話,效率會比較低,因此可以考慮使用多進程。
python的多進程只需要使用multiprocessing的模塊就行,如果使用大量的進程就可以使用multiprocessing的進程池--Pool,然後不同進程處理時使用apply_async函數進行非同步處理即可。

實驗測試語料:message.txt中存放的581行文本,一共7M的數據,每行提取100個關鍵詞。
代碼如下:

[python] view plain
#coding:utf-8
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
from multiprocessing import Pool,Queue,Process
import multiprocessing as mp
import time,random
import os
import codecs
import jieba.analyse
jieba.analyse.set_stop_words("yy_stop_words.txt")

def extract_keyword(input_string):
#print("Do task by process {proc}".format(proc=os.getpid()))
tags = jieba.analyse.extract_tags(input_string, topK=100)
#print("key words:{kw}".format(kw=" ".join(tags)))
return tags

#def parallel_extract_keyword(input_string,out_file):
def parallel_extract_keyword(input_string):
#print("Do task by process {proc}".format(proc=os.getpid()))
tags = jieba.analyse.extract_tags(input_string, topK=100)
#time.sleep(random.random())
#print("key words:{kw}".format(kw=" ".join(tags)))
#o_f = open(out_file,'w')
#o_f.write(" ".join(tags)+"\n")
return tags
if __name__ == "__main__":

data_file = sys.argv[1]
with codecs.open(data_file) as f:
lines = f.readlines()
f.close()

out_put = data_file.split('.')[0] +"_tags.txt"
t0 = time.time()
for line in lines:
parallel_extract_keyword(line)
#parallel_extract_keyword(line,out_put)
#extract_keyword(line)
print("串列處理花費時間{t}".format(t=time.time()-t0))

pool = Pool(processes=int(mp.cpu_count()*0.7))
t1 = time.time()
#for line in lines:
#pool.apply_async(parallel_extract_keyword,(line,out_put))
#保存處理的結果,可以方便輸出到文件
res = pool.map(parallel_extract_keyword,lines)
#print("Print keywords:")
#for tag in res:
#print(" ".join(tag))

pool.close()
pool.join()
print("並行處理花費時間{t}s".format(t=time.time()-t1))

運行:
python data_process_by_multiprocess.py message.txt
message.txt是每行是一個文檔,共581行,7M的數據

運行時間:

不使用sleep來掛起進程,也就是把time.sleep(random.random())注釋掉,運行可以大大節省時間。

❾ PYTHON語言如何取到聲音的頻率(其他語言也可行)

先得到時域信號,然後做傅立葉變換,得到頻譜。
感覺題主可能對python比較熟悉?那就別換語言了。稍微網路谷歌以下肯定能找到python的傅立葉變換的庫。

❿ 怎樣用python進行關鍵詞提取

關鍵字具體是什麼?
字元串比對就行了
html是beautifulsoup或者正則
json就更簡單了

閱讀全文

與如何用python提取詞頻率相關的資料

熱點內容
職業生涯pdf 瀏覽:953
ubuntu安裝軟體php 瀏覽:158
黑馬程序員退學流程 瀏覽:362
網頁伺服器崩潰怎麼回事 瀏覽:650
cnc編程前景怎麼樣 瀏覽:319
lniux命令詳解 瀏覽:493
linuxmysql查詢日誌 瀏覽:368
老捷達夥伴壓縮比 瀏覽:93
改後綴加密 瀏覽:433
郵局選址問題演算法 瀏覽:14
河北伺服器內存雲主機 瀏覽:12
在電腦上怎麼找到加密狗圖標 瀏覽:435
電腦的瀏覽器怎麼打開pdf文件怎麼打開 瀏覽:142
pdf卡片庫下載 瀏覽:11
單片機中二進製表示什麼 瀏覽:725
java網路編程推薦 瀏覽:795
施耐德開關編程 瀏覽:66
組織胚胎學pdf 瀏覽:844
linux查看發包 瀏覽:496
加密貨幣交易所暴利時代 瀏覽:824