python獲取網頁編碼_python爬蟲怎麼獲取動態的網頁源碼

1. 用python抓取編碼為gb2312的網頁，結果抓取的都是亂碼怎樣才能將它弄成正常的HTML格式

你試試下面的代碼

#!/usr/bin/envpython
#-*-coding:utf8-*-

importurllib2

req=urllib2.Request("http://www..com/")
res=urllib2.urlopen(req)
html=res.read()
res.close()

html=unicode(html,"gb2312").encode("utf8")
printhtml

2. python抓取網頁print字元編碼問題

string.decode('gbk').encode('utf8')

3. python用requests獲取網頁源代碼為什麼中文顯示錯誤

你用的啥？Python2?urllib?Python3隻需要html
=
urllib.
request.
urlope('http://
music.
.
com').
read()即可，不難啊。也不會有亂碼這東西(不排除本身網站土)。

4. python爬蟲怎麼獲取動態的網頁源碼

一個月前實習導師布置任務說通過網路爬蟲獲取深圳市氣象局發布的降雨數據，網頁如下：

心想，爬蟲不太難的，當年跟zjb爬煎蛋網無（mei）聊（zi）圖的時候，多麼清高。由於接受任務後的一個月考試加作業一大堆，導師也不催，自己也不急。

但是，導師等我一個月都得讓我來寫意味著這東西得有多難吧。。。今天打開一看的確是這樣。網站是基於Ajax寫的，數據動態獲取，所以無法通過下載源代碼然後解析獲得。

從某不良少年寫的抓取淘寶mm的例子中收到啟發，對於這樣的情況，一般可以同構自己搭建瀏覽器實現。phantomJs，CasperJS都是不錯的選擇。

導師的要求是獲取過去一年內深圳每個區每個站點每小時的降雨量，執行該操作需要通過如上圖中的歷史查詢實現，即通過一個時間來查詢，而這個時間存放在一個hidden類型的input標簽里，當然可以通過js語句將其改為text類型，然後執行send_keys之類的操作。然而，我失敗了。時間可以修改設置，可是結果如下圖。

為此，僅抓取實時數據。選取python的selenium，模擬搭建瀏覽器，模擬人為的點擊等操作實現數據生成和獲取。selenium的一大優點就是能獲取網頁渲染後的源代碼，即執行操作後的源代碼。普通的通過 url解析網頁的方式只能獲取給定的數據，不能實現與用戶之間的交互。selenium通過獲取渲染後的網頁源碼，並通過豐富的查找工具，個人認為最好用的就是find_element_by_xpath("xxx")，通過該方式查找到元素後可執行點擊、輸入等事件，進而向伺服器發出請求，獲取所需的數據。

[python]view plain

#coding=utf-8
fromtestStringimport*
fromseleniumimportwebdriver
importstring
importos
fromselenium.webdriver.common.keysimportKeys
importtime
importsys
default_encoding='utf-8'
ifsys.getdefaultencoding()!=default_encoding:
reload(sys)
sys.setdefaultencoding(default_encoding)
district_navs=['nav2','nav1','nav3','nav4','nav5','nav6','nav7','nav8','nav9','nav10']
district_names=['福田區','羅湖區','南山區','鹽田區','寶安區','龍崗區','光明新區','坪山新區','龍華新區','大鵬新區']
flag=1
while(flag>0):
driver=webdriver.Chrome()
driver.get("hianCe/")
#選擇降雨量
driver.find_element_by_xpath("//span[@id='fenqu_H24R']").click()
filename=time.strftime("%Y%m%d%H%M",time.localtime(time.time()))+'.txt'
#創建文件
output_file=open(filename,'w')
#選擇行政區
foriinrange(len(district_navs)):
driver.find_element_by_xpath("//div[@id='"+district_navs[i]+"']").click()
#printdriver.page_source
timeElem=driver.find_element_by_id("time_shikuang")
#輸出時間和站點名
output_file.write(timeElem.text+',')
output_file.write(district_names[i]+',')
elems=driver.find_elements_by_xpath("//span[@onmouseover='javscript:changeTextOver(this)']")
#輸出每個站點的數據，格式為：站點名，一小時降雨量，當日累積降雨量
foreleminelems:
output_file.write(AMonitorRecord(elem.get_attribute("title"))+',')
output_file.write(' ')
output_file.close()
driver.close()
time.sleep(3600)
文件中引用的文件testString只是修改輸出格式，提取有效數據。

[python]view plain

#Encoding=utf-8
defOnlyCharNum(s,oth=''):
s2=s.lower()
fomart=',.'
forcins2:
ifnotcinfomart:
s=s.replace(c,'')
returns
defAMonitorRecord(str):
str=str.split(":")
returnstr[0]+","+OnlyCharNum(str[1])

一小時抓取一次數據，結果如下：

5. 剛學python，抓中文網頁遇到編碼的問題，怎麼轉換也不行

其實你可以用現成的框架，比如scrapy，已經幫你處理了編碼的問題。

如果一定要自己寫的話，可以先看一下你抓取站點的編碼，一邊頁面里都會有，比如網路知道里的：

說明是gbk編碼。

#str是你獲取到的頁面內容
str.decode("gbk")

這樣生成的就是python內部編碼unicode了，如果你再想編碼成utf8，可以：

str.encode("utf8")

如果解決了您的問題請採納！
如果未解決請繼續追問！

6. python裡面request怎麼讀取html代碼

使用Python 3的requests模塊抓取網頁源碼並保存到文件示例：

import requests

ff = open('testt.txt','w',encoding='utf-8')

with open('test.txt',encoding="utf-8") as f:

for line in f:

ff.write(line)

ff.close()

這是演示讀取一個txt文件，每次讀取一行，並保存到另一個txt文件中的示例。

因為在命令行中列印每次讀取一行的數據，中文會出現編碼錯誤，所以每次讀取一行並保存到另一個文件，這樣來測試讀取是否正常。（注意open的時候制定encoding編碼方式）

7. python 用requests獲取網頁源代碼為什麼中文顯示錯誤

查看一下網頁的編碼，比如是gbk的話，就r.encoding='gbk'。一下內容摘自requests文檔
requests會自動解碼來自伺服器的內容。大多數unicode字元集都能被無縫地解碼。
請求發出後，requests會基於http頭部對響應的編碼作出有根據的推測。當你訪問
r.text
之時，requests會使用其推測的文本編碼。你可以找出requests使用了什麼編碼，並且能夠使用
r.encoding
屬性來改變它:
r.encoding
'utf-8'
r.encoding
=
'iso-8859-1'
如果你改變了編碼，每當你訪問
r.text
，request都將會使用
r.encoding
的新值。你可能希望在使用特殊邏輯計算出文本的編碼的情況下來修改編碼。比如
http
和
xml
自身可以指定編碼。這樣的話，你應該使用
r.content
來找到編碼，然後設置
r.encoding
為相應的編碼。這樣就能使用正確的編碼解析
r.text
了。

8. python，requests中獲取網頁源代碼，與右鍵查看的源代碼不一致，求解！！！下面是代碼，不知有何錯誤

requests請求網址url = 'https://www..com/s?wd=周傑倫'後，print(res.text) #列印的只是url = 'https://www..com/s?wd=周傑倫這一個請求返回的響應體內容，

而如下圖，右鍵查看的頁面源代碼是你請求的網頁url加上其他頁面內的js請求，圖片等靜態資源請求，css等最終形成的頁面，所以兩者不一樣的

9. 求python抓網頁的代碼

python3.x中使用urllib.request模塊來抓取網頁代碼，通過urllib.request.urlopen函數取網頁內容，獲取的為數據流，通過read()函數把數字讀取出來，再把讀取的二進制數據通過decode函數解碼（編號可以通過查看網頁源代碼中<meta http-equiv="content-type" content="text/html;charset=gbk" />得知，如下例中為gbk編碼。），這樣就得到了網頁的源代碼。

如下例所示，抓取本頁代碼：

importurllib.request

html=urllib.request.urlopen('
).read().decode('gbk')#注意抓取後要按網頁編碼進行解碼
print(html)

以下為urllib.request.urlopen函數說明：

urllib.request.urlopen(url,
data=None, [timeout, ]*, cafile=None, capath=None,
cadefault=False, context=None)

Open the URL url, which can be either a string or a Request object.

data must be a bytes object specifying additional data to be sent to
the server, or None
if no such data is needed. data may also be an iterable object and in
that case Content-Length value must be specified in the headers. Currently HTTP
requests are the only ones that use data; the HTTP request will be a
POST instead of a GET when the data parameter is provided.

data should be a buffer in the standard application/x-www-form-urlencoded format. The urllib.parse.urlencode() function takes a mapping or
sequence of 2-tuples and returns a string in this format. It should be encoded
to bytes before being used as the data parameter. The charset parameter
in Content-Type
header may be used to specify the encoding. If charset parameter is not sent
with the Content-Type header, the server following the HTTP 1.1 recommendation
may assume that the data is encoded in ISO-8859-1 encoding. It is advisable to
use charset parameter with encoding used in Content-Type header with the Request.

urllib.request mole uses HTTP/1.1 and includes Connection:close header
in its HTTP requests.

The optional timeout parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the global
default timeout setting will be used). This actually only works for HTTP, HTTPS
and FTP connections.

If context is specified, it must be a ssl.SSLContext instance describing the various SSL
options. See HTTPSConnection for more details.

The optional cafile and capath parameters specify a set of
trusted CA certificates for HTTPS requests. cafile should point to a
single file containing a bundle of CA certificates, whereas capath
should point to a directory of hashed certificate files. More information can be
found in ssl.SSLContext.load_verify_locations().

The cadefault parameter is ignored.

For http and https urls, this function returns a http.client.HTTPResponse object which has the
following HTTPResponse
Objects methods.

For ftp, file, and data urls and requests explicitly handled by legacy URLopener and FancyURLopener classes, this function returns a
urllib.response.addinfourl object which can work as context manager and has methods such as

geturl() — return the URL of the resource retrieved,
commonly used to determine if a redirect was followed

info() — return the meta-information of the page, such
as headers, in the form of an email.message_from_string() instance (see Quick
Reference to HTTP Headers)

getcode() – return the HTTP status code of the response.

Raises URLError on errors.

Note that None
may be returned if no handler handles the request (though the default installed
global OpenerDirector uses UnknownHandler to ensure this never happens).

In addition, if proxy settings are detected (for example, when a *_proxy environment
variable like http_proxy is set), ProxyHandler is default installed and makes sure the
requests are handled through the proxy.

The legacy urllib.urlopen function from Python 2.6 and earlier has
been discontinued; urllib.request.urlopen() corresponds to the old
urllib2.urlopen.
Proxy handling, which was done by passing a dictionary parameter to urllib.urlopen, can be
obtained by using ProxyHandler objects.

Changed in version 3.2: cafile
and capath were added.

Changed in version 3.2: HTTPS virtual
hosts are now supported if possible (that is, if ssl.HAS_SNI is true).

New in version 3.2: data can be
an iterable object.

Changed in version 3.3: cadefault
was added.

Changed in version 3.4.3: context
was added.

10. 網頁編碼和Python編碼不匹配怎麼辦

網頁編碼格式有很多，比如UTF-8，GBK2312等，在網址頁面F12鍵，ctrl+f搜索charset可看到該網頁使用的編碼格式，如CSDN為charset=」utf-8」。我們使用python獲取網頁內容時，經常會由於網頁編碼問題導致程序崩潰報錯或獲取到一堆二進制內容，軟體的兼容性很差。有一個辦法，可以通過第三方庫chardet獲取編碼格式，再使用該編碼格式解碼數據可實現兼容。

1、安裝chardet庫
chardet是第三方庫，需要先安裝再使用。簡單的辦法是啟動DOS界面，進入python安裝路徑下Scripts路徑中（其中有pip腳本），運行」pip install chardet」，即可完成安裝（可能需要先更新pip，根據提示運行命令即可）；

2、導入charset、建立函數
python工程中導入charset庫（」import chardet」）;建立函數如下：

def get_url_context(url):
content = urllib.request.urlopen(url) #獲取網頁內容
encode = chardet.detect(content) #獲取網頁編碼格式字典信息，字典encode中鍵encoding的值為編碼格式
return content.decode(encode['encoding'], 'ignore') #根據獲取到的編碼格式進行解碼，並忽略不能識別的編碼信息

以上函數的返回值即為網頁解碼後的內容，無論網頁是哪種格式編碼，都能輕松識別轉換；需要注意的是解碼時要加參數』ignore』,否則網頁中可能會有混合編碼導致程序出錯。

導航:首頁 > 編程語言 > python獲取網頁編碼

python獲取網頁編碼

與python獲取網頁編碼相關的資料