python獲取指定網頁源碼_新手python抓取網頁源碼處理

A. 如何用python解析網頁並獲得網頁真實的源碼

Python 2.7版本的話代碼如下：

#!/usr/bin/env python
# -*- coding:utf8 -*-
import urllib
import urllib2
import string
import re

addr1 = 某個網址的地址（string format）
response1 = urllib.urlopen(addr1)
text1 = response1.read()
response1.close()

text1就是網頁的源代碼，可以print出來看。UTF8的代碼是為了確保能正確抓取中文。

B. python獲取網頁源碼問題，怎麼都獲取不到

有的網站有流量控制，獲取不到很正常。
建議兩次獲取之間設定一個時間間隔，比如sleep
10秒，會好一些。

C. python裡面request怎麼讀取html代碼

使用Python 3的requests模塊抓取網頁源碼並保存到文件示例：

import requests

ff = open('testt.txt','w',encoding='utf-8')

with open('test.txt',encoding="utf-8") as f:

for line in f:

ff.write(line)

ff.close()

這是演示讀取一個txt文件，每次讀取一行，並保存到另一個txt文件中的示例。

因為在命令行中列印每次讀取一行的數據，中文會出現編碼錯誤，所以每次讀取一行並保存到另一個文件，這樣來測試讀取是否正常。（注意open的時候制定encoding編碼方式）

D. python爬蟲怎麼獲取動態的網頁源碼

一個月前實習導師布置任務說通過網路爬蟲獲取深圳市氣象局發布的降雨數據，網頁如下：

心想，爬蟲不太難的，當年跟zjb爬煎蛋網無（mei）聊（zi）圖的時候，多麼清高。由於接受任務後的一個月考試加作業一大堆，導師也不催，自己也不急。

但是，導師等我一個月都得讓我來寫意味著這東西得有多難吧。。。今天打開一看的確是這樣。網站是基於Ajax寫的，數據動態獲取，所以無法通過下載源代碼然後解析獲得。

從某不良少年寫的抓取淘寶mm的例子中收到啟發，對於這樣的情況，一般可以同構自己搭建瀏覽器實現。phantomJs，CasperJS都是不錯的選擇。

導師的要求是獲取過去一年內深圳每個區每個站點每小時的降雨量，執行該操作需要通過如上圖中的歷史查詢實現，即通過一個時間來查詢，而這個時間存放在一個hidden類型的input標簽里，當然可以通過js語句將其改為text類型，然後執行send_keys之類的操作。然而，我失敗了。時間可以修改設置，可是結果如下圖。

為此，僅抓取實時數據。選取python的selenium，模擬搭建瀏覽器，模擬人為的點擊等操作實現數據生成和獲取。selenium的一大優點就是能獲取網頁渲染後的源代碼，即執行操作後的源代碼。普通的通過 url解析網頁的方式只能獲取給定的數據，不能實現與用戶之間的交互。selenium通過獲取渲染後的網頁源碼，並通過豐富的查找工具，個人認為最好用的就是find_element_by_xpath("xxx")，通過該方式查找到元素後可執行點擊、輸入等事件，進而向伺服器發出請求，獲取所需的數據。

[python]view plain

#coding=utf-8
fromtestStringimport*
fromseleniumimportwebdriver
importstring
importos
fromselenium.webdriver.common.keysimportKeys
importtime
importsys
default_encoding='utf-8'
ifsys.getdefaultencoding()!=default_encoding:
reload(sys)
sys.setdefaultencoding(default_encoding)
district_navs=['nav2','nav1','nav3','nav4','nav5','nav6','nav7','nav8','nav9','nav10']
district_names=['福田區','羅湖區','南山區','鹽田區','寶安區','龍崗區','光明新區','坪山新區','龍華新區','大鵬新區']
flag=1
while(flag>0):
driver=webdriver.Chrome()
driver.get("hianCe/")
#選擇降雨量
driver.find_element_by_xpath("//span[@id='fenqu_H24R']").click()
filename=time.strftime("%Y%m%d%H%M",time.localtime(time.time()))+'.txt'
#創建文件
output_file=open(filename,'w')
#選擇行政區
foriinrange(len(district_navs)):
driver.find_element_by_xpath("//div[@id='"+district_navs[i]+"']").click()
#printdriver.page_source
timeElem=driver.find_element_by_id("time_shikuang")
#輸出時間和站點名
output_file.write(timeElem.text+',')
output_file.write(district_names[i]+',')
elems=driver.find_elements_by_xpath("//span[@onmouseover='javscript:changeTextOver(this)']")
#輸出每個站點的數據，格式為：站點名，一小時降雨量，當日累積降雨量
foreleminelems:
output_file.write(AMonitorRecord(elem.get_attribute("title"))+',')
output_file.write(' ')
output_file.close()
driver.close()
time.sleep(3600)
文件中引用的文件testString只是修改輸出格式，提取有效數據。

[python]view plain

#Encoding=utf-8
defOnlyCharNum(s,oth=''):
s2=s.lower()
fomart=',.'
forcins2:
ifnotcinfomart:
s=s.replace(c,'')
returns
defAMonitorRecord(str):
str=str.split(":")
returnstr[0]+","+OnlyCharNum(str[1])

一小時抓取一次數據，結果如下：

E. python 用requests獲取網頁源代碼為什麼中文顯示錯誤

查看一下網頁的編碼，比如是gbk的話，就r.encoding='gbk'。一下內容摘自requests文檔
requests會自動解碼來自伺服器的內容。大多數unicode字元集都能被無縫地解碼。
請求發出後，requests會基於http頭部對響應的編碼作出有根據的推測。當你訪問
r.text
之時，requests會使用其推測的文本編碼。你可以找出requests使用了什麼編碼，並且能夠使用
r.encoding
屬性來改變它:
r.encoding
'utf-8'
r.encoding
=
'iso-8859-1'
如果你改變了編碼，每當你訪問
r.text
，request都將會使用
r.encoding
的新值。你可能希望在使用特殊邏輯計算出文本的編碼的情況下來修改編碼。比如
http
和
xml
自身可以指定編碼。這樣的話，你應該使用
r.content
來找到編碼，然後設置
r.encoding
為相應的編碼。這樣就能使用正確的編碼解析
r.text
了。

F. python怎麼爬取網頁源代碼

#!/usr/bin/env python3
#-*- coding=utf-8 -*-

import urllib3

if __name__ == '__main__':
http=urllib3.PoolManager()
r=http.request('GET','IP')
print(r.data.decode("gbk"))

可以正常抓取。需要安裝urllib3,py版本3.43

G. 新手python抓取網頁源碼處理

先用id定位，定位到了在用getatribute來獲取value

H. 怎麼使用python查看網頁源代碼

使用python查看網頁源代碼的方法：

1、使用「import」命令導入requests包

import requests

2、使用該包的get()方法，將要查看的網頁鏈接傳遞進去，結果賦給變數x

x = requests.get(url='http://www.hao123.com')

3、用「print (x.text)」語句把網頁的內容以text的格式輸出

print(x.text)

完整代碼如下：

執行結果如下：

更多Python知識，請關註：Python自學網！！

I. 求python抓網頁的代碼

python3.x中使用urllib.request模塊來抓取網頁代碼，通過urllib.request.urlopen函數取網頁內容，獲取的為數據流，通過read()函數把數字讀取出來，再把讀取的二進制數據通過decode函數解碼（編號可以通過查看網頁源代碼中<meta http-equiv="content-type" content="text/html;charset=gbk" />得知，如下例中為gbk編碼。），這樣就得到了網頁的源代碼。

如下例所示，抓取本頁代碼：

importurllib.request

html=urllib.request.urlopen('
).read().decode('gbk')#注意抓取後要按網頁編碼進行解碼
print(html)

以下為urllib.request.urlopen函數說明：

urllib.request.urlopen(url,
data=None, [timeout, ]*, cafile=None, capath=None,
cadefault=False, context=None)

Open the URL url, which can be either a string or a Request object.

data must be a bytes object specifying additional data to be sent to
the server, or None
if no such data is needed. data may also be an iterable object and in
that case Content-Length value must be specified in the headers. Currently HTTP
requests are the only ones that use data; the HTTP request will be a
POST instead of a GET when the data parameter is provided.

data should be a buffer in the standard application/x-www-form-urlencoded format. The urllib.parse.urlencode() function takes a mapping or
sequence of 2-tuples and returns a string in this format. It should be encoded
to bytes before being used as the data parameter. The charset parameter
in Content-Type
header may be used to specify the encoding. If charset parameter is not sent
with the Content-Type header, the server following the HTTP 1.1 recommendation
may assume that the data is encoded in ISO-8859-1 encoding. It is advisable to
use charset parameter with encoding used in Content-Type header with the Request.

urllib.request mole uses HTTP/1.1 and includes Connection:close header
in its HTTP requests.

The optional timeout parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the global
default timeout setting will be used). This actually only works for HTTP, HTTPS
and FTP connections.

If context is specified, it must be a ssl.SSLContext instance describing the various SSL
options. See HTTPSConnection for more details.

The optional cafile and capath parameters specify a set of
trusted CA certificates for HTTPS requests. cafile should point to a
single file containing a bundle of CA certificates, whereas capath
should point to a directory of hashed certificate files. More information can be
found in ssl.SSLContext.load_verify_locations().

The cadefault parameter is ignored.

For http and https urls, this function returns a http.client.HTTPResponse object which has the
following HTTPResponse
Objects methods.

For ftp, file, and data urls and requests explicitly handled by legacy URLopener and FancyURLopener classes, this function returns a
urllib.response.addinfourl object which can work as context manager and has methods such as

geturl() — return the URL of the resource retrieved,
commonly used to determine if a redirect was followed

info() — return the meta-information of the page, such
as headers, in the form of an email.message_from_string() instance (see Quick
Reference to HTTP Headers)

getcode() – return the HTTP status code of the response.

Raises URLError on errors.

Note that None
may be returned if no handler handles the request (though the default installed
global OpenerDirector uses UnknownHandler to ensure this never happens).

In addition, if proxy settings are detected (for example, when a *_proxy environment
variable like http_proxy is set), ProxyHandler is default installed and makes sure the
requests are handled through the proxy.

The legacy urllib.urlopen function from Python 2.6 and earlier has
been discontinued; urllib.request.urlopen() corresponds to the old
urllib2.urlopen.
Proxy handling, which was done by passing a dictionary parameter to urllib.urlopen, can be
obtained by using ProxyHandler objects.

Changed in version 3.2: cafile
and capath were added.

Changed in version 3.2: HTTPS virtual
hosts are now supported if possible (that is, if ssl.HAS_SNI is true).

New in version 3.2: data can be
an iterable object.

Changed in version 3.3: cadefault
was added.

Changed in version 3.4.3: context
was added.

J. Python怎樣抓取當前頁面HTML內容

當然這樣子也是可以的，不過通用點的方法是用beautifulsoup庫去定位id=phoneCodestatus

導航:首頁 > 源碼編譯 > python獲取指定網頁源碼

python獲取指定網頁源碼

與python獲取指定網頁源碼相關的資料