py爬網頁源碼_python怎麼看源碼進行網路爬蟲

㈠如何用python提取網頁中框架的源代碼

簡單的做個例子，框架路徑可以自己修改，調用像網路等網站時無法讀取其中源碼，涉及到一些安全問題，所以路徑要求是合法的允許訪問的路徑 function GetFrameInnerHtml(objIFrame) { var iFrameHTML = ""; if (objIFrame.contentDocument) { //針...

㈡怎麼使用python查看網頁源代碼

使用python查看網頁源代碼的方法：

1、使用「import」命令導入requests包

import requests

2、使用該包的get()方法，將要查看的網頁鏈接傳遞進去，結果賦給變數x

x = requests.get(url='http://www.hao123.com')

3、用「print (x.text)」語句把網頁的內容以text的格式輸出

print(x.text)

完整代碼如下：

執行結果如下：

更多Python知識，請關註：Python自學網！！

㈢求python抓網頁的代碼

python3.x中使用urllib.request模塊來抓取網頁代碼，通過urllib.request.urlopen函數取網頁內容，獲取的為數據流，通過read()函數把數字讀取出來，再把讀取的二進制數據通過decode函數解碼（編號可以通過查看網頁源代碼中<meta http-equiv="content-type" content="text/html;charset=gbk" />得知，如下例中為gbk編碼。），這樣就得到了網頁的源代碼。

如下例所示，抓取本頁代碼：

importurllib.request

html=urllib.request.urlopen('
).read().decode('gbk')#注意抓取後要按網頁編碼進行解碼
print(html)

以下為urllib.request.urlopen函數說明：

urllib.request.urlopen(url,
data=None, [timeout, ]*, cafile=None, capath=None,
cadefault=False, context=None)

Open the URL url, which can be either a string or a Request object.

data must be a bytes object specifying additional data to be sent to
the server, or None
if no such data is needed. data may also be an iterable object and in
that case Content-Length value must be specified in the headers. Currently HTTP
requests are the only ones that use data; the HTTP request will be a
POST instead of a GET when the data parameter is provided.

data should be a buffer in the standard application/x-www-form-urlencoded format. The urllib.parse.urlencode() function takes a mapping or
sequence of 2-tuples and returns a string in this format. It should be encoded
to bytes before being used as the data parameter. The charset parameter
in Content-Type
header may be used to specify the encoding. If charset parameter is not sent
with the Content-Type header, the server following the HTTP 1.1 recommendation
may assume that the data is encoded in ISO-8859-1 encoding. It is advisable to
use charset parameter with encoding used in Content-Type header with the Request.

urllib.request mole uses HTTP/1.1 and includes Connection:close header
in its HTTP requests.

The optional timeout parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the global
default timeout setting will be used). This actually only works for HTTP, HTTPS
and FTP connections.

If context is specified, it must be a ssl.SSLContext instance describing the various SSL
options. See HTTPSConnection for more details.

The optional cafile and capath parameters specify a set of
trusted CA certificates for HTTPS requests. cafile should point to a
single file containing a bundle of CA certificates, whereas capath
should point to a directory of hashed certificate files. More information can be
found in ssl.SSLContext.load_verify_locations().

The cadefault parameter is ignored.

For http and https urls, this function returns a http.client.HTTPResponse object which has the
following HTTPResponse
Objects methods.

For ftp, file, and data urls and requests explicitly handled by legacy URLopener and FancyURLopener classes, this function returns a
urllib.response.addinfourl object which can work as context manager and has methods such as

geturl() — return the URL of the resource retrieved,
commonly used to determine if a redirect was followed

info() — return the meta-information of the page, such
as headers, in the form of an email.message_from_string() instance (see Quick
Reference to HTTP Headers)

getcode() – return the HTTP status code of the response.

Raises URLError on errors.

Note that None
may be returned if no handler handles the request (though the default installed
global OpenerDirector uses UnknownHandler to ensure this never happens).

In addition, if proxy settings are detected (for example, when a *_proxy environment
variable like http_proxy is set), ProxyHandler is default installed and makes sure the
requests are handled through the proxy.

The legacy urllib.urlopen function from Python 2.6 and earlier has
been discontinued; urllib.request.urlopen() corresponds to the old
urllib2.urlopen.
Proxy handling, which was done by passing a dictionary parameter to urllib.urlopen, can be
obtained by using ProxyHandler objects.

Changed in version 3.2: cafile
and capath were added.

Changed in version 3.2: HTTPS virtual
hosts are now supported if possible (that is, if ssl.HAS_SNI is true).

New in version 3.2: data can be
an iterable object.

Changed in version 3.3: cadefault
was added.

Changed in version 3.4.3: context
was added.

㈣求助，python 解析爬取的網頁源碼中的json部分

我用re把json的部分截取出來了，也用json.loads()解析成了字典，現在的問題是裡面需要的信息那部分是有一些是unicode 編碼的，求解。。。。
{"pageName":"mainsrp","mods":{"shopcombotip":{"status":"hide","export":false},"shopstar":{"status":"hide","export":false},"navtablink":{"status":"hide","export":false},"personalbar":{"status":"show","data":{"metisData":{"nickname":"","query":"秋季打底衫","shopItems":[{"text":"黃鑽愛買店鋪","count":"500+","url":"/search?q\u003d秋季打底衫\u0026tab\u003dmysearch\u0026filter_rectype\u003d44\u0026stats_click\u003dms_from:44","trace":"metis44"},{"text":"回頭客愛買店鋪","count":"500+","url":"/search?q\u003d秋季打底衫\u0026tab\u003dmysearch\

㈤ Python怎麼爬取證才通這家網站的源碼

不知道你是用框架還是用 Selenium 爬的內容， iframe 里的內容實際上就是另一個網頁了。
你只是爬它的源碼是爬不到的，你要提取 iframe 里的 src 所指向的網址，重新打開它，然後才爬他的源碼。或者如果你用框架，裡面應該有另外提供方法，讀取 iframe 中的內容

㈥ python爬蟲怎麼獲取動態的網頁源碼

一個月前實習導師布置任務說通過網路爬蟲獲取深圳市氣象局發布的降雨數據，網頁如下：

心想，爬蟲不太難的，當年跟zjb爬煎蛋網無（mei）聊（zi）圖的時候，多麼清高。由於接受任務後的一個月考試加作業一大堆，導師也不催，自己也不急。

但是，導師等我一個月都得讓我來寫意味著這東西得有多難吧。。。今天打開一看的確是這樣。網站是基於Ajax寫的，數據動態獲取，所以無法通過下載源代碼然後解析獲得。

從某不良少年寫的抓取淘寶mm的例子中收到啟發，對於這樣的情況，一般可以同構自己搭建瀏覽器實現。phantomJs，CasperJS都是不錯的選擇。

導師的要求是獲取過去一年內深圳每個區每個站點每小時的降雨量，執行該操作需要通過如上圖中的歷史查詢實現，即通過一個時間來查詢，而這個時間存放在一個hidden類型的input標簽里，當然可以通過js語句將其改為text類型，然後執行send_keys之類的操作。然而，我失敗了。時間可以修改設置，可是結果如下圖。

為此，僅抓取實時數據。選取python的selenium，模擬搭建瀏覽器，模擬人為的點擊等操作實現數據生成和獲取。selenium的一大優點就是能獲取網頁渲染後的源代碼，即執行操作後的源代碼。普通的通過 url解析網頁的方式只能獲取給定的數據，不能實現與用戶之間的交互。selenium通過獲取渲染後的網頁源碼，並通過豐富的查找工具，個人認為最好用的就是find_element_by_xpath("xxx")，通過該方式查找到元素後可執行點擊、輸入等事件，進而向伺服器發出請求，獲取所需的數據。

[python]view plain

#coding=utf-8
fromtestStringimport*
fromseleniumimportwebdriver
importstring
importos
fromselenium.webdriver.common.keysimportKeys
importtime
importsys
default_encoding='utf-8'
ifsys.getdefaultencoding()!=default_encoding:
reload(sys)
sys.setdefaultencoding(default_encoding)
district_navs=['nav2','nav1','nav3','nav4','nav5','nav6','nav7','nav8','nav9','nav10']
district_names=['福田區','羅湖區','南山區','鹽田區','寶安區','龍崗區','光明新區','坪山新區','龍華新區','大鵬新區']
flag=1
while(flag>0):
driver=webdriver.Chrome()
driver.get("hianCe/")
#選擇降雨量
driver.find_element_by_xpath("//span[@id='fenqu_H24R']").click()
filename=time.strftime("%Y%m%d%H%M",time.localtime(time.time()))+'.txt'
#創建文件
output_file=open(filename,'w')
#選擇行政區
foriinrange(len(district_navs)):
driver.find_element_by_xpath("//div[@id='"+district_navs[i]+"']").click()
#printdriver.page_source
timeElem=driver.find_element_by_id("time_shikuang")
#輸出時間和站點名
output_file.write(timeElem.text+',')
output_file.write(district_names[i]+',')
elems=driver.find_elements_by_xpath("//span[@onmouseover='javscript:changeTextOver(this)']")
#輸出每個站點的數據，格式為：站點名，一小時降雨量，當日累積降雨量
foreleminelems:
output_file.write(AMonitorRecord(elem.get_attribute("title"))+',')
output_file.write(' ')
output_file.close()
driver.close()
time.sleep(3600)
文件中引用的文件testString只是修改輸出格式，提取有效數據。

[python]view plain

#Encoding=utf-8
defOnlyCharNum(s,oth=''):
s2=s.lower()
fomart=',.'
forcins2:
ifnotcinfomart:
s=s.replace(c,'')
returns
defAMonitorRecord(str):
str=str.split(":")
returnstr[0]+","+OnlyCharNum(str[1])

一小時抓取一次數據，結果如下：

㈦ python如何抓取網頁源代碼中的字元串

使用正則匹配，列：

importrequests
importre

req=requests.get(url)
r=re.findall('<scriptsrc="(.*?)"></script>',req.text)#(.*?)非貪婪匹配
print(r)

自己網上找找python正則方面的知識

㈧ python怎麼爬取網頁源代碼

#!/usr/bin/env python3
#-*- coding=utf-8 -*-

import urllib3

if __name__ == '__main__':
http=urllib3.PoolManager()
r=http.request('GET','IP')
print(r.data.decode("gbk"))

可以正常抓取。需要安裝urllib3,py版本3.43

㈨ python，求一個簡單的selenium+re的網頁源碼爬取

網頁爬取不一定要用Selenium，Selenium是為了注入瀏覽器獲取點擊行為的調試工具，如果網頁無需人工交互就可以抓取，不建議你使用selenium。要使用它，你需要安裝一個工具軟體，使用Chrome瀏覽器需要下載chromedriver.exe到system32下，如使用firefox則要下載geckodriver.exe到system32下。下面以chromedriver驅動chrome為例：

#-*-coding:UTF-8-*-
fromseleniumimportwebdriver
frombs4importBeautifulSoup
importre
importtime

if__name__=='__main__':

	options=webdriver.ChromeOptions()
	options.add_argument('user-agent="Mozilla/5.0(Linux;Android4.0.4;GalaxyNexusBuild/IMM76B)AppleWebKit/535.19(KHTML,likeGecko)Chrome/18.0.1025.133MobileSafari/535.19"')
	driver=webdriver.Chrome()
	driver.get('url')#你要抓取網路文庫的URL，隨便找個幾十頁的替換掉

	html=driver.page_source
	bf1=BeautifulSoup(html,'lxml')
	result=bf1.find_all(class_='rtcspage')
	bf2=BeautifulSoup(str(result[0]),'lxml')
	title=bf2.div.div.h1.string
	pagenum=bf2.find_all(class_='size')
	pagenum=BeautifulSoup(str(pagenum),'lxml').span.string
	pagepattern=re.compile('頁數：(d+)頁')
	num=int(pagepattern.findall(pagenum)[0])
	print('文章標題：%s'%title)
	print('文章頁數：%d'%num)


	whileTrue:
		num=num/5.0
		html=driver.page_source
		bf1=BeautifulSoup(html,'lxml')
		result=bf1.find_all(class_='rtcspage')
		foreach_resultinresult:
			bf2=BeautifulSoup(str(each_result),'lxml')
			texts=bf2.find_all('p')
			foreach_textintexts:
				main_body=BeautifulSoup(str(each_text),'lxml')
				foreachinmain_body.find_all(True):
					ifeach.name=='span':
						print(each.string.replace('xa0',''),end='')
					elifeach.name=='br':
						print('')
			print('
')
		ifnum>1:
			page=driver.find_elements_by_xpath("//div[@class='page']")
			driver.execute_script('arguments[0].scrollIntoView();',page[-1])#拖動到可見的元素去
			nextpage=driver.find_element_by_xpath("//a[@data-fun='next']")
			nextpage.click()
			time.sleep(3)
		else:
			break

執行代碼，chromedriver自動為你打開chrome瀏覽器，此時你翻頁到最後，點擊閱讀更多，然後等一段時間後關閉瀏覽器，代碼繼續執行。

㈩ python怎麼看源碼進行網路爬蟲

在我們日常上網瀏覽網頁的時候，經常會看到一些好看的圖片，我們就希望把這些圖片保存下載，或者用戶用來做桌面壁紙，或者用來做設計的素材。
我們最常規的做法就是通過滑鼠右鍵，選擇另存為。但有些圖片滑鼠右鍵的時候並沒有另存為選項，還有辦法就通過就是通過截圖工具截取下來，但這樣就降低圖片的清晰度。好吧～！其實你很厲害的，右鍵查看頁面源代碼。
我們可以通過python 來實現這樣一個簡單的爬蟲功能，把我們想要的代碼爬取到本地。下面就看看如何使用python來實現這樣一個功能。

一，獲取整個頁面數據

首先我們可以先獲取要下載圖片的整個頁面信息。
getjpg.py

#coding=utf-8
import urllib

def getHtml(url):
page = urllib.urlopen(url)
html = page.read()
return html

html = getHtml("http://tieba..com/p/2738151262")

print html

Urllib 模塊提供了讀取web頁面數據的介面，我們可以像讀取本地文件一樣讀取www和ftp上的數據。首先，我們定義了一個getHtml()函數:
urllib.urlopen()方法用於打開一個URL地址。
read()方法用於讀取URL上的數據，向getHtml()函數傳遞一個網址，並把整個頁面下載下來。執行程序就會把整個網頁列印輸出。

二，篩選頁面中想要的數據

Python 提供了非常強大的正則表達式，我們需要先要了解一點python 正則表達式的知識才行。
http://www.cnblogs.com/fnng/archive/2013/05/20/3089816.html

假如我們網路貼吧找到了幾張漂亮的壁紙，通過到前段查看工具。找到了圖片的地址，如：src=」https://gss0..com/70cFfyinKgQFm2e88IuM_a/forum......jpg」pic_ext=」jpeg」

修改代碼如下：

import re
import urllib

def getHtml(url):
page = urllib.urlopen(url)
html = page.read()
return html

def getImg(html):
reg = r'src="(.+?\.jpg)" pic_ext'
imgre = re.compile(reg)
imglist = re.findall(imgre,html)
return imglist

html = getHtml("http://tieba..com/p/2460150866")
print getImg(html)

我們又創建了getImg()函數，用於在獲取的整個頁面中篩選需要的圖片連接。re模塊主要包含了正則表達式：
re.compile() 可以把正則表達式編譯成一個正則表達式對象.
re.findall() 方法讀取html 中包含 imgre（正則表達式）的數據。
運行腳本將得到整個頁面中包含圖片的URL地址。

三，將頁面篩選的數據保存到本地

把篩選的圖片地址通過for循環遍歷並保存到本地，代碼如下：

#coding=utf-8
import urllib
import re

def getHtml(url):
page = urllib.urlopen(url)
html = page.read()
return html

def getImg(html):
reg = r'src="(.+?\.jpg)" pic_ext'
imgre = re.compile(reg)
imglist = re.findall(imgre,html)
x = 0
for imgurl in imglist:
urllib.urlretrieve(imgurl,'%s.jpg' % x)
x+=1

html = getHtml("http://tieba..com/p/2460150866")

print getImg(html)

這里的核心是用到了urllib.urlretrieve()方法，直接將遠程數據下載到本地。
通過一個for循環對獲取的圖片連接進行遍歷，為了使圖片的文件名看上去更規范，對其進行重命名，命名規則通過x變數加1。保存的位置默認為程序的存放目錄。
程序運行完成，將在目錄下看到下載到本地的文件。

導航:首頁 > 源碼編譯 > py爬網頁源碼

py爬網頁源碼

與py爬網頁源碼相關的資料