python下载网页代码_如何用python解析网页并获得网页真实的源码

㈠怎么使用python查看网页源代码

使用python查看网页源代码的方法：

1、使用“import”命令导入requests包

import requests

2、使用该包的get()方法，将要查看的网页链接传递进去，结果赋给变量x

x = requests.get(url='http://www.hao123.com')

3、用“print (x.text)”语句把网页的内容以text的格式输出

print(x.text)

完整代码如下：

执行结果如下：

更多Python知识，请关注：Python自学网！！

㈡求python抓网页的代码

python3.x中使用urllib.request模块来抓取网页代码，通过urllib.request.urlopen函数取网页内容，获取的为数据流，通过read()函数把数字读取出来，再把读取的二进制数据通过decode函数解码（编号可以通过查看网页源代码中<meta http-equiv="content-type" content="text/html;charset=gbk" />得知，如下例中为gbk编码。），这样就得到了网页的源代码。

如下例所示，抓取本页代码：

importurllib.request

html=urllib.request.urlopen('
).read().decode('gbk')#注意抓取后要按网页编码进行解码
print(html)

以下为urllib.request.urlopen函数说明：

urllib.request.urlopen(url,
data=None, [timeout, ]*, cafile=None, capath=None,
cadefault=False, context=None)

Open the URL url, which can be either a string or a Request object.

data must be a bytes object specifying additional data to be sent to
the server, or None
if no such data is needed. data may also be an iterable object and in
that case Content-Length value must be specified in the headers. Currently HTTP
requests are the only ones that use data; the HTTP request will be a
POST instead of a GET when the data parameter is provided.

data should be a buffer in the standard application/x-www-form-urlencoded format. The urllib.parse.urlencode() function takes a mapping or
sequence of 2-tuples and returns a string in this format. It should be encoded
to bytes before being used as the data parameter. The charset parameter
in Content-Type
header may be used to specify the encoding. If charset parameter is not sent
with the Content-Type header, the server following the HTTP 1.1 recommendation
may assume that the data is encoded in ISO-8859-1 encoding. It is advisable to
use charset parameter with encoding used in Content-Type header with the Request.

urllib.request mole uses HTTP/1.1 and includes Connection:close header
in its HTTP requests.

The optional timeout parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the global
default timeout setting will be used). This actually only works for HTTP, HTTPS
and FTP connections.

If context is specified, it must be a ssl.SSLContext instance describing the various SSL
options. See HTTPSConnection for more details.

The optional cafile and capath parameters specify a set of
trusted CA certificates for HTTPS requests. cafile should point to a
single file containing a bundle of CA certificates, whereas capath
should point to a directory of hashed certificate files. More information can be
found in ssl.SSLContext.load_verify_locations().

The cadefault parameter is ignored.

For http and https urls, this function returns a http.client.HTTPResponse object which has the
following HTTPResponse
Objects methods.

For ftp, file, and data urls and requests explicitly handled by legacy URLopener and FancyURLopener classes, this function returns a
urllib.response.addinfourl object which can work as context manager and has methods such as

geturl() — return the URL of the resource retrieved,
commonly used to determine if a redirect was followed

info() — return the meta-information of the page, such
as headers, in the form of an email.message_from_string() instance (see Quick
Reference to HTTP Headers)

getcode() – return the HTTP status code of the response.

Raises URLError on errors.

Note that None
may be returned if no handler handles the request (though the default installed
global OpenerDirector uses UnknownHandler to ensure this never happens).

In addition, if proxy settings are detected (for example, when a *_proxy environment
variable like http_proxy is set), ProxyHandler is default installed and makes sure the
requests are handled through the proxy.

The legacy urllib.urlopen function from Python 2.6 and earlier has
been discontinued; urllib.request.urlopen() corresponds to the old
urllib2.urlopen.
Proxy handling, which was done by passing a dictionary parameter to urllib.urlopen, can be
obtained by using ProxyHandler objects.

Changed in version 3.2: cafile
and capath were added.

Changed in version 3.2: HTTPS virtual
hosts are now supported if possible (that is, if ssl.HAS_SNI is true).

New in version 3.2: data can be
an iterable object.

Changed in version 3.3: cadefault
was added.

Changed in version 3.4.3: context
was added.

㈢ python爬虫怎么另存网页代码

步骤分为这几步
1发送一个请求
2分析获取请求的url地址，参数
3处理参数并发送请求，获取响应
4把得到的响应保存文件

㈣ Python提取网页链接和标题

提取所有链接应该用循环：
urls = driver.find_elements_by_xpath("//a")
for url in urls:
print(url.get_attribute("href"))如果get_attribute方法报错应该是没有找到a标签对象，如果确定是有的话，可能是页面加载比较慢还没加载出来，selenium默认是不会等待对象出现的，需要在找对象前加一些等待时间；另外如果页面上有iframe的话需要先切换进去才能找到里面的对象。

㈤ python爬虫怎么获取动态的网页源码

一个月前实习导师布置任务说通过网络爬虫获取深圳市气象局发布的降雨数据，网页如下：

心想，爬虫不太难的，当年跟zjb爬煎蛋网无（mei）聊（zi）图的时候，多么清高。由于接受任务后的一个月考试加作业一大堆，导师也不催，自己也不急。

但是，导师等我一个月都得让我来写意味着这东西得有多难吧。。。今天打开一看的确是这样。网站是基于Ajax写的，数据动态获取，所以无法通过下载源代码然后解析获得。

从某不良少年写的抓取淘宝mm的例子中收到启发，对于这样的情况，一般可以同构自己搭建浏览器实现。phantomJs，CasperJS都是不错的选择。

导师的要求是获取过去一年内深圳每个区每个站点每小时的降雨量，执行该操作需要通过如上图中的历史查询实现，即通过一个时间来查询，而这个时间存放在一个hidden类型的input标签里，当然可以通过js语句将其改为text类型，然后执行send_keys之类的操作。然而，我失败了。时间可以修改设置，可是结果如下图。

为此，仅抓取实时数据。选取python的selenium，模拟搭建浏览器，模拟人为的点击等操作实现数据生成和获取。selenium的一大优点就是能获取网页渲染后的源代码，即执行操作后的源代码。普通的通过 url解析网页的方式只能获取给定的数据，不能实现与用户之间的交互。selenium通过获取渲染后的网页源码，并通过丰富的查找工具，个人认为最好用的就是find_element_by_xpath("xxx")，通过该方式查找到元素后可执行点击、输入等事件，进而向服务器发出请求，获取所需的数据。

[python]view plain

#coding=utf-8
fromtestStringimport*
fromseleniumimportwebdriver
importstring
importos
fromselenium.webdriver.common.keysimportKeys
importtime
importsys
default_encoding='utf-8'
ifsys.getdefaultencoding()!=default_encoding:
reload(sys)
sys.setdefaultencoding(default_encoding)
district_navs=['nav2','nav1','nav3','nav4','nav5','nav6','nav7','nav8','nav9','nav10']
district_names=['福田区','罗湖区','南山区','盐田区','宝安区','龙岗区','光明新区','坪山新区','龙华新区','大鹏新区']
flag=1
while(flag>0):
driver=webdriver.Chrome()
driver.get("hianCe/")
#选择降雨量
driver.find_element_by_xpath("//span[@id='fenqu_H24R']").click()
filename=time.strftime("%Y%m%d%H%M",time.localtime(time.time()))+'.txt'
#创建文件
output_file=open(filename,'w')
#选择行政区
foriinrange(len(district_navs)):
driver.find_element_by_xpath("//div[@id='"+district_navs[i]+"']").click()
#printdriver.page_source
timeElem=driver.find_element_by_id("time_shikuang")
#输出时间和站点名
output_file.write(timeElem.text+',')
output_file.write(district_names[i]+',')
elems=driver.find_elements_by_xpath("//span[@onmouseover='javscript:changeTextOver(this)']")
#输出每个站点的数据，格式为：站点名，一小时降雨量，当日累积降雨量
foreleminelems:
output_file.write(AMonitorRecord(elem.get_attribute("title"))+',')
output_file.write(' ')
output_file.close()
driver.close()
time.sleep(3600)
文件中引用的文件testString只是修改输出格式，提取有效数据。

[python]view plain

#Encoding=utf-8
defOnlyCharNum(s,oth=''):
s2=s.lower()
fomart=',.'
forcins2:
ifnotcinfomart:
s=s.replace(c,'')
returns
defAMonitorRecord(str):
str=str.split(":")
returnstr[0]+","+OnlyCharNum(str[1])

一小时抓取一次数据，结果如下：

㈥如何用python解析网页并获得网页真实的源码

可以去了解下python如何调用webkit的引擎，你说的那种不是用js加密，只是用js动态加载页面内容。必须用webkit之类的浏览器引擎去渲染。

㈦ python怎么爬取网页源代码

#!/usr/bin/env python3
#-*- coding=utf-8 -*-

import urllib3

if __name__ == '__main__':
http=urllib3.PoolManager()
r=http.request('GET','IP')
print(r.data.decode("gbk"))

可以正常抓取。需要安装urllib3,py版本3.43

㈧ Python 如何快速下载网页中的内容

直接urllib.urlopen(url).read()调用就可以读取该地址网页内容

㈨ python 怎么网页下载文件.

这个需要你分析网页，提取其中的链接，然后下载链接
python自带的urllib2, urllib可以用来处理网页，不过比较麻烦，需要自记写很多代码
或者用beautiful soap之类的库，处理html就比较轻松了；可以自己看Beautiful Soap的文档，有中文版本的，链接我就不贴了，网络老会发神经屏蔽；按文档写几个例子，就能处理你自己的事情了，很容易的

㈩ Python爬虫怎么抓取html网页的代码块

范围匹配大点，像这种

re.findall('(<div class="moco-course-wrap".*?</div>)',source,re.S)

可以看下这个

http://blog.csdn.net/tangdou5682/article/details/52596863

导航:首页 > 编程语言 > python下载网页代码

python下载网页代码

与python下载网页代码相关的资料