导航:首页 > 编程语言 > python抓取网页实例

python抓取网页实例

发布时间:2022-04-14 10:54:15

‘壹’ python如何实现爬取需要登录的网站代码实例

final String url = "jdbc:oracle:thin:@localhost:1521:ORCL";
final String user = "store";
final String password = "store_password";
Class.forName("oracle.jdbc.driver.OracleDriver");
Connection con = DriverManager.getConnection(url, user, password);
return con;
}

‘贰’ 如何用python抓取网页上的数据

使用内置的包来抓取,就是在模仿浏览器访问页面,再把页面的数据给解析出来,也可以看做是一次请求。

‘叁’ 如何用python抓取网页数据库

最简单可以用urllib,python2.x和python3.x的用法不同,以python2.x为例:
import
urllib
html
=
urllib.open(url)
text
=
html.read()
复杂些可以用requests库,支持各种请求类型,支持cookies,header等
再复杂些的可以用selenium,支持抓取javascript产生的文本

‘肆’ 如何用python抓取这个网页的内容

如果包含动态内容可以考虑使用Selenium浏览器自动化测试框架,当然找人有偿服务也可以

‘伍’ python如何提取网页信息

requests库+ 正则表达式/dom库/xpath库等

‘陆’ 谁用过python中的re来抓取网页,能否给个例子,谢谢

这是我写的一个非常简单的抓取页面的脚本,作用为获得指定URL的所有链接地址并获取所有链接的标题。

===========geturls.py================
#coding:utf-8
import urllib
import urlparse
import re
import socket
import threading

#定义链接正则
urlre = re.compile(r"href=[\"']?([^ >\"']+)")
titlere = re.compile(r"<title>(.*?)</title>",re.I)

#设置超时时间为10秒
timeout = 10
socket.setdefaulttimeout(timeout)

#定义最高线程数
max = 10
#定义当前线程数
current = 0

def gettitle(url):
global current
try:
content = urllib.urlopen(url).read()
except:
current -= 1
return
if titlere.search(content):
title = titlere.search(content).group(1)
try:
title = title.decode('gbk').encode('utf-8')
except:
title = title
else:
title = "无标题"
print "%s: %s" % (url,title)
current -= 1
return

def geturls(url):
global current,max
ts = []
content = urllib.urlopen(url)
#使用set去重
result = set()
for eachline in content:
if urlre.findall(eachline):
temp = urlre.findall(eachline)
for x in temp:
#如果为站内链接,前面加上url
if not x.startswith("http:"):
x = urlparse.urljoin(url,x)
#不记录js和css文件
if not x.endswith(".js") and not x.endswith(".css"):
result.add(x)
threads = []
for url in result:
t = threading.Thread(target=gettitle,args=(url,))
threads.append(t)
i = 0
while i < len(threads):
if current < max:
threads[i].start()
i += 1
current += 1
else:
pass

geturls("http://www..com")

使用正则表达式(re)只能做到一些比较简单或者机械的功能,如果需要更强大的网页分析功能,请尝试一下beautiful soup或者pyquery,希望能帮到你

‘柒’ 求python抓网页的代码

python3.x中使用urllib.request模块来抓取网页代码,通过urllib.request.urlopen函数取网页内容,获取的为数据流,通过read()函数把数字读取出来,再把读取的二进制数据通过decode函数解码(编号可以通过查看网页源代码中<meta http-equiv="content-type" content="text/html;charset=gbk" />得知,如下例中为gbk编码。),这样就得到了网页的源代码。

如下例所示,抓取本页代码:

importurllib.request

html=urllib.request.urlopen('
).read().decode('gbk')#注意抓取后要按网页编码进行解码
print(html)

以下为urllib.request.urlopen函数说明:

urllib.request.urlopen(url,
data=None, [timeout, ]*, cafile=None, capath=None,
cadefault=False, context=None)


Open the URL url, which can be either a string or a Request object.


data must be a bytes object specifying additional data to be sent to
the server, or None
if no such data is needed. data may also be an iterable object and in
that case Content-Length value must be specified in the headers. Currently HTTP
requests are the only ones that use data; the HTTP request will be a
POST instead of a GET when the data parameter is provided.


data should be a buffer in the standard application/x-www-form-urlencoded format. The urllib.parse.urlencode() function takes a mapping or
sequence of 2-tuples and returns a string in this format. It should be encoded
to bytes before being used as the data parameter. The charset parameter
in Content-Type
header may be used to specify the encoding. If charset parameter is not sent
with the Content-Type header, the server following the HTTP 1.1 recommendation
may assume that the data is encoded in ISO-8859-1 encoding. It is advisable to
use charset parameter with encoding used in Content-Type header with the Request.


urllib.request mole uses HTTP/1.1 and includes Connection:close header
in its HTTP requests.


The optional timeout parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the global
default timeout setting will be used). This actually only works for HTTP, HTTPS
and FTP connections.


If context is specified, it must be a ssl.SSLContext instance describing the various SSL
options. See HTTPSConnection for more details.


The optional cafile and capath parameters specify a set of
trusted CA certificates for HTTPS requests. cafile should point to a
single file containing a bundle of CA certificates, whereas capath
should point to a directory of hashed certificate files. More information can be
found in ssl.SSLContext.load_verify_locations().


The cadefault parameter is ignored.


For http and https urls, this function returns a http.client.HTTPResponse object which has the
following HTTPResponse
Objects methods.


For ftp, file, and data urls and requests explicitly handled by legacy URLopener and FancyURLopener classes, this function returns a
urllib.response.addinfourl object which can work as context manager and has methods such as


geturl() — return the URL of the resource retrieved,
commonly used to determine if a redirect was followed

info() — return the meta-information of the page, such
as headers, in the form of an email.message_from_string() instance (see Quick
Reference to HTTP Headers)

getcode() – return the HTTP status code of the response.


Raises URLError on errors.


Note that None
may be returned if no handler handles the request (though the default installed
global OpenerDirector uses UnknownHandler to ensure this never happens).


In addition, if proxy settings are detected (for example, when a *_proxy environment
variable like http_proxy is set), ProxyHandler is default installed and makes sure the
requests are handled through the proxy.


The legacy urllib.urlopen function from Python 2.6 and earlier has
been discontinued; urllib.request.urlopen() corresponds to the old
urllib2.urlopen.
Proxy handling, which was done by passing a dictionary parameter to urllib.urlopen, can be
obtained by using ProxyHandler objects.



Changed in version 3.2: cafile
and capath were added.



Changed in version 3.2: HTTPS virtual
hosts are now supported if possible (that is, if ssl.HAS_SNI is true).



New in version 3.2: data can be
an iterable object.



Changed in version 3.3: cadefault
was added.



Changed in version 3.4.3: context
was added.

‘捌’ 如何用Python爬虫抓取网页内容

首先,你要安装requests和BeautifulSoup4,然后执行如下代码.

importrequests
frombs4importBeautifulSoup

iurl='http://news.sina.com.cn/c/nd/2017-08-03/doc-ifyitapp0128744.shtml'

res=requests.get(iurl)

res.encoding='utf-8'

#print(len(res.text))

soup=BeautifulSoup(res.text,'html.parser')

#标题
H1=soup.select('#artibodyTitle')[0].text

#来源
time_source=soup.select('.time-source')[0].text


#来源
origin=soup.select('#artibodyp')[0].text.strip()

#原标题
oriTitle=soup.select('#artibodyp')[1].text.strip()

#内容
raw_content=soup.select('#artibodyp')[2:19]
content=[]
forparagraphinraw_content:
content.append(paragraph.text.strip())
'@'.join(content)
#责任编辑
ae=soup.select('.article-editor')[0].text

这样就可以了

‘玖’ 怎么用python抓取网页并实现一些提交操作

首先我们找到登录的元素,在输入账号处选中–>右键–>检查

然后直接查询网页源代码去找到上面的部分,根据标签来观察提交的表单参数,这里强调一下:

form标签和form标签下的input标签非常重要,form标签中的action属性代表请求的URL,input标签下的name属性代表提交参数的KEY。
代码参考如下:
import requests
url="网址" #action属性
params={
"source":"index_nav", #input标签下的name
"form_email":"xxxxxx", #input标签下的name
"form_password":"xxxxxx" #input标签下的name

}
html=requests.post(url,data=params)
print(html.text)

运行后发现已登录账号,相当于一个提交登陆的操作

‘拾’ 如何用python抓取网页特定内容

Python用做数据处理还是相当不错的,如果你想要做爬虫,Python是很好的选择,它有很多已经写好的类包,只要调用,即可完成很多复杂的功能,此文中所有的功能都是基于BeautifulSoup这个包。
1 Pyhton获取网页的内容(也就是源代码)
page = urllib2.urlopen(url)
contents = page.read()
#获得了整个网页的内容也就是源代码 print(contents)
url代表网址,contents代表网址所对应的源代码,urllib2是需要用到的包,以上三句代码就能获得网页的整个源代码
2 获取网页中想要的内容(先要获得网页源代码,再分析网页源代码,找所对应的标签,然后提取出标签中的内容)

阅读全文

与python抓取网页实例相关的资料

热点内容
华为媒体算法工程师 浏览:650
pdf怎么转txt 浏览:689
售乐实物商城源码 浏览:674
php的资源标识符 浏览:623
联通app怎么交电视费 浏览:729
怎么从服务器切回自己的电脑 浏览:105
用生硬的命令成语 浏览:992
数据结构与算法大作业 浏览:149
英特尔支持加密货币购买吗 浏览:407
苹果如何玩安卓号的部落冲突 浏览:863
还原魔方算法c 浏览:849
树莓派如何变成服务器 浏览:251
google浏览器app怎么用 浏览:458
安卓充游戏怎么退款网易游戏 浏览:266
gre词汇精选pdf 浏览:666
荣耀20i照片加密 浏览:633
phpshtml 浏览:207
高中生玩乐队能解压吗 浏览:882
物理服务器如何租用 浏览:100
战地1是什么服务器 浏览:957