『壹』 python如何實現爬取需要登錄的網站代碼實例
final String url = "jdbc:oracle:thin:@localhost:1521:ORCL";
final String user = "store";
final String password = "store_password";
Class.forName("oracle.jdbc.driver.OracleDriver");
Connection con = DriverManager.getConnection(url, user, password);
return con;
}
『貳』 如何用python抓取網頁上的數據
使用內置的包來抓取,就是在模仿瀏覽器訪問頁面,再把頁面的數據給解析出來,也可以看做是一次請求。
『叄』 如何用python抓取網頁資料庫
最簡單可以用urllib,python2.x和python3.x的用法不同,以python2.x為例:
import
urllib
html
=
urllib.open(url)
text
=
html.read()
復雜些可以用requests庫,支持各種請求類型,支持cookies,header等
再復雜些的可以用selenium,支持抓取javascript產生的文本
『肆』 如何用python抓取這個網頁的內容
如果包含動態內容可以考慮使用Selenium瀏覽器自動化測試框架,當然找人有償服務也可以
『伍』 python如何提取網頁信息
requests庫+ 正則表達式/dom庫/xpath庫等
『陸』 誰用過python中的re來抓取網頁,能否給個例子,謝謝
這是我寫的一個非常簡單的抓取頁面的腳本,作用為獲得指定URL的所有鏈接地址並獲取所有鏈接的標題。
===========geturls.py================
#coding:utf-8
import urllib
import urlparse
import re
import socket
import threading
#定義鏈接正則
urlre = re.compile(r"href=[\"']?([^ >\"']+)")
titlere = re.compile(r"<title>(.*?)</title>",re.I)
#設置超時時間為10秒
timeout = 10
socket.setdefaulttimeout(timeout)
#定義最高線程數
max = 10
#定義當前線程數
current = 0
def gettitle(url):
global current
try:
content = urllib.urlopen(url).read()
except:
current -= 1
return
if titlere.search(content):
title = titlere.search(content).group(1)
try:
title = title.decode('gbk').encode('utf-8')
except:
title = title
else:
title = "無標題"
print "%s: %s" % (url,title)
current -= 1
return
def geturls(url):
global current,max
ts = []
content = urllib.urlopen(url)
#使用set去重
result = set()
for eachline in content:
if urlre.findall(eachline):
temp = urlre.findall(eachline)
for x in temp:
#如果為站內鏈接,前面加上url
if not x.startswith("http:"):
x = urlparse.urljoin(url,x)
#不記錄js和css文件
if not x.endswith(".js") and not x.endswith(".css"):
result.add(x)
threads = []
for url in result:
t = threading.Thread(target=gettitle,args=(url,))
threads.append(t)
i = 0
while i < len(threads):
if current < max:
threads[i].start()
i += 1
current += 1
else:
pass
geturls("http://www..com")
使用正則表達式(re)只能做到一些比較簡單或者機械的功能,如果需要更強大的網頁分析功能,請嘗試一下beautiful soup或者pyquery,希望能幫到你
『柒』 求python抓網頁的代碼
python3.x中使用urllib.request模塊來抓取網頁代碼,通過urllib.request.urlopen函數取網頁內容,獲取的為數據流,通過read()函數把數字讀取出來,再把讀取的二進制數據通過decode函數解碼(編號可以通過查看網頁源代碼中<meta http-equiv="content-type" content="text/html;charset=gbk" />得知,如下例中為gbk編碼。),這樣就得到了網頁的源代碼。
如下例所示,抓取本頁代碼:
importurllib.request
html=urllib.request.urlopen('
).read().decode('gbk')#注意抓取後要按網頁編碼進行解碼
print(html)
以下為urllib.request.urlopen函數說明:
urllib.request.urlopen(url,
data=None, [timeout, ]*, cafile=None, capath=None,
cadefault=False, context=None)
Open the URL url, which can be either a string or a Request object.
data must be a bytes object specifying additional data to be sent to
the server, or None
if no such data is needed. data may also be an iterable object and in
that case Content-Length value must be specified in the headers. Currently HTTP
requests are the only ones that use data; the HTTP request will be a
POST instead of a GET when the data parameter is provided.
data should be a buffer in the standard application/x-www-form-urlencoded format. The urllib.parse.urlencode() function takes a mapping or
sequence of 2-tuples and returns a string in this format. It should be encoded
to bytes before being used as the data parameter. The charset parameter
in Content-Type
header may be used to specify the encoding. If charset parameter is not sent
with the Content-Type header, the server following the HTTP 1.1 recommendation
may assume that the data is encoded in ISO-8859-1 encoding. It is advisable to
use charset parameter with encoding used in Content-Type header with the Request.
urllib.request mole uses HTTP/1.1 and includes Connection:close header
in its HTTP requests.
The optional timeout parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the global
default timeout setting will be used). This actually only works for HTTP, HTTPS
and FTP connections.
If context is specified, it must be a ssl.SSLContext instance describing the various SSL
options. See HTTPSConnection for more details.
The optional cafile and capath parameters specify a set of
trusted CA certificates for HTTPS requests. cafile should point to a
single file containing a bundle of CA certificates, whereas capath
should point to a directory of hashed certificate files. More information can be
found in ssl.SSLContext.load_verify_locations().
The cadefault parameter is ignored.
For http and https urls, this function returns a http.client.HTTPResponse object which has the
following HTTPResponse
Objects methods.
For ftp, file, and data urls and requests explicitly handled by legacy URLopener and FancyURLopener classes, this function returns a
urllib.response.addinfourl object which can work as context manager and has methods such as
geturl() — return the URL of the resource retrieved,
commonly used to determine if a redirect was followed
info() — return the meta-information of the page, such
as headers, in the form of an email.message_from_string() instance (see Quick
Reference to HTTP Headers)
getcode() – return the HTTP status code of the response.
Raises URLError on errors.
Note that None
may be returned if no handler handles the request (though the default installed
global OpenerDirector uses UnknownHandler to ensure this never happens).
In addition, if proxy settings are detected (for example, when a *_proxy environment
variable like http_proxy is set), ProxyHandler is default installed and makes sure the
requests are handled through the proxy.
The legacy urllib.urlopen function from Python 2.6 and earlier has
been discontinued; urllib.request.urlopen() corresponds to the old
urllib2.urlopen.
Proxy handling, which was done by passing a dictionary parameter to urllib.urlopen, can be
obtained by using ProxyHandler objects.
Changed in version 3.2: cafile
and capath were added.
Changed in version 3.2: HTTPS virtual
hosts are now supported if possible (that is, if ssl.HAS_SNI is true).
New in version 3.2: data can be
an iterable object.
Changed in version 3.3: cadefault
was added.
Changed in version 3.4.3: context
was added.
『捌』 如何用Python爬蟲抓取網頁內容
首先,你要安裝requests和BeautifulSoup4,然後執行如下代碼.
importrequests
frombs4importBeautifulSoup
iurl='http://news.sina.com.cn/c/nd/2017-08-03/doc-ifyitapp0128744.shtml'
res=requests.get(iurl)
res.encoding='utf-8'
#print(len(res.text))
soup=BeautifulSoup(res.text,'html.parser')
#標題
H1=soup.select('#artibodyTitle')[0].text
#來源
time_source=soup.select('.time-source')[0].text
#來源
origin=soup.select('#artibodyp')[0].text.strip()
#原標題
oriTitle=soup.select('#artibodyp')[1].text.strip()
#內容
raw_content=soup.select('#artibodyp')[2:19]
content=[]
forparagraphinraw_content:
content.append(paragraph.text.strip())
'@'.join(content)
#責任編輯
ae=soup.select('.article-editor')[0].text
這樣就可以了
『玖』 怎麼用python抓取網頁並實現一些提交操作
首先我們找到登錄的元素,在輸入賬號處選中–>右鍵–>檢查
然後直接查詢網頁源代碼去找到上面的部分,根據標簽來觀察提交的表單參數,這里強調一下:
form標簽和form標簽下的input標簽非常重要,form標簽中的action屬性代表請求的URL,input標簽下的name屬性代表提交參數的KEY。
代碼參考如下:
import requests
url="網址" #action屬性
params={
"source":"index_nav", #input標簽下的name
"form_email":"xxxxxx", #input標簽下的name
"form_password":"xxxxxx" #input標簽下的name
}
html=requests.post(url,data=params)
print(html.text)
運行後發現已登錄賬號,相當於一個提交登陸的操作
『拾』 如何用python抓取網頁特定內容
Python用做數據處理還是相當不錯的,如果你想要做爬蟲,Python是很好的選擇,它有很多已經寫好的類包,只要調用,即可完成很多復雜的功能,此文中所有的功能都是基於BeautifulSoup這個包。
1 Pyhton獲取網頁的內容(也就是源代碼)
page = urllib2.urlopen(url)
contents = page.read()
#獲得了整個網頁的內容也就是源代碼 print(contents)
url代表網址,contents代表網址所對應的源代碼,urllib2是需要用到的包,以上三句代碼就能獲得網頁的整個源代碼
2 獲取網頁中想要的內容(先要獲得網頁源代碼,再分析網頁源代碼,找所對應的標簽,然後提取出標簽中的內容)