python爬網頁新聞_如何用Python爬蟲抓取網頁內容

⑴ python爬取新浪網頁新聞時，分割代碼類名怎麼消失了

建議改用屬性查找，如下示例：
soup.find(attrs={『class』:』feed-card-item』})
圖3替換是什麼意思？請補充描述。

⑵ 怎麼用Python網路爬蟲爬取騰訊新聞內容

所謂網頁抓取，就是把URL地址中指定的網路資源從網路流中讀取出來，保存到本地。類似於使用程序模擬IE瀏覽器的功能，把URL作為HTTP請求的內容發送到伺服器端，然後讀取伺服器端的響應資源。在Python中，我們使用urllib2這個組件來抓取網頁。u...

⑶ python3 怎麼爬取新聞網站

1 #coding=utf-8
2 import re # 正則表達式
3 import bs4 # Beautiful Soup 4 解析模塊
4 import urllib2 # 網路訪問模塊
5 import News #自己定義的新聞結構
6 import codecs #解決編碼問題的關鍵，使用codecs.open打開文件
7 import sys #1解決不同頁面編碼問題
8
9 reload(sys) # 2
10 sys.setdefaultencoding('utf-8') # 3
11
12 # 從首頁獲取所有鏈接
13 def GetAllUrl(home):
14 html = urllib2.urlopen(home).read().decode('utf8')
15 soup = bs4.BeautifulSoup(html, 'html.parser')
16 pattern = 'http://\w+\.jia\.\.com/article/\w+'
17 links = soup.find_all('a', href=re.compile(pattern))
18 for link in links:
19 url_set.add(link['href'])
20
21 def GetNews(url):
22 global NewsCount,MaxNewsCount #全局記錄新聞數量
23 while len(url_set) != 0:
24 try:
25 # 獲取鏈接
26 url = url_set.pop()
27 url_old.add(url)
28
29 # 獲取代碼
30 html = urllib2.urlopen(url).read().decode('utf8')
31
32 # 解析
33 soup = bs4.BeautifulSoup(html, 'html.parser')
34 pattern = 'http://\w+\.jia\.\.com/article/\w+' # 鏈接匹配規則
35 links = soup.find_all('a', href=re.compile(pattern))
36
37 # 獲取URL
38 for link in links:
39 if link['href'] not in url_old:
40 url_set.add(link['href'])
41
42 # 獲取信息
43 article = News.News()
44 article.url = url # URL信息
45 page = soup.find('div', {'id': 'page'})
46 article.title = page.find('h1').get_text() # 標題信息
47 info = page.find('div', {'class': 'article-info'})
48 article.author = info.find('a', {'class': 'name'}).get_text() # 作者信息
49 article.date = info.find('span', {'class': 'time'}).get_text() # 日期信息
50 article.about = page.find('blockquote').get_text()
51 pnode = page.find('div', {'class': 'article-detail'}).find_all('p')
52 article.content = ''
53 for node in pnode: # 獲取文章段落
54 article.content += node.get_text() + '\n' # 追加段落信息
55
56 SaveNews(article)
57
58 print NewsCount
59 break
60 except Exception as e:
61 print(e)
62 continue
63 else:
64 print(article.title)
65 NewsCount+=1
66 finally:
67 # 判斷數據是否收集完成
68 if NewsCount == MaxNewsCount:
69 break
70
71 def SaveNews(Object):
72 file.write("【"+Object.title+"】"+"\t")
73 file.write(Object.author+"\t"+Object.date+"\n")
74 file.write(Object.content+"\n"+"\n")
75
76 url_set = set() # url集合
77 url_old = set() # 爬過的url集合
78
79 NewsCount = 0
80 MaxNewsCount=3
81
82 home = 'http://jia..com/' # 起始位置
83
84 GetAllUrl(home)
85
86 file=codecs.open("D:\\test.txt","a+") #文件操作
87
88 for url in url_set:
89 GetNews(url)
90 # 判斷數據是否收集完成
91 if NewsCount == MaxNewsCount:
92 break
93
94 file.close()
復制代碼
新聞文章結構

復制代碼
1 #coding: utf-8
2 # 文章類定義
3 class News(object):
4 def __init__(self):
5 self.url = None
6 self.title = None
7 self.author = None
8 self.date = None
9 self.about = None
10 self.content = None

⑷ 如何用Python爬蟲抓取網頁內容

首先,你要安裝requests和BeautifulSoup4,然後執行如下代碼.

importrequests
frombs4importBeautifulSoup

iurl='http://news.sina.com.cn/c/nd/2017-08-03/doc-ifyitapp0128744.shtml'

res=requests.get(iurl)

res.encoding='utf-8'

#print(len(res.text))

soup=BeautifulSoup(res.text,'html.parser')

#標題
H1=soup.select('#artibodyTitle')[0].text

#來源
time_source=soup.select('.time-source')[0].text


#來源
origin=soup.select('#artibodyp')[0].text.strip()

#原標題
oriTitle=soup.select('#artibodyp')[1].text.strip()

#內容
raw_content=soup.select('#artibodyp')[2:19]
content=[]
forparagraphinraw_content:
content.append(paragraph.text.strip())
'@'.join(content)
#責任編輯
ae=soup.select('.article-editor')[0].text

這樣就可以了

⑸ Python如何簡單爬取騰訊新聞網前五頁文字內容

可以使用python裡面的一個爬蟲庫，beautifulsoup，這個庫可以很方便的爬取數據。爬蟲首先就得知道網頁的鏈接，然後獲取網頁的源代碼，通過正則表達式或者其他方法來獲取所需要的內容，具體還是要對著網頁源代碼進行操作，查看需要哪些地方的數據，然後通過beautifulsoup來爬取特定html標簽的內容。網上有很多相關的內容，可以看看。

⑹ 如何利用python爬取網頁內容

利用python爬取網頁內容需要用scrapy（爬蟲框架），但是很簡單，就三步

定義item類
開發spider類
開發pipeline

想學習更深的爬蟲，可以用《瘋狂python講義》

⑺ python爬蟲如何分析一個將要爬取的網站

首先，你去爬取一個網站，

你會清楚這個網站是屬於什麼類型的網站（新聞，論壇，貼吧等等）。

你會清楚你需要哪部分的數據。

你需要去想需要的數據你將如何編寫表達式去解析。

你會碰到各種反爬措施，無非就是各種網路各種解決。當爬取成本高於數據成本，你會選擇放棄。

你會利用你所學各種語言去解決你將要碰到的問題，利用各種語言的client組件去請求你想要爬取的URL，獲取到HTML，利用正則，XPATH去解析你想要的數據，然後利用sql存儲各類資料庫。

⑻ python3 怎麼爬取新聞網站

需求：

從門戶網站爬取新聞，將新聞標題，作者，時間，內容保存到本地txt中。

用到的python模塊：

importre#正則表達式
importbs4#BeautifulSoup4解析模塊
importurllib2#網路訪問模塊
importNews#自己定義的新聞結構
importcodecs#解決編碼問題的關鍵，使用codecs.open打開文件
importsys#1解決不同頁面編碼問題

其中bs4需要自己裝一下，安裝方法可以參考：Windows命令行下pip安裝python whl包

程序：

#coding=utf-8
importre#正則表達式
importbs4#BeautifulSoup4解析模塊
importurllib2#網路訪問模塊
importNews#自己定義的新聞結構
importcodecs#解決編碼問題的關鍵，使用codecs.open打開文件
importsys#1解決不同頁面編碼問題

reload(sys)#2
sys.setdefaultencoding('utf-8')#3

#從首頁獲取所有鏈接
defGetAllUrl(home):
html=urllib2.urlopen(home).read().decode('utf8')
soup=bs4.BeautifulSoup(html,'html.parser')
pattern='http://w+.jia..com/article/w+'
links=soup.find_all('a',href=re.compile(pattern))
forlinkinlinks:
url_set.add(link['href'])

defGetNews(url):
globalNewsCount,MaxNewsCount#全局記錄新聞數量
whilelen(url_set)!=0:
try:
#獲取鏈接
url=url_set.pop()
url_old.add(url)

#獲取代碼
html=urllib2.urlopen(url).read().decode('utf8')

#解析
soup=bs4.BeautifulSoup(html,'html.parser')
pattern='http://w+.jia..com/article/w+'#鏈接匹配規則
links=soup.find_all('a',href=re.compile(pattern))

#獲取URL
forlinkinlinks:
iflink['href']notinurl_old:
url_set.add(link['href'])

#獲取信息
article=News.News()
article.url=url#URL信息
page=soup.find('div',{'id':'page'})
article.title=page.find('h1').get_text()#標題信息
info=page.find('div',{'class':'article-info'})
article.author=info.find('a',{'class':'name'}).get_text()#作者信息
article.date=info.find('span',{'class':'time'}).get_text()#日期信息
article.about=page.find('blockquote').get_text()
pnode=page.find('div',{'class':'article-detail'}).find_all('p')
article.content=''
fornodeinpnode:#獲取文章段落
article.content+=node.get_text()+'
'#追加段落信息

SaveNews(article)

printNewsCount
break
exceptExceptionase:
print(e)
continue
else:
print(article.title)
NewsCount+=1
finally:
#判斷數據是否收集完成
ifNewsCount==MaxNewsCount:
break

defSaveNews(Object):
file.write("【"+Object.title+"】"+"	")
file.write(Object.author+"	"+Object.date+"
")
file.write(Object.content+"
"+"
")

url_set=set()#url集合
url_old=set()#爬過的url集合

NewsCount=0
MaxNewsCount=3

home='http://jia..com/'#起始位置

GetAllUrl(home)

file=codecs.open("D:\test.txt","a+")#文件操作

forurlinurl_set:
GetNews(url)
#判斷數據是否收集完成
ifNewsCount==MaxNewsCount:
break

file.close()

新聞文章結構

#coding:utf-8
#文章類定義
classNews(object):
def__init__(self):
self.url=None
self.title=None
self.author=None
self.date=None
self.about=None
self.content=None

對爬取的文章數量就行統計。

導航:首頁 > 編程語言 > python爬網頁新聞

python爬網頁新聞

需求：

用到的python模塊：

與python爬網頁新聞相關的資料