linux安裝scrapy_python爬蟲什麼教程最好

1. python爬蟲什麼教程最好

可以看這個教程：網頁鏈接

此教程通過三個爬蟲案例來使學員認識Scrapy框架、了解Scrapy的架構、熟悉Scrapy各模塊。

此教程的大致內容：

1、Scrapy的簡介。

主要知識點：Scrapy的架構和運作流程。

2、搭建開發環境：

主要知識點：Windows及linux環境下Scrapy的安裝。

3、Scrapy Shell以及Scrapy Selectors的使用。

4、使用Scrapy完成網站信息的爬取。

主要知識點：創建Scrapy項目(scrapy startproject)、定義提取的結構化數據(Item)、編寫爬取網站的Spider並提取出結構化數據(Item)、編寫Item Pipelines來存儲提取到的Item(即結構化數據)。

2. 如何在ubuntu中安裝scrapy

Scrapy是Python開發的一個快速,高層次的屏幕抓取和web抓取框架，用於抓取web站點並從頁面中提取結構化的數據。Scrapy用途廣泛，可以用於數據挖掘、監測和自動化測試。官網網站http://www.scrapy.org/
1、安裝如下軟體

sudo apt-get install build-essential;
sudo apt-get install python-dev;
sudo apt-get install libxml2-dev;
sudo apt-get install libxslt1-dev;
sudo apt-get install python-setuptools;
2、安裝Scrapy

sudo easy_install Scrapy;
wang@ubuntu:/usr/local/lib/python2.7/dist-packages$ sudo easy_install Scrapy
Searching for Scrapy
Best match: Scrapy 0.16.1
Processing Scrapy-0.16.1-py2.7.egg
Scrapy 0.16.1 is already the active version in easy-install.pth
Installing scrapy script to /usr/local/bin

Using /usr/local/lib/python2.7/dist-packages/Scrapy-0.16.1-py2.7.egg
Processing dependencies for Scrapy
Searching for lxml
Reading http://pypi.python.org/simple/lxml/
Reading http://codespeak.net/lxml
Best match: lxml 3.0.1
Downloading http://pypi.python.org/packages/source/l/lxml/lxml-3.0.1.tar.gz#md5=
Processing lxml-3.0.1.tar.gz
Running lxml-3.0.1/setup.py -q bdist_egg --dist-dir /tmp/easy_install-qibAzL/lxml-3.0.1/egg-dist-tmp-mSvUVN
Building lxml version 3.0.1.
Building without Cython.
Using build configuration of libxslt 1.1.26
Building against libxml2/libxslt in the following directory: /usr/lib/x86_64-linux-gnu
warning: no files found matching '*.txt' under directory 'src/lxml/tests'
src/lxml/lxml.etree.c: In function 『__pyx_f_4lxml_5etree__getFilenameForFile』:
src/lxml/lxml.etree.c:26310:7: warning: variable 『__pyx_clineno』 set but not used [-Wunused-but-set-variable]
src/lxml/lxml.etree.c:26309:15: warning: variable 『__pyx_filename』 set but not used [-Wunused-but-set-variable]
src/lxml/lxml.etree.c:26308:7: warning: variable 『__pyx_lineno』 set but not used [-Wunused-but-set-variable]
src/lxml/lxml.etree.c: In function 『__pyx_pf_4lxml_5etree_4XSLT_18__call__』:
src/lxml/lxml.etree.c:132608:81: warning: passing argument 1 of 『__pyx_f_4lxml_5etree_12_XSLTContext__』 from incompatible pointer type [enabled by default]
src/lxml/lxml.etree.c:130569:52: note: expected 『struct __pyx_obj_4lxml_5etree__XSLTContext *』 but argument is of type 『struct __pyx_obj_4lxml_5etree__BaseContext *』
src/lxml/lxml.etree.c: In function 『__pyx_f_4lxml_5etree__XSLT』:
src/lxml/lxml.etree.c:133997:79: warning: passing argument 1 of 『__pyx_f_4lxml_5etree_12_XSLTContext__』 from incompatible pointer type [enabled by default]
src/lxml/lxml.etree.c:130569:52: note: expected 『struct __pyx_obj_4lxml_5etree__XSLTContext *』 but argument is of type 『struct __pyx_obj_4lxml_5etree__BaseContext *』
src/lxml/lxml.etree.c: At top level:
src/lxml/lxml.etree.c:12128:13: warning: 『__pyx_f_4lxml_5etree_displayNode』 defined but not used [-Wunused-function]
src/lxml/lxml.etree.c: In function 『__pyx_f_4lxml_5etree_11_BaseParser__parseDocFromFile』:
src/lxml/lxml.etree.c:86715:3: warning: 『__pyx_r』 may be used uninitialized in this function [-Wuninitialized]
src/lxml/lxml.etree.c: In function 『__pyx_f_4lxml_5etree_11_BaseParser__parseDoc』:
src/lxml/lxml.etree.c:86403:3: warning: 『__pyx_r』 may be used uninitialized in this function [-Wuninitialized]
src/lxml/lxml.etree.c: In function 『__pyx_f_4lxml_5etree_11_BaseParser__parseUnicodeDoc』:
src/lxml/lxml.etree.c:86093:3: warning: 『__pyx_r』 may be used uninitialized in this function [-Wuninitialized]
src/lxml/lxml.etree.c: In function 『__pyx_f_4lxml_5etree_11_BaseParser__parseDocFromFilelike』:
src/lxml/lxml.etree.c:86925:3: warning: 『__pyx_r』 may be used uninitialized in this function [-Wuninitialized]
Adding lxml 3.0.1 to easy-install.pth file

Installed /usr/local/lib/python2.7/dist-packages/lxml-3.0.1-py2.7-linux-x86_64.egg
Searching for w3lib>=1.2
Reading http://pypi.python.org/simple/w3lib/
Reading http://github.com/scrapy/w3lib
Best match: w3lib 1.2
Downloading http://pypi.python.org/packages/source/w/w3lib/w3lib-1.2.tar.gz#md5=
Processing w3lib-1.2.tar.gz
Running w3lib-1.2/setup.py -q bdist_egg --dist-dir /tmp/easy_install-ZAXTgy/w3lib-1.2/egg-dist-tmp-aU3vpc
zip_safe flag not set; analyzing archive contents...
Adding w3lib 1.2 to easy-install.pth file

Installed /usr/local/lib/python2.7/dist-packages/w3lib-1.2-py2.7.egg
Searching for Twisted>=8.0
Reading http://pypi.python.org/simple/Twisted/
Reading http://www.twistedmatrix.com
Reading http://twistedmatrix.com/procts/download
Reading http://twistedmatrix.com/
Reading http://tmrc.mit.e/mirror/twisted/Twisted/9.0/
Reading http://tmrc.mit.e/mirror/twisted/Twisted/10.0/
Reading http://twistedmatrix.com/projects/core/
Reading http://tmrc.mit.e/mirror/twisted/Twisted/8.2/
Reading http://tmrc.mit.e/mirror/twisted/Twisted/8.1/
Best match: Twisted 12.2.0
Downloading http://pypi.python.org/packages/source/T/Twisted/Twisted-12.2.0.tar.bz2#md5=
Processing Twisted-12.2.0.tar.bz2
Running Twisted-12.2.0/setup.py -q bdist_egg --dist-dir /tmp/easy_install-kw897y/Twisted-12.2.0/egg-dist-tmp-sZWFYb
In file included from /usr/include/python2.7/Python.h:8:0,
from twisted/internet/_sigchld.c:9:
/usr/include/python2.7/pyconfig.h:1161:0: warning: "_POSIX_C_SOURCE" redefined [enabled by default]
/usr/include/features.h:215:0: note: this is the location of the previous definition
twisted/internet/_sigchld.c: In function 『got_signal』:
twisted/internet/_sigchld.c:15:13: warning: variable 『ignored_result』 set but not used [-Wunused-but-set-variable]
Adding Twisted 12.2.0 to easy-install.pth file
Installing mailmail script to /usr/local/bin
Installing conch script to /usr/local/bin
Installing pyhtmlizer script to /usr/local/bin
Installing twistd script to /usr/local/bin
Installing lore script to /usr/local/bin
Installing tkconch script to /usr/local/bin
Installing tapconvert script to /usr/local/bin
Installing ckeygen script to /usr/local/bin
Installing tap2rpm script to /usr/local/bin
Installing manhole script to /usr/local/bin
Installing trial script to /usr/local/bin
Installing cftp script to /usr/local/bin
Installing tap2deb script to /usr/local/bin

Installed /usr/local/lib/python2.7/dist-packages/Twisted-12.2.0-py2.7-linux-x86_64.egg
Finished processing dependencies for Scrapy
表示安裝成功。

3、測試

scrapy shell http://ziki.cn
獲取所有a標簽

hxs.select('//a').extract()
參考資料

http://doc.scrapy.org/en/latest/intro/install.html
http://doc.scrapy.org/en/latest/intro/tutorial.html

3. 如何在linux下安裝支持python3的scrapy

如何在linux下安裝支持python3的scrapy
window)的歷史內容已經被tmux接管了，所以原來console/terminal提供的Shift+PgUp/PgDn所顯示的內容並不是當前窗口的歷史內容，所以要用C-b
[進入-mode，然後才能用PgUp/PgDn/游標/Ctrl-S等鍵在-mode中移動。
如果要啟用滑鼠滾輪來卷動窗口內容的話，可以按C-b
:然後輸入
setw
mode-mouse
on
這就可以了。如果要對所有窗口開啟的話:
setw
-g
mode-mouse
on

4. linux 怎麼刪除scrapy

一.安裝scrapy
pip install Scrapy 由於scrapy相關依賴較多，因此在安裝過程中可能遇到如下問題：
1．ImportError: No mole named w3lib.http
解決：pip install w3lib
2．ImportError: No mole named twisted
解決：pip install twisted
3．ImportError: No mole named lxml.html
解決：pip install lxml
4．error: libxml/xmlversion.h: No such file or directory
解決：apt-get install libxml2-dev libxslt-dev
apt-get install python-lxml
5．ImportError: No mole named cssselect
解決：pip install cssselect
6．ImportError: No mole named OpenSSL
解決：pip install pyOpenSSL
以上基本涵蓋安裝過程中可能出現的依賴問題，如有遺漏待發現後補充

使用scrapy --version 如顯示出版本信息則安裝成功

5. 如何在雲伺服器上部署持久運行scrapy

作為linux伺服器管理員,經常要使用ssh登陸到遠程linux機器上做一些耗時的操作。
也許你遇到過使用telnet或SSH遠程登錄linux,運行一些程序。如果這些程序需要運行很長時間(幾個小時)，而程序運行過程中出現網路故障，或者客戶機故障，這時候客戶機與遠程伺服器的鏈接將終端，並且遠程伺服器沒有正常結束的命令將被迫終止。
又比如你SSH到主機上後，開始批量的scp命令，如果這個ssh線程斷線了，scp進程就中斷了。在遠程伺服器上正在運行某些耗時的作業，但是工作還沒做完快要下班了，退出的話就會中斷操作了，如何才好呢?
我們利用screen命令可以很好的解決這個問題。實現在斷開SSH的情況下,在伺服器上繼續執行程序。
那什麼是screen命令?
Screen被稱之為一個全屏窗口管理器，用他可以輕松在一個物理終端上獲得多個虛擬終端的效果。
Screen功能說明：
簡單來說，Screen是一個可以在多個進程之間多路復用一個物理終端的窗口管理器,這意味著你能夠使用一個單一的終端窗口運行多終端的應用。Screen中有會話的概念，用戶可以在一個screen會話中創建多個screen窗口，在每一個screen窗口中就像操作一個真實的telnet/SSH連接窗口那樣。
Screen命令語法：
screen [-AmRvx -ls -wipe][-d <作業名稱>][-h <行數>][-r <作業名稱>][-s ][-S <作業名稱>]
Screen命令參數：
-A -[rR] 將所有的視窗都調整為目前終端機的大小。
-c filename 用指定的filename文件替代screen的配置文件』.screenrc』.
-d [pid.tty.host] 斷開screen進程(使用該命令時，screen的狀態一定要是Attached，也就是說有用戶連在screen里)。一般進程的名字是以pid.tty.host這種形式表示(用screen -list命令可以看出狀態)。
-D [pid.tty.host] 與-d命令實現一樣的功能，區別就是如果執行成功，會踢掉原來在screen里的用戶並讓他logout。
-h <行數> 指定視窗的緩沖區行數。
-ls或–list 顯示目前所有的screen作業。
-m 即使目前已在作業中的screen作業，仍強制建立新的screen作業。
-p number or name 預先選擇一個窗口。
-r [pid.tty.host] 恢復離線的screen進程，如果有多個斷開的進程，需要指定[pid.tty.host]
-R 先試圖恢復離線的作業。若找不到離線的作業，即建立新的screen作業。
-s shell 指定建立新視窗時，所要執行的shell。
-S <作業名稱> 指定screen作業的名稱。(用來替代[pid.tty.host]的命名方式,可以簡化操作).
-v 顯示版本信息。
-wipe 檢查目前所有的screen作業，並刪除已經無法使用的screen作業。
-x 恢復之前離線的screen作業。
Screen命令的常規用法:
screen -d -r:連接一個screen進程，如果該進程是attached，就先踢掉遠端用戶再連接。
screen -D -r:連接一個screen進程，如果該進程是attached，就先踢掉遠端用戶並讓他logout再連接
screen -ls或者-list:顯示存在的screen進程，常用命令
screen -m:如果在一個Screen進程里，用快捷鍵crtl+a c或者直接打screen可以創建一個新窗口,screen -m可以新建一個screen進程。
screen -dm:新建一個screen，並默認是detached模式，也就是建好之後不會連上去。
screen -p number or name:預先選擇一個窗口。
Screen實現後台運行程序的簡單步驟:
1> 要進行某項操作時，先使用命令創建一個Screen:
代碼如下:
[linux@user~]$ screen -S test1
2>接著就可以在裡面進行操作了，如果你的任務還沒完成就要走開的話，使用命令保留Screen：
代碼如下:
[linux@user~]$ Ctrl+a+d #按Ctrl+a，然後再按d即可保留Screen
[detached] #這時會顯示出這個提示，說明已經保留好Screen了
如果你工作完成的話，就直接輸入:
代碼如下:
[linux@user~]$ exit #這樣就表示成功退出了
[screen is terminating]
3> 如果你上一次保留了Screen，可以使用命令查看：
代碼如下:
[linux@user~]$ screen -ls
There is a screen on:
9649.test1 (Detached)
恢復Screen，使用命令：
代碼如下:
[linux@user~]$ screen -r test1 (or 9649)
Screen命令中用到的快捷鍵
Ctrl+a c ：創建窗口
Ctrl+a w ：窗口列表
Ctrl+a n ：下一個窗口
Ctrl+a p ：上一個窗口
Ctrl+a 0-9 ：在第0個窗口和第9個窗口之間切換
Ctrl+a K(大寫) ：關閉當前窗口，並且切換到下一個窗口(當退出最後一個窗口時，該終端自動終止，並且退回到原始shell狀態)
exit ：關閉當前窗口，並且切換到下一個窗口(當退出最後一個窗口時，該終端自動終止，並且退回到原始shell狀態)
Ctrl+a d ：退出當前終端，返回載入screen前的shell命令狀態
多窗口
screen，像許多的窗口管理器一樣，能支持多窗口。這個功能在處理多個任務且同時沒有打開新的會話時很有用。作為一個系統管理員，我常常要同時開四五個SSH會話。在每個shell下，我可能要處理兩三個任務。不使用screen的話，需要15個SSH 會話，15次登錄，15個窗口等等。使用screen，每個系統都分配到一個單獨的會話中，我通過screen來管理系統上不同的作業。
要打開新的窗口，只需要使用「Ctrl-A」「c」。創建的新的窗口會顯示一個默認的命令提示符。例如，我可以運行top命令後再打開一個新的窗口來做其它的工作。Top繼續留在那運行!可以親身實驗一下，啟動screen並運行top。(註：為了節省空間我截斷了多個屏幕。)
啟動top
代碼如下:
Mem: 506028K av, 500596K used, 5432K free,
0K shrd, 11752K buff
Swap: 1020116K av, 53320K used, 966796K free
393660K cached
< p> PID USER PRI NI SIZE RSS SHARE STAT %CPU %ME

6538 root 25 0 1892 1892 596 R 49.1 0.3
6614 root 16 0 1544 1544 668 S 28.3 0.3
7198 admin 15 0 1108 1104 828 R 5.6 0.2
現在可以通過「Ctrl-A」「c」來打開一個新窗口
代碼如下:
[admin@ensim admin]$
To get back to top, use "Ctrl-A "n"
Mem: 506028K av, 500588K used, 5440K free,
0K shrd, 11960K buff
Swap: 1020116K av, 53320K used, 966796K free
392220K cached
< p> PID USER PRI NI SIZE RSS SHARE STAT %CPU %ME

6538 root 25 0 1892 1892 596 R 48.3 0.3
6614 root 15 0 1544 1544 668 S 30.7 0.3
你可以創建多個窗口然後通過「Ctrl-A」「n」切換到下一個窗口，或者使用「Ctrl-A」「p」返回上一個窗口。當你在其它窗口工作時，其它窗口的每個程序都會保持運行。
退出screen
有兩種方式退出screen。第一種和登出一個shell一樣，你可以通過「Ctrl-A」「K」或者「exit」來終止一個窗口。這樣當前的窗口會被關閉，如果你打開了多個窗口，你就會直接轉到其餘中的一個，而如果是僅有的一個窗口時，你就退出了screen。
另外一種退出screen的方式是分離窗口。這種方式只是簡單地關閉了窗口但進程仍運行著。如果你有確定要長時間執行的進程，還需要關閉SSH程序時，你便可以使用「Ctrl-A」「d」分離窗口。這會使你回到shell中。所有的screen窗口都待在那裡，你可以稍後重新接管它們。(譯者註：這很像我們實際中的最小化窗口和程序後台運行)
接管會話
假設你正用著screen花了很長時間編譯著一個程序，突然間你的連接斷開了。請不用擔心，screen會保存你的編譯進度。重新登錄你的操作系統後使用screen列表工具查看有哪些會話正在運行：
代碼如下:
[root@gigan root]# screen -ls
There are screens on:
31619.ttyp2.gigan (Detached)
4731.ttyp2.gigan (Detached)
2 Sockets in /tmp/screens/S-root.
在這里，我有兩個不同的screen會話。要需要重新接管其中一個，使用恢復窗口的命令：
代碼如下:
[root@gigan root]#screen -r 31619.ttyp2.gigan
只需要使用 -r 選項再接會話的名，現在你便可以重新回到剛才的屏幕。令人欣喜的是，你還可以在任何地方重新接管。不論在辦公室還是其它客戶端上，你都可以使用screen來啟動一項工作然後退出。
多窗口
screen，像許多的窗口管理器一樣，能支持多窗口。這個功能在處理多個任務且同時沒有打開新的會話時很有用。作為一個系統管理員，我常常要同時開四五個SSH會話。在每個shell下，我可能要處理兩三個任務。不使用screen的話，需要15個SSH 會話，15次登錄，15個窗口等等。使用screen，每個系統都分配到一個單獨的會話中，我通過screen來管理系統上不同的作業。
要打開新的窗口，只需要使用「Ctrl-A」「c」。創建的新的窗口會顯示一個默認的命令提示符。例如，我可以運行top命令後再打開一個新的窗口來做其它的工作。Top繼續留在那運行!可以親身實驗一下，啟動screen並運行top。(註：為了節省空間我截斷了多個屏幕。)
啟動top
代碼如下:
Mem: 506028K av, 500596K used, 5432K free,
0K shrd, 11752K buff
Swap: 1020116K av, 53320K used, 966796K free
393660K cached
< p> PID USER PRI NI SIZE RSS SHARE STAT %CPU %ME

6538 root 25 0 1892 1892 596 R 49.1 0.3
6614 root 16 0 1544 1544 668 S 28.3 0.3
7198 admin 15 0 1108 1104 828 R 5.6 0.2
現在可以通過「Ctrl-A」「c」來打開一個新窗口
代碼如下:
[admin@ensim admin]$
To get back to top, use "Ctrl-A "n"
Mem: 506028K av, 500588K used, 5440K free,
0K shrd, 11960K buff
Swap: 1020116K av, 53320K used, 966796K free
392220K cached
< p> PID USER PRI NI SIZE RSS SHARE STAT %CPU %ME

6538 root 25 0 1892 1892 596 R 48.3 0.3
6614 root 15 0 1544 1544 668 S 30.7 0.3
你可以創建多個窗口然後通過「Ctrl-A」「n」切換到下一個窗口，或者使用「Ctrl-A」「p」返回上一個窗口。當你在其它窗口工作時，其它窗口的每個程序都會保持運行。
退出screen
有兩種方式退出screen。第一種和登出一個shell一樣，你可以通過「Ctrl-A」「K」或者「exit」來終止一個窗口。這樣當前的窗口會被關閉，如果你打開了多個窗口，你就會直接轉到其餘中的一個，而如果是僅有的一個窗口時，你就退出了screen。
另外一種退出screen的方式是分離窗口。這種方式只是簡單地關閉了窗口但進程仍運行著。如果你有確定要長時間執行的進程，還需要關閉SSH程序時，你便可以使用「Ctrl-A」「d」分離窗口。這會使你回到shell中。所有的screen窗口都待在那裡，你可以稍後重新接管它們。(譯者註：這很像我們實際中的最小化窗口和程序後台運行)
接管會話
假設你正用著screen花了很長時間編譯著一個程序，突然間你的連接斷開了。請不用擔心，screen會保存你的編譯進度。重新登錄你的操作系統後使用screen列表工具查看有哪些會話正在運行：
代碼如下:
[root@gigan root]# screen -ls
There are screens on:
31619.ttyp2.gigan (Detached)
4731.ttyp2.gigan (Detached)
2 Sockets in /tmp/screens/S-root.
在這里，我有兩個不同的screen會話。要需要重新接管其中一個，使用恢復窗口的命令：
代碼如下:
[root@gigan root]#screen -r 31619.ttyp2.gigan
只需要使用 -r 選項再接會話的名，現在你便可以重新回到剛才的屏幕。令人欣喜的是，你還可以在任何地方重新接管。不論在辦公室還是其它客戶端上，你都可以使用screen來啟動一項工作然後退出。

6. linux下怎麼用pycharm調試scrapy-linux博客

方法/步驟首先，打開pycharm，同時來檢查一下是否安裝好了git。用命令行來執行 git version，會有結果出來，就證明了git安裝好了，然後就通過git下載代碼。將代碼導入到pycharm中，會發現右上角有提示，意思就是找不到git的路徑，無法解析代.

7. 如何在linux ubuntu 下安裝scapy pyx

最近在學習爬蟲，早就聽說Python寫爬蟲極爽（貌似pythoner說python都爽，不過也確實，python的類庫非常豐富，不用重復造輪子），還有一個強大的框架Scrapy，於是決定嘗試一下。
要想使用Scrapy第一件事，當然是安裝Scrapy，嘗試了Windows和Ubuntu的安裝，本文先講一下 Ubuntu的安裝，比Windows的安裝簡單太多了。抽時間也會詳細介紹一下怎麼在Windows下進行安裝。
官方介紹，在安裝Scrapy前需要安裝一系列的依賴.
* Python 2.7： Scrapy是Python框架，當然要先安裝Python ，不過由於Scrapy暫時只支持 Python2.7，因此首先確保你安裝的是Python 2.7
* lxml：大多數Linux發行版自帶了lxml
* OpenSSL：除了windows之外的系統都已經提供
* Python Package: pip and setuptools. 由於現在pip依賴setuptools,所以安裝pip會自動安裝setuptools
有上面的依賴可知，在非windows的環境下安裝 Scrapy的相關依賴是比較簡單的，只用安裝pip即可。Scrapy使用pip完成安裝。
檢查Scrapy依賴是否安裝
你可能會不放心自己的電腦是否已經安裝了，上面說的已經存在的依賴，那麼你可以使用下面的方法檢查一下，本文使用的是Ubuntu 14.04。
檢查Python的版本
$ python --version
如果看到下面的輸出，說明Python的環境已經安裝，我這里顯示的是Python 2.7.6，版本也是2.7的滿足要求。如果沒有出現下面的信息，那麼請讀者自行網路安裝Python，本文不介紹Python的安裝（網上一搜一堆）。

檢查lxml和OpenSSL是否安裝
假設已經安裝了Python，在控制台輸入python，進入Python的交互環境。

然後分別輸入import lxml和import OpenSSL如果沒有報錯，說明兩個依賴都已經安裝。

安裝python-dev和libevent
python-dev是linux上開發python比較重要的工具，以下的情況你需要安裝
* 你需要自己安裝一個源外的python類庫, 而這個類庫內含需要編譯的調用python api的c/c++文件
* 你自己寫的一個程序編譯需要鏈接libpythonXX.(a|so)
libevent是一個時間出發的高性能的網路庫，很多框架的底層都使用了libevent
上面兩個庫是需要安裝的，不然後面後報錯。使用下面的指令安裝
$sudo apt-get install python-dev
$sudo apt-get install libevent-dev
安裝pip
因為Scrapy可以使用pip方便的安裝，因此我們需要先安裝pip，可以使用下面的指令安裝pip
$ sudo apt-get install python-pip
使用pip安裝Scrapy
使用下面的指令安裝Scrapy。
$ sudo pip install scrapy
記住一定要獲得root許可權，否則會出現下面的錯誤。

至此scrapy安裝完成，使用下面的命令檢查Scrapy是否安裝成功。
$ scrapy version
顯示如下結果說明安裝成功，此處的安裝版本是1.02

8. centos6.6 python2.6 能安裝scrapy嗎

可以的。具體安裝步驟如下：

安裝步驟:
1.下載python2.7 http://www.python.org/ftp/python/2.7.3/Python-2.7.3.tgz

[root@zxy-websgs ~]# wget http://www.python.org/ftp/python/2.7.3/Python-2.7.3.tgz -P /opt

[root@zxy-websgs opt]# tar xvf Python-2.7.3.tgz

[root@zxy-websgs Python-2.7.3]# ./configure

[root@zxy-websgs Python-2.7.3]# make && make install

驗證python2.7安裝
[root@zxy-websgs Python-2.7.3]# python2.7
Python 2.7.3 (default, Feb 28 2013, 03:08:43)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-50)] on linux2
Type "help", "right", "credits" or "license" for more information.
>>> exit()

2.安裝setuptools,http://pypi.python.org/packages/source/s/setuptools/setuptools-0.6c11.tar.gz
[root@zxy-websgs ~]# wget http://pypi.python.org/packages/source/s/setuptools/setuptools-0.6c11.tar.gz -P /opt/
[root@zxy-websgs opt]# tar zxvf setuptools-0.6c11.tar.gz
[root@zxy-websgs setuptools-0.6c11]# python2.7 setup.py install

3.安裝Twisted
[root@zxy-websgs setuptools-0.6c11]# easy_install Twisted
......
Installed /usr/local/lib/python2.7/site-packages/Twisted-12.3.0-py2.7-linux-x86_64.egg
......
Installed /usr/local/lib/python2.7/site-packages/zope.interface-4.0.4-py2.7-linux-x86_64.egg

Twisted要安裝zope.interface,可以從下面地址下載
zope.interface:http://pypi.python.org/packages/source/z/zope.interface/zope.interface-4.0.1.tar.gz
twisted:http://twistedmatrix.com/Releases/Twisted/12.1/Twisted-12.1.0.tar.bz2
5.安裝w3lib

[root@zxy-websgs setuptools-0.6c11]# easy_install -U w3lib
Searching for w3lib
Reading http://pypi.python.org/simple/w3lib/
Reading http://github.com/scrapy/w3lib
Best match: w3lib 1.2
Downloading http://pypi.python.org/packages/source/w/w3lib/w3lib-1.2.tar.gz#md5=
Processing w3lib-1.2.tar.gz
Running w3lib-1.2/setup.py -q bdist_egg --dist-dir /tmp/easy_install-wm_1BB/w3lib-1.2/egg-dist-tmp-2DQHY_
zip_safe flag not set; analyzing archive contents...
Adding w3lib 1.2 to easy-install.pth file

Installed /usr/local/lib/python2.7/site-packages/w3lib-1.2-py2.7.egg
Processing dependencies for w3lib
Finished processing dependencies for w3lib

w3lib:http://pypi.python.org/packages/source/w/w3lib/w3lib-1.2.tar.gz
6.安裝libxml2或者用easy_install安裝lxml
[root@zxy-websgs lxml-3.1.0]# easy_install lxml

驗證lxml安裝
[root@zxy-websgs lxml-3.1.0]# python2.7
Python 2.7.3 (default, Feb 28 2013, 03:08:43)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-50)] on linux2
Type "help", "right", "credits" or "license" for more information.
>>> import lxml
>>> exit()

也可以安裝libxml2,官網上推薦安裝2.6.28或者以上的版本，但在官網上沒找到，我先是安裝的2.6.9的版本，運行scrapy時報以下錯誤

Traceback (most recent call last):
File "/usr/local/bin/scrapy", line 5, in <mole>
pkg_resources.run_script('Scrapy==0.14.4', 'scrapy')
File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 489, in run_script
File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 1207, in run_script
File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <mole>
execute()
File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/cmdline.py", line 112, in execute
cmds = _get_commands_dict(inproject)
File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/cmdline.py", line 37, in _get_commands_dict
cmds = _get_commands_from_mole('scrapy.commands', inproject)
File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/cmdline.py", line 30, in _get_commands_from_mole
for cmd in _iter_command_classes(mole):
File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/cmdline.py", line 21, in _iter_command_classes
for mole in walk_moles(mole_name):
File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/utils/misc.py", line 65, in walk_moles
submod = __import__(fullpath, {}, {}, [''])
File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/commands/shell.py", line 8, in <mole>
from scrapy.shell import Shell
File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/shell.py", line 14, in <mole>
from scrapy.selector import XPathSelector, XmlXPathSelector, HtmlXPathSelector
File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/selector/__init__.py", line 30, in <mole>
from scrapy.selector.libxml2sel import *
File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/selector/libxml2sel.py", line 12, in <mole>
from .factories import xmlDoc_from_html, xmlDoc_from_xml
File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/selector/factories.py", line 14, in <mole>
libxml2.HTML_PARSE_NOERROR + \
AttributeError: 'mole' object has no attribute 'HTML_PARSE_RECOVER'

升級到2.6.21版本以後解決了。
libxml2.6.1:ftp://xmlsoft.org/libxml2/python/libxml2-python-2.6.21.tar.gz
7.安裝pyOpenSSL(這個是可選安裝的，主要為了使scrapy能夠支持https)
用easy_install pyOpenSSL安裝的是pyOpenSSL-0.13版本，沒安裝成功，於是手動下載.011版本來進行安裝。
[root@zxy-websgs opt]# wget http://launchpadlibrarian.net/58498441/pyOpenSSL-0.11.tar.gz -P /opt
[root@zxy-websgs opt]# tar zxvf pyOpenSSL-0.11.tar.gz
[root@zxy-websgs pyOpenSSL-0.11]# python2.7 setup.py install

pyOpenSSL:http://launchpadlibrarian.net/58498441/pyOpenSSL-0.11.tar.gz
8.安裝scrapy
[root@zxy-websgs pyOpenSSL-0.11]# easy_install -U Scrapy

驗證安裝

[root@zxy-websgs pyOpenSSL-0.11]# scrapy
Scrapy 0.16.4 - no active project

Usage:
scrapy <command> [options] [args]

Available commands:
fetch Fetch a URL using the Scrapy downloader
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy

[ more ] More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command

scrapy:http://pypi.python.org/packages/source/S/Scrapy/Scrapy-0.14.4.tar.gz
總結：
pyOpenSSL單獨安裝的時候不成功，也可以先下載pyOpenSSL0.11進行安裝，再使用easy_install -U Scrapy進行全程安裝

9. python的scrapy需要額外安裝么

一、初窺Scrapy
Scrapy是一個為了爬取網站數據，提取結構性數據而編寫的應用框架。可以應用在包括數據挖掘，信息處理或存儲歷史數據等一系列的程序中。
其最初是為了頁面抓取 (更確切來說, 網路抓取 )所設計的，也可以應用在獲取API所返回的數據(例如 Amazon Associates Web Services ) 或者通用的網路爬蟲。
本文檔將通過介紹Scrapy背後的概念使您對其工作原理有所了解，並確定Scrapy是否是您所需要的。
當您准備好開始您的項目後，您可以參考入門教程。

二、Scrapy安裝介紹
Scrapy框架運行平台及相關輔助工具
Python 2.7（Python最新版3.5，這里選擇了2.7版本）
Python Package: pip and setuptools. 現在 pip 依賴 setuptools ，如果未安裝，則會自動安裝setuptools 。
lxml. 大多數Linux發行版自帶了lxml。如果缺失，請查看
OpenSSL. 除了Windows(請查看平台安裝指南)之外的系統都已經提供。
您可以使用pip來安裝Scrapy(推薦使用pip來安裝Python package).
pip install Scrapy
Windows下安裝流程：
1、安裝Python 2.7之後，您需要修改 PATH 環境變數，將Python的可執行程序及額外的腳本添加到系統路徑中。將以下路徑添加到 PATH 中:
C:\Python27\;C:\Python27\Scripts\;
除此之外，還可以用c

10. Python爬蟲教程和Python學習路徑有哪些

現在之所以有這么多的小夥伴熱衷於爬蟲技術，無外乎是因為爬蟲可以幫我們做很多事情，比如搜索引擎、採集數據、廣告過濾等，以Python為例，Python爬蟲可以用於數據分析，在數據抓取方面發揮巨大的作用。
但是這並不意味著單純掌握一門Python語言，就對爬蟲技術觸類旁通，要學習的知識和規范還有喜很多，包括但不僅限於HTML 知識、HTTP/HTTPS 協議的基本知識、正則表達式、資料庫知識，常用抓包工具的使用、爬蟲框架的使用等。而且涉及到大規模爬蟲，還需要了解分布式的概念、消息隊列、常用的數據結構和演算法、緩存，甚至還包括機器學習的應用，大規模的系統背後都是靠很多技術來支撐的。
零基礎如何學爬蟲技術？對於迷茫的初學者來說，爬蟲技術起步學習階段，最重要的就是明確學習路徑，找准學習方法，唯有如此，在良好的學習習慣督促下，後期的系統學習才會事半功倍，游刃有餘。
用Python寫爬蟲，首先需要會Python，把基礎語法搞懂，知道怎麼使用函數、類和常用的數據結構如list、dict中的常用方法就算基本入門。作為入門爬蟲來說，需要了解 HTTP協議的基本原理，雖然 HTTP 規范用一本書都寫不完，但深入的內容可以放以後慢慢去看，理論與實踐相結合後期學習才會越來越輕松。關於爬蟲學習的具體步驟，我大概羅列了以下幾大部分，大家可以參考：
網路爬蟲基礎知識:
爬蟲的定義
爬蟲的作用
Http協議
基本抓包工具(Fiddler)使用
Python模塊實現爬蟲：
urllib3、requests、lxml、bs4 模塊大體作用講解
使用requests模塊 get 方式獲取靜態頁面數據
使用requests模塊 post 方式獲取靜態頁面數據
使用requests模塊獲取 ajax 動態頁面數據
使用requests模塊模擬登錄網站
使用Tesseract進行驗證碼識別
Scrapy框架與Scrapy-Redis：
Scrapy 爬蟲框架大體說明
Scrapy spider 類
Scrapy item 及 pipeline
Scrapy CrawlSpider 類
通過Scrapy-Redis 實現分布式爬蟲
藉助自動化測試工具和瀏覽器爬取數據：
Selenium + PhantomJS 說明及簡單實例
Selenium + PhantomJS 實現網站登錄
Selenium + PhantomJS 實現動態頁面數據爬取
爬蟲項目實戰：
分布式爬蟲+ Elasticsearch 打造搜索引擎

導航:首頁 > 操作系統 > linux安裝scrapy

linux安裝scrapy

與linux安裝scrapy相關的資料