linux安装scrapy_python爬虫什么教程最好

1. python爬虫什么教程最好

可以看这个教程：网页链接

此教程通过三个爬虫案例来使学员认识Scrapy框架、了解Scrapy的架构、熟悉Scrapy各模块。

此教程的大致内容：

1、Scrapy的简介。

主要知识点：Scrapy的架构和运作流程。

2、搭建开发环境：

主要知识点：Windows及linux环境下Scrapy的安装。

3、Scrapy Shell以及Scrapy Selectors的使用。

4、使用Scrapy完成网站信息的爬取。

主要知识点：创建Scrapy项目(scrapy startproject)、定义提取的结构化数据(Item)、编写爬取网站的Spider并提取出结构化数据(Item)、编写Item Pipelines来存储提取到的Item(即结构化数据)。

2. 如何在ubuntu中安装scrapy

Scrapy是Python开发的一个快速,高层次的屏幕抓取和web抓取框架，用于抓取web站点并从页面中提取结构化的数据。Scrapy用途广泛，可以用于数据挖掘、监测和自动化测试。官网网站http://www.scrapy.org/
1、安装如下软件

sudo apt-get install build-essential;
sudo apt-get install python-dev;
sudo apt-get install libxml2-dev;
sudo apt-get install libxslt1-dev;
sudo apt-get install python-setuptools;
2、安装Scrapy

sudo easy_install Scrapy;
wang@ubuntu:/usr/local/lib/python2.7/dist-packages$ sudo easy_install Scrapy
Searching for Scrapy
Best match: Scrapy 0.16.1
Processing Scrapy-0.16.1-py2.7.egg
Scrapy 0.16.1 is already the active version in easy-install.pth
Installing scrapy script to /usr/local/bin

Using /usr/local/lib/python2.7/dist-packages/Scrapy-0.16.1-py2.7.egg
Processing dependencies for Scrapy
Searching for lxml
Reading http://pypi.python.org/simple/lxml/
Reading http://codespeak.net/lxml
Best match: lxml 3.0.1
Downloading http://pypi.python.org/packages/source/l/lxml/lxml-3.0.1.tar.gz#md5=
Processing lxml-3.0.1.tar.gz
Running lxml-3.0.1/setup.py -q bdist_egg --dist-dir /tmp/easy_install-qibAzL/lxml-3.0.1/egg-dist-tmp-mSvUVN
Building lxml version 3.0.1.
Building without Cython.
Using build configuration of libxslt 1.1.26
Building against libxml2/libxslt in the following directory: /usr/lib/x86_64-linux-gnu
warning: no files found matching '*.txt' under directory 'src/lxml/tests'
src/lxml/lxml.etree.c: In function ‘__pyx_f_4lxml_5etree__getFilenameForFile’:
src/lxml/lxml.etree.c:26310:7: warning: variable ‘__pyx_clineno’ set but not used [-Wunused-but-set-variable]
src/lxml/lxml.etree.c:26309:15: warning: variable ‘__pyx_filename’ set but not used [-Wunused-but-set-variable]
src/lxml/lxml.etree.c:26308:7: warning: variable ‘__pyx_lineno’ set but not used [-Wunused-but-set-variable]
src/lxml/lxml.etree.c: In function ‘__pyx_pf_4lxml_5etree_4XSLT_18__call__’:
src/lxml/lxml.etree.c:132608:81: warning: passing argument 1 of ‘__pyx_f_4lxml_5etree_12_XSLTContext__’ from incompatible pointer type [enabled by default]
src/lxml/lxml.etree.c:130569:52: note: expected ‘struct __pyx_obj_4lxml_5etree__XSLTContext *’ but argument is of type ‘struct __pyx_obj_4lxml_5etree__BaseContext *’
src/lxml/lxml.etree.c: In function ‘__pyx_f_4lxml_5etree__XSLT’:
src/lxml/lxml.etree.c:133997:79: warning: passing argument 1 of ‘__pyx_f_4lxml_5etree_12_XSLTContext__’ from incompatible pointer type [enabled by default]
src/lxml/lxml.etree.c:130569:52: note: expected ‘struct __pyx_obj_4lxml_5etree__XSLTContext *’ but argument is of type ‘struct __pyx_obj_4lxml_5etree__BaseContext *’
src/lxml/lxml.etree.c: At top level:
src/lxml/lxml.etree.c:12128:13: warning: ‘__pyx_f_4lxml_5etree_displayNode’ defined but not used [-Wunused-function]
src/lxml/lxml.etree.c: In function ‘__pyx_f_4lxml_5etree_11_BaseParser__parseDocFromFile’:
src/lxml/lxml.etree.c:86715:3: warning: ‘__pyx_r’ may be used uninitialized in this function [-Wuninitialized]
src/lxml/lxml.etree.c: In function ‘__pyx_f_4lxml_5etree_11_BaseParser__parseDoc’:
src/lxml/lxml.etree.c:86403:3: warning: ‘__pyx_r’ may be used uninitialized in this function [-Wuninitialized]
src/lxml/lxml.etree.c: In function ‘__pyx_f_4lxml_5etree_11_BaseParser__parseUnicodeDoc’:
src/lxml/lxml.etree.c:86093:3: warning: ‘__pyx_r’ may be used uninitialized in this function [-Wuninitialized]
src/lxml/lxml.etree.c: In function ‘__pyx_f_4lxml_5etree_11_BaseParser__parseDocFromFilelike’:
src/lxml/lxml.etree.c:86925:3: warning: ‘__pyx_r’ may be used uninitialized in this function [-Wuninitialized]
Adding lxml 3.0.1 to easy-install.pth file

Installed /usr/local/lib/python2.7/dist-packages/lxml-3.0.1-py2.7-linux-x86_64.egg
Searching for w3lib>=1.2
Reading http://pypi.python.org/simple/w3lib/
Reading http://github.com/scrapy/w3lib
Best match: w3lib 1.2
Downloading http://pypi.python.org/packages/source/w/w3lib/w3lib-1.2.tar.gz#md5=
Processing w3lib-1.2.tar.gz
Running w3lib-1.2/setup.py -q bdist_egg --dist-dir /tmp/easy_install-ZAXTgy/w3lib-1.2/egg-dist-tmp-aU3vpc
zip_safe flag not set; analyzing archive contents...
Adding w3lib 1.2 to easy-install.pth file

Installed /usr/local/lib/python2.7/dist-packages/w3lib-1.2-py2.7.egg
Searching for Twisted>=8.0
Reading http://pypi.python.org/simple/Twisted/
Reading http://www.twistedmatrix.com
Reading http://twistedmatrix.com/procts/download
Reading http://twistedmatrix.com/
Reading http://tmrc.mit.e/mirror/twisted/Twisted/9.0/
Reading http://tmrc.mit.e/mirror/twisted/Twisted/10.0/
Reading http://twistedmatrix.com/projects/core/
Reading http://tmrc.mit.e/mirror/twisted/Twisted/8.2/
Reading http://tmrc.mit.e/mirror/twisted/Twisted/8.1/
Best match: Twisted 12.2.0
Downloading http://pypi.python.org/packages/source/T/Twisted/Twisted-12.2.0.tar.bz2#md5=
Processing Twisted-12.2.0.tar.bz2
Running Twisted-12.2.0/setup.py -q bdist_egg --dist-dir /tmp/easy_install-kw897y/Twisted-12.2.0/egg-dist-tmp-sZWFYb
In file included from /usr/include/python2.7/Python.h:8:0,
from twisted/internet/_sigchld.c:9:
/usr/include/python2.7/pyconfig.h:1161:0: warning: "_POSIX_C_SOURCE" redefined [enabled by default]
/usr/include/features.h:215:0: note: this is the location of the previous definition
twisted/internet/_sigchld.c: In function ‘got_signal’:
twisted/internet/_sigchld.c:15:13: warning: variable ‘ignored_result’ set but not used [-Wunused-but-set-variable]
Adding Twisted 12.2.0 to easy-install.pth file
Installing mailmail script to /usr/local/bin
Installing conch script to /usr/local/bin
Installing pyhtmlizer script to /usr/local/bin
Installing twistd script to /usr/local/bin
Installing lore script to /usr/local/bin
Installing tkconch script to /usr/local/bin
Installing tapconvert script to /usr/local/bin
Installing ckeygen script to /usr/local/bin
Installing tap2rpm script to /usr/local/bin
Installing manhole script to /usr/local/bin
Installing trial script to /usr/local/bin
Installing cftp script to /usr/local/bin
Installing tap2deb script to /usr/local/bin

Installed /usr/local/lib/python2.7/dist-packages/Twisted-12.2.0-py2.7-linux-x86_64.egg
Finished processing dependencies for Scrapy
表示安装成功。

3、测试

scrapy shell http://ziki.cn
获取所有a标签

hxs.select('//a').extract()
参考资料

http://doc.scrapy.org/en/latest/intro/install.html
http://doc.scrapy.org/en/latest/intro/tutorial.html

3. 如何在linux下安装支持python3的scrapy

如何在linux下安装支持python3的scrapy
window)的历史内容已经被tmux接管了，所以原来console/terminal提供的Shift+PgUp/PgDn所显示的内容并不是当前窗口的历史内容，所以要用C-b
[进入-mode，然后才能用PgUp/PgDn/光标/Ctrl-S等键在-mode中移动。
如果要启用鼠标滚轮来卷动窗口内容的话，可以按C-b
:然后输入
setw
mode-mouse
on
这就可以了。如果要对所有窗口开启的话:
setw
-g
mode-mouse
on

4. linux 怎么删除scrapy

一.安装scrapy
pip install Scrapy 由于scrapy相关依赖较多，因此在安装过程中可能遇到如下问题：
1．ImportError: No mole named w3lib.http
解决：pip install w3lib
2．ImportError: No mole named twisted
解决：pip install twisted
3．ImportError: No mole named lxml.html
解决：pip install lxml
4．error: libxml/xmlversion.h: No such file or directory
解决：apt-get install libxml2-dev libxslt-dev
apt-get install python-lxml
5．ImportError: No mole named cssselect
解决：pip install cssselect
6．ImportError: No mole named OpenSSL
解决：pip install pyOpenSSL
以上基本涵盖安装过程中可能出现的依赖问题，如有遗漏待发现后补充

使用scrapy --version 如显示出版本信息则安装成功

5. 如何在云服务器上部署持久运行scrapy

作为linux服务器管理员,经常要使用ssh登陆到远程linux机器上做一些耗时的操作。
也许你遇到过使用telnet或SSH远程登录linux,运行一些程序。如果这些程序需要运行很长时间(几个小时)，而程序运行过程中出现网络故障，或者客户机故障，这时候客户机与远程服务器的链接将终端，并且远程服务器没有正常结束的命令将被迫终止。
又比如你SSH到主机上后，开始批量的scp命令，如果这个ssh线程断线了，scp进程就中断了。在远程服务器上正在运行某些耗时的作业，但是工作还没做完快要下班了，退出的话就会中断操作了，如何才好呢?
我们利用screen命令可以很好的解决这个问题。实现在断开SSH的情况下,在服务器上继续执行程序。
那什么是screen命令?
Screen被称之为一个全屏窗口管理器，用他可以轻松在一个物理终端上获得多个虚拟终端的效果。
Screen功能说明：
简单来说，Screen是一个可以在多个进程之间多路复用一个物理终端的窗口管理器,这意味着你能够使用一个单一的终端窗口运行多终端的应用。Screen中有会话的概念，用户可以在一个screen会话中创建多个screen窗口，在每一个screen窗口中就像操作一个真实的telnet/SSH连接窗口那样。
Screen命令语法：
screen [-AmRvx -ls -wipe][-d <作业名称>][-h <行数>][-r <作业名称>][-s ][-S <作业名称>]
Screen命令参数：
-A -[rR] 将所有的视窗都调整为目前终端机的大小。
-c filename 用指定的filename文件替代screen的配置文件’.screenrc’.
-d [pid.tty.host] 断开screen进程(使用该命令时，screen的状态一定要是Attached，也就是说有用户连在screen里)。一般进程的名字是以pid.tty.host这种形式表示(用screen -list命令可以看出状态)。
-D [pid.tty.host] 与-d命令实现一样的功能，区别就是如果执行成功，会踢掉原来在screen里的用户并让他logout。
-h <行数> 指定视窗的缓冲区行数。
-ls或–list 显示目前所有的screen作业。
-m 即使目前已在作业中的screen作业，仍强制建立新的screen作业。
-p number or name 预先选择一个窗口。
-r [pid.tty.host] 恢复离线的screen进程，如果有多个断开的进程，需要指定[pid.tty.host]
-R 先试图恢复离线的作业。若找不到离线的作业，即建立新的screen作业。
-s shell 指定建立新视窗时，所要执行的shell。
-S <作业名称> 指定screen作业的名称。(用来替代[pid.tty.host]的命名方式,可以简化操作).
-v 显示版本信息。
-wipe 检查目前所有的screen作业，并删除已经无法使用的screen作业。
-x 恢复之前离线的screen作业。
Screen命令的常规用法:
screen -d -r:连接一个screen进程，如果该进程是attached，就先踢掉远端用户再连接。
screen -D -r:连接一个screen进程，如果该进程是attached，就先踢掉远端用户并让他logout再连接
screen -ls或者-list:显示存在的screen进程，常用命令
screen -m:如果在一个Screen进程里，用快捷键crtl+a c或者直接打screen可以创建一个新窗口,screen -m可以新建一个screen进程。
screen -dm:新建一个screen，并默认是detached模式，也就是建好之后不会连上去。
screen -p number or name:预先选择一个窗口。
Screen实现后台运行程序的简单步骤:
1> 要进行某项操作时，先使用命令创建一个Screen:
代码如下:
[linux@user~]$ screen -S test1
2>接着就可以在里面进行操作了，如果你的任务还没完成就要走开的话，使用命令保留Screen：
代码如下:
[linux@user~]$ Ctrl+a+d #按Ctrl+a，然后再按d即可保留Screen
[detached] #这时会显示出这个提示，说明已经保留好Screen了
如果你工作完成的话，就直接输入:
代码如下:
[linux@user~]$ exit #这样就表示成功退出了
[screen is terminating]
3> 如果你上一次保留了Screen，可以使用命令查看：
代码如下:
[linux@user~]$ screen -ls
There is a screen on:
9649.test1 (Detached)
恢复Screen，使用命令：
代码如下:
[linux@user~]$ screen -r test1 (or 9649)
Screen命令中用到的快捷键
Ctrl+a c ：创建窗口
Ctrl+a w ：窗口列表
Ctrl+a n ：下一个窗口
Ctrl+a p ：上一个窗口
Ctrl+a 0-9 ：在第0个窗口和第9个窗口之间切换
Ctrl+a K(大写) ：关闭当前窗口，并且切换到下一个窗口(当退出最后一个窗口时，该终端自动终止，并且退回到原始shell状态)
exit ：关闭当前窗口，并且切换到下一个窗口(当退出最后一个窗口时，该终端自动终止，并且退回到原始shell状态)
Ctrl+a d ：退出当前终端，返回加载screen前的shell命令状态
多窗口
screen，像许多的窗口管理器一样，能支持多窗口。这个功能在处理多个任务且同时没有打开新的会话时很有用。作为一个系统管理员，我常常要同时开四五个SSH会话。在每个shell下，我可能要处理两三个任务。不使用screen的话，需要15个SSH 会话，15次登录，15个窗口等等。使用screen，每个系统都分配到一个单独的会话中，我通过screen来管理系统上不同的作业。
要打开新的窗口，只需要使用“Ctrl-A”“c”。创建的新的窗口会显示一个默认的命令提示符。例如，我可以运行top命令后再打开一个新的窗口来做其它的工作。Top继续留在那运行!可以亲身实验一下，启动screen并运行top。(注：为了节省空间我截断了多个屏幕。)
启动top
代码如下:
Mem: 506028K av, 500596K used, 5432K free,
0K shrd, 11752K buff
Swap: 1020116K av, 53320K used, 966796K free
393660K cached
< p> PID USER PRI NI SIZE RSS SHARE STAT %CPU %ME

6538 root 25 0 1892 1892 596 R 49.1 0.3
6614 root 16 0 1544 1544 668 S 28.3 0.3
7198 admin 15 0 1108 1104 828 R 5.6 0.2
现在可以通过“Ctrl-A”“c”来打开一个新窗口
代码如下:
[admin@ensim admin]$
To get back to top, use "Ctrl-A "n"
Mem: 506028K av, 500588K used, 5440K free,
0K shrd, 11960K buff
Swap: 1020116K av, 53320K used, 966796K free
392220K cached
< p> PID USER PRI NI SIZE RSS SHARE STAT %CPU %ME

6538 root 25 0 1892 1892 596 R 48.3 0.3
6614 root 15 0 1544 1544 668 S 30.7 0.3
你可以创建多个窗口然后通过“Ctrl-A”“n”切换到下一个窗口，或者使用“Ctrl-A”“p”返回上一个窗口。当你在其它窗口工作时，其它窗口的每个程序都会保持运行。
退出screen
有两种方式退出screen。第一种和登出一个shell一样，你可以通过“Ctrl-A”“K”或者“exit”来终止一个窗口。这样当前的窗口会被关闭，如果你打开了多个窗口，你就会直接转到其余中的一个，而如果是仅有的一个窗口时，你就退出了screen。
另外一种退出screen的方式是分离窗口。这种方式只是简单地关闭了窗口但进程仍运行着。如果你有确定要长时间执行的进程，还需要关闭SSH程序时，你便可以使用“Ctrl-A”“d”分离窗口。这会使你回到shell中。所有的screen窗口都待在那里，你可以稍后重新接管它们。(译者注：这很像我们实际中的最小化窗口和程序后台运行)
接管会话
假设你正用着screen花了很长时间编译着一个程序，突然间你的连接断开了。请不用担心，screen会保存你的编译进度。重新登录你的操作系统后使用screen列表工具查看有哪些会话正在运行：
代码如下:
[root@gigan root]# screen -ls
There are screens on:
31619.ttyp2.gigan (Detached)
4731.ttyp2.gigan (Detached)
2 Sockets in /tmp/screens/S-root.
在这里，我有两个不同的screen会话。要需要重新接管其中一个，使用恢复窗口的命令：
代码如下:
[root@gigan root]#screen -r 31619.ttyp2.gigan
只需要使用 -r 选项再接会话的名，现在你便可以重新回到刚才的屏幕。令人欣喜的是，你还可以在任何地方重新接管。不论在办公室还是其它客户端上，你都可以使用screen来启动一项工作然后退出。
多窗口
screen，像许多的窗口管理器一样，能支持多窗口。这个功能在处理多个任务且同时没有打开新的会话时很有用。作为一个系统管理员，我常常要同时开四五个SSH会话。在每个shell下，我可能要处理两三个任务。不使用screen的话，需要15个SSH 会话，15次登录，15个窗口等等。使用screen，每个系统都分配到一个单独的会话中，我通过screen来管理系统上不同的作业。
要打开新的窗口，只需要使用“Ctrl-A”“c”。创建的新的窗口会显示一个默认的命令提示符。例如，我可以运行top命令后再打开一个新的窗口来做其它的工作。Top继续留在那运行!可以亲身实验一下，启动screen并运行top。(注：为了节省空间我截断了多个屏幕。)
启动top
代码如下:
Mem: 506028K av, 500596K used, 5432K free,
0K shrd, 11752K buff
Swap: 1020116K av, 53320K used, 966796K free
393660K cached
< p> PID USER PRI NI SIZE RSS SHARE STAT %CPU %ME

6538 root 25 0 1892 1892 596 R 49.1 0.3
6614 root 16 0 1544 1544 668 S 28.3 0.3
7198 admin 15 0 1108 1104 828 R 5.6 0.2
现在可以通过“Ctrl-A”“c”来打开一个新窗口
代码如下:
[admin@ensim admin]$
To get back to top, use "Ctrl-A "n"
Mem: 506028K av, 500588K used, 5440K free,
0K shrd, 11960K buff
Swap: 1020116K av, 53320K used, 966796K free
392220K cached
< p> PID USER PRI NI SIZE RSS SHARE STAT %CPU %ME

6538 root 25 0 1892 1892 596 R 48.3 0.3
6614 root 15 0 1544 1544 668 S 30.7 0.3
你可以创建多个窗口然后通过“Ctrl-A”“n”切换到下一个窗口，或者使用“Ctrl-A”“p”返回上一个窗口。当你在其它窗口工作时，其它窗口的每个程序都会保持运行。
退出screen
有两种方式退出screen。第一种和登出一个shell一样，你可以通过“Ctrl-A”“K”或者“exit”来终止一个窗口。这样当前的窗口会被关闭，如果你打开了多个窗口，你就会直接转到其余中的一个，而如果是仅有的一个窗口时，你就退出了screen。
另外一种退出screen的方式是分离窗口。这种方式只是简单地关闭了窗口但进程仍运行着。如果你有确定要长时间执行的进程，还需要关闭SSH程序时，你便可以使用“Ctrl-A”“d”分离窗口。这会使你回到shell中。所有的screen窗口都待在那里，你可以稍后重新接管它们。(译者注：这很像我们实际中的最小化窗口和程序后台运行)
接管会话
假设你正用着screen花了很长时间编译着一个程序，突然间你的连接断开了。请不用担心，screen会保存你的编译进度。重新登录你的操作系统后使用screen列表工具查看有哪些会话正在运行：
代码如下:
[root@gigan root]# screen -ls
There are screens on:
31619.ttyp2.gigan (Detached)
4731.ttyp2.gigan (Detached)
2 Sockets in /tmp/screens/S-root.
在这里，我有两个不同的screen会话。要需要重新接管其中一个，使用恢复窗口的命令：
代码如下:
[root@gigan root]#screen -r 31619.ttyp2.gigan
只需要使用 -r 选项再接会话的名，现在你便可以重新回到刚才的屏幕。令人欣喜的是，你还可以在任何地方重新接管。不论在办公室还是其它客户端上，你都可以使用screen来启动一项工作然后退出。

6. linux下怎么用pycharm调试scrapy-linux博客

方法/步骤首先，打开pycharm，同时来检查一下是否安装好了git。用命令行来执行 git version，会有结果出来，就证明了git安装好了，然后就通过git下载代码。将代码导入到pycharm中，会发现右上角有提示，意思就是找不到git的路径，无法解析代.

7. 如何在linux ubuntu 下安装scapy pyx

最近在学习爬虫，早就听说Python写爬虫极爽（貌似pythoner说python都爽，不过也确实，python的类库非常丰富，不用重复造轮子），还有一个强大的框架Scrapy，于是决定尝试一下。
要想使用Scrapy第一件事，当然是安装Scrapy，尝试了Windows和Ubuntu的安装，本文先讲一下 Ubuntu的安装，比Windows的安装简单太多了。抽时间也会详细介绍一下怎么在Windows下进行安装。
官方介绍，在安装Scrapy前需要安装一系列的依赖.
* Python 2.7： Scrapy是Python框架，当然要先安装Python ，不过由于Scrapy暂时只支持 Python2.7，因此首先确保你安装的是Python 2.7
* lxml：大多数Linux发行版自带了lxml
* OpenSSL：除了windows之外的系统都已经提供
* Python Package: pip and setuptools. 由于现在pip依赖setuptools,所以安装pip会自动安装setuptools
有上面的依赖可知，在非windows的环境下安装 Scrapy的相关依赖是比较简单的，只用安装pip即可。Scrapy使用pip完成安装。
检查Scrapy依赖是否安装
你可能会不放心自己的电脑是否已经安装了，上面说的已经存在的依赖，那么你可以使用下面的方法检查一下，本文使用的是Ubuntu 14.04。
检查Python的版本
$ python --version
如果看到下面的输出，说明Python的环境已经安装，我这里显示的是Python 2.7.6，版本也是2.7的满足要求。如果没有出现下面的信息，那么请读者自行网络安装Python，本文不介绍Python的安装（网上一搜一堆）。

检查lxml和OpenSSL是否安装
假设已经安装了Python，在控制台输入python，进入Python的交互环境。

然后分别输入import lxml和import OpenSSL如果没有报错，说明两个依赖都已经安装。

安装python-dev和libevent
python-dev是linux上开发python比较重要的工具，以下的情况你需要安装
* 你需要自己安装一个源外的python类库, 而这个类库内含需要编译的调用python api的c/c++文件
* 你自己写的一个程序编译需要链接libpythonXX.(a|so)
libevent是一个时间出发的高性能的网络库，很多框架的底层都使用了libevent
上面两个库是需要安装的，不然后面后报错。使用下面的指令安装
$sudo apt-get install python-dev
$sudo apt-get install libevent-dev
安装pip
因为Scrapy可以使用pip方便的安装，因此我们需要先安装pip，可以使用下面的指令安装pip
$ sudo apt-get install python-pip
使用pip安装Scrapy
使用下面的指令安装Scrapy。
$ sudo pip install scrapy
记住一定要获得root权限，否则会出现下面的错误。

至此scrapy安装完成，使用下面的命令检查Scrapy是否安装成功。
$ scrapy version
显示如下结果说明安装成功，此处的安装版本是1.02

8. centos6.6 python2.6 能安装scrapy吗

可以的。具体安装步骤如下：

安装步骤:
1.下载python2.7 http://www.python.org/ftp/python/2.7.3/Python-2.7.3.tgz

[root@zxy-websgs ~]# wget http://www.python.org/ftp/python/2.7.3/Python-2.7.3.tgz -P /opt

[root@zxy-websgs opt]# tar xvf Python-2.7.3.tgz

[root@zxy-websgs Python-2.7.3]# ./configure

[root@zxy-websgs Python-2.7.3]# make && make install

验证python2.7安装
[root@zxy-websgs Python-2.7.3]# python2.7
Python 2.7.3 (default, Feb 28 2013, 03:08:43)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-50)] on linux2
Type "help", "right", "credits" or "license" for more information.
>>> exit()

2.安装setuptools,http://pypi.python.org/packages/source/s/setuptools/setuptools-0.6c11.tar.gz
[root@zxy-websgs ~]# wget http://pypi.python.org/packages/source/s/setuptools/setuptools-0.6c11.tar.gz -P /opt/
[root@zxy-websgs opt]# tar zxvf setuptools-0.6c11.tar.gz
[root@zxy-websgs setuptools-0.6c11]# python2.7 setup.py install

3.安装Twisted
[root@zxy-websgs setuptools-0.6c11]# easy_install Twisted
......
Installed /usr/local/lib/python2.7/site-packages/Twisted-12.3.0-py2.7-linux-x86_64.egg
......
Installed /usr/local/lib/python2.7/site-packages/zope.interface-4.0.4-py2.7-linux-x86_64.egg

Twisted要安装zope.interface,可以从下面地址下载
zope.interface:http://pypi.python.org/packages/source/z/zope.interface/zope.interface-4.0.1.tar.gz
twisted:http://twistedmatrix.com/Releases/Twisted/12.1/Twisted-12.1.0.tar.bz2
5.安装w3lib

[root@zxy-websgs setuptools-0.6c11]# easy_install -U w3lib
Searching for w3lib
Reading http://pypi.python.org/simple/w3lib/
Reading http://github.com/scrapy/w3lib
Best match: w3lib 1.2
Downloading http://pypi.python.org/packages/source/w/w3lib/w3lib-1.2.tar.gz#md5=
Processing w3lib-1.2.tar.gz
Running w3lib-1.2/setup.py -q bdist_egg --dist-dir /tmp/easy_install-wm_1BB/w3lib-1.2/egg-dist-tmp-2DQHY_
zip_safe flag not set; analyzing archive contents...
Adding w3lib 1.2 to easy-install.pth file

Installed /usr/local/lib/python2.7/site-packages/w3lib-1.2-py2.7.egg
Processing dependencies for w3lib
Finished processing dependencies for w3lib

w3lib:http://pypi.python.org/packages/source/w/w3lib/w3lib-1.2.tar.gz
6.安装libxml2或者用easy_install安装lxml
[root@zxy-websgs lxml-3.1.0]# easy_install lxml

验证lxml安装
[root@zxy-websgs lxml-3.1.0]# python2.7
Python 2.7.3 (default, Feb 28 2013, 03:08:43)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-50)] on linux2
Type "help", "right", "credits" or "license" for more information.
>>> import lxml
>>> exit()

也可以安装libxml2,官网上推荐安装2.6.28或者以上的版本，但在官网上没找到，我先是安装的2.6.9的版本，运行scrapy时报以下错误

Traceback (most recent call last):
File "/usr/local/bin/scrapy", line 5, in <mole>
pkg_resources.run_script('Scrapy==0.14.4', 'scrapy')
File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 489, in run_script
File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 1207, in run_script
File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <mole>
execute()
File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/cmdline.py", line 112, in execute
cmds = _get_commands_dict(inproject)
File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/cmdline.py", line 37, in _get_commands_dict
cmds = _get_commands_from_mole('scrapy.commands', inproject)
File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/cmdline.py", line 30, in _get_commands_from_mole
for cmd in _iter_command_classes(mole):
File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/cmdline.py", line 21, in _iter_command_classes
for mole in walk_moles(mole_name):
File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/utils/misc.py", line 65, in walk_moles
submod = __import__(fullpath, {}, {}, [''])
File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/commands/shell.py", line 8, in <mole>
from scrapy.shell import Shell
File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/shell.py", line 14, in <mole>
from scrapy.selector import XPathSelector, XmlXPathSelector, HtmlXPathSelector
File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/selector/__init__.py", line 30, in <mole>
from scrapy.selector.libxml2sel import *
File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/selector/libxml2sel.py", line 12, in <mole>
from .factories import xmlDoc_from_html, xmlDoc_from_xml
File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/selector/factories.py", line 14, in <mole>
libxml2.HTML_PARSE_NOERROR + \
AttributeError: 'mole' object has no attribute 'HTML_PARSE_RECOVER'

升级到2.6.21版本以后解决了。
libxml2.6.1:ftp://xmlsoft.org/libxml2/python/libxml2-python-2.6.21.tar.gz
7.安装pyOpenSSL(这个是可选安装的，主要为了使scrapy能够支持https)
用easy_install pyOpenSSL安装的是pyOpenSSL-0.13版本，没安装成功，于是手动下载.011版本来进行安装。
[root@zxy-websgs opt]# wget http://launchpadlibrarian.net/58498441/pyOpenSSL-0.11.tar.gz -P /opt
[root@zxy-websgs opt]# tar zxvf pyOpenSSL-0.11.tar.gz
[root@zxy-websgs pyOpenSSL-0.11]# python2.7 setup.py install

pyOpenSSL:http://launchpadlibrarian.net/58498441/pyOpenSSL-0.11.tar.gz
8.安装scrapy
[root@zxy-websgs pyOpenSSL-0.11]# easy_install -U Scrapy

验证安装

[root@zxy-websgs pyOpenSSL-0.11]# scrapy
Scrapy 0.16.4 - no active project

Usage:
scrapy <command> [options] [args]

Available commands:
fetch Fetch a URL using the Scrapy downloader
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy

[ more ] More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command

scrapy:http://pypi.python.org/packages/source/S/Scrapy/Scrapy-0.14.4.tar.gz
总结：
pyOpenSSL单独安装的时候不成功，也可以先下载pyOpenSSL0.11进行安装，再使用easy_install -U Scrapy进行全程安装

9. python的scrapy需要额外安装么

一、初窥Scrapy
Scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架。可以应用在包括数据挖掘，信息处理或存储历史数据等一系列的程序中。
其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的，也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。
本文档将通过介绍Scrapy背后的概念使您对其工作原理有所了解，并确定Scrapy是否是您所需要的。
当您准备好开始您的项目后，您可以参考入门教程。

二、Scrapy安装介绍
Scrapy框架运行平台及相关辅助工具
Python 2.7（Python最新版3.5，这里选择了2.7版本）
Python Package: pip and setuptools. 现在 pip 依赖 setuptools ，如果未安装，则会自动安装setuptools 。
lxml. 大多数Linux发行版自带了lxml。如果缺失，请查看
OpenSSL. 除了Windows(请查看平台安装指南)之外的系统都已经提供。
您可以使用pip来安装Scrapy(推荐使用pip来安装Python package).
pip install Scrapy
Windows下安装流程：
1、安装Python 2.7之后，您需要修改 PATH 环境变量，将Python的可执行程序及额外的脚本添加到系统路径中。将以下路径添加到 PATH 中:
C:\Python27\;C:\Python27\Scripts\;
除此之外，还可以用c

10. Python爬虫教程和Python学习路径有哪些

现在之所以有这么多的小伙伴热衷于爬虫技术，无外乎是因为爬虫可以帮我们做很多事情，比如搜索引擎、采集数据、广告过滤等，以Python为例，Python爬虫可以用于数据分析，在数据抓取方面发挥巨大的作用。
但是这并不意味着单纯掌握一门Python语言，就对爬虫技术触类旁通，要学习的知识和规范还有喜很多，包括但不仅限于HTML 知识、HTTP/HTTPS 协议的基本知识、正则表达式、数据库知识，常用抓包工具的使用、爬虫框架的使用等。而且涉及到大规模爬虫，还需要了解分布式的概念、消息队列、常用的数据结构和算法、缓存，甚至还包括机器学习的应用，大规模的系统背后都是靠很多技术来支撑的。
零基础如何学爬虫技术？对于迷茫的初学者来说，爬虫技术起步学习阶段，最重要的就是明确学习路径，找准学习方法，唯有如此，在良好的学习习惯督促下，后期的系统学习才会事半功倍，游刃有余。
用Python写爬虫，首先需要会Python，把基础语法搞懂，知道怎么使用函数、类和常用的数据结构如list、dict中的常用方法就算基本入门。作为入门爬虫来说，需要了解 HTTP协议的基本原理，虽然 HTTP 规范用一本书都写不完，但深入的内容可以放以后慢慢去看，理论与实践相结合后期学习才会越来越轻松。关于爬虫学习的具体步骤，我大概罗列了以下几大部分，大家可以参考：
网络爬虫基础知识:
爬虫的定义
爬虫的作用
Http协议
基本抓包工具(Fiddler)使用
Python模块实现爬虫：
urllib3、requests、lxml、bs4 模块大体作用讲解
使用requests模块 get 方式获取静态页面数据
使用requests模块 post 方式获取静态页面数据
使用requests模块获取 ajax 动态页面数据
使用requests模块模拟登录网站
使用Tesseract进行验证码识别
Scrapy框架与Scrapy-Redis：
Scrapy 爬虫框架大体说明
Scrapy spider 类
Scrapy item 及 pipeline
Scrapy CrawlSpider 类
通过Scrapy-Redis 实现分布式爬虫
借助自动化测试工具和浏览器爬取数据：
Selenium + PhantomJS 说明及简单实例
Selenium + PhantomJS 实现网站登录
Selenium + PhantomJS 实现动态页面数据爬取
爬虫项目实战：
分布式爬虫+ Elasticsearch 打造搜索引擎

导航:首页 > 操作系统 > linux安装scrapy

linux安装scrapy

与linux安装scrapy相关的资料