php爬虫正则_php 实现网络爬虫

① python 网络爬虫正则表达式

importre

file=open('xx.htm','r',encoding='gbk')
allLines=file.readlines()
xx=''.join(allLines).encode('utf8').decode('utf8')
a=re.findall(r'<td><divalign="[sS]*</td>?',xx)
#print('
'.join(a))
foriina:
a=re.findall(r'd+[.]?d*</div>?|d{4}-d{2}-d{2}</div>?|[u4e00-u9fa5]+<?',i)
print('
'.join(a))
file.close()

② 如何用php做网络爬虫

其实用PHP来爬会非常方便，主要是PHP的正则表达式功能在搜集页面连接方面很方便，另外PHP的fopen、file_get_contents以及libcur的函数非常方便的下载网页内容。

③ 求一个PHP写的爬虫，能绕过的。

根据题主的需求，手敲两个小时代码，拿走不谢
from selenium import webdriver
import time
import os
import requests

class Huaban():

    def get_picture_url(self, content):
        global path
        path = "E:\spider\pictures\huaban" + '\\' + content

        if not os.path.exists(path):
            os.makedirs(path)
        url = "http://huaban.com"

        driver.maximize_window()
        driver.get(url)
        time.sleep(8)


        try:
            driver.find_elements_by_xpath('//input[@name="email"]')[0].send_keys('花瓣账号')
            print('user success!')
        except:
            print('user error!')
        time.sleep(3)
        try:
            driver.find_elements_by_xpath('//input[@name="password"]')[0].send_keys('账号密码')
            print('pw success!')
        except:
            print('pw error!')
        time.sleep(3)

④ 怎么写php爬虫自动抓取百度知道

curl来写。模拟登陆。抓取页面。分析标签。正则匹配你想要的内容。然后存入数据大概就是这样的流程。

⑤ php中curl爬虫怎么样通过网页获取所有链接

本文承接上面两篇，本篇中的示例要调用到前两篇中的函数，做一个简单的URL采集。一般php采集网络数据会用file_get_contents、file和cURL。不过据说cURL会比file_get_contents、file更快更专业，更适合采集。今天就试试用cURL来获取网页上的所有链接。示例如下：

<?php
/*
* 使用curl 采集hao123.com下的所有链接。
*/
include_once('function.php');
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.hao123.com/');
// 只需返回HTTP header
curl_setopt($ch, CURLOPT_HEADER, 1);
// 页面内容我们并不需要
// curl_setopt($ch, CURLOPT_NOBODY, 1);
// 返回结果，而不是输出它
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$html = curl_exec($ch);
$info = curl_getinfo($ch);
if ($html === false) {
echo "cURL Error: " . curl_error($ch);
}
curl_close($ch);
$linkarr = _striplinks($html);
// 主机部分，补全用
$host = 'http://www.hao123.com/';
if (is_array($linkarr)) {
foreach ($linkarr as $k => $v) {
$linkresult[$k] = _expandlinks($v, $host);
}
}
printf("<p>此页面的所有链接为：</p><pre>%s</pre>n", var_export($linkresult , true));
?>

function.php内容如下（即为上两篇中两个函数的合集）：

<?php
function _striplinks($document) {
preg_match_all("'<s*as.*?hrefs*=s*(["'])?(?(1) (.*?)\1 | ([^s>]+))'isx", $document, $links);
// catenate the non-empty matches from the conditional subpattern
while (list($key, $val) = each($links[2])) {
if (!empty($val))
$match[] = $val;
} while (list($key, $val) = each($links[3])) {
if (!empty($val))
$match[] = $val;
}
// return the links
return $match;
}
/*===================================================================*
Function: _expandlinks
Purpose: expand each link into a fully qualified URL
Input: $links the links to qualify
$URI the full URI to get the base from
Output: $expandedLinks the expanded links
*===================================================================*/
function _expandlinks($links,$URI)
{
$URI_PARTS = parse_url($URI);
$host = $URI_PARTS["host"];
preg_match("/^[^?]+/",$URI,$match);
$match = preg_replace("|/[^/.]+.[^/.]+$|","",$match[0]);
$match = preg_replace("|/$|","",$match);
$match_part = parse_url($match);
$match_root =
$match_part["scheme"]."://".$match_part["host"];
$search = array( "|^http://".preg_quote($host)."|i",
"|^(/)|i",
"|^(?!http://)(?!mailto:)|i",
"|/./|",
"|/[^/]+/../|"
);
$replace = array( "",
$match_root."/",
$match."/",
"/",
"/"
);
$expandedLinks = preg_replace($search,$replace,$links);
return $expandedLinks;
}
?>

⑥ php 实现网络爬虫

pcntl_fork或者swoole_process实现多进程并发。按照每个网页抓取耗时500ms，开200个进程，可以实现每秒400个页面的抓取。
curl实现页面抓取，设置cookie可以实现模拟登录
simple_html_dom 实现页面的解析和DOM处理
如果想要模拟浏览器，可以使用casperJS。用swoole扩展封装一个服务接口给PHP层调用

在这里有一套爬虫系统就是基于上述技术方案实现的，每天会抓取几千万个页面。

⑦ php 的爬虫和 python 写出来的有区别吗

没有本质区别，不同语言写的相同功能的程序。

⑧ 哪位大神写过过正则表达式呀请教个问题如何在自己爬虫下来的的网页去匹配自己需要的内容呀

简单说：
使用正则，或者专门处理解析html的库，去提取即可；

详细说：
你巧了。我之前写过了，极其详尽的，不仅解释原理，还给出示例的说明，以及源代码的。自己看就全都明白了：
如何用Python，C#等语言去实现抓取静态网页，模拟登陆网站

（此处不给帖地址，请自己用google搜标题，就可以找到帖子地址了）

⑨ php怎么正则匹配div里面class值是两个得

最好不要使用正则来做，感觉你想在做爬虫，如果是的话建议使用class选择器或者xpath选择器。这个都比使用正则简单。
我遇到这样的问题都是这两种工具，比正则简单。有问题直接问我吧

⑩ PHP爬虫和基于命令行的Python爬虫有什么差别

php和python 写爬虫采集一些简单的都可以，但是相对来说python更好，更方便，有很多现成的库和方法支持直接解析网站，剖析你需要的数据，而php需要你大部分正则匹配，麻烦。

导航:首页 > 编程语言 > php爬虫正则

php爬虫正则

与php爬虫正则相关的资料