php网页正文提取_PHP抓取网页指定内容

A. php抓取网页指定内容

<?php
/*
* 如下：方法有点笨
* 抓取网页内容用 PHP 的正则
* 用JS每隔5分钟刷新当前页面---即重新获取网页内容
*
* 注： $mode中--<title></title>-更改为所需内容（如 $mode = "#<a(.*)</a>#";>获取所有链接）
*
* window.location.href="http://localhost//refesh.php";中的http://localhost//refesh.php
* 更改为自己的URL----作用：即刷新当前页面
*
* setInterval("ref()",300000);是每隔300000毫秒（即 5 * 60 *1000 毫秒即5分钟）执行一次函数 ref()
*
* print_r($arr);输出获得的所有内容 $arr是一个数组可根据所需输出一部分（如 echo $arr[1][0];）
* 若要获得所有内容可去掉
* $mode = "#<title>(.*)</title>#";
if(preg_match_all($mode,$content,$arr)){
print_r($arr);
echo "<br/>";
echo $arr[1][0];
}
再加上 echo $content；
*/
$url = "http://www..com"; //目标站
$fp = @fopen($url, "r") or die("超时");

$content=file_get_contents($url);
$mode = "#<title>(.*)</title>#";
if(preg_match_all($mode,$content,$arr)){
//print_r($arr);
echo "<br/>";
echo $arr[1][0];
}
?>
<script language="javaScript" type="text/javascript">
<--
function ref(){
window.location.href="http://localhost//refesh.php";
}
setInterval("ref()",300000);
//-->
</script>

B. 有没有办法实现PHP代理抓取网页内容

可以呀。

用snoopy的类，网上有snoopy.class.php，你自行网络查找。
snoopy的类可以设置$proxy_host参数，设置代理主机，$proxy_port是代理主机端口。你下载一个下来，网上的教程很多，看看应该明白。

至于调用proxy.txt，轮换ip的问题，我觉得可用代理不是很多的话，可以设置成随机选择代理就好了。你采集的那个网站记录的是你代理服务器的ip

C. php正则提取HTML中的内容

那就无需正则了！
php本身就有一个函数：strip_tags()
这个函数有2个参数
第一个：需要过滤的字符串，在这里也就是你说的html，这个函数必须
第二个：要保留的html标签，就是设置你不想过滤掉的html标签，这个函数可选！

在第二个参数缺省的情况下，会将所有html标签过滤掉！

还要什么正则呢？？？

D. 用php获取指定网页内容

functiongetRemoteRes($url,$postfields=NULL,$timeout=60){
	$ci=curl_init();
	curl_setopt($ci,CURLOPT_URL,$url);
	curl_setopt($ci,CURLOPT_HEADER,FALSE);
	curl_setopt($ci,CURLOPT_RETURNTRANSFER,TRUE);
	curl_setopt($ci,CURLOPT_SSL_VERIFYPEER,0);
	curl_setopt($ci,CURLOPT_SSL_VERIFYHOST,0);
	curl_setopt($ci,CURLOPT_TIMEOUT,$timeout);
	curl_setopt($ci,CURLOPT_POST,TRUE);
	if(is_array($postfields)){
		$field_str="";
		foreach($postfieldsas$k=>$v){
			$field_str.="&$k=".urlencode($v);
		}
		curl_setopt($ci,CURLOPT_POSTFIELDS,$field_str);
	}
	$response=curl_exec($ci);
	if(curl_errno($ci)){
		return'ERRNO!';
	}else{
		$httpStatusCode=curl_getinfo($ci,CURLINFO_HTTP_CODE);
		if(200!==$httpStatusCode){
			return'ERRNO!';
		}
	}
	curl_close($ci);
	return$response;
}
先用以上函数获取指定的网页,然后从返回的数据中解析出你要的数据.可以使用正则表达式来提取,这要根据你要获取的页面源代码来判断了.暂时未知,以上只是提供一个思路给你.

E. 用PHP正则表达式提取页面内容

<?php
$theurl="http://www.kitco.cn/cn/";
if (!($contents = file_get_contents($theurl)))
{
echo 'Could not open URL';
exit;
}

/*
$contents=preg_replace('/<.+?>/', '', $contents)；
*/

if (preg_match("/<td class=\"tableHeader\" align=\"left\">原油价格([^^]*?)<\/tr>/u",$contents,$matches))
{
print "A match was found:".strip_tags($matches[0]);
} else {
print "A match was not found.<br />";
}
?>

试试这样
------------------------------------
呵呵，上边这段已经把你那行注释掉了，先找到唯一的一段代码，取出来你想要的以后以后，再去掉标签，你运行一下试试
运行结果：
A match was found:原油价格 68.11 +0.95
应该是你想要的结果吧？

F. 用php提取网页中内容请教

$str='<script id="js_load" language="javascript" src="/match_data_xml/loaddm.php?md=2010-01-01&1374176264778"></script>';
$a=preg_match("/src=[\\\'| \\\"]([^(\\\'|\\\")])+[\\\'|\\\"]>/is",$str,$arr);
$new_str=substr($arr[0], strpos($arr[0],'=')+1);
echo $str_need=substr(trim($new_str), 1,strlen(trim($new_str))-3);
$str_need 入库

G. PHP如何正则表达式提取网页内容

如果你要<div class="nav" monkey="nav">和<div class="head-ad">之间的所有源码，用 preg_match 就可以，不用preg_match_all ，如果你要里面的所有的 <li></li>标签中的内容，可以用preg_match_all

//提取所有代码
$pattern = '/<div class="nav" monkey="nav">(.+?)<div class="head-ad">/is';
preg_match($pattern, $string, $match);
//$match[0] 即为<div class="nav" monkey="nav">和<div class="head-ad">之间的所有源码
echo $match[0];

//然后再提取<li></li>之间的内容
$pattern = '/<li.*?>(.+?)<\/li>/is';

preg_match_all($pattern, $match[0], $results);
$new_arr=array_unique($results[0]);

foreach($new_arr as $kkk){
echo $kkk;

}

H. PHP如何根据URL抓取不同网站的文章内容

我自己有写一套采集系统，PHP 和 python 都有，思想就是你要采集的网站每个网站写一个文件，里面存一些变量，比如你要采集的内容在 a网站是在 DIV里面，在b网站是在p里面，这样你就要建立两个文件，分别用来存这些无法预知的变量，然后通过公用的下载程序，把这个变量替换进去。如果你有更好的方法，希望能多交流。

I. PHP怎么修改网页文字并提取图片啊

<?php
$text1="第一句";
$text2="第二句";
$text3="第三句";
$url='http://tp.388g.com/aosbegin00006.php?id=736&text1={text1}&text2={text2}&text3={text3}&text4=undefined&text5=undefined&rnd=';
$url=str_replace('{text1}',urlencode($text1),$url);
$url=str_replace('{text2}',urlencode($text2),$url);
$url=str_replace('{text3}',urlencode($text3),$url);
$url=$url.time();
$content=file_get_contents($url);
preg_match('/<outputimg>.(.*?)</outputimg>/is',$content,$matched);
$imgUrl='http://tp.388g.com'.$matched[1];
echo$imgUrl;
exit;

J. php提取网页中变量文字

可以用file_get_contents，也可以用curl，但是首先得，你要提取的文字有标志围着，比如<span class="mark">某某某某</span>，然后根据class="mark"就可以截取了，用substr等都可以。

导航:首页 > 编程语言 > php网页正文提取

php网页正文提取

与php网页正文提取相关的资料