PHP采集数据后,处理采集数据的函数
时间:2023年10月04日
/来源:网络
/编辑:佚名
函数:
function xxfseo_body($body){
$body = preg_replace('~<(?!img)(\w+)\s+[^>]*>~i','<$1>', $body);
$body = preg_replace("/<(iframe.*?)>(.*?)<(\/iframe.*?)>/si", "", $body);
$body = preg_replace("/<(object.*?)>(.*?)<(\/object.*?)>/si", "", $body);
$body = preg_replace("/<(script.*?)>(.*?)<\/script>/si", "", $body);
$body = preg_replace("~<(|/)form([^>]*)>~i", "", $body);
$body = preg_replace("~<input([^>]*)>~i", "", $body);
$body = preg_replace("/<(textarea.*?)>(.*?)<\/textarea>/si", "", $body);
$body = preg_replace("/<(botton.*?)>(.*?)<\/botton>/si", "", $body);
$body = preg_replace("/<(select.*?)>(.*?)<\/select>/si", "", $body);
$body = preg_replace("~<(|/)div([^>]*)>~i", "", $body);
$body = preg_replace("~<(|/)span([^>]*)>~i", "", $body);
$body = preg_replace("~<(|/)font([^>]*)>~i", "", $body);
$body = preg_replace("~<(|/)a([^>]*)>~i", "", $body);
$body = preg_replace("~<style[^>]*>(.*?)</style>~iUs", "", $body);
$body = preg_replace("~<xml[^>]*>(.*?)</xml>~iUs", '', $body);
$body = preg_replace("~<(|/)b>~i", "", $body);
$body = preg_replace('~<!--(.*)-->~','', $body);
$body = preg_replace('~<!--\[if [^\]]+\]>(.*?)<!\[endif\]-->~iUs','', $body);
$body = preg_replace('~<(\w+)[^>]*>\s*</\\1>~Us', '', $body);
$body = preg_replace("~[\r\n]+~",'', $body);
$body = preg_replace("~>\s*~",'>', $body);
$body = str_replace('</object>','', $body);
return trim($body);
}
非常好用,适合采集后处理比较复杂的html
function xxfseo_body($body){
$body = preg_replace('~<(?!img)(\w+)\s+[^>]*>~i','<$1>', $body);
$body = preg_replace("/<(iframe.*?)>(.*?)<(\/iframe.*?)>/si", "", $body);
$body = preg_replace("/<(object.*?)>(.*?)<(\/object.*?)>/si", "", $body);
$body = preg_replace("/<(script.*?)>(.*?)<\/script>/si", "", $body);
$body = preg_replace("~<(|/)form([^>]*)>~i", "", $body);
$body = preg_replace("~<input([^>]*)>~i", "", $body);
$body = preg_replace("/<(textarea.*?)>(.*?)<\/textarea>/si", "", $body);
$body = preg_replace("/<(botton.*?)>(.*?)<\/botton>/si", "", $body);
$body = preg_replace("/<(select.*?)>(.*?)<\/select>/si", "", $body);
$body = preg_replace("~<(|/)div([^>]*)>~i", "", $body);
$body = preg_replace("~<(|/)span([^>]*)>~i", "", $body);
$body = preg_replace("~<(|/)font([^>]*)>~i", "", $body);
$body = preg_replace("~<(|/)a([^>]*)>~i", "", $body);
$body = preg_replace("~<style[^>]*>(.*?)</style>~iUs", "", $body);
$body = preg_replace("~<xml[^>]*>(.*?)</xml>~iUs", '', $body);
$body = preg_replace("~<(|/)b>~i", "", $body);
$body = preg_replace('~<!--(.*)-->~','', $body);
$body = preg_replace('~<!--\[if [^\]]+\]>(.*?)<!\[endif\]-->~iUs','', $body);
$body = preg_replace('~<(\w+)[^>]*>\s*</\\1>~Us', '', $body);
$body = preg_replace("~[\r\n]+~",'', $body);
$body = preg_replace("~>\s*~",'>', $body);
$body = str_replace('</object>','', $body);
return trim($body);
}
非常好用,适合采集后处理比较复杂的html
新闻资讯 更多
- 【建站知识】查询nginx日志状态码大于400的请求并打印整行04-03
- 【建站知识】Python中的logger和handler到底是个什么?04-03
- 【建站知识】python3拉勾网爬虫之(您操作太频繁,请稍后访问)04-03
- 【建站知识】xpath 获取meta里的keywords及description的方法04-03
- 【建站知识】python向上取整以50为界04-03
- 【建站知识】scrapy xpath遇见乱码解决04-03
- 【建站知识】scrapy爬取后中文乱码,解决word转为html 时cp1252编码问题04-03
- 【建站知识】scrapy采集—爬取中文乱码,gb2312转为utf-804-03