|-转 puppeteer爬取豆瓣电影信息
代码挺舒服,最主要能采集到图片,还是jpg的。
puppeteer爬取豆瓣电影信息
更多项目地址https://github.com/zhentaoo/puppeteer-deep
const puppeteer = require('puppeteer')
const url = `https://movie.douban.com/tag/#/?sort=T&range=0,10&tags=2020`
const sleep = time =>{ new Promise(resolve=>{
//成功执行
try{
setTimeout(resolve,time)}
catch(err){
console.log(err)
}
})};
/*
const sleep =function(time){
return new Promise(function(resolve){
setTimeout(resolve,time)
})
}**/
(async ()=>{
try{
console.log('start visit the target page')
const browser = await puppeteer.launch({
args:['--no-sandbox'],//不是沙箱模式
dumpio:false,
headless: false //是否运行在浏览器headless模式,true为不打开浏览器执行,默认为true
});
//args :传递给 chrome 实例的其他参数,比如你可以使用”–ash-host-window-bounds=1024x768” 来设置浏览器窗口大小。更多参数参数列表可以参考这里
//dumpio 是否将浏览器进程stdout和stderr导入到process.stdout和process.stderr中。默认为false。
const page = await browser.newPage();
await page.goto(url,{
waitUntil:'networkidle2' //等待页面不动了,说明加载完毕了
});
await sleep(3000)
await page.waitForSelector('.more') //异步的,等待元素加载之后,否则获取不到异步加载的元素
for (let i= 0 ; i<3; i++){
await sleep(3000)
await page.click('.more') //点击按钮一次
}
//evaluate 方法中注册回调函数,并分析dom结构,从下图可以进行详细分析,并通过document.querySelectorAll('ol li a')拿到文章的所有链接
const result = await page.evaluate(()=>{
//这里调用了了windows里的jQuary的方法
var $= window.$
var items = $('.list-wp a')
var links = []
//判断这里是否列表有数值
if(items.length>=1){
items.each((index,item)=>{
let it=$(item)
console.log(it)
let doubanID= it.find('div').data('id')
// jQuery >= 1.4.3,可以选择div中data-id属性的值
let title = it.find('.title').text()
let rate= Number(it.find('.rate').text())
let poster = it.find('img').attr('src').replace('s_retio','l_retio')
links.push({
doubanID,
title,
rate,
poster
})
})
}
return links
})
browser.close()
console.log(result)
}catch(err){
console.log(err)
}
})();
代码测试了,ok的
采集结果:
start visit the target page
[
{
doubanID: 26754233,
title: '八佰',
rate: 7.5,
poster: 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2615992304.jpg'
},
{
doubanID: 35051512,
title: '我和我的家乡',
rate: 7,
poster: 'https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2620453443.jpg'
},
{
doubanID: 34477588,
title: '弥留之国的爱丽丝 第一季',
rate: 8,
poster: 'https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2624050592.jpg'
},
{
doubanID: 35096844,
title: '送你一朵小红花',
rate: 7.2,
poster: 'https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2618247457.jpg'
},
{
doubanID: 33440244,
title: '一直游到海水变蓝',
rate: 6.7,
poster: 'https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2679356779.jpg'
},
{
doubanID: 25907124,
title: '姜子牙',
rate: 6.6,
poster: 'https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2621219978.jpg'
},
{
doubanID: 30128916,
title: '夺冠',
rate: 7.1,
poster: 'https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2620083313.jpg'
},
{
doubanID: 33447642,
title: '沉默的真相',
rate: 9,
poster: 'https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2620780603.jpg'
},
{
doubanID: 30444960,
title: '信条',
rate: 7.6,
poster: 'https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2612061299.jpg'
},
{
doubanID: 24733428,
title: '心灵奇旅',
rate: 8.7,
poster: 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2626308994.jpg'
},
{
doubanID: 30171424,
title: '拆弹专家2',
rate: 7.5,
poster: 'https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2621379901.jpg'
},
{
doubanID: 30306570,
title: '囧妈',
rate: 5.9,
poster: 'https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2581835383.jpg'
},
{
doubanID: 35155748,
title: '金刚川',
rate: 6.5,
poster: 'https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2623301908.jpg'
},
{
doubanID: 33404425,
title: '隐秘的角落',
rate: 8.8,
poster: 'https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2609064048.jpg'
},
{
doubanID: 33432655,
title: '困在时间里的父亲',
rate: 8.6,
poster: 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2628877926.jpg'
},
{
doubanID: 34894753,
title: '沐浴之王',
rate: 6,
poster: 'https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2627788612.jpg'
},
{
doubanID: 26357307,
title: '花木兰',
rate: 4.8,
poster: 'https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2590336843.jpg'
},
{
doubanID: 35069506,
title: '一点就到家',
rate: 6.5,
poster: 'https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2621101922.jpg'
},
{
doubanID: 30466931,
title: '波斯语课',
rate: 8.1,
poster: 'https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2588101332.jpg'
},
{
doubanID: 30323687,
title: '夜间小屋',
rate: 6.3,
poster: 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2637498114.jpg'
}
]
[Finished in 7.9s]
...
浏览更多内容请先登录。
立即注册
更新于:2022-04-25 07:07:50
相关内容
这里专门开个帖子用来整理采集遇到的问题
Linux中使用curl命令访问https站点4种常见错误和解决方法
使用 curl 进行 ssl 认证 -文章是百度搜curl.cainfo找到的
网上之前找的封装php curl的类,小巧且实用,用了挺久
采集的时候把目标网页的内容输出到页面调试的问题
PHP实现抓取百度搜索结果并分析数据结构
CentOS 8 安装Puppeteer 记录
windows wamp SSL certificate problem: unable to get local issuer cert...
安装Puppeteer插件,PHP采集实现抓取百度搜索结果并分析数据结构
采集时遇到报错,去github.com查资料,遇到Github网站打不开的问题,网上找的...
PHP采集时报错Failed to launch the browser process puppeteer
Win7安装nodejs,之后在sublime运行,之后再安装 puppeteer采集网页
puppeteer爬取豆瓣电影信息
解决centos运行node项目puppeteer时chrome错误问题
How to Setup Puppeteer In CentOS 7 用spatie/browsershot成功采集百度...
cnpm 安装的扩展的路径 不好找,觉得还是用npm安装,用国内的镜像源
nodejs 报错 Error: EPERM: operation not permitted, mkdir‘xxxxxxxxx...
新的chrome headless模式 headless=new
nodejs 报错 Error: Could not find Chrome (ver. 119.0.6045.105). This ...
Error: Could not find Chrome 运行js脚本直接执行ok,用php的exec执行脚...
PHP抓取JS渲染后的页面内容
[PHP] 网盘搜索引擎-采集爬取百度网盘分享文件实现网盘搜索
php获取链接跳转的真实地址
curl超时的设置
推荐内容