|-转 puppeteer爬取豆瓣电影信息
代码挺舒服,最主要能采集到图片,还是jpg的。
puppeteer爬取豆瓣电影信息
更多项目地址https://github.com/zhentaoo/puppeteer-deep
const puppeteer = require('puppeteer') const url = `https://movie.douban.com/tag/#/?sort=T&range=0,10&tags=2020` const sleep = time =>{ new Promise(resolve=>{ //成功执行 try{ setTimeout(resolve,time)} catch(err){ console.log(err) } })}; /* const sleep =function(time){ return new Promise(function(resolve){ setTimeout(resolve,time) }) }**/ (async ()=>{ try{ console.log('start visit the target page') const browser = await puppeteer.launch({ args:['--no-sandbox'],//不是沙箱模式 dumpio:false, headless: false //是否运行在浏览器headless模式,true为不打开浏览器执行,默认为true }); //args :传递给 chrome 实例的其他参数,比如你可以使用”–ash-host-window-bounds=1024x768” 来设置浏览器窗口大小。更多参数参数列表可以参考这里 //dumpio 是否将浏览器进程stdout和stderr导入到process.stdout和process.stderr中。默认为false。 const page = await browser.newPage(); await page.goto(url,{ waitUntil:'networkidle2' //等待页面不动了,说明加载完毕了 }); await sleep(3000) await page.waitForSelector('.more') //异步的,等待元素加载之后,否则获取不到异步加载的元素 for (let i= 0 ; i<3; i++){ await sleep(3000) await page.click('.more') //点击按钮一次 } //evaluate 方法中注册回调函数,并分析dom结构,从下图可以进行详细分析,并通过document.querySelectorAll('ol li a')拿到文章的所有链接 const result = await page.evaluate(()=>{ //这里调用了了windows里的jQuary的方法 var $= window.$ var items = $('.list-wp a') var links = [] //判断这里是否列表有数值 if(items.length>=1){ items.each((index,item)=>{ let it=$(item) console.log(it) let doubanID= it.find('div').data('id') // jQuery >= 1.4.3,可以选择div中data-id属性的值 let title = it.find('.title').text() let rate= Number(it.find('.rate').text()) let poster = it.find('img').attr('src').replace('s_retio','l_retio') links.push({ doubanID, title, rate, poster }) }) } return links }) browser.close() console.log(result) }catch(err){ console.log(err) } })();
代码测试了,ok的
采集结果:
start visit the target page [ { doubanID: 26754233, title: '八佰', rate: 7.5, poster: 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2615992304.jpg' }, { doubanID: 35051512, title: '我和我的家乡', rate: 7, poster: 'https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2620453443.jpg' }, { doubanID: 34477588, title: '弥留之国的爱丽丝 第一季', rate: 8, poster: 'https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2624050592.jpg' }, { doubanID: 35096844, title: '送你一朵小红花', rate: 7.2, poster: 'https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2618247457.jpg' }, { doubanID: 33440244, title: '一直游到海水变蓝', rate: 6.7, poster: 'https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2679356779.jpg' }, { doubanID: 25907124, title: '姜子牙', rate: 6.6, poster: 'https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2621219978.jpg' }, { doubanID: 30128916, title: '夺冠', rate: 7.1, poster: 'https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2620083313.jpg' }, { doubanID: 33447642, title: '沉默的真相', rate: 9, poster: 'https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2620780603.jpg' }, { doubanID: 30444960, title: '信条', rate: 7.6, poster: 'https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2612061299.jpg' }, { doubanID: 24733428, title: '心灵奇旅', rate: 8.7, poster: 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2626308994.jpg' }, { doubanID: 30171424, title: '拆弹专家2', rate: 7.5, poster: 'https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2621379901.jpg' }, { doubanID: 30306570, title: '囧妈', rate: 5.9, poster: 'https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2581835383.jpg' }, { doubanID: 35155748, title: '金刚川', rate: 6.5, poster: 'https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2623301908.jpg' }, { doubanID: 33404425, title: '隐秘的角落', rate: 8.8, poster: 'https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2609064048.jpg' }, { doubanID: 33432655, title: '困在时间里的父亲', rate: 8.6, poster: 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2628877926.jpg' }, { doubanID: 34894753, title: '沐浴之王', rate: 6, poster: 'https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2627788612.jpg' }, { doubanID: 26357307, title: '花木兰', rate: 4.8, poster: 'https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2590336843.jpg' }, { doubanID: 35069506, title: '一点就到家', rate: 6.5, poster: 'https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2621101922.jpg' }, { doubanID: 30466931, title: '波斯语课', rate: 8.1, poster: 'https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2588101332.jpg' }, { doubanID: 30323687, title: '夜间小屋', rate: 6.3, poster: 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2637498114.jpg' } ] [Finished in 7.9s]
...
浏览更多内容请先登录。
立即注册
更新于:2022-04-25 07:07:50
相关内容
这里专门开个帖子用来整理采集遇到的问题
Linux中使用curl命令访问https站点4种常见错误和解决方法
使用 curl 进行 ssl 认证 -文章是百度搜curl.cainfo找到的
网上之前找的封装php curl的类,小巧且实用,用了挺久
采集的时候把目标网页的内容输出到页面调试的问题
PHP实现抓取百度搜索结果并分析数据结构
CentOS 8 安装Puppeteer 记录
windows wamp SSL certificate problem: unable to get local issuer cert...
安装Puppeteer插件,PHP采集实现抓取百度搜索结果并分析数据结构
采集时遇到报错,去github.com查资料,遇到Github网站打不开的问题,网上找的...
PHP采集时报错Failed to launch the browser process puppeteer
Win7安装nodejs,之后在sublime运行,之后再安装 puppeteer采集网页
puppeteer爬取豆瓣电影信息
解决centos运行node项目puppeteer时chrome错误问题
How to Setup Puppeteer In CentOS 7 用spatie/browsershot成功采集百度...
cnpm 安装的扩展的路径 不好找,觉得还是用npm安装,用国内的镜像源
nodejs 报错 Error: EPERM: operation not permitted, mkdir‘xxxxxxxxx...
新的chrome headless模式 headless=new
nodejs 报错 Error: Could not find Chrome (ver. 119.0.6045.105). This ...
Error: Could not find Chrome 运行js脚本直接执行ok,用php的exec执行脚...
PHP抓取JS渲染后的页面内容
[PHP] 网盘搜索引擎-采集爬取百度网盘分享文件实现网盘搜索
php获取链接跳转的真实地址
curl超时的设置
推荐内容