Ah, I want to scrape.
So I will scrape it with Nuxt on Docker. In the case of node type, the library called puppeteer seemed to be recommended for scraping, so scraping quickly from Nuxt's server Middleware.
I grew up being told that I shouldn't bother people too much, so I'll scrape my site. (You don't need to log in. Please use it ♪) toribure | Simple is the best brainstorming tool that can be used alone or as a team
It's a little publicity. There is an image of a cute bird (Irasutoya) on the top page. This time, I will scrape this and display it.
I think there are many other articles around here, so let's take a look. By the way, the environment is
$ docker -v
Docker version 19.03.13-beta2, build ff3fbc9d55
$ docker-compose -v
docker-compose version 1.26.2, build eefe0d31
was.
$ docker run --rm -it -w /app -v `pwd`:/app node yarn create nuxt-app scraping
? Project name: scraping
? Programming language: JavaScript
? Package manager: Yarn
? UI framework: None
? Nuxt.js modules: Axios
? Linting tools:
? Testing framework: None
? Rendering mode: Universal (SSR / SSG)
? Deployment target: Server (Node.js hosting)
? Development tools:
Only axios
will be used later, so I will consciously include it.
By the way, at the time of article creation, the version of the node: latest
image was 14.9.0, the version of create-nuxt-app
was v3.2.0, and the version of nuxt
was 2.14.0.
From here on, the just-created scraping /
directory is your working directory.
$ cd scraping
Dockerfile
FROM node
ENV HOME=/app \
LANG=C.UTF-8 \
TZ=Asia/Tokyo \
HOST=0.0.0.0
WORKDIR ${HOME}
RUN apt-get update \
&& apt-get install -y wget gnupg \
&& wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - \
&& sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list' \
&& apt-get update \
&& apt-get install -y google-chrome-stable fonts-ipafont-gothic fonts-wqy-zenhei fonts-thai-tlwg fonts-kacst fonts-freefont-ttf libxss1 \
--no-install-recommends \
&& rm -rf /var/lib/apt/lists/*
COPY package.json ${HOME}
COPY yarn.lock ${HOME}
RUN yarn install
COPY . ${HOME}
EXPOSE 3000
CMD ["yarn", "run", "dev"]
For more information on RUN apt-get ...
, see Troubleshooting puppeteer. there is. It means that an error will occur if the browser and fonts are not prepared in the container.
docker-compose.yml
version: "3"
services:
nuxt:
build: .
volumes:
- .:/app
ports:
- 3000:3000
Once this is done, build the container once.
$ docker-compose build
I will put it in yarn.
$ docker-compose run --rm nuxt yarn add puppeteer
We will use serverMiddleware. We will scrape through serverMiddleware as an API by referring to express-template which is also officially introduced.
js:nuxt.config.js
export default {
...
,
serverMiddleware: {
'/api': '~/api'
}
}
This will direct / api
access to ~ / api / index.js
. So I will make a file.
$ mkdir api
$ touch api/index.js api/scraping.js
I made two files, but index.js
is the receiver and I'm going to let scraping.js
do the actual processing.
api/index.js
const app = require('express')()
const scraping = require('./scraping')
app.get('/get_image', async(req, res) => {
const image = await scraping.getImage()
res.send(image)
})
module.exports = {
path: '/api',
handler: app
}
This is to call the get_image ()
method of scraping.js
when / api / get_image
is accessed.
api/scraping.js
const puppeteer = require('puppeteer')
async function getImage() {
const browser = await puppeteer.launch({
args: [
'--no-sandbox',
'--disable-dev-shm-usage'
]
})
const page = await browser.newPage()
await page.goto("https://toribure.herokuapp.com/")
const image = await page.evaluate(() => {
return document.getElementsByTagName("main")[0].getElementsByTagName("img")[0].src
})
return image
}
module.exports = {
getImage
}
It almost follows the puppeteer official README.
You can get and manipulate elements by using page.evaluate
.
As you can see from the HTML structure of this scraping destination (https://toribure.herokuapp.com/) with the Developer tool etc., there is only one in totalThis is an image of Mr. Tori, whose target is the
img element, which has only one under the main
element. (It has a dirty structure, but it's cute)
Once you know that, all you have to do is get the elements just like normal js.
This is the end of coding on the API side.
I'm getting tired, so when I press the button on the front, the image is displayed.
pages/index.vue
<template>
<div>
<button @click="showBird">Scraping!!</button>
<br>
<img v-if="src" :src="src">
</div>
</template>
<script>
export default {
data() {
return {
src: ""
}
},
methods: {
async showBird() {
this.src = await this.$axios.$get("/api/get_image")
}
}
}
</script>
**Complete! !! ** **
A cute bird came out ♪
If you can do so far, the rest is the world of DOM manipulation, so if you understand the structure of the target page and write js, you can scrape anything. Some sites prohibit scraping, so it would be great if you could realize various ideas while paying attention to that point!
-[Nuxt] Scraping with Puppeteer-From data acquisition to display (serverMiddleware) --7839 -[[Procedure explanation] Easy scraping with JavaScipt! – Take a screenshot with Puppeteer | ProgLearn --ProgLearn --Proglearn, a comprehensive programming information site](https://blog.proglearn.com/2019/06/20/javascipt%E3%81%A7%E7%B0%A1%E5 % 8D% 98% E3% 82% B9% E3% 82% AF% E3% 83% AC% E3% 82% A4% E3% 83% 94% E3% 83% B3% E3% 82% B0% EF% BC % 81-puppeteer% E3% 81% A7% E3% 82% B9% E3% 82% AF% E3% 83% AA% E3% 83% BC% E3% 83% B3% E3% 82% B7% E3% 83 % A7 /) -Scraping with Puppeteer | grgr-dkrk's blog -I tried scraping with Docker + docker-compose + puppeteer --Qiita -Data acquisition by Axios with Nuxt.js --Qiita
Recommended Posts