Python Super Anfänger versucht zu kratzen

Was ist Schaben?

Wenn Sie das Wort "kratzen" sagen, gibt es ungefähr zwei Dinge: "kriechen" und "kratzen". Ich war verwirrt, also werde ich es einmal klären.

Krabbeln
Folgen Sie dem Link der im Web veröffentlichten Seite und laden Sie die Webseite des Ziels herunter
Schaben
Arbeiten Sie daran, die gewünschten Informationen von der heruntergeladenen Webseite zu extrahieren (teilweise)

So werde ich zum Beispiel von der Seite der Shogi-Föderation den Titel meines Lieblingsschachmanns extrahieren. Toka ist eine Übersetzung von "Scraping".

scrapy

Lassen Sie es uns tatsächlich kratzen. Wenn ich darüber nachdenke, habe ich bisher nur PHP verwendet Ich habe mich bemüht, die gewünschten Informationen mit Goutte usw. von der Seite zu extrahieren.

Also habe ich erfahren, dass Python, das ich kürzlich eingeführt habe, eine Bibliothek (Framework?) Namens Scrapy hat, die das Scraping sehr einfach macht.

Dieses Mal werde ich dies verwenden, um Informationen über meine Lieblingsschachfiguren auf der Seite der Shogi-Föderation zu sammeln.

Installation

$ pip install scrapy

Komplett

Lernprogramm

Nun, ich bin ein super Anfänger, der Python überhaupt nicht versteht, also werde ich das Tutorial Schritt für Schritt ausprobieren, um ein Gefühl dafür zu bekommen.

In der Dokumentation gab es eine Tutorial-Ecke. https://docs.scrapy.org/en/latest/intro/tutorial.html

Es ist Englisch, aber es ist ganz so.

Die im Tutorial beschriebene Arbeitsreihenfolge

Erstellen Sie ein neues Scrapy-Projekt
Schreiben Sie eine Spinne, um Ihre Site zu crawlen und die benötigten Daten zu extrahieren
Geben Sie die extrahierten Informationen über die Befehlszeile aus
Wechseln wir die Spinne, um dem Link zu folgen (ich habe kein Englisch verstanden)
Verwenden wir Spinnenargumente

Ich möchte etwas in dieser Reihenfolge tun.

1. Erstellen Sie ein neues Scrapy-Projekt

scrapy startproject tutorial

Das scheint gut zu sein.

[vagrant@localhost test]$ scrapy startproject tutorial
New Scrapy project 'tutorial', using template directory '/usr/lib64/python3.5/site-packages/scrapy/templates/project', created in:
    /home/vagrant/test/tutorial

You can start your first spider with:
    cd tutorial
    scrapy genspider example example.com
    
[vagrant@localhost test]$ ll
Insgesamt 0
drwxr-xr-x 3 Vagabund Vagabund 38 April 16 04:15 tutorial

Ein Verzeichnis namens Tutorial wurde erstellt!

Es gibt also verschiedene Dinge, aber laut Dokument hat jede Datei die folgenden Rollen.

tutorial/
    scrapy.cfg            #Bereitstellungskonfigurationsdatei

    tutorial/             # project's Python module, you'll import your code from here
        __init__.py

        items.py          # project items definition file

        pipelines.py      # project pipelines file

        settings.py       # project settings file

        spiders/          # a directory where you'll later put your spiders
            __init__.py

Ich habe nichts anderes als die Bereitstellungskonfigurationsdatei verstanden lol

2. Schreiben Sie eine Spinne, um Ihre Site zu crawlen und die benötigten Daten zu extrahieren

Erstellen Sie eine Datei mit dem Namen "quote_spider.py" unter "tutorial / spides /" und erstellen Sie sie, da etwas kopiert und eingefügt werden muss.

[vagrant@localhost tutorial]$ vi tutorial/spiders/quotes_spider.py

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

name
Spinnenkennung? Es scheint, dass es innerhalb desselben Projekts einzigartig sein muss
start_requests()
Dies ist die Start-URL für das Crawlen. Es heißt so etwas wie iterable Anfragen zurückgeben
parse()
Wird es aufgerufen, wenn jede Seite heruntergeladen werden kann?
Und die Antwort dieses zweiten Arguments scheint eine Instanz von [TextResponse] zu sein (https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.TextResponse).
Es scheint, dass es eine Methode gibt, um die Elemente auf der Seite mit Selektor, xpath, css usw. anzugeben und zu extrahieren.

3. Geben Sie die extrahierten Informationen über die Befehlszeile aus

scrapy crawl quotes

Es scheint, dass Sie damit gehen können.

Nachdem etwas herauskam, wurden "Quotes-1.html" und "Quotes-2.html" erstellt

[vagrant@localhost tutorial]$ ll
32 insgesamt
-rw-rw-r--1 Vagabund Vagabund 11053 16. April 04:27 quotes-1.html
-rw-rw-r--1 Vagabund Vagabund 13734 16. April 04:27 quotes-2.html
-rw-r--r--1 Vagabund Vagabund 260 16. April 04:15 scrapy.cfg
drwxr-xr-x 4 Vagabund Vagabund 129 16. April 04:15 tutorial

Ich schrieb hier "Lassen Sie uns die aus der Befehlszeile extrahierten Informationen ausgeben", Als ich mir den Inhalt der Analysemethode ansah, machte ich eigentlich nur so etwas wie ↓

Extrahieren Sie den Nummernteil von der URL der gecrawlten Site
Wenden Sie diese Zahl auf den% s-Teil von quote-% s.html an
Fügen Sie abschließend den Antworttext (TextResponse) in diese Datei ein und speichern Sie ihn.

Die Methode start_requests ist einfach zu schreiben

Immerhin gibt diese Methode am Ende nur das Objekt "Scrapy.Request" zurück, aber es scheint, dass dies durch einfaches Schreiben von "start_urls" realisiert werden kann.

    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
         'http://quotes.toscrape.com/page/2/',
    ]

Dies ist in Ordnung, ohne dass Sie sich die Mühe machen müssen, die Methode "start_requests" zu definieren

Versuchen Sie schließlich, die Daten zu extrahieren

Das Tutorial sagt: "Um zu erfahren, wie sich Scrapy tatsächlich herauszieht, verwenden Sie die" Scrapy Shell "."

Ich werde es sofort versuchen

[vagrant@localhost tutorial]$ scrapy shell 'http://quotes.toscrape.com/page/1/'

...Unterlassung...

2017-04-16 04:36:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fbb13dd0080>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape.com/page/1/>
[s]   response   <200 http://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x7fbb129308d0>
[s]   spider     <DefaultSpider 'default' at 0x7fbb11f14828>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

Extrahieren Sie zuerst die Elemente mit CSS und sehen Sie

>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

Oh, es scheint, dass so etwas wie ein Titelelement extrahiert werden kann.

Diese reponse.css (xxx) gibt eine XML mit dem Namen [SelectorList] zurück (https://docs.scrapy.org/en/latest/topics/selectors.html#scrapy.selector.SelectorList). Oder ein Objekt, das HTML umschließt. Also werde ich mehr Daten von hier extrahieren. Das kann man auch sagen. Extrahieren Sie den Text des Titels als Testversion.

>>> response.css('title::text').extract()
['Quotes to Scrape']

:: text bedeutet, dass nur das Textelement aus diesem -Tag extrahiert wird.</li> <li>Wenn Sie dies nicht hinzufügen, können Sie das <title> -Tag entfernen.</li> </ul> <pre><code>>>> response.css('title').extract() ['<title>Quotes to Scrape</title>'] </code></pre> <title> Sie können sehen, dass jedes Tag vergeben ist <h4>Holen Sie sich eines der Elemente</h4> <p>Beim Extrahieren wird <a href="https://docs.scrapy.org/en/latest/topics/selectors.html#scrapy.selector.SelectorList">SelectorList</a> zurückgegeben, sodass im Grunde der Listentyp zurückgegeben wird. (Deshalb war alles oben von <code>[]</code> umgeben)</p> <p>Wenn Sie eine bestimmte erhalten möchten, geben Sie die Listennummer an oder verwenden Sie "extract_first", um das erste Element abzurufen.</p> <ul> <li>Verwenden Sie extract_first</li> </ul> <pre><code>>>> response.css('title::text').extract_first() 'Quotes to Scrape' </code></pre> <ul> <li>Geben Sie die Listennummer an</li> </ul> <pre><code>>>> response.css('title::text')[0].extract() 'Quotes to Scrape' ##Es gibt nur einen Titel auf dieser Webseite. Wenn Sie also den zweiten angeben, werden Sie wütend >>> response.css('title::text')[1].extract() Traceback (most recent call last): File "<console>", line 1, in <module> File "/usr/lib/python3.5/site-packages/parsel/selector.py", line 58, in __getitem__ o = super(SelectorList, self).__getitem__(pos) IndexError: list index out of range </code></pre> <h2>Mit xpath extrahieren</h2> <p>Was ist xpath? Ich dachte, aber @ merrills Artikel war sehr leicht zu verstehen.</p> <p>http://qiita.com/merrill/items/aa612e6e865c1701f43b</p> <p>Es scheint, dass Sie so etwas wie atag im vierten td im `tbody aus dem HTML angeben können.</p> <p>Wenn ich es in diesem Beispiel sofort verwende, sieht es so aus</p> <pre><code>>>> response.xpath('//title') [<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>] >>> response.xpath('//title/text()').extract_first() 'Quotes to Scrape' </code></pre> <h3>Versuchen Sie, mehr zu extrahieren</h3> <p>Lassen Sie uns den Textteil und den Autor von http://quotes.toscrape.com/page/1/ extrahieren, der jetzt das Ziel des Scrapings ist.</p> <p><img src="https://qiita-image-store.s3.amazonaws.com/0/23276/c047e05f-dd6c-b813-9c2a-78bbacc49db1.png" alt="スクリーンショット 2017-04-16 12.12.55.png" title="スクリーンショット2017-04-1612.12.55.png " /></p> <p>Setzen Sie zuerst das erste div in eine Variable namens quote</p> <pre><code>>>> quote = response.css("div.quote")[0] </code></pre> <pre><code>>>> title = quote.css("span.text::text").extract_first() >>> title '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”' </code></pre> <p>Es ist gelungen, den Textteil zu extrahieren</p> <ul> <li>Autor fordert auch heraus</li> </ul> <pre><code>>>> autor = quote.css("small.author::text").extract_first() >>> autor 'Albert Einstein' </code></pre> <p>Es ist wahnsinnig einfach.</p> <ul> <li>Versuchen Sie, die Tag-Liste zu erhalten</li> </ul> <pre><code>>>> tags = quote.css("div.tags a.tag::text").extract() >>> tags ['change', 'deep-thoughts', 'thinking', 'world'] </code></pre> <p>Ich kann es richtig mit Listentyp extrahieren</p> <pre><code>>>> for quote in response.css("div.quote"): >>> text = quote.css("span.text::text").extract_first() >>> author = quote.css("small.author::text").extract_first() >>> tags = quote.css("div.tags a.tag::text").extract() >>> print(dict(text=text, author=author, tags=tags)) {'tags': ['change', 'deep-thoughts', 'thinking', 'world'], 'author': 'Albert Einstein', 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'} </code></pre> <h2>Versuchen Sie dies mit Spinne anstelle von Muschel</h2> <pre><code>import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] def parse(self, response): for quote in response.css("div.quote"): yield { 'text' : quote.css('span.text::text').extract_first(), 'author' : quote.css('small.author::text').extract_first(), 'tags' : quote.css('div.tags a.tag::text').extract() } </code></pre> <p>Ich werde es so umschreiben und ausführen.</p> <pre><code>[vagrant@localhost tutorial]$ scrapy crawl quotes 2017-04-16 05:27:09 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: tutorial) 2017-04-16 05:27:09 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'BOT_NAME': 'tutorial', 'SPIDER_MODULES': ['tutorial.spiders'], 'ROBOTSTXT_OBEY': True} ...Unterlassung... {'author': 'Albert Einstein', 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']} 2017-04-16 05:27:11 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> {'author': 'J.K. Rowling', 'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'tags': ['abilities', 'choices']} 2017-04-16 05:27:11 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> {'author': 'Albert Einstein', 'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']} 2017-04-16 05:27:11 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> {'author': 'Jane Austen', 'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'tags': ['aliteracy', 'books', 'classic', 'humor']} 2017-04-16 05:27:11 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> {'author': 'Marilyn Monroe', 'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'tags': ['be-yourself', 'inspirational']} 2017-04-16 05:27:11 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> {'author': 'Albert Einstein', 'text': '“Try not to become a man of success. Rather become a man of value.”', 'tags': ['adulthood', 'success', 'value']} 2017-04-16 05:27:11 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> {'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”', 'tags': ['life', 'love']} 2017-04-16 05:27:11 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> ...Unterlassung... </code></pre> <p>Es gibt verschiedene Dinge, aber es scheint, dass sie extrahiert werden können.</p> <p>** Legen Sie es in eine Datei und sehen Sie es **</p> <pre><code>[vagrant@localhost tutorial]$ scrapy crawl quotes -o result.json </code></pre> <p>Mal sehen, das Ergebnis</p> <pre><code>[vagrant@localhost tutorial]$ cat result.json [ {"tags": ["change", "deep-thoughts", "thinking", "world"], "text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein"}, {"tags": ["abilities", "choices"], "text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d", "author": "J.K. Rowling"}, {"tags": ["inspirational", "life", "live", "miracle", "miracles"], "text": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d", "author": "Albert Einstein"}, {"tags": ["aliteracy", "books", "classic", "humor"], "text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d", "author": "Jane Austen"}, {"tags": ["be-yourself", "inspirational"], "text": "\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d", "author": "Marilyn Monroe"}, {"tags": ["adulthood", "success", "value"], "text": "\u201cTry not to become a man of success. Rather become a man of value.\u201d", "author": "Albert Einstein"}, {"tags": ["life", "love"], "text": "\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d", "author": "Andr\u00e9 Gide"}, {"tags": ["edison", "failure", "inspirational", "paraphrased"], "text": "\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d", "author": "Thomas A. Edison"}, {"tags": ["misattributed-eleanor-roosevelt"], "text": "\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d", "author": "Eleanor Roosevelt"}, {"tags": ["humor", "obvious", "simile"], "text": "\u201cA day without sunshine is like, you know, night.\u201d", "author": "Steve Martin"}, {"tags": ["friends", "heartbreak", "inspirational", "life", "love", "sisters"], "text": "\u201cThis life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.\u201d", "author": "Marilyn Monroe"}, {"tags": ["courage", "friends"], "text": "\u201cIt takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.\u201d", "author": "J.K. Rowling"}, {"tags": ["simplicity", "understand"], "text": "\u201cIf you can't explain it to a six year old, you don't understand it yourself.\u201d", "author": "Albert Einstein"}, {"tags": ["love"], "text": "\u201cYou may not be her first, her last, or her only. She loved before she may love again. But if she loves you now, what else matters? She's not perfect\u2014you aren't either, and the two of you may never be perfect together but if she can make you laugh, cause you to think twice, and admit to being human and making mistakes, hold onto her and give her the most you can. She may not be thinking about you every second of the day, but she will give you a part of her that she knows you can break\u2014her heart. So don't hurt her, don't change her, don't analyze and don't expect more than she can give. Smile when she makes you happy, let her know when she makes you mad, and miss her when she's not there.\u201d", "author": "Bob Marley"}, {"tags": ["fantasy"], "text": "\u201cI like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.\u201d", "author": "Dr. Seuss"}, {"tags": ["life", "navigation"], "text": "\u201cI may not have gone where I intended to go, but I think I have ended up where I needed to be.\u201d", "author": "Douglas Adams"}, {"tags": ["activism", "apathy", "hate", "indifference", "inspirational", "love", "opposite", "philosophy"], "text": "\u201cThe opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.\u201d", "author": "Elie Wiesel"}, {"tags": ["friendship", "lack-of-friendship", "lack-of-love", "love", "marriage", "unhappy-marriage"], "text": "\u201cIt is not a lack of love, but a lack of friendship that makes unhappy marriages.\u201d", "author": "Friedrich Nietzsche"}, {"tags": ["books", "contentment", "friends", "friendship", "life"], "text": "\u201cGood friends, good books, and a sleepy conscience: this is the ideal life.\u201d", "author": "Mark Twain"}, {"tags": ["fate", "life", "misattributed-john-lennon", "planning", "plans"], "text": "\u201cLife is what happens to us while we are making other plans.\u201d", "author": "Allen Saunders"} </code></pre> <p>Poi Poi! !! !! !! Sehr einfach ww</p> <h2>4. Wechseln wir die Spinne, um dem Link zu folgen (ich habe kein Englisch verstanden)</h2> <p>Übrigens habe ich jetzt alle Übergangsziel-URLs direkt in start_urls aufgelistet. Wie üblich möchten Sie jedoch möglicherweise einem bestimmten Link auf der Seite folgen, um die gewünschten Daten rekursiv abzurufen.</p> <p>In einem solchen Fall scheint es gut, die URL des Links abzurufen und Ihre eigene Analyse aufzurufen.</p> <pre><code>import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', ] def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').extract_first(), 'author': quote.css('small.author::text').extract_first(), 'tags': quote.css('div.tags a.tag::text').extract(), } next_page = response.css('li.next a::attr(href)').extract_first() if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse) </code></pre> <p>Ich fühle mich so. Wenn es "next_page" gibt, fühlt es sich an, als würde man wieder herumgehen.</p> <p>Ich frage mich, ob <code>urljoin</code> es zu einer netten Patrouillen-URL macht.</p> <h3>Lass uns mehr kriechen und spielen</h3> <p>Hier befindet sich ein Link im Autorenteil von http://quotes.toscrape.com, sodass ein Tutorial eingeführt wird, um weitere Informationen zu erhalten.</p> <pre><code class="language-python"> import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/', ] def parse(self, response): #Holen Sie sich einen Link zur Detailseite des Autors for href in response.css('.author + a::attr(href)').extract(): yield scrapy.Request(response.urljoin(href), callback=self.parse_author) #Holen Sie sich Pagenation Link next_page = response.css('li.next a::attr(href)').extract_first() if next_page is not NONE: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse) def parse_author(self, response): #Auszug aus der Antwort in der empfangenen Abfrage und Streifen(Trimmartiges Ding)Machen def extract_with_css(query): return response.css(query).extract_first().strip() yield { 'name' : extract_with_css('h3.author-title::text'), 'birthdate' : extract_with_css('.author-born-date::text'), 'bio': extract_with_css('.author-description::text'), } </code></pre> <p>Wenn du es so machst</p> <p>―― 1. Folgen Sie dem Link des Autors und führen Sie <code>parse_author</code> aus (extrahieren Sie den Namen, das Geburtsdatum und die Beschreibung). ―― 2. Wenn Paging vorhanden ist, analysieren Sie es erneut für die nächste Seite ―― 3. Wiederholen Sie diesen Vorgang, bis kein Paging mehr erfolgt</p> <p>Es ist möglich, so etwas in nur ein paar Dutzend Zeilen zu schreiben ...</p> <h2>5. Verwenden wir Spinnenargumente</h2> <p>Ich wusste nicht, wie ich das benutzen sollte, also habe ich es bestanden.</p> <h2>Zusammenfassung</h2> <p>--Erstellen Sie ein Projekt mit Scrapy</p> <ul> <li>Schreiben Sie, was Sie in Spinnen tun möchten --Crawling ist auch über den Link möglich ――Es ist super einfach herauszuziehen</li> </ul> <h2>Hinweis-Probleme, die nicht codiert und nicht lesbar sind</h2> <p>Wenn ich mit <code>-o</code> an json ausgebe, ist die Zeichenkette nicht codiert und kann nicht gelesen werden. Dies kann gelöst werden, indem eine Zeile von "FEED_EXPORT_ENCODING =" utf-8 "zu" [Projektname] / settings.py "hinzugefügt wird.</p> <h2>Bonus</h2> <p>Ich habe etwas gemacht, das die Daten des Schwertkämpfers kratzt.</p> <p>Was ich getan habe</p> <ul> <li>Ausgehend von der Spielerliste der Shogi Federation</li> <li>Folgen Sie dem Link auf der Detailseite</li> <li>Extrahieren Sie Daten von "Name, Geburtsdatum, Meister"</li> </ul> <p>Der eigentliche Code sieht so aus (es ist einfach w)</p> <pre><code class="language-python">import scrapy class QuotesSpider(scrapy.Spider): name = "kisi" start_urls = [ 'https://www.shogi.or.jp/player/', ] def parse(self, response): #Holen Sie sich einen Link zur Detailseite des Schwertkämpfers for href in response.css("p.ttl a::attr(href)").extract(): yield scrapy.Request(response.urljoin(href), callback=self.parse_kisi) def parse_kisi(self, response): def extract_with_xpath(query): return response.xpath(query).extract_first().strip() yield { 'name' : extract_with_xpath('//*[@id="contents"]/div[2]/div/div[2]/div/div/h1/span[1]/text()'), 'birth' : extract_with_xpath('//*[@id="contents"]/div[2]/div/div[2]/table/tbody/tr[2]/td/text()'), 'sisho' : extract_with_xpath('//*[@id="contents"]/div[2]/div/div[2]/table/tbody/tr[4]/td/text()'), } </code></pre> <h4>Ergebnis</h4> <pre><code>[vagrant@localhost tutorial]$ head kisi.json [ {"name": "Akira Watanabe", "birth": "23. April 1984(32 Jahre alt)", "sisho": "Kazuharu Toshiji 7. Dan"}, {"name": "Masahiko Urano", "birth": "14. März 1964(53 Jahre alt)", "sisho": "(Spät) Nakai Ryukichi 8. Dan"}, {"name": "Masaki Izumi", "birth": "11. Januar 1961(56 Jahre alt)", "sisho": "Sekine Shigeru 9. Dan"}, {"name": "Koji Tosa", "birth": "30. März 1955(62 Jahre alt)", "sisho": "(Spät) Shizuo Kiyono 8. Dan"}, {"name": "Hiroshi Kamiya", "birth": "21. April 1961(55 Jahre alt)", "sisho": "(spät)Hisao Hirotsu 9. Dan"}, {"name": "Kensuke Kitahama", "birth": "28. Dezember 1975(41 Jahre alt)", "sisho": "Masayu Saeki 9. Dan"}, {"name": "Akutsu Hauptsteuer", "birth": "24. Juni 1982(34 Jahre alt)", "sisho": "Seiichiro Taki 8. Dan"}, {"name": "Takayuki Yamazaki", "birth": "14. Februar 1981(36 Jahre alt)", "sisho": "Nobuo Mori 7. Dan"}, {"name": "Akihito Hirose", "birth": "18. Januar 1987(30 Jahre alt)", "sisho": "Katsuura Shu 9. Dan"}, </code></pre> <p>Sie können sehen, dass jeder es richtig bekommt. Es ist wirklich einfach.</p> <h2>Was ich in Zukunft machen möchte</h2> <p>--Starten von einer bestimmten Seite</p> <ul> <li>Geben Sie die Suchbedingungen an</li> <li>Extrahieren Sie Suchergebnisse basierend auf Regeln</li> </ul> <p>Ich werde wenn möglich einen Artikel schreiben. (Nun, ich verstehe die Ausbeute nicht gut, ich kann nicht debuggen und ich muss Python studieren.)</p>  <script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>  <ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-5469278205356604" data-ad-slot="4209814965" data-ad-format="auto" data-full-width-responsive="true"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script> <div style="margin-top: 30px;"> <div class="link-top" style="margin-top: 1px;"></div> <p> <font size="4">Recommended Posts</font>  <div style="margin-top: 10px;"> <a href="/de/272d485e8a249d0d1bd7">Python Super Anfänger versucht zu kratzen</a> </div> <div style="margin-top: 10px;"> <a href="/de/d0c36bd3e5d1c998d3cd">Web Scraping Anfänger mit Python</a> </div> <div style="margin-top: 10px;"> <a href="/de/8706bdb77eb75d09fd76">[Scraping] Python-Scraping</a> </div> <div style="margin-top: 10px;"> <a href="/de/01de993d4125c29136fb">Anfänger ABC154 (Python)</a> </div> <div style="margin-top: 10px;"> <a href="/de/0944d989e72fa8ac8f3a">Python-Scraping-Memo</a> </div> <div style="margin-top: 10px;"> <a href="/de/0cb9b41f32f99e2bc2a5">Python Scraping get_ranker_categories</a> </div> <div style="margin-top: 10px;"> <a href="/de/136297ed22df0317bd89">Scraping mit Python</a> </div> <div style="margin-top: 10px;"> <a href="/de/2112ba1c57d50161b6df">Python-Anfänger-Memo (9.2-10)</a> </div> <div style="margin-top: 10px;"> <a href="/de/36cd0292b327fee417dc">Scraping mit Python</a> </div> <div style="margin-top: 10px;"> <a href="/de/3dc4f906af7d7948e387">Python-Anfänger-Memo (9.1)</a> </div> <div style="margin-top: 10px;"> <a href="/de/3f14dae4447af7cd04b2">Python-Anfängernotizen</a> </div> <div style="margin-top: 10px;"> <a href="/de/40cac44524ed6d7bedc1">[Anfänger] Leicht verständliches Python-Web-Scraping mit Google Colaboratory</a> </div> <div style="margin-top: 10px;"> <a href="/de/552aabf11d53cd1f4096">[Anfänger] Python-Array</a> </div> <div style="margin-top: 10px;"> <a href="/de/66fa6ceea66dc5a4d3a3">Python Scraping eBay</a> </div> <div style="margin-top: 10px;"> <a href="/de/7b103afbcbbe78238276">Anfänger ABC155 (Python)</a> </div> <div style="margin-top: 10px;"> <a href="/de/91f9232ae28e4b30a73d">Python Scraping get_title</a> </div> <div style="margin-top: 10px;"> <a href="/de/a8d3f16ec0e4c3c50b7c">Python: Scraping Teil 1</a> </div> <div style="margin-top: 10px;"> <a href="/de/aa2ba944bb3688647a0c">[Anfänger] Python-Funktionen</a> </div> <div style="margin-top: 10px;"> <a href="/de/b47da0eb043a6c173c97">Anfänger ABC157 (Python)</a> </div> <div style="margin-top: 10px;"> <a href="/de/ba720f44e5bcd2ae6b59">PyQ ~ Python Anfänger ~</a> </div> <div style="margin-top: 10px;"> <a href="/de/e28900e85fa8f25daf30">Python-Anfänger-Memo (2)</a> </div> <div style="margin-top: 10px;"> <a href="/de/e3dd905fa536b69329ad">Scraping mit Python</a> </div> <div style="margin-top: 10px;"> <a href="/de/f2b50634f8ed0b27fc34">Python-Anfänger Zundokokiyoshi</a> </div> <div style="margin-top: 10px;"> <a href="/de/fa7941ba5586d95398d7">Python: Scraping Teil 2</a> </div> <div style="margin-top: 10px;"> <a href="/de/0989a2daf169c19adada">Scraping in Python (Vorbereitung)</a> </div> <div style="margin-top: 10px;"> <a href="/de/0e41870de5f84b327d59">Versuchen Sie es mit Python.</a> </div> <div style="margin-top: 10px;"> <a href="/de/350773b741ea87c32c20">UnicodeEncodeError: 'cp932' während des Python-Scrapings</a> </div> <div style="margin-top: 10px;"> <a href="/de/377db82d6cc943b41495">[Python] Super nützliches Debugging</a> </div> <div style="margin-top: 10px;"> <a href="/de/42b947a77bba75ea6ce3">Grundlagen der Python-Scraping-Grundlagen</a> </div> <div style="margin-top: 10px;"> <a href="/de/42dfe18c81af98bf0db3">[Python] Klassenvererbung (super)</a> </div> <div style="margin-top: 10px;"> <a href="/de/4655a954e8e7e7c557a4">Scraping mit Python + PhantomJS</a> </div> <div style="margin-top: 10px;"> <a href="/de/c161462844aef87e0f0d">Schaben mit Selen [Python]</a> </div> <div style="margin-top: 10px;"> <a href="/de/cd51a00de026ef92080a">Scraping mit Python + PyQuery</a> </div> <div style="margin-top: 10px;"> <a href="/de/e633b1422a49ed95177f">Python Memorandum Super Basic</a> </div> <div style="margin-top: 10px;"> <a href="/de/ef0ed3273907ea56e5cd">Scraping von RSS mit Python</a> </div> <div style="margin-top: 10px;"> <a href="/de/03229bfa161e6dc2ea61">Scraping mit Python 3.5 async / await</a> </div> <div style="margin-top: 10px;"> <a href="/de/0888dff584666d948dd4">Ich habe versucht, mit Python zu kratzen</a> </div> <div style="margin-top: 10px;"> <a href="/de/1911252d97321c1f9d9b">Web Scraping mit Python + JupyterLab</a> </div> <div style="margin-top: 10px;"> <a href="/de/20002dfa12457064a910">Schaben mit Selen in Python</a> </div> <div style="margin-top: 10px;"> <a href="/de/225f38c23a652459962f">Schaben mit Selen + Python Teil 1</a> </div> <div style="margin-top: 10px;"> <a href="/de/2714bcd6a56836cc9134">[Python] Scraping in AWS Lambda</a> </div> <div style="margin-top: 10px;"> <a href="/de/3088148a31f625bff095">Schaben mit Chromedriver in Python</a> </div> <div style="margin-top: 10px;"> <a href="/de/35905779504016085801">Festliches Scraping mit Python, Scrapy</a> </div> <div style="margin-top: 10px;"> <a href="/de/56415d41cae986ee2491">Python-Anfänger startet Discord Bot</a> </div> <div style="margin-top: 10px;"> <a href="/de/5c5c9e653b3a13108d12">Scraping mit Python 3.5 Async-Syntax</a> </div> <div style="margin-top: 10px;"> <a href="/de/68e0ce1db7677cfebf63">Scraping mit Selen in Python</a> </div> <div style="margin-top: 10px;"> <a href="/de/69049d560a0bb949d78e">Super winzige Struktur in Python</a> </div> <div style="margin-top: 10px;"> <a href="/de/81f4b893bb1406162ab3">Scraping mit Tor in Python</a> </div> <div style="margin-top: 10px;"> <a href="/de/901569974f040927fe9d">[python] super (), Vererbung, __init__ usw.</a> </div> <div style="margin-top: 10px;"> <a href="/de/95750957b6ce266add50">Python #Funktion 2 für Super-Anfänger</a> </div> <div style="margin-top: 10px;"> <a href="/de/9d6d1169093f8db705df">Web Scraping mit Selenium (Python)</a> </div> <div style="margin-top: 10px;"> <a href="/de/a5cf2f755e1725dd0201">Kratzwettervorhersage mit Python</a> </div> <div style="margin-top: 10px;"> <a href="/de/bcbc5b09170be2903ce9">Schaben mit Selen + Python Teil 2</a> </div> <div style="margin-top: 10px;"> <a href="/de/c403a2a997a0247adc96">Python #Funktion 1 für Super-Anfänger</a> </div> <div style="margin-top: 10px;"> <a href="/de/ca7a4d0525d6ea32ebe7">[Python + Selen] Tipps zum Scraping</a> </div> <div style="margin-top: 10px;"> <a href="/de/cb1927019aeff1158b33">Ich habe versucht, mit Python zu kratzen</a> </div> <div style="margin-top: 10px;"> <a href="/de/ccdb61e0caf75c1d523c">Python #Liste für Super-Anfänger</a> </div> <div style="margin-top: 10px;"> <a href="/de/e093ce01b5782d820997">[Python-Anfänger] Pip selbst aktualisieren</a> </div> <div style="margin-top: 10px;"> <a href="/de/ece2d61af1d3653e4e83">Atcoder Anfänger Wettbewerb 152 Kiroku (Python)</a> </div>  </p> </div> </div> </div> <div class="footer text-center" style="margin-top: 40px;">  </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.4.1/dist/jquery.min.js"></script> <script src="https://cdn.jsdelivr.net/npm/bootstrap@4.3.1/dist/js/bootstrap.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@10.1.2/build/highlight.min.js"></script> <script> $(document).ready(function() { var cfg_post_height = 60; var cfg_per = 0.51; var ads_obj = $('<ins class="adsbygoogle" style="display:block; text-align:center;" data-ad-layout="in-article" data-ad-format="fluid" data-ad-client="ca-pub-5469278205356604" data-ad-slot="7950405964"></ins>'); $('pre code').each(function(i, e) {hljs.highlightBlock(e)}); function getDocumentOffsetPosition( el ) { var _x = 0; var _y = 0; while( el && !isNaN( el.offsetLeft ) && !isNaN( el.offsetTop ) ) { _x += el.offsetLeft - el.scrollLeft; _y += el.offsetTop - el.scrollTop; el = el.offsetParent; } return { top: _y, left: _x }; } if ( $( "#article202011" ).length ) { var h1_pos = getDocumentOffsetPosition($('h1')[0]); var footer_pos = getDocumentOffsetPosition($('.link-top')[0]); var post_distance = footer_pos.top - h1_pos.top; // console.log('h1_pos: '+ h1_pos.top); // console.log(cfg_post_height) if((post_distance/h1_pos.top)>=cfg_post_height) { // console.log('tesssssssssssssssssssssssssssssssss'); $( ".container p" ).each(function( index ) { var p_tag_pos = $(this).position().top; var dis = p_tag_pos - h1_pos.top; var per = dis/post_distance; if(per>cfg_per) { ads_obj.insertAfter($(this)); (adsbygoogle = window.adsbygoogle || []).push({}); console.log( index + ": " + $( this ).text() ); return false; } }); } } }); </script> <script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script>  <script data-ad-client="ca-pub-5469278205356604" async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js" type="d7540fe192d881abe59fcf57-text/javascript"></script>  </body> </html><script src="/cdn-cgi/scripts/7d0fa10a/cloudflare-static/rocket-loader.min.js" data-cf-settings="64b0bf8c4ff46bc4120cf03f-|49" defer></script>