[PYTHON] Save the text of all Evernote notes to SQLite using Beautiful Soup and SQLAlchemy

I need to extract the full text of evernote and publish the method I used at that time. It seems that you can do it using the Evernote API, but it's not so much, so it's troublesome. Therefore, I will introduce how to output all notes in html format and scrape them with Beautiful Soup.

Output all Evernote notes in html format

First, select all notes with Command + A. Export your notes from there. evernote-scrape-1.png Select html as the output format. evernote-scrape-2.png This time save it to your desktop as mynote.

The index.html of mynote is the table of contents of all the output files, and there is a link to each html file, so use that.

As a procedure

That is.

Scraping with Beautiful Soup

In the first place, scraping is the act of extracting specific information from a website. The file you scraped earlier is not a website, but it is in html format so you can scrape it. There are several python modules that can be scraped, but this time I will use something called BeautifulSoup.

Install with Beautiful Soup with pip.

$ pip install beautifulsoup4

Beatiful Soup is basically used as follows.

import urllib2
from bs4 import BeautifulSoup

html = urllib2.urlopen("http://~ ~ ~")
soup = BeautifulSoup(html)
scrape = soup.find_all("a")

See the official documentation for details. http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Only soup.get_text (), soup.find_all ("a") and soup.get ("href ") are used this time.

Save to database using SQLAlchemy

SQLAlchemy is an OR mapper, which is a convenient one that can interact with the database without writing it in SQL. Let's install with pip.

$ pip install sqlalchemy

Extract Evernote text

Now that I'm ready, I'll scrape it.

First of all, if you specify the url of the note, create a function that extracts and returns only that sentence.

def scrape_evernote(url):
    note_url = "file:///(Notebook directory)" + url.encode('utf-8')
    html = urllib2.urlopen(note_url)
    soup = BeautifulSoup(html)
    all_items = soup.get_text()

    return "".join(all_items)

The first three lines create a BeautifulSoup object. ʻAll_items = soup.get_text ()to get the full text of the url destination. In the part after that, the characters that can be obtained byget_text ()` are included in the array character by character, so all the arrays are combined into a character string.

Save the extracted text in SQLite

Next, create a function to save the extracted text in SQLite.

def scrape_and_save2sql():
    Base = sqlalchemy.ext.declarative.declarative_base()

    class Evernote(Base):
        __tablename__ = 'mynote'
        id = sqlalchemy.Column(sqlalchemy.Integer, primary_key=True)
        title = sqlalchemy.Column(sqlalchemy.String)
        note = sqlalchemy.Column(sqlalchemy.String)

    db_url = "sqlite+pysqlite:///evernote.sqlite3"
    engine = sqlalchemy.create_engine(db_url, echo=True)

    Base.metadata.create_all(engine)

    #Create a session
    Session = sqlalchemy.orm.sessionmaker(bind=engine)
    session = Session()

    #Get the url of all notes from index
    index_url = "file:///(Notebook directory)/index.html"
    index_html = urllib2.urlopen(index_url)
    index_soup = BeautifulSoup(index_html)
    all_url = index_soup.find_all("a")

    for note_url in all_url:
        title = note_url.get_text()
        note = scrape_evernote(note_url.get("href"))
        evernote = Evernote(title=title, note=note)
        session.add(evernote)

    session.commit()

First, create Base. Then create a model of the notebook.

class Evernote(Base):
    __tablename__ = 'mynote'
    id = sqlalchemy.Column(sqlalchemy.Integer, primary_key=True)
    title = sqlalchemy.Column(sqlalchemy.String)
    note = sqlalchemy.Column(sqlalchemy.String)

This time, simply save the title and contents of the note.

Create a SQLite storage location and session.

After that, get the title and url of each note from ʻindex.html`. Links to each note in index.html

<a href="Note url">Note title</a>

Since it is configured as ʻindex_soup.find_all ("a") , all a tags are acquired. Since each tag is stored as an array, take it out and get the url and title of the link destination from the a tag. Extract the text from that url using the scrape_evernote ()` created earlier. Finally commit and save to SQLite.

This completes the extraction.

If you want to output to txt data instead of output with SQLite

def scrape_and_save2txt():
    file = open('evernote_text.txt', 'w')

    #Get the url of all notes from index
    index_url = "file:///(Notebook directory)/index.html"
    index_html = urllib2.urlopen(index_url)
    index_soup = BeautifulSoup(index_html)
    all_url = index_soup.find_all("a")

    for note_url in all_url:
        title = note_url.get_text()
        file.write(title)
        note = scrape_evernote(note_url.get("href"))
        file.write(note)

    file.close()

If so, it is possible. Of course, you can also output in csv format.

Summary

I wrote it in the beginning, but the general procedure is

It has become. This time it was only text, but the image also has a folder with the same name as the title in the note and is saved there. If you use this well, you can extract all the images in evernote.

Recommended Posts

Save the text of all Evernote notes to SQLite using Beautiful Soup and SQLAlchemy
I tried to notify the update of "Hamelin" using "Beautiful Soup" and "IFTTT"
I tried to extract and illustrate the stage of the story using COTOHA
How to query BigQuery with Kubeflow Pipelines and save the result and notes
[Python3] Understand the basics of Beautiful Soup
Let's play with Python Receive and save / display the text of the input form
Save an array of numpy to a wav file using the wave module
Frequently used methods of Selenium and Beautiful Soup
How to search HTML data using Beautiful Soup
I tried to notify the update of "Become a novelist" using "IFTTT" and "Become a novelist API"
I tried to extract the text in the image file using Tesseract of the OCR engine
Get all songs of Arashi's song information using Spotify API and verify the index
Automatically determine and process the encoding of the text file
Personal notes about the integration of vscode and anaconda
The story of using circleci to build manylinux wheels
How to know the number of GPUs from python ~ Notes on using multiprocessing with pytorch ~