I need to extract the full text of evernote and publish the method I used at that time. It seems that you can do it using the Evernote API, but it's not so much, so it's troublesome. Therefore, I will introduce how to output all notes in html format and scrape them with Beautiful Soup.

Output all Evernote notes in html format

First, select all notes with Command + A. Export your notes from there. Select html as the output format. This time save it to your desktop as mynote.

The index.html of mynote is the table of contents of all the output files, and there is a link to each html file, so use that.

As a procedure

Extract the url of the output note from index.html.
Extract text from url destination.
Save it in SQLite.

That is.

Scraping with Beautiful Soup

In the first place, scraping is the act of extracting specific information from a website. The file you scraped earlier is not a website, but it is in html format so you can scrape it. There are several python modules that can be scraped, but this time I will use something called BeautifulSoup.

Install with Beautiful Soup with pip.

$ pip install beautifulsoup4

Beatiful Soup is basically used as follows.

import urllib2
from bs4 import BeautifulSoup

html = urllib2.urlopen("http://~ ~ ~")
soup = BeautifulSoup(html)
scrape = soup.find_all("a")

See the official documentation for details. http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Only soup.get_text (), soup.find_all ("a") and soup.get ("href ") are used this time.

Save to database using SQLAlchemy

SQLAlchemy is an OR mapper, which is a convenient one that can interact with the database without writing it in SQL. Let's install with pip.

$ pip install sqlalchemy

Extract Evernote text

Now that I'm ready, I'll scrape it.

First of all, if you specify the url of the note, create a function that extracts and returns only that sentence.

def scrape_evernote(url):
    note_url = "file:///(Notebook directory)" + url.encode('utf-8')
    html = urllib2.urlopen(note_url)
    soup = BeautifulSoup(html)
    all_items = soup.get_text()

    return "".join(all_items)

The first three lines create a BeautifulSoup object. ʻAll_items = soup.get_text ()to get the full text of the url destination. In the part after that, the characters that can be obtained byget_text ()` are included in the array character by character, so all the arrays are combined into a character string.

Save the extracted text in SQLite

Next, create a function to save the extracted text in SQLite.

def scrape_and_save2sql():
    Base = sqlalchemy.ext.declarative.declarative_base()

    class Evernote(Base):
        __tablename__ = 'mynote'
        id = sqlalchemy.Column(sqlalchemy.Integer, primary_key=True)
        title = sqlalchemy.Column(sqlalchemy.String)
        note = sqlalchemy.Column(sqlalchemy.String)

    db_url = "sqlite+pysqlite:///evernote.sqlite3"
    engine = sqlalchemy.create_engine(db_url, echo=True)

    Base.metadata.create_all(engine)

    #Create a session
    Session = sqlalchemy.orm.sessionmaker(bind=engine)
    session = Session()

    #Get the url of all notes from index
    index_url = "file:///(Notebook directory)/index.html"
    index_html = urllib2.urlopen(index_url)
    index_soup = BeautifulSoup(index_html)
    all_url = index_soup.find_all("a")

    for note_url in all_url:
        title = note_url.get_text()
        note = scrape_evernote(note_url.get("href"))
        evernote = Evernote(title=title, note=note)
        session.add(evernote)

    session.commit()

First, create Base. Then create a model of the notebook.

class Evernote(Base):
    __tablename__ = 'mynote'
    id = sqlalchemy.Column(sqlalchemy.Integer, primary_key=True)
    title = sqlalchemy.Column(sqlalchemy.String)
    note = sqlalchemy.Column(sqlalchemy.String)

This time, simply save the title and contents of the note.

Create a SQLite storage location and session.

After that, get the title and url of each note from ʻindex.html`. Links to each note in index.html

<a href="Note url">Note title</a>

Since it is configured as ʻindex_soup.find_all ("a") , all a tags are acquired. Since each tag is stored as an array, take it out and get the url and title of the link destination from the a tag. Extract the text from that url using the scrape_evernote ()` created earlier. Finally commit and save to SQLite.

This completes the extraction.

If you want to output to txt data instead of output with SQLite

def scrape_and_save2txt():
    file = open('evernote_text.txt', 'w')

    #Get the url of all notes from index
    index_url = "file:///(Notebook directory)/index.html"
    index_html = urllib2.urlopen(index_url)
    index_soup = BeautifulSoup(index_html)
    all_url = index_soup.find_all("a")

    for note_url in all_url:
        title = note_url.get_text()
        file.write(title)
        note = scrape_evernote(note_url.get("href"))
        file.write(note)

    file.close()

If so, it is possible. Of course, you can also output in csv format.

Summary

I wrote it in the beginning, but the general procedure is

Output the note you want to extract sentences from evernote in html format.
Extract the url of the output note from index.html.
Extract text from url destination.
Save it in SQLite.

It has become. This time it was only text, but the image also has a folder with the same name as the title in the note and is saved there. If you use this well, you can extract all the images in evernote.