Overview

I will make a note of the knowledge I gained when doing machine learning and data mining with VS Code in Python 3.

I know it's hard to read.

Updated from time to time.

Personal best practices for building a development environment for Python 3 only

Python3 + venv + VSCode + macOS development environment construction --Qiita

Drawing library comparison

Jupyter-notebook drawing library comparison-Qiita

Interactive visualization

Use ipywidgets and Bokeh for interactive visualization-Qiita

When I try to do ipywidgets with VSCode's jupyter extension, I can't use it because I can't read around the script. support for ipython/jupyter widgets · Issue #21 · DonJayamanne/vscodeJupyter Let's do it quietly with the browser Jupyter

sample

https://github.com/bokeh/bokeh/blob/master/examples/howto/notebook_comms/Jupyter%20Interactors.ipynb

If you get an error that pylint has no ~ when using numpy, go to VS Code settings.json

    "python.linting.pylintArgs": [
        "--extension-pkg-whitelist=numpy"
    ]

No-member error in Pylint-Qiita

The reason why the error `ValueError: n_samples = 1 should be> = n_clusters = 3` appears when k-means is performed.

Since the required data needs to be two-dimensional, it is appending and inefficient in this blog, so it is good to do something like sample_data.iloc [:, 0: 1].

This can extract the first column, which is the same as when sample_data.iloc [:, 0] is set, but it seems that it will be represented in two dimensions by setting 0: 1, and the above error Will not come out

Day 6 until understanding machine learning / clustering-IT captain's blog

`df.append (df2)` with `df = pd.DataFrame ()` does not add to df.

Should be df = df.append (df2)

python - Appending to an empty data frame in Pandas? - Stack Overflow

If you write the type name after: (Type Hints), IntelliSense will be done.

However, Type Hints seems to be synonymous with just a comment, so if you pass an object of the wrong type, linter will not get angry and will not be type checked until you run it.

Typed world starting with Python-Qiita

Class member scope

Python class member scope summary-Qiita

When you want to retrieve pandas data nicely

Python pandas data iteration and function application, pipe --StatsFragments

When you want index when foreaching list

List index (enumerate)-Learning site from Python introductory to application

When you want to convert the type of the contents of dataframe

Pandas: Converting to numeric, creating NaNs when necessary

Good combination of pip-compile and pip-sync

Easy Python package management with pip related tools-Qiita

Illustration when concatenating and combining DataFrames

Append when you want to simply join vertically, join when you want to join horizontally

Python pandas data concatenation / join processing as seen in the figure --StatsFragments

If the amount is large when printing the pandas dataFrame, it will be omitted, but if you set it arbitrarily you can display all

pd.set_option("display.max_rows", 10)

Prevent pandas from omitting display-problems and solution notes at work.

When sorting a multidimensional list, set the sort key to lambda

[[Python] Sort # Sort multidimensional list](http://qiita.com/fantm21/items/6df776d99356ef6d14d4 #Sort multidimensional list)

How to sort by data structure

Summary of Python sort (list, dictionary type, Series, DataFrame) --Qiita

If you use isort, import will be reformated nicely.

code-python-isort - Visual Studio Marketplace

%sql select * from hoge Extension of jupyter that can be plunged into DataFrame etc. just by writing ipython-sql

I made a tool to convert Jupyter py to ipynb with VS Code --Qiita

The former append comes out as append with IntelliSense working, but the latter does not come out.

tttt = pd.DataFrame()
tttt.append(None)
tttt = df[["label"]]
tttt.append(None)

This is because you don't know the type of the argument, so if you specify the type after df [[“label ”]] using ʻassert is instance` or something, append will appear in IntelliSense.

How to write Python to get IntelliSense to work --Ajobuji Hoshi Tsushin

grouper very convenient

Python pandas accessor / Grouper with a little more advanced grouping / aggregation --StatsFragments

You can also group time-series data every 1 second or every day.

You can only replace the target string with `.replace ("hoge", "toHoge")` ,

You can also use regular expressions like .replace (". *", "+1", regex = True)

A super beginner in machine learning read and summarized an article that everyone says is good-Qiita

A super beginner in machine learning read and summarized an article that everyone says is good --Qiita

If you want to put out a confusion matrix, you can do it immediately like this

from sklearn.metrics import confusion_matrix

test_label_lb = []  #Correct label
p_label = []  #Estimated label

cmx_data = confusion_matrix(y_true=test_label_lb, y_pred=p_label)
labels = ["A", "B", "C"]
df_cmx = pd.DataFrame(cmx_data, index=labels, columns=labels)

Common mistakes developers make when using Python for big data analytics|programming| POSTD

You can display a map with jupyter regardless of the basement

Folium

import folium
m = folium.Map(location=[33.763, -84.392], zoom_start=17)
folium.Marker(
    location=[33.763006, -84.392912],
    popup='World of Coca-Cola'
).add_to(m)
m

Is it smarter to use lambda?

How to use map / filter in Python3 --- A story that seems to go somewhere

Nice plot example

Mastering the Python pandas plot function-StatsFragments

Since iterators are returned by map and filter

Iterator advances when the contents are taken out by list () etc.

num_map = map(lambda n: n + 1, np.random.random(1000))
print(list(num_map)) #Here is the value

num_filter = filter(lambda n: n > 0.5, np.random.random(1000))
print(list(num_filter)) #Here is the value

print(list(num_map)) #Not here anymore
print(list(num_filter)) #Not here anymore

If you want to find the key of dict with max value, you can go in one line

max(dic, key=lambda i: dic[i])

Path lib is easy to operate the path

If you have Python 3.4 or later, you should throw away os.path and use pathlib

from pathlib import Path
LOG_DIR = "/Users/your_name/log"

Path(LOG_DIR).joinpath("log.json") #Or Path(LOG_DIR) / "log.json"
# PosixPath('/Users/your_name/log/log.json')Becomes

Path(LOG_DIR).joinpath("log.json").exists() 
# False

Multi-process using multiprocessing

How to do multi-core parallel processing with python

It's easy because you can pass it appropriately in the range

Wrapper for various graph drawing tools HoloViews

Python visualization tools may be standardized by HoloViews Basic graph of HoloViews in one liner

Show progress bar

Show progress bar in Python (tqdm)

If you pass an iterable object, you can see how many iterates per second you are progressing, so it is a good guide.

Labels sometimes stick out when savefig with matplotlib

bbox_inches = "tight" or something like that

If you make the font big or make a landscape or portrait graph, the label may stick out with savefig, so if you do .savefig ("test.png ", bbox_inches = "tight") , it will come out beautifully.

I want to measure the execution time

Jupyter Notebook>% timeit range (100)> Measurement of processing time> %% timeit> Measurement of processing time of multiple sentences Story of measuring code execution time with IPython

With Jupyter, you can get the execution time of func with % time func (), but it is rather blurry If you set % timeit func (), it will be executed several times and measured.

VS Code's jupyter extension doesn't recognize %% timeit, so if you want to evaluate multiple lines, VS Code's Jupyter seems impossible ( Well, it should be a function)

Is there NaN in that DataFrame?

Is there NaN in the pandas DataFrame? df.isnull (). values.any () is easy to remember and fast, so it's good, but it depends on the type, so give it a try.

Tips for fast processing with Pandas

Three tips for maintaining Python pandas performance

autocomplete slow problem with pythonVSCode

Slow auto complete speed for custom modules python #903 Slow autocompletion/formatting #581

If you add the following to VSCode settings.json, it will be preloaded.

"python.autoComplete.preloadModules": [
    "pandas",
    "numpy",
    "matplotlib"
]

As a result, I feel that suggestions such as pandas.DataFrame () are faster, but I feel that it does not change when type inference is required. It will be faster if you specify it with ʻassert is instance`, but you can not do it one by one ...

df = func_something()
df.sum() #Sum comes out slowly here

assert isinstance(df, pd.DataFrame)
df.sum() #Here sum comes out soon

Zombie mass outbreak problem when multi-process processing with Pool with Jupyter

When zombies are used when using multiprocessing in IPython

#p = Pool()
p.terminate()

Explicitly kill or

with Pool() as p:
    results = p.map(func, range(0, 100))

Use with

Pylint gets angry with E0602 when referencing lambda variable with double filter or map

If you find a match in list_prefix that matches the prefix in list_ab (although this example isn't very good ...)

list_ab = ["aa_a", "aa_b", "ab_a", "ab_b", "ba_a", "ba_b"]
list_prefix = ["aa", "ab"]

print(list(
    filter(lambda a: True in map(lambda b: a.startswith(b), list_prefix),
            list_ab)
))  # ['aa_a', 'aa_b', 'ab_a', 'ab_b']

With this, ʻa` gets angry at E0602 (but since pylint just gets angry, it can be executed and the result is as expected).

from itertools import compress
print(list(
    compress(list_ab,
                [True in [a.startswith(b) for b in list_prefix] for a in list_ab]
                )
))  # ['aa_a', 'aa_b', 'ab_a', 'ab_b']

It is good to write in list comprehension notation using compress.

[Python] What to do when memory runs out in python

In summary

--Let's stop multiprocessing --Let's use GC well --Let's make it a numpy array --Let's make it 32bit --Let's make a destructive assignment with for (cython if slow) --Let's compress data (practicality is subtle?) --Let's physically increase memory

Persist using joblib instead of pickle

The effect is weak because the compression does not work so much in terms of data, but it is getting smaller. Since it is compressed, the speed to export is naturally slower than picke

When compless = 0, it is uncompressed, so it will be almost the same size as when it was put out with pickle, but joblib is easier because there is no need to write with open in dump and load.

import os
import pickle

import joblib
import numpy as np
import pandas as pd

dump_data = np.random.randn(10000000)

with open("dump_data.pkl", "wb") as f:
    pickle.dump(dump_data, f)

print(os.path.getsize("dump_data.pkl") / 1024 / 1024, "MB")
# 76.29409885406494 MB

joblib.dump(dump_data, "dump_data", compress=3)
print(os.path.getsize("dump_data") / 1024 / 1024, "MB")
# 73.5648946762085 MB

# joblib.load("dump_data") #Read

Seaborn drawing examples

[Explanation of all Seaborn methods (Part 1: Graph list)](http://own-search-and-study.xyz/2017/05/02/ Explanation of all seaborn methods (Part 1: Graph list) / ) Data visualization with Python-let's draw a cool heat map Beautiful graph drawing with python -seaborn makes data analysis and visualization easier Part 1

If you get an egg_info error when pip-compile

I feel that it often happens related to matplotlib and seaborn, but there are cases where pip-compile cannot be performed due to an error such as egg_info. In that case, I think that pip-compile --rebuild will work. Reference: https://github.com/jazzband/pip-tools/issues/586

When you want to import from another directory

Summary of how to import files in Python 3

Is it best to create and read __init__.py?

Use Plotly from Jupyter extension of Visual Studio Code

Very convenient

When HTML (html_code) and ʻinit_notebook_mode () are executed at once in the same Cell, they are not displayed. So, if you first execute only HTML (html_code) and then execute ʻinit_notebook_mode (), it will work (once it can be displayed, it is okay to execute it on the same Cell at once) Because JS loading is asynchronous?

Personal tips when doing various things with Python 3