Organize data divided by folder with Python

In 3 lines

Get a list of folder names

Standard version (OK in standard environment)

import os
dir = [d for d in os.listdir(".") if os.path.isdir(d)]

A cooler way (regular expressions can be used)

Windows


import glob
dir = glob.glob(os.path.join("*",""))

Mac


dir = glob.glob("*/")

Regular expression usage example

Example of searching for folders case01, case02, ...

dir = glob.glob(os.path.join("case*",""))

If you want to get only a text file (.txt).

dir = glob.glob("*.txt")

Execute the processing program for each folder

import shutil
import subprocess

for f in dir:
    
    # copy files from local folder to target folder
    cp_files=["Addup_win.py","y.input"]
    for fi in cp_files:
        shutil.copy(fi,f)
        
    # remove files at target folder    
    rm_files=['y.out','out.tsv']
    for fi in rm_files:
        if os.path.exists(os.path.join(f,fi)):
            os.remove(os.path.join(f,fi))
            
    subprocess.Popen(["python","Addup_win.py"],cwd=f)

Process text data organized by folder with pandas

The data is in tab format (.tsv), and the index column and data column are assumed from the left. スクリーンショット 2019-11-23 13.42.09.png

Data reading may be handled by try: because the above processing program may fail. The error folder needs to be output. It is convenient to prepare the index by processing from the folder name later.

import pandas as pd

dfs=pd.DataFrame()

for f in dir:
    # case01\\ => case01
    index_name = os.path.split(f)[0]
    
    # Error handle
    try:
        # Data structure {col.0 : index, col.1 : Data}
        df = pd.read_csv(os.path.join(f,"out.tsv"),sep='\t',header=None,index_col=0)
        dfs[index_name]=df.iloc[:,0]
    except:
        print("Error in {0}".foramt(index_name))
        
# make index
dfs.index = df.index

Let's check the data. (Why is there a "0" line, but I don't care because it will disappear later)

dfs.head()
スクリーンショット 2019-11-23 15.12.25.png

Something done with pandas

First, it's easier to handle if you swap the rows and columns.

dfsT = dfs.T
スクリーンショット 2019-11-23 15.12.47.png

First, processing of missing data (NaN).

dfsT = dfsT.dropna()

Appropriately from here.

For example, use a fancy index to process conditional data. (Here, an example in which the WSA / L2 column outputs data of 0.2 or more)

dfsT_select = dfsT[dfsT["WSA/L2"] > 0.2]

Visualization with matplotlib

import matplotlib.pyplot as plt

plt.bar(range(len(dfsT)),dfsT["WSA/L2"], \
        tick_label=dfsT.index)
plt.show()
スクリーンショット 2019-11-23 14.58.15.png

Adjustment of horizontal axis

fig, ax = plt.subplots()
ax.bar(range(len(dfsT)),dfsT["WSA/L2"], \
        tick_label=dfsT.index)
labels = ax.get_xticklabels()
plt.setp(labels, rotation=45, fontsize=10);
スクリーンショット 2019-11-23 14.58.24.png

Utilization in Excel (output)

Many people ask me to use Excel for the data, so I'll give it to you.

dfs.to_excel("addup.xlsx")
スクリーンショット 2019-11-23 15.15.33.png

If the text format is acceptable, for example:

dfs.to_csv("addup.tsv",sep='\t')

Recommended Posts

Organize data divided by folder with Python
Data analysis with python 2
Data analysis with Python
Sample data created with python
Get Youtube data with python
Easy folder synchronization with Python
Read json data with python
[Python] Get economic data with DataReader
Python data structures learned with chemoinformatics
Easy data visualization with Python seaborn.
Process Pubmed .xml data with python
Data analysis starting with python (data visualization 1)
Data analysis starting with python (data visualization 2)
Python application: Data cleansing # 2: Data cleansing with DataFrame
I added Maki Horikita and Kanna Hashimoto and divided by 2 with python
Get property information by scraping with python
Data pipeline construction with Python and Luigi
Receive textual data from mysql with python
[Note] Get data from PostgreSQL with Python
Process Pubmed .xml data with python [Part 2]
Add a Python data source with Redash
Retrieving food data with Amazon API (Python)
Save video frame by frame with Python OpenCV
Try working with binary data in Python
Convert Excel data to JSON with python
Download Japanese stock price data with python
Manipulate DynamoDB data with Lambda (Node & Python)
Convert FX 1-minute data to 5-minute data with Python
Recommendation of Altair! Data visualization with Python
Data analysis starting with python (data preprocessing-machine learning)
Let's do MySQL data manipulation with Python
[Part1] Scraping with Python → Organize to csv!
Process big data with Dataflow (ApacheBeam) + Python3
A memo organized by renaming the file names in the folder with python
I tried to open the latest data of the Excel file managed by date in the folder with Python
FizzBuzz with Python3
Stock number ranking by Qiita tag with python
Scraping with Python
[Python] Get the files in a folder with Python
Create test data like that with Python (Part 1)
Statistics with python
Read data with python / netCDF> nc.variables [] / Check data size
Scraping with Python
Python with Go
Data analysis python
Twilio with Python
Read table data in PDF file with Python
Integrate with Python
Get stock price data with Quandl API [Python]
Play with 2016-Python
AES256 with python
Tested with Python
A story stuck with handling Python binary data
python starts with ()
Folium: Visualize data on a map with Python
with syntax (Python)
[Scientific / technical calculation by Python] Plot, visualize, matplotlib 2D data with error bars
Zundokokiyoshi with python
I started machine learning with Python Data preprocessing
Read line by line from a file with Python
Machine Learning with docker (40) with anaconda (40) "Hands-On Data Science and Python Machine Learning" By Frank Kane