Extract strings from files in Python

Introduction

** What to explain in this article ** Sample code for the following features.

--Create a list of files under the specified directory --Check if the text in the file contains a particular string --Extract the text in the range enclosed by a specific string from the text in the file

Development environment

--python 2.7 and above

Create a list of files under the specified directory

code

def generate_file_list(dirpath_to_search):
    file_list = []
    for dirpath, dirnames, filenames in os.walk(dirpath_to_search):
        for filename in filenames:
             file_list.append(os.path.join(dirpath,filename))

    return file_list

how to use

A sample when you want to recursively acquire the file names under sample1 with the following directory structure.

Sample directory structure


sample1/
├── dir01
│   ├── dir11
│   │   └── file21.txt
│   └── file11.txt
├── file01.txt
└── file02.txt

how to use


file_list = generate_file_list('sample1')
for file in file_list:
    print(file)

#output
# sample1/file01.txt
# sample1/file02.txt
# sample1/dir01/file11.txt
# sample1/dir01/dir11/file21.txt

API used

os.walk(top, topdown=True, onerror=None, followlinks=False)

Create the file names under the directory tree by scanning the tree top-down or bottom-up. Yield tuples (dirpath, dirnames, filenames) for each directory (including top itself) in the directory tree rooted at directory top.

Find out if the text in the file contains a particular string

code

def contain_text_in_file(filepath, text):
    with open(filepath) as f:
        return any(text in line for line in f)

how to use

A sample when there are files contain.txt and not_contain.txt as shown below and you want to know the file that includes "2020/02/02" in the file.

contain.txt


Update date: 2020/02/02
This article is about python file operations.

not_contain.txt


Update date: 2019/10/15
This article is about python file operations.

how to use


filepath1 = './contain.txt'
text = '2020/02/02'
result1 = contain_text_in_file(filepath1, text)
print(result1) # True

filepath2 = './not_contain.txt'
text = '2020/02/02'
result2 = contain_text_in_file(filepath2, text)
print(result2) # False

API used

open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)

Opens file and returns the corresponding file object.

any(iterable)

Returns True if any element of iterable is true. Returns False if iterable is empty. Equivalent to the following code:

Extract the text in the range enclosed by a specific string from the text in the file

code

import re

def extract_text_in_file(filepath, pattern_prev, pattern_next):
    extracted_text_array = []
    pattern = pattern_prev + '(.*)' + pattern_next
    with open(filepath) as f:
        lines = f.readlines()
        for line in lines:
            tmp_extracted_text_array = re.findall(pattern, line)
            extracted_text_array.extend(tmp_extracted_text_array)

    return extracted_text_array

how to use

A sample when there is a file called file.txt like the one below and you want to extract the date part surrounded by" update date "and" by ".

file.txt


Update date:2020/02/01 by taro
This article is about python file operations.

Update date:2020/02/02 by jiro
This article is about python file operations.

how to use


filepath = './file.txt'
pattern_prev = 'Update date:'
pattern_next = ' by'
extracted_text_array = extract_text_in_file(filepath, pattern_prev, pattern_next)

for extracted_text in extracted_text_array:
    print(extracted_text)

#output
# 2020/02/01
# 2020/02/02

API used

re.findall(pattern, string, flags=0)

Returns all unique matches by pattern in string as a list of strings. The string is scanned from left to right and matches are returned in the order they are found. Returns a list of groups if there is more than one group in the pattern. If the pattern has multiple groups, it will be a list of tuples. Empty matches are included in the result.

open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)

Opens file and returns the corresponding file object.

Recommended Posts

Extract strings from files in Python
Extract text from images in Python
Allow Python to select strings in input files from folders
Compare strings in Python
Reverse strings in Python
OCR from PDF in Python
Search for strings in Python
Search for strings in files
How to download files from Selenium in Python in Chrome
Import classes in jar files directly from Python scripts
Extract multiple list duplicates in Python
Transpose CSV files in Python Part 1
Get data from Quandl in Python
Manipulate files and folders in Python
Read and use Python files from Python
Handling of JSON files in Python
Sort large text files in Python
[Beginner] Extract character strings with Python
Read files in parallel with Python
Get exchange rates from open exchange rates in Python
Play audio files from Python with interrupts
Output tree structure of files in Python
Revived from "no internet access" in Python
Prevent double boot from cron in Python
Type annotations for Python2 in stub files!
# 5 [python3] Extract characters from a character string
Decrypt files encrypted with openssl from python with openssl
Referencing INI files in Python or Ruby
How to extract polygon area in Python
Automate jobs by manipulating files in Python
Download images from URL list in Python
Get battery level from SwitchBot in Python
Read and write JSON files in Python
Sample for handling eml files in Python
Bulk replacement of strings in Python arrays
Generate a class from a string in Python
Generate C language from S-expressions in Python
Convert from Markdown to HTML in Python
Download files in any format using Python
Get metric history from MLflow in Python
[Python] (Line) Extract values from graph images
Quadtree in Python --2
CURL in python
Metaprogramming in Python
Python 3.3 in Anaconda
Geocoding in python
SendKeys in Python
Meta-analysis in Python
Unittest in python
Reading from text files and SQLite in Python (+ Pandas), R, Julia (+ DataFrames)
Discord in Python
DCI in Python
sql from python
quicksort in python
nCr in python
N-Gram in Python
Programming in python
Plink in Python
Extract every n elements from an array (list) in Python and Ruby
Constant in python
Get options in Python from both JSON files and command line arguments