What I was addicted to when dealing with huge files in a Linux 32bit environment

I may be addicted to it again someday, so a memorandum

I was addicted to

I still develop in a Linux 32bit environment, but I was addicted to a certain system. The system manages files with a C program, It handles a large size of several GByte level, and when I try it, it doesn't work at all.

Since the file is huge, it takes time to reproduce it once, and the cause that I found by debugging while frustrating is very simple.

** The file is too big to fopen! ** **

It seems that Linux 32bit is about 2GByte. (Maybe only my environment Well, there will be an upper limit, but what you will actually come across ... Moreover, in this system, an application for C language programs to divide files. It can't even open the file. Cant Believe It…

solution

2018/04/23 postscript In a comment from @ angel_p_57 I heard that there is a compile option called ** _FILE_OFFSET_BITS **.

Following the man page and the OSS code I had, I added the following to CFLAGS so that fopen can be done safely as it is!

-D_FILE_OFFSET_BITS=64

No, it's wonderful that you can change the behavior with just one option, glibc!

Extra: My solution, which was lenient about glibc

After investigating various file splitting methods, I came up with the ** split command **. This is a command for splitting files on Linux, thanks to this Only the split command worked fine.

Then how should we deal with it? The best is to look at the contents of the split, but I don't have time, so ** Let's make a fopen wrapper using the split command! ** **

Coping

Open IF instead of fopen, if the size is large, split it with / usr / bin / split and save it in the tmp directory → At the time of fread, when I went to the end during fread, I opened the next file and read the continuation.

 //fopenのラッパー
void * large_freader_open(const char *path, unsigned long maxsize) {
//....
 //サイズを見て、大きすぎたらseparate_fileでファイル分割
        unsigned long fsize = get_size(path);
        if(fsize<=maxsize) {
                //same as fopen
                handle->fp=fopen(path, "r");
                handle->max_index=1;
        } else {
                //separate file and open file as order
                separate_file(path, maxsize, handle);
                handle->fp = freader_fopen(handle);
        }
//..
 // return returns the internal struct instead of FILE *
        return handle;
//..
}

static void separate_file(const char *path, unsigned long maxsize,  struct large_freader_s *handle) {
//...
 //tmpディレクトリを取得してファイル分割
        char name[FNAME_MAX];
        get_current_dirname(handle, name, FNAME_MAX);
        snprintf(cmd, sizeof(cmd), "/usr/bin/split -d --suffix-length=6 -b %lu %s %s", maxsize, path, name);
//...
}

 //freadのラッパー
size_t large_freader_read(void * prt, size_t size,void * stream) {

//...
 //普通にfreadして
        size_t ret = fread(prt, 1, size, handle->fp);
        if(ret == size) {
                //read success, return normaly
                return ret;
        }

 //サイズ分読んでないなら次のファイルに移動
        freader_fclose(handle);
        //move to next
        handle->cur_index++;

 //全ファイル読んでるなら終わり
        if(IS_LAST_FILE(handle)) {
                //finish to read
                return ret;
        }

 //そうじゃないなら次をopenしてread
        handle->fp = freader_fopen(handle);
        if(handle->fp) {
                ret += fread(((char *)prt)+ret, 1, size-ret, handle->fp);
        }

        return ret;
}

 //内部でのfopen処理
static FILE * freader_fopen(struct large_freader_s *handle) {
 //splitしたファイル名を取得してfopen。close時にはファイル削除します。
        char dname[FNAME_MAX];
        get_current_fname(handle, dname, FNAME_MAX);
        return fopen(dname, "r");
}

I created it at home overnight and published it on github. I also wanted to split the file during wrtie, so it's a read / write wrapper library. Now that there is -D_FILE_OFFSET_BITS = 64, only write can be used ...

https://github.com/developer-kikikaikai/read_write_wrapper

Since my home PC is 64bit and I haven't been able to try it firmly with 32bit, I will fix the problems I noticed when I tried using it in another environment.

Reflection

2018/04/23 postscript ・ If something unexpected happens, first check the compile options for something!

Otherwise, ・ Since fopen cannot be used, it is very unpleasant to have to do popen → ls to get the size. ・ I'm worried about guaranteeing operation in a 32-bit environment. ・ It's about the limit of a 32-bit system, so why not try your best? etc You have to rely on a crappy solution.

Recommended Posts

What I was addicted to when dealing with huge files in a Linux 32bit environment
What I was addicted to when creating a web application in a windows environment
What I was addicted to with json.dumps in Python base64 encoding
A note I was addicted to when making a beep on Linux
A note I was addicted to when creating a table with SQLAlchemy
When I tried to install PIL and matplotlib in a virtualenv environment, I was addicted to it.
I was addicted to creating a Python venv environment with VS Code
When I put Django in my home directory, I was addicted to static files with permission errors
I was addicted to scraping with Selenium (+ Python) in 2020
A note I was addicted to when running Python with Visual Studio Code
A story that I was addicted to when I made SFTP communication with python
What I was addicted to when using Python tornado
What I was addicted to when combining class inheritance and Joint Table Inheritance in SQLAlchemy
What I did when I was angry to put it in with the enable-shared option
What I was addicted to when migrating Processing users to Python
The story I was addicted to when I specified nil as a function argument in Go
What I was addicted to when introducing ALE to Vim for Python
What I was addicted to Python autorun
When I tried to scrape using requests in python, I was addicted to SSLError, so a workaround memo
A story I was addicted to when inserting from Python to a PostgreSQL table
A story I was addicted to trying to install LightFM on Amazon Linux
A story I was addicted to trying to get a video url with tweepy
What I was addicted to in Collective Intelligence Chaprter 3. It's not a typo, so I think something is wrong with my code.
I was addicted to trying Cython with PyCharm, so make a note
[Python] When I tried to make a decompression tool with a zip file I just knew, I was addicted to sys.exit ()
Character encoding when dealing with files in Python 3
When I tried to create a virtual environment with Python, it didn't work
Three things I was addicted to when using Python and MySQL with Docker
[Words spelled to me when I was in the first grade ①] I'm not afraid to build a programming environment.
In IPython, when I tried to see the value, it was a generator, so I came up with it when I was frustrated.
I investigated what permissions I needed to delete linux files.
What to do when gdal_merge creates a huge file
I want to work with a robot in python.
pickle To read what was made in 2 series with 3 series
A story that I was addicted to at np.where
What I did to get started with Linux commands
I was addicted to trying logging.getLogger in Flask 1.1.x
I was soberly addicted to calling awscli from a Python 2.7 script registered in crontab
Note that I was addicted to npm script not passing in the verification environment
I tried to build an environment with WSL + Ubuntu + VS Code in a Windows environment
I got stuck when trying to specify a relative path with relative_to () in python
Convenient Linux keyboard operation that I want to teach myself when I was in school
I tried to create a class to search files with Python's Glob method in VBA
What I was careful about when implementing Airflow with docker-compose
[IOS] GIF animation with Pythonista3. I was addicted to it.
Build Azure Pipelies with Azure DevOps in a Linux self-hosted environment
I want to start a jupyter environment with one command
When creating a pipenv environment, I got addicted to "Value Error: Not a valid python path"
[Linux] How to deal with garbled characters when viewing files
I want to use a virtual environment with jupyter notebook!
Memo (March 2020) that I was addicted to when installing Arch Linux on MacBook Air 11'Early 2015
What I was asked when using Random Forest in practice
What I was worried about when displaying images with matplotlib
What I was addicted to when I built my own neural network using the weights and biases I got with scikit-learn's MLP Classifier.
I built an environment from centos installation to php source expansion on linux, but what to do when a browser error occurs
Note that I was addicted to accessing the DB with Python's mysql.connector using a web application.
When I try to divide a list with MeCab, I get'TypeError: in method'Tagger_parse', argument 2 of type'char const *''
A story that didn't work when I tried to log in with the Python requests module
What I did when I migrated to Linux 2 with EOL support for Amazon Linux (learning and failure stories)
What to do if pip --user returns an error in a virtual environment created with pyenv
How to display a specified column of files in Linux (awk)