Trends and countermeasures when there is a possibility of reading binaries in the scene of line-oriented processing in C

For example, fgetws (), which is intended for line-oriented processing, is a function that reads a text file until it encounters a newline or fills the buffer, but it also passes through L'\ 0', which is the end of the string, without any error. Therefore, except for the case where "the buffer is filled up to (specified number of bytes-1) other than L'\ 0'" and "the character string is terminated with L'\ n', L'\ 0'". I don't know how many characters the function really read.

So, even if you encounter L'\ 0', the method of identifying the binary and reading the character string as part of continuing the text processing.

Let's make binary compatible fgetws ()

Based on the request to use an improved version of fgetws, consider how to implement fgetws2 () with the following specifications.

Since it returns the number of characters, it means that even if L'\ 0'is inserted in the middle, it can be understood that the character string (byte string) continues. Horrible files are scattered all over the world, and there are people who feed them softly to make them buggy. Protect yourself from such threats.

Read character by character with fgetwc ()

Well, the way you can think of it normally. Safe of safe.

size_t fgetws2(wchar_t *buf, size_t len, FILE *fp) {
	wint_t c = fgetwc(fp);
	if (c == WEOF) {
		return (size_t)-1;
	}
	if (len <= 1) {
	//There is no buffer to write?
		ungetwc(c, fp);
		if (len) {
			*buf = L'\0';
		}
		return 0;
	}
	*buf = c;
	size_t rtn = 1;
	len--;
	while (rtn < len) {
		c = fgetwc(fp);
		if (c == WEOF) {
			break;
		}
		buf[rtn++] = c;
		if (c == L'\n') {
			break;
		}
	}
	buf[rtn] = L'\0';
	return rtn;
}

Even though it is buffered, reading IO operations character by character seems to be a heavy process, so I think it is the programmer's nature to avoid it.

Fill the buffer with a character other than L'\ 0'in advance and then fgetws ()

Since fgetws () adds L'\ 0'to the end after reading it for the time being, you should search for L'\ 0'from the back. By the way, I'm having trouble with the latter half of this sentence, so I'll name the function fgetws3.

size_t fgetws3(wchar_t *buf, size_t len, FILE *fp) {
	if (len <= 1) {
		//Why did I call it even though there was no buffer?
		wint_t c = fgetwc(fp);
		if (c == WEOF) {
			return (size_t)-1;
		}
		ungetwc(c, fp);
		if (len) {
			*buf = L'\0';
		}
		return 0;
	}
	memset(buf, 1, len * sizeof(*buf));
	if (!fgetws(buf, len, fp)) {
		return (size_t)-1;
	}
	while (buf[--len]) {}
	return len;
}

As for input, it seems to be much better than the one that reads one character at a time. However, the operation of initializing all the buffers with memset () and licking the existence of L'\ 0'is annoying when the difference between the number of characters read and the buffer length becomes large. It's the same as wcslen, but ...

fwscanf()

There was such an unexpected ambush. fwscanf ().

fwscanf () or scanf () is a reference book for beginners, and you can read numbers and strings from the keyboard (?) At once by saying "% d% s". It's amazing! " It's standard, but I'm not sure why to add & to the int variable, or why not to add it to the char array, and when I get a little confused, this is a normal string. It's a function that isn't used at all because it's better to analyze the string by yourself after importing it in, or it's difficult to use it so as not to overflow the buffer !? I don't usually use it either.

However, there was a format specifier that was surprisingly useful for reading binary files. It is % [^ ...].

The detailed operation of % [^ ...] is to have the specification read, but this can be used in the same way as % s except where it is read while limiting the character type. So you can limit the buffer length by saying % 20 [^ ...], and early on, fwscanf (fp, L"% 20l [^ \ n]% zn ", buf, & len) If you do, you can get the value in a form that is just right for handling a binary such as "Read up to 20 wide characters excluding line breaks and store the number of characters read so far". You did it!

It seems good to make a function. So, the following is possible.

size_t fgetws4(wchar_t *buf, size_t len, FILE *fp) {
	if (len <= 1) {
		//No buffer
		wint_t c = fgetwc(fp);
		if (c == WEOF) {
			return (size_t)-1;
		}
		ungetwc(c, fp);
		if (len) {
			*buf = L'\0';
		}
		return 0;
	}
	
	//I'm making a format specifier instead of reading a file.
	//Read up to 2 characters before the end of the buffer, then fgetwc()、L'\0'I want to fill it.
	//If it is a 512 character buffer, I want fwscanf to read 510 characters.
	// "%510l[^\n]%zn"I want to make a character string.%The combination of is complicated
	wchar_t fmt[32];
	swprintf(fmt, 32, L"%%%zul[^\n]%%zn", len - 2);
	
	//0 initialization of readlen is required.
	//If the conversion did not take place, that is, you immediately encountered a line break
	//Because the argument is not changed when the character string is not read.
	// (At that time, ret becomes 0)
	size_t readlen = 0;
	int ret = fwscanf(fp, fmt, buf, &readlen);
	//The above two steps are for C99, but for C11 fwscanf_s()It seems that you can enjoy it more
	
	if (ret == EOF) {
		return (size_t)-1;
	}
	wint_t c = fgetwc(fp);
	if (!ret && c == WEOF) {
		return (size_t)-1;
	}
	if (c != WEOF) {
		buf[readlen++] = c;
	}
	buf[readlen] = L'\0';
	return readlen;
}

I made three alternatives to fgetws (). If an application that uses fgetws () is thrown into the wild, and you get rid of unreliable data, you can easily tell that it's at least binary data. It was good.

Performance

Now, I have just made three types of implementations, and I would like to consider the difference in performance. A text file is variable-length data and is not used so that the prepared buffer is always full. That's why I made a test to measure the difference in the following four situations.

Here, the long buffer is (1 << 17) * sizeof (wchar_t), and the short buffer is (1 << 10) * sizeof (wchar_t). I used the following command to send a short string.

yes 01234567890123456789012345678901234567890123456789012345678901234567890123456789

I used the following command to keep the buffer full.

tr \\0 @ </dev/zero

It's not a test that deals with L'\ 0', but it's not the story right now, so don't worry. … Connect this to a binary that simply repeats fgetws2 () on the standard input and measure the time with time in the text. For texto. It was interesting that there was a considerable difference in the four situations.

The following tests were done on Debian Buster amd64.

A long buffer is prepared and only a short string is written

Add fgetws * (buf, 1 << 17, stdin); to yes ... 10,000,000 times

fgetws2 fgetws3 fgetws4
real 7.865s Measurement interruption 7.644s
user 7.548s Measurement interruption 7.364s
sys 1.185s Measurement interruption 1.139s

I stopped fgetw3 () because it took too long for this test. After all it is hard to lick 2 ^ 17 characters. When I tried to reduce the repetition a little more, it took about 250 times longer than fgetws2. It's hard.

A long buffer is prepared and always filled up

Add fgetws * (buf, 1 << 17, stdin); to tr ... 10,000 times

fgetws2 fgetws3 fgetws4
real 12.749s 3.133s 8.231s
user 14.276s 4.085s 9.621s
sys 1.506s 1.202s 1.531s

A short buffer is prepared and only a short string is written

5,000,000 times fgetws * (buf, 1 << 10, stdin); to yes ...

fgetws2 fgetws3 fgetws4
real 3.986s 9.754s 3.819s
user 3.850s 9.545s 3.639s
sys 0.563s 0.750s 0.621s

A short buffer is prepared and always filled up

Add fgetws * (buf, 1 << 10, stdin); to tr ... 500,000 times

fgetws2 fgetws3 fgetws4
real 4.988s 1.275s 3.453s
user 5.592s 1.685s 4.022s
sys 0.590s 0.449s 0.575s

I was able to confirm that fgetws3 () was a bomb by measurement where I felt that it was not implemented. Generally, the buffer is often set aside for the expected data a little longer with insurance, so that would be an extra addition. Also, repeating fgetwc () for fwscanf (), which reads data continuously, takes about 1.5 times longer, isn't it?

Digression getline ()

I don't know if it responds to the request for line-oriented editing even for data containing binaries, but glibc has a function called getline (), and the line length is the length of the buffer prepared in advance. If it is less than, the library will widen it, and it will also tell you the length of the line and the length of the widened part. It is now also included in the POSIX standard.

However, a function that automatically widens the memory, is this something that can be used with software released to the world? If any file is accepted with this, it will be </ dev / zero with or without malicious intent. But if you do, DoS that fills the memory area will work and the system will die, right? I'm not good at such functions ...

So, I need wide characters, but I have to do mbsnrtowcs () by myself because there is no wide character version function corresponding to getline (), but every time I get caught in'\ 0', the conversion stops and the buffer The work of transferring the characters is a hassle !!!!

Summary

C should prepare a more matomo API.

Recommended Posts