For example, fgetws (), which is intended for line-oriented processing, is a function that reads a text file until it encounters a newline or fills the buffer, but it also passes through L'\ 0', which is the end of the string, without any error. Therefore, except for the case where "the buffer is filled up to (specified number of bytes-1) other than L'\ 0'" and "the character string is terminated with L'\ n', L'\ 0'". I don't know how many characters the function really read.
So, even if you encounter L'\ 0', the method of identifying the binary and reading the character string as part of continuing the text processing.
Based on the request to use an improved version of fgetws, consider how to implement fgetws2 ()
with the following specifications.
size_t fgetws2 (wchar_t * buf, size_t len, FILE * fp);
fp
until it encounters a maximum len --1
newline and stores them in buf
. The read wide character string is terminated with L'\ 0'
.fp
. Returns (size_t) -1
at the end of the file or when an error occurs. As with fgetws
, the caller must look atferror ()
andfeof ()
to determine if it is an error or a file termination.Since it returns the number of characters, it means that even if L'\ 0'is inserted in the middle, it can be understood that the character string (byte string) continues. Horrible files are scattered all over the world, and there are people who feed them softly to make them buggy. Protect yourself from such threats.
Well, the way you can think of it normally. Safe of safe.
size_t fgetws2(wchar_t *buf, size_t len, FILE *fp) {
wint_t c = fgetwc(fp);
if (c == WEOF) {
return (size_t)-1;
}
if (len <= 1) {
//There is no buffer to write?
ungetwc(c, fp);
if (len) {
*buf = L'\0';
}
return 0;
}
*buf = c;
size_t rtn = 1;
len--;
while (rtn < len) {
c = fgetwc(fp);
if (c == WEOF) {
break;
}
buf[rtn++] = c;
if (c == L'\n') {
break;
}
}
buf[rtn] = L'\0';
return rtn;
}
Even though it is buffered, reading IO operations character by character seems to be a heavy process, so I think it is the programmer's nature to avoid it.
Since fgetws () adds L'\ 0'to the end after reading it for the time being, you should search for L'\ 0'from the back. By the way, I'm having trouble with the latter half of this sentence, so I'll name the function fgetws3.
size_t fgetws3(wchar_t *buf, size_t len, FILE *fp) {
if (len <= 1) {
//Why did I call it even though there was no buffer?
wint_t c = fgetwc(fp);
if (c == WEOF) {
return (size_t)-1;
}
ungetwc(c, fp);
if (len) {
*buf = L'\0';
}
return 0;
}
memset(buf, 1, len * sizeof(*buf));
if (!fgetws(buf, len, fp)) {
return (size_t)-1;
}
while (buf[--len]) {}
return len;
}
As for input, it seems to be much better than the one that reads one character at a time. However, the operation of initializing all the buffers with memset () and licking the existence of L'\ 0'is annoying when the difference between the number of characters read and the buffer length becomes large. It's the same as wcslen, but ...
fwscanf()
There was such an unexpected ambush. fwscanf ().
fwscanf () or scanf () is a reference book for beginners, and you can read numbers and strings from the keyboard (?) At once by saying "% d% s". It's amazing! " It's standard, but I'm not sure why to add &
to the int variable, or why not to add it to the char array, and when I get a little confused, this is a normal string. It's a function that isn't used at all because it's better to analyze the string by yourself after importing it in, or it's difficult to use it so as not to overflow the buffer !? I don't usually use it either.
However, there was a format specifier that was surprisingly useful for reading binary files. It is % [^ ...]
.
The detailed operation of % [^ ...]
is to have the specification read, but this can be used in the same way as % s
except where it is read while limiting the character type. So you can limit the buffer length by saying % 20 [^ ...]
, and early on, fwscanf (fp, L"% 20l [^ \ n]% zn ", buf, & len)
If you do, you can get the value in a form that is just right for handling a binary such as "Read up to 20 wide characters excluding line breaks and store the number of characters read so far". You did it!
It seems good to make a function. So, the following is possible.
size_t fgetws4(wchar_t *buf, size_t len, FILE *fp) {
if (len <= 1) {
//No buffer
wint_t c = fgetwc(fp);
if (c == WEOF) {
return (size_t)-1;
}
ungetwc(c, fp);
if (len) {
*buf = L'\0';
}
return 0;
}
//I'm making a format specifier instead of reading a file.
//Read up to 2 characters before the end of the buffer, then fgetwc()、L'\0'I want to fill it.
//If it is a 512 character buffer, I want fwscanf to read 510 characters.
// "%510l[^\n]%zn"I want to make a character string.%The combination of is complicated
wchar_t fmt[32];
swprintf(fmt, 32, L"%%%zul[^\n]%%zn", len - 2);
//0 initialization of readlen is required.
//If the conversion did not take place, that is, you immediately encountered a line break
//Because the argument is not changed when the character string is not read.
// (At that time, ret becomes 0)
size_t readlen = 0;
int ret = fwscanf(fp, fmt, buf, &readlen);
//The above two steps are for C99, but for C11 fwscanf_s()It seems that you can enjoy it more
if (ret == EOF) {
return (size_t)-1;
}
wint_t c = fgetwc(fp);
if (!ret && c == WEOF) {
return (size_t)-1;
}
if (c != WEOF) {
buf[readlen++] = c;
}
buf[readlen] = L'\0';
return readlen;
}
I made three alternatives to fgetws (). If an application that uses fgetws () is thrown into the wild, and you get rid of unreliable data, you can easily tell that it's at least binary data. It was good.
Now, I have just made three types of implementations, and I would like to consider the difference in performance. A text file is variable-length data and is not used so that the prepared buffer is always full. That's why I made a test to measure the difference in the following four situations.
Here, the long buffer is (1 << 17) * sizeof (wchar_t), and the short buffer is (1 << 10) * sizeof (wchar_t). I used the following command to send a short string.
yes 01234567890123456789012345678901234567890123456789012345678901234567890123456789
I used the following command to keep the buffer full.
tr \\0 @ </dev/zero
It's not a test that deals with L'\ 0', but it's not the story right now, so don't worry. … Connect this to a binary that simply repeats fgetws2 () on the standard input and measure the time with time in the text. For texto. It was interesting that there was a considerable difference in the four situations.
The following tests were done on Debian Buster amd64.
Add fgetws * (buf, 1 << 17, stdin); to yes ...
10,000,000 times
fgetws2 | fgetws3 | fgetws4 | |
---|---|---|---|
real | 7.865s | Measurement interruption | 7.644s |
user | 7.548s | Measurement interruption | 7.364s |
sys | 1.185s | Measurement interruption | 1.139s |
I stopped fgetw3 () because it took too long for this test. After all it is hard to lick 2 ^ 17 characters. When I tried to reduce the repetition a little more, it took about 250 times longer than fgetws2. It's hard.
Add fgetws * (buf, 1 << 17, stdin); to tr ...
10,000 times
fgetws2 | fgetws3 | fgetws4 | |
---|---|---|---|
real | 12.749s | 3.133s | 8.231s |
user | 14.276s | 4.085s | 9.621s |
sys | 1.506s | 1.202s | 1.531s |
5,000,000 times fgetws * (buf, 1 << 10, stdin); to yes ...
fgetws2 | fgetws3 | fgetws4 | |
---|---|---|---|
real | 3.986s | 9.754s | 3.819s |
user | 3.850s | 9.545s | 3.639s |
sys | 0.563s | 0.750s | 0.621s |
Add fgetws * (buf, 1 << 10, stdin); to tr ...
500,000 times
fgetws2 | fgetws3 | fgetws4 | |
---|---|---|---|
real | 4.988s | 1.275s | 3.453s |
user | 5.592s | 1.685s | 4.022s |
sys | 0.590s | 0.449s | 0.575s |
I was able to confirm that fgetws3 () was a bomb by measurement where I felt that it was not implemented. Generally, the buffer is often set aside for the expected data a little longer with insurance, so that would be an extra addition. Also, repeating fgetwc () for fwscanf (), which reads data continuously, takes about 1.5 times longer, isn't it?
I don't know if it responds to the request for line-oriented editing even for data containing binaries, but glibc has a function called getline (), and the line length is the length of the buffer prepared in advance. If it is less than, the library will widen it, and it will also tell you the length of the line and the length of the widened part. It is now also included in the POSIX standard.
However, a function that automatically widens the memory, is this something that can be used with software released to the world? If any file is accepted with this, it will be </ dev / zero
with or without malicious intent. But if you do, DoS that fills the memory area will work and the system will die, right? I'm not good at such functions ...
So, I need wide characters, but I have to do mbsnrtowcs () by myself because there is no wide character version function corresponding to getline (), but every time I get caught in'\ 0', the conversion stops and the buffer The work of transferring the characters is a hassle !!!!
C should prepare a more matomo API.
Recommended Posts