Be careful because some overseas libraries have an appropriate idea of encoding.
For business, I needed to touch sftp with python, so I used paramiko and tried to download the file directly from the FTP server and apply the text file inside for statistical processing. In the API documentation, "the file () function can be used for the same purpose as the python file". So when I specified the path etc. properly, the following error appeared.
UnicodeDecodeError: 'utf-8' codec can't decode byte ~~ in position ~~: invalid start byte
The code looks like the following.
client = paramiko.SSHClient()
client.connect(Appropriate connection information)
sftp_connection = client.open_sftp()
with sftp_connection.open(File Path) as f
for line in f:
print(line)
When I did this, I got a UnicodeDecodeError near the for statement.
In a nutshell, it was "because you can't specify the encoding when retrieving the contents of a file." I'm trying to read a text file encoded in ANSI in UTF-8 and I'm getting an error. I looked at the source, but at the moment I can't specify the encoding when opening. I can do it with standard input.
It seems that it is not possible to rewrite the contents of the file into English or recreate it with utf, so this time I decided to open it in binary and encode it separately with ANSI to read it. Specifically, the following with statement in the above code was rewritten with the following image. (The reason why it resembles standard input to some extent is because I wanted to be able to debug locally instead of sftp. Of course, it takes less time to debug if there is no communication.)
import codecs
for line in readlines():
print(line)
def readlines():
file f = sftp_connection.open(File Path, "rb")
return codecs.encode(f.read(), "ANSI").split("\r\n")
By specifying "rb" in the second variable of open, the file is read as a binary. I read it again with the encoding I want to read, and then return the line.
If the file is too big, there seems to be a problem, but it seems that there is no problem in practical use, so this time it was OK.
I think the root cause of this problem is having paramiko's BufferedFile specify the encoding.
It's a good opportunity to read it properly, and I want to send a pull request to paramiko's github.
Recommended Posts