Python subprocess locale settings

Python subprocess locale settings - python

When executing opennlp POSTagger with subprocess.call in python, the result goes wrong. But when I put the same command into my terminal, the result is correct.
After some testing , I think this is because opennlp failed to load the model file correctly, so what's the problem? The model is trained in Chinese and I use python 2.7.
OpenNLP runs without any warnings or errors, but it tags the input sentence totally wrong. It gives the correct tags in the terminal. I guess it's an encoding problem but I'm not sure.
Here is the code. It's nothing special and contains only ascii chars.
Print this command and copy it to terminal, the result is correct.
Now I know it's the locale/encoding problem (debug the script by strace). But it's of no use to set the python locale to en_US.utf-8 or zh_CN.utf-8. My shell locale setting is zh_CN.utf-8.
opennlp_path = './opennlp/bin/opennlp'
pos_model = 'train.pos.model'
pos_predict_cmd = [opennlp_path, 'POSTagger', pos_model]
subproc = call(pos_predict_cmd)

First, have a look at http://docs.python.org/library/subprocess.html#using-the-subprocess-module, read it once or twice, then try using call(pos_predict_cmd, shell=True) and see if that works.

Related

Encoding of characters when running powershell script as python subprocess

Currently, I'm scripting a small python application that executes a PowerShell script. My python script should process the returned string but unfortunately, I have some trouble with the encoding of special characters like 'ä', 'ö', 'Ü' and so on. How can I return a Unicode/UTF-8 string?
You can see a simple example below. The console output is b'\xc7\xcf\r\n'. I don't understand why it's not b'\xc3\xa4\r\n' because \xc3\xa4 should be the correct UTF8 Encoding for the character 'ä'.
try.py:
import subprocess
p = subprocess.check_output(["powershell.exe", ".\script.ps1"])
print(p)
script.ps1:
return 'ä'
I adopted my PowerShell script in some ways but did not get the desired result.
Added "[Console]::OutputEncoding = [Text.UTF8Encoding]::UTF8". Result: b'\xc3\x83\xc2\xa4\r\n'
Returned return [System.Text.Encoding]::UTF8.GetBytes("ä"). Result: b'195\r\n131\r\n194\r\n164\r\n'
Who can help to get console output of 'ä' for my upper script?

I used "pwsh" because I ran it on mac, you can use "powershell.exe" in your code
Try this:
import subprocess
p = subprocess.check_output(["pwsh", ".\sc.ps1"])
print(p.decode('utf-8'))
For more: You can read here.
Working Screenshot

TensorFlow Get started page - print first 5 rows

I'm using PyCharm, and when I try to execute the statement from here:
!head -n5 {train_dataset_fp}
IDE complains that this is SyntaxError: invalid syntax and program never executes. I thought the entire tutorial on TensorFlow is in Python, but seems like this code from completely different language. Has anyone proceed successfully through the TensorFlow: Get Started tutorial?

This is not a python command, this is a unix one, to launch the head program.
You can use PyCharm to open a Terminal on your target machine, and type:
head -n5 {train_dataset_fp}
... replacing {train_dataset_fp} with the actual path to your dataset, which you obtained/printed in the previous step of the tutorial, c.f. lines:
train_dataset_fp = tf.keras.utils.get_file(fname=os.path.basename(train_dataset_url),
origin=train_dataset_url)
print("Local copy of the dataset file: {}".format(train_dataset_fp))

Since you're on Windows, you need to use Windows commands to achieve what head would do. If you have Powershell installed, you can use the command gc. If you don't, here's a workaround to print the first 5 lines of file.txt, prefixed with the line number:
findstr /n ".*" file.txt | findstr /b "[1-5]:"
inspired by this answer. Basically it numbers all lines in the file and then picks the first five. Obviously pretty inefficient for large files though. Use the "!" prefix as needed.

Python subprocess.call with multiline string EOF

I've hit a issue that I don't really understand how to overcome. I'm trying to create a subprocess in python to run another python script. Not too difficult. The issue is I'm unable to get around is EOF error when a python file includes a super long string.
Here's an example of what my files look like.
Subprocess.py:
### call longstr.py from the primary pyfile
subprocess.call(['python longstr.py'], shell = True)
Longstr.py
### called from subprocess.py
### the actual string is a lot longer; this is an example to illustrate how the string is formatted
lngstr = """here is a really long
string (which has many n3w line$ and "characters")
that are causing the python file to state the file is ending early
"""
print lngstr
Printer error in terminal
SyntaxError: EOF while scanning triple-quoted string literal
As a work around, I tried to remove all linebreaks as well as all spaces to see if it was due to it being multi-line. That still returned the same result.
My assumption is that when the subprocess is running and the shell is doing something with the file contents, when the new line is reached the shell itself is freaking out and that's what's terminating the process; not the file.
What is the correct workaround for having subprocess run a file like this?
Thank you for your help.

Answering my own question here; my problem was that I didn't file.close() before trying to execute a subprocess.call.
If you encounter this problem, and are working with recently written files this could be your issue too. Thank you to everyone who read or responded to this thread.

Python pipe cp1252 string from PowerShell to a python (2.7) script

After a few days of dwelling over stackoverflow and python 2.7 doc, I have come to no conclusion about this.
Basically I'm running a python script on a windows server that must have as input a block of text. This block of text (unfortunately) has to be passed by a pipe. Something like:
PS > [something_that_outputs_text] | python .\my_script.py
So the problem is:
The server uses cp1252 encoding and I really cannot change it due to administrative regulations and whatnot. And when I pipe the text to my python script, when I read it, it comes already with ? whereas characters like \xe1 should be.
What I have done so far:
Tested with UTF-8. Yep, chcp 65001 and $OutputEncoding = [Console]::OutputEncoding "solve it", as in python gets the text perfectly and then I can decode it to unicode etc. But apparently they don't let me do it on the server /sadface.
A little script to test what the hell is happening:
import codecs
import sys
def main(argv=None):
if argv is None:
argv = sys.argv
if len(argv)>1:
for arg in argv[1:]:
print arg.decode('cp1252')
sys.stdin = codecs.getreader('cp1252')(sys.stdin)
text = sys.stdin.read().strip()
print text
return 0
if __name__=="__main__":
sys.exit(main())
Tried it with both the codecs wrapping and without it.
My input & output:
PS > echo "Blá" | python .\testinput.py blé
blé
Bl?
--> So there's no problem with the argument (blé) but the piped text (Blá) is no good :(
I even converted the text string to hex and, yes, it gets flooded with 3f (AKA mr ?), so it's not a problem with the print.
[Also: it's my first question here... feel free to ask any more info about what I did]
EDIT
I don't know if this is relevant or not, but when I do sys.stdin.encoding it yields None
Update: So... I have no problems with cmd. Checked sys.stdin.encoding while running the program on cmd and everything went fine. I think my head just exploded.

How about saving the data into a file and piping it to Python on a CMD session? Invoke Powershell and Python on CMD. Like so,
c:\>powershell -command "c:\genrateDataForPython.ps1 -output c:\data.txt"
c:\>type c:\data.txt | python .\myscript.py
Edit
Another an idea: convert the data into base64 format in Powershell and decode it in Python. Base64 is simple in Powershell, I guess in Python it isn't hard either. Like so,
# Convert some accent chars to base64
$s = [Text.Encoding]::UTF8.GetBytes("éêèë")
[System.Convert]::ToBase64String($s)
# Output:
w6nDqsOow6s=
# Decode:
$d = [System.Convert]::FromBase64String("w6nDqsOow6s=")
[Text.Encoding]::UTF8.GetString($d)
# Output
éêèë

What to do with "The input line is too long" error message?

I am trying to use os.system() to call another program that takes an input and an output file. The command I use is ~250 characters due to the long folder names.
When I try to call the command, I'm getting an error: The input line is too long.
I'm guessing there's a 255 character limit (its built using a C system call, but I couldn't find the limitations on that either).
I tried changing the directory with os.chdir() to reduce the folder trail lengths, but when I try using os.system() with "..\folder\filename" it apparently can't handle relative path names. Is there any way to get around this limit or get it to recognize relative paths?

Even it's a good idea to use subprocess.Popen(), this does not solve the issue.
Your problem is not the 255 characters limit, this was true on DOS times, later increased to 2048 for Windows NT/2000, and increased again to 8192 for Windows XP+.
The real solution is to workaround a very old bug in Windows APIs: _popen() and _wpopen().
If you ever use quotes during the command line you have to add the entire command in quoates or you will get the The input line is too long error message.
All Microsoft operating systems starting with Windows XP had a 8192 characters limit which is now enough for any decent command line usage but they forgot to solve this bug.
To overcome their bug just include your entire command in double quotes, and if you want to know more real the MSDN comment on _popen().
Be careful because these works:
prog
"prog"
""prog" param"
""prog" "param""
But these will not work:
""prog param""
If you need a function that does add the quotes when they are needed you can take the one from http://github.com/ssbarnea/tendo/blob/master/tendo/tee.py

You should use the subprocess module instead. See this little doc for how to rewrite os.system calls to use subprocess.

You should use subprocess instead of os.system.
subprocess has the advantage of being able to change the directory for you:
import subprocess
my_cwd = r"..\folder\"
my_process = subprocess.Popen(["command name", "option 1", "option 2"], cwd=my_cwd)
my_process.wait() # wait for process to end
if my_process.returncode != 0:
print "Something went wrong!"
The subprocess module contains some helper functions as well if the above looks a bit verbose.

Assuming you're using windows, from the backslashes, you could write a .bat file from python and then os.system() on that. It's a hack.

Make sure when you're using '\' in your strings that they're being properly escaped.
Python uses the '\' as the escape character, so the string "..\folder\filename" evaluates to "..folderfilename" since an escaped f is still an f.
You probably want to use
r"..\folder\filename"
or
"..\\folder\\filename"

I got the same message but it was strange because the command was not that long (130 characters) and it used to work, it just stopped working one day.
I just closed the command window and opened a new one and it worked.
I have had the command window opened for a long time (maybe months, it's a remote virtual machine).
I guess is some windows bug with a buffer or something.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python subprocess locale settings - python

First, have a look at http://docs.python.org/library/subprocess.html#using-the-subprocess-module, read it once or twice, then try using call(pos_predict_cmd, shell=True) and see if that works.

Related

Encoding of characters when running powershell script as python subprocess

TensorFlow Get started page - print first 5 rows

Python subprocess.call with multiline string EOF

Python pipe cp1252 string from PowerShell to a python (2.7) script

What to do with "The input line is too long" error message?

Categories

Resources