I interact with a server that I use to tag sentences. This server is launched locally on port 2020.
For example, if I send Je mange des pâtes . on port 2020 through the client used below, the server answers Je_CL mange_V des_P pâtes_N ._., the result is always one line only, and always one line if my input is not empty.
I currently have to tag 9 568 files through this server. The first 9 483 files are tagged as expected. After that, the input stream seems closed / full / something else because I get an IOError, specifically a Broken Pipe error when I try to write on stdin.
When I skip the first 9 483 first files, the last ones are tagged without any issue, including the one causing the first error.
My server doesn't produce any error log indicating something fishy happened... Do I handle something incorrectly? Is it normal that the pipe fails after some time?
log = codecs.open('stanford-tagger.log', 'w', 'utf-8')
p1 = Popen(["java",
"-cp", JAR,
"edu.stanford.nlp.tagger.maxent.MaxentTaggerServer",
"-client",
"-port", "2020"],
stdin=PIPE,
stdout=PIPE,
stderr=log)
fhi = codecs.open(SUMMARY, 'r', 'utf-8') # a descriptor of the files to tag
for i, line in enumerate(fhi, 1):
if i % 500:
print "Tagged " + str(i) + " documents..."
tokens = ... # a list of words, can be quite long
try:
p1.stdin.write(' '.join(tokens).encode('utf-8') + '\n')
except IOError:
print 'bouh, I failed ;(('
result = p1.stdout.readline()
# Here I do something with result...
fhi.close()
In addition to my comments, I might suggest a few other changes...
for i, line in enumerate(fhi, 1):
if i % 500:
print "Tagged " + str(i) + " documents..."
tokens = ... # a list of words, can be quite long
try:
s = ' '.join(tokens).encode('utf-8') + '\n'
assert s.find('\n') == len(s) - 1 # Make sure there's only one CR in s
p1.stdin.write(s)
p1.stdin.flush() # Block until we're sure it's been sent
except IOError:
print 'bouh, I failed ;(('
result = p1.stdout.readline()
assert result # Make sure we got something back
assert result.find('\n') == len(result) - 1 # Make sure there's only one CR in result
# Here I do something with result...
fhi.close()
...but given there's also a client/server of which we know nothing about, there's a lot of places it could be going wrong.
Does it work if you dump all the queries into a single file, and then run it from the commandline with something like...
java .... < input > output
Related
As part of a program that decodes a communication protocol (EDIFACT MSCONS) I have a class that gives me the next 'segment' of the message. The segments are delimited by an apostrophe "'". There may be newlines after the "'" or not.
Here's the code for that class:
class SegmentGenerator:
def __init__(self, filename):
try:
fh = open(filename)
except IOError:
print ("Error: file " + filename + " not found!")
sys.exit(2)
lines=[]
for line in fh:
line = line.rstrip()
lines.append(line)
if len(lines) == 1:
msg = lines[0]
else:
msg = ''
for line in lines:
msg = msg + line.rstrip()
self.segments=msg.split("'")
self.iterator=iter(self.segments)
def next(self):
try:
return next(self.iterator)
except StopIteration:
return None
if __name__ == '__main__': #testing only
sg = SegmentGenerator('MSCONS_21X000000001333E_20X-SUD-STROUM-M_20180807_000026404801.txt')
for i in range(210436):
if i > 8940:
break
print(sg.next())
To give an idea what the file looks like here's an excerpt of it:
UNB+UNOC:3+21X000000001333E:020+20X-SUD-STROUM-M:020+180807:1400+000026404801++TL'UNH+000026404802+MSCONS:D:04B:UN:1.0'BGM+7+000026404802+9'DTM+137:201808071400:203'RFF+AGI:6HYR67925RZUD_000000257860_00_E27'NAD+MS+21X000000001333E::020'NAD+MR+20X-SUD-STROUM-M::020'UNS+D'NAD+DP'LOC+172+LU0000010496200000000000050287886::89'DTM+163:201701010000?+01:303'DTM+164:201702010000?+01:303'LIN+1'PIA+5+1-1?:1.29.0:SRW'QTY+220:9.600'DTM+163:201701010000?+01:303'DTM+164:201701010015?+01:303'QTY+220:10.400'DTM+163:201701010015?+01:303'DTM+164:201701010030?+01:303'QTY+220:10.400'DTM+163:201701010030?+01:303'DTM+164:201701010045?+01:303'QTY+220:10.400'DTM+163:201701010045?+01:303'DTM+164:201701010100?+01:303'QTY+220:10.400'DTM+163:201701010100?+01:303'DTM+164:201701010115?+01:303'QTY+220:10.400'DTM+163:201701010115?+01:303'DTM+164:201701010130?+01:303'QTY+220:10.400'DTM+163:201701010130?+01:303'DTM+164:201701010145?+01:303'QTY+220:10.400'DTM+163:201701010145?+01:303'DTM+164:201701010200?+01:303'QTY+220:11.200'DTM+163:201701010200?+01:303' ...
The file I have a problem with has 210000 of those segments. I tested the code and everything works fine. The list of segments is complete and I get one segment after the other correctly until the end of the list.
I use the segments as input to a statemachine that gets new segments from an instance of SegmentGenerator.
Here's an excerpt:
def DTMstarttransition(self,segment):
match=re.search('DTM\+(.*?):(.*?):(.*?)($|\+.*|:.*)',segment)
if match:
if match.group(1) == '164':
self.currentendtime=self.dateConvert(match.group(2),match.group(3))
return('DTMend',self.sg.next())
return('Error',segment + "\nExpected DTM segment didn't match")
The method returns the name of the next state and the next segment sg.next(), sg being an instance of SegmentGenerator.
However at the 8942st segment the call to sg.next() doesn't give me the next segment but the second last of the list of segments!
I traced the function calls (with the autologging module):
TRACE:segmentgenerator.SegmentGenerator:next:CALL *() **{}
TRACE:segmentgenerator.SegmentGenerator:next:RETURN 'DTM+164:201702010000?+01:303'
TRACE:__main__.MSCONSparser:QTYtransition:RETURN ('DTMstart', 'DTM+164:201702010000?+01:303')
TRACE:__main__.MSCONSparser:DTMstarttransition:CALL *('DTM+164:201702010000?+01:303',) **{}
TRACE:__main__.MSCONSparser:dateConvert:CALL *('201702010000?+01', '303') **{}
TRACE:__main__.MSCONSparser:dateConvert:RETURN datetime.datetime(2017, 2, 1, 0, 0)
TRACE:segmentgenerator.SegmentGenerator:next:CALL *() **{}
TRACE:segmentgenerator.SegmentGenerator:next:RETURN 'UNT+17872+000026404802'
TRACE:__main__.MSCONSparser:DTMstarttransition:RETURN ('DTMend', 'UNT+17872+000026404802')
TRACE:__main__.MSCONSparser:DTMendtransition:CALL *('UNT+17872+000026404802',) **{}
UNT+... isn't the next segment it should be a LIN segment.
But how is this possible? Why does SegmentGenerator work when I test it with the main function in its module and doesn't work correctly after thousands of calls from the other module?
All the segments are there from beginning to end. I can verify this from the interpreter, since the list sg.segments stays available after program stop. len(sg.segments) is 210435 but my program stops after 8942. So it is clearly a problem with the iterator.
The files (3 python files and data example) can be found on Github in branch 'next' if you like to test the whole thing.
I think it's possible there is a double apostrophe '' in your data file, near the 8942th apostrophe.
In this case your code will continue to read the whole file reading all 210435 segments.
But if you have the condition that tests the result of sg.next(), then that would be falsey on the 8942th iteration, and I'm guessing this is causing your program to abort.
eg:
while sg.next():
# some processing here
If I'm completely wrong then I'd be interested in seeing the behaviour of this: - where len and iterations should equal.
if __name__ == '__main__':
fn = sys.argv[1]
sg = SegmentGenerator(fn)
print("Num segments:", len(sg.segments))
i = 0
value = 'x'
while value:
value = sg.next()
i += 1
print(i, value)
print("Num iterations:", i)
It turned out that a segment 'DTM+164:201702010000?+01:303' existed a second time further down in the file and that indeed that one is followed by a UTM segment. So the problem is with the protocol states themselves and the iterator was working correctly.
So sorry that I bothered you with my wrong assumption. Thanks for wanting to help!
I have recently been learning some Python and how to apply it to my work. I have written a couple of scripts successfully, but I am having an issue I just cannot figure out.
I am opening a file with ~4000 lines, two tab separated columns per line. When reading the input file, I get an index error saying that the list index is out of range. However, while I get the error every time, it doesn't happen on the same line every time (as in, it will throw the error on different lines everytime!). So, for some reason, it works generally but then (seemingly) randomly fails.
As I literally only started learning Python last week, I am stumped. I have looked around for the same problem, but not found anything similar. Furthermore I don't know if this is a problem that is language specific or IPython specific. Any help would be greatly appreciated!
input = open("count.txt", "r")
changelist = []
listtosort = []
second = str()
output = open("output.txt", "w")
for each in input:
splits = each.split("\t")
changelist = list(splits[0])
second = int(splits[1])
print second
if changelist[7] == ";":
changelist.insert(6, "000")
va = "".join(changelist)
var = va + ("\t") + str(second)
listtosort.append(var)
output.write(var)
elif changelist[8] == ";":
changelist.insert(6, "00")
va = "".join(changelist)
var = va + ("\t") + str(second)
listtosort.append(var)
output.write(var)
elif changelist[9] == ";":
changelist.insert(6, "0")
va = "".join(changelist)
var = va + ("\t") + str(second)
listtosort.append(var)
output.write(var)
else:
#output.write(str("".join(changelist)))
va = "".join(changelist)
var = va + ("\t") + str(second)
listtosort.append(var)
output.write(var)
output.close()
The error
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
/home/a/Desktop/sharedfolder/ipytest/individ.ins.count.test/<ipython-input-87-32f9b0a1951b> in <module>()
57 splits = each.split("\t")
58 changelist = list(splits[0])
---> 59 second = int(splits[1])
60
61 print second
IndexError: list index out of range
Input:
ID=cds0;Name=NP_414542.1;Parent=gene0;Dbxref=ASAP:ABE-0000006,UniProtKB%2FSwiss-Prot:P0AD86,Genbank:NP_414542.1,EcoGene:EG11277,GeneID:944742;gbkey=CDS;product=thr 12
ID=cds1000;Name=NP_415538.1;Parent=gene1035;Dbxref=ASAP:ABE-0003451,UniProtKB%2FSwiss-Prot:P31545,Genbank:NP_415538.1,EcoGene:EG11735,GeneID:946500;gbkey=CDS;product=deferrrochelatase%2C 50
ID=cds1001;Name=NP_415539.1;Parent=gene1036;Note=PhoB-dependent%2C 36
Desired output:
ID=cds0000;Name=NP_414542.1;Parent=gene0;Dbxref=ASAP:ABE-0000006,UniProtKB%2FSwiss-Prot:P0AD86,Genbank:NP_414542.1,EcoGene:EG11277,GeneID:944742;gbkey=CDS;product=thr 12
ID=cds1000;Name=NP_415538.1;Parent=gene1035;Dbxref=ASAP:ABE-0003451,UniProtKB%2FSwiss-Prot:P31545,Genbank:NP_415538.1,EcoGene:EG11735,GeneID:946500;gbkey=CDS;product=deferrrochelatase%2C 50
ID=cds1001;Name=NP_415539.1;Parent=gene1036;Note=PhoB-dependent%2C 36
The reason you're getting the IndexError is that your input-file is apparently not entirely tab delimited. That's why there is nothing at splits[1] when you attempt to access it.
Your code could use some refactoring. First of all you're repeating yourself with the if-checks, it's unnecessary. This just pads the cds0 to 7 characters which is probably not what you want. I threw the following together to demonstrate how you could refactor your code to be a little more pythonic and dry. I can't guarantee it'll work with your dataset, but I'm hoping it might help you understand how to do things differently.
to_sort = []
# We can open two files using the with statement. This will also handle
# closing the files for us, when we exit the block.
with open("count.txt", "r") as inp, open("output.txt", "w") as out:
for each in inp:
# Split at ';'... So you won't have to worry about whether or not
# the file is tab delimited
changed = each.split(";")
# Get the value you want. This is called unpacking.
# The value before '=' will always be 'ID', so we don't really care about it.
# _ is generally used as a variable name when the value is discarded.
_, value = changed[0].split("=")
# 0-pad the desired value to 7 characters. Python string formatting
# makes this very easy. This will replace the current value in the list.
changed[0] = "ID={:0<7}".format(value)
# Join the changed-list with the original separator and
# and append it to the sort list.
to_sort.append(";".join(changed))
# Write the results to the file all at once. Your test data already
# provided the newlines, you can just write it out as it is.
output.writelines(to_sort)
# Do what else you need to do. Maybe to_list.sort()?
You'll notice that this code is reduces your code down to 8 lines but achieves the exact same thing, does not repeat itself and is pretty easy to understand.
Please read the PEP8, the Zen of python, and go through the official tutorial.
This happens when there is a line in count.txt which doesn't contain the tab character. So when you split by tab character there will not be any splits[1]. Hence the error "Index out of range".
To know which line is causing the error, just add a print(each) after splits in line 57. The line printed before the error message is your culprit. If your input file keeps changing, then you will get different locations. Change your script to handle such malformed lines.
Hi i am processing a 600Mb file. i have written the below code. What i am doing was, to search for a keyword in the data between <dest> tags and if it exists then add a city tag to <dest> tag. It worked fine for small set of data but when i ran the program on large file it is throwing MEMORY ERROR. I guess i am getting this error when i use return statement in if condition can any one please let me know how to solve this?
import re
def casp ( tx ):
def tbcnv( st ):
ct = ''
prt = re.compile(r"(?i)(Slip Copy,.*?\))", re.DOTALL|re.M)
val = re.search(prt, st)
try:
ct = val.group(1)
if re.search(r"(?i)alaska", ct):
jval = "Alaska"
print jval
if jval:
prt = re.compile(r"(?i)(.*?<dest.*?>)", re.DOTALL|re.M)
vl = re.sub(prt, "\\1\n" + "<city>" + jval + "</city>" + "\n" ,st)
return vl
else:
return st
else:
return st
except:
print "Not available"
return st
pt = re.compile("(?i)(<dest.*?</dest>)", re.DOTALL|re.M)
t = re.sub(pt, lambda m: tbcnv(m.group(1)), tx)
return t
with open('input.txt', 'r') as content_file:
content = content_file.read()
pt = re.compile(r"(?i)<Lrlevel level='3'>(.*?)</Lrlevel>", re.DOTALL|re.M)
content = re.sub(pt,lambda m: "<Lrlevel level='3'>" + casp(m.group(1) + "</Lrlevel>" ), content)
with open('out.txt', 'w') as out_file:
out_file.write(content)
If you remove the return statement just before the expect, then the string built by re.sub() is much smaller.
I'm getting memory usage that is 3 times the file size, which means that you'd get a MemoryError if you don't have (more than) 2GB. This is reasonable here --- or at least I can guess why. It's how re.sub() works.
This means that you're using somehow the wrong tools, as explained in the comments above. You should either use a full xml-processing tool like lxml, or if you want to stick with regular expressions, find a way to never need the whole string in memory; or at least to never call re.sub() on it (e.g. only the tx variable ever contains a big string, which is the input; and you do pt.search(tx, startpos) in a loop, locating the places to change, and writing piece by piece parts of tx).
It has been awhile since I have written functions with for loops and writing to files so bare with my ignorance.
This function is given an IP address to read from a text file; pings the IP, searches for the received packets and then appends it to a .csv
My question is: Is there a better or an easier way to write this?
def pingS (IPadd4):
fTmp = "tmp"
os.system ("ping " + IPadd4 + "-n 500 > tmp")
sName = siteNF #sys.argv[1]
scrap = open(fTmp,"r")
nF = file(sName,"a") # appends
nF.write(IPadd4 + ",")
for line in scrap:
if line.startswith(" Packets"):
arrT = line.split(" ")
nF.write(arrT[10]+" \n")
scrap.close()
nF.close()
Note: If you need the full script I can supply that as well.
This in my opinion at least makes what is going on a bit more obvious. The len('Received = ') could obviously be replaced by a constant.
def pingS (IPadd4):
fTmp = "tmp"
os.system ("ping " + IPadd4 + "-n 500 > tmp")
sName = siteNF #sys.argv[1]
scrap = open(fTmp,"r")
nF = file(sName,"a") # appends
ip_string = scrap.read()
recvd = ip_string[ip_string.find('Received = ') + len('Received = ')]
nF.write(IPadd4 + ',' + recvd + '\n')
You could also try looking at the Python csv module for writing to the csv. In this case it's pretty trivial though.
This may not be a direct answer, but you may get some performance increase from using StringIO. I have had some dramatic speedups in IO with this. I'm a bioinformatics guy, so I spend a lot of time shooting large text files out of my code.
http://www.skymind.com/~ocrow/python_string/
I use method 5. Didn't require many changes. There are some fancier methods in there, but they didn't appeal to me as much.
I'm in a little over my head on this one, so please pardon my terminology in advance.
I'm running this using Python 2.7 on Windows XP.
I found some Python code that reads a log file, does some stuff, then displays something.
What, that's not enough detail? Ok, here's a simplified version:
#!/usr/bin/python
import re
import sys
class NotSupportedTOCError(Exception):
pass
def filter_toc_entries(lines):
while True:
line = lines.next()
if re.match(r""" \s*
.+\s+ \| (?#track)
\s+.+\s+ \| (?#start)
\s+.+\s+ \| (?#length)
\s+.+\s+ \| (?#start sec)
\s+.+\s*$ (?#end sec)
""", line, re.X):
lines.next()
break
while True:
line = lines.next()
m = re.match(r"""
^\s*
(?P<num>\d+)
\s*\|\s*
(?P<start_time>[0-9:.]+)
\s*\|\s*
(?P<length_time>[0-9:.]+)
\s*\|\s*
(?P<start_sector>\d+)
\s*\|\s*
(?P<end_sector>\d+)
\s*$
""", line, re.X)
if not m:
break
yield m.groupdict()
def calculate_mb_toc_numbers(eac_entries):
eac = list(eac_entries)
num_tracks = len(eac)
tracknums = [int(e['num']) for e in eac]
if range(1,num_tracks+1) != tracknums:
raise NotSupportedTOCError("Non-standard track number sequence: %s", tracknums)
leadout_offset = int(eac[-1]['end_sector']) + 150 + 1
offsets = [(int(x['start_sector']) + 150) for x in eac]
return [1, num_tracks, leadout_offset] + offsets
f = open(sys.argv[1])
mb_toc_urlpart = "%20".join(str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(f)))
print mb_toc_urlpart
The code works fine as long as the log file is "simple" text (I'm tempted to say ASCII although that may not be precise/accurate - for e.g. Notepad++ indicates it's ANSI).
However, the script doesn't work on certain log files (in these cases, Notepad++ says "UCS-2 Little Endian").
I get the following error:
Traceback (most recent call last):
File "simple.py", line 55, in <module>
mb_toc_urlpart = "%20".join(str(x) for x in calculate_mb_toc_numbers(filter_
toc_entries(f)))
File "simple.py", line 49, in calculate_mb_toc_numbers
leadout_offset = int(eac[-1]['end_sector']) + 150 + 1
IndexError: list index out of range
This log works
This log breaks
I believe it's the encoding that's breaking the script because if I simply do this at a command prompt:
type ascii.log > scrubbed.log
and then run the script on scrubbed.log, the script works fine (this is actually fine for my purposes since there's no loss of important information and I'm not writing back to a file, just printing to the console).
One workaround would be to "scrub" the log file before passing it to Python (e.g. using the type pipe trick above to a temporary file and then have the script run on that), but I would like to have Python "ignore" the encoding if it's possible. I'm also not sure how to detect what type of log file the script is reading so I can act appropriately.
I'm reading this and this but my eyes are still spinning around in their head, so while that may be my longer term strategy, I'm wondering if there's an interim hack I could use.
codecs.open() will allow you to open a file using a specific encoding, and it will produce unicodes. You can try a few, going from most likely to least likely (or the tool could just always produce UTF-16LE but ha ha fat chance).
Also, "Unicode In Python, Completely Demystified".
works.log appears to be encoded in ASCII:
>>> data = open('works.log', 'rb').read()
>>> all(d < '\x80' for d in data)
True
breaks.log appears to be encoded in UTF-16LE -- it starts with the 2 bytes '\xff\xfe'. None of the characters in breaks.log are outside the ASCII range:
>>> data = open('breaks.log', 'rb').read()
>>> data[:2]
'\xff\xfe'
>>> udata = data.decode('utf16')
>>> all(d < u'\x80' for d in udata)
True
If these are the only two possibilities, you should be able to get away with the following hack. Change your mainline code from:
f = open(sys.argv[1])
mb_toc_urlpart = "%20".join(
str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(f)))
print mb_toc_urlpart
to this:
f = open(sys.argv[1], 'rb')
data = f.read()
f.close()
if data[:2] == '\xff\xfe':
data = data.decode('utf16').encode('ascii')
# ilines is a generator which produces newline-terminated strings
ilines = (line + '\n' for line in data.splitlines())
mb_toc_urlpart = "%20".join(
str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(ilines)) )
print mb_toc_urlpart
Python 2.x expects normal strings to be ASCII (or at least one byte). Try this:
Put this at the top of your Python source file:
from __future__ import unicode_literals
And change all the str to unicode.
[edit]
And as Ignacio Vazquez-Abrams wrote, try codecs.open() to open the input file.