Unexpected whitespace in python generated strings - python

I am using Python to generate an ASCII file composed of very long lines. This is one example line (let's say line 100 in the file, '[...]' are added by me to shorten the line):
{6 1,14 1,[...],264 1,270 2,274 2,[...],478 1,479 8,485 1,[...]}
If I open the ASCII file that I generated with ipython:
f = open('myfile','r')
print repr(f.readlines()[99])
I do obtain the expected line printed correctly ('[...]' are added by me to shorten the line):
'{6 1,14 1,[...],264 1,270 2,274 2,[...],478 1,479 8,485 1,[...]}\n'
On the contrary, if I open this file with the program that is suppose to read it, it will generate an exception, complaining about an unexpected pair after 478 1.
So I tried to open the file with vim. Still vim shows no problem, but if I copy the line as printed by vim and paste it in another text editor (in my case TextMate), this is the line that I obtain ('[...]' are added by me to shorten the line):
{6 1,14 1,[...],264 1,270 2,274 2,[...],478 1,4 79 8,485 1,[...]}
This line indeed has a problem after the pair 478 1.
I tried to generate my lines in different ways (concatenating, with cStringIO, ...), but I always obtain this result. When using the cStringIO, for example, the lines are generated as in the following (even though I tried to change this, as well, with no luck):
def _construct_arff(self,attributes,header,data_rows):
"""Create the string representation of a Weka ARFF file.
*attributes* is a dictionary with attribute_name:attribute_type
(e.g., 'num_of_days':'NUMERIC')
*header* is a list of the attributes sorted
(e.g., ['age','name','num_of_days'])
*data_rows* is a list of lists with the values, sorted as in the header
(e.g., [ [88,'John',465],[77,'Bob',223]]"""
arff_str = cStringIO.StringIO()
arff_str.write('#relation %s\n' % self.relation_name)
for idx,att_name in enumerate(header):
try:
name = att_name.replace("\\","\\\\").replace("'","\\'")
arff_str.write("#attribute '%s' %s\n" % (name,attributes[att_name]))
except UnicodeEncodeError:
arff_str.write('#attribute unicode_err_%s %s\n'
% (idx,attributes[att_name]))
arff_str.write('#data\n')
for data_row in data_rows:
row = []
for att_idx,att_name in enumerate(header):
att_type = attributes[att_name]
value = data_row[att_idx]
# numeric attributes can be sparse: None and zeros are not written
if ((not att_type == constants.ARRF_NUMERIC)
or not ((value == None) or value == 0)):
row.append('%s %s' % (att_idx,value))
arff_str.write('{' + (','.join(row)) + '}\n')
return arff_str.getvalue()
UPDATE: As you can see from the code above, the function transforms a given set of data to a special arff file format. I noticed that one of the attributes I was creating contained numbers as strings (e.g., '1', instead of 1). By forcing these numbers into integers:
features[name] = int(value)
I recreated the arff file successfully. However I don't see how this, which is a value, can have an impact on the formatting of *att_idx*, which is always an integer, as also pointed out by #JohnMachin and #gnibbler (thanks for your answers, btw). So, even if my code runs now, I still don't see why this happens. How can the value, if not properly transformed into int, influence the formatting of something else?
This file contains the wrongly formatted version.

The built-in function repr is your friend. It will show you unambiguously what you have in your file.
Do this:
f = open('myfile','r')
print repr(f.readlines()[99])
and edit your question to show the result.
Update: As to how it got there, it is impossible to tell, because it cannot have been generated by the code that you showed. The value 37 should be a value of att_idx which comes from enumerate() and so must be an int. You are formatting this int with %s ... 37 can't become 3rubbish7. Also that should generate att_idx in order 0, 1, etc etc but you are missing many values and there is nothing conditional inside your loop.
Please show us the code that you actually ran.
Update:
And again, this code won't run:
for idx,att_name in enumerate(header):
arff_str.write("#attribute '%s' %s\n" % (name,attributes[att_name]))
because name is not defined; you probably mean att_name.
Perhaps we can short-circuit all this stuffing about: post a copy of your output file (zipped if it's huge) on the web somewhere so that we can see for ourselves what might be disturbing its consumers. Please do edit your question to say which line(s) exhibits(s) the problem.
By the way, you say some of the data is string rather than integer, and the problem goes away if you coerce the data to int by doing features[name] = int(value) ... what is 'features'?? What is 'name'??
Are any of those strings unicode instead of str?
Update 2 (after bad file posted on net)
No info supplied on which line(s) exhibits(s) the problem. As it turned out, no lines exhibited the described problem with attribute 479. I wrote this checking script:
import re, sys
# sample data line:
# {40 1,101 3,319 2,375 2,525 2,530 bug}
# Looks like all data lines end in ",530 bug}" or ",530 other}"
pattern1 = r"\{(?:\d+ \d+,)*\d+ \w+\}$"
matcher1 = re.compile(pattern1).match
pattern2 = r"\{(?:\d+ \d+,)*"
matcher2 = re.compile(pattern2).match
bad_atts = re.compile(r"\D\d+\s+\W").findall
got_data = False
for lino, line in enumerate(open(sys.argv[1], "r"), 1):
if not got_data:
got_data = line.startswith('#data')
continue
if not matcher1(line):
print
print lino, repr(line)
m = matcher2(line)
if m:
print "OK up to offset", m.end()
print bad_atts(line)
Sample output (wrapped at column 80):
581 '{2 1,7 1,9 1,12 1,13 1,14 1,15 1,16 1,17 1,18 1,21 1,22 1,24 1,25 1,26 1,27
1,29 1,32 1,33 1,36 1,39 1,40 1,44 1,48 1,49 1,50 1,54 1,57 1,58 1,60 1,67 1,68
1,69 1,71 1,74 1,75 1,76 1,77 1,80 1,88 1,93 1,101 ,103 6,104 2,109 20,110 3,11
2 2,114 1,119 17,120 4,124 39,128 5,137 1,138 1,139 1,162 1,168 1,172 18,175 1,1
76 6,179 1,180 1,181 2,185 2,187 9,188 8,190 1,193 1,195 2,196 4,197 1,199 3,201
3,202 4,203 5,206 1,207 2,208 1,210 2,211 1,212 5,213 1,215 2,216 3,218 2,220 2
,221 3,225 8,226 1,233 1,241 4,242 1,248 5,254 2,255 1,257 4,258 4,260 1,266 1,2
68 1,269 3,270 2,271 5,273 1,276 1,277 1,280 1,282 1,283 11,285 1,288 1,289 1,29
6 8,298 1,299 1,303 1,304 11,306 5,308 1,309 8,310 1,315 3,316 1,319 11,320 5,32
1 11,322 2,329 1,342 2,345 1,349 1,353 2,355 2,358 3,359 1,362 1,367 2,368 1,369
1,373 2,375 9,377 1,381 4,382 1,383 3,387 1,388 5,395 2,397 2,400 1,401 7,407 2
,412 1,416 1,419 2,421 2,422 1,425 2,427 1,431 1,433 7,434 1,435 1,436 2,440 1,4
49 1,454 2,455 1,460 3,461 1,463 1,467 1,470 1,471 2,472 7,477 2,478 11,479 31,4
82 6,485 7,487 1,490 2,492 16,494 2,495 1,497 1,499 1,501 1,502 1,503 1,504 11,5
06 3,510 2,515 1,516 2,517 3,518 1,522 4,523 2,524 1,525 4,527 2,528 7,529 3,530
bug}\n'
OK up to offset 203
[',101 ,']
709 '{101 ,124 2,184 1,188 1,333 1,492 3,500 4,530 bug}\n'
OK up to offset 1
['{101 ,']
So it looks like the attribute with att_idx == 101 can sometimes contain the empty string ''. You need to sort out how this attribute is to be treated. It would help your thinking if you unwound this Byzantine code:
if ((not att_type == constants.ARRF_NUMERIC)
or not ((value == None) or value == 0)):
Aside: that "expletive deleted" code won't run; it should be ARFF, not ARRF
into:
if value or att_type != constants.ARFF_NUMERIC:
or maybe just if value: which will filter out all of None, 0, and "". Note that att_idx == 101 corresponds to the attribute "priority" which is given a STRING type in the ARFF file header:
[line 103] #attribute 'priority' STRING
By the way, your statement about features[name] = int(value) "fixing" the problem is very suspicious; int("") raises an exception.
It may help you to read the warning at the end of this wiki section about sparse ARFF files.

Related

String index out of range when reading data

For this problem, I have a separate txt file which contains a list of values down below:
Years+1900 Populationx106
0 1650
10 1750
20 1860
30 2070
40 2300
50 2560
60 3040
70 3710
80 4450
90 5280
100 6080
110 6870
For the problem I'm working on, I'm supposed to obtain that file and path name to then use to do calculations on with some functions I created. I have finished the functions I need to do, however I'm having an issue running it because I believe when doing the function it reads the "Years+1900 Populationx106" part first instead of the numbers below it.
Here's the code for my functions:
Input: year
Output: estimate of population for that year
def pop(year):
return 1436.53*((1.01395)**year)
# Input: data
# Return: the average error as per equation 18.
def error(data):
error=0
for i in data:
error +=(abs(i[1]-pop(i[0]))/i[1])
return 100*error/12
Here is the code I created to retrieve the data from my separate txt file:
def get_data(path,name):
with open("Assignment7/pop.txt", "r") as path:
path = open("Assignment7/pop.txt", "r")
name = path.read()
return name
The error I'm receiving is for the part below. It is an index error and it says the string index is out of range. I believe this is because it is reading the first part of the data in the pop.txt, how can I remove te first line in the pop.txt so that it only reads the numerical values I have?
error +=(abs(i[1]-pop(i[0]))/i[1])
I have tried changing the index values already, however it still says that my string index is out of range.
Let's assume you are correct and passing the first line of your text file to your function is breaking it.
You can "throw away" the first line of the text file by reading it as a single line (but doing nothing with it) and then reading the data you actually want like this..
def get_data(path,name):
with open("Assignment7/pop.txt", "r") as path:
path = open("Assignment7/pop.txt", "r")
header=path.readline() #Read "Header line", but don't use it
name = path.read() #Read subsequent lines as the data you want
return name
I suspect that you’ve simply read the entire file as one string. Therefore each element, i, is a single character and has no dimensionality. You’ll need to either parse the file by the new line character to split it into by line (and likely again to get it the two separate columns).
Python String Split will be useful for the that.
You’re correct that the first line will pose issues, but this can be removed by using a path.readline() call as Richard said.

Index out of bounds for a 2D Array?

I am having a 2D List and i am trying to retrieve a column with the index spcified as a parameter (type : IntEnum).
I get the index out of bounds error when trying to retrieve any column other then the one at index 0.
Enum:
class Column(IntEnum):
ROAD = 0
SECTION = 1
FROM = 2
TO = 3
TIMESTAMP = 4
VFLOW=5
class TrafficData:
data=[[]]
Below are member methods of TrafficData
Reading from file and storing the matrix:
def __init__(self,file):
self.data=[[word for word in line.split('\t')]for line in file.readlines()[1:]]
Retrieve-ing the desired column:
def getColumn(self,columnName):
return [line[columnName] for line in self.data]
Call:
)
column1 = traficdata.getColumn(columnName=Column.ROAD)
`column2 = traficdata.getColumn(columnName=Column.FROM)` //error
`column3 = traficdata.getColumn(columnName=Column.TO)` //error
I attached a picture with the data after __init__ processing:
[
I tested the code that you provided above, and didn't see any issues. That leads me to believe that there might be something wrong with the data that you have in the file. Could you paste the file data? (the tab delimited data)
UPDATE -
I found the issue - as suspected, it was a data issue (there is a minor code update involved too). Make the following changes -
1) When opening the file use the appropriate encoding, I used utf-16.
2) At the end of the data file that you shared, it contains the text - "(72413 row(s) affected)" along with a couple of new line characters. So, you have 2 options, either manually cleanup the data file, or update the code to ignore the "(72413 row(s) affected)" & "\n" characters.
Hope that helps.

Python - Reading a CSV, won't print the contents of the last column

I'm pretty new to Python, and put together a script to parse a csv and ultimately output its data into a repeated html table.
I got most of it working, but there's one weird problem I haven't been able to fix. My script will find the index of the last column, but won't print out the data in that column. If I add another column to the end, even an empty one, it'll print out the data in the formerly-last column - so it's not a problem with the contents of that column.
Abridged (but still grumpy) version of the code:
import os
os.chdir('C:\\Python34\\andrea')
import csv
csvOpen = open('my.csv')
exampleReader = csv.reader(csvOpen)
tableHeader = next(exampleReader)
if 'phone' in tableHeader:
phoneIndex = tableHeader.index('phone')
else:
phoneIndex = -1
for row in exampleReader:
row[-1] =''
print(phoneIndex)
print(row[phoneIndex])
csvOpen.close()
my.csv
stuff,phone
1,3235556177
1,3235556170
Output
1
1
Same script, small change to the CSV file:
my.csv
stuff,phone,more
1,3235556177,
1,3235556170,
Output
1
3235556177
1
3235556170
I'm using Python 3.4.3 via Idle 3.4.3
I've had the same problem with CSVs generated directly by mysql, ones that I've opened in Excel first then re-saved as CSVs, and ones I've edited in Notepad++ and re-saved as CSVs.
I tried adding several different modes to the open function (r, rU, b, etc.) and either it made no difference or gave me an error (for example, it didn't like 'b').
My workaround is just to add an extra column to the end, but since this is a frequently used script, it'd be much better if it just worked right.
Thank you in advance for your help.
row[-1] =''
The CSV reader returns to you a list representing the row from the file. On this line you set the last value in the list to an empty string. Then you print it afterwards. Delete this line if you don't want the last column to be set to an empty string.
If you know it is the last column, you can count them and then use that value minus 1. Likewise you can use your string comparison method if you know it will always be "phone". I recommend if you are using the string compare, convert the value from the csv to lower case so that you don't have to worry about capitalization.
In my code below I created functions that show how to use either method.
import os
import csv
os.chdir('C:\\temp')
csvOpen = open('my.csv')
exampleReader = csv.reader(csvOpen)
tableHeader = next(exampleReader)
phoneColIndex = None;#init to a value that can imply state
lastColIndex = None;#init to a value that can imply state
def getPhoneIndex(header):
for i, col in enumerate(header): #use this syntax to get index of item
if col.lower() == 'phone':
return i;
return -1; #send back invalid index
def findLastColIndex(header):
return len(tableHeader) - 1;
## methods to check for phone col. 1. by string comparison
#and 2. by assuming it's the last col.
if len(tableHeader) > 1:# if only one row or less, why go any further?
phoneColIndex = getPhoneIndex(tableHeader);
lastColIndex = findLastColIndex(tableHeader)
for row in exampleReader:
print(row[phoneColIndex])
print('----------')
print(row[lastColIndex])
print('----------')
csvOpen.close()

Python: How to extract string from text file to use as data

this is my first time writing a python script and I'm having some trouble getting started. Let's say I have a txt file named Test.txt that contains this information.
x y z Type of atom
ATOM 1 C1 GLN D 10 26.395 3.904 4.923 C
ATOM 2 O1 GLN D 10 26.431 2.638 5.002 O
ATOM 3 O2 GLN D 10 26.085 4.471 3.796 O
ATOM 4 C2 GLN D 10 26.642 4.743 6.148 C
What I want to do is eventually write a script that will find the center of mass of these three atoms. So basically I want to sum up all of the x values in that txt file with each number multiplied by a given value depending on the type of atom.
I know I need to define the positions for each x-value, but I'm having trouble with figuring out how to make these x-values be represented as numbers instead of txt from a string. I have to keep in mind that I'll need to multiply these numbers by the type of atom, so I need a way to keep them defined for each atom type. Can anyone push me in the right direction?
mass_dictionary = {'C':12.0107,
'O':15.999
#Others...?
}
# If your files are this structured, you can just
# hardcode some column assumptions.
coords_idxs = [6,7,8]
type_idx = 9
# Open file, get lines, close file.
# Probably prudent to add try-except here for bad file names.
f_open = open("Test.txt",'r')
lines = f_open.readlines()
f_open.close()
# Initialize an array to hold needed intermediate data.
output_coms = []; total_mass = 0.0;
# Loop through the lines of the file.
for line in lines:
# Split the line on white space.
line_stuff = line.split()
# If the line is empty or fails to start with 'ATOM', skip it.
if (not line_stuff) or (not line_stuff[0]=='ATOM'):
pass
# Otherwise, append the mass-weighted coordinates to a list and increment total mass.
else:
output_coms.append([mass_dictionary[line_stuff[type_idx]]*float(line_stuff[i]) for i in coords_idxs])
total_mass = total_mass + mass_dictionary[line_stuff[type_idx]]
# After getting all the data, finish off the averages.
avg_x, avg_y, avg_z = tuple(map( lambda x: (1.0/total_mass)*sum(x), [[elem[i] for elem in output_coms] for i in [0,1,2]]))
# A lot of this will be better with NumPy arrays if you'll be using this often or on
# larger files. Python Pandas might be an even better option if you want to just
# store the file data and play with it in Python.
Basically using the open function in python you can open any file. So you can do something as follows: --- the following snippet is not a solution to the whole problem but an approach.
def read_file():
f = open("filename", 'r')
for line in f:
line_list = line.split()
....
....
f.close()
From this point on you have a nice setup of what you can do with these values. Basically the second line just opens the file for reading. The third line define a for loop that reads the file one line at a time and each line goes into the line variable.
The last line in that snippet basically breaks the string --at every whitepsace -- into an list. So line_list[0] will be the value on your first column and so forth. From this point if you have any programming experience you can just use if statements and such to get the logic that you want.
** Also keep in mind that the type of values stored in that list will all be string so if you want to perform any arithmetic operations such as adding you have to be careful.
* Edited for syntax correction
If you have pandas installed, checkout the read_fwf function that imports a fixed-width file and creates a DataFrame (2-d tabular data structure). It'll save you lines of code on import and also give you a lot of data munging functionality if you want to do any additional data manipulations.

Data analysis for inconsistent string formatting

I have this task that I've been working on, but am having extreme misgivings about my methodology.
So the problem is that I have a ton of excel files that are formatted strangely (and not consistently) and I need to extract certain fields for each entry. An example data set is
My original approach was this:
Export to csv
Separate into counties
Separate into districts
Analyze each district individually, pull out values
write to output.csv
The problem I've run into is that the format (seemingly well organized) is almost random across files. Each line contains the same fields, but in a different order, spacing, and wording. I wrote a script to correctly process one file, but it doesn't work on any other files.
So my question is, is there a more robust method of approaching this problem rather than simple string processing? What I had in mind was more of a fuzzy logic approach for trying to pin which field an item was, which could handle the inputs being a little arbitrary. How would you approach this problem?
If it helps clear up the problem, here is the script I wrote:
# This file takes a tax CSV file as input
# and separates it into counties
# then appends each county's entries onto
# the end of the master out.csv
# which will contain everything including
# taxes, bonds, etc from all years
#import the data csv
import sys
import re
import csv
def cleancommas(x):
toggle=False
for i,j in enumerate(x):
if j=="\"":
toggle=not toggle
if toggle==True:
if j==",":
x=x[:i]+" "+x[i+1:]
return x
def districtatize(x):
#list indexes of entries starting with "for" or "to" of length >5
indices=[1]
for i,j in enumerate(x):
if len(j)>2:
if j[:2]=="to":
indices.append(i)
if len(j)>3:
if j[:3]==" to" or j[:3]=="for":
indices.append(i)
if len(j)>5:
if j[:5]==" \"for" or j[:5]==" \'for":
indices.append(i)
if len(j)>4:
if j[:4]==" \"to" or j[:4]==" \'to" or j[:4]==" for":
indices.append(i)
if len(indices)==1:
return [x[0],x[1:len(x)-1]]
new=[x[0],x[1:indices[1]+1]]
z=1
while z<len(indices)-1:
new.append(x[indices[z]+1:indices[z+1]+1])
z+=1
return new
#should return a list of lists. First entry will be county
#each successive element in list will be list by district
def splitforstos(string):
for itemind,item in enumerate(string): # take all exception cases that didn't get processed
splitfor=re.split('(?<=\d)\s\s(?=for)',item) # correctly and split them up so that the for begins
splitto=re.split('(?<=\d)\s\s(?=to)',item) # a cell
if len(splitfor)>1:
print "\n\n\nfor detected\n\n"
string.remove(item)
string.insert(itemind,splitfor[0])
string.insert(itemind+1,splitfor[1])
elif len(splitto)>1:
print "\n\n\nto detected\n\n"
string.remove(item)
string.insert(itemind,splitto[0])
string.insert(itemind+1,splitto[1])
def analyze(x):
#input should be a string of content
#target values are nomills,levytype,term,yearcom,yeardue
clean=cleancommas(x)
countylist=clean.split(',')
emptystrip=filter(lambda a: a != '',countylist)
empt2strip=filter(lambda a: a != ' ', emptystrip)
singstrip=filter(lambda a: a != '\' \'',empt2strip)
quotestrip=filter(lambda a: a !='\" \"',singstrip)
splitforstos(quotestrip)
distd=districtatize(quotestrip)
print '\n\ndistrictized\n\n',distd
county = distd[0]
for x in distd[1:]:
if len(x)>8:
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
numyears=x[6]
yearcom=x[8]
yeardue=x[10]
reason=x[11]
data = [filename,county,district, vote1, vote2, mills, votetype, numyears, yearcom, yeardue, reason]
print "data",data
else:
print "x\n\n",x
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
special=x[5]
splitspec=special.split(' ')
try:
forind=[i for i,j in enumerate(splitspec) if j=='for'][0]
numyears=splitspec[forind+1]
yearcom=splitspec[forind+6]
except:
forind=[i for i,j in enumerate(splitspec) if j=='commencing'][0]
numyears=None
yearcom=splitspec[forind+2]
yeardue=str(x[6])[-4:]
reason=x[7]
data = [filename,county,district,vote1,vote2,mills,votetype,numyears,yearcom,yeardue,reason]
print "data other", data
openfile=csv.writer(open('out.csv','a'),delimiter=',', quotechar='|',quoting=csv.QUOTE_MINIMAL)
openfile.writerow(data)
# call the file like so: python tax.py 2007May8Tax.csv
filename = sys.argv[1] #the file is the first argument
f=open(filename,'r')
contents=f.read() #entire csv as string
#find index of every instance of the word county
separators=[m.start() for m in re.finditer('\w+\sCOUNTY',contents)] #alternative implementation in regex
# split contents into sections by county
# analyze each section and append to out.csv
for x,y in enumerate(separators):
try:
data = contents[y:separators[x+1]]
except:
data = contents[y:]
analyze(data)
is there a more robust method of approaching this problem rather than simple string processing?
Not really.
What I had in mind was more of a fuzzy logic approach for trying to pin which field an item was, which could handle the inputs being a little arbitrary. How would you approach this problem?
After a ton of analysis and programming, it won't be significantly better than what you've got.
Reading stuff prepared by people requires -- sadly -- people-like brains.
You can mess with NLTK to try and do a better job, but it doesn't work out terribly well either.
You don't need a radically new approach. You need to streamline the approach you have.
For example.
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
numyears=x[6]
yearcom=x[8]
yeardue=x[10]
reason=x[11]
data = [filename,county,district, vote1, vote2, mills, votetype, numyears, yearcom, yeardue, reason]
print "data",data
Might be improved by using a named tuple.
Then build something like this.
data = SomeSensibleName(
district= x[0],
vote1=x[1], ... etc.
)
So that you're not creating a lot of intermediate (and largely uninformative) loose variables.
Also, keep looking at your analyze function (and any other function) to pull out the various "pattern matching" rules. The idea is that you'll examine a county's data, step through a bunch of functions until one matches the pattern; this will also create the named tuple. You want something like this.
for p in ( some, list, of, functions ):
match= p(data)
if match:
return match
Each function either returns a named tuple (because it liked the row) or None (because it didn't like the row).

Categories