Related
I am reading a cfg file, and receive a dictionary for each section. So, for example:
Config-File:
[General]
parameter1="Param1"
parameter2="Param2"
[FileList]
file001="file1.txt"
file002="file2.txt" ......
I have the FileList section stored in a dictionary called section. In this example, I can access "file1.txt" as test = section["file001"], so test == "file1.txt". To access every file of FileList one after the other, I could try the following:
for i in range(1, (number_of_files + 1)):
access_key = str("file_00" + str(i))
print(section[access_key])
This is my current solution, but I don't like it at all. First of all, it looks kind of messy in python, but I will also face problems when more than 9 files are listed in the config.
I could also do it like:
for i in range(1, (number_of_files + 1)):
if (i <= 9):
access_key = str("file_00" + str(i))
elif (i > 9 and i < 100):
access_key = str("file_0" + str(i))
print(section[access_key])
But I don't want to start with that because it becomes even worse. So my question is: What would be a proper and relatively clean way to go through all the file names in order? I definitely need the loop because I need to perform some actions with every file.
Use zero padding to generate the file number (for e.g. see this SO question answer: https://stackoverflow.com/a/339013/3775361). That way you don’t have to write the logic of moving through digit rollover yourself—you can use built-in Python functionality to do it for you. If you’re using Python 3 I’d also recommend you try out f-strings (one of the suggested solutions at the link above). They’re awesome!
If we can assume the file number has three digits, then you can do the followings to achieve zero padding. All of the below returns "015".
i = 15
str(i).zfill(3)
# or
"%03d" % i
# or
"{:0>3}".format(i)
# or
f"{i:0>3}"
Start by looking at the keys you actually have instead of guessing what they might be. You need to filter out the ones that match your pattern, and sort according to the numerical portion.
keys = [key for key in section.keys() if key.startswith('file') and key[4:].isdigit()]
You can add additional conditions, like len(key) > 4, or drop the conditions entirely. You might also consider learning regular expressions to make the checking more elegant.
To sort the names without having to account for padding, you can do something like
keys = sorted(keys, key=lambda s: int(s[4:]))
You can also try a library like natsort, which will handle the custom sort key much more generally.
Now you can iterate over the keys and do whatever you want:
for key in sorted((k for k in section if k.startswith('file') and k[4:].isdigit()), key=lambda s: int(s[4:])):
print(section[key])
Here is what a solution equipt with re and natsort might look like:
import re
from natsort import natsorted
pattern = re.compile(r'file\d+')
for key in natsorted(k for k in section if pattern.fullmatch(k)):
print(section[key])
I am a biologist that is just trying to use python to automate a ton of calculations, so I have very little experience.
I have a very large array that contains values that are formatted into two columns of observations. Sometimes the observations will be the same between the columns:
v1,v2
x,y
a,b
a,a
x,x
In order to save time and effort I wanted to make an if statement that just prints 0 if the two columns are the same and then moves on. If the values are the same there is no need to run those instances through the downstream analyses.
This is what I have so far just to test out the if statement. It has yet to recognize any instances where the columns are equivalen.
Script:
mylines=[]
with open('xxxx','r') as myfile:
for myline in myfile:
mylines.append(myline) ##reads the data into the two column format mentioned above
rang=len(open ('xxxxx,'r').readlines( )) ##returns the number or lines in the file
for x in range(1, rang):
li = mylines[x] ##selected row as defined by x and the number of lines in the file
spit = li.split(',',2) ##splits the selected values so they can be accessed seperately
print(spit[0]) ##first value
print(spit[1]) ##second value
if spit[0] == spit[1]:
print(0)
else:
print('Issue')
Output:
192Alhe52
192Alhe52
Issue ##should be 0
188Alhe48
192Alhe52
Issue
191Alhe51
192Alhe52
Issue
How do I get python to recgonize that certain observations are actually equal?
When you read the values and store them in the array, you can be storing '\n' as well, which is a break line character, so your array actually looks like this
print(mylist)
['x,y\n', 'a,b\n', 'a,a\n', 'x,x\n']
To work around this issue, you have to use strip(), which will remove this character and occasional blank spaces in the end of the string that would also affect the comparison
mylines.append(myline.strip())
You shouldn't use rang=len(open ('xxxxx,'r').readlines( )), because you are reading the file again
rang=len(mylines)
There is a more readable, pythonic way to replicate your for
for li in mylines[1:]:
spit = li.split(',')
if spit[0] == spit[1]:
print(0)
else:
print('Issue')
Or even
for spit.split(',') in mylines[1:]:
if spit[0] == spit[1]:
print(0)
else:
print('Issue')
will iterate on the array mylines, starting from the first element.
Also, if you're interested in python packages, you should have a look at pandas. Assuming you have a csv file:
import pandas as pd
df = pd.read_csv('xxxx')
for i, elements in df.iterrows():
if elements['v1'] == elements['v2']:
print('Equal')
else:
print('Different')
will do the trick. If you need to modify values and write another file
df.to_csv('nameYouWant')
For one, your issue with the equals test might be because iterating over lines like this also yields the newline character. There is a string function that can get rid of that, .strip(). Also, your argument to split is 2, which splits your row into three groups - but that probably doesn't show here. You can avoid having to parse it yourself when using the csv module, as your file presumably is that:
import csv
with open("yourfile.txt") as file:
reader = csv.reader(file)
next(reader) # skip header
for first, second in reader:
print(first)
print(second)
if first == second:
print(0)
else:
print("Issue")
I'm pretty new to Python, and put together a script to parse a csv and ultimately output its data into a repeated html table.
I got most of it working, but there's one weird problem I haven't been able to fix. My script will find the index of the last column, but won't print out the data in that column. If I add another column to the end, even an empty one, it'll print out the data in the formerly-last column - so it's not a problem with the contents of that column.
Abridged (but still grumpy) version of the code:
import os
os.chdir('C:\\Python34\\andrea')
import csv
csvOpen = open('my.csv')
exampleReader = csv.reader(csvOpen)
tableHeader = next(exampleReader)
if 'phone' in tableHeader:
phoneIndex = tableHeader.index('phone')
else:
phoneIndex = -1
for row in exampleReader:
row[-1] =''
print(phoneIndex)
print(row[phoneIndex])
csvOpen.close()
my.csv
stuff,phone
1,3235556177
1,3235556170
Output
1
1
Same script, small change to the CSV file:
my.csv
stuff,phone,more
1,3235556177,
1,3235556170,
Output
1
3235556177
1
3235556170
I'm using Python 3.4.3 via Idle 3.4.3
I've had the same problem with CSVs generated directly by mysql, ones that I've opened in Excel first then re-saved as CSVs, and ones I've edited in Notepad++ and re-saved as CSVs.
I tried adding several different modes to the open function (r, rU, b, etc.) and either it made no difference or gave me an error (for example, it didn't like 'b').
My workaround is just to add an extra column to the end, but since this is a frequently used script, it'd be much better if it just worked right.
Thank you in advance for your help.
row[-1] =''
The CSV reader returns to you a list representing the row from the file. On this line you set the last value in the list to an empty string. Then you print it afterwards. Delete this line if you don't want the last column to be set to an empty string.
If you know it is the last column, you can count them and then use that value minus 1. Likewise you can use your string comparison method if you know it will always be "phone". I recommend if you are using the string compare, convert the value from the csv to lower case so that you don't have to worry about capitalization.
In my code below I created functions that show how to use either method.
import os
import csv
os.chdir('C:\\temp')
csvOpen = open('my.csv')
exampleReader = csv.reader(csvOpen)
tableHeader = next(exampleReader)
phoneColIndex = None;#init to a value that can imply state
lastColIndex = None;#init to a value that can imply state
def getPhoneIndex(header):
for i, col in enumerate(header): #use this syntax to get index of item
if col.lower() == 'phone':
return i;
return -1; #send back invalid index
def findLastColIndex(header):
return len(tableHeader) - 1;
## methods to check for phone col. 1. by string comparison
#and 2. by assuming it's the last col.
if len(tableHeader) > 1:# if only one row or less, why go any further?
phoneColIndex = getPhoneIndex(tableHeader);
lastColIndex = findLastColIndex(tableHeader)
for row in exampleReader:
print(row[phoneColIndex])
print('----------')
print(row[lastColIndex])
print('----------')
csvOpen.close()
I have this task that I've been working on, but am having extreme misgivings about my methodology.
So the problem is that I have a ton of excel files that are formatted strangely (and not consistently) and I need to extract certain fields for each entry. An example data set is
My original approach was this:
Export to csv
Separate into counties
Separate into districts
Analyze each district individually, pull out values
write to output.csv
The problem I've run into is that the format (seemingly well organized) is almost random across files. Each line contains the same fields, but in a different order, spacing, and wording. I wrote a script to correctly process one file, but it doesn't work on any other files.
So my question is, is there a more robust method of approaching this problem rather than simple string processing? What I had in mind was more of a fuzzy logic approach for trying to pin which field an item was, which could handle the inputs being a little arbitrary. How would you approach this problem?
If it helps clear up the problem, here is the script I wrote:
# This file takes a tax CSV file as input
# and separates it into counties
# then appends each county's entries onto
# the end of the master out.csv
# which will contain everything including
# taxes, bonds, etc from all years
#import the data csv
import sys
import re
import csv
def cleancommas(x):
toggle=False
for i,j in enumerate(x):
if j=="\"":
toggle=not toggle
if toggle==True:
if j==",":
x=x[:i]+" "+x[i+1:]
return x
def districtatize(x):
#list indexes of entries starting with "for" or "to" of length >5
indices=[1]
for i,j in enumerate(x):
if len(j)>2:
if j[:2]=="to":
indices.append(i)
if len(j)>3:
if j[:3]==" to" or j[:3]=="for":
indices.append(i)
if len(j)>5:
if j[:5]==" \"for" or j[:5]==" \'for":
indices.append(i)
if len(j)>4:
if j[:4]==" \"to" or j[:4]==" \'to" or j[:4]==" for":
indices.append(i)
if len(indices)==1:
return [x[0],x[1:len(x)-1]]
new=[x[0],x[1:indices[1]+1]]
z=1
while z<len(indices)-1:
new.append(x[indices[z]+1:indices[z+1]+1])
z+=1
return new
#should return a list of lists. First entry will be county
#each successive element in list will be list by district
def splitforstos(string):
for itemind,item in enumerate(string): # take all exception cases that didn't get processed
splitfor=re.split('(?<=\d)\s\s(?=for)',item) # correctly and split them up so that the for begins
splitto=re.split('(?<=\d)\s\s(?=to)',item) # a cell
if len(splitfor)>1:
print "\n\n\nfor detected\n\n"
string.remove(item)
string.insert(itemind,splitfor[0])
string.insert(itemind+1,splitfor[1])
elif len(splitto)>1:
print "\n\n\nto detected\n\n"
string.remove(item)
string.insert(itemind,splitto[0])
string.insert(itemind+1,splitto[1])
def analyze(x):
#input should be a string of content
#target values are nomills,levytype,term,yearcom,yeardue
clean=cleancommas(x)
countylist=clean.split(',')
emptystrip=filter(lambda a: a != '',countylist)
empt2strip=filter(lambda a: a != ' ', emptystrip)
singstrip=filter(lambda a: a != '\' \'',empt2strip)
quotestrip=filter(lambda a: a !='\" \"',singstrip)
splitforstos(quotestrip)
distd=districtatize(quotestrip)
print '\n\ndistrictized\n\n',distd
county = distd[0]
for x in distd[1:]:
if len(x)>8:
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
numyears=x[6]
yearcom=x[8]
yeardue=x[10]
reason=x[11]
data = [filename,county,district, vote1, vote2, mills, votetype, numyears, yearcom, yeardue, reason]
print "data",data
else:
print "x\n\n",x
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
special=x[5]
splitspec=special.split(' ')
try:
forind=[i for i,j in enumerate(splitspec) if j=='for'][0]
numyears=splitspec[forind+1]
yearcom=splitspec[forind+6]
except:
forind=[i for i,j in enumerate(splitspec) if j=='commencing'][0]
numyears=None
yearcom=splitspec[forind+2]
yeardue=str(x[6])[-4:]
reason=x[7]
data = [filename,county,district,vote1,vote2,mills,votetype,numyears,yearcom,yeardue,reason]
print "data other", data
openfile=csv.writer(open('out.csv','a'),delimiter=',', quotechar='|',quoting=csv.QUOTE_MINIMAL)
openfile.writerow(data)
# call the file like so: python tax.py 2007May8Tax.csv
filename = sys.argv[1] #the file is the first argument
f=open(filename,'r')
contents=f.read() #entire csv as string
#find index of every instance of the word county
separators=[m.start() for m in re.finditer('\w+\sCOUNTY',contents)] #alternative implementation in regex
# split contents into sections by county
# analyze each section and append to out.csv
for x,y in enumerate(separators):
try:
data = contents[y:separators[x+1]]
except:
data = contents[y:]
analyze(data)
is there a more robust method of approaching this problem rather than simple string processing?
Not really.
What I had in mind was more of a fuzzy logic approach for trying to pin which field an item was, which could handle the inputs being a little arbitrary. How would you approach this problem?
After a ton of analysis and programming, it won't be significantly better than what you've got.
Reading stuff prepared by people requires -- sadly -- people-like brains.
You can mess with NLTK to try and do a better job, but it doesn't work out terribly well either.
You don't need a radically new approach. You need to streamline the approach you have.
For example.
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
numyears=x[6]
yearcom=x[8]
yeardue=x[10]
reason=x[11]
data = [filename,county,district, vote1, vote2, mills, votetype, numyears, yearcom, yeardue, reason]
print "data",data
Might be improved by using a named tuple.
Then build something like this.
data = SomeSensibleName(
district= x[0],
vote1=x[1], ... etc.
)
So that you're not creating a lot of intermediate (and largely uninformative) loose variables.
Also, keep looking at your analyze function (and any other function) to pull out the various "pattern matching" rules. The idea is that you'll examine a county's data, step through a bunch of functions until one matches the pattern; this will also create the named tuple. You want something like this.
for p in ( some, list, of, functions ):
match= p(data)
if match:
return match
Each function either returns a named tuple (because it liked the row) or None (because it didn't like the row).
Related to a previous question, I'm trying to do replacements over a number of large CSV files.
The column order (and contents) change between files, but for each file there are about 10 columns that I want and can identify by the column header names. I also have 1-2 dictionaries for each column I want. So for the columns I want, I want to use only the correct dictionaries and want to implement them sequentially.
An example of how I've tried to solve this:
# -*- coding: utf-8 -*-
import re
# imaginary csv file. pretend that we do not know the column order.
Header = [u'col1', u'col2']
Line1 = [u'A',u'X']
Line2 = [u'B',u'Y']
fileLines = [Line1,Line2]
# dicts to translate lines
D1a = {u'A':u'a'}
D1b = {u'B':u'b'}
D2 = {u'X':u'x',u'Y':u'y'}
# dict to correspond header names with the correct dictionary.
# i would like the dictionaries to be read sequentially in col1.
refD = {u'col1':[D1a,D1b],u'col2':[D2]}
# clunky replace function
def freplace(str, dict):
rc = re.compile('|'.join(re.escape(k) for k in dict))
def trans(m):
return dict[m.group(0)]
return rc.sub(trans, str)
# get correspondence between dictionary and column
C = []
for i in range(len(Header)):
if Header[i] in refD:
C.append([refD[Header[i]],i])
# loop through lines and make replacements
for line in fileLines:
for i in range(len(line)):
for j in range(len(C)):
if C[j][1] == i:
for dict in C[j][0]:
line[i] = freplace(line[i], dict)
My problem is that this code is quite slow, and I can't figure out how to speed it up. I'm a beginner, and my guess was that my freplace function is largely what is slowing things down, because it has to compile for each column in each row. I would like to take the line rc = re.compile('|'.join(re.escape(k) for k in dict)) out of that function, but don't know how to do that and still preserve what the rest of my code is doing.
There's a ton of things that you can do to speed this up:
First, use the csv module. It provides efficient and bug-free methods for reading and writing CSV files. The DictReader object in particular is what you're interested in: it will present every row it reads from the file as a dictionary keyed by its column name.
Second, compile your regexes once, not every time you use them. Save the compiled regexes in a dictionary keyed by the column that you're going to apply them to.
Third, consider that if you apply a hundred regexes to a long string, you're going to be scanning the string from start to finish a hundred times. That may not be the best approach to your problem; you might be better off investing some time in an approach that lets you read the string from start to end once.
You don't need re:
# -*- coding: utf-8 -*-
# imaginary csv file. pretend that we do not know the column order.
Header = [u'col1', u'col2']
Line1 = [u'A',u'X']
Line2 = [u'B',u'Y']
fileLines = [Line1,Line2]
# dicts to translate lines
D1a = {u'A':u'a'}
D1b = {u'B':u'b'}
D2 = {u'X':u'x',u'Y':u'y'}
# dict to correspond header names with the correct dictionary
refD = {u'col1':[D1a,D1b],u'col2':[D2]}
# now let's have some fun...
for line in fileLines:
for i, (param, word) in enumerate(zip(Header, line)):
for minitranslator in refD[param]:
if word in minitranslator:
line[i] = minitranslator[word]
returns:
[[u'a', u'x'], [u'b', u'y']]
So if that's the case, and all 10 columns have the same names each time, but out of order, (I'm not sure if this is what you're doing up there, but here goes) keep one array for the heading names, and one for each column split into elements (should be 10 items each line), now just offset which regex by doing a case/select combo, compare the element number of your header array, then inside the case, reference the data array at the same offset, since the name is what will get to the right case you should be able to use the same 10 regex's repeatedly, and not have to recompile a new "command" each time.
I hope that makes sense. I'm sorry i don't know the syntax to help you out, but I hope my idea is what you're looking for
EDIT:
I.E.
initialize all regexes before starting your loops.
then after you read a line (and after the header line)
select array[n]
case "column1"
regex(data[0]);
case "column2"
regex(data[1]);
.
.
.
.
end select
This should call the right regex for the right columns