Collecting data first in python to conduct operations - python

I was given the following problem where I had to match the logdata and the expected_result. The code is as follows, edited with my solution and comments containing feedback I received:
import collections
log_data = """1.1.2014 12:01,111-222-333,454-333-222,COMPLETED
1.1.2014 13:01,111-222-333,111-333,FAILED
1.1.2014 13:04,111-222-333,454-333-222,FAILED
1.1.2014 13:05,111-222-333,454-333-222,COMPLETED
2.1.2014 13:01,111-333,111-222-333,FAILED
"""
expected_result = {
"111-222-333": "40.00%",
"454-333-222": "66.67%",
"111-333" : "0.00%"
}
def compute_success_ratio(logdata):
#! better option to use .splitlines()
#! or even better recognize the CSV structure and use csv.reader
entries = logdata.split('\n')
#! interesting choice to collect the data first
#! which could result in explosive growth of memory hunger, are there
#! alternatives to this structure?
complst = []
faillst = []
#! probably no need for attaching `lst` to the variable name, no?
for entry in entries:
#! variable naming could be clearer here
#! a good way might involve destructuring the entry like:
#! _, caller, callee, result
#! which also avoids using magic indices further down (-1, 1, 2)
ent = entry.split(',')
if ent[-1] == 'COMPLETED':
#! complst.extend(ent[1:3]) for even more brevity
complst.append(ent[1])
complst.append(ent[2])
elif ent[-1] == 'FAILED':
faillst.append(ent[1])
faillst.append(ent[2])
#! variable postfix `lst` could let us falsely assume that the result of set()
#! is a list.
numlst = set(complst + faillst)
#! good use of collections.Counter,
#! but: Counter() already is a dictionary, there is no need to convert it to one
comps = dict(collections.Counter(complst))
fails = dict(collections.Counter(faillst))
#! variable naming overlaps with global, and doesn't make sense in this context
expected_result = {}
for e in numlst:
#! good: dealt with possibility of a number not showing up in `comps` or `fails`
#! bad: using a try/except block to deal with this when a simpler .get("e", 0)
#! would've allowed dealing with this more elegantly
try:
#! variable naming not very expressive
rat = float(comps[e]) / float(comps[e] + fails[e]) * 100
perc = round(rat, 2)
#! here we are rounding twice, and then don't use the formatting string
#! to attach the % -- '{:.2f}%'.format(perc) would've been the right
#! way if one doesn't know percentage formatting (see below)
expected_result[e] = "{:.2f}".format(perc) + '%'
#! a generally better way would be to either
#! from __future__ import division
#! or to compute the ratio as
#! ratio = float(comps[e]) / (comps[e] + fails[e])
#! and then use percentage formatting for the ratio
#! "{:.2%}".format(ratio)
except KeyError:
expected_result[e] = '0.00%'
return expected_result
if __name__ == "__main__":
assert(compute_success_ratio(log_data) == expected_result)
#! overall
#! + correct
#! ~ implementation not optimal, relatively wasteful in terms of memory
#! - variable naming inconsistent, overly shortened, not expressive
#! - some redundant operations
#! + good use of standard library collections.Counter
#! ~ code could be a tad bit more idiomatic
I have understood some of the problems such as variable naming conventions and avoiding the try block section as much as possible.
However, I fail to understand how using csv.reader improves the code. Also, how am I supposed to understand the comment about collecting the data first? What could the alternatives be? Could anybody throw some light on these two issues?

When you do entries = logdata.split('\n') you will create a list with the split strings. Since log files can be quite large, this can consume a large amount of memory.
The way that csv.reader works is that it will open the file and only read one line at a time (approximately). This means that the data remains in the file and only one row is ever in memory.
Forgetting about the csv parsing for a minute, the issue is illustrated by the difference between these approaches:
In approach 1 we read the whole file into memory:
data = open('logfile').read().split('\n')
for line in data:
# do something with the line
In approach 2 we read one line at a time:
data = open('logfile')
for line in data:
# do something with the line
Approach 1 will consume more memory as the whole file needs to be read into memory. It also traverses the data twice - once when we read it and once to split into lines. The downside of approach 2 is that we can only do one loop through data.
For the particular case here, where we're not reading from a file but rather from a variable which is already in memory, the big difference will be that we will consume about twice as much memory by using the split approach.

split('\n') and splitlines will create a copy of your data where each line is is separate item in the list. Since you only need to pass over the data once instead of randomly accessing lines this is wasteful compared to CSV reader which could return you one line at the time. The other benefit of using the reader is that you wouldn't have to split the data to lines and lines to columns manually.
The comment about data collection refers the fact that you add all the completed and failed items to two lists. Let's say that item 111-333 completes five times and fails twice. Your data would look something like this:
complst = ['111-333', '111-333', '111-333', '111-333', '111-333']
faillst = ['111-333', '111-333']
You don't need those repeating items so you could have used Counter directly without collecting items to the lists and save a lot of memory.
Here's an alternative implementation that uses csv.reader and collects success & failure counts to a dict where item name is key and value is list [success count, failure count]:
from collections import defaultdict
import csv
from io import StringIO
log_data = """1.1.2014 12:01,111-222-333,454-333-222,COMPLETED
1.1.2014 13:01,111-222-333,111-333,FAILED
1.1.2014 13:04,111-222-333,454-333-222,FAILED
1.1.2014 13:05,111-222-333,454-333-222,COMPLETED
2.1.2014 13:01,111-333,111-222-333,FAILED
"""
RESULT_STRINGS = ['COMPLETED', 'FAILED']
counts = defaultdict(lambda: [0, 0])
for _, *params, result in csv.reader(StringIO(log_data)):
try:
index = RESULT_STRINGS.index(result)
for param in params:
counts[param][index] += 1
except ValueError:
pass # Skip line in case last column is not in RESULT_STRINGS
result = {k: '{0:.2f}%'.format(v[0] / sum(v) * 100) for k, v in counts.items()}
Note that above will work only on Python 3.

Alternatively, Pandas looks a good solution for this purpose if you are OK with using it.
import pandas as pd
log_data = pd.read_csv('data.csv',header=None)
log_data.columns = ['date', 'key1','key2','outcome']
meltedData = pd.melt(log_data, id_vars=['date','outcome'], value_vars=['key1','key2'],
value_name = 'key') # we transpose the keys here
meltedData['result'] = [int(x.lower() == 'completed') for x in meltedData['outcome']] # add summary variable
groupedData = meltedData.groupby(['key'])['result'].mean()
groupedDict = groupedData.to_dict()
print groupedDict
Result:
{'111-333': 0.0, '111-222-333': 0.40000000000000002, '454-333-222': 0.66666666666666663}

Related

How can I increase the amount of array iterated during the run-time of script?

My script cleans arrays from the unwanted string like "##$!" and other stuff.
The script works as intended but the speed of it is extremely slow when the excel row size is big.
I tried to use numpy if it could speed it up but I'm not too familiar with is so I might be using it incorrectly.
xls = pd.ExcelFile(path)
df = xls.parse("Sheet2")
TeleNum = np.array(df['telephone'].values)
def replace(orignstr): # removes the unwanted string from numbers
for elem in badstr:
if elem in orignstr:
orignstr = orignstr.replace(elem, '')
return orignstr
for UncleanNum in tqdm(TeleNum):
newnum = replace(str(UncleanNum)) # calling replace function
df['telephone'] = df['telephone'].replace(UncleanNum, newnum) # store string back in data frame
I also tried removing the method to if that would help and just place it as one block of code but the speed remained the same.
for UncleanNum in tqdm(TeleNum):
orignstr = str(UncleanNum)
for elem in badstr:
if elem in orignstr:
orignstr = orignstr.replace(elem, '')
print(orignstr)
df['telephone'] = df['telephone'].replace(UncleanNum, orignstr)
TeleNum = np.array(df['telephone'].values)
The current speed of the script running an excel file of 200,000 is around 70it/s and take around an hour to finish. Which is not that good since this is just one function of many.
I'm not too advanced in python. I'm just learning as I script so if you have any pointer it would be appreciated.
Edit:
Most of the array elements Im dealing with are numbers but some have string in them. I trying to remove all string in the array element.
Ex.
FD3459002912
*345*9002912$
If you are trying to clear everything that isn't a digit from the strings you can directly use re.sub like this:
import re
string = "FD3459002912"
regex_result = re.sub("\D", "", string)
print(regex_result) # 3459002912

How to iterate over two files effectively (25000+ Lines)

So, I am trying to make a combined list inside of Python for matching data of about 25,000 lines.
The first list data came from file mac.uid and looks like this
Mac|ID
The second list data came serial.uid and looks like this:
Serial|Mac
Mac from list 1 must equal the Mac from list 2 before it's joined.
This is what I am currently doing, I believe there is too much repetition going on.
combined = [];
def combineData():
lines = open('mac.uid', 'r+')
for line in lines:
with open('serial.uid', 'r+') as serial:
for each in serial:
a, b = line.strip().split('|')
a = a.lower()
x, y = each.strip().split('|')
y = y.lower()
if a == y:
combined.append(a+""+b+""+x)
The final list is supposed to look like this:
Mac(List1), ID(List1), Serial(List2)
So that I can import it into an excel sheet.
Thanks for any and all help!
Instead of your nested loops (which cause quadratic complexity) you should use dictionaries which will give you roughly O(n log(n)) complexity. To do so, first read serial.uid once and store the mapping of MAC addresses to serial numbers in a dict.
serial = dict()
with open('serial.uid') as istr:
for line in istr:
(ser, mac) = split_fields(line)
serial[mac] = ser
Then you can close the file again and process mac.uid looking up the serial number for each MAC address in the dictionary you've just created.
combined = list()
with open('mac.uid') as istr:
for line in istr:
(mac, uid) = split_fields(line)
combined.append((mac, uid, serial[mac]))
Note that I've changed combined from a list of strings to a list of tuples. I've also factored the splitting logic out into a separate function. (You'll have to put its definition before its use, of course.)
def split_fields(line):
return line.strip().lower().split('|')
Finally, I recommend that you start using more descriptive names for your variables.
For files of 25k lines, you should have no issues storing everything in memory. If your data sets become too large for that, you might want to start looking into using a database. Note that the Python standard library ships with an SQLite module.

Fast extraction of chunks of lines from large CSV file

I have a large CSV file full of stock-related data formatted as such:
Ticker Symbol, Date, [some variables...]
So each line starts of with the symbol (like "AMZN"), then has the date, then has 12 variables related to price or volume on the selected date. There are about 10,000 different securities represented in this file and I have a line for each day that the stock has been publicly traded for each of them. The file is ordered first alphabetically by ticker symbol and second chronologically by date. The entire file is about 3.3 GB.
The sort of task I want to solve would be to be able to extract the most recent n lines of data for a given ticker symbol with respect to the current date. I have code that does this, but based on my observations it seems to take, on average, around 8-10 seconds per retrieval (all tests have been extracting 100 lines).
I have functions I'd like to run that require me to grab such chunks for hundreds or thousands of symbols, and I would really like to reduce the time. My code is inefficient, but I am not sure how to make it run faster.
First, I have a function called getData:
def getData(symbol, filename):
out = ["Symbol","Date","Open","High","Low","Close","Volume","Dividend",
"Split","Adj_Open","Adj_High","Adj_Low","Adj_Close","Adj_Volume"]
l = len(symbol)
beforeMatch = True
with open(filename, 'r') as f:
for line in f:
match = checkMatch(symbol, l, line)
if beforeMatch and match:
beforeMatch = False
out.append(formatLineData(line[:-1].split(",")))
elif not beforeMatch and match:
out.append(formatLineData(line[:-1].split(",")))
elif not beforeMatch and not match:
break
return out
(This code has a couple of helper functions, checkMatch and formatLineData, which I will show below.) Then, there is another function called getDataColumn that gets the column I want with the correct number of days represented:
def getDataColumn(symbol, col=12, numDays=100, changeRateTransform=False):
dataset = getData(symbol)
if not changeRateTransform:
column = [day[col] for day in dataset[-numDays:]]
else:
n = len(dataset)
column = [(dataset[i][col] - dataset[i-1][col])/dataset[i-1][col] for i in range(n - numDays, n)]
return column
(changeRateTransform converts raw numbers into daily change rate numbers if True.) The helper functions:
def checkMatch(symbol, symbolLength, line):
out = False
if line[:symbolLength+1] == symbol + ",":
out = True
return out
def formatLineData(lineData):
out = [lineData[0]]
out.append(datetime.strptime(lineData[1], '%Y-%m-%d').date())
out += [float(d) for d in lineData[2:6]]
out += [int(float(d)) for d in lineData[6:9]]
out += [float(d) for d in lineData[9:13]]
out.append(int(float(lineData[13])))
return out
Does anyone have any insight on what parts of my code run slow and how I can make this perform better? I can't do the sort of analysis I want to do without speeding this up.
EDIT:
In response to the comments, I made some changes to the code in order to utilize the existing methods in the csv module:
def getData(symbol, database):
out = ["Symbol","Date","Open","High","Low","Close","Volume","Dividend",
"Split","Adj_Open","Adj_High","Adj_Low","Adj_Close","Adj_Volume"]
l = len(symbol)
beforeMatch = True
with open(database, 'r') as f:
databaseReader = csv.reader(f, delimiter=",")
for row in databaseReader:
match = (row[0] == symbol)
if beforeMatch and match:
beforeMatch = False
out.append(formatLineData(row))
elif not beforeMatch and match:
out.append(formatLineData(row))
elif not beforeMatch and not match:
break
return out
def getDataColumn(dataset, col=12, numDays=100, changeRateTransform=False):
if not changeRateTransform:
out = [day[col] for day in dataset[-numDays:]]
else:
n = len(dataset)
out = [(dataset[i][col] - dataset[i-1][col])/dataset[i-1][col] for i in range(n - numDays, n)]
return out
Performance was worse using the csv.reader class. I tested on two stocks, AMZN (near top of file) and ZNGA (near bottom of file). With the original method, the run times were 0.99 seconds and 18.37 seconds, respectively. With the new method leveraging the csv module, the run times were 3.04 seconds and 64.94 seconds, respectively. Both return the correct results.
My thought is that the time is being taken up more from finding the stock than from the parsing. If I try these methods on the first stock in the file, A, the methods both run in about 0.12 seconds.
When you're going to do lots of analysis on the same dataset, the pragmatic approach would be to read it all into a database. It is made for fast querying; CSV isn't. Use the sqlite command line tools, for example, which can directly import from CSV. Then add a single index on (Symbol, Date) and lookups will be practically instantaneous.
If for some reason that is not feasible, for example because new files can come in at any moment and you cannot afford the preparation time before starting your analysis of them, you'll have to make the best of dealing with CSV directly, which is what the rest of my answer will focus on. Remember that it's a balancing act, though. Either you pay a lot upfront, or a bit extra for every lookup. Eventually, for some amount of lookups it would have been cheaper to pay upfront.
Optimization is about maximizing the amount of work not done. Using generators and the built-in csv module aren't going to help much with that in this case. You'd still be reading the whole file and parsing all of it, at least for line breaks. With that amount of data, it's a no-go.
Parsing requires reading, so you'll have to find a way around it first. Best practices of leaving all intricacies of the CSV format to the specialized module bear no meaning when they can't give you the performance you want. Some cheating must be done, but as little as possible. In this case, I suppose it is safe to assume that the start of a new line can be identified as b'\n"AMZN",' (sticking with your example). Yes, binary here, because remember: no parsing yet. You could scan the file as binary from the beginning until you find the first line. From there read the amount of lines you need, decode and parse them the proper way, etc. No need for optimization there, because a 100 lines are nothing to worry about compared to the hundreds of thousands of irrelevant lines you're not doing that work for.
Dropping all that parsing buys you a lot, but the reading needs to be optimized as well. Don't load the whole file into memory first and skip as many layers of Python as you can. Using mmap lets the OS decide what to load into memory transparently and lets you work with the data directly.
Still you're potentially reading the whole file, if the symbol is near the end. It's a linear search, which means the time it takes is linearly proportional to the number of lines in the file. You can do better though. Because the file is sorted, you could improve the function to instead perform a kind of binary search. The number of steps that will take (where a step is reading a line) is close to the binary logarithm of the number of lines. In other words: the number of times you can divide your file into two (almost) equally sized parts. When there are one million lines, that's a difference of five orders of magnitude!
Here's what I came up with, based on Python's own bisect_left with some measures to account for the fact that your "values" span more than one index:
import csv
from itertools import islice
import mmap
def iter_symbol_lines(f, symbol):
# How to recognize the start of a line of interest
ident = b'"' + symbol.encode() + b'",'
# The memory-mapped file
mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
# Skip the header
mm.readline()
# The inclusive lower bound of the byte range we're still interested in
lo = mm.tell()
# The exclusive upper bound of the byte range we're still interested in
hi = mm.size()
# As long as the range isn't empty
while lo < hi:
# Find the position of the beginning of a line near the middle of the range
mid = mm.rfind(b'\n', 0, (lo+hi)//2) + 1
# Go to that position
mm.seek(mid)
# Is it a line that comes before lines we're interested in?
if mm.readline() < ident:
# If so, ignore everything up to right after this line
lo = mm.tell()
else:
# Otherwise, ignore everything from right before this line
hi = mid
# We found where the first line of interest would be expected; go there
mm.seek(lo)
while True:
line = mm.readline()
if not line.startswith(ident):
break
yield line.decode()
with open(filename) as f:
r = csv.reader(islice(iter_symbol_lines(f, 'AMZN'), 10))
for line in r:
print(line)
No guarantees about this code; I didn't pay much attention to edge cases, and I couldn't test with (any of) your file(s), so consider it a proof of concept. It is plenty fast, however – think tens of milliseconds on an SSD!
So I have an alternative solution which I ran and tested on my own as well with a sample data set that I got on Quandl that appears to have all the same headers and similar data. (Assuming that I havent misunderstood the end result that your trying to achieve).
I have this command line tool that one of our engineers built for us for parsing massive csvs - since I deal with absurd amount of data on a day to day basis - it is open sourced and you can get it here: https://github.com/DataFoxCo/gocsv
I also already wrote the short bash script for it in case you don't want to pipeline the commands but it does also support pipelining.
The command to run the following short script follows a super simple convention:
bash tickers.sh wikiprices.csv 'AMZN' '2016-12-\d+|2016-11-\d+'
#!/bin/bash
dates="$3"
cat "$1" \
| gocsv filter --columns 'ticker' --regex "$2" \
| gocsv filter --columns 'date' --regex "$dates" > "$2"'-out.csv'
both arguments for ticker and for dates are regexes
You can add as many variations as your want into that one regex, separating them by |.
So if you wanted AMZN and MSFT then you would simply modify it to this: AMZN|MSFT
I did something very similar with the dates - but i only limited my sample run to any dates from this month or last month.
End Result
Starting data:
myusername$ gocsv dims wikiprices.csv
Dimensions:
Rows: 23946
Columns: 14
myusername$ bash tickers.sh wikiprices.csv 'AMZN|MSFT' '2016-12-\d+'
myusername$ gocsv dims AMZN|MSFT-out.csv
Dimensions:
Rows: 24
Columns: 14
Here is a sample where I limited to only those 2 tickers and then to december only:
Voila - in a matter of seconds you have a second file saved with out the data you care about.
The gocsv program has great documentation by the way - and a ton of other functions e.g. running a vlookup basically at any scale (which is what inspired the creator to make the tool)
in addition to using csv.reader I think using itertools.groupby would speed up looking for the wanted sections, so the actual iteration could look something like this:
import csv
from itertools import groupby
from operator import itemgetter #for the keyfunc for groupby
def getData(wanted_symbol, filename):
with open(filename) as file:
reader = csv.reader(file)
#so each line in reader is basically line[:-1].split(",") from the plain file
for symb, lines in groupby(reader, itemgetter(0)):
#so here symb is the symbol at the start of each line of lines
#and lines is the lines that all have that symbol in common
if symb != wanted_symbol:
continue #skip this whole section if it has a different symbol
for line in lines:
#here we have each line as a list of fields
#for only the lines that have `wanted_symbol` as the first element
<DO STUFF HERE>
so in the space of <DO STUFF HERE> you could have the out.append(formatLineData(line)) to do what your current code does but the code for that function has a lot of unnecessary slicing and += operators which I think are pretty expensive for lists (might be wrong), another way you could apply the conversions is to have a list of all the conversions:
def conv_date(date_str):
return datetime.strptime(date_str, '%Y-%m-%d').date()
#the conversions applied to each element (taken from original formatLineData)
castings = [str, conv_date, #0, 1
float, float, float, float, #2:6
int, int, int, #6:9
float, float, float, float, #9:13
int] #13
then use zip to apply these to each field in a line in a list comprehension:
[conv(val) for conv, val in zip(castings, line)]
so you would replace <DO STUFF HERE> with out.append with that comprehension.
I'd also wonder if switching the order of groupby and reader would be better since you don't need to parse most of the file as csv, just the parts you are actually iterating over so you could use a keyfunc that seperates just the first field of the string
def getData(wanted_symbol, filename):
out = [] #why are you starting this with strings in it?
def checkMatch(line): #define the function to only take the line
#this would be the keyfunc for groupby in this example
return line.split(",",1)[0] #only split once, return the first element
with open(filename) as file:
for symb, lines in groupby(file,checkMatch):
#so here symb is the symbol at the start of each line of lines
if symb != wanted_symbol:
continue #skip this whole section if it has a different symbol
for line in csv.reader(lines):
out.append( [typ(val) for typ,val in zip(castings,line)] )
return out

Data analysis for inconsistent string formatting

I have this task that I've been working on, but am having extreme misgivings about my methodology.
So the problem is that I have a ton of excel files that are formatted strangely (and not consistently) and I need to extract certain fields for each entry. An example data set is
My original approach was this:
Export to csv
Separate into counties
Separate into districts
Analyze each district individually, pull out values
write to output.csv
The problem I've run into is that the format (seemingly well organized) is almost random across files. Each line contains the same fields, but in a different order, spacing, and wording. I wrote a script to correctly process one file, but it doesn't work on any other files.
So my question is, is there a more robust method of approaching this problem rather than simple string processing? What I had in mind was more of a fuzzy logic approach for trying to pin which field an item was, which could handle the inputs being a little arbitrary. How would you approach this problem?
If it helps clear up the problem, here is the script I wrote:
# This file takes a tax CSV file as input
# and separates it into counties
# then appends each county's entries onto
# the end of the master out.csv
# which will contain everything including
# taxes, bonds, etc from all years
#import the data csv
import sys
import re
import csv
def cleancommas(x):
toggle=False
for i,j in enumerate(x):
if j=="\"":
toggle=not toggle
if toggle==True:
if j==",":
x=x[:i]+" "+x[i+1:]
return x
def districtatize(x):
#list indexes of entries starting with "for" or "to" of length >5
indices=[1]
for i,j in enumerate(x):
if len(j)>2:
if j[:2]=="to":
indices.append(i)
if len(j)>3:
if j[:3]==" to" or j[:3]=="for":
indices.append(i)
if len(j)>5:
if j[:5]==" \"for" or j[:5]==" \'for":
indices.append(i)
if len(j)>4:
if j[:4]==" \"to" or j[:4]==" \'to" or j[:4]==" for":
indices.append(i)
if len(indices)==1:
return [x[0],x[1:len(x)-1]]
new=[x[0],x[1:indices[1]+1]]
z=1
while z<len(indices)-1:
new.append(x[indices[z]+1:indices[z+1]+1])
z+=1
return new
#should return a list of lists. First entry will be county
#each successive element in list will be list by district
def splitforstos(string):
for itemind,item in enumerate(string): # take all exception cases that didn't get processed
splitfor=re.split('(?<=\d)\s\s(?=for)',item) # correctly and split them up so that the for begins
splitto=re.split('(?<=\d)\s\s(?=to)',item) # a cell
if len(splitfor)>1:
print "\n\n\nfor detected\n\n"
string.remove(item)
string.insert(itemind,splitfor[0])
string.insert(itemind+1,splitfor[1])
elif len(splitto)>1:
print "\n\n\nto detected\n\n"
string.remove(item)
string.insert(itemind,splitto[0])
string.insert(itemind+1,splitto[1])
def analyze(x):
#input should be a string of content
#target values are nomills,levytype,term,yearcom,yeardue
clean=cleancommas(x)
countylist=clean.split(',')
emptystrip=filter(lambda a: a != '',countylist)
empt2strip=filter(lambda a: a != ' ', emptystrip)
singstrip=filter(lambda a: a != '\' \'',empt2strip)
quotestrip=filter(lambda a: a !='\" \"',singstrip)
splitforstos(quotestrip)
distd=districtatize(quotestrip)
print '\n\ndistrictized\n\n',distd
county = distd[0]
for x in distd[1:]:
if len(x)>8:
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
numyears=x[6]
yearcom=x[8]
yeardue=x[10]
reason=x[11]
data = [filename,county,district, vote1, vote2, mills, votetype, numyears, yearcom, yeardue, reason]
print "data",data
else:
print "x\n\n",x
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
special=x[5]
splitspec=special.split(' ')
try:
forind=[i for i,j in enumerate(splitspec) if j=='for'][0]
numyears=splitspec[forind+1]
yearcom=splitspec[forind+6]
except:
forind=[i for i,j in enumerate(splitspec) if j=='commencing'][0]
numyears=None
yearcom=splitspec[forind+2]
yeardue=str(x[6])[-4:]
reason=x[7]
data = [filename,county,district,vote1,vote2,mills,votetype,numyears,yearcom,yeardue,reason]
print "data other", data
openfile=csv.writer(open('out.csv','a'),delimiter=',', quotechar='|',quoting=csv.QUOTE_MINIMAL)
openfile.writerow(data)
# call the file like so: python tax.py 2007May8Tax.csv
filename = sys.argv[1] #the file is the first argument
f=open(filename,'r')
contents=f.read() #entire csv as string
#find index of every instance of the word county
separators=[m.start() for m in re.finditer('\w+\sCOUNTY',contents)] #alternative implementation in regex
# split contents into sections by county
# analyze each section and append to out.csv
for x,y in enumerate(separators):
try:
data = contents[y:separators[x+1]]
except:
data = contents[y:]
analyze(data)
is there a more robust method of approaching this problem rather than simple string processing?
Not really.
What I had in mind was more of a fuzzy logic approach for trying to pin which field an item was, which could handle the inputs being a little arbitrary. How would you approach this problem?
After a ton of analysis and programming, it won't be significantly better than what you've got.
Reading stuff prepared by people requires -- sadly -- people-like brains.
You can mess with NLTK to try and do a better job, but it doesn't work out terribly well either.
You don't need a radically new approach. You need to streamline the approach you have.
For example.
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
numyears=x[6]
yearcom=x[8]
yeardue=x[10]
reason=x[11]
data = [filename,county,district, vote1, vote2, mills, votetype, numyears, yearcom, yeardue, reason]
print "data",data
Might be improved by using a named tuple.
Then build something like this.
data = SomeSensibleName(
district= x[0],
vote1=x[1], ... etc.
)
So that you're not creating a lot of intermediate (and largely uninformative) loose variables.
Also, keep looking at your analyze function (and any other function) to pull out the various "pattern matching" rules. The idea is that you'll examine a county's data, step through a bunch of functions until one matches the pattern; this will also create the named tuple. You want something like this.
for p in ( some, list, of, functions ):
match= p(data)
if match:
return match
Each function either returns a named tuple (because it liked the row) or None (because it didn't like the row).

Find and replace in CSV files with Python

Related to a previous question, I'm trying to do replacements over a number of large CSV files.
The column order (and contents) change between files, but for each file there are about 10 columns that I want and can identify by the column header names. I also have 1-2 dictionaries for each column I want. So for the columns I want, I want to use only the correct dictionaries and want to implement them sequentially.
An example of how I've tried to solve this:
# -*- coding: utf-8 -*-
import re
# imaginary csv file. pretend that we do not know the column order.
Header = [u'col1', u'col2']
Line1 = [u'A',u'X']
Line2 = [u'B',u'Y']
fileLines = [Line1,Line2]
# dicts to translate lines
D1a = {u'A':u'a'}
D1b = {u'B':u'b'}
D2 = {u'X':u'x',u'Y':u'y'}
# dict to correspond header names with the correct dictionary.
# i would like the dictionaries to be read sequentially in col1.
refD = {u'col1':[D1a,D1b],u'col2':[D2]}
# clunky replace function
def freplace(str, dict):
rc = re.compile('|'.join(re.escape(k) for k in dict))
def trans(m):
return dict[m.group(0)]
return rc.sub(trans, str)
# get correspondence between dictionary and column
C = []
for i in range(len(Header)):
if Header[i] in refD:
C.append([refD[Header[i]],i])
# loop through lines and make replacements
for line in fileLines:
for i in range(len(line)):
for j in range(len(C)):
if C[j][1] == i:
for dict in C[j][0]:
line[i] = freplace(line[i], dict)
My problem is that this code is quite slow, and I can't figure out how to speed it up. I'm a beginner, and my guess was that my freplace function is largely what is slowing things down, because it has to compile for each column in each row. I would like to take the line rc = re.compile('|'.join(re.escape(k) for k in dict)) out of that function, but don't know how to do that and still preserve what the rest of my code is doing.
There's a ton of things that you can do to speed this up:
First, use the csv module. It provides efficient and bug-free methods for reading and writing CSV files. The DictReader object in particular is what you're interested in: it will present every row it reads from the file as a dictionary keyed by its column name.
Second, compile your regexes once, not every time you use them. Save the compiled regexes in a dictionary keyed by the column that you're going to apply them to.
Third, consider that if you apply a hundred regexes to a long string, you're going to be scanning the string from start to finish a hundred times. That may not be the best approach to your problem; you might be better off investing some time in an approach that lets you read the string from start to end once.
You don't need re:
# -*- coding: utf-8 -*-
# imaginary csv file. pretend that we do not know the column order.
Header = [u'col1', u'col2']
Line1 = [u'A',u'X']
Line2 = [u'B',u'Y']
fileLines = [Line1,Line2]
# dicts to translate lines
D1a = {u'A':u'a'}
D1b = {u'B':u'b'}
D2 = {u'X':u'x',u'Y':u'y'}
# dict to correspond header names with the correct dictionary
refD = {u'col1':[D1a,D1b],u'col2':[D2]}
# now let's have some fun...
for line in fileLines:
for i, (param, word) in enumerate(zip(Header, line)):
for minitranslator in refD[param]:
if word in minitranslator:
line[i] = minitranslator[word]
returns:
[[u'a', u'x'], [u'b', u'y']]
So if that's the case, and all 10 columns have the same names each time, but out of order, (I'm not sure if this is what you're doing up there, but here goes) keep one array for the heading names, and one for each column split into elements (should be 10 items each line), now just offset which regex by doing a case/select combo, compare the element number of your header array, then inside the case, reference the data array at the same offset, since the name is what will get to the right case you should be able to use the same 10 regex's repeatedly, and not have to recompile a new "command" each time.
I hope that makes sense. I'm sorry i don't know the syntax to help you out, but I hope my idea is what you're looking for
EDIT:
I.E.
initialize all regexes before starting your loops.
then after you read a line (and after the header line)
select array[n]
case "column1"
regex(data[0]);
case "column2"
regex(data[1]);
.
.
.
.
end select
This should call the right regex for the right columns

Categories