Compare 2 .CSV with unknown number of columns and names - python

and thanks in advance for any advice. First-time poster here, so I'll do my best to put in all required info. I am also quite beginner with Python, have been doing some online tutorials, and some copy/paste coding from StackOverflow, it's FrankenCoding... So I'm probably approaching this wrong...
I need to compare two CSV files, that will have a changing number of columns, there will only ever be 2 columns that match (for example, email_address in one file, and EMAIL in the other). Both files will have headers, however the names of these headers may change. The file sizes may be anywhere from a few thousand lines up to +2,000,000, with potentially 100+ columns (but more likely to have a handful).
Output is to a third 'results.csv' file, containing all the info. It may be a merge (all unique entries), a substract (remove entries present in one or the other) or an intersect (all entries present in both).
I have searched here, and found a lot of good information, but all of the ones I saw had a fixed number of columns in the files. I've tried dict and dictreader, and I know the answer is in there somewhere, but right now, I'm a bit confused. But since I haven't made any progress in several days, and I can only devote so much time on this, I'm hoping that I can get a nudge in the right direction.
Ideally, I want to learn how to do it myself, which means understanding how the data is 'moving around'.
Extract of CSV files below, I didn't add more columns then (I think) necessary, the dataset I have now will match on Originalid/UID or emailaddress/email, but this may not always be the case.
Original.csv
"originalid","emailaddress",""
"12345678","Bob#mail.com",""
"23456789","NORMA#EMAIL.COM",""
"34567890","HENRY#some-mail.com",""
"45678901","Analisa#sports.com",""
"56789012","greta#mail.org",""
"67890123","STEVEN#EMAIL.ORG",""
Compare.CSV
"email","","DATEOFINVALIDATION_WITH_TIME","OPTOUTDATE_WITH_TIME","EMAIL_USERS"
"Bob#mail.com",,,"true"
"NORMA#EMAIL.COM",,,"true"
"HENRY#some-mail.com",,,"true"
"Henrietta#AWESOME.CA",,,"true"
"NORMAN#sports.CA",,,"true"
"albertina#justemail.CA",,,"true"
Data in results.csv should be all columns from Original.CSV + all columns in Compare.csv, but not the matching one (email) :
"originalid","emailaddress","","DATEOFINVALIDATION_WITH_TIME","OPTOUTDATE_WITH_TIME","EMAIL_USERS"
"12345678","Bob#mail.com","",,,"true"
"23456789","NORMA#EMAIL.COM","",,,"true"
"34567890","HENRY#some-mail.com","",,,"true"
Here are my results as they are now:
email,,DATEOFINVALIDATION_WITH_TIME,OPTOUTDATE_WITH_TIME,EMAIL_USERS
Bob#mail.com,,,true,"['12345678', 'Bob#mail.com', '']"
NORMA#EMAIL.COM,,,true,"['23456789', 'NORMA#EMAIL.COM', '']"
HENRY#some-mail.com,,,true,"['34567890', 'HENRY#some-mail.com', '']"
And here's where I'm at with the code, the print statement returns matching data from the files to screen but not to file, so I'm missing something in there.
***** And I'm not getting the headers from the original.csv file, data is coming in.
import csv
def get_column_from_file(filename, column_name):
f = open(filename, 'r')
reader = csv.reader(f)
headers = next(reader, None)
i = 0
max = (len(headers))
while i < max:
if headers[i] == column_name:
column_header = i
# print(headers[i])
i = i + 1
return(column_header)
file_to_check = "Original.csv"
file_console = "Compare.csv"
column_to_read = get_column_from_file(file_console, 'email')
column_to_compare = get_column_from_file(file_to_check, 'emailaddress')
with open(file_console, 'r') as master:
master_indices = dict((r[1], r) for i, r in enumerate(csv.reader(master)))
with open('Compare.csv', 'r') as hosts:
with open('results.csv', 'w', newline='') as results:
reader = csv.reader(hosts)
writer = csv.writer(results)
writer.writerow(next(reader, []))
for row in reader:
index = master_indices.get(row[0])
if index is not None:
print (row +[master_indices.get(row[0])])
writer.writerow(row +[master_indices.get(row[0])])
Thanks for your time!
Pat

I like that you want to do this yourself, and recognize a need to "understand how the data is moving around." This is exactly how you should be thinking of the problem: focusing on the movement of data rather than the result. Some people may disagree with me, but I think this is a good philosophy to follow as it will make future reuse easier.
You're not trying to build a tool that combines two CSVs, you're trying to organize data (that happens to come from a CSV) according to a common reference (email address) and output the result as a CSV. Because you are talking about potentially large data sets (+2,000,000 [rows] with potentially 100+ columns) recognize that it is important to pay attention to the asymptotic runtime. If you do not know what this is, I recommend you read up on Big-O notation and asymptotic algorithm analysis. You might be okay without this.
First you decide what, from each CSV, is your key. You've already done this, 'email' for 'Compare.csv' and 'emailaddress' from 'Original.csv'.
Now, build yourself a function to produce dictionaries from the CSV based off the key.
def get_dict_from_csv(path_to_csv, key):
with open(path_to_csv, 'r') as f:
reader = csv.reader(f)
headers, *rest = reader # requires python3
key_index = headers.index(key) # find index of key
# dictionary comprehensions are your friend, just think about what you want the dict to look like
d = {row[key_index]: row[:key_index] + row[key_index+1:] # +1 to skip the email entry
for row in rest}
headers.remove(key)
d['HEADERS'] = headers # add headers so you know what the information in the dict is
return d
Now you can call this function on both of your CSVs.
file_console_dict = get_dict_from_csv('Compare.csv', 'email')
file_to_check_dict = get_dict_from_csv('Original.csv', 'emailaddress')
Now you have two dicts which are keyed off the same information. Now we need a function to combine these into one dict.
def combine_dicts(*dicts):
d, *rest = dicts # requires python3
# iteratively pull other dicts into the first one, d
for r in rest:
original_headers = d['HEADERS'][:]
new_headers = r['HEADERS'][:]
# copy headers
d['HEADERS'].extend(new_headers)
# find missing keys
s = set(d.keys()) - set(r.keys()) # keys present in d but not in r
for k in s:
d[k].extend(['', ] * len(new_headers))
del r['HEADERS'] # we don't want to copy this a second time in the loop below
for k, v in r.items():
# use setdefault in case the key didn't exist in the first dict
d.setdefault(k, ['', ] * len(original_headers)).extend(v)
return d
Now you have one dict which has all the information you want, all you need to do is write it back as a CSV.
def write_dict_to_csv(output_file, d, include_key=False):
with open(output_file, 'w', newline='') as results:
writer = csv.writer(results)
# email isn't in your HEADERS, so you'll need to add it
if include_key:
headers = ['email',] + d['HEADERS']
else:
headers = d['HEADERS']
writer.writerow(headers)
# now remove it from the dict so we can iterate over it without including it twice
del d['HEADERS']
for k, v in d.items():
if include_key:
row = [k,] + v
else:
row = v
writer.writerow(row)
And that should be it. To call all of this is just
file_console_dict = get_dict_from_csv('Compare.csv', 'email')
file_to_check_dict = get_dict_from_csv('Original.csv', 'emailaddress')
results_dict = combine_dicts(file_to_check_dict, file_console_dict)
write_dict_to_csv('results.csv', results_dict)
And you can easily see how this can be extended to arbitrarily many dictionaries.
You said you didn't want the email to be in the final CSV. This is counter-intuitive to me, so I made it an option in write_dict_to_csv() in case you change your mind.
When I run all the above I get
email,originalid,,,DATEOFINVALIDATION_WITH_TIME,OPTOUTDATE_WITH_TIME,EMAIL_USERS
Bob#mail.com,12345678,,,,true
NORMA#EMAIL.COM,23456789,,,,true
HENRY#some-mail.com,34567890,,,,true
Analisa#sports.com,45678901,,,,,
greta#mail.org,56789012,,,,,
STEVEN#EMAIL.ORG,67890123,,,,,
Henrietta#AWESOME.CA,,,,,true
NORMAN#sports.CA,,,,,true
albertina#justemail.CA,,,,,true

Right now it looks like you only use writerow once for the header:
writer.writerow(next(reader, []))
As francisco pointed out, uncommenting that last line may fix your problem. You can do this by removing the "#" at the beginning of the line.

Related

Python How do I sum the data values

Basically currently my program reads the Data file (electric info), sums the values up, and after summing the values, it changes all the negative numbers to 0, and keeps the positive numbers as they are. The program does this perfectly. This is the code I currently have:
import csv
from datetime import timedelta
from collections import defaultdict
def convert(item):
try:
return float(item)
except ValueError:
return 0
sums = defaultdict(list)
def daily():
lista = []
with open('Data.csv', 'r') as inp:
reader = csv.reader(inp, delimiter = ';')
headers = next(reader)
for line in reader:
mittaus = max(0,sum([convert(i) for i in line[1:-2]]))
lista.append()
#print(line[0],mittaus) ('#'only printing to check that it works ok)
daily()
My question is: How can I save the data to lists, so I can use them, and add all the values per day, so should look something like this:
1.1.2016;358006
2.1.2016;39
3.1.2016;0 ...
8.1.2016;239143
After had having these in a list (to save later on to a new data file), it should calculate the cumulative values straight after, and should look like this:
1.1.2016;358006
2.1.2016;358045
3.1.2016;358045...
8.1.2016;597188
Having done these, it should be ready to write these datas to a new csv file.
Small peak what's behind the Data file: https://pastebin.com/9HxwcixZ [It's actually divided with ';' , not with ' ' as in the pastebin]
The data file: https://files.fm/u/yuf4bbuk
I have clarified the questions, so you might have seen me ask before. These should be done without external libraries. I hope to find some help.

Most Python-ic way to collate a group of CSVs based on a common key?

The Problem: I have a group of CSV files, imagine they're named a.csv, b.csv, etc. They all share a common structure of first name, last name, phone, email, and the common key is source - where I got the data from. I collated all of those CSV files into one massive CSV.
Now, it can be the case that 1 individual can be in all of the CSV files - i.e., imagine if the source was a website - the same individual's profile is listed on each website, so each source is different, but the rest of the data is the same. This would mean I end up with data like:
John,Doe,867-5309,johndoe#fake.com,Website A
John,Doe,867-5309,johndoe#fake.com,Website B
I have the CSVs collated into one massive CSV, but I'm having trouble how to best complete the latter portion - sorting by the common key. Ideally, I'd want it so that the source is a list instead of a string - a list of all the sources. So, instead of the code sample above, I want to make the data look like this:
John,Doe,867-5309,johndoe#fake.com,Website A,Website B
Attempted solutions: Not my first rodeo here, so I know I have to show my work. My initial idea was to iterate through all of the agents in the collated CSV file, save their emails to a list, then iterate again through all of the list of agents and the list of emails - if the agent's email and the iterated email are the same, then I add the source from the iterated email to the agent's source column, and continue on as necessary. Here's the code I used to achieve that:
import csv
from tqdm import tqdm
class Agent:
def __init__(self, source, first_name, last_name, phone, email):
self.sources = []
self.source = source
self.first_name = first_name
self.last_name = last_name
self.phone = phone
self.email = email
def writer(self):
with open('final_agent_list.csv', 'a') as csvfile:
csv_writer = csv.writer(csvfile)
row = [self.first_name, self.last_name, self.phone,
self.email]
row.extend(i for i in self.sources)
csv_writer.writerow(
row
)
agents = []
with open('collated_files.csv', 'r') as csvfile:
csv_reader = csv.reader(csvfile)
for row in tqdm(list(csv_reader)):
a = Agent(*row)
for agent in agents:
if agent.email == a.email:
a.sources.append(agent.source)
else:
a.sources.append(a.source)
agents.append(a)
for agent in agents:
agent.writer()
For a minimum, complete verifiable example, use the following as collated_files.csv :
John,Doe,867-5309,johndoe#fake.com,Website A
John,Doe,867-5309,johndoe#fake.com,Website B
However, when I run that, I do get a list of sources as hoped.. but they aren't collated. When I run that, a good example of the output would be:
John,Doe,867-5309,johndoe#fake.com,[Website A]
John,Doe,867-5309,johndoe#fake.com,[Website B]
Clearly, it's not combining the two as I want, but I can't figure out what's making the code go wonky. Do any of you wonderful folks have any ideas? Thanks for reading!
The problem is that you break out of the loop as soon as you find one match, and so your agent will always end up with one value in sources. [You fixed this in a later edit].
Secondly, the inner else will also kick in, even if in a previous iteration there was an email match, and so you still get to append duplicate agents. And as there are more agents, the more iterations the loop will go through, and the more duplicates you add.
I would suggest using a dict, as it will allow faster lookup of a matching email:
agents = {} # create a dict keyed by email
with open('collated_files.csv', 'r') as csvfile:
csv_reader = csv.reader(csvfile)
for row in tqdm(list(csv_reader)):
a = Agent(*row)
# if this is a new email, add it to the agents
if not a.email in agents:
agents[a.email] = a
# in all cases add the source
agents[a.email].sources.append(a.source)
for agent in agents:
agent.writer()
To be more pythonic (and to make that work), what I would do:
use collections.DefaultDict to associate a list with the key, which is a tuple made of all values but the last one (the website)
read the input csv files in a loop and create the dictionary
write a new csv file and recreate the rows using concatenation of the key and the values (the gathered websites)
like this:
import collections,csv
d = collections.defaultdict(list)
for input_file in ["in1.csv","in2.csv"]:
with open(input_file) as f:
for row in csv.reader(f):
d[tuple(row[:-1])].append(row[-1])
with open("out.csv","w",newline="") as f:
csv.writer(f).writerows((k+tuple(v)) for k,v in d.items())

Efficiently Find Partial String Match --> Values Starting From List of Values in 5 GB file with Python

I have a 5GB file of businesses and I'm trying to extract all the businesses that whose business type codes (SNACODE) start with the SNACODE corresponding to grocery stores. For example, SNACODEs for some businesses could be 42443013, 44511003, 44419041, 44512001, 44522004 and I want all businesses whose codes start with my list of grocery SNACODES codes = [4451,4452,447,772,45299,45291,45212]. In this case, I'd get the rows for 44511003, 44512001, and 44522004
Based on what I googled, the most efficient way to read in the file seemed to be one row at a time (if not the SQL route). I then used a for loop and checked if my SNACODE column started with any of my codes (which probably was a bad idea but the only way I could get to work).
I have no idea how many rows are in the file, but there are 84 columns. My computer was running for so long that I asked a friend who said it should only take 10-20 min to complete this task. My friend edited the code but I think he misunderstood what I was trying to do because his result returns nothing.
I am now trying to find a more efficient method than re-doing my 9.5 hours and having my laptop run for an unknown amount of time. The closest thing I've been able to find is most efficient way to find partial string matches in large file of strings (python), but it doesn't seem like what I was looking for.
Questions:
What's the best way to do this? How long should this take?
Is there any way that I can start where I stopped? (I have no idea how many rows of my 5gb file I read, but I have the last saved line of data--is there a fast/easy way to find the line corresponding to a unique ID in the file without having to read each line?)
This is what I tried -- in 9.5 hours it outputted a 72MB file (200k+ rows) of grocery stores
codes = [4451,4452,447,772,45299,45291,45212] #codes for grocery stores
for df in pd.read_csv('infogroup_bus_2010.csv',sep=',', chunksize=1):
data = np.asarray(df)
data = pd.DataFrame(data, columns = headers)
for code in codes:
if np.char.startswith(str(data["SNACODE"][0]), str(code)):
with open("grocery.csv", "a") as myfile:
data.to_csv(myfile, header = False)
print code
break #break code for loop if match
grocery.to_csv("grocery.csv", sep = '\t')
This is what my friend edited it to. I'm pretty sure the x = df[df.SNACODE.isin(codes)] is only matching perfect matches, and thus returning nothing.
codes = [4451,4452,447,772,45299,45291,45212]
matched = []
for df in pd.read_csv('infogroup_bus_2010.csv',sep=',', chunksize=1024*1024, dtype = str, low_memory=False):
x = df[df.SNACODE.isin(codes)]
if len(x):
matched.append(x)
print "Processed chunk and found {} matches".format(len(x))
output = pd.concat(matched, axis=0)
output.to_csv("grocery.csv", index = False)
Thanks!
To increase speed you could pre-build a single regexp matching the lines you need and the read the raw file lines (no csv parsing) and check them with the regexp...
codes = [4451,4452,447,772,45299,45291,45212]
col_number = 4 # Column number of SNACODE
expr = re.compile("[^,]*," * col_num +
"|".join(map(str, codes)) +
".*")
for L in open('infogroup_bus_2010.csv'):
if expr.match(L):
print L
Note that this is just a simple sketch as no escaping is considered... if the SNACODE column is not the first one and preceding fields may contain a comma you need a more sophisticated regexp like:
...
'([^"][^,]*,|"([^"]|"")*",)' * col_num +
...
that ignores commas inside double-quotes
You can probably make your pandas solution much faster:
codes = [4451, 4452, 447, 772, 45299, 45291, 45212]
codes = [str(code) for code in codes]
sna = pd.read_csv('infogroup_bus_2010.csv', usecols=['SNACODE'],
chunksize=int(1e6), dtype={'SNACODE': str})
with open('grocery.csv', 'w') as fout:
for chunk in sna:
for code in chunk['SNACODE']:
for target_code in codes:
if code.startswith(target_code):
fout.write('{}\n'.format(code))
Read only the needed column with usecols=['SNACODE']. You can adjust the chunk size with chunksize=int(1e6). Depending on your RAM you can likely make it much bigger.

Speed up simple Python function that uses list comprehension

I'm extracting 4 columns from an imported CSV file (~500MB) to be used for fitting a scikit-learn regression model.
It seems that this function used to do the extraction is extremely slow. I just learnt python today, any suggestions on how the function can be sped up?
Can multithreading/core be used? My system has 4 cores.
def splitData(jobs):
salaries = [jobs[i]['salaryNormalized'] for i, v in enumerate(jobs)]
descriptions = [jobs[i]['description'] + jobs[i]['normalizedLocation'] + jobs[i]['category'] for i, v in enumerate(jobs)]
titles = [jobs[i]['title'] for i, v in enumerate(jobs)]
return salaries, descriptions, titles
print type(jobs)
<type 'list'>
print jobs[:1]
[{'category': 'Engineering Jobs', 'salaryRaw': '20000 - 30000/annum 20-30K', 'rawLocation': 'Dorking, Surrey, Surrey', 'description': 'Engineering Systems Analyst Dorking Surrey Salary ****K Our client is located in Dorking, Surrey and are looking for Engineering Systems Analyst our client provides specialist software development Keywords Mathematical Modelling, Risk Analysis, System Modelling, Optimisation, MISER, PIONEEER Engineering Systems Analyst Dorking Surrey Salary ****K', 'title': 'Engineering Systems Analyst', 'sourceName': 'cv-library.co.uk', 'company': 'Gregory Martin International', 'contractTime': 'permanent', 'normalizedLocation': 'Dorking', 'contractType': '', 'id': '12612628', 'salaryNormalized': '25000'}]
def loadData(filePath):
reader = csv.reader( open(filePath) )
rows = []
for i, row in enumerate(reader):
categories = ["id", "title", "description", "rawLocation", "normalizedLocation",
"contractType", "contractTime", "company", "category",
"salaryRaw", "salaryNormalized","sourceName"]
# Skip header row
if i != 0:
rows.append( dict(zip(categories, row)) )
return rows
def splitData(jobs):
salaries = []
descriptions = []
titles = []
for i in xrange(len(jobs)):
salaries.append( jobs[i]['salaryNormalized'] )
descriptions.append( jobs[i]['description'] + jobs[i]['normalizedLocation'] + jobs[i]['category'] )
titles.append( jobs[i]['title'] )
return salaries, descriptions, titles
def fit(salaries, descriptions, titles):
#Vectorize
vect = TfidfVectorizer()
vect2 = TfidfVectorizer()
descriptions = vect.fit_transform(descriptions)
titles = vect2.fit_transform(titles)
#Fit
X = hstack((descriptions, titles))
y = [ np.log(float(salaries[i])) for i, v in enumerate(salaries) ]
rr = Ridge(alpha=0.035)
rr.fit(X, y)
return vect, vect2, rr, X, y
jobs = loadData( paths['train_data_path'] )
salaries, descriptions, titles = splitData(jobs)
vect, vect2, rr, X_train, y_train = fit(salaries, descriptions, titles)
I see multiple problems with your code, directly impacting its performance.
You enumerate the jobs list multiple times. You could enumerate it only once and instead use the enumerated list (stored in a variable).
You don't use the value from the enumerated items at all. All you need is the index, and you could easily achieve this using the built-in range function.
Each of the lists is generated in eager manner. What happens is the following: 1st list blocks the execution of the program and it takes some time to finish. Same thing happens with the second and third lists, where calculations are exactly the same.
What I would offer you to do, is to use a generator, so that you process the data in a lazy manner. It's more performance-efficient and allows you to extract the data on-the-go.
def splitData(jobs):
for job in jobs:
yield job['salaryNormalized'], job['description'] + job['normalizedLocation'] + job['category'], job['title']
One simple speedup is to cut down on your list traversals. You can build a generator or generator expression that returns tuples for a single dictionary, then zip the resulting iterable:
(salaries, descriptions, titles) = zip(*((j['salaryNormalized'], j['description'] + j['normalizedLocation'] + j['category'], j['title']) for j in jobs))
Unfortunately, that still creates three sizable in-memory lists - using a generator expression rather than a list comprehension should at least prevent it from creating a full list of three-element tuples prior to zipping.
Correct me if I'm wrong, but it seems that TfidVectorizer accepts an iterator (e.g. generator expression) as well. This helps prevent having multiple copies of this rather large data in memory, which probably is what makes it slow. Alternatively, for sure it can work with files directly. One could transform the csv into separate files and then feed those files to TfidVectorizer directly without keeping them in memory in any way at all.
Edit 1
Now that you provided some more code, I can be a bit more specific.
First of all, please note that loadData is doing more than it needs to; it duplicates functionality present in csv.DictReader. If we use that, we skip the listing of category names. Another syntax for opening files is used, because in this way, they're closed automatically. Also, some names are changed to be both more accurate and Pythonic (underscore style).
def data_from_file(filename):
rows = []
with open(filename) as f:
reader = csv.DictReader(f)
for row in reader:
rows.append(row)
return rows
We can now change this so that we don't build the list of all rows in memory, but instead give back a row one at a time right after we read it from the file. If this looks like magic, just read a little about generators in Python.
def data_from_file(path):
with open(filename) as f:
reader = csv.DictReader(f)
for row in reader:
yield row
Now let's have a look at splitData. We could write it more cleanly like this:
def split_data(jobs):
salaries = []
descriptions = []
titles = []
for job in jobs:
salaries.append(job['salaryNormalized'] )
descriptions.append(job['description'] + job['normalizedLocation'] +
job['category'])
titles.append(job['title'])
return salaries, descriptions, titles
But again we don't want to build three huge lists in memory. And generally, it's not going to be practical that this function gives us three different things. So to split it up:
def extract_salaries(jobs):
for job in jobs:
yield job['salaryNormalized']
And so on. This helps us set up some kind of processing pipeline; everytime we'd request a value from extract_salaries(data_from_file(filename)) a single line of the csv would be read and the salary extracted. The next time, the second line giving back the second salary. There's no need to make functions for this simple case. Instead, you can use a generator expression:
salaries = (job['salaryNormalized'] for job in data_from_file(filename))
descriptions = (job['description'] + job['normalizedLocation'] +
job['category'] for job in data_from_file(filename))
titles = (job['title'] for job in data_from_file(filename))
You can now pass these generators to fit, where the most important modification is this:
y = [np.log(float(salary)) for salary in salaries]
You can't index into an iterator (something that gives you one value at a time) so you just assume you will get a salary from salaries as long as there are more, and do something with it.
In the end, you will read the whole csv file multiple times, but I don't expect that to be the bottleneck. Otherwise, some more restructuring is required.
Edit 2
Using DictReader seems a bit slow. Not sure why, but you may stick with your own implementation of that (modified to be a generator) or even better, go with namedtuples:
def data_from_file(filename):
with open(filename) as f:
reader = csv.reader(f)
header = reader.next()
Job = namedtuple('Job', header)
for row in reader:
yield Job(*row)
Then access the attributes with a dot (job.salaryNormalized). But anyway note that you can get the list of column names from the file; don't duplicate it in code.
You may of course decide to keep a single copy of the file in memory after all. In that case, do something like this:
data = list(data_from_file(filename))
salaries = (job['salaryNormalized'] for job in data)
The functions remain untouched. The call to list consumes the whole generator and stores all values in a list.
You don't need the indexes at all. Just use in. This saves the creation of a extra list of tuples, and it removes a level of indirection;
salaries = [j['salaryNormalized'] for j in jobs]
descriptions = [j['description'] + j['normalizedLocation'] + j['category'] for j in jobs]
titles = [j['title'] for j in jobs]
This still iterates over the data three times.
Alternatively you could get everything in one list comprehension, grouping the relevant data from one job together in a tuple;
data = [(j['salaryNormalized'],
j['description'] + j['normalizedLocation'] + j['category'],
j['title']) for j in jobs]
Saving the best for last; why not fill the lists straight from the CSV file instead of making a dict first?
import csv
with open('data.csv', 'r') as df:
reader = csv.reader(df)
# I made up the row indices...
data = [(row[1], row[3]+row[7]+row[6], row[2]) for row in reader]

Find and replace in CSV files with Python

Related to a previous question, I'm trying to do replacements over a number of large CSV files.
The column order (and contents) change between files, but for each file there are about 10 columns that I want and can identify by the column header names. I also have 1-2 dictionaries for each column I want. So for the columns I want, I want to use only the correct dictionaries and want to implement them sequentially.
An example of how I've tried to solve this:
# -*- coding: utf-8 -*-
import re
# imaginary csv file. pretend that we do not know the column order.
Header = [u'col1', u'col2']
Line1 = [u'A',u'X']
Line2 = [u'B',u'Y']
fileLines = [Line1,Line2]
# dicts to translate lines
D1a = {u'A':u'a'}
D1b = {u'B':u'b'}
D2 = {u'X':u'x',u'Y':u'y'}
# dict to correspond header names with the correct dictionary.
# i would like the dictionaries to be read sequentially in col1.
refD = {u'col1':[D1a,D1b],u'col2':[D2]}
# clunky replace function
def freplace(str, dict):
rc = re.compile('|'.join(re.escape(k) for k in dict))
def trans(m):
return dict[m.group(0)]
return rc.sub(trans, str)
# get correspondence between dictionary and column
C = []
for i in range(len(Header)):
if Header[i] in refD:
C.append([refD[Header[i]],i])
# loop through lines and make replacements
for line in fileLines:
for i in range(len(line)):
for j in range(len(C)):
if C[j][1] == i:
for dict in C[j][0]:
line[i] = freplace(line[i], dict)
My problem is that this code is quite slow, and I can't figure out how to speed it up. I'm a beginner, and my guess was that my freplace function is largely what is slowing things down, because it has to compile for each column in each row. I would like to take the line rc = re.compile('|'.join(re.escape(k) for k in dict)) out of that function, but don't know how to do that and still preserve what the rest of my code is doing.
There's a ton of things that you can do to speed this up:
First, use the csv module. It provides efficient and bug-free methods for reading and writing CSV files. The DictReader object in particular is what you're interested in: it will present every row it reads from the file as a dictionary keyed by its column name.
Second, compile your regexes once, not every time you use them. Save the compiled regexes in a dictionary keyed by the column that you're going to apply them to.
Third, consider that if you apply a hundred regexes to a long string, you're going to be scanning the string from start to finish a hundred times. That may not be the best approach to your problem; you might be better off investing some time in an approach that lets you read the string from start to end once.
You don't need re:
# -*- coding: utf-8 -*-
# imaginary csv file. pretend that we do not know the column order.
Header = [u'col1', u'col2']
Line1 = [u'A',u'X']
Line2 = [u'B',u'Y']
fileLines = [Line1,Line2]
# dicts to translate lines
D1a = {u'A':u'a'}
D1b = {u'B':u'b'}
D2 = {u'X':u'x',u'Y':u'y'}
# dict to correspond header names with the correct dictionary
refD = {u'col1':[D1a,D1b],u'col2':[D2]}
# now let's have some fun...
for line in fileLines:
for i, (param, word) in enumerate(zip(Header, line)):
for minitranslator in refD[param]:
if word in minitranslator:
line[i] = minitranslator[word]
returns:
[[u'a', u'x'], [u'b', u'y']]
So if that's the case, and all 10 columns have the same names each time, but out of order, (I'm not sure if this is what you're doing up there, but here goes) keep one array for the heading names, and one for each column split into elements (should be 10 items each line), now just offset which regex by doing a case/select combo, compare the element number of your header array, then inside the case, reference the data array at the same offset, since the name is what will get to the right case you should be able to use the same 10 regex's repeatedly, and not have to recompile a new "command" each time.
I hope that makes sense. I'm sorry i don't know the syntax to help you out, but I hope my idea is what you're looking for
EDIT:
I.E.
initialize all regexes before starting your loops.
then after you read a line (and after the header line)
select array[n]
case "column1"
regex(data[0]);
case "column2"
regex(data[1]);
.
.
.
.
end select
This should call the right regex for the right columns

Categories