Related
and thanks in advance for any advice. First-time poster here, so I'll do my best to put in all required info. I am also quite beginner with Python, have been doing some online tutorials, and some copy/paste coding from StackOverflow, it's FrankenCoding... So I'm probably approaching this wrong...
I need to compare two CSV files, that will have a changing number of columns, there will only ever be 2 columns that match (for example, email_address in one file, and EMAIL in the other). Both files will have headers, however the names of these headers may change. The file sizes may be anywhere from a few thousand lines up to +2,000,000, with potentially 100+ columns (but more likely to have a handful).
Output is to a third 'results.csv' file, containing all the info. It may be a merge (all unique entries), a substract (remove entries present in one or the other) or an intersect (all entries present in both).
I have searched here, and found a lot of good information, but all of the ones I saw had a fixed number of columns in the files. I've tried dict and dictreader, and I know the answer is in there somewhere, but right now, I'm a bit confused. But since I haven't made any progress in several days, and I can only devote so much time on this, I'm hoping that I can get a nudge in the right direction.
Ideally, I want to learn how to do it myself, which means understanding how the data is 'moving around'.
Extract of CSV files below, I didn't add more columns then (I think) necessary, the dataset I have now will match on Originalid/UID or emailaddress/email, but this may not always be the case.
Original.csv
"originalid","emailaddress",""
"12345678","Bob#mail.com",""
"23456789","NORMA#EMAIL.COM",""
"34567890","HENRY#some-mail.com",""
"45678901","Analisa#sports.com",""
"56789012","greta#mail.org",""
"67890123","STEVEN#EMAIL.ORG",""
Compare.CSV
"email","","DATEOFINVALIDATION_WITH_TIME","OPTOUTDATE_WITH_TIME","EMAIL_USERS"
"Bob#mail.com",,,"true"
"NORMA#EMAIL.COM",,,"true"
"HENRY#some-mail.com",,,"true"
"Henrietta#AWESOME.CA",,,"true"
"NORMAN#sports.CA",,,"true"
"albertina#justemail.CA",,,"true"
Data in results.csv should be all columns from Original.CSV + all columns in Compare.csv, but not the matching one (email) :
"originalid","emailaddress","","DATEOFINVALIDATION_WITH_TIME","OPTOUTDATE_WITH_TIME","EMAIL_USERS"
"12345678","Bob#mail.com","",,,"true"
"23456789","NORMA#EMAIL.COM","",,,"true"
"34567890","HENRY#some-mail.com","",,,"true"
Here are my results as they are now:
email,,DATEOFINVALIDATION_WITH_TIME,OPTOUTDATE_WITH_TIME,EMAIL_USERS
Bob#mail.com,,,true,"['12345678', 'Bob#mail.com', '']"
NORMA#EMAIL.COM,,,true,"['23456789', 'NORMA#EMAIL.COM', '']"
HENRY#some-mail.com,,,true,"['34567890', 'HENRY#some-mail.com', '']"
And here's where I'm at with the code, the print statement returns matching data from the files to screen but not to file, so I'm missing something in there.
***** And I'm not getting the headers from the original.csv file, data is coming in.
import csv
def get_column_from_file(filename, column_name):
f = open(filename, 'r')
reader = csv.reader(f)
headers = next(reader, None)
i = 0
max = (len(headers))
while i < max:
if headers[i] == column_name:
column_header = i
# print(headers[i])
i = i + 1
return(column_header)
file_to_check = "Original.csv"
file_console = "Compare.csv"
column_to_read = get_column_from_file(file_console, 'email')
column_to_compare = get_column_from_file(file_to_check, 'emailaddress')
with open(file_console, 'r') as master:
master_indices = dict((r[1], r) for i, r in enumerate(csv.reader(master)))
with open('Compare.csv', 'r') as hosts:
with open('results.csv', 'w', newline='') as results:
reader = csv.reader(hosts)
writer = csv.writer(results)
writer.writerow(next(reader, []))
for row in reader:
index = master_indices.get(row[0])
if index is not None:
print (row +[master_indices.get(row[0])])
writer.writerow(row +[master_indices.get(row[0])])
Thanks for your time!
Pat
I like that you want to do this yourself, and recognize a need to "understand how the data is moving around." This is exactly how you should be thinking of the problem: focusing on the movement of data rather than the result. Some people may disagree with me, but I think this is a good philosophy to follow as it will make future reuse easier.
You're not trying to build a tool that combines two CSVs, you're trying to organize data (that happens to come from a CSV) according to a common reference (email address) and output the result as a CSV. Because you are talking about potentially large data sets (+2,000,000 [rows] with potentially 100+ columns) recognize that it is important to pay attention to the asymptotic runtime. If you do not know what this is, I recommend you read up on Big-O notation and asymptotic algorithm analysis. You might be okay without this.
First you decide what, from each CSV, is your key. You've already done this, 'email' for 'Compare.csv' and 'emailaddress' from 'Original.csv'.
Now, build yourself a function to produce dictionaries from the CSV based off the key.
def get_dict_from_csv(path_to_csv, key):
with open(path_to_csv, 'r') as f:
reader = csv.reader(f)
headers, *rest = reader # requires python3
key_index = headers.index(key) # find index of key
# dictionary comprehensions are your friend, just think about what you want the dict to look like
d = {row[key_index]: row[:key_index] + row[key_index+1:] # +1 to skip the email entry
for row in rest}
headers.remove(key)
d['HEADERS'] = headers # add headers so you know what the information in the dict is
return d
Now you can call this function on both of your CSVs.
file_console_dict = get_dict_from_csv('Compare.csv', 'email')
file_to_check_dict = get_dict_from_csv('Original.csv', 'emailaddress')
Now you have two dicts which are keyed off the same information. Now we need a function to combine these into one dict.
def combine_dicts(*dicts):
d, *rest = dicts # requires python3
# iteratively pull other dicts into the first one, d
for r in rest:
original_headers = d['HEADERS'][:]
new_headers = r['HEADERS'][:]
# copy headers
d['HEADERS'].extend(new_headers)
# find missing keys
s = set(d.keys()) - set(r.keys()) # keys present in d but not in r
for k in s:
d[k].extend(['', ] * len(new_headers))
del r['HEADERS'] # we don't want to copy this a second time in the loop below
for k, v in r.items():
# use setdefault in case the key didn't exist in the first dict
d.setdefault(k, ['', ] * len(original_headers)).extend(v)
return d
Now you have one dict which has all the information you want, all you need to do is write it back as a CSV.
def write_dict_to_csv(output_file, d, include_key=False):
with open(output_file, 'w', newline='') as results:
writer = csv.writer(results)
# email isn't in your HEADERS, so you'll need to add it
if include_key:
headers = ['email',] + d['HEADERS']
else:
headers = d['HEADERS']
writer.writerow(headers)
# now remove it from the dict so we can iterate over it without including it twice
del d['HEADERS']
for k, v in d.items():
if include_key:
row = [k,] + v
else:
row = v
writer.writerow(row)
And that should be it. To call all of this is just
file_console_dict = get_dict_from_csv('Compare.csv', 'email')
file_to_check_dict = get_dict_from_csv('Original.csv', 'emailaddress')
results_dict = combine_dicts(file_to_check_dict, file_console_dict)
write_dict_to_csv('results.csv', results_dict)
And you can easily see how this can be extended to arbitrarily many dictionaries.
You said you didn't want the email to be in the final CSV. This is counter-intuitive to me, so I made it an option in write_dict_to_csv() in case you change your mind.
When I run all the above I get
email,originalid,,,DATEOFINVALIDATION_WITH_TIME,OPTOUTDATE_WITH_TIME,EMAIL_USERS
Bob#mail.com,12345678,,,,true
NORMA#EMAIL.COM,23456789,,,,true
HENRY#some-mail.com,34567890,,,,true
Analisa#sports.com,45678901,,,,,
greta#mail.org,56789012,,,,,
STEVEN#EMAIL.ORG,67890123,,,,,
Henrietta#AWESOME.CA,,,,,true
NORMAN#sports.CA,,,,,true
albertina#justemail.CA,,,,,true
Right now it looks like you only use writerow once for the header:
writer.writerow(next(reader, []))
As francisco pointed out, uncommenting that last line may fix your problem. You can do this by removing the "#" at the beginning of the line.
I am just starting out with Python. I have some fortran and some Matlab skills, but I am by no means a coder. I need to post-process some output files.
I can't figure out how to read each value into the respective variable. The data looks something like this:
h5097600N1 2348.13 2348.35 -0.2219 20.0 -4.438
h5443200N1 2348.12 2348.36 -0.2326 20.0 -4.651
h8467200N2 2348.11 2348.39 -0.2813 20.0 -5.627
...
In my limited Matlab notation, I would like to assign the following variables of the form tN1(i,j) something like this:
tN1(1,1)=5097600; tN1(1,2)=5443200; tN2(1,3)=8467200; #time between 'h' and 'N#'
hmN1(1,1)=2348.13; hmN1(1,2)=2348.12; hmN2(1,3)=2348.11; #value in 2nd column
hsN1(1,1)=2348.35; hsN1(1,2)=2348.36; hsN2(1,3)=2348.39; #value in 3rd column
I will have about 30 sets, or tN1(1:30,1:j); hmN1(1:30,1:j);hsN1(1:30,1:j)
I know it may not seem like it, but I have been trying to figure this out for 2 days now. I am trying to learn this on my own and it seems I am missing something fundamental in my understanding of python.
I wrote a simple script which does what you asks. It creates three dictionaries, t, hm and hs. These will have keys as the N values.
import csv
import re
path = 'vector_data.txt'
# Using the <with func as obj> syntax handles the closing of the file for you.
with open(path) as in_file:
# Use the csv package to read csv files
csv_reader = csv.reader(in_file, delimiter=' ')
# Create empty dictionaries to store the values
t = dict()
hm = dict()
hs = dict()
# Iterate over all rows
for row in csv_reader:
# Get the <n> and <t_i> values by using regular expressions, only
# save the integer part (hence [1:] and [1:-1])
n = int(re.findall('N[0-9]+', row[0])[0][1:])
t_i = int(re.findall('h.+N', row[0])[0][1:-1])
# Cast the other values to float
hm_i = float(row[1])
hs_i = float(row[2])
# Try to append the values to an existing list in the dictionaries.
# If that fails, new lists is added to the dictionaries.
try:
t[n].append(t_i)
hm[n].append(hm_i)
hs[n].append(hs_i)
except KeyError:
t[n] = [t_i]
hm[n] = [hm_i]
hs[n] = [hs_i]
Output:
>> t
{1: [5097600, 5443200], 2: [8467200]}
>> hm
{1: [2348.13, 2348.12], 2: [2348.11]}
>> hn
{1: [2348.35, 2348.36], 2: [2348.39]}
(remember that Python uses zero-indexing)
Thanks for all your comments. Suggested readings led to other things which helped. Here is what I came up with:
if len(line) >= 45:
if line[0:45] == " FIT OF SIMULATED EQUIVALENTS TO OBSERVATIONS": #! indicates data to follow, after 4 lines of junk text
for i in range (0,4):
junk = file.readline()
for i in range (0,int(nobs)):
line = file.readline()
sline = line.split()
obsname.append(sline[0])
hm.append(sline[1])
hs.append(sline[2])
def averager(filename):
f=open(filename, "r")
avg=f.readlines()
f.close()
avgr=[]
final=""
x=0
i=0
while i < range(len(avg[0])):
while x < range(len(avg)):
avgr+=str((avg[x[i]]))
x+=1
final+=str((sum(avgr)/(len(avgr))))
clear(avgr)
i+=1
return final
The error I get is:
File "C:\Users\konrad\Desktop\exp\trail3.py", line 11, in averager
avgr+=str((avg[x[i]]))
TypeError: 'int' object has no attribute '__getitem__'
x is just an integer, so you can't index it.
So, this:
x[i]
Should never work. That's what the error is complaining about.
UPDATE
Since you asked for a recommendation on how to simplify your code (in a below comment), here goes:
Assuming your CSV file looks something like:
-9,2,12,90...
1423,1,51,-12...
...
You can read the file in like this:
with open(<filename>, 'r') as file_reader:
file_lines = file_reader.read().split('\n')
Notice that I used .split('\n'). This causes the file's contents to be stored in file_lines as, well, a list of the lines in the file.
So, assuming you want the ith column to be summed, this can easily be done with comprehensions:
ith_col_sum = sum(float(line.split(',')[i]) for line in file_lines if line)
So then to average it all out you could just divide the sum by the number of lines:
average = ith_col_sum / len(file_lines)
Others have pointed out the root cause of your error. Here is a different way to write your method:
def csv_average(filename, column):
""" Returns the average of the values in
column for the csv file """
column_values = []
with open(filename) as f:
reader = csv.reader(f)
for row in reader:
column_values.append(row[column])
return sum(column_values) / len(column_values)
Let's pick through this code:
def averager(filename):
averager as a name is not as clear as it could be. How about averagecsv, for example?
f=open(filename, "r")
avg=f.readlines()
avg is poorly named. It isn't the average of everything! It's a bunch of lines. Call it csvlines for example.
f.close()
avgr=[]
avgr is poorly named. What is it? Names should be meaningful, otherwise why give them?
final=""
x=0
i=0
while i < range(len(avg[0])):
while x < range(len(avg)):
As mentioned in comments, you can replace these with for loops, as in for i in range(len(avg[0])):. This saves you from needing to declare and increment the variable in question.
avgr+=str((avg[x[i]]))
Huh? Let's break this line down.
The poorly named avg is our lines from the csv file.
So, we index into avg by x, okay, that would give us the line number x. But... x[i] is meaningless, since x is an integer, and integers don't support array access. I guess what you're trying to do here is... split the file into rows, then the rows into columns, since it's csv. Right?
So let's ditch the code. You want something like this, using the split http://docs.python.org/2/library/stdtypes.html#str.split function:
totalaverage = 0
for col in range(len(csvlines[0].split(","))):
average = 0
for row in range(len(csvlines)):
average += int(csvlines[row].split(",")[col])
totalaverage += average/len(csvlines)
return totalaverage
BUT wait! There's more! Python has a built in csv parser that is safer than splitting by ,. Check it out here: http://docs.python.org/2/library/csv.html
In response to OP asking how he should go about this in one of the comments, here is my suggestion:
import csv
from collections import defaultdict
with open('numcsv.csv') as f:
reader = csv.reader(f)
numbers = defaultdict(list) #used to avoid so each column starts with a list we can append to
for row in reader:
for column, value in enumerate(row,start=1):
numbers[column].append(float(value)) #convert the value to a float 1. as the number may be a float and 2. when we calc average we need to force float division
#simple comprehension to print the averages: %d = integer, %f = float. items() goes over key,value pairs
print('\n'.join(["Column %d had average of: %f" % (i,sum(column)/(len(column))) for i,column in numbers.items()]))
Producing
>>>
Column 1 had average of: 2.400000
Column 2 had average of: 2.000000
Column 3 had average of: 1.800000
For a file:
1,2,3
1,2,3
3,2,1
3,2,1
4,2,1
Here's two methods. The first one just gets the average for the line (what your code above looks like it's doing). The second gets the average for a column (which is what your question asked)
''' This just gets the avg for a line'''
def averager(filename):
f=open(filename, "r")
avg = f.readlines()
f.close()
count = 0
for i in xrange(len(avg)):
count += len(avg[i])
return count/len(avg)
''' This gets a the avg for all "columns"
char is what we split on , ; | (etc)
'''
def averager2(filename, char):
f=open(filename, "r")
avg = f.readlines()
f.close()
count = 0 # count of items
total = 0 # sum of all the lengths
for i in xrange(len(avg)):
cols = avg[i].split(char)
count += len(cols)
for j in xrange(len(cols)):
total += len(cols[j].strip()) # Remove line endings
return total/float(count)
I’m trying to write a python script that sends a query to TweetSentiments.com API.
The idea is that it will perform like this –
Reads CSV tweet file > construct query > Interrogates API > format JSON response > writes to CSV file.
So far I’ve come up with this –
import csv
import urllib
import os
count = 0
TweetList=[] ## Creates empty list to store tweets.
TweetWriter = csv.writer(open('test.csv', 'w'), dialect='excel', delimiter=' ',quotechar='|')
TweetReader = csv.reader(open("C:\StoredTweets.csv", "r"))
for rows in TweetReader:
TweetList.append(rows)
#print TweetList [0]
for rows in TweetList:
data = urllib.urlencode(TweetList[rows])
connect = httplib.HTTPConnection("http://data.tweetsentiments.com:8080/api/analyze.json?q=")
connect.result = json.load(urllib.request("POST", "", data))
TweetWriter.write(result)
But when its run I get “line 20, data = urllib.urlencode(TweetList[rows]) Type Error: list indices must be integers, not list”
I know my list “TweetList” is storing the tweets just as I’d like but I don’t think I’m using urllib.urlencode correct. The API requires that queries are sent like –
http://data.tweetsentiments.com:8080/api/analyze.json?q= (text to analyze)
So the idea was that urllib.urlencode would simply add the tweets to the end of the address to allow a query.
The last four lines of code have become a mess after looking at so many examples. Your help would be much appreciated.
I'm not 100% sure what it is you're trying to do since I don't know what's the format of the files you are reading, but this part looks suspicious:
for rows in TweetList:
data = urllib.urlencode(TweetList[rows])
since TweetList is a list, the for loop puts in the rows one single value from the list in each iteration, and so this for example:
list = [1, 2, 3, 4]
for num in list:
print num
will print 1 2 3 4. But if this:
list = [1, 2, 3, 4]
for num in list:
print list[num]
Will end up with this error: IndexError: list index out of range.
Can you please elaborate a bit more about the format of the files you are reading?
Edit
If I understand you correctly, you need something like this:
tweets = []
tweetReader = csv.reader(open("C:\StoredTweets.csv", "r"))
for row in tweetReader:
tweets.append({ 'tweet': row[0], 'date': row[1] })
for row in tweets:
data = urllib.urlencode(row)
.....
i have this code:
import csv
import collections
def do_work():
(data,counter)=get_file('thefile.csv')
b=samples_subset1(data, counter,'/pythonwork/samples_subset3.csv',500)
return
def get_file(start_file):
with open(start_file, 'rb') as f:
data = list(csv.reader(f))
counter = collections.defaultdict(int)
for row in data:
counter[row[10]] += 1
return (data,counter)
def samples_subset1(data,counter,output_file,sample_cutoff):
with open(output_file, 'wb') as outfile:
writer = csv.writer(outfile)
b_counter=0
b=[]
for row in data:
if counter[row[10]] >= sample_cutoff:
b.append(row)
writer.writerow(row)
b_counter+=1
return (b)
i recently started learning python, and would like to start off with good habits. therefore, i was wondering if you can help me get started to turn this code into classes. i dont know where to start.
Per my comment on the original post, I don't think a class is necessary here. Still, if other Python programmers will ever read this, I'd suggest getting it inline with PEP8, the Python style guide. Here's a quick rewrite:
import csv
import collections
def do_work():
data, counter = get_file('thefile.csv')
b = samples_subset1(data, counter, '/pythonwork/samples_subset3.csv', 500)
def get_file(start_file):
with open(start_file, 'rb') as f:
counter = collections.defaultdict(int)
data = list(csv.reader(f))
for row in data:
counter[row[10]] += 1
return (data, counter)
def samples_subset1(data, counter, output_file, sample_cutoff):
with open(output_file, 'wb') as outfile:
writer = csv.writer(outfile)
b = []
for row in data:
if counter[row[10]] >= sample_cutoff:
b.append(row)
writer.writerow(row)
return b
Notes:
No one uses more than 4 spaces to
indent ever. Use 2 - 4. And all
your levels of indentation should
match.
Use a single space after the commas between arguments
to functions ("F(a, b, c)" not
"F(a,b,c)")
Naked return statements at the end of a function
are meaningless. Functions without
return statements implicitly return
None
Single space around all
operators (a = 1, not a=1)
Do not
wrap single values in parentheses.
It looks like a tuple, but it isn't.
b_counter wasn't used at all, so I
removed it.
csv.reader returns an iterator, which you are casting to a list. That's usually a bad idea because it forces Python to load the entire file into memory at once, whereas the iterator will just return each line as needed. Understanding iterators is absolutely essential to writing efficient Python code. I've left data in for now, but you could rewrite to use an iterator everywhere you're using data, which is a list.
Well, I'm not sure what you want to turn into a class. Do you know what a class is? You want to make a class to represent some type of thing. If I understand your code correctly, you want to filter a CSV to show only those rows whose row[ 10 ] is shared by at least sample_cutoff other rows. Surely you could do that with an Excel filter much more easily than by reading through the file in Python?
What the guy in the other thread suggested is true, but not really applicable to your situation. You used a lot of global variables unnecessarily: if they'd been necessary to the code you should have put everything into a class and made them attributes, but as you didn't need them in the first place, there's no point in making a class.
Some tips on your code:
Don't cast the file to a list. That makes Python read the whole thing into memory at once, which is bad if you have a big file. Instead, simply iterate through the file itself: for row in csv.reader(f): Then, when you want to go through the file a second time, just do f.seek(0) to return to the top and start again.
Don't put return at the end of every function; that's just unnecessary. You don't need parentheses, either: return spam is fine.
Rewrite
import csv
import collections
def do_work():
with open( 'thefile.csv' ) as f:
# Open the file and count the rows.
data, counter = get_file(f)
# Go back to the start of the file.
f.seek(0)
# Filter to only common rows.
b = samples_subset1(data, counter,
'/pythonwork/samples_subset3.csv', 500)
return b
def get_file(f):
counter = collections.defaultdict(int)
data = csv.reader(f)
for row in data:
counter[row[10]] += 1
return data, counter
def samples_subset1(data, counter, output_file, sample_cutoff):
with open(output_file, 'wb') as outfile:
writer = csv.writer(outfile)
b = []
for row in data:
if counter[row[10]] >= sample_cutoff:
b.append(row)
writer.writerow(row)
return b