I have a CSV sheet, having data like this:
| not used | Day 1 | Day 2 |
| Person 1 | Score | Score |
| Person 2 | Score | Score |
But with a lot more rows and columns. Every day I get progress of how much each person progressed, and I get that data as a dictionary where keys are names and values are score amounts.
The thing is, sometimes that dictionary will include new people and not include already existing ones. Then, if a new person comes, it will add 0 as every previous day and if the dict doesn't include already existing person, it will give him 0 score to that day
My idea of solving this is doing lines = file.readlines() on that CSV file, making a new list of people's names with
for line in lines:
names.append(line.split(",")[0])
then making a copy of lines (newLines = lines)
and going through dict's keys, seeing if that person is already in the csv, if so, append the value followed by a comma
But I'm stuck at the part of adding score of 0
Any help or contributions would be appreciated
EXAMPLE: Before I will have this
-,day1,day2,day3
Mark,1500,0,1660
John,1800,1640,0
Peter,1670,1680,1630
Hannah,1480,1520,1570
And I have this dictionary to add
{'Mark': 1750, 'Hannah':1640, 'Brian':1780}
The result should be
-,day1,day2,day3,day4
Mark,1500,0,1660,1750
John,1800,1640,0,0
Peter,1670,1680,1630,0
Hannah,1480,1520,1570,1640
Brian,0,0,0,1780
See how Brian is in the dict and not in the before csv and he got added with any other day score 0. I figured out that one line .split(',') would give a list of N elements, where N - 2 will be amount of zero scores to add prior to first day of that person
This is easy to do in pandas as an outer join. Read the CSV into a dataframe and generate a new dataframe from the dictionary. The join is almost what you want except that since not-a-number values are inserted for empty cells, you need to fill the NaN's with zero and reconvert everything to integer.
The one potential problem is that the CSV is sorted. You don't simply have the new rows appended to the bottom.
import pandas as pd
import errno
import os
INDEX_COL = "-"
def add_days_score(filename, colname, scores):
try:
df = pd.read_csv(filename, index_col=INDEX_COL)
except OSError as e:
if e.errno == errno.ENOENT:
# file doesn't exist, create empty df
df = pd.DataFrame([], columns=[INDEX_COL])
df = df.set_index(INDEX_COl)
else:
raise
new_df = pd.DataFrame.from_dict({colname:scores})
merged = df.join(new_df, how="outer").fillna(0).astype(int)
try:
merged.to_csv(filename + ".tmp", index_label=[INDEX_COL])
except:
raise
else:
os.rename(filename + ".tmp", filename)
return merged
#============================================================================
# TEST
#============================================================================
test_file = "this_is_a_test.csv"
before = """-,day1,day2,day3
Mark,1500,0,1660
John,1800,1640,0
Peter,1670,1680,1630
Hannah,1480,1520,1570
"""
after = """-,day1,day2,day3,day4
Brian,0,0,0,1780
Hannah,1480,1520,1570,1640
John,1800,1640,0,0
Mark,1500,0,1660,1750
Peter,1670,1680,1630,0
"""
test_dicts = [
["day4", {'Mark': 1750, 'Hannah':1640, 'Brian':1780}],
]
open(test_file, "w").write(before)
for name, scores in test_dicts:
add_days_score(test_file, name, scores)
print("want\n", after, "\n")
got = open(test_file).read()
print("got\n", got, "\n")
if got != after:
print("FAILED")
Related
I have python code that gives data in the following list = [author, item, number]
I want to add this data to an excel file that looks like this: .
The python script will:
Check if the author given in the list is in the Author Names column, and add name if it is not in present.
Then the code will add the number in the column that matches the item given.
For example:
['author2', 'Oranges', 300], 300 would be added to Oranges column on the row for author2.
If the person adds a list again like ['author2', 'Oranges', 500] and an input already exists for the item, the number will be added together so the final result is 800.
How do I get started with this? I'm mostly confused about how to read columns/rows to find where to insert things.
Here's one example of how you might do it:
import csv
from collections import defaultdict
# Make a dictionary for the authors, that will automatically start all the
# values at 0 when you try to add a new author
authors = defaultdict(lambda: dict({'Oranges':0, 'Apples':0, 'Peaches':0}))
# Add some items
authors['author1']['Oranges'] += 300
authors['author2']['Peaches'] += 200
authors['author3']['Apples'] += 50
authors['author1']['Apples'] += 20
authors['author2']['Oranges'] += 250
authors['author3']['Apples'] += 100
# Write the output to a csv file, for opening in Excel
with open('authors_csv.csv', 'w', newline='') as file:
writer = csv.writer(file)
# Write Header
writer.writerow(['Author Names', 'Oranges', 'Apples', 'Peaches'])
for key, val in authors.items():
writer.writerow(
[key,
val['Oranges'], val['Apples'], val['Peaches']
])
For more details on writing out to CSV's you can see the documentation here: https://docs.python.org/3/library/csv.html
Alternatively, just search using DuckDuckGo or your favorite search engine.
Most likely it appears your spreadsheet is stored externally, and you want to read in some new data from the list format [author, item, number].
Python pandas is great for this. This would read in the data file, lets call it authorVolumes.xlsx. This assumes the spreadsheet is already in the folder we are working in and looks as it does in your first picture. Also the items are limited to the ones in the spreadsheet already as you did not mention that in the question.
import pandas as pd
df = pd.read_excel('authorVolumes.xlsx', index_col='Author Names').fillna(0)
print(df)
Author Names Oranges Apples Peaches
author1 0 0 0
author2 0 0 0
author3 0 0 0
author4 0 0 0
author5 0 0 0
Now lets define a function to handle the updates.
def updateVolumes(author, item, amount):
try:
df.loc[author][item] += amount
except KeyError:
df = pd.concat([df,pd.DataFrame([amount], index=[author], columns=[item])]).fillna(0)
Time to handle the first update:['author2', 'Oranges', 300]
author, item, amount = ['author2', 'Oranges', 300]
updateVolumes(author, item, amount)
Now to handle one where the author is not there:
author, item, amount = ['author10', 'Apples', 300]
updateVolumes(author, item, amount)
When we are done we can save our excel file back out to the files system.
df.to_excel('authorVolumes.xlsx')
I have an original dataset with informations stored as a list of dict, in a column (this is a mongodb extract). This is the column :
[{u'domain_id': ObjectId('A'), u'p': 1},
{u'domain_id': ObjectId('B'), u'p': 2},
{u'domain_id': ObjectId('B'), u'p': 3},
...
{u'domain_id': ObjectId('CG'), u'p': 101}]
I'm only interested in the first 10 dict ( 'p' value from 1 to 10). The output dataframe should look like this :
index | A | ... | B
------------------------
0 | 1 | ... | 2
1 | Nan | ... | Nan
2 | Nan | ... | 3
e.g : For each line of my original DataFrame, I create a column for each domain_id, and I associate it with the corresponding 'p' value. I can have the same domain_id for several 'p' value, in this case I only keep the first one (smaller 'p')
Here is my current code, which may be easier to understand :
first = True
for i in df.index[:]: # for each line of original Dataframe
temp_list = df["positions"][i] # this is the column with the list of dict inside
col_list = []
data_list = []
for j in range(10): # get the first 10 values
try:
if temp_list[j]["domain_id"] not in col_list: # check if domain_id already exist
col_list.append(temp_list[j]["domain_id"])
data_list.append(temp_list[j]["p"])
except IndexError as e:
print e
df_temp = pd.DataFrame([np.transpose(data_list)],columns = col_list) # create a temporary DataFrame for this line of the original DataFrame
if first:
df_kw = df_temp
first = False
else:
# pass
df_kw = pd.concat([df_kw,df_temp], axis=0, ignore_index=True) # concat all the temporary DataFrame : now I have my output Dataframe, with the same number of lines as my original DataFrame.
This is all working fine, but it is very very slow as I have 15k lines and end up with 10k columns.
I'm sure (or at least I hope very much) that there is a simpler an faster solution : any advice will be much appreciated.
I found a decent solution : the slow part is the concatenation, so it is way more efficient to first create the dataframe and then update the values.
Create the DataFrame:
for i in df.index[:]:
temp_list = df["positions"][i]
for j in range(10):
try:
# if temp_list[j]["domain_id"] not in col_list:
col_list.append(temp_list[j]["domain_id"])
except IndexError as e:
print e
df_total = pd.DataFrame(index=df.index, columns=set(col_list))
Update the values :
for i in df.index[:]:
temp_list = df["positions"][i]
col_list = []
for j in range(10):
try:
if temp_list[j]["domain_id"] not in col_list: # avoid overwriting values
df_total.loc[i, temp_list[j]["domain_id"]] = temp_list[j]["p"]
col_list.append(temp_list[j]["domain_id"])
except IndexError as e:
print e
Creating a 15k x 6k DataFrame took about 6 seconds on my computer, and filling it took 27 seconds.
I killed the former solution after more than 1 hour running, so this is really faster.
I have two csv files and I want to create a third csv from the a merge of the two. Here's how my files look:
Num | status
1213 | closed
4223 | open
2311 | open
and another file has this:
Num | code
1002 | 9822
1213 | 1891
4223 | 0011
So, here is my little code that I was trying to loop through but it does not print the output with the third column added matching the correct values.
def links():
first = open('closed.csv')
csv_file = csv.reader(first)
second = open('links.csv')
csv_file2 = csv.reader(second)
for row in csv_file:
for secrow in csv_file2:
if row[0] == secrow[0]:
print row[0]+"," +row[1]+","+ secrow[0]
time.sleep(1)
so what I want is something like:
Num | status | code
1213 | closed | 1891
4223 | open | 0011
2311 | open | blank no match
If you decide to use pandas, you can do it in only five lines.
import pandas as pd
first = pd.read_csv('closed.csv')
second = pd.read_csv('links.csv')
merged = pd.merge(first, second, how='left', on='Num')
merged.to_csv('merged.csv', index=False)
This is definitely a job for pandas. You can easily read in both csv files as DataFrames and use either merge or concat. It'll be way faster and you can do it in just a few lines of code.
The problem is that you could iterate over a csv reader only once, so that csv_file2 does not work after the first iteration. To solve that you should save the output of csv_file2 and iterate over the saved list.
It could look like that:
import time, csv
def links():
first = open('closed.csv')
csv_file = csv.reader(first, delimiter="|")
second = open('links.csv')
csv_file2 = csv.reader(second, delimiter="|")
list=[]
for row in csv_file2:
list.append(row)
for row in csv_file:
match=False
for secrow in list:
if row[0].replace(" ","") == secrow[0].replace(" ",""):
print row[0] + "," + row[1] + "," + secrow[1]
match=True
if not match:
print row[0] + "," + row[1] + ", blank no match"
time.sleep(1)
Output:
Num , status, code
1213 , closed, 1891
4223 , open, 0011
2311 , open, blank no match
You could read the values of the second file into a dictionary and then add them to the first.
Code = {}
for row in csv_file2:
Code[row[0]] = row[1]
for row in csv_file1:
row.append(Code.get(row[0], "blank no match"))
This code will do it for you:
import csv
def links():
# open both files
with open('closed.csv') as closed, open('links.csv') as links:
# using DictReader instead to be able more easily access information by num
csv_closed = csv.DictReader(closed)
csv_links = csv.DictReader(links)
# create dictionaries out of the two CSV files using dictionary comprehensions
num_dict = {row['num']:row['status'] for row in csv_closed}
link_dict = {row['num']:row['code'] for row in csv_links}
# print header, each column has width of 8 characters
print("{0:8} | {1:8} | {2:8}".format("Num", "Status", "Code"))
# print the information
for num, status in num_dict.items():
# note this call to link_dict.get() - we are getting values out of the link dictionary,
# but specifying a default return value of an empty string if num is not found in it
# to avoid an exception
print("{0:8} | {1:8} | {2:8}".format(num, status, link_dict.get(num, '')))
links()
In it, I'm taking advantage of dictionaries, which let you access information by keys. I'm also using implicit loops (the dictionary comprehensions) which tend to be faster and require less code.
There are two quirks of this code that you should be aware of, that your example suggests are fine:
Order is not preserved (because we're using dictionaries)
Num entries that are in links.csv but not closed.csv are not included in the printout
Last note: I made some assumptions about how your input files are formatted since you called them "CSV" files. This is what my input files looked like for this code:
closed.csv
num,status
1213,closed
4223,open
2311,open
links.csv
num,code
1002,9822
1213,1891
4223,0011
Given those input files, the result looks like this:
Num | Status | Code
1213 | closed | 1891
2311 | open |
4223 | open | 0011
I'm currently stumped by some basic issues with a small data set. Here are the first three lines to illustrate the format of the data:
"Sport","Entry","Contest_Date_EST","Place","Points","Winnings_Non_Ticket","Winnings_Ticket","Contest_Entries","Entry_Fee","Prize_Pool","Places_Paid"
"NBA","NBA 3K Crossover #3 [3,000 Guaranteed] (Early Only) (1/15)","2015-03-01 13:00:00",35,283.25,"13.33","0.00",171,"20.00","3,000.00",35
"NBA","NBA 1,500 Layup #4 [1,500 Guaranteed] (Early Only) (1/25)","2015-03-01 13:00:00",148,283.25,"3.00","0.00",862,"2.00","1,500.00",200
The issues I am having after using read_csv to create a DataFrame:
The presence of commas in certain categorical values (such as Prize_Pool) results in python considering these entries as strings. I need to convert these to floats in order to make certain calculations. I've used python's replace() function to get rid of the commas, but that's as far as I've gotten.
The category Contest_Date_EST contains timestamps, but some are repeated. I'd like to subset the entire dataset into one that has only unique timestamps. It would be nice to have a choice in which repeated entry or entries are removed, but at the moment I'd just like to be able to filter the data with unique timestamps.
Use thousands=',' argument for numbers that contain a comma
In [1]: from pandas import read_csv
In [2]: d = read_csv('data.csv', thousands=',')
You can check Prize_Pool is numerical
In [3]: type(d.ix[0, 'Prize_Pool'])
Out[3]: numpy.float64
To drop rows - take first observed, you can also take last
In [7]: d.drop_duplicates('Contest_Date_EST', take_last=False)
Out[7]:
Sport Entry \
0 NBA NBA 3K Crossover #3 [3,000 Guaranteed] (Early ...
Contest_Date_EST Place Points Winnings_Non_Ticket Winnings_Ticket \
0 2015-03-01 13:00:00 35 283.25 13.33 0
Contest_Entries Entry_Fee Prize_Pool Places_Paid
0 171 20 3000 35
Edit: Just realized you're using pandas - should have looked at that.
I'll leave this here for now in case it's applicable but if it gets
downvoted I'll take it down by virtue of peer pressure :)
I'll try and update it to use pandas later tonight
Seems like itertools.groupby() is the tool for this job;
Something like this?
import csv
import itertools
class CsvImport():
def Run(self, filename):
# Get the formatted rows from CSV file
rows = self.readCsv(filename)
for key in rows.keys():
print "\nKey: " + key
i = 1
for value in rows[key]:
print "\nValue {index} : {value}".format(index = i, value = value)
i += 1
def readCsv(self, fileName):
with open(fileName, 'rU') as csvfile:
reader = csv.DictReader(csvfile)
# Keys may or may not be pulled in with extra space by DictReader()
# The next line simply creates a small dict of stripped keys to original padded keys
keys = { key.strip(): key for (key) in reader.fieldnames }
# Format each row into the final string
groupedRows = {}
for k, g in itertools.groupby(reader, lambda x : x["Contest_Date_EST"]):
groupedRows[k] = [self.normalizeRow(v.values()) for v in g]
return groupedRows;
def normalizeRow(self, row):
row[1] = float(row[1].replace(',','')) # "Prize_Pool"
# and so on
return row
if __name__ == "__main__":
CsvImport().Run("./Test1.csv")
Output:
More info:
https://docs.python.org/2/library/itertools.html
Hope this helps :)
The pandas script below keeps modifying my data exported to CSV when it shouldn't be.
If you compare the original file to the modified testing2.csv you'll see that numbers like: 0.357 from the first line turn into: 0.35700000000000004 yet on line 2 the number 0.1128 doesn't change at all...
It should NOT be modifying these numbers, they should all remain as they are.
testing.py
import re
import pandas
# each block in the text file will be one element of this list
matchers = [[]]
i = 0
with open('testing.txt') as infile:
for line in infile:
line = line.strip()
# Blocks are seperated by blank lines
if len(line) == 0:
i += 1
matchers.append([])
# assume there are always two blank lines between items
# and just skip to the lext line
infile.next()
continue
matchers[i].append(line)
# This regular expression matches the variable number of students in each block
studentlike = re.compile('(\d+) (.+) (\d+/\d+)')
# These are the names of the fields we expect at the end of each block
datanames = ['Data', 'misc2', 'bla3']
# We will build a table containing a list of elements for each student
table = []
for matcher in matchers:
# We use an iterator over the block lines to make indexing simpler
it = iter(matcher)
# The first two elements are match values
m1, m2 = it.next(), it.next()
# then there are a number of students
students = []
for possiblestudent in it:
m = studentlike.match(possiblestudent)
if m:
students.append(list(m.groups()))
else:
break
# After the students come the data elements, which we read into a dictionary
# We also add in the last possible student line as that didn't match the student re
dataitems = dict(item.split() for item in [possiblestudent] + list(it))
# Finally we construct the table
for student in students:
# We use the dictionary .get() method to return blanks for the missing fields
table.append([m1, m2] + student + [dataitems.get(d, '') for d in datanames])
textcols = ['MATCH2', 'MATCH1', 'TITLE01', 'MATCH3', 'TITLE02', 'Data', 'misc2', 'bla3']
csvdata = pandas.read_csv('testing.csv')
textdata = pandas.DataFrame(table, columns=textcols)
# Add any new columns
newCols = textdata.columns - csvdata.columns
for c in newCols:
csvdata[c] = None
mergecols = ['MATCH2', 'MATCH1', 'MATCH3']
csvdata.set_index(mergecols, inplace=True, drop=False)
textdata.set_index(mergecols, inplace=True,drop=False)
csvdata.update(textdata)
csvdata.to_csv('testing2.csv', index=False)
testing.csv
http://pastebin.com/raw.php?i=HxVE0nA0 (Uploaded because of file size)
testing.txt
MData (N/A)
DMATCH1
3 Tommy 144512/23332
1 Jim 90000/222311
1 Elz M 90000/222311
1 Ben 90000/222311
Data $50.90
misc2 $10.40
bla3 $20.20
MData (B/B)
DMATCH2
4 James Smith 2333/114441
4 Mike 90000/222311
4 Jessica Long 2333/114441
Data $50.90
bla3 $5.44
Anyone have any ideas how to fix this?
(The above example recreates the issue 100% perfectly. Took me forever to find out what was causing this problem.)
This looks like a precision issue.
Try changing your to_csv lines to include the argument float_format='%.4f' which will round things to 2 decimal places.
Pandas supports two basic numeric types, Int64 and Float64. Float64 will not represent decimal values exactly, because it is a floating-point type. Your options are
Specify float_format as suggested by #TomAugspurger (this can be done column-wise or for the whole dataframe
Convert your column dtype to object
Option 2 can be done liuke this:
df['col_name'] = df['col_name'].astype(object)
Try this :)
csvdata = pandas.read_csv('testing.csv', dtype={'TITLE5' : 'object', 'TITLE5.1' : 'object', 'TITLE5.2' : 'object', 'TITLE5.3' : 'object'})