Get column total from list - python

I'm trying to get the sum total of a particular column from a list in a CSV file. I'm able to select the column and remove the header but I can't add up all of the values.
import csv
projectFile = open('data.csv')
projectReader = csv.reader(projectFile)
projectData = list(projectReader)
sum = 0
for amount in projectData[1:]:
amount = amount[1]
print(amount)
I've tried sum(amount) which didn't work and then tried adding a global variable, sum = 0, and adding the float of the list to it ex: total= int(sum + float(amount)) and got errors. I can't use Pandas or mapping for this.
EDIT:
CSV example -

Here's an example of calculating the sum of the 3rd column from a 3x3 matrix (stored as list of lists). Note that column index of 2 corresponds to the 3rd column:
col = 2
my_matrix = [[1,2,3],[4,5,6],[7,8,9]]
sum = sum([row[col] for row in my_matrix])
print(sum)
The output is:
18
(calculated as 3+6+9)
For string matrix (based on comment by #mpstring)
Just add float() to convert each string to float.
col = 2
mymat = [['1','2','3'],['4','5','6'],['7','8','9']]
sum = sum([float(row[col]) for row in mymat])
print(sum)
Given example data.csv (based on updated question by #mpstring)
import csv
projectFile = open('data.csv')
projectReader = csv.reader(projectFile)
next(projectReader)
projectData = list(projectReader)
sum = sum(float(row[1]) for row in projectData)
print(sum)
Output is
216.61

Related

Calculating averaged data in and writing to csv from a pandas dataframe

I have a very large spatial dataset stored in a dataframe. I am taking a slice of that dataframe into a new smaller subset to run further calculations.
The data has x, y and z coordinates with a number of additional columns, some of which are text and some are numeric. The x and y coordinates are on a defined grid and have a known separation.
Data looks like this
x,y,z,text1,text2,text3,float1,float2
75000,45000,120,aa,bbb,ii,12,0.2
75000,45000,110,bb,bbb,jj,22,0.9
75000,45100,120,aa,bbb,ii,11,1.8
75000,45100,110,bb,bbb,jj,45,2.4
75000,45100,100,bb,ccc,ii,13.6,1
75100,45000,120,bb,ddd,jj,8.2,2.1
75100,45000,110,bb,ddd,ii,12,0.6
For each x and y pair I want to iterate over a two series of text values and do three things in the z direction.
Calculate the average of one numeric value for all the values with a third specific text value
Sum another numeric value for all the values with the same text value
Write the a resultant table of 'x, y, average, sum' to a csv.
My code does part three (albeit very slowly) but doesn't calculate 1 or 2 or at least I don't appear to get the average and sum calculations in my output.
What have I done wrong and how can I speed it up?
for text1 in text_list1:
for text2 in text_list2:
# Get the data into smaller dataframe
df = data.loc[ (data["textfield1"] == text1) & (data["textfield2"] == text2 ) ]
#Get the minimum and maximum x and y
minXw = df['x'].min()
maxXw = df['x'].max()
minYw = df['y'].min()
maxYw = df['y'].max()
# dictionary for quicker printing
dict_out = {}
rows_list = []
# Make output filename
filenameOut = text1+"_"+text2+"_Values.csv"
# Start looping through x values
for x in np.arange(minXw, maxXw, x_inc):
xcount += 1
# Start looping through y values
for y in np.arange(minYw, maxYw, y_inc):
ycount += 1
# calculate average and sum
ave_val = df.loc[df['textfield3'] == 'text3', 'float1'].mean()
sum_val = df.loc[df['textfield3'] == 'text3', 'float2'].sum()
# Make Dictionary of output values
dict_out = dict([('text1', text1),
('text2', text2),
('text3', df['text3']),
('x' , x-x_inc),
('y' , y-y_inc),
('ave' , ave_val),
('sum' , sum_val)])
rows_list_c.append(dict_out)
# Write csv
columns = ['text1','text2','text3','x','y','ave','sum']
with open(filenameOut, 'w') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=columns)
writer.writeheader()
for data in dict_out:
writer.writerow(data)
My resultant csv gives me:
text1,text2,text3,x,y,ave,sum
text1,text2,,74737.5,43887.5,nan,0.0
text1,text2,,74737.5,43912.5,nan,0.0
text1,text2,,74737.5,43937.5,nan,0.0
text1,text2,,74737.5,43962.5,nan,0.0
Not really clear what you're trying to do. But here is a starting point
If you only need to process rows with a specific text3value, start by filtering out the other rows:
df = df[df.text3=="my_value"]
If at this point, you do not need text3 anymore, you can also drop it
df = df.drop(columns="text3")
Then you process several sub dataframes, and write each of them to their own csv file. groupby is the perfect tool for that:
for (text1, text2), sub_df in df.groupby(["text1", "text2"]):
filenameOut = text1+"_"+text2+"_Values.csv"
# Process sub df
output_df = process(sub_df)
# Write sub df
output_df.to_csv(filenameOut)
Note that if you keep your data as a DataFrame instead of converting it to a dict, you can use the DataFrame to_csv method to simply write the output csv.
Now let's have a look at the process function (Note that you dont really need to make it a separate function, you could as well dump the function body in the for loop).
At this point, if I understand correctly, you want to compute the sum and the average of every rows that have the same x and y coordinates. Here again you can use groupby and the agg function to compute the mean and the sum of the group.
def process(sub_df):
# drop the text1 and text2 columns since they are in the filename anyway
out = sub_df.drop(columns=["text1","text2"])
# Compute mean and max
return out.groupby(["x", "y"]).agg(ave=("float1", "mean"), sum=("float2", "sum"))
And that's preety much it.
Bonus: 2-liner version (but don't do that...)
for (text1, text2), sub_df in df[df.text3=="my_value"].drop(columns="text3").groupby(["text1", "text2"]):
sub_df.drop(columns=["text1","text2"]).groupby(["x", "y"]).agg(ave=("float1", "mean"), sum=("float2", "sum")).to_csv(text1+"_"+text2+"_Values.csv")
To do this in an efficient way in pandas you will need to use groupby, agg and the in-built to_csv method rather than using for loops to construct lists of data and writing each one with the csv module. Something like this:
groups = data[data["text1"].isin(text_list1) & data["text2"].isin(text_list2)] \
.groupby(["text1", "text2"])
for (text1, text2), group in groups:
group.groupby("text3") \
.agg({"float1": np.mean, "float2": sum}) \
.to_csv(f"{text1}_{text2}_Values.csv")
It's not clear exactly what you're trying to do with the incrementing of x and y values, which is also what makes your current code very slow. To present sums and averages of the floating point columns by intervals of x and y, you could make bin columns and group by those too.
data["x_bin"] = (data["x"] - data["x"].min()) // x_inc
data["y_bin"] = (data["y"] - data["y"].min()) // y_inc
groups = data[data["text1"].isin(text_list1) & data["text2"].isin(text_list2)] \
.groupby(["text1", "text2"])
for (text1, text2), group in groups:
group.groupby(["text3", "x_bin", "y_bin"]) \
.agg({"x": "first", "y": "first", "float1": np.mean, "float2": sum}) \
.to_csv(f"{text1}_{text2}_Values.csv")

efficient soultion to create multiple columns with formula. pandas/python

i'm trying to create multiple columns(couple of hundreds) using values within the same df. is there a more efficient way for me to create multiple columns in batches? below is an example where i have to manually input new column names jwrl2_rank.r1, jwrl2_rank.1r1,jwrl2_rank.2r1, etc.. attached to the formula.
i0, i1, i2 are the original column names
and rn is the value within the column.
i0='jwrl2_rank'
i1='jwrl2_rank.1'
i2='jwrl2_rank.2'
i3='jwrl2_rank.3'
i4='jwrl2_rank.4'
i5='jwrl2_rank.5'
i6='jwrl2_rank.6'
i7='jwrl2_rank.7'
rn=1
df['jwrl2_rank.r1']=((df.loc[(df[i0]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i0]==rn),i0].count()))-1
df['jwrl2_rank.1r1']=((df.loc[(df[i1]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i1]==rn),i1].count()))-1
df['jwrl2_rank.2r1']=((df.loc[(df[i2]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i2]==rn),i2].count()))-1
df['jwrl2_rank.3r1']=((df.loc[(df[i3]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i3]==rn),i3].count()))-1
df['jwrl2_rank.4r1']=((df.loc[(df[i4]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i4]==rn),i4].count()))-1
df['jwrl2_rank.5r1']=((df.loc[(df[i5]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i5]==rn),i5].count()))-1
df['jwrl2_rank.6r1']=((df.loc[(df[i6]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i6]==rn),i6].count()))-1
df['jwrl2_rank.7r1']=((df.loc[(df[i7]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i7]==rn),i7].count()))-1
many thanks. regards
Using a for loop should work.
Incrementing string value
By using string interpolation you could solve your problem. See here for a quick introduction. I am using f-strings in the example below.
base_name='jwrl2_rank'
MAX_NUMBER = 3
for i in range(1, MAX_NUMBER + 1):
new_name = f"{base_name}.{i}"
print(new_name)
>>>
jwrl2_rank.1
jwrl2_rank.2
jwrl2_rank.3
Example of for loop
base_name='jwrl2_rank'
MAX_NUMBER = 3
for i in range(MAX_NUMBER + 1):
current_iN = f"{base_name}.{i}"
new_col_name = f"{base_name}.{i}r1"
if i == 0: # compensate for missing zero in column name
current_iN = base_name
new_col_name = f"{base_name}.r1"
df[new_col_name]=((df.loc[(df[current_iN]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[current_iN]==rn),current_iN].count()))-1

Output unique values from a pandas dataframe without reordering the output

I know that a few posts have been made regarding how to output the unique values of a dataframe without reordering the data.
I have tried many times to implement these methods, however, I believe that the problem relates to how the dataframe in question has been defined.
Basically, I want to look into the dataframe named "C", and output the unique values into a new dataframe named "C1", without changing the order in which they are stored at the moment.
The line that I use currently is:
C1 = pd.DataFrame(np.unique(C))
However, this returns an ascending order list (while, I simply want the list order preserved only with duplicates removed).
Once again, I apologise to the advanced users who will look at my code and shake their heads -- I'm still learning! And, yes, I have tried numerous methods to solve this problem (redefining the C dataframe, converting the output to be a list etc), to no avail unfortunately, so this is my cry for help to the Python gods. I defined both C and C1 as dataframes, as I understand that these are pretty much the best datastructures to house data in, such that they can be recalled and used later, plus it is quite useful to name the columns without affecting the data contained in the dataframe).
Once again, your help would be much appreciated.
F0 = ('08/02/2018','08/02/2018',50)
F1 = ('08/02/2018','09/02/2018',52)
F2 = ('10/02/2018','11/02/2018',46)
F3 = ('12/02/2018','16/02/2018',55)
F4 = ('09/02/2018','28/02/2018',48)
F_mat = [[F0,F1,F2,F3,F4]]
F_test = pd.DataFrame(np.array(F_mat).reshape(5,3),columns=('startdate','enddate','price'))
#convert string dates into DateTime data type
F_test['startdate'] = pd.to_datetime(F_test['startdate'])
F_test['enddate'] = pd.to_datetime(F_test['enddate'])
#convert datetype to be datetime type for columns startdate and enddate
F['startdate'] = pd.to_datetime(F['startdate'])
F['enddate'] = pd.to_datetime(F['enddate'])
#create contract duration column
F['duration'] = (F['enddate'] - F['startdate']).dt.days + 1
#re-order the F matrix by column 'duration', ensure that the bootstrapping
#prioritises the shorter term contracts
F.sort_values(by=['duration'], ascending=[True])
# create prices P
P = pd.DataFrame()
for index, row in F.iterrows():
new_P_row = pd.Series()
for date in pd.date_range(row['startdate'], row['enddate']):
new_P_row[date] = row['price']
P = P.append(new_P_row, ignore_index=True)
P.fillna(0, inplace=True)
#create C matrix, which records the unique day prices across the observation interval
C = pd.DataFrame(np.zeros((1, intNbCalendarDays)))
C.columns = tempDateRange
#create the Repatriation matrix, which records the order in which contracts will be
#stored in the A matrix, which means that once results are generated
#from the linear solver, we know exactly which CalendarDays map to
#which columns in the results array
#this array contains numbers from 1 to NbContracts
R = pd.DataFrame(np.zeros((1, intNbCalendarDays)))
R.columns = tempDateRange
#define a zero filled matrix, P1, which will house the dominant daily prices
P1 = pd.DataFrame(np.zeros((intNbContracts, intNbCalendarDays)))
#rename columns of P1 to be the dates contained in matrix array D
P1.columns = tempDateRange
#create prices in correct rows in P
for i in list(range(0, intNbContracts)):
for j in list(range(0, intNbCalendarDays)):
if (P.iloc[i, j] != 0 and C.iloc[0,j] == 0) :
flUniqueCalendarMarker = P.iloc[i, j]
C.iloc[0,j] = flUniqueCalendarMarker
P1.iloc[i,j] = flUniqueCalendarMarker
R.iloc[0,j] = i
for k in list(range(j+1,intNbCalendarDays)):
if (C.iloc[0,k] == 0 and P.iloc[i,k] != 0):
C.iloc[0,k] = flUniqueCalendarMarker
P1.iloc[i,k] = flUniqueCalendarMarker
R.iloc[0,k] = i
elif (C.iloc[0,j] != 0 and P.iloc[i,j] != 0):
P1.iloc[i,j] = C.iloc[0,j]
#convert C dataframe into C_list, in prepataion for converting C_list
#into a unique, order preserved list
C_list = C.values.tolist()
#create C1 matrix, which records the unique day prices across unique days in the observation period
C1 = pd.DataFrame(np.unique(C))
Use DataFrame.duplicated() to check if your data-frame contains any duplicate or not.
If yes then you can try DataFrame.drop_duplicate() .

compare sum of column values in python

I have a csv file loaded in a python object. 15 of the columns contains binary values. I have several thousands rows.
I want to count the sum of the binary values of each of the columns and sort the result ascendingly.
I only made it to:
sum1=sum(products['1'])
sum2=sum(products['2'])
sum3=sum(products['3'])
....
...
sum15=sum(products['15'])
and process the result manually. Is there a programmatic way to achieve this?
How about this:
sorted_sum = sorted([sum(products[i]) for i in range(1, 16)])
sorted_sum is the sorted list of column sums. However, I believe the index should run from 0 to 14, not 1 to 15.
you will find the solution here :
with open("file.csv") as fin:
headerline = fin.next()
list_sum_product=[]
for i in range(15):
total = 0
for row in csv.reader(fin):
total += int(row[i])
list_sum_product.append(total)
print sorted(list_sum_product)

Python Import data dictionary and pattern

If I have data as:
Code, data_1, data_2, data_3, [....], data204700
a,1,1,0, ... , 1
b,1,0,0, ... , 1
a,1,1,0, ... , 1
c,0,1,0, ... , 1
b,1,0,0, ... , 1
etc. same code different value (0, 1, ?(not known))
I need to create a big matrix and I want to analyze.
How can I import data in a dictionary?
I want to use dictionary for column (204.700+1)
There is a built in function (or package) that return to me pattern?
(I expect a percent pattern). I mean as 90% of 1 in column 1, 80% of in column 2.
Alright so I am going to assume you want this in a dictionary for storing purposes and I will tell you that you don't want that with this kind of data. use a pandas DataFrame
this is how you will get your code into a dataframe:
import pandas as pd
my_file = 'file_name'
df = pd.read_csv(my_file)
now you don't need a package for returning the pattern you are looking for, just write a simple algorithm for returning that!
def one_percentage(data):
#get total number of rows for calculating percentages
size = len(data)
#get type so only grabbing the correct rows
x = data.columns[1]
x = data[x].dtype
#list of touples to hold amount of 1s and the column names
ones = [(i,sum(data[i])) for i in data if data[i].dtype == x]
my_dict = {}
#create dictionary with column names and percent
for x in ones:
percent = x[1]/float(size)
my_dict[x[0]] = percent
return my_dict
now if you want to get the percent of ones in any column, this is what you do:
percentages = one_percentage(df)
column_name = 'any_column_name'
print percentages[column_name]
now if you want to have it do every single column, then you can grab all of the column names and loop through them:
columns = [name for name in percentages]
for name in columns:
print str(percentages[name]) + "% of 1 in column " + name
let me know if you need anything else!

Categories