How create summary table for every column? - python

I have Pandas DataFrames with around 100 columns each. I have to create a summary table for all of those columns. In the summary Dataframe I want to have a name (one from every of these data frames and this I'm doing okay) and put mean and std of every column.
So my final table should have the shape: n x m
where n is the number of files
and m is the number of columns x 2 (mean and std)
Something like this
name mean_col1 std_col1 mean_col2 std_col2
ABC 22.815293 0.103567 90.277533 0.333333
DCE 22.193991 0.12389 87.17391 0.123457
I tried following but I'm not getting what I wanted:
list_with_names = []
list_for_mean_and_std = []
for file in glob.glob("/data/path/*.csv"):
df = pd.read_csv(file)
output = {'name':df['name'][0]}
list_with_names.append(output)
numerical_cols = df.select_dtypes('float64')
for column in numerical_cols:
mean_col = numerical_cols[column].mean()
std_col = numerical_cols[column].std()
output_2 = {'mean': mean_col,
'std': std_col}
list_for_mean_and_std.append(output_2)
summary = pd.DataFrame(list_with_names, list_for_mean_and_std)
And I'm getting an error Shape of passed values is (183, 1), indices imply (7874, 1) because I'm assigning in the wrong way these values with means and std but I have no idea how.
I will be glad for any advice on how to change it

Of coruse, Pandas have a method for that - describe():
df.describe()
Which gives more statistics than you requested. If you are interested only in mean and std, you can do:
df.describe()[['mean', 'std']]

Related

Creating a Dictionary and then a dataframe from different Types

I'm having a problem manipulating different types. Here is what I'm doing:
--
row_list=[]
for x in a:
df=selectinfo(x)
analysis=df[['Sales']].copy()
decompose_result_mult = seasonal_decompose(analysis, model="additive")
date = decompose_result_mult.trend.to_frame()
date.reset_index(inplace=True)
date=date.iloc[:,0]
trend = decompose_result_mult.trend
trend = decompose_result_mult.trend.to_frame()
trend.reset_index(inplace=True)
trend=trend.iloc[:,1]
seasonal = decompose_result_mult.seasonal
seasonal = decompose_result_mult.seasonal.to_frame()
seasonal.reset_index(inplace=True)
seasonal=seasonal.iloc[:,1]
residual = decompose_result_mult.resid
residual = decompose_result_mult.resid.to_frame()
residual.reset_index(inplace=True)
residual=residual.iloc[:,1]
observed=decompose_result_mult.observed
observed=decompose_result_mult.observed.to_frame()
observed.reset_index(inplace=True)
observed=observed.iloc[:,1]
dict= {'Date':date,'region':x,'Trend':trend,'Seasonal':seasonal,'Residual':residual,'Sales':observed}
row_list.append(dict)
pandasTable=pd.DataFrame(row_list)
The problem with this is that when I run my code I get the next result:
It is a dataframe that have inside Series... Any help? I would like to have 1 value per row and not 1 list per row.
Thank you!

How to get a percentage of a pandas dataframe

I have a df of 300000 rows and 25 columns.
Heres a link to 21 rows of the dataset
I have added a unique index to all the rows, using uuid.uuid4().
Now I only wand a random portion of the dataset (say 25%). Here is what I am trying to do to get it, but its not working:
def gen_uuid(self, df, percentage = 1.0, uuid_list = []):
for i in range(df.shape[0]):
uuid_list.append(str(uuid.uuid4()))
uuid_pd = pd.Series(uuid_list)
df_uuid = df.copy()
df_uuid['id'] = uuid_pd
df_uuid = df_uuid.set_index('id')
if (percentage == 1.0) : return df_uuid
else:
uuid_list_sample = random.sample(uuid_list, int(len(uuid_list) * percentage))
return df_uuid[df_uuid.index.any() in uuid_list_sample]
But this gives an error saying keyerror: False
The uuid_list_sample that I generate is the correct length
So I have 2 questions:
How do I get the above code to work as intendend? Return a random portion of the pandas df based on index
How do I in general get a percentage of the whole pandas data frame? I was looking at pandas.DataFrame.quantile, but Im not sure if that does what im looking for

Calculating averaged data in and writing to csv from a pandas dataframe

I have a very large spatial dataset stored in a dataframe. I am taking a slice of that dataframe into a new smaller subset to run further calculations.
The data has x, y and z coordinates with a number of additional columns, some of which are text and some are numeric. The x and y coordinates are on a defined grid and have a known separation.
Data looks like this
x,y,z,text1,text2,text3,float1,float2
75000,45000,120,aa,bbb,ii,12,0.2
75000,45000,110,bb,bbb,jj,22,0.9
75000,45100,120,aa,bbb,ii,11,1.8
75000,45100,110,bb,bbb,jj,45,2.4
75000,45100,100,bb,ccc,ii,13.6,1
75100,45000,120,bb,ddd,jj,8.2,2.1
75100,45000,110,bb,ddd,ii,12,0.6
For each x and y pair I want to iterate over a two series of text values and do three things in the z direction.
Calculate the average of one numeric value for all the values with a third specific text value
Sum another numeric value for all the values with the same text value
Write the a resultant table of 'x, y, average, sum' to a csv.
My code does part three (albeit very slowly) but doesn't calculate 1 or 2 or at least I don't appear to get the average and sum calculations in my output.
What have I done wrong and how can I speed it up?
for text1 in text_list1:
for text2 in text_list2:
# Get the data into smaller dataframe
df = data.loc[ (data["textfield1"] == text1) & (data["textfield2"] == text2 ) ]
#Get the minimum and maximum x and y
minXw = df['x'].min()
maxXw = df['x'].max()
minYw = df['y'].min()
maxYw = df['y'].max()
# dictionary for quicker printing
dict_out = {}
rows_list = []
# Make output filename
filenameOut = text1+"_"+text2+"_Values.csv"
# Start looping through x values
for x in np.arange(minXw, maxXw, x_inc):
xcount += 1
# Start looping through y values
for y in np.arange(minYw, maxYw, y_inc):
ycount += 1
# calculate average and sum
ave_val = df.loc[df['textfield3'] == 'text3', 'float1'].mean()
sum_val = df.loc[df['textfield3'] == 'text3', 'float2'].sum()
# Make Dictionary of output values
dict_out = dict([('text1', text1),
('text2', text2),
('text3', df['text3']),
('x' , x-x_inc),
('y' , y-y_inc),
('ave' , ave_val),
('sum' , sum_val)])
rows_list_c.append(dict_out)
# Write csv
columns = ['text1','text2','text3','x','y','ave','sum']
with open(filenameOut, 'w') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=columns)
writer.writeheader()
for data in dict_out:
writer.writerow(data)
My resultant csv gives me:
text1,text2,text3,x,y,ave,sum
text1,text2,,74737.5,43887.5,nan,0.0
text1,text2,,74737.5,43912.5,nan,0.0
text1,text2,,74737.5,43937.5,nan,0.0
text1,text2,,74737.5,43962.5,nan,0.0
Not really clear what you're trying to do. But here is a starting point
If you only need to process rows with a specific text3value, start by filtering out the other rows:
df = df[df.text3=="my_value"]
If at this point, you do not need text3 anymore, you can also drop it
df = df.drop(columns="text3")
Then you process several sub dataframes, and write each of them to their own csv file. groupby is the perfect tool for that:
for (text1, text2), sub_df in df.groupby(["text1", "text2"]):
filenameOut = text1+"_"+text2+"_Values.csv"
# Process sub df
output_df = process(sub_df)
# Write sub df
output_df.to_csv(filenameOut)
Note that if you keep your data as a DataFrame instead of converting it to a dict, you can use the DataFrame to_csv method to simply write the output csv.
Now let's have a look at the process function (Note that you dont really need to make it a separate function, you could as well dump the function body in the for loop).
At this point, if I understand correctly, you want to compute the sum and the average of every rows that have the same x and y coordinates. Here again you can use groupby and the agg function to compute the mean and the sum of the group.
def process(sub_df):
# drop the text1 and text2 columns since they are in the filename anyway
out = sub_df.drop(columns=["text1","text2"])
# Compute mean and max
return out.groupby(["x", "y"]).agg(ave=("float1", "mean"), sum=("float2", "sum"))
And that's preety much it.
Bonus: 2-liner version (but don't do that...)
for (text1, text2), sub_df in df[df.text3=="my_value"].drop(columns="text3").groupby(["text1", "text2"]):
sub_df.drop(columns=["text1","text2"]).groupby(["x", "y"]).agg(ave=("float1", "mean"), sum=("float2", "sum")).to_csv(text1+"_"+text2+"_Values.csv")
To do this in an efficient way in pandas you will need to use groupby, agg and the in-built to_csv method rather than using for loops to construct lists of data and writing each one with the csv module. Something like this:
groups = data[data["text1"].isin(text_list1) & data["text2"].isin(text_list2)] \
.groupby(["text1", "text2"])
for (text1, text2), group in groups:
group.groupby("text3") \
.agg({"float1": np.mean, "float2": sum}) \
.to_csv(f"{text1}_{text2}_Values.csv")
It's not clear exactly what you're trying to do with the incrementing of x and y values, which is also what makes your current code very slow. To present sums and averages of the floating point columns by intervals of x and y, you could make bin columns and group by those too.
data["x_bin"] = (data["x"] - data["x"].min()) // x_inc
data["y_bin"] = (data["y"] - data["y"].min()) // y_inc
groups = data[data["text1"].isin(text_list1) & data["text2"].isin(text_list2)] \
.groupby(["text1", "text2"])
for (text1, text2), group in groups:
group.groupby(["text3", "x_bin", "y_bin"]) \
.agg({"x": "first", "y": "first", "float1": np.mean, "float2": sum}) \
.to_csv(f"{text1}_{text2}_Values.csv")

Output unique values from a pandas dataframe without reordering the output

I know that a few posts have been made regarding how to output the unique values of a dataframe without reordering the data.
I have tried many times to implement these methods, however, I believe that the problem relates to how the dataframe in question has been defined.
Basically, I want to look into the dataframe named "C", and output the unique values into a new dataframe named "C1", without changing the order in which they are stored at the moment.
The line that I use currently is:
C1 = pd.DataFrame(np.unique(C))
However, this returns an ascending order list (while, I simply want the list order preserved only with duplicates removed).
Once again, I apologise to the advanced users who will look at my code and shake their heads -- I'm still learning! And, yes, I have tried numerous methods to solve this problem (redefining the C dataframe, converting the output to be a list etc), to no avail unfortunately, so this is my cry for help to the Python gods. I defined both C and C1 as dataframes, as I understand that these are pretty much the best datastructures to house data in, such that they can be recalled and used later, plus it is quite useful to name the columns without affecting the data contained in the dataframe).
Once again, your help would be much appreciated.
F0 = ('08/02/2018','08/02/2018',50)
F1 = ('08/02/2018','09/02/2018',52)
F2 = ('10/02/2018','11/02/2018',46)
F3 = ('12/02/2018','16/02/2018',55)
F4 = ('09/02/2018','28/02/2018',48)
F_mat = [[F0,F1,F2,F3,F4]]
F_test = pd.DataFrame(np.array(F_mat).reshape(5,3),columns=('startdate','enddate','price'))
#convert string dates into DateTime data type
F_test['startdate'] = pd.to_datetime(F_test['startdate'])
F_test['enddate'] = pd.to_datetime(F_test['enddate'])
#convert datetype to be datetime type for columns startdate and enddate
F['startdate'] = pd.to_datetime(F['startdate'])
F['enddate'] = pd.to_datetime(F['enddate'])
#create contract duration column
F['duration'] = (F['enddate'] - F['startdate']).dt.days + 1
#re-order the F matrix by column 'duration', ensure that the bootstrapping
#prioritises the shorter term contracts
F.sort_values(by=['duration'], ascending=[True])
# create prices P
P = pd.DataFrame()
for index, row in F.iterrows():
new_P_row = pd.Series()
for date in pd.date_range(row['startdate'], row['enddate']):
new_P_row[date] = row['price']
P = P.append(new_P_row, ignore_index=True)
P.fillna(0, inplace=True)
#create C matrix, which records the unique day prices across the observation interval
C = pd.DataFrame(np.zeros((1, intNbCalendarDays)))
C.columns = tempDateRange
#create the Repatriation matrix, which records the order in which contracts will be
#stored in the A matrix, which means that once results are generated
#from the linear solver, we know exactly which CalendarDays map to
#which columns in the results array
#this array contains numbers from 1 to NbContracts
R = pd.DataFrame(np.zeros((1, intNbCalendarDays)))
R.columns = tempDateRange
#define a zero filled matrix, P1, which will house the dominant daily prices
P1 = pd.DataFrame(np.zeros((intNbContracts, intNbCalendarDays)))
#rename columns of P1 to be the dates contained in matrix array D
P1.columns = tempDateRange
#create prices in correct rows in P
for i in list(range(0, intNbContracts)):
for j in list(range(0, intNbCalendarDays)):
if (P.iloc[i, j] != 0 and C.iloc[0,j] == 0) :
flUniqueCalendarMarker = P.iloc[i, j]
C.iloc[0,j] = flUniqueCalendarMarker
P1.iloc[i,j] = flUniqueCalendarMarker
R.iloc[0,j] = i
for k in list(range(j+1,intNbCalendarDays)):
if (C.iloc[0,k] == 0 and P.iloc[i,k] != 0):
C.iloc[0,k] = flUniqueCalendarMarker
P1.iloc[i,k] = flUniqueCalendarMarker
R.iloc[0,k] = i
elif (C.iloc[0,j] != 0 and P.iloc[i,j] != 0):
P1.iloc[i,j] = C.iloc[0,j]
#convert C dataframe into C_list, in prepataion for converting C_list
#into a unique, order preserved list
C_list = C.values.tolist()
#create C1 matrix, which records the unique day prices across unique days in the observation period
C1 = pd.DataFrame(np.unique(C))
Use DataFrame.duplicated() to check if your data-frame contains any duplicate or not.
If yes then you can try DataFrame.drop_duplicate() .

Python (Pandas) - Working with numeric data but adding non numeric data back

I have a CSV file that looks like this:
Build,Avg,Min,Max
BuildA,56.190,39.123,60.1039
BuildX,57.11,40.102,60.200
BuildZER,55.1134,35.129404123,60.20121
I want to get the average, min, max of each column and have each of these stats be as a new row. I exclude the non numeric column (the build column) and then run the statistics. I accomplish this by doing:
df = pd.read_csv('fakedata.csv')
columns = []
builds = []
for column in df.columns:
if(df[column].dtype == 'float64'):
columns.append(column)
else:
builds.append(column)
save = df[builds]
df = df[columns]
print(df)
df.loc['Min']= df.min()
df.loc['Average']= df.mean()
df.loc['Max']= df.max()
If I were then to write this data to a CSV it would look like:
,Avg,Min,Max
0,56.19,39.123,60.1039
1,57.11,40.102,60.2
2,55.1134,35.129404123,60.20121
Min,55.1134,35.129404123,60.1039
Average,55.8817,37.3709520615,60.1522525
Max,57.11,40.102,60.20121
Which is close to what I want but I want the Build column to be column one again and have the build names exist on top of the Min, Average, Max. Basically this:
Builds,Avg,Min,Max
BuildA,56.19,39.123,60.1039
BuildX,57.11,40.102,60.2
BuildZER,55.1134,35.129404123,60.20121
Min,55.1134,35.129404123,60.1039
Average,55.8817,37.3709520615,60.1522525
Max,57.11,40.102,60.20121
I have attempted to accomplish this by doing:
df.insert(0,'builds', save)
with open('fakedata.csv', 'w') as f:
df.to_csv(f)
But this gives me this CSV:
,builds,Avg,Min,Max
0,Build1,56.19,39.123,60.1039
1,Build2,57.11,40.102,60.2
2,Build3,55.1134,35.129404123,60.20121
Min,,55.1134,35.129404123,60.1039
Average,,55.8817,37.3709520615,60.1522525
Max,,57.11,40.102,60.20121
How can I fix this?
IIUC:
df_out = pd.concat([df.set_index('Build'),df.set_index('Build').agg(['max','min','mean'])]).rename(index={'max':'Max','min':'Min','mean':'Average'}).reset_index()
Output:
index Avg Min Max
0 BuildA 56.1900 39.123000 60.10390
1 BuildX 57.1100 40.102000 60.20000
2 BuildZER 55.1134 35.129404 60.20121
3 Max 57.1100 40.102000 60.20121
4 Min 55.1134 35.129404 60.10390
5 Average 56.1378 38.118135 60.16837

Categories