Assuming that I have a dataframe with the following values:
df:
col1 col2 value
1 2 3
1 2 1
2 3 1
I want to first groupby my dataframe based on the first two columns (col1 and col2) and then average over values of the thirs column (value). So the desired output would look like this:
col1 col2 avg-value
1 2 2
2 3 1
I am using the following code:
columns = ['col1','col2','avg']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
print(df[['col1','col2','avg']].groupby('col1','col2').mean())
which gets the following error:
ValueError: No axis named col2 for object type <class 'pandas.core.frame.DataFrame'>
Any help would be much appreciated.
You need to pass a list of the columns to groupby, what you passed was interpreted as the axis param which is why it raised an error:
In [30]:
columns = ['col1','col2','avg']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
print(df[['col1','col2','avg']].groupby(['col1','col2']).mean())
avg
col1 col2
1 2 3
3 3
If you want to group by multiple columns, you should put them in a list:
columns = ['col1','col2','value']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
df.loc[2] = [2,3,1]
print(df.groupby(['col1','col2']).mean())
Or slightly more verbose, for the sake of getting the word 'avg' in your aggregated dataframe:
import numpy as np
columns = ['col1','col2','value']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
df.loc[2] = [2,3,1]
print(df.groupby(['col1','col2']).agg({'value': {'avg': np.mean}}))
Related
I have a dataframe that looks like this
id
human_id
1
('apples', '2022-12-04', 'a5ted')
2
('bananas', '2012-2-14')
3
('2012-2-14', 'reda21', 'ss')
..
..
I would like a "pythonic" way to have such output
id
human_id
col1
col2
col3
1
('apples', '2022-12-04', 'a5ted')
apples
2022-12-04
a5ted
2
('bananas', '2012-2-14')
bananas
2022-12-04
np.NaN
3
('2012-2-14', 'reda21', 'ss')
2012-2-14
reda21
ss
import pandas as pd
df['a'], df['b'], df['c'] = df.human_id.str
The code I have tried give me error:
ValueError: not enough values to unpack (expected 2, got 1) Python
How can I split the values in tuple to be in columns?
Thank you.
You can do it this way. It will just put None in places where it couldn't find the values. You can then append the df1 to df.
d = {'id': [1,2,3],
'human_id': ["('apples', '2022-12-04', 'a5ted')",
"('bananas', '2012-2-14')",
"('2012-2-14', 'reda21', 'ss')"
]}
df = pd.DataFrame(data=d)
list_human_id = tuple(list(df['human_id']))
newList = []
for val in listh:
newList.append(eval(val))
df1 = pd.DataFrame(newList, columns=['col1', 'col2', 'col3'])
print(df1)
Output
col1 col2 col3
0 apples 2022-12-04 a5ted
1 bananas 2012-2-14 None
2 2012-2-14 reda21 ss
You can do
out = df.join(pd.DataFrame(df.human_id.tolist(),index=df.index,columns=['a','b','c']))
column will create dynamic with length of tuple and using same dataframe
import pandas as pd
id = [1, 2, 3]
human_id = [('apples', '2022-12-04', 'a5ted')
,('bananas', '2012-2-14')
, ('2012-2-14', 'reda21', 'ss')]
df = pd.DataFrame({'id': id, 'human_id': human_id})
print("*"*20,'Dataframe',"*"*20)
print(df.to_string())
print()
print("*"*20,'Split Data',"*"*20)
row = 0
for x in df['human_id']:
col = 1
for xx in x:
#df['col'+str(z)] = str(xx)
name_column = 'col'+str(col)
df.loc[df.index[row], name_column] = str(xx)
col+=1
row+=1
print(df.to_string())
I need to convert 3 columns into 2 rows using python.
col1 col2 col3
A 2 3
B 4 5
col1 col2
A 2
A 3
B 4
B 5
*my code
hdr = ['col1', 'col2']
final_output=[]
for row in rows:
output = {}
output1 = {}
output = { A : row.get(col1), B: row.get(col2)}
output1 = { A : row.get(col1), B: row.get(col3)}
final_out.append(output)
final_out.append(output1)
with open(tgt_file.csv, w) as tgt_file:
csv_writer=csv.DictWriter(tgt_file, fieldnames=hdr, delimiter=',')
csv_writer.writeheader()
csv_writer.writerows(final_output)
import pandas as pd
### this is the sample data
df = pd.DataFrame(data= [['A',2, 3],['B',4, 5]],
columns =['col1', 'col2', 'col3'])
### this is the solution
ef = [] # create an empty list
for i,row in df.iterrows():
ef.append([row[0], row[1]]) # append first column first
ef.append([row[0], row[2]]) # append 2nd column second
df = pd.DataFrame(data=ef,columns=['col1','col2']) # recreate the dataframe
remark: there are more advanced solutions possible, but I think this is readable
You can try using pd.melt
df = pd.melt(df, id_vars=["col1"],value_name = 'col2').drop(['variable'],axis=1)
And then you can sort the dataframe on "col1".
I have the following DataFrame:
I need to switch values of col2 and col3 with the values of col4 and col5. Values of col1 will remain the same. The end result needs to look as the following:
Is there a way to do this without looping through the DataFrame?
Use rename in pandas
In [160]: df = pd.DataFrame({'A':[1,2,3],'B':[3,4,5]})
In [161]: df
Out[161]:
A B
0 1 3
1 2 4
2 3 5
In [167]: df.rename({'B':'A','A':'B'},axis=1)
Out[167]:
B A
0 1 3
1 2 4
2 3 5
This should do:
og_cols = df.columns
new_cols = [df.columns[0], *df.columns[3:], *df.columns[1:3]]
df = df[new_cols] # Sort columns in the desired order
df.columns = og_cols # Use original column names
If you want to swap the column values:
df.iloc[:, 1:3], df.iloc[:, 3:] = df.iloc[:,3:].to_numpy(copy=True), df.iloc[:,1:3].to_numpy(copy=True)
Pandas reindex could help :
cols = df.columns
#reposition the columns
df = df.reindex(columns=['col1','col4','col5','col2','col3'])
#pass in new names
df.columns = cols
I have a df where I have several columns, that, based on the value (1-6) in these columns, I want to assign a value (0-1) to its corresponding column. I can do it on a column by column basis but would like to make it a single function. Below is some example code:
import pandas as pd
df = pd.DataFrame({'col1': [1,3,6,3,5,2], 'col2': [4,5,6,6,1,3], 'col3': [3,6,5,1,1,6],
'colA': [0,0,0,0,0,0], 'colB': [0,0,0,0,0,0], 'colC': [0,0,0,0,0,0]})
(col1 corresponds with colA, col2 with colB, col3 with colC)
This code works on a column by column basis:
df.loc[(df.col1 != 1) & (df.col1 < 6), 'colA'] = (df['colA']+ 1)
But I would like to be able to have a list of columns, so to speak, and have it correspond with another. Something like this, (but that actually works):
m = df['col1' : 'col3'] != 1 & df['col1' : 'col3'] < 6
df.loc[m, 'colA' : 'colC'] += 1
Thank You!
Idea is filter both DataFrames by DataFrame.loc, then filter columns by mask and rename columns by another df2 and last use DataFrame.add only for df.columns:
df1 = df.loc[:, 'col1' : 'col3']
df2 = df.loc[:, 'colA' : 'colC']
d = dict(zip(df1.columns,df2.columns))
df1 = ((df1 != 1) & (df1 < 6)).rename(columns=d)
df[df2.columns] = df[df2.columns].add(df1)
print (df)
col1 col2 col3 colA colB colC
0 1 4 3 0 1 1
1 3 5 6 1 1 0
2 6 6 5 0 0 1
3 3 6 1 1 0 0
4 5 1 1 1 0 0
5 2 3 6 1 1 0
Here's what I would do:
# split up dataframe
sub_df = df.iloc[:,:3]
abc = df.iloc[:,3:]
# make numpy array truth table
truth_table = (sub_df.to_numpy() > 1) & (sub_df.to_numpy() < 6)
# redefine abc based on numpy truth table
new_abc = pd.DataFrame(truth_table.astype(int), columns=['colA', 'colB', 'colC'])
# join the updated dataframe subgroups
new_df = pd.concat([sub_df, new_abc], axis=1)
This question already has answers here:
How to add a new column to an existing DataFrame?
(32 answers)
Closed 4 years ago.
I have dataframe in Pandas for example:
Col1 Col2
A 1
B 2
C 3
Now if I would like to add one more column named Col3 and the value is based on Col2. In formula, if Col2 > 1, then Col3 is 0, otherwise would be 1. So, in the example above. The output would be:
Col1 Col2 Col3
A 1 1
B 2 0
C 3 0
Any idea on how to achieve this?
You just do an opposite comparison. if Col2 <= 1. This will return a boolean Series with False values for those greater than 1 and True values for the other. If you convert it to an int64 dtype, True becomes 1 and False become 0,
df['Col3'] = (df['Col2'] <= 1).astype(int)
If you want a more general solution, where you can assign any number to Col3 depending on the value of Col2 you should do something like:
df['Col3'] = df['Col2'].map(lambda x: 42 if x > 1 else 55)
Or:
df['Col3'] = 0
condition = df['Col2'] > 1
df.loc[condition, 'Col3'] = 42
df.loc[~condition, 'Col3'] = 55
The easiest way that I found for adding a column to a DataFrame was to use the "add" function. Here's a snippet of code, also with the output to a CSV file. Note that including the "columns" argument allows you to set the name of the column (which happens to be the same as the name of the np.array that I used as the source of the data).
# now to create a PANDAS data frame
df = pd.DataFrame(data = FF_maxRSSBasal, columns=['FF_maxRSSBasal'])
# from here on, we use the trick of creating a new dataframe and then "add"ing it
df2 = pd.DataFrame(data = FF_maxRSSPrism, columns=['FF_maxRSSPrism'])
df = df.add( df2, fill_value=0 )
df2 = pd.DataFrame(data = FF_maxRSSPyramidal, columns=['FF_maxRSSPyramidal'])
df = df.add( df2, fill_value=0 )
df2 = pd.DataFrame(data = deltaFF_strainE22, columns=['deltaFF_strainE22'])
df = df.add( df2, fill_value=0 )
df2 = pd.DataFrame(data = scaled, columns=['scaled'])
df = df.add( df2, fill_value=0 )
df2 = pd.DataFrame(data = deltaFF_orientation, columns=['deltaFF_orientation'])
df = df.add( df2, fill_value=0 )
#print(df)
df.to_csv('FF_data_frame.csv')