Pandas aggregate statistics as new columns - python

I have a dataframe df with 3 columns: A is an object id, B is a flag, and C is a value measured on object A with flag B.
I want to compute the avarage value of C grouped by [A,B] and store the results as three new columns:
C0: meanC (or NaN) when B = 0
C1: meanC (or NaN) when B = 1
C2: meanC (or NaN) when B = 2
Below there's an example of how I am trying to transform the dataframe df into res.
import numpy as np
import pandas as pd
df = pd.DataFrame({
"A":[0,0,0,0,0,1,2,2,3,3,3],
"B":[0,1,2,0,1,2,0,2,0,1,1],
"C":[.654,.123,1.45,6.1,0.322,1.77,9.234,2.54,1,6.77,6.438]})
grouped = df.groupby(["A","B"]).agg("mean")
# how to transform grouped into res?
res = pd.DataFrame({
"A":[0,1,2,3],
"C0":[3.377,np.nan,9.234,1],
"C1":[0.2225,np.nan,np.nan,6.604],
"C2":[1.45,1.77,2.54,np.nan]})

Add unstack with add_prefix:
res = df.groupby(["A","B"])['C'].mean().unstack().add_prefix('C').reset_index()
Or use pivot_table with default mean aggregate function:
res = df.pivot_table(index="A",columns="B",values='C').add_prefix('C').reset_index()
print (res)
B A C0 C1 C2
0 0 3.377 0.2225 1.45
1 1 NaN NaN 1.77
2 2 9.234 NaN 2.54
3 3 1.000 6.6040 NaN

Related

Adding new columns to Pandas Data Frame which the length of new column value is bigger than length of index

I'm in a trouble with adding a new column to a pandas dataframe when the length of new column value is bigger than length of index.
Data may like this :
import pandas as pd
df = pd.DataFrame(
{
"bar": ["A","B","C"],
"zoo": [1,2,3],
})
So, you see, length of this df's index is 3.
And next I wanna add a new column , code may like this two ways below:
df["new_col"] = [1,2,3,4]
It'll raise an error : Length of values does not match length of index.
Or:
df["new_col"] = pd.Series([1,2,3,4])
I will just get values[1,2,3] in my data frame df.
(The count of new column values can't out of the max index).
Now, what I want just like :
Is there a better way ?
Looking forward to your answer,thanks!
Use DataFrame.join with change Series name and right join:
#if not default index
#df = df.reset_index(drop=True)
df = df.join(pd.Series([1,2,3,4]).rename('new_col'), how='right')
print (df)
bar zoo new_col
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
Another idea is add reindex by new s.index:
s = pd.Series([1,2,3,4])
df = df.reindex(s.index)
df["new_col"] = s
print (df)
bar zoo new_col
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
s = pd.Series([1,2,3,4])
df = df.reindex(s.index).assign(new_col = s)
df = pd.DataFrame(
{
"bar": ["A","B","C"],
"zoo": [1,2,3],
})
new_col = pd.Series([1,2,3,4])
df = pd.concat([df,new_col],axis=1)
print(df)
bar zoo 0
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4

Pandas: Convert nan in a row to an empty array

My dataframes are like below
df1
id c1
1 abc
2 def
3 ghi
df2
id set1
1 [123,456]
2 [789]
When I join df1 and df2 (final_data = df1.merge(df2, how = 'left')). It gives me
final_df
id c1 set1
1 abc [123,456]
2 def [789]
3 ghi NaN
I'm using below code to replace NaN with empty array []
for row in final_df.loc[final_df.set1.isnull(), 'set1'].index:
final_df.at[row, 'set1'] = []
The issue is if df2 is empty dataframe. It is giving
ValueError: setting an array element with a sequence.
PS: I'm using pandas 0.23.4 version
Pandas is not designed to be used with series of lists. You lose all vectorised functionality and any manipulations on such series involve inefficient, Python-level loops.
One work-around is to define a series of empty lists:
res = df1.merge(df2, how='left')
empty = pd.Series([[] for _ in range(len(df.index))], index=df.index)
res['set1'] = res['set1'].fillna(empty)
print(res)
id c1 set1
0 1 abc [123, 456]
1 2 def [789]
2 3 ghi []
A better idea at this point, if viable, is to split your lists into separate series:
res = res.join(pd.DataFrame(res.pop('set1').values.tolist()))
print(res)
id c1 0 1
0 1 abc 123.0 456.0
1 2 def 789.0 NaN
2 3 ghi NaN NaN
This is is not ideal but will get your work done
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[1,'abc'],[2,'def'],[3,'ghi']], columns=['id', 'c1'])
df2 = pd.DataFrame([[1,[123,456]],[2,[789]]], columns=['id', 'set1'])
df=pd.merge(df1,df2, how='left', on='id')
df['set1'].fillna(0, inplace=True)
df['set1']=df['set1'].apply( lambda x:pd.Series({'set1': [] if x == 0 else x}))
print(df)

Which data structure in Python to use to replace Excel 2-dim array of strings/amounts?

I am using xlwings to replace my VB code with Python but since I am not an experienced programmer I was wondering - which data structure to use?
Data is in .xls in 2 columns and has the following form; In VB I lift this into a basic two dimensional array arrCampaignsAmounts(i, j):
Col 1: 'market_channel_campaign_product'; Col 2: '2334.43 $'
Then I concatenate words from 4 columns on another sheet into a similar 'string', into another 2-dim array arrStrings(i, j):
'Austria_Facebook_Winter_Active vacation'; 'rowNumber'
Finally, I search strings from 1. array within strings from 2. array; if found I write amounts into rowNumber from arrStrings(i, 2).
Would I use 4 lists for this task?
Two dictionaries?
Something else?
Definitely use pandas Dataframes. Here are references and very simple Dataframe examples.
#reference: http://pandas.pydata.org/pandas-docs/stable/10min.html
#reference: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html.
import numpy as np
import pandas as pd
def df_dupes(df_in):
'''
Returns [object,count] pairs for each unique item in the dataframe.
'''
# import pandas
if isinstance(df_in, list) or isinstance(df_in, tuple):
import pandas as pd
df_in = pd.DataFrame(df_in)
return df_in.groupby(df_in.columns.tolist(),as_index=False).size()
def df_filter_example(df):
'''
In [96]: df
Out[96]:
A B C D
0 1 4 9 1
1 4 5 0 2
2 5 5 1 0
3 1 3 9 6
'''
import pandas as pd
df=pd.DataFrame([[1,4,9,1],[4,5,0,2],[5,5,1,0],[1,3,9,6]],columns=['A','B','C','D'])
return df[(df.A == 1) & (df.D == 6)]
def df_compare(df1, df2, compare_col_list, join_type):
'''
df_compare compares 2 dataframes.
Returns left, right, inner or outer join
df1 is the first/left dataframe
df2 is the second/right dataframe
compare_col_list is a lsit of column names that must match between df1 and df2
join_type = 'inner', 'left', 'right' or 'outer'
'''
import pandas as pd
return pd.merge(df1, df2, how=join_type,
on=compare_col_list)
def df_compare_examples():
import numpy as np
import pandas as pd
df1=pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns = ['c1', 'c2', 'c3'])
''' c1 c2 c3
0 1 2 3
1 4 5 6
2 7 8 9 '''
df2=pd.DataFrame([[4,5,6],[7,8,9],[10,11,12]], columns = ['c1', 'c2', 'c3'])
''' c1 c2 c3
0 4 5 6
1 7 8 9
2 10 11 12 '''
# One can see that df1 contains 1 row ([1,2,3]) not in df3 and
# df2 contains 1 rown([10,11,12]) not in df1.
# Assume c1 is not relevant to the comparison. So, we merge on cols 2 and 3.
df_merge = pd.merge(df1,df2,how='outer',on=['c2','c3'])
print(df_merge)
''' c1_x c2 c3 c1_y
0 1 2 3 NaN
1 4 5 6 4
2 7 8 9 7
3 NaN 11 12 10 '''
''' One can see that columns c2 and c3 are returned. We also received
columns c1_x and c1_y, where c1_X is the value of column c1
in the first dataframe and c1_y is the value of c1 in the second
dataframe. As such,
any row that contains c1_y = NaN is a row from df1 not in df2 &
any row that contains c1_x = NaN is a row from df2 not in df1. '''
df1_unique = pd.merge(df1,df2,how='left',on=['c2','c3'])
df1_unique = df1_unique[df1_unique['c1_y'].isnull()]
print(df1_unique)
df2_unique = pd.merge(df1,df2,how='right',on=['c2','c3'])
print(df2_unique)
df_common = pd.merge(df1,df2,how='inner',on=['c2','c3'])
print(df_common)
def delete_column_example():
print 'create df'
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['a','b','c'])
print 'drop (delete/remove) column'
col_name = 'b'
df.drop(col_name, axis=1, inplace=True) # or df = df.drop('col_name, 1)
def delete_rows_example():
print '\n\ncreate df'
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['col_1','col_2','col_3'])
print(df)
print '\n\nappend rows'
df= df.append(pd.DataFrame([[11,22,33]], columns=['col_1','col_2','col_3']))
print(df)
print '\n\ndelete rows where (based on) column value'
df = df[df.col_1 == 4]
print(df)

pandas dataframe drop columns by number of nan

I have a dataframe with some columns containing nan. I'd like to drop those columns with certain number of nan. For example, in the following code, I'd like to drop any column with 2 or more nan. In this case, column 'C' will be dropped and only 'A' and 'B' will be kept. How can I implement it?
import pandas as pd
import numpy as np
dff = pd.DataFrame(np.random.randn(10,3), columns=list('ABC'))
dff.iloc[3,0] = np.nan
dff.iloc[6,1] = np.nan
dff.iloc[5:8,2] = np.nan
print dff
There is a thresh param for dropna, you just need to pass the length of your df - the number of NaN values you want as your threshold:
In [13]:
dff.dropna(thresh=len(dff) - 2, axis=1)
Out[13]:
A B
0 0.517199 -0.806304
1 -0.643074 0.229602
2 0.656728 0.535155
3 NaN -0.162345
4 -0.309663 -0.783539
5 1.244725 -0.274514
6 -0.254232 NaN
7 -1.242430 0.228660
8 -0.311874 -0.448886
9 -0.984453 -0.755416
So the above will drop any column that does not meet the criteria of the length of the df (number of rows) - 2 as the number of non-Na values.
You can use a conditional list comprehension:
>>> dff[[c for c in dff if dff[c].isnull().sum() < 2]]
A B
0 -0.819004 0.919190
1 0.922164 0.088111
2 0.188150 0.847099
3 NaN -0.053563
4 1.327250 -0.376076
5 3.724980 0.292757
6 -0.319342 NaN
7 -1.051529 0.389843
8 -0.805542 -0.018347
9 -0.816261 -1.627026
Here is a possible solution:
s = dff.isnull().apply(sum, axis=0) # count the number of nan in each column
print s
A 1
B 1
C 3
dtype: int64
for col in dff:
if s[col] >= 2:
del dff[col]
Or
for c in dff:
if sum(dff[c].isnull()) >= 2:
dff.drop(c, axis=1, inplace=True)
I recommend the drop-method. This is an alternative solution:
dff.drop(dff.loc[:,len(dff) - dff.isnull().sum() <2], axis=1)
Say you have to drop columns having more than 70% null values.
data.drop(data.loc[:,list((100*(data.isnull().sum()/len(data.index))>70))].columns, 1)
You can do this through another approach as well like below for dropping columns having certain number of na values:
df = df.drop( columns= [x for x in df if df[x].isna().sum() > 5 ])
For dropping columns having certain percentage of na values :
df = df.drop(columns= [x for x in df if round((df[x].isna().sum()/len(df)*100),2) > 20 ])

Random sampling and Pandas dataframes

I have the following dataframe, cr_df, which shows the rate at which ID1 converts to ID2
ID1 ID2 Conversion Rate
0 1 A 0.046562
1 1 B 0.315975
2 1 C 0.577998
3 1 D 0.059465
4 2 A 0.6
5 2 B 0.4
Then I have another dataframe, raw_df, in the format of ID1 such as:
ID1 Value
0 1 100
1 2 200
My goal is to output a dataframe final_df, in the ID2 format that looks something like:
ID2 Value
0 C 100
1 A 200
Where the mapping from ID1 consists of selecting a random value between 0 and 1 and picking the ID2 based off the conversion rates.
How can I achieve this in pandas? (Do I need to use .apply?)
Given this setup:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'ID1': [1]*4+[2]*2, 'ID2':list('ABCDAB'),
'Conversion Rate': [0.046562, 0.315975, 0.577998, 0.059465, 0.6, 0.4]})
raw_df = pd.DataFrame({'ID1': [1,2], 'Value':[100, 200]})
you could define a function random_id2:
def random_id2(x):
return np.random.choice(x['ID2'], p=x['Conversion Rate'].values)
and use groupby/apply:
id2 = df.groupby(['ID1']).apply(random_id2)
to obtain the Series
ID1
1 C
2 A
dtype: object
You could then build final_df by mapping raw_df['ID1'] values to id2 values:
final_df = raw_df.copy()
final_df['ID1'] = final_df['ID1'].map(id2)
final_df = final_df.rename(columns={'ID1': 'ID2'})
import numpy as np
import pandas as pd
df = pd.DataFrame({
'ID1': [1]*4+[2]*2, 'ID2':list('ABCDAB'),
'Conversion Rate': [0.046562, 0.315975, 0.577998, 0.059465, 0.6, 0.4]})
raw_df = pd.DataFrame({'ID1': [1,2], 'Value':[100, 200]})
def random_id2(x):
return np.random.choice(x['ID2'], p=x['Conversion Rate'].values)
id2 = df.groupby(['ID1']).apply(random_id2)
final_df = raw_df.copy()
final_df['ID1'] = final_df['ID1'].map(id2)
final_df = final_df.rename(columns={'ID1': 'ID2'})
print(final_df)
yields
ID2 Value
0 C 100
1 A 200
You can do a combination of the following:
To make a weighted random choice of the rows, use the answer in this question; specifically, make a weighted selection of range(len(df)) with the weights given by df[Conversion Rate].
To select the rows with the given indices, see here.
To join the resulting dataframe with the second one, use merge

Categories