Find mean of the grouped rows of pandas dataframe - python

I am at very basic level of python. here i am stuck with a problem, can someone help me out?
i have a large pandas dataframe, i want to find rows and do mean, if the first column of each row has some similar value (ex: someinteger seperated by '_' another integer).
i tried to use .split to match 1st number of list, it works for single row but if i have iterate over row, it throws error.
my data frame looks like:
d = {'ID' : pd.Series(['1_1', '2_1', '1_2', '2_2' ], index=['0','1','2', '3']),
'one' : pd.Series([2.5, 2, 3.5, 2.5], index=['0','1', '2', '3']),
'two' : pd.Series([1, 2, 3, 4], index=['0', '1', '2', '3'])}
df2 = pd.DataFrame(d)
requirement:
mean of the rows which has similar ID at first position after split. ex. mean of 1_1 and 1_2, 2_1 and 2_2
output:
ID one two
0 1 3 2
1 2 2.25 3
here is my code,
working version : ((df2.ix[0,0]).split('_'))[0]
error version:
for i in df2.iterrows():
df2[df2.columns[((df2.ix[0,0]).split('_'))[0] == ((df2.ix[0,0]).split('_'))[0]]]
looking forward for sooner reply..
Thanks in advance..

You could create new column only with first number of your ID column with [str methods](http://pandas.pydata.org/pandas-docs/stable/text.html#splitting-and-replacing-strings) and then usegroupby` method:
df['groupedID'] = df.ID.str.split('_').str.get(0)
In [347]: df
Out[347]:
ID one two groupedID
0 10_1 2.5 1 10
1 2_1 2.0 2 2
2 10_2 3.5 3 10
3 2_2 2.5 4 2
df1 = df.groupby('groupedID').mean()
In [349]: df1
Out[349]:
one two
groupedID
10 3.00 2
2 2.25 3
If you need to change name of the index back to 'ID':
df1.index.name = 'ID'
In [351]: df1
Out[351]:
one two
ID
10 3.00 2
2 2.25 3

Related

pandas groupby and return a series of one column

I have a dataframe like below
df = pd.DataFrame({'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'time_1' :['2017-04-03 12:35:00','2017-04-03 12:50:00','2018-04-05 12:59:00','2018-05-04 13:14:00','2017-05-05 13:37:00','2018-07-06 13:39:00','2018-07-08 11:30:00','2017-04-08 16:00:00','2019-04-09 22:00:00','2019-04-11 04:00:00','2018-04-13 04:30:00','2017-04-14 08:00:00'],
'val' :[5,5,5,5,1,6,5,5,8,3,4,6],
'Prod_id':['A','B','C','A','E','Q','G','F','G','H','J','A']})
df['time_1'] = pd.to_datetime(df['time_1'])
I would like to do the below
a) groupby subject_id and time_1 using freq=3M`
b) return only the aggregated values of Prod_id column (and drop index)
So, I tried the below
df.groupby(['subject_id',pd.Grouper(key='time_1', freq='3M')])['Prod_id'].nunique()
Though the above works but it returned the group by columns as well in the output.
So, I tried the below using as_index=False
df.groupby(['subject_id',pd.Grouper(key='time_1', freq='3M'),as_index=False])['Prod_id'].nunique()
But still it didn't give the exepected output
I expect my output to be like as shown below
uniq_prod_cnt
2
1
1
3
2
1
2
You are in one of those cases in which you need to get rid of the index afterwards.
To get the exact shown output:
(df.groupby(['subject_id',pd.Grouper(key='time_1', freq='3M')])
.agg(uniq_prod_cnt=('Prod_id', 'nunique'))
.reset_index(drop=True)
)
output:
uniq_prod_cnt
0 2
1 1
2 1
3 3
4 2
5 1
6 2
if you want to get array without index
use values attribute :
df.groupby(['subject_id',pd.Grouper(key='time_1', freq='3M')])['Prod_id'].nunique().values
output:
array([2, 1, 1, 3, 2, 1, 2], dtype=int64)
if you want to get range index series
use reset_index(drop=True):
df.groupby(['subject_id',pd.Grouper(key='time_1', freq='3M')])['Prod_id'].nunique().reset_index(drop=True)
output:
0 2
1 1
2 1
3 3
4 2
5 1
6 2
Name: Prod_id, dtype: int64

Check if many columns of a data frame are exactly the same

I am developing a clinical bioinformatic application and the input this application gets is a data frame that looks like this
df = pd.DataFrame({'store': ['Blank_A09', 'Control_4p','13_MEG3','04_GRB10','02_PLAGL1','Control_21q','01_PLAGL1','11_KCNQ10T1','16_SNRPN','09_H19','Control_6p','06_MEST'],
'quarter': [1, 1, 2, 2, 1, 1, 2, 2,2,2,2,2],
'employee': ['Blank_A09', 'Control_4p','13_MEG3','04_GRB10','02_PLAGL1','Control_21q','01_PLAGL1','11_KCNQ10T1','16_SNRPN','09_H19','Control_6p','06_MEST'],
'foo': [1, 1, 2, 2, 1, 1, 9, 2,2,4,2,2],
'columnX': ['Blank_A09', 'Control_4p','13_MEG3','04_GRB10','02_PLAGL1','Control_21q','01_PLAGL1','11_KCNQ10T1','16_SNRPN','09_H19','Control_6p','06_MEST']})
print(df)
store quarter employee foo columnX
0 Blank_A09 1 Blank_A09 1 Blank_A09
1 Control_4p 1 Control_4p 1 Control_4p
2 13_MEG3 2 13_MEG3 2 13_MEG3
3 04_GRB10 2 04_GRB10 2 04_GRB10
4 02_PLAGL1 1 02_PLAGL1 1 02_PLAGL1
5 Control_21q 1 Control_21q 1 Control_21q
6 01_PLAGL1 2 01_PLAGL1 9 01_PLAGL1
7 11_KCNQ10T1 2 11_KCNQ10T1 2 11_KCNQ10T1
8 16_SNRPN 2 16_SNRPN 2 16_SNRPN
9 09_H19 2 09_H19 4 09_H19
10 Control_6p 2 Control_6p 2 Control_6p
11 06_MEST 2 06_MEST 2 06_MEST
This is a minimal reproducible example, but the real one has an uncertain number of columns in which the first, the third the 5th, the 7th, etc. "should" be exactly the same.
And this is what I want to check. I want to ensure that these columns have their values in the same order.
I know how to check if 2 columns are exactly the same but I don't know how to expand this checking across all data frame.
EDIT:
The name of the columns change, in my example, they are just two examples.
Refer here How to check if 3 columns are same and add a new column with the value if the values are same?
Here is a code that would check if more columns are the same and returns the index of rows which are the same
arr = df[['quarter','foo_test','foo']].values #You can add as many columns as you wish
np.where((arr == arr[:, [0]]).all(axis=1))
You need to tweak it for your usage
Edit
columns_to_check = [x for x in range(1, len(df.columns), 2)]
arr = df.iloc[:, columns_to_check].values
If you want an efficient method you can hash the Series using pandas.util.hash_pandas_object, making the operation O(n):
pd.util.hash_pandas_object(df.T, index=False)
We clearly see that store/employee/columnX have the same hash:
store 18266754969677227875
quarter 11367719614658692759
employee 18266754969677227875
foo 92544834319824418
columnX 18266754969677227875
dtype: uint64
You can further use groupby to identify the identical values:
df.columns.groupby(pd.util.hash_pandas_object(df.T, index=False))
output:
{ 92544834319824418: ['foo'],
11367719614658692759: ['quarter'],
18266754969677227875: ['store', 'employee', 'columnX']}

How can I rename NaN columns in python pandas?

Good day everyone! I had trouble putting a nested dictionary as separate columns. However, I fixed it using the concat and json.normalize function. But for some reason the code I used removed all the column names and returned NaN as values for the columns...
Does someone knows how to fix this?
Code I used:
import pandas as pd
c = ['photo.photo_replace', 'photo.photo_remove', 'photo.photo_add', 'photo.photo_effect', 'photo.photo_brightness',
'photo.background_color', 'photo.photo_resize', 'photo.photo_rotate', 'photo.photo_mirror', 'photo.photo_layer_rearrange',
'photo.photo_move', 'text.text_remove', 'text.text_add', 'text.text_edit', 'text.font_select', 'text.text_color', 'text.text_style',
'text.background_color', 'text.text_align', 'text.text_resize', 'text.text_rotate', 'text.text_move', 'text.text_layer_rearrange']
df_edit = pd.concat([json_normalize(x)[c] for x in df['editables']], ignore_index=True)
df.columns = df.columns.str.split('.').str[1]
Current problem:
Result I want:
df= pd.DataFrame({
'A':[1,2,3],
'B':[3,3,3]
})
print(df)
A B
0 1 3
1 2 3
2 3 3
c=['new_name1','new_name2']
df.columns=c
print(df)
new_name1 new_name2
0 1 3
1 2 3
2 3 3
remember , lenght of column names (c) should be equal to column amount

Combining two columns with related data into a single column (python, pandas)

I am looking for the correct logic to combine two columns with related data from an .xlsx file using pandas in python. It is similar to the post: Merge 2 columns in pandas into one columns that have data in python, except that I also want to transform the data as I combine the columns so it's not really a true merge of the two columns. I want to be able to say "if column wbc_na has the value "checked" in row x, place "Not available" in row x under column wbc". Once combined, I want to drop the column" wbc_na" since "wbc" now contains all the information I need. For example:
input:
ID,wbc, wbc_na
1,9.0,-
2,NaN,checked
3,10.2,-
4,8.8,-
5,0,checked
output:
ID,wbc
1,9.0
2,Not available
3,10.2
4,8.8
5,Not available
Thanks for your suggestions.
You can use loc to find where column 'wbc_na' is 'checked' and for those rows assign column 'wbc' value:
In [18]:
df.loc[df['wbc_na'] == 'checked', 'wbc'] = 'Not available'
df
Out[18]:
ID wbc wbc_na
0 1 9 -
1 2 Not available checked
2 3 10.2 -
3 4 8.8 -
4 5 Not available checked
[5 rows x 3 columns]
In [19]:
# now drop the extra column
df.drop(labels='wbc_na', axis=1, inplace=True)
df
Out[19]:
ID wbc
0 1 9
1 2 Not available
2 3 10.2
3 4 8.8
4 5 Not available
[5 rows x 2 columns]
You could also a list comprehension to reassign the values in column wbc:
data = pd.DataFrame({'ID': [1,2,3,4,5], 'wbc': [9, np.nan, 10, 8, 0], 'wbc_nan': ['-', 'checked', '-', '-', 'checked']})
data['wbc'] = [(item if data['wbc_nan'][x] != 'checked' else 'Not available') for x, item in enumerate(data['wbc'])]
data = data.drop('wbc_nan', axis=1)

Apply function to pandas DataFrame that can return multiple rows

I am trying to transform DataFrame, such that some of the rows will be replicated a given number of times. For example:
df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]})
class count
0 A 1
1 B 0
2 C 2
should be transformed to:
class
0 A
1 C
2 C
This is the reverse of aggregation with count function. Is there an easy way to achieve it in pandas (without using for loops or list comprehensions)?
One possibility might be to allow DataFrame.applymap function return multiple rows (akin apply method of GroupBy). However, I do not think it is possible in pandas now.
You could use groupby:
def f(group):
row = group.irow(0)
return DataFrame({'class': [row['class']] * row['count']})
df.groupby('class', group_keys=False).apply(f)
so you get
In [25]: df.groupby('class', group_keys=False).apply(f)
Out[25]:
class
0 A
0 C
1 C
You can fix the index of the result however you like
I know this is an old question, but I was having trouble getting Wes' answer to work for multiple columns in the dataframe so I made his code a bit more generic. Thought I'd share in case anyone else stumbles on this question with the same problem.
You just basically specify what column has the counts in it in and you get an expanded dataframe in return.
import pandas as pd
df = pd.DataFrame({'class 1': ['A','B','C','A'],
'class 2': [ 1, 2, 3, 1],
'count': [ 3, 3, 3, 1]})
print df,"\n"
def f(group, *args):
row = group.irow(0)
Dict = {}
row_dict = row.to_dict()
for item in row_dict: Dict[item] = [row[item]] * row[args[0]]
return pd.DataFrame(Dict)
def ExpandRows(df,WeightsColumnName):
df_expand = df.groupby(df.columns.tolist(), group_keys=False).apply(f,WeightsColumnName).reset_index(drop=True)
return df_expand
df_expanded = ExpandRows(df,'count')
print df_expanded
Returns:
class 1 class 2 count
0 A 1 3
1 B 2 3
2 C 3 3
3 A 1 1
class 1 class 2 count
0 A 1 1
1 A 1 3
2 A 1 3
3 A 1 3
4 B 2 3
5 B 2 3
6 B 2 3
7 C 3 3
8 C 3 3
9 C 3 3
With regards to speed, my base df is 10 columns by ~6k rows and when expanded is ~100,000 rows takes ~7 seconds. I'm not sure in this case if grouping is necessary or wise since it's taking all the columns to group form, but hey whatever only 7 seconds.
There is even a simpler and significantly more efficient solution.
I had to make similar modification for a table of about 3.5M rows, and the previous suggested solutions were extremely slow.
A better way is to use numpy's repeat procedure for generating a new index in which each row index is repeated multiple times according to its given count, and use iloc to select rows of the original table according to this index:
import pandas as pd
import numpy as np
df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count': [1, 0, 2]})
spread_ixs = np.repeat(range(len(df)), df['count'])
spread_ixs
array([0, 2, 2])
df.iloc[spread_ixs, :].drop(columns='count').reset_index(drop=True)
class
0 A
1 C
2 C
This question is very old and the answers do not reflect pandas modern capabilities. You can use iterrows to loop over every row and then use the DataFrame constructor to create new DataFrames with the correct number of rows. Finally, use pd.concat to concatenate all the rows together.
pd.concat([pd.DataFrame(data=[row], index=range(row['count']))
for _, row in df.iterrows()], ignore_index=True)
class count
0 A 1
1 C 2
2 C 2
This has the benefit of working with any size DataFrame.

Categories