I am using groupby in pandas to compute some aggregates statistics in pandas on data where columns in a data frame are organized with a hierarchical index.
For the computed statistics I want to get back to a table form in the end, where the groups are reconverted to columns with the group values, e.g. like:
index = pd.MultiIndex.from_tuples([('A', 'a'), ('B', 'b')])
df = pd.DataFrame(np.random.randn(8,2), columns=index)
which results in e.g. this data frame
A B
a b
0 0.511157 0.334748
1 0.031113 -0.477456
2 0.288080 -0.258238
3 0.138467 -0.955547
4 -0.087873 0.017494
5 -0.667393 1.190039
6 -0.068245 -1.282864
7 -0.996982 0.589667
Now I compute the statistics using groupby and reset the index to recreate a flat data frame:
df.groupby([('A','a')]).mean().reset_index()
(A, a) B
b
0 -0.996982 0.589667
1 -0.667393 1.190039
2 -0.087873 0.017494
3 -0.068245 -1.282864
4 0.031113 -0.477456
5 0.138467 -0.955547
6 0.288080 -0.258238
7 0.511157 0.334748
How can I achieve that ('A', 'a') becomes a part of the multi index again, hopefully in an automatic fashion? Or stated otherwise: is there a way to preserve the hierarchical column structure during the groupby operation.
For me work add parameter as_index=False to groupby:
print df.groupby([('A','a')], as_index=False).mean()
A B
a b
0 -0.765088 -0.556601
1 -0.628040 2.074559
2 -0.516396 -2.028387
3 -0.152027 0.389853
4 0.450218 1.474989
5 0.718040 -0.882018
6 1.932556 -0.977316
7 2.028468 -0.875167
The simplest thing to do is reassign back the original columns:
In [182]:
df1 = df.groupby([('A','a')]).mean().reset_index()
df1.columns = df.columns
df1
Out[182]:
A B
a b
0 -0.857465 -0.761948
1 -0.263677 0.538251
2 0.067710 -1.038906
3 0.345584 -0.425514
4 0.478200 0.119345
5 0.639305 0.047526
6 1.528260 1.956677
7 3.114834 -0.532462
Related
I'm trying to get the correlation between a single column and the rest of the numerical columns of the dataframe, but I'm stuck.
I'm trying with this:
corr = IM['imdb_score'].corr(IM)
But I get the error
operands could not be broadcast together with shapes
which I assume is because I'm trying to find a correlation between a vector (my imdb_score column) with the dataframe of several columns.
How can this be fixed?
The most efficient method it to use corrwith.
Example:
df.corrwith(df['A'])
Setup of example data:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(10, size=(5, 5)), columns=list('ABCDE'))
# A B C D E
# 0 7 2 0 0 0
# 1 4 4 1 7 2
# 2 6 2 0 6 6
# 3 9 8 0 2 1
# 4 6 0 9 7 7
output:
A 1.000000
B 0.526317
C -0.209734
D -0.720400
E -0.326986
dtype: float64
I think you can you just use .corr which returns all correlations between all columns and then select just the column you are interested in.
So, something like
IM.corr()['imbd_score']
should work.
Rather than calculating all correlations and keeping the ones of interest, it can be computationally more efficient to compute the subset of interesting correlations:
import pandas as pd
df = pd.DataFrame()
df['a'] = range(10)
df['b'] = range(10)
df['c'] = range(10)
pd.DataFrame([[c, df['a'].corr(df[c])] for c in df.columns if c!='a'], columns=['var', 'corr'])
I have 2 data frames sample output is here
My code for getting those and formatting the date column is here
First df:
csv_data_df = pd.read_csv(os.path.join(path_to_data+'\\Data\\',appendedfile))
csv_data_df['Date_Formatted'] = pd.to_datetime(csv_data_df['DATE']).dt.strftime('%Y-%m-%d')
csv_data_df.head(3)
second df :
new_Data_df = pd.read_csv(io.StringIO(response.decode('utf-8')))
new_Data_df['Date_Formatted'] =
pd.to_datetime(new_Data_df['DATE']).dt.strftime('%Y-%m-%d')
new_Data_df.head(3)`
I want to construct third dataframe where only the rows with un-matching dates from second dataframe needs to go in third one.
Is there any method to do that. The date formatted column you can see in the screenshot.
You could set the index of both dataframes to your desired join column, then
use df1.combine_first(df2). For your specific example here, that could look like the below line.
csv_data_df.set_index('Date_Formatted').combine_first(new_Data_df.set_index('Date_Formatted')).reset_index()
Ex:
df = pd.DataFrame(np.random.randn(5, 3), columns=list('abc'), index=list(range(1, 6)))
df2 = pd.DataFrame(np.random.randn(8, 3), columns=list('abc'))
df
Out[10]:
a b c
1 -1.357517 -0.925239 0.974483
2 0.362472 -1.881582 1.263237
3 0.785508 0.227835 -0.604377
4 -0.386585 -0.511583 3.080297
5 0.660516 -1.393421 1.363900
df2
Out[11]:
a b c
0 1.732251 -1.977803 0.720292
1 0.048229 1.125277 1.016083
2 -1.684013 2.136061 0.553824
3 -0.022957 1.237249 0.236923
4 -0.998079 1.714126 1.291391
5 0.955464 -0.049673 1.629146
6 0.865864 1.137120 1.117207
7 -0.126944 1.003784 -0.180811
df.combine_first(df2)
Out[13]:
a b c
0 1.732251 -1.977803 0.720292
1 -1.357517 -0.925239 0.974483
2 0.362472 -1.881582 1.263237
3 0.785508 0.227835 -0.604377
4 -0.386585 -0.511583 3.080297
5 0.660516 -1.393421 1.363900
6 0.865864 1.137120 1.117207
7 -0.126944 1.003784 -0.180811
I have a large data set in a pandas dataframe I would like to subset in order to further manipulate.
For example, I have a df that looks like this:
Sample Group AMP ADP ATP
1 A 0.3840396 0.55635504 0.5844648
2 A 0.3971521 0.57851902 -0.24603208
3 A 0.4578926 0.68118957 0.19129746
4 B 0.400222 0.58370811 0.01782915
5 B 0.4110945 0.60208593 -0.6285537
6 B 0.3307011 -0.82615087 -0.25354715
7 C 0.3485679 -0.79597002 -0.17294609
8 C 0.3408411 -0.8090222 0.76138965
9 C 0.3856457 -0.73333568 0.27364299
Lets say I want to make a new dataframe df2 from df that contains only the samples in group B and only the corresponding values for ATP. I should be able to do this from indexing alone.(?)
I would like to do something like this:
df2 = df[(df['Group']=='B') & (df['ATP'])]
I know df['ATP'] obviously is not the correct way to do this. Printing df2 yields this:
Sample Group AMP ADP ATP
4 B 0.400222 0.583708 0.017829
5 B 0.411094 0.602086 -0.628554
6 B 0.330701 -0.826151 -0.253547
Ideally, df2 would look like this:
Sample Group ATP
4 B 0.017829
5 B -0.628554
6 B -0.253547
Is there a way to do this without resorting to some convoluted looping or simply manually deleting the unwanted columns and their values?
Thanks!!!
df2 = df.loc[df['Group'] == 'B', ['Sample', 'Group', 'ATP']]
I am attempting to join several dataframes together. The list of names of these dataframes is stored in another dataframe called companies, which is displayed below.
>>> companies
16: Symbols
0 TUES
1 DRAM
2 NTRS
3 PCBK
4 CRIS
5 PERY
6 IRDM
7 GNCMA
8 IBOC
My aim would be to do something like this: joined=TUES.join(DRAM) then joined=joined.join(NTRS) and so on, down the list. How might I be able to reference elements of the Symbols column of the dataframe companies in order to achieve this?
Many thanks in advance!
You can define an empty DataFrame and append all other dataframes to it. See the example below:
combined_df = pandas.DataFrame()
for df in other_dataframes:
combined_df = combined_df.append(df)
Use pd.concat, it is designed for merging lists of dfs:
so for your example just turn the values into a list and then concat:
joined = pd.concat(list(companies['Symbols']), axis=1)
Example:
In [4]:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':np.random.randn(5), 'b':np.random.randn(5)})
df1 = pd.DataFrame({'c':np.random.randn(5), 'd':np.random.randn(5)})
df2 = pd.DataFrame({'e':np.random.randn(5), 'f':np.random.randn(5)})
df_list=[df,df2,df1]
df_list
Out[4]:
[ a b
0 0.143116 1.205407
1 -0.430869 1.429313
2 0.059810 0.430131
3 2.554849 -1.450640
4 -1.127638 0.715323
[5 rows x 2 columns], e f
0 0.658967 1.150672
1 0.813355 -0.252577
2 0.885928 0.970844
3 0.519375 -1.929081
4 -0.217152 0.907535
[5 rows x 2 columns], c d
0 -1.375885 1.422697
1 -0.870040 0.135527
2 -0.696600 1.954966
3 0.494035 -0.727816
4 -0.367156 -0.216115
[5 rows x 2 columns]]
In [8]:
# now concatenate the list of dfs, by column
pd.concat(df_list,axis=1)
Out[8]:
a b e f c d
0 0.143116 1.205407 0.658967 1.150672 -1.375885 1.422697
1 -0.430869 1.429313 0.813355 -0.252577 -0.870040 0.135527
2 0.059810 0.430131 0.885928 0.970844 -0.696600 1.954966
3 2.554849 -1.450640 0.519375 -1.929081 0.494035 -0.727816
4 -1.127638 0.715323 -0.217152 0.907535 -0.367156 -0.216115
[5 rows x 6 columns]
I am trying to transform DataFrame, such that some of the rows will be replicated a given number of times. For example:
df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]})
class count
0 A 1
1 B 0
2 C 2
should be transformed to:
class
0 A
1 C
2 C
This is the reverse of aggregation with count function. Is there an easy way to achieve it in pandas (without using for loops or list comprehensions)?
One possibility might be to allow DataFrame.applymap function return multiple rows (akin apply method of GroupBy). However, I do not think it is possible in pandas now.
You could use groupby:
def f(group):
row = group.irow(0)
return DataFrame({'class': [row['class']] * row['count']})
df.groupby('class', group_keys=False).apply(f)
so you get
In [25]: df.groupby('class', group_keys=False).apply(f)
Out[25]:
class
0 A
0 C
1 C
You can fix the index of the result however you like
I know this is an old question, but I was having trouble getting Wes' answer to work for multiple columns in the dataframe so I made his code a bit more generic. Thought I'd share in case anyone else stumbles on this question with the same problem.
You just basically specify what column has the counts in it in and you get an expanded dataframe in return.
import pandas as pd
df = pd.DataFrame({'class 1': ['A','B','C','A'],
'class 2': [ 1, 2, 3, 1],
'count': [ 3, 3, 3, 1]})
print df,"\n"
def f(group, *args):
row = group.irow(0)
Dict = {}
row_dict = row.to_dict()
for item in row_dict: Dict[item] = [row[item]] * row[args[0]]
return pd.DataFrame(Dict)
def ExpandRows(df,WeightsColumnName):
df_expand = df.groupby(df.columns.tolist(), group_keys=False).apply(f,WeightsColumnName).reset_index(drop=True)
return df_expand
df_expanded = ExpandRows(df,'count')
print df_expanded
Returns:
class 1 class 2 count
0 A 1 3
1 B 2 3
2 C 3 3
3 A 1 1
class 1 class 2 count
0 A 1 1
1 A 1 3
2 A 1 3
3 A 1 3
4 B 2 3
5 B 2 3
6 B 2 3
7 C 3 3
8 C 3 3
9 C 3 3
With regards to speed, my base df is 10 columns by ~6k rows and when expanded is ~100,000 rows takes ~7 seconds. I'm not sure in this case if grouping is necessary or wise since it's taking all the columns to group form, but hey whatever only 7 seconds.
There is even a simpler and significantly more efficient solution.
I had to make similar modification for a table of about 3.5M rows, and the previous suggested solutions were extremely slow.
A better way is to use numpy's repeat procedure for generating a new index in which each row index is repeated multiple times according to its given count, and use iloc to select rows of the original table according to this index:
import pandas as pd
import numpy as np
df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count': [1, 0, 2]})
spread_ixs = np.repeat(range(len(df)), df['count'])
spread_ixs
array([0, 2, 2])
df.iloc[spread_ixs, :].drop(columns='count').reset_index(drop=True)
class
0 A
1 C
2 C
This question is very old and the answers do not reflect pandas modern capabilities. You can use iterrows to loop over every row and then use the DataFrame constructor to create new DataFrames with the correct number of rows. Finally, use pd.concat to concatenate all the rows together.
pd.concat([pd.DataFrame(data=[row], index=range(row['count']))
for _, row in df.iterrows()], ignore_index=True)
class count
0 A 1
1 C 2
2 C 2
This has the benefit of working with any size DataFrame.