Create dataframe with (for each cell) averages of other dataframes

Create dataframe with (for each cell) averages of other dataframes - python

I have a list of about 20 dataframes, all with the same structure (same rows and columns).
I want to create a new df, where each cell is equal to the average of the corresponding (same row/column) cells of the listed dfs.
So, for example, if we have just 2 dfs (A and B), I need the following:
A=
A B C D
0 7 6 8 7
1 7 0 7 6
2 9 2 7 0
B=
A B C D
0 6 9 2 7
1 4 4 5 7
2 6 8 5 4
Average=
A B C D
0 6.5 7.5 5.0 7.0
1 5.5 2.0 6.0 6.5
2 7.5 5.0 6.0 2.0
I tried this code, but it's pretty slow (the real dfs are quite large) and messes up the order of columns:
dfs = [A,B]
Average = pd.concat([each.stack() for each in dfs],axis=1)\
.apply(lambda x:x.mean(),axis=1)\
.unstack()
Is there a better alternative? Thanks

Use -
(A+B) / 2
Output
A B C D
0 6.5 7.5 5.0 7.0
1 5.5 2.0 6.0 6.5
2 7.5 5.0 6.0 2.0
For scaling up to more dfs, put all of them in a list and just use sum(list). Edit: Based on #younggoti's reco-
list_of_df = [A,B]
sum(list_of_df)/len(list_of_df)

Related

How do I find the max value in only specific columns in a row?

If this was my dataframe
a
b
c
12
5
0.1
9
7
8
1.1
2
12.9
I can use the following code to get the max values in each row... (12) (9) (12.9)
df = df.max(axis=1)
But I don't know would you get the max values only comparing columns a & b (12, 9, 2)

Assuming one wants to consider only the columns a and b, and store the maximum value in a new column called max, one can do the following
df['max'] = df[['a', 'b']].max(axis=1)
[Out]:
a b c max
0 12.0 5 0.1 12.0
1 9.0 7 8.0 9.0
2 1.1 2 12.9 2.0
One can also do that with a custom lambda function, as follows
df['max'] = df[['a', 'b']].apply(lambda x: max(x), axis=1)
[Out]:
a b c max
0 12.0 5 0.1 12.0
1 9.0 7 8.0 9.0
2 1.1 2 12.9 2.0
As per OP's request, if one wants to create a new column, max_of_all, that one will use to store the maximum value for all the dataframe columns, one can use the following
df['max_of_all'] = df.max(axis=1)
[Out]:
a b c max max_of_all
0 12.0 5 0.1 12.0 12.0
1 9.0 7 8.0 9.0 9.0
2 1.1 2 12.9 2.0 12.9

Python How to drop rows of Pandas DataFrame whose value in a certain column is NaN

I have this DataFrame and want only the records whose "Total" column is not NaN ,and records when A~E has more than two NaN:
A B C D E Total
1 1 3 5 5 8
1 4 3 5 5 NaN
3 6 NaN NaN NaN 6
2 2 5 9 NaN 8
..i.e. something like df.dropna(....) to get this resulting dataframe:
A B C D E Total
1 1 3 5 5 8
2 2 5 9 NaN 8
Here's my code
import pandas as pd
dfInputData = pd.read_csv(path)
dfInputData = dfInputData.dropna(axis=1,how = 'any')
RowCnt = dfInputData.shape[0]
But it looks like no modification has been made even error
Please help!! Thanks

Use boolean indexing with count all columns without Total for number of missing values and not misisng values in Total:
df = df[df.drop('Total', axis=1).isna().sum(axis=1).le(2) & df['Total'].notna()]
print (df)
A B C D E Total
0 1 1 3.0 5.0 5.0 8.0
3 2 2 5.0 9.0 NaN 8.0
Or filter columns between A:E:
df = df[df.loc[:, 'A':'E'].isna().sum(axis=1).le(2) & df['Total'].notna()]
print (df)
A B C D E Total
0 1 1 3.0 5.0 5.0 8.0
3 2 2 5.0 9.0 NaN 8.0

Duplicate positions from group

I have the following Dataset:
col value
0 A 1
1 A NaN
2 B NaN
3 B NaN
4 B NaN
5 B 1
6 C 3
7 C NaN
8 C NaN
9 D 5
10 E 6
There is only one value set per group, the rest in Nan. What I want to do know, is fill the NaN with he value of the group. If a group has no NaNs, I just want to ignore it.
Outcome should look like this:
col value
0 A 1
1 A 1
2 B 1
3 B 1
4 B 1
5 B 1
6 C 3
7 C 3
8 C 3
9 D 5
10 E 6
What I've tried so far is the following:
df["value"] = df.groupby(col).transform(lambda x: x.fillna(x.mean()))
However, this method is not only super slow, but doesn't give me the wished result.
Anybody an idea?

It depends of data - if there is always one non missing value you can sorting and then replace by GroupBy.ffill, it working well if some groups has NANs only:
df = df.sort_values(['col','value'])
df["value"] = df.groupby('col')["value"].ffill()
#if always only one non missing value per group, fail if all NaNs of some group
#df["value"] = df["value"].ffill()
print (df)
col value
0 A 1.0
1 A 1.0
5 B 1.0
2 B 1.0
3 B 1.0
4 B 1.0
6 C 3.0
7 C 3.0
8 C 3.0
9 D 5.0
10 E 6.0
Or if there is multiple values and need replace by mean, for improve performace change your solution with GroupBy.transform only mean passed to Series.fillna:
df["value"] = df["value"].fillna(df.groupby('col')["value"].transform('mean'))
print (df)
col value
0 A 1.0
1 A 1.0
5 B 1.0
2 B 1.0
3 B 1.0
4 B 1.0
6 C 3.0
7 C 3.0
8 C 3.0
9 D 5.0
10 E 6.0

You can use ffill which is the same as fillna() with method=ffill (see docs)
df["value"] = df["value"].ffill()

Pandas Dataframe Question: Subtract next row and add specific value if NaN

Trying to groupby in pandas, then sort values and have a result column show what you need to add to get to the next row in the group, and if your are the end of the group. To replace the value with the number 3. Anyone have an idea how to do it?
import pandas as pd
df = pd.DataFrame({'label': 'a a b c b c'.split(), 'Val': [2,6,6, 4,16, 8]})
df
label Val
0 a 2
1 a 6
2 b 6
3 c 4
4 b 16
5 c 8
Id like the results as shown below, that you have to add 4 to 2 to get 6. So the groups are sorted. But if there is no next value in the group and NaN is added. To replace it with the value 3. I have shown below what the results should look like:
label Val Results
0 a 2 4.0
1 a 6 3.0
2 b 6 10.0
3 c 4 4.0
4 b 16 3.0
5 c 8 3.0
I tried this, and was thinking of shifting values up but the problem is that the labels aren't sorted.
df['Results'] = df.groupby('label').apply(lambda x: x - x.shift())`
df
label Val Results
0 a 2 NaN
1 a 6 4.0
2 b 6 NaN
3 c 4 NaN
4 b 16 10.0
5 c 8 4.0
Hope someone can help:D!

Use groupby, diff and abs:
df['Results'] = abs(df.groupby('label')['Val'].diff(-1)).fillna(3)
label Val Results
0 a 2 4.0
1 a 6 3.0
2 b 6 10.0
3 c 4 4.0
4 b 16 3.0
5 c 8 3.0

Pandas Dataframe merge without duplicating either side?

I often get tables containing similar information from different sources for "QC". Sometime I want to put these two tables side by side, output to excel to show others, so we can resolve discrepancies. To do so I want a 'lazy' merge with pandas dataframe.
say, I have two tables:
df a: df b:
n I II n III IV
0 a 1 2 0 a 1 2
1 a 3 4 1 a 0 0
2 b 5 6 2 b 5 6
3 c 9 9 3 b 7 8
I want to have results like:
a merge b
n I II III IV
0 a 1 2 1 2
1 a 3 4
2 b 5 6 5 6
3 b 7 8
4 c 9 9
of course this is what I got with merge():
a.merge(b, how='outer', on="n")
n I II III IV
0 a 1 2 1.0 2.0
1 a 1 2 0.0 0.0
2 a 3 4 1.0 2.0
3 a 3 4 0.0 0.0
4 b 5 6 5.0 6.0
5 b 5 6 7.0 8.0
6 c 9 9 NaN NaN
I feel there must be an easy way to do that, but all my solution were convoluted.
Is there a parameter in merge or concat for something like "no_copy"?

Doesn't look like you can do it with the information given alone, you need to introduce a cumulative count column to add to the merge columns. Consider this solution
>>> import pandas
>>> dfa = pandas.DataFrame( {'n':['a','a','b','c'] , 'I' : [1,3,5,9] , 'II':[2,4,6,9]}, columns=['n','I','II'])
>>> dfb = pandas.DataFrame( {'n':['a','b','b'] , 'III' : [1,5,7] , 'IV':[2,6,8] }, columns=['n','III','IV'])
>>>
>>> dfa['nCC'] = dfa.groupby( 'n' ).cumcount()
>>> dfb['nCC'] = dfb.groupby( 'n' ).cumcount()
>>> dm = dfa.merge(dfb, how='outer', on=['n','nCC'] )
>>>
>>>
>>> dfa
n I II nCC
0 a 1 2 0
1 a 3 4 1
2 b 5 6 0
3 c 9 9 0
>>> dfb
n III IV nCC
0 a 1 2 0
1 b 5 6 0
2 b 7 8 1
>>> dm
n I II nCC III IV
0 a 1.0 2.0 0 1.0 2.0
1 a 3.0 4.0 1 NaN NaN
2 b 5.0 6.0 0 5.0 6.0
3 c 9.0 9.0 0 NaN NaN
4 b NaN NaN 1 7.0 8.0
>>>
It has the gaps or lack of duplication where you want although the index isn't quite identical to your output. Because NaN's are involved the various columns get coerced to float64 types.
Adding the cumulative count essentially forces instances to match with each other across both sides, the first matches for a given level match the corresponding first level, and likewise for all instances of the level for all levels.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create dataframe with (for each cell) averages of other dataframes - python

Use - (A+B) / 2 Output A B C D 0 6.5 7.5 5.0 7.0 1 5.5 2.0 6.0 6.5 2 7.5 5.0 6.0 2.0 For scaling up to more dfs, put all of them in a list and just use sum(list). Edit: Based on #younggoti's reco- list_of_df = [A,B] sum(list_of_df)/len(list_of_df)

Related

How do I find the max value in only specific columns in a row?

Python How to drop rows of Pandas DataFrame whose value in a certain column is NaN

Duplicate positions from group

Pandas Dataframe Question: Subtract next row and add specific value if NaN

Pandas Dataframe merge without duplicating either side?

Categories

Resources