Pandas average columns with same value in other columns [duplicate] - python

This question already has answers here:
Pandas: using groupby to get mean for each data category
(2 answers)
Pandas groupby mean - into a dataframe?
(4 answers)
Closed 5 years ago.
I have a df that looks like this:
headings = ['foo','bar','qui','gon','jin']
table = [[1,1,3,4,5],
[1,1,4,5,6],
[2,2,3,4,5],
[2,2,4,5,6],
]
df = DataFrame(columns=headings,data=table)
foo bar qui gon jin
0 1 1 3 4 5
1 1 1 4 5 6
2 2 2 3 4 5
3 2 2 4 5 6
What I want to do is average the values of all columns whenever a certain column has a similar value e.g. I want to average all the columns with similar 'bar' values and then create a dataframe with the answer. I tried the following:
newDf = DataFrame([])
for i in df['bar'].loc[1:2]:
newDf = newDf.append(df[df['foo'] == i].mean(axis=0),ignore_index=True)
And it outputs what I want:
bar foo gon jin qui
0 1.00E+00 1.00E+00 4.50E+00 5.50E+00 3.50E+00
1 2.00E+00 2.00E+00 4.50E+00 5.50E+00 3.50E+00
But when I try that with another column with value, it does not output what I want:
for i in df['qui'].loc[1:2]:
newDf = newDf.append(df[df['foo'] == i].mean(axis=0),ignore_index=True)
Produces
bar foo gon jin qui
0 NAN NAN NAN NAN NAN
1 NAN NAN NAN NAN NAN
Can you give me a hand?
Side question: how do I prevent the columns of the new dataframe to be ordered alphabetically? Is it possible to maintain the order of the original dataframe?

Related

I have a for loop that generates different pandas dataframes whose values I want to save sequentially [duplicate]

This question already has answers here:
Merge multiple dataframes based on a common column [duplicate]
(4 answers)
Merge multiple DataFrames Pandas
(5 answers)
pandas three-way joining multiple dataframes on columns
(12 answers)
Closed 10 months ago.
I made a for loop that creates a different pandas dataframe on each iteration. Something like this ->
First iteration:
index
Letter
Value
0
A
1
1
B
2
2
C
3
Second iteration:
index
Letter
Value
0
C
5
1
D
3
2
E
1
3
F
2
Third iteration:
index
Letter
Value
0
A
2
1
F
1
I want to save each dataframe to a new one that looks like this:
index
Letter
Value
Value
Value
0
A
1
2
1
B
2
2
C
3
5
3
D
3
4
E
1
5
F
2
1
Also, new letters can appear on each iteration, so for example if 'G' appears for the first time on interation 'n', a new row would need to be created on the desired consolidated dataframe.
You can make Letter the index for each dataframe, and then use pd.concat with axis=1:
dataframes = [df1, df2, df3]
new_df = pd.concat([d.set_index('Letter') for d in dataframes], axis=1)
Output:
>>> new_df
Value Value Value
Letter
A 1.0 NaN 2.0
B 2.0 NaN NaN
C 3.0 5.0 NaN
D NaN 3.0 NaN
E NaN 1.0 NaN
F NaN 2.0 1.0

Comparing two columns and if condition is met add '1' to a new column [duplicate]

This question already has answers here:
Pandas conditional creation of a series/dataframe column
(13 answers)
Closed 3 years ago.
I cannot figure out how to compare two columns and if one columns is greater than or equal to another number input '1' to a new column. If the condition is not met I would like python to do nothing.
The data set for testing is here:
data = [[12,10],[15,10],[8,5],[4,5],[15,'NA'],[5,'NA'],[10,10], [9,10]]
df = pd.DataFrame(data, columns = ['Score', 'Benchmark'])
Score Benchmark
0 12 10
1 15 10
2 8 5
3 4 5
4 15 NA
5 5 NA
6 10 10
7 9 10
The desired output is:
desired_output_data = [[12,10, 1],[15,10,1],[8,5,1],[4,5],[15,'NA'],[5,'NA'],[10,10,1], [9,10]]
desired_output_df = pd.DataFrame(desired_output_data, columns = ['Score', 'Benchmark', 'MetBench'])
Score Benchmark MetBench
0 12 10 1.0
1 15 10 1.0
2 8 5 1.0
3 4 5 NaN
4 15 NA NaN
5 5 NA NaN
6 10 10 1.0
7 9 10 NaN
I tried doing something like this:
if df['Score'] >= df['Benchmark']:
df['MetBench'] = 1
I am new to programming in general so any guidance would be greatly appreciated.
Thank you!
Can usege and map
df.Score.ge(df.Benchmark).map({True: 1, False:np.nan})
or use the mapping from False to np.nan implicitly, since pandas uses the dict.get method to apply the mapping, and None is the default value (thanks #piRSquared)
df.Score.ge(df.Benchmark).map({True: 1})
Or simply series.where
df.Score.ge(df.Benchmark).where(lambda s: s)
Both outputs
0 1.0
1 1.0
2 1.0
3 NaN
4 NaN
5 NaN
6 1.0
7 NaN
dtype: float64
Make sure to do
df['Benchmark'] = pd.to_numeric(df['Benchmark'], errors='coerce')
first, since you have 'NA' as a string, but you need the numeric value np.nan to be able to compare it with other numbers

In a Pandas dataframe, how can I extract the difference between the values on separate rows within the same column, conditional on a second column? [duplicate]

This question already has answers here:
Pandas groupby multiple fields then diff
(2 answers)
Closed 4 years ago.
This is part of a larger project, but I've broken my problem down into steps, so here's the first step. Take a Pandas dataframe, like this:
index | user time
---------------------
0 F 0
1 T 0
2 T 0
3 T 1
4 B 1
5 K 2
6 J 2
7 T 3
8 J 4
9 B 4
For each unique user, can I extract the difference between the values in column "time," but with some conditions?
So, for example, there are two instances of user J, and the "time" difference between these two instances is 2. Can I extract the difference, 2, between these two rows? Then if that user appears again, extract the difference between that row and the previous appearance of that user in the dataframe?
I believe need DataFrameGroupBy.diff:
df['new'] = df.groupby('user')['time'].diff()
print (df)
user time new
0 F 0 NaN
1 T 0 NaN
2 T 0 0.0
3 T 1 1.0
4 B 1 NaN
5 K 2 NaN
6 J 2 NaN
7 T 3 2.0
8 J 4 2.0
9 B 4 3.0
I think np.where and pandas shifts does this
This subtract between two consecutive Time, only if the users are same
df1 = np.where (df['users'] == df['users'].shifts(-1), df['time'] - df['time'].shifts(-1), 'NaN')

Populating the column value with previous when NaN [duplicate]

This question already has answers here:
How to replace NaNs by preceding or next values in pandas DataFrame?
(10 answers)
Closed 5 years ago.
I have a pd.Series that looks like this:
>>> series
0 This is a foo bar something...
1 NaN
2 NaN
3 foo bar indeed something...
4 NaN
5 NaN
6 foo your bar self...
7 NaN
8 NaN
How do I populate the NaN column values with the previous non NaN value in the series?
I have tried this:
new_column = []
for row in list(series):
if type(row) == str:
new_column.append(row)
else:
new_column.append(new_column[-1])
series = pd.Series(new_column)
But is there another way to do the same in pandas?
From the docs:
DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
...
method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap
So:
series.fillna(method='ffill')
Some explanation:
ffill / pad: Forward fill is to use the value from previous row that isn't NA and populate the NA value. pad is just a verbose alias to ffill.
bfill / backfill: Back fill is to use the value from the next row that isn't NA to populate the NA value. backfill is just verbose alias to bfill.
In code:
>>> import pandas as pd
>>> import numpy as np
>>> np.NaN
nan
>>> series = pd.Series([np.NaN, 'abc', np.NaN, np.NaN, 'def', np.NaN, np.NaN])
>>> series
0 NaN
1 abc
2 NaN
3 NaN
4 def
5 NaN
6 NaN
dtype: object
>>> series.fillna(method='ffill')
0 NaN
1 abc
2 abc
3 abc
4 def
5 def
6 def
dtype: object
>>> series.fillna(method='bfill')
0 abc
1 abc
2 def
3 def
4 def
5 NaN
6 NaN
dtype: object

Pandas pivot_table do not comply with values order

I have an issue with pandas pivot_table.
Sometimes, the order of the columns specified on "values" list does not match
In [11]: p = pivot_table(df, values=["x","y"], cols=["month"],
rows="name", aggfunc=np.sum)
i get the wrong order (y,x) instead of (x,y)
Out[12]:
y x
month 1 2 3 1 2 3
name
a 1 NaN 7 2 NaN 8
b 3 NaN 9 4 NaN 10
c NaN 5 NaN NaN 6 NaN
Is there something i don't do well ?
According to the pandas documentation, values should take the name of a single column, not an iterable.
values : column to aggregate, optional

Categories