Drop rows that contains the same value in pandas DataFrame [duplicate] - python

This question already has answers here:
better way to drop nan rows in pandas
(7 answers)
Closed 16 days ago.
I'm currently working on a data frame like the one below:
artist
week1
week2
week3
week4
Drake
2
2
3
1
Muse
NA
NA
NA
NA
Bruno Mars
3
3
4
2
Imagine Dragons
NA
NA
NA
NA
Justin Timberlake
2
2
NA
1
What I want to do is to drop the rows that only contain "NA" values. The result should be something like this:
artist
week1
week2
week3
week4
Drake
2
2
3
1
Bruno Mars
3
3
4
2
Justin Timberlake
2
2
NA
1
I've tried using the pandas drop() function but drops every row with at least one "NA" value. In that case, the row for Justin Timberlake would be dropped but that's not what I need.

Use df.dropna() and set how='all' meaning If all values are NA, drop that row or column. then set the subset columns.
df = df.dropna(how='all', subset=['week1', 'week2', 'week3', 'week4'])
print(df)
Or Keep only the rows with at least 2 non-NA values.
df = df.dropna(thresh=2)
print(df)
artist week1 week2 week3 week4
0 Drake 2.0 2.0 3.0 1.0
2 Bruno Mars 3.0 3.0 4.0 2.0
4 Justin Timberlake 2.0 2.0 NaN 1.0

Related

How to assign a column the value that is above it only if a condition is met?

So I have a dataframe where I have some empty values in a column. I need those empty values to be assigned to the next real value above them, whether it is 1 row above or 4 rows above. But, the caveat is that I only needs those empty values to be filled in if a certain condition is met.
Dataframe currently looks like:
Column A
Column B
1
100
1
NaN
1
NaN
2
150
2
NaN
2
NaN
3
NaN
3
NaN
4
60
5
70
5
NaN
I need it to look like:
Column A
Column B
1
100
1
100
1
100
2
150
2
150
2
150
3
NaN
3
NaN
4
60
5
70
5
70
So the first value for each grouping in column A needs to be carried out for that grouping in column B...all rows with a 1 in column A should have the same column B value. All rows with a 2 in column A should have the same column B value. The value it should be will always be the first value. In other words, the first row a new value comes up in column A will contain the correct value in Column B that should be carried down.
I really have no idea how to approach this. I was thinking about using groupby but that didn't make much sense.
I think groupby is the way to go:
g = df.groupby('Column A')
df['Column B'] = g.ffill()
Output:
Column A Column B
0 1 100.00
1 1 100.00
2 1 100.00
3 2 150.00
4 2 150.00
5 2 150.00
6 3 NaN
7 3 NaN
8 4 60.00
9 5 70.00
10 5 70.00

Rolling average matching cases across multiple columns

I'm sorry if this has been asked but I can't find another question like this.
I have a data frame in Pandas like this:
Home Away Home_Score Away_Score
MIL NYC 1 2
ATL NYC 1 3
NYC PHX 2 1
HOU NYC 1 6
I want to calculate the moving average for each team, but the catch is that I want to do it for all of their games, both home and away combined.
So for a moving average window of size 3 for 'NYC' the answer should be (2+3+2)/3 for row 1 and then (3+2+6)/3 for row 2, etc.
You can exploid stack to convert the two columns into one and groupby:
(df[['Home_Score','Away_Score']]
.stack()
.groupby(df[['Home','Away']].stack().values)
.rolling(3).mean()
.reset_index(level=0, drop=True)
.unstack()
.add_prefix('Avg_')
)
Output:
Avg_Away_Score Avg_Home_Score
0 NaN NaN
1 NaN NaN
2 NaN 2.333333
3 3.666667 NaN

How to drop 1st level index and then merge the remaining index values with custom logic for a pd DataFrame?

Say I have a MultiIndex DataFrame like so:
price volume
year product city
2010 A LA 10 7
B SF 7 9
C NY 7 6
LA 18 21
SF 4 8
2011 A LA 13 5
B SF 2 4
C NY 9 3
SF 2 0
I want to do a somewhat complex merge where the first level of the DataFrame index (year) is dropped and the duplicates in the now first level index (product) in the DataFrame get merged according to some custom logic. In this case I would like to be able to set the price column to use the value from the 2010 outer index and the volume column to use the values from the 2011 outer index, but I would like a general solution that can be applied to more columns should they exist.
Final DataFrame would look like this, where the price values are those from the 2010 index and the volume values are those from the 2011 index, where missing values are filled with NaNs.
price volume
product city
A LA 10 5
B SF 7 4
C NY 7 3
LA 18 NaN
SF 4 0
You can select by first level by DataFrame.xs and then concat:
df = pd.concat([df.xs(2010)['price'], df.xs(2011)['volume']], axis=1)
Also is possible use loc:
df = pd.concat([df.loc[2010, 'price'], df.loc[2011, 'volume']], axis=1)
print (df)
price volume
product city
A LA 10 5.0
B SF 7 4.0
C LA 18 NaN
NY 7 3.0
SF 4 0.0

pandas aggregate dataframe returns only one column

Hy there.
I have a pandas DataFrame (df) like this:
foo id1 bar id2
0 8.0 1 NULL 1
1 5.0 1 NULL 1
2 3.0 1 NULL 1
3 4.0 1 1 2
4 7.0 1 3 2
5 9.0 1 4 3
6 5.0 1 2 3
7 7.0 1 3 1
...
I want to group by id1 and id2 and try to get the mean of foo and bar.
My code:
res = df.groupby(["id1","id2"])["foo","bar"].mean()
What I get is almost what I expect:
foo
id1 id2
1 1 5.750000
2 7.000000
2 1 3.500000
2 1.500000
3 1 6.000000
2 5.333333
The values in column "foo" are exactly the average values (means) that I am looking for but where is my column "bar"?
So if it would be SQL I was looking for a result like from:
"select avg(foo), avg(bar) from dataframe group by id1, id2;"
(Sorry for this but I am more an sql person and new to pandas but I need it now.)
What I alternatively tried:
groupedFrame = res.groupby(["id1","id2"])
aggrFrame = groupedFrame.aggregate(numpy.mean)
Which gives me exactly the same result, still missing column "bar".
Sites I read:
http://wesmckinney.com/blog/groupby-fu-improvements-in-grouping-and-aggregating-data-in-pandas/
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.aggregate.html
and documentation for group-by but I cannot post the link here.
What am I doing wrong? - Thanks in foreward.
There is problem your column bar is not numeric, so aggregate function omit it.
You can check dtype of omited column - is not numeric:
print (df['bar'].dtype)
object
You can check automatic exclusion of nuisance columns.
Solution is before aggregating convert string values to numeric and if not possible, add NaNs with to_numeric and parameter errors='coerce':
df['bar'] = pd.to_numeric(df['bar'], errors='coerce')
res = df.groupby(["id1","id2"])["foo","bar"].mean()
print (res)
foo bar
id1 id2
1 1 5.75 3.0
2 5.50 2.0
3 7.00 3.0
But if have mixed data - numeric with strings is possible use replace:
df['bar'] = df['bar'].replace("NULL", np.nan)
As stated earlier, you should replace your NULL value before taking the mean
df.replace("NULL",-1).groupby(["id1","id2"])["foo","bar"].mean()
output
id1 id2 foo bar
1 1 5.75 3.0
1 2 5.5 2.0
1 3 7.0 3.0

Example to clarify fill_value when using add() on dataframes (pandas)

I'm learning pandas these days. I have a rudimentary question regarding the fill_value parameter when using add() on dataframes.
Imagine I have the following data:
dframe1:
A B
NYC 0 1
LA 2 3
dframe2:
A D C
NYC 0 1 2
SF 3 4 5
LA 6 7 8
Doing dframe1.add(dframe2,fill_value=0) yields:
A B C D
LA 8.0 3.0 8.0 7.0
NYC 0.0 1.0 2.0 1.0
SF 3.0 NaN 5.0 4.0
Why do I get NaN for column B, index SF?
I was expecting that fill_value ensures no results of NaN occur by - in this case - assuming column D,C and index SF exist with value 0 for dframe1.
According to http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.add.html
Probably this is the case of:
Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing.
I bet you already know the fillna for pandas:
df.fillna('', inplace=True)

Categories