Keep all cells above given value in pandas DataFrame - python

I would like to discard all cells that contain a value below a given value. So not only the rows or only the columns that, but for for all cells.
Tried code below, where all values in each cell should be at least 3. Doesn't work.
df[(df >= 3).any(axis=1)]
Example
import pandas as pd
my_dict = {'A':[1,5,6,2],'B':[9,9,1,2],'C':[1,1,3,5]}
df = pd.DataFrame(my_dict)
df
A B C
0 1 9 1
1 5 9 1
2 6 1 3
3 2 2 5
I want to keep only the cells that are at least 3.

If you want "all values in each cell should be at least 3"
df [df < 3] = 3
df
A B C
0 3 9 3
1 5 9 3
2 6 3 3
3 3 3 5
If you want "to keep only the cells that are at least 3"
df = df [df >= 3]
df
A B C
0 NaN 9.0 NaN
1 5.0 9.0 NaN
2 6.0 NaN 3.0
3 3.0 3.0 5.0

You can check if the value is >= 3 then drop all rows with NaN value.
df[df >= 3 ].dropna()
DEMO:
import pandas as pd
my_dict = {'A':[1,5,6,3],'B':[9,9,1,3],'C':[1,1,3,5]}
df = pd.DataFrame(my_dict)
df
A B C
0 1 9 1
1 5 9 1
2 6 1 3
3 3 3 5
df = df[df >= 3 ].dropna().reset_index(drop=True)
df
A B C
0 3.0 3.0 5.0

Related

i cant find the min value(which is>0) in each row in selected columns df[df[col]>0]

this is my data and i want to find the min value of selected columns(a,b,c,d) in each row then calculate the difference between that and dd. I need to ignore 0 in rows, I mean in the first row i need to find 8
need to ignore 0 in rows
Then just replace it with nan, consider following simple example
import numpy as np
import pandas as pd
df = pd.DataFrame({"A":[1,2,0],"B":[3,5,7],"C":[7,0,7]})
df.replace(0,np.nan).apply(min)
df["minvalue"] = df.replace(0,np.nan).apply("min",axis=1)
print(df)
gives output
A B C minvalue
0 1 3 7 1.0
1 2 5 0 2.0
2 0 7 7 7.0
You can use pandas.apply with axis=1 and all column ['a','b','c','d'] convert to Series then replace 0 with +inf and find min. At the end compute diff min with colmun 'dd'.
import numpy as np
df['min_dd'] = df.apply(lambda row: min(pd.Series(row[['a','b','c','d']]).replace(0,np.inf)) - row['d'], axis=1)
print(df)
a b c d dd min_dd
0 0 15 0 8 6 2.0 # min_without_zero : 8 , dd : 6 -> 8-6=2
1 2 0 5 3 2 0.0 # min_without_zero : 2 , dd : 2 -> 2-2=0
2 5 3 3 0 2 1.0 # 3 - 2
3 0 2 3 4 2 0.0 # 2 - 2
You can try
cols = ['a','b','c','d']
df['res'] = df[cols][df[cols].ne(0)].min(axis=1) - df['dd']
print(df)
a b c d dd res
0 0 15 0 8 6 2.0
1 2 0 5 3 2 0.0
2 5 3 3 0 2 1.0
3 2 3 4 4 2 0.0

Python How to drop rows of Pandas DataFrame whose value in a certain column is NaN

I have this DataFrame and want only the records whose "Total" column is not NaN ,and records when A~E has more than two NaN:
A B C D E Total
1 1 3 5 5 8
1 4 3 5 5 NaN
3 6 NaN NaN NaN 6
2 2 5 9 NaN 8
..i.e. something like df.dropna(....) to get this resulting dataframe:
A B C D E Total
1 1 3 5 5 8
2 2 5 9 NaN 8
Here's my code
import pandas as pd
dfInputData = pd.read_csv(path)
dfInputData = dfInputData.dropna(axis=1,how = 'any')
RowCnt = dfInputData.shape[0]
But it looks like no modification has been made even error
Please help!! Thanks
Use boolean indexing with count all columns without Total for number of missing values and not misisng values in Total:
df = df[df.drop('Total', axis=1).isna().sum(axis=1).le(2) & df['Total'].notna()]
print (df)
A B C D E Total
0 1 1 3.0 5.0 5.0 8.0
3 2 2 5.0 9.0 NaN 8.0
Or filter columns between A:E:
df = df[df.loc[:, 'A':'E'].isna().sum(axis=1).le(2) & df['Total'].notna()]
print (df)
A B C D E Total
0 1 1 3.0 5.0 5.0 8.0
3 2 2 5.0 9.0 NaN 8.0

Pandas Dataframe Question: Subtract next row and add specific value if NaN

Trying to groupby in pandas, then sort values and have a result column show what you need to add to get to the next row in the group, and if your are the end of the group. To replace the value with the number 3. Anyone have an idea how to do it?
import pandas as pd
df = pd.DataFrame({'label': 'a a b c b c'.split(), 'Val': [2,6,6, 4,16, 8]})
df
label Val
0 a 2
1 a 6
2 b 6
3 c 4
4 b 16
5 c 8
Id like the results as shown below, that you have to add 4 to 2 to get 6. So the groups are sorted. But if there is no next value in the group and NaN is added. To replace it with the value 3. I have shown below what the results should look like:
label Val Results
0 a 2 4.0
1 a 6 3.0
2 b 6 10.0
3 c 4 4.0
4 b 16 3.0
5 c 8 3.0
I tried this, and was thinking of shifting values up but the problem is that the labels aren't sorted.
df['Results'] = df.groupby('label').apply(lambda x: x - x.shift())`
df
label Val Results
0 a 2 NaN
1 a 6 4.0
2 b 6 NaN
3 c 4 NaN
4 b 16 10.0
5 c 8 4.0
Hope someone can help:D!
Use groupby, diff and abs:
df['Results'] = abs(df.groupby('label')['Val'].diff(-1)).fillna(3)
label Val Results
0 a 2 4.0
1 a 6 3.0
2 b 6 10.0
3 c 4 4.0
4 b 16 3.0
5 c 8 3.0

Pandas - combine columns and put one after another?

I have the following dataframe:
a1,a2,b1,b2
1,2,3,4
2,3,4,5
3,4,5,6
The desirable output is:
a,b
1,3
2,4
3,5
2,4
3,5
4,6
There is a lot of "a" and "b" named headers in the dataframe, the maximum is a50 and b50. So I am looking for the way to combine them all into just "a" and "b".
I think it's possible to do with concat, but I have no idea how to combine it all, putting all the values under each other. I'll be grateful for any ideas.
You can use pd.wide_to_long:
pd.wide_to_long(df.reset_index(), ['a','b'], 'index', 'No').reset_index()[['a','b']]
Output:
a b
0 1 3
1 2 4
2 3 5
3 2 4
4 3 5
5 4 6
First we read the dataframe:
import pandas as pd
from io import StringIO
s = """a1,a2,b1,b2
1,2,3,4
2,3,4,5
3,4,5,6"""
df = pd.read_csv(StringIO(s), sep=',')
Then we stack the columns, and separate the number of the columns from the letter 'a' or 'b':
stacked = df.stack().rename("val").reset_index(1).reset_index()
cols_numbers = pd.DataFrame(stacked
.level_1
.str.split('(\d)')
.apply(lambda l: l[:2])
.tolist(),
columns=["col", "num"])
x = cols_numbers.join(stacked[['val', 'index']])
print(x)
col num val index
0 a 1 1 0
1 a 2 2 0
2 b 1 3 0
3 b 2 4 0
4 a 1 2 1
5 a 2 3 1
6 b 1 4 1
7 b 2 5 1
8 a 1 3 2
9 a 2 4 2
10 b 1 5 2
11 b 2 6 2
Finally, we group by index and num to get two columns a and b, and we fill the first row of the b column with the second value, to get what was expected:
result = (x
.set_index("col", append=True)
.groupby(["index", "num"])
.val
.apply(lambda g:
g
.unstack()
.fillna(method="bfill")
.head(1))
.reset_index(-1, drop=True))
print(result)
col a b
index num
0 1 1.0 3.0
2 2.0 4.0
1 1 2.0 4.0
2 3.0 5.0
2 1 3.0 5.0
2 4.0 6.0
To get rid of the multiindex at the end: result.reset_index(drop=True)

append two data frames with unequal columns

I am trying to append two dataframes in pandas which have two different no of columns.
Example:
df1
A B
1 1
2 2
3 3
df2
A
4
5
Expected concatenated dataframe
df
A B
1 1
2 2
3 3
4 Null(or)0
5 Null(or)0
I am using
df1.append(df2) when the columns are same. But no idea how to deal with unequal no of columns.
How about pd.concat?
>>> pd.concat([df1,df2])
A B
0 1 1.0
1 2 2.0
2 3 3.0
0 4 NaN
1 5 NaN
Also, df1.append(df2) still works:
>>> df1.append(df2)
A B
0 1 1.0
1 2 2.0
2 3 3.0
0 4 NaN
1 5 NaN
From the docs of df.append:
Columns not in this frame are added as new columns.
Use the concat to join two columns and pass the additional argument ignore_index=True to reset the index other wise you might end with indexes as 0 1 2 0 1. For additional information refer docs here:
df1 = pd.DataFrame({'A':[1,2,3], 'B':[1,2,3]})
df2 = pd.DataFrame({'A':[4,5]})
df = pd.concat([df1,df2],ignore_index=True)
df
Output:
without ignore_index = True :
A B
0 1 1.0
1 2 2.0
2 3 3.0
0 4 NaN
1 5 NaN
with ignore_index = True :
A B
0 1 1.0
1 2 2.0
2 3 3.0
3 4 NaN
4 5 NaN

Categories