Pandas - drop all rows with 0 in at least two columns - python

I have some DataFrame:
df = pd.DataFrame({'name': ['apple1', 'apple2', 'apple3', 'apple4', 'orange1', 'orange2', 'orange3', 'orange4'],
'A': [0, 0, 0, 0, 0, 0 ,0, 0],
'B': [0.10, -0.15, 0.25, -0.55, 0.50, -0.51, 0.70, 0],
'C': [0, 0, 0.25, -0.55, 0.50, -0.51, 0.70, 0.90],
'D': [0.10, -0.15, 0.25, 0, 0.50, -0.51, 0.70, 0.90]})
df
name A B C D
0 apple1 0 0.10 0.00 0.10
1 apple2 0 -0.15 0.00 -0.15
2 apple3 0 0.25 0.25 0.25
3 apple4 0 -0.55 -0.55 0.00
4 orange1 0 0.50 0.50 0.50
5 orange2 0 -0.51 -0.51 -0.51
6 orange3 0 0.70 0.70 0.70
7 orange4 0 0.00 0.90 0.90
I'd like to drop all rows that have two or more zeros in columns A,B,C,D.
This DataFrame has other columns that have zeros; I only want to check for zeros in columns A,B,C,D.

You can use .eq to check if dataframe is equal to 0 and then take sum on axis=1 and return a boolean series by checking if the sum is greater than or equal to 2 (ge):
df[~df[['A','B','C','D']].eq(0).sum(1).ge(2)]
name A B C D
2 apple3 0 0.25 0.25 0.25
4 orange1 0 0.50 0.50 0.50
5 orange2 0 -0.51 -0.51 -0.51
6 orange3 0 0.70 0.70 0.70

Related

Find number of datapoints in each range

I have a data frame that looks like this
data = [['A', 0.20], ['B',0.25], ['C',0.11], ['D',0.30], ['E',0.29]]
df = pd.DataFrame(data, columns=['col1', 'col2'])
Col1 is a primary key (each row has a unique value)
The max of col2 is 1 and the min is 0. I want to find the number of datapoint in ranges 0-.30 (both 0 and 0.30 are included), 0-.29, 0-.28, and so on till 0-.01. I can use pd.cut, but the lower limit is not fixed. My lower limit is always 0.
Can someone help?
One option using numpy broadcasting:
step = 0.01
up = np.arange(0, 0.3+step, step)
out = pd.Series((df['col2'].to_numpy()[:,None] <= up).sum(axis=0), index=up)
Output:
0.00 0
0.01 0
0.02 0
0.03 0
0.04 0
0.05 0
0.06 0
0.07 0
0.08 0
0.09 0
0.10 0
0.11 1
0.12 1
0.13 1
0.14 1
0.15 1
0.16 1
0.17 1
0.18 1
0.19 1
0.20 2
0.21 2
0.22 2
0.23 2
0.24 2
0.25 3
0.26 3
0.27 3
0.28 3
0.29 4
0.30 5
dtype: int64
With pandas.cut and cumsum:
step = 0.01
up = np.arange(0, 0.3+step, step)
(pd.cut(df['col2'], up, labels=up[1:].round(2))
.value_counts(sort=False).cumsum()
)
Output:
0.01 0
0.02 0
0.03 0
0.04 0
0.05 0
0.06 0
0.07 0
0.08 0
0.09 0
0.1 0
0.11 1
0.12 1
0.13 1
0.14 1
0.15 1
0.16 1
0.17 1
0.18 1
0.19 1
0.2 2
0.21 2
0.22 2
0.23 2
0.24 2
0.25 3
0.26 3
0.27 3
0.28 3
0.29 4
0.3 5
Name: col2, dtype: int64

How can i insert values from a list into a pandas data frame column?

i have this dataframe:
index
x
y
0
0
3
1
0.07
4
2
0.1
6
3
0. 13
5
i want to insert new x values to the x column
new_x = [0, 0.03, 0.07, 0.1, 0.13, 0.17, 0.2]
so that the dataframe becomes
index
x
y
0
0
3
1
0.03
NaN
2
0.07
4
3
0.1
6
4
0. 13
5
5
0. 17
NaN
6
0. 2
NaN
so basically for every new_x value that doesn't exist in column x, the y value is NaN
is it possible to do it in pandas? thank you
You can use Numpy's searchsorted.
After you create a new_y array that is the same length as the new_x array. You use searchsorted to identify where in the new_y array you need to drop the old y values.
new_y = np.full(len(new_x), np.nan, np.float64)
new_y[np.searchsorted(new_x, df.x)] = df.y
pd.DataFrame({'x': new_x, 'y': new_y})
x y
0 0.00 3.0
1 0.03 NaN
2 0.07 4.0
3 0.10 6.0
4 0.13 5.0
5 0.17 NaN
6 0.20 NaN
This is a straightforward application of the merge function for pandas. More specifically a left join.
import pandas as pd
x1 = [0, 0.07, 0.1, 0.13]
y1 = [3, 4, 6, 5]
df1 = pd.DataFrame({"x": x1, "y": y1})
print(df1)
x2 = [0, 0.03, 0.07, 0.1, 0.13, 0.17, 0.2]
df2 = pd.DataFrame({"x": x2})
print(df2)
df3 = df2.merge(df1, how="left", on="x")
print(df3)
x y
0 0.00 3
1 0.07 4
2 0.10 6
3 0.13 5
x
0 0.00
1 0.03
2 0.07
3 0.10
4 0.13
5 0.17
6 0.20
x y
0 0.00 3.0
1 0.03 NaN
2 0.07 4.0
3 0.10 6.0
4 0.13 5.0
5 0.17 NaN
6 0.20 NaN
You could try to use the join method. Here is some sample code that you could refer
d1 = {'x': [0, 0.07, 0.1, 0.13], 'y': [3,4,6,5]}
d2 = {'x': [0, 0.03, 0.07, 0.1, 0.13, 0.17, 0.2]}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame(data=d2)
df2.set_index('x').join(df1.set_index('x'), on='x', how='left').reset_index()
try using this, you can compare the values already present in the column with the new one's and then you can create a df remaining values and concatenate it to the older df
import pandas as pd
import numpy as np
df = pd.DataFrame({'x':[0,0.07,0.1,0.13], 'y':[3,4,6,5]})
new_list = [0, 0.03, 0.07, 0.1, 0.13, 0.17, 0.2]
def diff(new_list, col_list):
if len(new_list) > len(df['x'].to_list()):
diff_list = list(set(new_list) - set(df['x'].to_list()))
else:
diff_list = list(set(df['x'].to_list()) - set(new_list))
return diff_list
new_df = pd.DataFrame({'x':diff(new_list,df['x'].to_list()),'y':np.nan})
fin_df = pd.concat([df,new_df]).reset_index(drop=True)
fin_df
x y
0 0.00 3.0
1 0.07 4.0
2 0.10 6.0
3 0.13 5.0
4 0.03 NaN
5 0.20 NaN
6 0.17 NaN

Pandas: sum DataFrame column for max value only

I have the following DataFrame:
df = pd.DataFrame({'a': [0.28, 0, 0.25, 0.85, 0.1],
'b': [0.5, 0.5, 0, 0.75, 0.1],
'c': [0.33, 0.7, 0.25, 0.2, 0.5],
'd': [0, 0.25, 0.2, 0.66, 0.1]})
Output:
a b c d
0 0.28 0.50 0.33 0.00
1 0.00 0.50 0.70 0.25
2 0.25 0.00 0.25 0.20
3 0.85 0.75 0.20 0.66
4 0.10 0.10 0.50 0.10
For each column, I want to sum the top n max values of the column, where n is determined by how many row max values that column contains.
For example, column b has a row max only in row 1, so its sum is the sum of the top 1 max values in that column, which is just 0.5 -- but column c has three row-maxes, located in rows 1, 2, and 4, so the top 3 max values of column c should be summed.
Expected output:
a b c d
0 0.28 0.50 0.33 0.00
1 0.00 0.50 0.70 0.25
2 0.25 0.00 0.25 0.20
3 0.85 0.75 0.20 0.66
4 0.10 0.10 0.50 0.10
count 1.10 0.50 1.45 0.00
where
df.append(
df.where( # only look at values that are max for the row
df.eq( # compare max values to all values in row just
# in case there are more than 1
df.max(axis=1), # actually get max values
axis=0
)
).sum().rename('count')
)
a b c d
0 0.28 0.50 0.33 0.00
1 0.00 0.50 0.70 0.25
2 0.25 0.00 0.25 0.20
3 0.85 0.75 0.20 0.66
4 0.10 0.10 0.50 0.10
count 1.10 0.50 1.45 0.00
The fastest way to do this would be to using the .max() method passing through the axis argument:
df.max(axis =1)
If you're after another column:
df['column_name'] = df.max(axis =1)
I didn't read the question that well!

Parse a string of floats in Python

list=[['10, 0.01, 0.0428, 120; 30, 0.1, 2, 33; 50, 0.023, 0.31, 0.65'],
['10, 0.7, 0.5428, 2.31'],
['50, 0.3, 0.35, 0.1'],
['-10, 0.2, 0.048, 124; -30, 0.11, 24, 3; -50, 0.02, 0.1, 0.60; 0, 0, 0, 0; 10, 0.1, 2, 33;
20, 0.023, 0.31, 0.66']]
df=pd.DataFrame(list)
I have a dataframe df from which I am trying to get the 3rd value after each semicolon sign if the column name matches with the 1st value after the semicolon sign. The expected output is as below. Any clue on how to tackle this in a simple way?
Use nested loops:
d = {}
for r, i in enumerate(l):
for j in i[0].split(';'):
k = j.split(',')
c, v = int(k[0]), float(k[2])
d[(r, c)] = v
df = pd.Series(d).unstack(fill_value=0)
Output:
>>> df
-50 -30 -10 0 10 20 30 50
0 0.0 0.0 0.000 0.0 0.0428 0.00 2.0 0.31
1 0.0 0.0 0.000 0.0 0.5428 0.00 0.0 0.00
2 0.0 0.0 0.000 0.0 0.0000 0.00 0.0 0.35
3 0.1 24.0 0.048 0.0 2.0000 0.31 0.0 0.00

Pandas change value of column if other column values don't meet criteria

I have the following data frame. I want to check the values of each row for the columns of "mental_illness", "feeling", and "flavor". If all the values for those three columns per row are less than 0.5, I want to change the corresponding value of the "unclassified" column to 1.0.
sent_no pos unclassified mental_illness feeling flavor
0 0 word_1 0.0 0.75 0.30 0.28
1 1 word_2 0.0 0.17 0.72 0.16
2 2 word_3 0.0 0.19 0.38 0.16
3 3 word_4 0.0 0.39 0.20 0.14
4 4 word_5 0.0 0.72 0.30 0.14
Expected result:
sent_no pos unclassified mental_illness feeling flavor
0 0 word_1 0.0 0.75 0.30 0.28
1 1 word_2 0.0 0.17 0.72 0.16
2 2 word_3 1.0 0.19 0.38 0.16
3 3 word_4 1.0 0.39 0.20 0.14
4 4 word_5 0.0 0.72 0.30 0.14
How do I go about doing so?
Use .le and .all over axis=1:
m = df[['mental_illness', 'feeling', 'flavor']].le(0.5).all(axis=1)
df['unclassified'] = m.astype(int)
sent_no pos unclassified mental_illness feeling flavor
0 0 word_1 0 0.75 0.30 0.28
1 1 word_2 0 0.17 0.72 0.16
2 2 word_3 1 0.19 0.38 0.16
3 3 word_4 1 0.39 0.20 0.14
4 4 word_5 0 0.72 0.30 0.14
Would this work?
mask1 = df["mental_illness"] < 0.5
mask2 = df["feeling"] < 0.5
mask3 = df["flavor"] < 0.5
df.loc[mask1 & mask2 & mask3, 'unclassified'] = 1
Here is my solution:
data.unclassified = data[['mental_illness', 'feeling', 'flavor']].apply(lambda x: x.le(0.5)).apply(lambda x: 1 if sum(x) == 3 else 0, axis = 1)
output
sent_no pos unclassified mental_illness feeling flavor
0 0 Word_1 0 0.75 0.30 0.28
1 1 Word_2 0 0.17 0.72 0.16
2 2 Word_3 1 0.19 0.38 0.16
3 3 Word_4 1 0.39 0.20 0.14
4 4 Word_5 0 0.72 0.30 0.14

Categories