Python Dataframe drop rows of multi columns with specific values

Python Dataframe drop rows of multi columns with specific values - python

My dataframe is given below. i want to drop rows in two columns which have less than 0 value.
df =
name value1 value2
0 A 10 10
1 B -10 10 #drop
2 A 10 10
3 A 40 -10 #drop
4 C 50 10
5 C 60 10
6 D -70 -10 #drop
I want to drop rows with negative values in value1 and value2 columns.
My expected output:
df =
name value1 value2
0 A 10 10
1 A 10 10
2 C 50 10
3 C 60 10
My present code:
df = df[df['value1','value2']>0]
Output:
KeyError: ('value1','value2')

i guess you mean that if one of the 'value1' or 'value2' are negative, you want to drop the row. so use:
df = df[(df['value1'] >= 0) & (df['value2'] >= 0)])

Related

Get last value from pandas groupeddataframe, summed by another column

I have the following dataframe
x = pd.DataFrame(
{
'FirstGroupCriterium': [1,1,2,2,3],
'SortingCriteria': [1,1,1,2,1],
'Value': [10,20,30,40,50]
}
)
x.sort_values('SortingCriteria').groupby('FirstGroupCriterium').agg(last_value=('Value', 'last'))
The latter outputs:
FirstGroupCriterium
last_value
1
20
2
40
3
50
What I would like to have, is to sum up the last value, based on the last SortingCriteria. So in this case:
FirstGroupCriterium
last_value
1
10+20 = 30
2
40
3
50
My initial idea was to call a custom aggregator function that groups the data yet again, but that fails.
def last_value(group):
return group.groupby('SortingCriteria')['Value'].sum().tail(1)
Do you have any idea how to get this to work? Thank you!

Sorting by both columns first, then filter last rows per FirstGroupCriterium in GroupBy.transform and aggregate sum:
df = x.sort_values(['FirstGroupCriterium','SortingCriteria'])
df1 = df[df['SortingCriteria'].eq(df.groupby('FirstGroupCriterium')['SortingCriteria'].transform('last'))]
print (df1)
FirstGroupCriterium SortingCriteria Value
0 1 1 10
1 1 1 20
3 2 2 40
4 3 1 50
df2 = df1.groupby(['FirstGroupCriterium'],as_index=False)['Value'].sum()
print (df2)
FirstGroupCriterium Value
0 1 30
1 2 40
2 3 50
Anoter idea is aggregate sum by both columns and then remove duplicates with keep last row by DataFrame.drop_duplicates:
df2 = (df.groupby(['FirstGroupCriterium','SortingCriteria'],as_index=False)['Value'].sum()
.drop_duplicates(['FirstGroupCriterium'], keep='last'))
print (df2)
FirstGroupCriterium SortingCriteria Value
0 1 1 30
2 2 2 40
3 3 1 50

perform operation on selected columns of dataframe based on threshold present in another dataframe

I have a dataframe
df1 = pd.DataFrame([["A",1,98,56,61,1,4,6], ["B",1,79,54,36,2,5,7], ["C",1,97,32,83,3,6,8],["B",1,96,31,90,4,7,9], ["C",1,45,32,12,5,8,10], ["A",1,67,33,55,6,9,11]], columns=["id","date","c1","c2","c3","x","y","z"])
I have another dataframe where conditions for selected columns are present
df2 = pd.DataFrame([["c2",40], ["c1",80], ["C3",90]], columns=["col","condition"])
Perform operations on df1 based on conditions present in df2. Like if the value for c1 is 80 in df2, change the values present in column c1 of df1 to -1 if the value is less than 80, if higher than 80 then change values to 1. Perform similar operations for other columns which are present in df2 also.
Expected Output:
df_out = pd.DataFrame([["A",1,1,1,-1,1,4,6], ["B",1,-1,1,-1,2,5,7], ["C",1,1,-1,-1,3,6,8],["B",1,1,-1,1,4,7,9], ["C",1,-1,-1,-1,5,8,10], ["A",1,-1,-1,-1,6,9,11]], columns=["id","date","c1","c2","c3","x","y","z"])
How to do it?

Convert df2 to Series first, then create mask for compare with columns names, compre for greater or equal by DataFrame.ge by Series and pass to numpy.where:
s = df2.set_index('col')['condition']
m = df1.columns.isin(s.index)
df1.loc[:, m] = np.where(df1.loc[:, m].ge(s), 1, -1)
print (df1)
id date c1 c2 c3 x y z
0 A 1 1 1 -1 1 4 6
1 B 1 -1 1 -1 2 5 7
2 C 1 1 -1 -1 3 6 8
3 B 1 1 -1 1 4 7 9
4 C 1 -1 -1 -1 5 8 10
5 A 1 -1 -1 -1 6 9 11

How to find the maximum value of a column with pandas?

I have a table with 40 columns and 1500 rows. I want to find the maximum value among the 30-32nd (3 columns). How can it be done? I want to return the maximum value among these 3 columns and the index of dataframe.
print(Max_kVA_df.iloc[30:33].max())

hi you can refer this example
import pandas as pd
df=pd.DataFrame({'col1':[1,2,3,4,5],
'col2':[4,5,6,7,8],
'col3':[2,3,4,5,7]
})
print(df)
#print(df.iloc[:,0:3].max())# Mention range of the columns which you want, In your case change 0:3 to 30:33, here 33 will be excluded
ser=df.iloc[:,0:3].max()
print(ser.max())
Output
8

Select values by positions and use np.max:
Sample: for maximum by first 5 rows:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(10, size=(10, 3)), columns=list('ABC'))
print (df)
A B C
0 2 2 6
1 1 3 9
2 6 1 0
3 1 9 0
4 0 9 3
print (df.iloc[0:5])
A B C
0 2 2 6
1 1 3 9
2 6 1 0
3 1 9 0
4 0 9 3
print (np.max(df.iloc[0:5].max()))
9

Or use iloc this way:
print(df.iloc[[30, 31], 2].max())

Selecting rows with the highest value based on 1 column in the dataframe

I have a set of dataframe with about 20k rows. with headings X,Y,Z,I,R,G,B. ( yes its point cloud)
I would wanna create numerous sub dataframes by grouping the data in rows of 100 after sorting out according to column X.
Subsequently i would like to sort all sub dataframes according to Y column and breaking them down further into rows of 50. (breaking each sub dataframe down further)
The end result is I should have a group of sub dataframes in rows of 50, and i would like to pick out all the rows with the highest Z value in each sub dataframe and write them onto a CSV file.
I have reached the following method with my code. But i am not sure how to continue further.
import pandas as pd
headings = ['x', 'y', 'z']
data = pd.read_table('file.csv', sep=',', skiprows=[0], names=headings)
points = data.sort_values(by=['x'])

Considering a dummy dataframe of 1000 rows,
df.head() # first 5 rows
X Y Z I R G B
0 6 6 0 3 7 0 2
1 0 8 3 6 5 9 7
2 8 9 7 3 0 4 5
3 9 6 8 5 1 0 0
4 9 0 3 0 9 2 9
First, extract the highest value of Z from the dataframe,
z_max = df['Z'].max()
df = df.sort_values('X')
# list of dataframes
dfs_X = np.split(df, len(df)/ 100)
results = pd.DataFrame()
for idx, df_x in enumerate(dfs_X):
dfs_X[idx] = df_x.sort_values('Y')
dfs_Y = np.split(dfs_X[idx], len(dfs_X[idx]) / 50)
for idy, df_y in enumerate(dfs_Y):
rows = df_y[df_y['Z'] == z_max]
results = results.append(rows)
results.head()
results will contain rows from all dataframes which have highest value of Z.
Output: First 5 rows
X Y Z I R G B
541 0 0 9 0 3 6 2
610 0 2 9 3 0 7 6
133 0 4 9 3 3 9 9
731 0 5 9 5 1 0 2
629 0 5 9 0 9 7 7
Now, write this dataframe to csv using df.to_csv().

Sort pandas df into individual columns

I am trying to sort a pandas df into individual columns based on when values in columns change. For the df below I can sort the df into separate columns when a values changes in Col B. But I'm trying to add Col Cso it's when values change in both Col B and Col C.
import pandas as pd
df = pd.DataFrame({
'A' : [10,20,30,40,40,30,20,10,5,10,15,20,20,15,10,5],
'B' : ['X','X','X','X','Y','Y','Y','Y','X','X','X','X','Y','Y','Y','Y'],
'C' : ['W','W','Z','Z','Z','Z','W','W','W','W','Z','Z','Z','Z','W','W'],
})
d = df['B'].ne(df['B'].shift()).cumsum()
df['C'] = d.groupby(df['B']).transform(lambda x: pd.factorize(x)[0]).add(1).astype(str)
df['D'] = df.groupby(['B','C']).cumcount()
df = df.set_index(['D','C','B'])['A'].unstack([2,1])
df.columns = df.columns.map(''.join)
Output:
X1 Y1 X2 Y2
D
0 10 40 5 20
1 20 30 10 15
2 30 20 15 10
3 40 10 20 5
As you can see, this creates a new column every time there's a new value in Col B. But I'm trying to incorporate Col C as well. So it should be every time there's a change in both Col B and Col C.
Intended output:
XW1 XZ1 YZ1 YW1 XW2 XZ2 YZ2 YW2
0 10 30 40 20 5 15 20 10
1 20 40 30 10 10 20 15 5

Just base on your out put create the help columns one by one.
df['key']=df.B+df.C# create the key
df['key2']=(df.key!=df.key.shift()).ne(0).cumsum() # make the continue key into one group
df.key2=df.groupby('key').key2.apply(lambda x : x.astype('category').cat.codes+1)# change the group number to 1 or 2
df['key3']=df.groupby(['key','key2']).cumcount() # create the index for pivot
df['key']=df.key+df.key2.astype(str) # create the columns for pivot
df.pivot('key3','key','A')#yield
Out[126]:
key XW1 XW2 XZ1 XZ2 YW1 YW2 YZ1 YZ2
key3
0 10 5 30 15 20 10 40 20
1 20 10 40 20 10 5 30 15

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Dataframe drop rows of multi columns with specific values - python

i guess you mean that if one of the 'value1' or 'value2' are negative, you want to drop the row. so use: df = df[(df['value1'] >= 0) & (df['value2'] >= 0)])

Related

Get last value from pandas groupeddataframe, summed by another column

perform operation on selected columns of dataframe based on threshold present in another dataframe

How to find the maximum value of a column with pandas?

Selecting rows with the highest value based on 1 column in the dataframe

Sort pandas df into individual columns

Categories

Resources