I tried to find the index which satisfy certain conditions in pandas DataFrame.
For example, we have the following dataframe
and find the index such that
argmin(j) df['A'].iloc[j] >= (df['A'].iloc[i] + 3 ) for all i
so the result will be given by
I finished the work by using for loop, but I believe there is more efficient way to acheieve this job.
Thank you for your reply!
My code is
for i in range(len(df)):
df['B'].iloc[i] = df[df2['A']>= df2['A'].iloc[i]+1].index[0]
but, for loop is too slow for a large data set.
try following method, :)
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1,3,5,8,10,12]})
b = pd.DataFrame(df.values - (df['A'].values + 3), index=df.index)
df['B'] = b.where(b >= 0).idxmin()
df
Related
I have the following code:
import pandas as pd
df = pd.util.testing.makeDataFrame()
max_index = df.A.idxmax()
What I am trying to do is get the index value right above and below max_index in the dataframe. Could you please advise how this could be accomplished.
If you're unsure whether the index has duplicates, a safe way is:
import pandas as pd
df = pd.util.testing.makeDataFrame()
max_index = df.A.idxmax()
before = df['A'].shift(-1).idxmax()
after = df['A'].shift().idxmax()
If the indices are unique:
i = df.index.get_loc(max_index)
before, after = df.index[i-1], df.index[i+1]
Or, maybe slightly more efficient and which also handles duplicated indices:
i = df.reset_index()['A'].idxmax()
before, max_index, after = df.index[i-1:i+2]
import polars as pl
import pandas as pd
A = ['a','a','a','a','a','a','a','b','b','b','b','b','b','b']
B = [1,2,3,4,5,6,7,8,9,10,11,12,13,14]
df = pl.DataFrame({'cola':A,
'colb':B})
df_pd = df.to_pandas()
index = df_pd.groupby('cola')['colb'].idxmax()
df_pd.loc[index,'top'] = 1
in pandas i can get the column of top using idxmax().
however, in polars
i use the arg_max()
index = df[pl.col('colb').arg_max().over('cola').flatten()]
seems cannot get what i want..
is there any way to get generate a column of 'top' in polars?
thx a lot!
In Polars, window functions (the .over()) will do an aggregation + self-join (see https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.Expr.over.html?highlight=over#polars.Expr.over), which means you cannot return a unique value per row, which is what you are after.
A way to compute the top column is to use apply:
df.groupby("cola").apply(lambda x: x.with_columns([pl.col("colb"), (pl.col("colb")==pl.col("colb").max()).alias("top")]))
I am trying to replicate the following operation on a dask dataframe where I have to filter the dataframe based on column value and multiply another column on that.
Following is pandas equivalent -
import dask.dataframe as dd
df['adjusted_revenue'] = 0
df.loc[(df.tracked ==1), 'adjusted_revenue'] = 0.7*df['gross_revenue']
df.loc[(df.tracked ==0), 'adjusted_revenue'] = 0.3*df['gross_revenue']
I am trying to do this on a dask dataframe but it doesn't support assignment.
TypeError: '_LocIndexer' object does not support item assignment
This is working for me -
df['adjusted_revenue'] = 0
df1 = df.loc[df['tracked'] ==1]
df1['adjusted_revenue'] = 0.7*df1['gross_revenue']
df2 = df.loc[df['tracked'] ==0]
df2['adjusted_revenue'] = 0.3*df['gross_revenue']
df = dd.concat([df1, df2])
However, I was hoping if there is any simpler way to do this.
Thanks!
You should use .apply, which is probably the right thing to do with Pandas too; or perhaps where. However, to keep things as similar to your original, here it is with map_partitions, in which you act on each piece of the the dataframe independently, and those pieces really are Pandas dataframes.
def make_col(df):
df['adjusted_revenue'] = 0
df.loc[(df.tracked ==1), 'adjusted_revenue'] = 0.7*df['gross_revenue']
df.loc[(df.tracked ==0), 'adjusted_revenue'] = 0.3*df['gross_revenue']
return df
new_df = df.map_partitions(make_col)
I cannot figure out how to use the index results from np.where in a for loop. I want to use this for loop to ONLY change the values of a column given the np.where index results.
This is a hypothetical example for a situation where I want to find the indexed location of certain problems or anomalies in my dataset, grab their locations with np.where, and then run a loop on the dataframe to recode them as NaN, while leaving every other index untouched.
Here is my simple code attempt so far:
import pandas as pd
import numpy as np
# import iris
df = pd.read_csv('https://raw.githubusercontent.com/rocketfish88/democ/master/iris.csv')
# conditional np.where -- hypothetical problem data
find_error = np.where((df['petal_length'] == 1.6) &
(df['petal_width'] == 0.2))
# loop over column to change error into NA
for i in enumerate(find_error):
df = df['species'].replace({'setosa': np.nan})
# df[i] is a problem but I cannot figure out how to get around this or an alternative
You can directly assign to the column:
m = (df['petal_length'] == 1.6) & (df['petal_width'] == 0.2)
df.loc[m, 'species'] = np.nan
Or, fixing your code.
df['species'] = np.where(m, np.nan, df['species'])
Or, using Series.mask:
df['species'] = df['species'].mask(m)
I have a dataframe like this
now i want to normalize the string in the 'comments' column for the word 'election' . I tried using fuzzywuzzy but wasn't able to implement it on pandas dataframe to partially match the word 'election'. The output dataframe should have the word 'election' in the 'comments' column like this
Assume that i have around 100k rows and possible combinations for the word 'election' can be many.
Kindly guide me on this part.
with the answer you gave, you can use pandas apply, stack and groupby functions to accelerate your code. you have input such as:
import pandas as pd
from fuzzywuzzy import fuzz
df = pd.DataFrame({'Merchant details': ['Alpha co','Bravo co'],
'Comments':['electionsss are around',
'vote in eelecttions']})
For the column 'comments', you can create a temporary mutiindex DF containing a word per row by splitting and using stack function:
df_temp = pd.DataFrame(
{'split_comments':df['Comments'].str.split(' ',expand=True).stack()})
Then you create the column with corrected word (according to your idea), using apply and the comparision of fuzz.ratio:
df_temp['corrected_comments'] = df_temp['split_comments'].apply(
lambda wd: 'election' if fuzz.ratio(wd, 'election') > 75 else wd)
Finally, you write back in your column Comments of df with the corrected data using groupby and join functions:
df['Comments'] = df_temp.reset_index().groupby('level_0').apply(
lambda wd: ' '.join(wd['corrected_comments']))
Don't operate on the dataframe. The overhead will kill you. Turn the column into a list, then iteratecover that. And finally assign that list back to the column.
Ok i tried this myself and came up with this code -
for i in range(len(df)):
a = []
a = df.comments[i].split()
for j in word:
for k in range(len(a)):
if fuzz.ratio(j,a[k]) > 75:
a[k] = j
df.comments[i] = a
df.comments[i] = ' '.join(df.comments[i])
But this approach seems slow for a large dataframe.
Can someone provide a better pythonic way of implementing this.