I cannot figure out how to use the index results from np.where in a for loop. I want to use this for loop to ONLY change the values of a column given the np.where index results.
This is a hypothetical example for a situation where I want to find the indexed location of certain problems or anomalies in my dataset, grab their locations with np.where, and then run a loop on the dataframe to recode them as NaN, while leaving every other index untouched.
Here is my simple code attempt so far:
import pandas as pd
import numpy as np
# import iris
df = pd.read_csv('https://raw.githubusercontent.com/rocketfish88/democ/master/iris.csv')
# conditional np.where -- hypothetical problem data
find_error = np.where((df['petal_length'] == 1.6) &
(df['petal_width'] == 0.2))
# loop over column to change error into NA
for i in enumerate(find_error):
df = df['species'].replace({'setosa': np.nan})
# df[i] is a problem but I cannot figure out how to get around this or an alternative
You can directly assign to the column:
m = (df['petal_length'] == 1.6) & (df['petal_width'] == 0.2)
df.loc[m, 'species'] = np.nan
Or, fixing your code.
df['species'] = np.where(m, np.nan, df['species'])
Or, using Series.mask:
df['species'] = df['species'].mask(m)
Related
I was replacing values in columns and noticed that if use mask on all the dataframe, it will produce expected results, but if I used it against selected columns with .loc, it won't change any value.
Can you explain why and tell if it is expected result?
You can try with a dataframe dt, containing 0 in columns:
dt = pd.DataFrame(np.random.randint(0,3,size=(10, 3)), columns=list('ABC'))
dt.mask(lambda x: x == 0, np.nan, inplace=True)
# will replace all zeros to nan, OK.
But:
dt = pd.DataFrame(np.random.randint(0,3,size=(10, 3)), columns=list('ABC'))
columns = list('BC')
dt.loc[:, columns].mask(lambda x: x == 0, np.nan, inplace=True)
# won't cange anything. I excpet B, C columns to have values replaced
i guess it's because the DataFrame.loc property is just giving access to a slice of your dataframe and you are masking a copy of the dataframe so it doesn't affect the data.
you can try this instead:
dt[columns] = dt[columns].mask(dt[columns] == 0)
The loc functions returns a copy of the dataframe. On this copy you are applying the mask function that perform the operation in place on the data. You can't do this on a one-liner, otherwise the memory copy remains inaccessible. To get access to that memory area you have to split the code into 2 lines, to get a reference to that memory area:
tmp = dt.loc[:, columns]
tmp.mask(tmp[columns] == 0, np.nan, inplace=True)
and then you can go and update the dataframe:
dt[columns] = tmp
Not using the inplace update of the mask function, on the other hand, you can do everything with one line of code
dt[columns] = dt.loc[:, columns].mask(dt[columns] == 0, np.nan, inplace=False)
Extra:
If you want to better understand the use of the inplace method in pandas, I recommend you read these posts:
Understanding inplace=True in pandas
In pandas, is inplace = True considered harmful, or not?
What is the difference between using loc and using just square brackets to filter for columns in Pandas/Python?
I am trying to fill an exisiting dataframe in pandas by adding several rows at one time, the number of rows depend on a comprehension list so it is variable. The initial dataframe is filled as follows:
import pandas as pd
import portion as P
columns = ['chr', 'Start', 'End', 'type']
x = pd.DataFrame(columns=columns)
RANGE = [(212, 222),(866, 888),(152, 158)]
INTERVAL= P.Interval(*[P.closed(x, y) for x, y in RANGE])
def fill_df(df, junction, chr, type ):
df['Start'] = [x.lower for x in junction]
df['End'] = [x.upper for x in junction]
df['chr'] = chr
df['type'] = type
return df
z = fill_df(x, INTERVAL, 1, 'DUP')
The idea is to keep appending rows to the dataframe from different intervals (so variable number of rows). Append those rows to the existing dataframe.
Here I have found different ways to add several rows but none of them are easy to apply unless I wrote a function to convert my data in tupples or lists, which I am not sure if it would be efficient. I have also try with pandas append but I was not able to do it for a bunch of lines..
Is it there any simple way to do this?
Thanks a lot!
Have you tried wrapping the list comprehension in pd.Series?
df['Start.pos'] = pd.Series([x.lower for x in junction])
If you want to use append and append several elements at once, you can create a second DataFrame table and simply append it to the first one. This looks like this:
import intvalpy as ip
import pandas as pd
inf = [1, 2, 3]
sup = [4, 5, 6]
intervals = ip.Interval(inf, sup)
add_intervals = ip.Interval([-10, -20], [10,20])
df = pd.DataFrame(data={'start': intervals.a, 'end': intervals.b})
df2 = pd.DataFrame(data={'start': add_intervals.a, 'end': add_intervals.b})
df = df.append(df2, ignore_index=True)
print(df.head(10))
The intvalpy library specialized for classical and full interval arithmetic is used here. To set an interval or intervals, use the Interval function, where the first argument is the left end and the second is the right end of the intervals.
The ignore_index parameter allows to continue indexing of the first table.
In case you want to add one line, you can do it as follows:
for k in range(len(intervals)):
df = df.append({'start': intervals[k].a, 'end': intervals[k].b}, ignore_index=True)
print(df.head(10))
I purposely did it with a loop to show that you can do without creating a second table if you want to add a few rows.
I tried to find the index which satisfy certain conditions in pandas DataFrame.
For example, we have the following dataframe
and find the index such that
argmin(j) df['A'].iloc[j] >= (df['A'].iloc[i] + 3 ) for all i
so the result will be given by
I finished the work by using for loop, but I believe there is more efficient way to acheieve this job.
Thank you for your reply!
My code is
for i in range(len(df)):
df['B'].iloc[i] = df[df2['A']>= df2['A'].iloc[i]+1].index[0]
but, for loop is too slow for a large data set.
try following method, :)
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1,3,5,8,10,12]})
b = pd.DataFrame(df.values - (df['A'].values + 3), index=df.index)
df['B'] = b.where(b >= 0).idxmin()
df
I am looking for a way to do some conditional mapping using multiple comparisons.
I have millions and millions of rows that I am investigating using sample SQL extracts in pandas. Along with SQL extracts read into pandas DataFrames I also have some rules tables, each with a few columns (these are also loaded into dateframes).
This is what I want to do: where a row in my SQL extract matches the conditions expressed in any one row in my rules table, I would like to generate a 1, else: 0. In the end I would like to add a column to my SQL extract called Rule Result with either 1's and 0's.
I have got a system that works using df.merge, but it produces many many extra duplicate rows in the process that must then be removed afterwards. I am looking for a better, faster, more elegant solution and would be grateful for any suggestions.
Here is a working example of the problem and the current solution code:
import pandas as pd
import numpy as np
#Create a set of test data
test_df = pd.DataFrame()
test_df['A'] = [1,120,982,1568,29,455,None, None, None]
test_df['B'] = ['EU','US',None, 'GB','DE','EU','US', 'GB',None]
test_df['C'] = [1111,1121,1111,1821,1131,1111,1121,1821,1723]
test_df['C_C'] = test_df['C']
test_df
test_df
#Create a rules_table
rules_df = pd.DataFrame()
rules_df['A_MIN'] = [0,500,None,600,200]
rules_df['A_MAX'] = [10,1000,500,1200,800]
rules_df['B_B'] = ['GB','GB','US','EU','EU']
rules_df['C_C'] = [1111,1821,1111,1111,None]
rules_df
def map_rules_to_df(df,rules):
#create column that mimics the index to keep track of later duplication
df['IND'] = df.index
#merge the rules with the data on C values
df = df.merge(rules,left_on='C_C',right_on='C_C',how='left')
#create a rule_result_column with a default value of zero
df['RULE_RESULT']=0
#create a mask indentifying those test_df_rows that fit with a
# rule_df_row
mask = df[
((df['A'] > df['A_MIN']) | (df['A_MIN'].isnull())) &
((df['A'] < df['A_MAX']) | (df['A_MAX'].isnull())) &
((df['B'] == df['B_B']) | (df['B_B'].isnull())) &
((df['C'] == df['C_C']) | (df['C_C'].isnull()))
]
#use mask.index to replace 0's in the result column with a 1
df.loc[mask.index.tolist(),'RULE_RESULT']=1
#drop the redundant rule_df columns
df = df.drop(['B_B','C_C','A_MIN','A_MAX'],axis=1)
#drop duplicate rows
df = df.drop_duplicates(keep='first')
#drop rows where the original index is duplicated and the rule result
#is zero
df = df[(df['IND'].duplicated(keep=False)) & (df['RULE_RESULT']==0) == False]
#reset the df index with the original index
df.index = df['IND'].values
#drop the now redundant second index column (IND)
df = df.drop('IND', axis=1)
print('df shape',df.shape)
return df
#map the rules
result_df = map_rules_to_df(test_df,rules_df)
result_df
result_df
I hope I have made what I would like to do clear and thank you for your help.
PS, my rep is non-existent, so i was not allowed to post more than two supporting images.
the original dataframe:
from pandas import Series, DataFrame
import pandas as pd
%pylab inline
df=pd.read_csv('NYC_Restaurants.csv', dtype=unicode)
I used a mask to isolate the desired rows (those that occur only once in the column)
mask = df['DBA'].value_counts()[df['DBA'].value_counts() == 1]
which produces the expected result
However, using df[mask] produces a strange dataframe with the first column repeated many times; as opposed to giving back the original dataframe with only the selected rows
Instead of using value_counts(); I used groupby which provided exactly what I was looking for.
mask = df.groupby("DBA").filter(lambda x: len(x) == 1)