I'm trying to match values in a matrix on python using pandas dataframes. Maybe this is not the best way to express it.
Imagine you have the following dataset:
import pandas as pd
d = {'stores':['','','','',''],'col1': ['x','price','','',1],'col2':['y','quantity','',1,''], 'col3':['z','',1,'',''] }
df = pd.DataFrame(data=d)
stores col1 col2 col3
0 NaN x y z
1 NaN price quantity NaN
2 NaN NaN Nan 1
3 NaN NaN 1 NaN
4 NaN 1 NaN NaN
I'm trying to get the following:
stores col1 col2 col3
0 NaN x y z
1 NaN price quantity NaN
2 z NaN Nan 1
3 y NaN 1 NaN
4 x 1 NaN NaN
Any ideas how this might work? I've tried running loops on lists but I'm not quite sure how to do it.
This is what I have so far but it's just terrible (and obviously not working) and I am sure there is a much simpler way of doing this but I just can't get my head around it.
stores = ['x','y','z']
for i in stores:
for v in df.iloc[0,:]:
if i==v :
df['stores'] = i
It yields the following:
stores col1 col2 col3
0 z x y z
1 z price quantity NaN
2 z NaN NaN 1
3 z NaN 1 NaN
4 z 1 NaN NaN
Thank you in advance.
You can complete this task with a loop by doing the following. It loops through each column excluding the first where you want to write the data. Takes the index values where the value is 1 and writes the value from the first row to the column 'stores'.
Be careful where you might have 1's in multiple rows, in which case it will fill the stores column with the last column that had a 1 value.
for col in df.columns[1:]:
index_values = df[col][df[col]==1].index.tolist()
df.loc[index_values, 'stores'] = df[col][0]
You can fill the whole column at once, like this:
df["stores"] = df[["col1", "col2", "col3"]].rename(columns=df.loc[0]).eq(1).idxmax(axis=1)
This first creates a version of the dataframe with the columns renamed "x", "y", and "z" after the values in the first row; then idxmax(axis=1) returns the column heading associated with the max value in each row (which is the True one).
However this adds an "x" in rows where none of the columns has a 1. If that is a problem you could do something like this:
df["NA"] = 1 # add a column of ones
df["stores"] = df[["col1", "col2", "col3", "NA"]].rename(columns=df.loc[0]).eq(1).idxmax(axis=1)
df["stores"].replace(1, np.NaN, inplace=True) # replace the 1s with NaNs
Related
One common thing people seem to want to do in pandas is to replace None-values with the next or previous None-value. This is easily done with .fillna. I however want to do something similar but different.
I have a dataframe, df, with some entries. Every row has a different number of entries and they are all "left-adjusted" (if the df is 10 columns wide and some row has n<10 entries the first n columns hold the entries and the remaining columns are Nones).
What I want to do is find the last non-None entry in every row and change it to also be a None. This could be any of the columns from the first to the last.
I could of course do this with a for-loop but my dfs can be quite large so something quicker would be preferable. Any ideas?
Thanks!
With help from numpy, this is quite easy. By counting the number of None in each row one can find for each row the column with the last non-None value. Then using Numpy change this value to None:
data = np.random.random((6,10))
df = pd.DataFrame(data)
df.iloc[0, 7:] = None
df.iloc[1, 6:] = None
df.iloc[2, 5:] = None
df.iloc[3, 8:] = None
df.iloc[4, 5:] = None
df.iloc[5, 4:] = None
Original dataframe looks like this:
0 1 2 3 4 5
0 0.992337 0.651785 0.521422 NaN NaN NaN
1 0.912962 0.292458 0.620195 0.507071 0.010205 NaN
2 0.061320 0.565979 0.344755 NaN NaN NaN
3 0.521936 0.057917 0.359699 0.484009 NaN NaN
isnull = df.isnull()
col = data.shape[1] - isnull.sum(axis = 1) - 1
df.values[range(len(df)), col] = None
Updated dataframe looks like this:
0 1 2 3 4 5
0 0.992337 0.651785 NaN NaN NaN NaN
1 0.912962 0.292458 0.620195 0.507071 NaN NaN
2 0.061320 0.565979 NaN NaN NaN NaN
3 0.521936 0.057917 0.359699 NaN NaN NaN
You can find the index of the element to replace in each row with np.argmax():
indices = np.isnan(df.to_numpy()).argmax(axis=1) - 1
df.to_numpy()[range(len(df)), indices] = None
how can I set all my values in df1 as missing if their position equivalent is a missing value in df2?
Data df1:
Index Data
1 3
2 8
3 9
Data df2:
Index Data
1 nan
2 2
3 nan
desired output:
Index Data
1 nan
2 8
3 nan
So I would like to keep the data of df1, but only for the positions for which df2 also has data entries. For all nans in df2 I would like to replace the value of df1 with nan as well.
I tried the following, but this replaced all data points with nan.
df1 = df1.where(df2== np.nan, np.nan)
Thank you very much for your help.
Use mask, which is doing exactly the inverse of where:
df3 = df1.mask(df2.isna())
output:
Index Data
0 1 NaN
1 2 8.0
2 3 NaN
In your case, you were setting all elements matching a non-NaN as NaN, and because equality is not the correct way to check for NaN (np.nan == np.nan yields False), you were setting all to NaN.
Change df2 == np.nan by df2.notna():
df3 = df1.where(df2.notna(), np.nan)
print(df3)
# Output
Index Data
0 1 NaN
1 2 8.0
2 3 NaN
I have the following dataframe as "w":
A B
0 Alex Benedict
1 John NaN
I want to find the maximum from these 2 columns and store it in "A" column
I used the following method:
w["A"] = w[['A','B']].max(axis=1)
A B
0 NaN Benedict
1 NaN NaN
I don't want this output of "NaN" in the "A" column. How should I get rid of this?
It is possible with max per rows with removing missing values:
w['A'] = w[['A','B']].apply(lambda x: x.dropna().max(), axis=1)
print (w)
A B
0 Benedict Benedict
1 John NaN
The nanmax() function of numpy does the job
w['A'] = w[['A', 'B']].apply(np.nanmax)
I am working combining different pandas Dataframes and sorting the index of the final dataframe I found something that does not make any sense to me. It gives no error but no assignation really happens. I give a simplified example below
Case 1:
import pandas as pd
ind_1 = ['a','a','b','c','c']
df_1 = pd.DataFrame(index=ind_1,columns=['col1','col2'])
df_1.col1.loc['a'].iloc[0] = 1
df_1.col1.loc['b'] = 2
df_1.col1.loc['c'].iloc[0] = 3
print('Original df_1')
print(df_1)
# Original df_1
# col1 col2
# a 1 NaN
# a NaN NaN
# b 2 NaN
# c 3 NaN
# c NaN NaN
You can see that this assignation works fine. But let's create the dataframe from the index sorted differently.
ind_1_sorted = sorted(ind_1,reverse=True)
df_1_sorted = pd.DataFrame(index=ind_1_sorted,columns=['col1','col2'])
df_1_sorted.col1.loc['a'].iloc[0] = 1
df_1_sorted.col1.loc['b'] = 2
df_1_sorted.col1.loc['c'].iloc[0] = 3
print('Sorted df_1')
print(df_1_sorted)
# Sorted df_1
# col1 col2
# c NaN NaN
# c NaN NaN
# b 2 NaN
# a NaN NaN
# a NaN NaN
You can see now that the assignation only works for the non-repeated index. I thought that the problem had to be related with the sorting but let's see next case.
Case 2:
ind_2 = ['c','c','b','a','a']
df_2 = pd.DataFrame(index=ind_2,columns=['col1','col2'])
df_2.col1.loc['a'].iloc[0] = 1
df_2.col1.loc['b'] = 2
df_2.col1.loc['c'].iloc[0] = 3
print('Original df_2')
print(df_2)
# Original df_2
# col1 col2
# c NaN NaN
# c NaN NaN
# b 2 NaN
# a NaN NaN
# a NaN NaN
Now, we get no assignation without implementing the sorting. Let's see what happens if I sort the index
ind_2_sorted = sorted(ind_2,reverse=False)
df_2_sorted = pd.DataFrame(index=ind_2_sorted,columns=['col1','col2'])
df_2_sorted.col1.loc['a'].iloc[0] = 1
df_2_sorted.col1.loc['b'] = 2
df_2_sorted.col1.loc['c'].iloc[0] = 3
print('Sorted df_2')
print(df_2_sorted)
# Sorted df_2
# col1 col2
# a 1 NaN
# a NaN NaN
# b 2 NaN
# c 3 NaN
# c NaN NaN
And now, the assignation works after the sorting!! The only difference I see is that the assignation works when the index is sorted in a "standard way" (alphabetically in this case). Has this any sense?
In case the solution is using first a index sorted alphabetically and then sort it in the order I want, how could I do this sorting using repeated indexes as in these examples?
Thanks!
As User Quickbeam2k1 mentioned, the issue is due to chain assignment.
Index Objects have a method called get_loc which can be used to convert labels to positions, however its return type is polymorphic & that is why I prefer to not use it.
Using np.nonzero & filtering on the dataframe's index & column, we can convert the labels to positional references & modify the dataframe using iloc instead of loc
i.e. your first code sample can be rewritten as:
# original
df_1.col1.loc['a'].iloc[0] = 1
df_1.col1.loc['b'] = 2
df_1.col1.loc['c'].iloc[0] = 3
# works for all indices
col1_mask = df_1.columns == 'col1'
a_mask, = np.nonzero(df_1.index == 'a')
b_mask, = np.nonzero(df_1.index == 'b')
c_mask, = np.nonzero(df_1.index == 'c')
df_1.iloc[a_mask[0], col1_mask] = 1
df_1.iloc[b_mask, col1_mask] = 1
df_1.iloc[c_mask[0], col1_mask] = 3
Similarly for the other examples
In my code the df.fillna() method is not working when the df.dropna() method is working. I don't want to drop the column though. What can I do that the fillna() method works?
def preprocess_df(df):
for col in df.columns: # go through all of the columns
if col != "target": # normalize all ... except for the target itself!
df[col] = df[col].pct_change() # pct change "normalizes" the different currencies (each crypto coin has vastly diff values, we're really more interested in the other coin's movements)
# df.dropna(inplace=True) # remove the nas created by pct_change
df.fillna(method="ffill", inplace=True)
print(df)
break
df[col] = preprocessing.scale(df[col].values) # scale between 0 and 1.
it should work unless its not within loop as mentioned..
You should consider filling it before you construct a loop or during the DataFrame construction:
Example Below cleary shows it working :
>>> df
col1
0 one
1 NaN
2 two
3 NaN
Works as expected:
>>> df['col1'].fillna( method ='ffill') # This is showing column specific to `col1`
0 one
1 one
2 two
3 two
Name: col1, dtype: object
Secondly, if you wish to change few selective columns then you use below method:
Let's suppose you have 3 columns and want to fillna with ffill for only 2 columns.
>>> df
col1 col2 col3
0 one test new
1 NaN NaN NaN
2 two rest NaN
3 NaN NaN NaN
Define the columns to be changed..
cols = ['col1', 'col2']
>>> df[cols] = df[cols].fillna(method ='ffill')
>>> df
col1 col2 col3
0 one test new
1 one test NaN
2 two rest NaN
3 two rest NaN
If you are considering it to be happen across entire DataFrame, the use it during as Follows:
>>> df
col1 col2
0 one test
1 NaN NaN
2 two rest
3 NaN NaN
>>> df.fillna(method ='ffill') # inplace=True if you considering as you wish for permanent change.
col1 col2
0 one test
1 one test
2 two rest
3 two rest
the first value was a NaN so I had to use bfill method instead. Thanks everyone