I was read Andy's answer to the question Outputting difference in two Pandas dataframes side by side - highlighting the difference
i have two questions regarding the code, unfortunately I dont yet have 50 rep to comment on the answer so I hope i could get some help here.
what does In [24]: changed = ne_stacked[ne_stacked] do?
I'm not sure what df1 = df[df] do and i cant seem to get an answer from pandas doc, could someone explain this to me please?
is np.where(df1 != df2) the same as pd.df.where(df1 != df2). If no, what is the difference?
Question 1
ne_stacked is a pd.Series that consists of True and False values that indicate where df1 and df2 are not equal.
ne_stacked[boolean_array] is a way to filter the series ne_stacked by eliminating the rows of ne_stacked where boolean_array is False and keeping the rows of ne_stacked where boolean_array is True.
It so happens that ne_stacked is also a boolean array and so can be used to filter itself. Why would be want to do this? So we can see what the values of the index are after we've filtered.
So ne_stacked[ne_stacked] is a subset of ne_stacked with only True values.
Question 2
np.where
np.where does two things, if you only pass a conditional like in np.where(df1 != df2), you get a tuple of arrays where the first is a reference of all row indices to be used in conjunction with the second element of the tuple that is a reference to all column indices. I usually use it like this
i, j = np.where(df1 != df2)
Now I can get at all elements of df1 or df2 in which there are differences like
df.values[i, j]
Or I can assign to those cells
df.values[i, j] = -99
Or lots of other useful things.
You can also use np.where as an if, then, else for arrays
np.where(df1 != df2, -99, 99)
To produce an array the same size as df1 or df2 where you have -99 in all the places where df1 != df2 and 99 in the rest.
df.where
On the other hand df.where evaluates the first argument of boolean values and returns an object of equal size to df where the cells that evaluated to True are kept and the rest are either np.nan or the values passed in the second argument of df.where
df1.where(df1 != df2)
Or
df1.where(df1 != df2, -99)
are they the same?
Clearly they are not the "same". But you can use them similarly
np.where(df1 != df2, df1, -99)
Should be the same as
df1.where(df1 != df2, -99).values
Related
I have two dataframes with same shape:
>>> df1.shape
(400,1200)
>>> df2.shape
(400,1200)
I would like to compare cell-by-cell and if a value is missing in one of the dataframes make the equivalent value in the other dataframe NaN as well.
Here's a (pretty inefficient) piece of code that works:
for i in df.columns: # iterate over columns
for j in range(len(df1): # iterate over rows
if pd.isna(df1[i][j]) | pd.isna(df2[i][j]):
df1[i][j] = np.NaN
df2[i][j] = np.NaN
How would be a better way to do this? I'm very sure there is.
This is a simple problem to solve with pandas. You can use this code:
df1[df2.isna()] = df2[df1.isna()] = np.nan
It first creates mask of df2, i.e., a copy of dataframe containing only True or False values. Each NaN in df2 will have a True in the mask, and every other value will have a False in the mask.
With pandas, you can use such masks to do bulk operations. So you can pass that mask to the [] of df1, and then assign it a value, and where each value in the mask is True, the corresponding value in df1 will be assigned the value.
I have following sample dataframe with column A and B:
df:
A B
123 555
456 123
789 666
I want to know which method can be used to print out 123 (a method to print out values of A which also exist in column B). I tried following:
for i, row in df.iterrows():
if row.A in row.B:
print(row.A, row.B)
but, got error: argument of type 'float' is not iterable.
If you are trying to print any row that row.A exists in column B, then your code should be:
for i, row in df.iterrows():
if row.A in df.B:
print(row.A, row.B)
col_B = df['B'].unique()
val_B_in_A = [ i for i in df['A'].unique() if i in col_B ]
print(val_B_in_A)
Be careful with "dot" notation in dataframes, since columns can contain spaces and it starts to be a pain dealing with those. With that said,
Depending on how many rows you are iterating over, and the proportion of rows that contain unique values, it may be computationally less expensive to iterate over the unique values in 'A', and check if each one is in 'B':
import pandas as pd
tmp = []
for value in df['A'].unique():
tmp.append(df.loc[df['B']==value])
df_results = pd.concat(tmp)
print(df_results)
You could also use the built-in method .isin(), in fact, much of the power of pandas is in its array-wise operators, which are significantly quicker than most approaches involving loops:
df.loc[df['B'].isin(df['A'].unique())]
And to only show one column with the ".loc" accessor, just add
df.loc[df['B'].isin(df['A'].unique()), 'A']
And to just return the values in an optimized array
df.loc[df['B'].isin(df['A'].unique()), 'A'].values
If you are concerned with an exact match try
df['match'] = pd.Series([(df['col1']==item).sum() for item in df['col1']])
I know this has been asked before but I cannot find an answer that is working for me. I have a dataframe df that contains a column age, but the values are not all integers, some are strings like 35-59. I want to drop those entries. I have tried these two solutions as suggested by kite but they both give me AttributeError: 'Series' object has no attribute 'isnumeric'
df.drop(df[df.age.isnumeric()].index, inplace=True)
df = df.query("age.isnumeric()")
df = df.reset_index(drop=True)
Additionally is there a simple way to edit the value of an entry if it matches a certain condition? For example instead of deleting rows that have age as a range of values, I could replace it with a random value within that range.
Try with:
df.drop(df[df.age.str.isnumeric() == False].index, inplace=True)
If you check documentation isnumeric is a method of Series.str and not of Series. That's why you get that error.
Also you will need the ==False because you have mixed types and get a series with only booleans.
I'm posting it in case this also helps you with your last question. You can use pandas.DataFrame.at with pandas.DataFrame.Itertuples for iteration over rows of the dataframe and replace values:
for row in df.itertuples():
# iterate every row and change the value of that column
if row.age == 'non_desirable_value:
df.at[row.Index, "age"] = 'desirable_value'
Hence, it could be:
for row in df.itertuples():
if row.age.str.isnumeric() == False or row.age == 'non_desirable_value':
df.at[row.Index, "age"] = 'desirable_value'
I am starting to learn Pandas. I have seen a lot of questions here in SO where people ask how to delete a row if a column matches certain value.
In my case it is the opposite. Imagine having this dataframe:
Where you want to know is, if any column has in any of its row the value salty, that column should be deleted, having as a result:
I have tried with several similarities to this:
if df.loc[df['A'] == 'salty']:
df.drop(df.columns[0], axis=1, inplace=True)
But I am quite lost at finding documentation onto how to delete columns based on a row value of that column. That code is a mix of finding a specific column and deleting always the first column (as my idea was to search the value of a row in that column, in ALL columns in a for loop.
Perform a comparison across your values, then use DataFrame.any to get a mask to index:
df.loc[:, ~(df == 'Salty').any()]
If you insist on using drop, this is how you need to do it. Pass a list of indices:
df.drop(columns=df.columns[(df == 'Salty').any()])
df = pd.DataFrame({
'A': ['Mountain', 'Salty'], 'B': ['Lake', 'Hotty'], 'C': ['River', 'Coldy']})
df
A B C
0 Mountain Lake River
1 Salty Hotty Coldy
(df == 'Salty').any()
A True
B False
C False
dtype: bool
df.loc[:, ~(df == 'Salty').any()]
B C
0 Lake River
1 Hotty Coldy
df.columns[(df == 'Salty').any()]
# Index(['A'], dtype='object')
df.drop(columns=df.columns[(df == 'Salty').any()])
B C
0 Lake River
1 Hotty Coldy
The following is locating the indices where your desired column matches a specific value and then drops them. I think this is probably the more straightforward way of accomplishing this:
df.drop(df.loc[df['Your column name here'] == 'Match value'].index, inplace=True)
Here's one possibility:
df = df.drop([col for col in df.columns if df[col].eq('Salty').any()], axis=1)
I have a pandas dataframe and want to get rid of rows in which the column 'A' is negative. I know 2 ways to do this:
df = df[df['A'] >= 0]
or
selRows = df[df['A'] < 0].index
df = df.drop(selRows, axis=0)
What is the recommended solution? Why?
The recommended solution is the most eficient, which in this case, is the first one.
df = df[df['A'] >= 0]
On the second solution
selRows = df[df['A'] < 0].index
df = df.drop(selRows, axis=0)
you are repeating the slicing process. But lets break it to pieces to understand why.
When you write
df['A'] >= 0
you are creating a mask, a Boolean Series with an entry for each index of df, whose value is either True or False according to a condition (on this case, if such the value of column 'A' at a given index is greater than or equal to 0).
When you write
df[df['A'] >= 0]
you accessing the rows for which your mask (df['A'] >= 0) is True. This is a slicing method supported by Pandas that lets you select certain rows by passing a Boolean Series and will return a view of the original DataFrame with only the entries for which the Series was True.
Finally, when you write this
selRows = df[df['A'] < 0].index
df = df.drop(selRows, axis=0)
you are repeating the proccess because
df[df['A'] < 0]
is already slicing your DataFrame (in this case for the rows you want to drop). You are then getting those indices, going back to the original DataFrame and explicitly dropping them. No need for this, you already sliced the DataFrame in the first step.
df = df[df['A'] >= 0]
is indeed the faster solution. Just be aware that it returns a view of the original data frame, not a new data frame. This can lead you into trouble, for example when you want to change its values, as pandas will give you the SettingwithCopyWarning.
The simple fix of course is what Wen-Ben recommended:
df = df[df['A'] >= 0].copy()
Your question is like this: "I have two identical cakes, but one has icing. Which has more calories?"
The second solution is doing the same thing but twice. A filtering step is enough, there's no need to filter and then redundantly proceed to call a function that does the exact same thing the filtering op from the previous step did.
To clarify: regardless of the operation, you are still doing the same thing: generating a boolean mask, and then subsequently indexing.