I'm finding a way (using a built-in pandas function) to scan a column of a DataFrame comparing its-self values for different indices.
Here an example using a for cycle. I've a dataframe with a single column col 1. I want to create a column col 2 with TRUE/FALSE in this way.
df["col_2"] = "False"
N=5
for idx in range(0,len(df)-N):
for i in range (idx+1,idx+N+1):
if(df["col_1"].iloc[idx]==df["col_1"].iloc[i]):
df["col_2"].iloc[idx]=True
What I'm trying to do is to compare the value of col 1 for the i-th index with the next N indices.
I'd like to do the same operation without using a for cycle . I've already tried to use a shift and df.loc , but the computational time is similar.
Have you tried doing something like
df["col_1_shifted"] = df["col_1"].shift(N)
df["col_2"] = (df["col_1"] == df["col_1_shifted"])
update: looking more carefully at your double-loop, it seems you want to flag all duplicates except the last. That's done by just changing the keep argument to 'last' instead of the default 'first'.
As suggested by #QuangHoang in the comments, duplicated() works nicely for this:
newdf = df.assign(col_2=df.duplicated(subset='col_1', keep='last'))
Example:
df = pd.DataFrame(np.random.randint(0, 5, 10), columns=['col_1'])
newdf = df.assign(col_2=df.duplicated(subset='col_1', keep='last'))
>>> newdf
col_1 col_2
0 2 False
1 0 True
2 1 True
3 0 True
4 0 False
5 3 False
6 1 True
7 1 False
8 4 True
9 4 False
Related
How do I set the values of a pandas dataframe slice, where the rows are chosen by a boolean expression and the columns are chosen by position?
I have done it in the following way so far:
>>> vals = [5,7]
>>> df = pd.DataFrame({'a':[1,2,3,4], 'b':[5,5,7,7]})
>>> df
a b
0 1 5
1 2 5
2 3 7
3 4 7
>>> df.iloc[:,1][df.iloc[:,1] == vals[0]] = 0
>>> df
a b
0 1 0
1 2 0
2 3 7
3 4 7
This works as expected on this small sample, but gives me the following warning on my real life dataframe:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
What is the recommended way to achieve this?
Use DataFrame.columns and DataFrame.loc:
col = df.columns[1]
df.loc[df.loc[:,col] == vals[0], col] = 0
One way is to use index of column header and loc (label based indexing):
df.loc[df.iloc[:, 1] == vals[0], df.columns[1]] = 0
Another way is to use np.where with iloc (integer position indexing), np.where returns the tuple of row, column index positions where True:
df.iloc[np.where(df.iloc[:, 1] == vals[0])[0], 1] = 0
I believe this can be also done with a combination of loc and iloc:
df.loc[df.iloc[:,1] == vals[0]].iloc[:, 1] = 0
For example, I have a dataframe A likes below :
a b c
x 0 2 1
y 1 3 2
z 0 2 4
I want to get the number of 0 in column 'a' , which should returns 2. ( A[x][a] and A[z][a] )
Is there a simple way or is there a function I can easily do this?
I've Googled for it, but there are only articles like this.
count the frequency that a value occurs in a dataframe column
Which makes a new dataframe, and is too complicated to what I only need to do.
Use sum with boolean mask - Trues are processes like 1, so output is count of 0 values:
out = A.a.eq(0).sum()
print (out)
2
Try value_counts from pandas (here):
df.a.value_counts()["0"]
If the values are changeable, do it with df[column_name].value_counts()[searched_value]
I'd like to return the rows which qualify to a certain condition. I can do this for a single row, but I need this for multiple rows combined. For example 'light green' qualifies to 'XYZ' being positive and 'total' > 10, where 'Red' does not. When I combine a neighbouring row or rows, it does => 'dark green'. Can I achieve this going over all the rows and not return duplicate rows?
N = 1000
np.random.seed(0)
df = pd.DataFrame(
{'X':np.random.uniform(-3,10,N),
'Y':np.random.uniform(-3,10,N),
'Z':np.random.uniform(-3,10,N),
})
df['total'] = df.X + df.Y + df.Z
df.head(10)
EDIT;
Desired output is 'XYZ'> 0 and 'total' > 10
Here's a try. You would maybe want to use rolling or expanding (for speed and elegance) instead of explicitly looping with range, but I did it that way so as to be able to print out the rows being used to calculate each boolean.
df = df[['X','Y','Z']] # remove the "total" column in order
# to make the syntax a little cleaner
df = df.head(4) # keep the example more manageable
for i in range(len(df)):
for k in range( i+1, len(df)+1 ):
df_sum = df[i:k].sum()
print( "rows", i, "to", k, (df_sum>0).all() & (df_sum.sum()>10) )
rows 0 to 1 True
rows 0 to 2 True
rows 0 to 3 True
rows 0 to 4 True
rows 1 to 2 False
rows 1 to 3 True
rows 1 to 4 True
rows 2 to 3 True
rows 2 to 4 True
rows 3 to 4 True
I am not too sure if I understood your question correctly, but if you are looking to put multiple conditions within a dataframe, you can consider this approach:
new_df = df[(df["X"] > 0) & (df["Y"] < 0)]
The & condition is for AND, while replacing that with | is for OR condition. Do remember to put the different conditions in ().
Lastly, if you want to remove duplicates, you can use this
new_df.drop_duplicates()
You can find more information about this function at here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
Hope my answer is useful to you.
Very simple question everyone, but nearly impossible to find answers to basic questions in official documentation.
I have a dataframe object in Pandas that has rows and columns.
One of the columns, named "CBSM", contains boolean values. I need to delete all rows from the dataframe where the value of the CBSM column = "Y".
I see that there is a method called dataframe.drop()
Label, Axis, and Level are 3 parameters that the drop() method takes in. I have no clue what values to provide these parameters to accomplish my need of deleting the rows in the fashion I described above. I have a feeling the drop() method is not the right way to do what I want.
Please advise, thanks.
This method is called boolean indexing.
You can try loc with str.contains:
df.loc[~df['CBSM'].str.contains('Y')]
Sample:
print df
A CBSM L
0 1 Y 4
1 1 N 6
2 2 N 3
print df['CBSM'].str.contains('Y')
0 True
1 False
2 False
Name: CBSM, dtype: bool
#inverted boolean serie
print ~df['CBSM'].str.contains('Y')
0 False
1 True
2 True
Name: CBSM, dtype: bool
print df.loc[~df['CBSM'].str.contains('Y')]
A CBSM L
1 1 N 6
2 2 N 3
Or:
print df.loc[~(df['CBSM'] == 'Y')]
A CBSM L
1 1 N 6
2 2 N 3
I have a pandas data frame. Some entries are equal to -1. How to find the number of times -1 exist in every column in the data frame. Based on that count, I am planning to drop the column.
Since you say you want the result for each column separately, you can use the condition like - df[column] == -1 , and then take .sum() on the result of the condition to get the count of -1 values for that row. Example -
(df[column] == -1).sum()
Demo -
In [22]: df
Out[22]:
A B C
0 -1 2 -1
1 3 4 5
2 3 1 4
3 -1 2 1
In [23]: for col in df.columns:
....: print(col, (df[col] == -1).sum())
....:
A 2
B 0
C 1
This works because when taking sum() , True value is equivalent to 1 and False is equivalent to 0. And the condition df[column] == -1 returns a Series of True/False values, True where the condition is met and False where the condition is not met.
I think you could have tried a few things before asking here, but I might as well post the answer anyway:
(df == -1).sum()
Ironically you can't use the count() method of a DataFrame because that counts all values except for None or nan, and there's no way to change the criterion. It's easier to just use sum than to figure out a way to convert the -1s to Nones.