I want do delete rows in a pandas dataframe where a the second column = 0
So this ...
Code Int
0 A 0
1 A 1
2 B 1
Would turn into this ...
Code Int
0 A 1
1 B 1
Any help greatly appreciated!
Find the row you want to delete, and use drop.
delete_row = df[df["Int"]==0].index
df = df.drop(delete_row)
print(df)
Code Int
1 A 1
2 B 1
Further more. you can use iloc to find the row, if you know the position of the column
delete_row = df[df.iloc[:,1]==0].index
df = df.drop(delete_row)
You could use loc and drop in one line of code.
df = df.drop(df["Int"].loc[df["Int"]==0].index)
You could use this as well!
df = df[df.Int != 0]
Related
I need to add the number of unique values in column C (right table) to the related row in the left table based on the values in common column A (as shown in the picture):
thank you in advance
Groupby column A in second dataset and calculate count of each unique value in column C. merge it with first dataset on column A. Rename column C to C-count if needed:
>>> count_df = df2.groupby('A', as_index=False).C.nunique()
>>> output = pd.merge(df1, count_df, on='A')
>>> output.rename(columns={'C':'C-count'}, inplace=True)
>>> output
A B C-count
0 2 22 3
1 3 23 2
2 5 21 1
3 1 24 1
4 6 21 1
Use DataFrameGroupBy.nunique with Series.map for new column in df1:
df1['C-count'] = df1['A'].map(df2.groupby('A')['C'].nunique())
This may not be the most effective way of doing this, so if your databases are too big be careful.
Define the following function:
def c_value(a_value, right_table):
c_ids = []
for index, row in right_table.iterrows():
if row['A'] == a_value:
if row['C'] not in c_ids:
c_ids.append(row['C'])
return len(c_ids)
For this function I'm supposing that the right_table is a pandas.Dataframe.
Now, you do the following to build the new column (assuming that the left table is a pandas.Dataframe):
new_column = []
for index, row in left_table.iterrows():
new_column.append(c_value(row['A'],right_table))
left_table["C-count"] = new_column
After this, the left_table Dataframe should be the one dessired (as far as I understand what you need).
I want to replace the values of specific columns. I can change the values one by one but, I have hundreds of columns and I need to change the columns starting with a specific string. Here is an example, I want to replace the string when the column name starts with "Q14"
df.filter(regex = 'Q14').replace(1, 'Selected').replace(0, 'Not selected')
The above code is working. But, how I can implement it in my dataframe? As this is the function so I can't use inplace.
Consider below df:
In [439]: df = pd.DataFrame({'Q14_A':[ 1,0,0,2], 'Q14_B':[0,1,1,2], 'Q12_A':[1,0,0,0]})
In [440]: df
Out[440]:
Q14_A Q14_B Q12_A
0 1 0 1
1 0 1 0
2 0 1 0
3 2 2 0
Filter columns that start with Q14, save it in a variable:
In [443]: cols = df.filter(regex='^Q14').columns
Now, change the above selected columns with your replace commands:
In [446]: df[cols] = df[cols].replace(1, 'Selected').replace(0, 'Not selected')
Output:
In [447]: df
Out[447]:
Q14_A Q14_B Q12_A
0 Selected Not selected 1
1 Not selected Selected 0
2 Not selected Selected 0
3 2 2 0
You can iterate over all columns and based on matched condition apply column transformation using apply command:
for column in df.columns:
if column.startswith("Q"):
df[column] = df[column].apply(lambda x: "Selected" if x == 1 else "Not selected")
Using pandas.Series.replace dict
df = pd.DataFrame({'Q14_A':[ 1,0,0,2], 'Q14_B':[0,1,1,2], 'Q12_A':[1,0,0,0]})
cols = df.filter(regex='^Q14').columns
replace_map = {
1: "Selected",
0 : "Not Selected"
}
df[cols] = df[cols].replace(replace_map)
I have a dataframe that i want to sort on one of my columns (that is a date)
However I have a loop i am running on the index (while i<df.shape[0]), I need the loop to go on my dataframe once it is sorted by date.
Is the current index modified accordingly to the sorting or should I use df.reset_index() ?
Maybe I'm not understanding the question, but a simple check shows that sort_values does modify the index:
df = pd.DataFrame({'x':['a','c','b'], 'y':[1,3,2]})
df = df.sort_values(by = 'x')
Yields:
x y
0 a 1
2 b 2
1 c 3
And a subsequent:
df = df.reset_index(drop = True)
Yields:
x y
0 a 1
1 b 2
2 c 3
If I have a dataframe and want to drop any rows where the value in one column is not an integer how would I do this?
The alternative is to drop rows if value is not within a range 0-2 but since I am not sure how to do either of them I was hoping someonelse might.
Here is what I tried but it didn't work not sure why:
df = df[(df['entrytype'] != 0) | (df['entrytype'] !=1) | (df['entrytype'] != 2)].all(1)
There are 2 approaches I propose:
In [212]:
df = pd.DataFrame({'entrytype':[0,1,np.NaN, 'asdas',2]})
df
Out[212]:
entrytype
0 0
1 1
2 NaN
3 asdas
4 2
If the range of values is as restricted as you say then using isin will be the fastest method:
In [216]:
df[df['entrytype'].isin([0,1,2])]
Out[216]:
entrytype
0 0
1 1
4 2
Otherwise we could cast to a str and then call .isdigit()
In [215]:
df[df['entrytype'].apply(lambda x: str(x).isdigit())]
Out[215]:
entrytype
0 0
1 1
4 2
str("-1").isdigit() is False
str("-1").lstrip("-").isdigit() works but is not nice.
df.loc[df['Feature'].str.match('^[+-]?\d+$')]
for your question the reverse set
df.loc[ ~(df['Feature'].str.match('^[+-]?\d+$')) ]
We have multiple ways to do the same, but I found this method easy and efficient.
Quick Examples
#Using drop() to delete rows based on column value
df.drop(df[df['Fee'] >= 24000].index, inplace = True)
# Remove rows
df2 = df[df.Fee >= 24000]
# If you have space in column name
# Specify column name with in single quotes
df2 = df[df['column name']]
# Using loc
df2 = df.loc[df["Fee"] >= 24000 ]
# Delect rows based on multiple column value
df2 = df[ (df['Fee'] >= 22000) & (df['Discount'] == 2300)]
# Drop rows with None/NaN
df2 = df[df.Discount.notnull()]
Is there a simple way to reference the previous row when iterating through a dataframe?
In the following dataframe I would like column B to change to 1 when A > 1 and remain at 1 until A < -1, when it changes to -1.
In [11]: df
Out[11]:
A B
2000-01-01 -0.182994 0
2000-01-02 1.290203 0
2000-01-03 0.245229 0
2000-01-08 -1.230742 0
2000-01-09 0.534939 0
2000-01-10 1.324027 0
This is what I've tried to do, but clearly you can't just subtract 1 from the index:
for idx,row in df.iterrows():
if df["A"][idx]<-1:
df["B"][idx] = -1
elif df["A"][idx]>1:
df["B"][idx] = 1
else:
df["B"][idx] = df["B"][idx-1]
I also tried using get_loc but got completely lost, I'm sure I'm missing a very simple solution!
This is what you are trying to do?
In [38]: df = DataFrame(randn(10,2),columns=list('AB'))
In [39]: df['B'] = np.nan
In [40]: df.loc[df.A<-1,'B'] = -1
In [41]: df.loc[df.A>1,'B'] = 1
In [42]: df.ffill()
Out[42]:
A B
0 -1.186808 -1
1 -0.095587 -1
2 -1.921372 -1
3 -0.772836 -1
4 0.016883 -1
5 0.350778 -1
6 0.165055 -1
7 1.101561 1
8 -0.346786 1
9 -0.186263 1
Similar question here: Reference values in the previous row with map or apply .
My impression is that pandas should handle iterations and we shouldn't have to do it on our own... Therefore, I chose to use the DataFrame 'apply' method.
Here is the same answer I posted on other question linked above...
You can use the dataframe 'apply' function and leverage the unused the 'kwargs' parameter to store the previous row.
import pandas as pd
df = pd.DataFrame({'a':[0,1,2], 'b':[0,10,20]})
new_col = 'c'
def apply_func_decorator(func):
prev_row = {}
def wrapper(curr_row, **kwargs):
val = func(curr_row, prev_row)
prev_row.update(curr_row)
prev_row[new_col] = val
return val
return wrapper
#apply_func_decorator
def running_total(curr_row, prev_row):
return curr_row['a'] + curr_row['b'] + prev_row.get('c', 0)
df[new_col] = df.apply(running_total, axis=1)
print(df)
# Output will be:
# a b c
# 0 0 0 0
# 1 1 10 11
# 2 2 20 33
This example uses a decorator to store the previous row in a dictionary and then pass it to the function when Pandas calls it on the next row.
Disclaimer 1: The 'prev_row' variable starts off empty for the first row so when using it in the apply function I had to supply a default value to avoid a 'KeyError'.
Disclaimer 2: I am fairly certain this will be slower the apply operation but I did not do any tests to figure out how much.
Try this: If the first value is neither >= 1 or < -1 set to 0 or whatever you like.
df["B"] = None
df["B"] = np.where(df['A'] >= 1, 1,df['B'])
df["B"] = np.where(df['A'] < -1, -1,df['B'])
df = df.ffill().fillna(0)
This solves the problem stated, But the real solution to reference previous row is use .shift() or .index() -1