Creating a pandas column conditional to another columns values - python

I'm trying to create a class column in a pandas dataframe conditional another columns values. The value will be 1 if the other column's i+1 value is greater than the i value and 0 otherwise.
For example:
column1 column2
5 1
6 0
3 0
2 1
4 0
How do create column2 by iterating through column1?

You can use the diff method on the first column with a period of -1, then check if it is less than zero to create the second column.
import pandas as pd
df = pd.DataFrame({'c1': [5,6,3,2,4]})
df['c2'] = (df.c1.diff(-1) < 0).astype(int)
df
# returns:
c1 c2
0 5 1
1 6 0
2 3 0
3 2 1
4 4 0

You can also use shift. Performance is almost the same as diff but diff seems to be faster by a a little.
df = pd.DataFrame({'column1': [5,6,3,2,4]})
df['column2'] = (df['column1'] <df['column1'].shift(-1)).astype(int)
print(df)
column1 column2
0 5 1
1 6 0
2 3 0
3 2 1
4 4 0

Related

How to merge two dataframes for updating older one with new one?

I am sorry for being a noob but I can't find a solution for my problem with hours of search.
import pandas as pd
df1 = pd.read_excel('df1.xlsx')
df1.set_index('time')
print(df1)
df2 = pd.read_excel('df2.xlsx')
df2.set_index('time')
print(df2)
new_df = pd.merge(df1, df2,how='outer')
print(new_df)
df1
time bought
0 1 0
1 2 0
2 3 0
3 4 0
4 5 1
df2
time bought
0 3 0
1 4 0
2 5 0
3 6 0
4 7 0
new_df
time bought
0 1 0
1 2 0
2 3 0
3 4 0
4 5 1
5 5 0
6 6 0
7 7 0
What I want is
updating df1(existing data) with df2(new data feed). when it comes to bought value, df1 data should comes first
the new_df should have all unique time values from df1, df2 without duplicates
I tried every method I found but no one made my desired outcome or created unnecessary duplicates as above.(two rows with time value of 5)
merge method created _x _y suffixes or duplicates
join() didn't work as well.
What I desire should look like:
new_df
time bought
0 1 0
1 2 0
2 3 0
3 4 0
4 5 1
5 6 0
6 7 0
Thank you in advance
if you perform the join as you have done all you need to do is remove the duplicate rows keeping only the more resent data,
drop_duplicates() take the kwarg subset which takes a list of columns and keep which sets which row to keep if there are duplicates
in this case we only need to check for duplicates in the time column and wee keep the first column
import pandas as pd
df1 = pd.read_excel('df1.xlsx')
df1.set_index('time')
print(df1)
df2 = pd.read_excel('df2.xlsx')
df2.set_index('time')
print(df2)
new_df = pd.merge(df1, df2,how='outer')
new_df = new_df.drop_duplicates(subset=['time'], keep='first')
print(new_df)
Output:
time bought
0 1 0
1 2 0
2 3 0
3 4 0
4 5 1
5 6 0
6 7 0

How to substitute with 0, the first of 2 consecutive values in a pandas column with groupby

I have the following pandas dataframe
import pandas as pd
foo = pd.DataFrame({'id': [1,1,1,1,1,2,2,2,2,2],
'col_a': [0,1,1,0,1,1,1,0,1,1]})
I would like to create a new column (col_a_new) which will be the same as col_a but substitute with 0 the 1st out of the 2 consecutive 1s in col_a, by id
The resulting dataframe looks like this:
foo = pd.DataFrame({'id': [1,1,1,1,1,2,2,2,2,2],
'col_a': [0,1,1,0,1,1,1,0,1,1],
'col_a_new': [0,0,1,0,1,0,1,0,0,1]})
Any ideas ?
Other approach: Just group by id and define new values using appropriate conditions.
(foo.groupby("id").col_a
.transform(lambda series: [0 if i < len(series) - 1
and series.tolist()[i+1] == 1
else x for i, x in enumerate(series.tolist())]))
# group by id and non-consecutive clusters of 0/1 in col_a
group = foo.groupby(["id", foo["col_a"].ne(foo["col_a"].shift()).cumsum()])
# get cumcount and count of groups
foo_cumcount = group.cumcount()
foo_count = group.col_a.transform(len)
# set to zero all first ones of groups with two ones, otherwise use original value
foo["col_a_new"] = np.where(foo_cumcount.eq(0)
& foo_count.gt(1)
& foo.col_a.eq(1),
0, foo.col_a)
# result
id col_a col_a_new
0 1 0 0
1 1 1 0
2 1 1 1
3 1 0 0
4 1 1 1
5 2 1 0
6 2 1 1
7 2 0 0
8 2 1 0
9 2 1 1

pandas: replace values in column with the last character in the column name

I have a dataframe as follows:
import pandas as pd
df = pd.DataFrame({'sent.1':[0,1,0,1],
'sent.2':[0,1,1,0],
'sent.3':[0,0,0,1],
'sent.4':[1,1,0,1]
})
I am trying to replace the non-zero values with the 5th character in the column names (which is the numeric part of the column names), so the output should be,
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
I have tried the following but it does not work,
print(df.replace(1, pd.Series([i[5] for i in df.columns], [i[5] for i in df.columns])))
However when I replace it with column name, the above code works, so I am not sure which part is wrong.
print(df.replace(1, pd.Series(df.columns, df.columns)))
Since you're dealing with 1's and 0's, you can actually just use multiply the dataframe by a range:
df = df * range(1, df.shape[1] + 1)
Output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
Or, if you want to take the numbers from the column names:
df = df * df.columns.str.split('.').str[-1].astype(int)
you could use string multiplication on a boolean array to place the strings based on the condition, and where to restore the zeros:
mask = df.ne(0)
(mask*df.columns.str[5]).where(mask, 0)
To have integers:
mask = df.ne(0)
(mask*df.columns.str[5].astype(int))
output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
And another one, working with an arbitrary condition (here s.ne(0)):
df.apply(lambda s: s.mask(s.ne(0), s.name.rpartition('.')[-1]))

How to assign a value to a column for a subset of dataframe based on a condition in Pandas?

I have a data frame:
df = pd.DataFrame([[0,4,0,0],
[1,5,1,0],
[2,6,0,0],
[3,7,1,0]], columns=['index', 'A', 'class', 'label'])
df:
index
A
class
label
0
4
0
0
1
5
1
0
2
6
0
0
3
7
1
0
I want to change the label to 1, if the mean of A column of rows with class 0 is bigger than the mean of all data in column A?
How to do this in a few line of code?
I tried this but didn't work:
if df[df['class'] == 0]['A'].mean() > df['A'].mean():
df[df['class']]['lable'] = 1
Use the following, pandas.DataFrame.groupby 'class', get groupby.mean of each group of 'A', check whether greater than df['A'].mean(), and pandas.Series.map that boolean series astype(int) to df['class'] and assign to df['label']:
>>> df['label'] = df['class'].map(
df.groupby('class')['A'].mean() > df['A'].mean()
).astype(int)
>>> df
index A class label
0 0 4 0 0
1 1 5 1 1
2 2 6 0 0
3 3 7 1 1
Since you are checking only for class == 0, you need to add another boolean mask on df['class']:
>>> df['label'] = (df['class'].map(
df.groupby('class')['A'].mean() > df['A'].mean()
) & (~df['class'].astype(bool))
).astype(int)
index A class label
0 0 4 0 0
1 1 5 1 0 # because (5+7)/2 < (4+5+6+7)/4
2 2 6 0 0
3 3 7 1 0 # because (5+7)/2 < (4+5+6+7)/4
So even if your code has worked, you will not know it, because the conditions do not get fulfilled.
If I understand correctly, if the condition you mentioned is fullfilled, than the labels of all rows changes to 1 right? in that case what you did es correct but you missed something, the code should look like this:
if df[df['class'] == 0]['A'].mean() > df['A'].mean:
df['label'] = 1
This should work.
What you did does not work because when you use df[df['class']], you are only selecting the 'class' column of the DataFrame, so the 'label' column you want to modify is not called

Drop Rows by Multiple Column Criteria in DataFrame

I have a pandas dataframe that I'm trying to drop rows based on a criteria across select columns. If the values in these select columns are zero, the rows should be dropped. Here is an example.
import pandas as pd
t = pd.DataFrame({'a':[1,0,0,2],'b':[1,2,0,0],'c':[1,2,3,4]})
a b c
0 1 1 1
1 0 2 2
2 0 0 3
3 2 0 4
I would like to try something like:
cols_of_interest = ['a','b'] #Drop rows if zero in all these columns
t = t[t[cols_of_interest]!=0]
This doesn't drop the rows, so I tried:
t = t.drop(t[t[cols_of_interest]==0].index)
And all rows are dropped.
What I would like to end up with is:
a b c
0 1 1 1
1 0 2 2
3 2 0 4
Where the 3rd row (index 2) was dropped because it took on value 0 in BOTH the columns of interest, not just one.
Your problem here is that you first assigned the result of your boolean condition: t = t[t[cols_of_interest]!=0] which overwrites your original df and sets where the condition is not met with NaN values.
What you want to do is generate the boolean mask, then drop the NaN rows and pass thresh=1 so that there must be at least a single non-NaN value in that row, we can then use loc and use the index of this to get the desired df:
In [124]:
cols_of_interest = ['a','b']
t.loc[t[t[cols_of_interest]!=0].dropna(thresh=1).index]
Out[124]:
a b c
0 1 1 1
1 0 2 2
3 2 0 4
EDIT
As pointed out by #DSM you can achieve this simply by using any and passing axis=1 to test the condition and use this to index into your df:
In [125]:
t[(t[cols_of_interest] != 0).any(axis=1)]
Out[125]:
a b c
0 1 1 1
1 0 2 2
3 2 0 4

Categories