Pandas compute features diff within group - python

I have a dataframe with n rows for each group ID, where only one label is 1 and all the others are 0s.
Example:
ID, Feature_1; Feature_2; Feature_3; label
1, 10, 3, 4, 1
1, 9, 1, 2, 0
...
2, 100, 30, 40, 1
2, 90, 10, 20, 0
I want to group by ID and for each ID group transform the features for each label=0 as a diff(Feature_1_i - Feature_1_j) where i is the row with lablel=1 within the group and j to n are the other rows in the group with label=1.
Expected output
ID, Feature_1; Feature_2; Feature_3; label
1, 10, 3, 4, 1
1, 10 - 9, 3- 1, 4- 2, 0
...
2, 100, 30, 40, 1
2, 100-90, 30-10, 40-20, 0
How can I achieve this in Pandas?

You can sort your dataframe using sort_values based on 'ID' and 'label in ascending and descending order respectively.
Then you can calculate a grouped difference using diff on your columns, which would calculate the difference between the last and first row of each group (last - new) and populate the last row, leaving the first row with NaN.
The last thing to do is to fill those resulted NaN (which are the first rows of each group):
df_sorted = df.sort_values(by=['ID','label'],ascending=[True,False])
df_sorted.groupby('ID').diff().assign(label=np.nan).fillna(df_sorted).astype(int)
prints:
Feature_1; Feature_2; Feature_3; label
0 10 3 4 1
1 -1 -2 -2 0
2 100 30 40 1
3 -10 -20 -20 0

Related

Compare two dataframe and conditionally capture random data in Python

The main logic of my question is on comparing the two dataframes a little, but it will be different from the existing questions here. Q1, Q2, Q3
Let's create dummy two dataframes.
data1 = {'user': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4,4],
'checkinid': [10, 20, 30, 40, 50, 35, 45, 55, 20, 120, 100, 35, 55, 180, 200,400],
'count': [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]}
data2 = {'checkinid': [10, 20, 30, 35, 40, 45, 50,55, 60, 70,100,120,180,200,300,400]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
data2 consists of whole checkinid values. I am trying to create a training file.
For example, user 1 visited 5 places where ids are (10,20,30,40,50)
I want to add randomly the places that user 1 does not visit and set the 'count column' as 0.
My expectation dataframe like this
user checkinid count
1 10 1
1 20 1
1 30 1
1 40 1
1 50 1
1 300 0 (add randomly)
1 180 0 (add randomly)
1 55 0 (add randomly)
2 35 1
2 45 1
2 55 1
2 20 1
2 120 1
2 10 0 (add randomly)
2 400 0 (add randomly)
2 180 0 (add randomly)
... ...
Now those who read the question can ask how many random data they will add.
For each user, just add 3 non-visited places is enough for this example.
This might not be the best solution but it works
you have to get each users and then pick the checkinids which are not assigned to them
#get all users
users = df1.user.unique();
for user in users:
checkins = df1.loc[df1['user'] == user]
df = checkins.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='right_only'].sample(n=3)
df['user']=[user,user,user]
df['count']=[0,0,0]
df.pop("_merge")
df1 = df1.append(df, ignore_index=True)
#sort data frome based on user
df1 = df1.sort_values(by=['user']);
#re-arrange cols
df1 = df1[['user', 'checkinid', 'count']]
#print df
print df1

Pandas DataFrame group-by indexes matching list - indexes respectively smaller than list[i+1] and greater than list[i]

I have a DataFrame Times_df with times in a single column and a second DataFrame End_df with specific end times for each group indexed by group name.
Times_df = pd.DataFrame({'time':np.unique(np.cumsum(np.random.randint(5, size=(100,))), axis=0)})
End_df = pd.DataFrame({'end time':np.unique(random.sample(range(Times_df.index.values[0], Times_df.index.values[-1]), 10))})
End_df.index.name = 'group'
I want to add a group index for all times in Times_df smaller or equal than each consequitive end time in End_df but greater than the previous one
I can only do it for now with a loop, which takes forever ;(
lis = []
i = 1
for row in Times_df['time'].values:
while i <= row:
lis.append((End_df['end time']==row).index)
i +1
Then I add the list lis as a new column to Times_df
Times_df['group']=lis
A nother sollution that sadly still uses a loop is this:
test_df = pd.DataFrame()
for group, index in End_df.iterrows():
test = count.loc[count.index<=index['end time]][:]
test['group']=group
test_df = pd.concat([test_df,test], axis=0, ignore_index=True)
I think what you are looking for is pd.cut to bin your values into the groups.
bins = [0, 3, 10, 20, 53, 59, 63, 65, 68, 74, np.inf]
groups = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Times_df["group"] = pd.cut(Times_df["time"], bins, labels=groups)
print(Times_df)
time group
0 2 0
1 3 0
2 7 1
3 11 2
4 15 2
5 16 2
6 18 2
7 22 3
8 25 3
9 28 3

How to conditionally fill new column using for loop in python? [duplicate]

This question already has answers here:
Pandas conditional creation of a series/dataframe column
(13 answers)
Closed 3 years ago.
I want to add a new column and fill values based on condition.
df:
indicator, value, a, b
1, 20, 5, 3
0, 30, 6, 8
0, 70, 2, 2
1, 10, 3, 7
I want to add a new column (value_new) based on Indicator. If indicator == 1, value_new = a*b otherwise value_new = value.
df:
indicator, value, a, b, value_new
1, 20, 5, 3, 15
0, 30, 6, 8, 30
0, 70, 2, 2, 70
1, 10, 3, 7, 21
I have tried following:
value_new = []
for in in range(1, len(df)):
if indicator[i] == 1:
value_new.append(df['a'][i]*df['b'][i])
else:
value_new.append(df['value'][i])
df['value_new'] = value_new
Error: 'Length of values does not match length of index'
And I have also tried:
for in in range(1, len(df)):
if indicator[i] == 1:
df['value_new'][i] = df['a'][i]*df['b'][i]
else:
df['value_new'][i] = df['value'][i]
KeyError: 'value_new'
You can use np.where:
df['value_new'] = np.where(df['indicator'], df['a']*df['b'], df['value'])
print(df)
Prints:
indicator value a b value_new
0 1 20 5 3 15
1 0 30 6 8 30
2 0 70 2 2 70
3 1 10 3 7 21

Groupwise sorting in pandas

I want to sort an array within the group boundaries defined in another array. The groups are not presorted in any way and need to remain unchanged after the sorting. In numpy terms it would look like this:
import numpy as np
def groupwise_sort(group_idx, a, reverse=False):
sortidx = np.lexsort((-a if reverse else a, group_idx))
# Reverse sorting back to into grouped order, but preserving groupwise sorting
revidx = np.argsort(np.argsort(group_idx, kind='mergesort'), kind='mergesort')
return a[sortidx][revidx]
group_idx = np.array([3, 2, 3, 2, 2, 1, 2, 1, 1])
a = np.array([3, 2, 1, 7, 4, 5, 5, 9, 1])
groupwise_sort(group_idx, a)
# >>> array([1, 2, 3, 4, 5, 1, 7, 5, 9])
groupwise_sort(group_idx, a, reverse=True)
# >>> array([3, 7, 1, 5, 4, 9, 2, 5, 1])
How can I do the same with pandas? I saw df.groupby() and df.sort_values(), though I couldn't find a straight forward way to achieve the same sorting. And a fast one, if possible.
Let us first set the stage:
import pandas as pd
import numpy as np
group_idx = np.array([3, 2, 3, 2, 2, 1, 2, 1, 1])
a = np.array([3, 2, 1, 7, 4, 5, 5, 9, 1])
df = pd.DataFrame({'group': group_idx, 'values': a})
df
# group values
#0 3 3
#1 2 2
#2 3 1
#3 2 7
#4 2 4
#5 1 5
#6 2 5
#7 1 9
#8 1 1
To get a dataframe sorted by group and values (within groups):
df.sort_values(["group", "values"])
# group values
#8 1 1
#5 1 5
#7 1 9
#1 2 2
#4 2 4
#6 2 5
#3 2 7
#2 3 1
#0 3 3
To sort the values in descending order, use ascending = False. To apply different orders to different columns, you can supply a list:
df.sort_values(["group", "values"], ascending = [True, False])
# group values
#7 1 9
#5 1 5
#8 1 1
#3 2 7
#6 2 5
#4 2 4
#1 2 2
#0 3 3
#2 3 1
Here, groups are sorted in ascending order, and the values within each group are sorted in descending order.
To only sort values for contiguous rows belonging to the same group, create a new group indicator:
(I keep this in here for reference since it might be helpful for others. I wrote this in an earlier version before the OP clarified his question in the comments.)
df['new_grp'] = (df.group.diff(1) != 0).astype('int').cumsum()
df
# group values new_grp
#0 3 3 1
#1 2 2 2
#2 3 1 3
#3 2 7 4
#4 2 4 4
#5 1 5 5
#6 2 5 6
#7 1 9 7
#8 1 1 7
We can then easily sort with new_grp instead of group, leaving the original order of groups untouched.
Ordering within groups but keeping the group-specifing row-positions:
To sort the elements of each group but keep the group-specific positions in the dataframe, we need to keep track of the original row numbers. For instance, the following will do the trick:
# First, create an indicator for the original row-number:
df["ind"] = range(len(df))
# Now, sort the dataframe as before
df_sorted = df.sort_values(["group", "values"])
# sort the original row-numbers within each group
newindex = df.groupby("group").apply(lambda x: x.sort_values(["ind"]))["ind"].values
# assign the sorted row-numbers to the sorted dataframe
df_sorted["ind"] = newindex
# Sort based on the row-numbers:
sorted_asc = df_sorted.sort_values("ind")
# compare the resulting order of values with your desired output:
np.array(sorted_asc["values"])
# array([1, 2, 3, 4, 5, 1, 7, 5, 9])
This is easier to test and profile when written up in a function, so let's do that:
def sort_my_frame(frame, groupcol = "group", valcol = "values", asc = True):
frame["ind"] = range(len(frame))
frame_sorted = frame.sort_values([groupcol, valcol], ascending = [True, asc])
ind_sorted = frame.groupby(groupcol).apply(lambda x: x.sort_values(["ind"]))["ind"].values
frame_sorted["ind"] = ind_sorted
frame_sorted = frame_sorted.sort_values(["ind"])
return(frame_sorted.drop(columns = "ind"))
np.array(sort_my_frame(df, "group", "values", asc = True)["values"])
# array([1, 2, 3, 4, 5, 1, 7, 5, 9])
np.array(sort_my_frame(df, "group", "values", asc = False)["values"])
# array([3, 7, 1, 5, 4, 9, 2, 5, 1])
Note that the latter results match your desired outcome.
I am sure this can be written up in a more succinct way. For instance, if the index of your dataframe is already ordered, you can use that one instead of the indicator ind I create (i.e., following #DJK's comment, we can use sort_index instead of sort_values and avoid assigning an additional column). In any case, the above highlights one possible solution and how to approach it. An alternative would be to use your numpy functions and wrap the output around a pd.DataFrame.
Pandas is built on top of numpy. Assuming a dataframe like so:
df
Out[21]:
group values
0 3 3
1 2 2
2 3 1
3 2 7
4 2 4
5 1 5
6 2 5
7 1 9
8 1 1
Call your function.
groupwise_sort(df.group.values, df['values'].values)
Out[22]: array([1, 2, 3, 4, 5, 1, 7, 5, 9])
groupwise_sort(df.group.values, df['values'].values, reverse=True)
Out[23]: array([3, 7, 1, 5, 4, 9, 2, 5, 1])

python pandas isin method?

I have a dictionary 'wordfreq' like this:
{'techsmart': 30, 'paradies': 57, 'jobvark': 5000, 'midgley': 100, 'weisman': 2, 'tucuman': 1, 'amdahl': 2, 'frogfeet': 1, 'd8848': 1, 'jiaoyuwang': 1, 'walter': 19}
and I want to put the keys in a list if the value is more than 5 and also if the key is not in another dataframe 'df', and then adding them to a list called 'stopword':here is a df dataframe:
word freq
1 paradies 1
5 tucuman 1
and here is the code I am using:
stopword = []
for k,v in wordfreq.items():
if v >= 5:
if k not in list_c:
stopword.append((k))
Anybody knows how can I do the same thing with isin() method or more efficiently at least?
I'd load your dict into a df:
In [177]:
wordfreq = {'techsmart': 30, 'paradies': 57, 'jobvark': 5000, 'midgley': 100, 'weisman': 2, 'tucuman': 1, 'amdahl': 2, 'frogfeet': 1, 'd8848': 1, 'jiaoyuwang': 1, 'walter': 19}
df = pd.DataFrame({'word':list(wordfreq.keys()), 'freq':list(wordfreq.values())})
df
Out[177]:
freq word
0 1 frogfeet
1 1 tucuman
2 57 paradies
3 1 d8848
4 5000 jobvark
5 100 midgley
6 1 jiaoyuwang
7 30 techsmart
8 2 weisman
9 19 walter
10 2 amdahl
And then filter using isin against the other df (df_1 in my case) like this:
In [181]:
df[(df['freq'] > 5) & (~df['word'].isin(df1['word']))]
Out[181]:
freq word
4 5000 jobvark
5 100 midgley
7 30 techsmart
9 19 walter
So the boolean condition looks for freq values greater than 5 and also where the word is not in the other df using isin and invert the boolean mask ~.
You can then now get a list easily:
In [182]:
list(df[(df['freq'] > 5) & (~df['word'].isin(df1['word']))]['word'])
Out[182]:
['jobvark', 'midgley', 'techsmart', 'walter']

Categories