Compare two dataframe and conditionally capture random data in Python - python

The main logic of my question is on comparing the two dataframes a little, but it will be different from the existing questions here. Q1, Q2, Q3
Let's create dummy two dataframes.
data1 = {'user': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4,4],
'checkinid': [10, 20, 30, 40, 50, 35, 45, 55, 20, 120, 100, 35, 55, 180, 200,400],
'count': [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]}
data2 = {'checkinid': [10, 20, 30, 35, 40, 45, 50,55, 60, 70,100,120,180,200,300,400]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
data2 consists of whole checkinid values. I am trying to create a training file.
For example, user 1 visited 5 places where ids are (10,20,30,40,50)
I want to add randomly the places that user 1 does not visit and set the 'count column' as 0.
My expectation dataframe like this
user checkinid count
1 10 1
1 20 1
1 30 1
1 40 1
1 50 1
1 300 0 (add randomly)
1 180 0 (add randomly)
1 55 0 (add randomly)
2 35 1
2 45 1
2 55 1
2 20 1
2 120 1
2 10 0 (add randomly)
2 400 0 (add randomly)
2 180 0 (add randomly)
... ...
Now those who read the question can ask how many random data they will add.
For each user, just add 3 non-visited places is enough for this example.

This might not be the best solution but it works
you have to get each users and then pick the checkinids which are not assigned to them
#get all users
users = df1.user.unique();
for user in users:
checkins = df1.loc[df1['user'] == user]
df = checkins.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='right_only'].sample(n=3)
df['user']=[user,user,user]
df['count']=[0,0,0]
df.pop("_merge")
df1 = df1.append(df, ignore_index=True)
#sort data frome based on user
df1 = df1.sort_values(by=['user']);
#re-arrange cols
df1 = df1[['user', 'checkinid', 'count']]
#print df
print df1

Related

Pandas compute features diff within group

I have a dataframe with n rows for each group ID, where only one label is 1 and all the others are 0s.
Example:
ID, Feature_1; Feature_2; Feature_3; label
1, 10, 3, 4, 1
1, 9, 1, 2, 0
...
2, 100, 30, 40, 1
2, 90, 10, 20, 0
I want to group by ID and for each ID group transform the features for each label=0 as a diff(Feature_1_i - Feature_1_j) where i is the row with lablel=1 within the group and j to n are the other rows in the group with label=1.
Expected output
ID, Feature_1; Feature_2; Feature_3; label
1, 10, 3, 4, 1
1, 10 - 9, 3- 1, 4- 2, 0
...
2, 100, 30, 40, 1
2, 100-90, 30-10, 40-20, 0
How can I achieve this in Pandas?
You can sort your dataframe using sort_values based on 'ID' and 'label in ascending and descending order respectively.
Then you can calculate a grouped difference using diff on your columns, which would calculate the difference between the last and first row of each group (last - new) and populate the last row, leaving the first row with NaN.
The last thing to do is to fill those resulted NaN (which are the first rows of each group):
df_sorted = df.sort_values(by=['ID','label'],ascending=[True,False])
df_sorted.groupby('ID').diff().assign(label=np.nan).fillna(df_sorted).astype(int)
prints:
Feature_1; Feature_2; Feature_3; label
0 10 3 4 1
1 -1 -2 -2 0
2 100 30 40 1
3 -10 -20 -20 0

Pandas DataFrame group-by indexes matching list - indexes respectively smaller than list[i+1] and greater than list[i]

I have a DataFrame Times_df with times in a single column and a second DataFrame End_df with specific end times for each group indexed by group name.
Times_df = pd.DataFrame({'time':np.unique(np.cumsum(np.random.randint(5, size=(100,))), axis=0)})
End_df = pd.DataFrame({'end time':np.unique(random.sample(range(Times_df.index.values[0], Times_df.index.values[-1]), 10))})
End_df.index.name = 'group'
I want to add a group index for all times in Times_df smaller or equal than each consequitive end time in End_df but greater than the previous one
I can only do it for now with a loop, which takes forever ;(
lis = []
i = 1
for row in Times_df['time'].values:
while i <= row:
lis.append((End_df['end time']==row).index)
i +1
Then I add the list lis as a new column to Times_df
Times_df['group']=lis
A nother sollution that sadly still uses a loop is this:
test_df = pd.DataFrame()
for group, index in End_df.iterrows():
test = count.loc[count.index<=index['end time]][:]
test['group']=group
test_df = pd.concat([test_df,test], axis=0, ignore_index=True)
I think what you are looking for is pd.cut to bin your values into the groups.
bins = [0, 3, 10, 20, 53, 59, 63, 65, 68, 74, np.inf]
groups = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Times_df["group"] = pd.cut(Times_df["time"], bins, labels=groups)
print(Times_df)
time group
0 2 0
1 3 0
2 7 1
3 11 2
4 15 2
5 16 2
6 18 2
7 22 3
8 25 3
9 28 3

In pandas, how should one add age-range columns?

Let's say I've got a simple DataFrame that details when people have been playing music through their lives, like this:
import pandas as pd
df = pd.DataFrame(
[[15, 8, 7],
[20, 10, 10],
[35, 15, 20],
[50, 12, 38]],
columns=['current age', 'age started playing music', 'years playing music'])
How should one add additional columns that break down the number of years playing music they've had in each decade of their lives? For example, if the columns added were 0-10, 10-20, 20-30 etc., then the first person would have had 2 years of playing music in their first decade, 5 in their second, 0 in their third etc.
You can also try this using pd.cut and value_counts:
df.join(df.apply(lambda x: pd.cut(np.arange(x['age started playing music'],
x['current age']),
bins=[0, 9, 19, 29, 39, 49],
labels=['0-10', '10-20',
'20-30', '30-40',
'40+'])
.value_counts(),
axis=1))
Output:
current age age started playing music years playing music 0-10 10-20 20-30 30-40 40+
0 15 8 7 2 5 0 0 0
1 20 10 10 0 10 0 0 0
2 35 15 20 0 5 10 5 0
3 50 12 38 0 8 10 10 10
I suggest to create a function that will return a list with the number of years played per decade and then apply it to your dataframe
import numpy as np
# Create list with numbers of years played in a decade
def get_years_playing_music_decade(current_age, age_start):
if age_start > current_age: # should not be possible
return None
# convert age to list of booleans
# was he playing on its i-th Year of living
# Example : age_start = 3 is a list [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1 ...]
age_start_lst = [0] * age_start + (100-age_start) * [1]
# was he living on its i-th Year of living
current_age_lst = [1] * current_age + (100-current_age) * [0]
# combination of living and playing
playing_music_lst = [1 if x==y else 0 for x, y in zip(age_start_lst, current_age_lst)]
# group by 10y
playing_music_lst_10y = [sum(playing_music_lst[(10*i):((10*i)+10)]) for i in range(0, 10)]
return playing_music_lst_10y
get_years_playing_music_decade(current_age=33, age_start=12)
# [0, 8, 10, 3, 0, 0, 0, 0, 0, 0]
# create columns 0-10 .. 90-100
colnames=list()
for i in range(10):
colnames += [str(10*i) + '-' + str(10*(i+1))]
# apply defined function to the dataframe
df[colnames] = pd.DataFrame(df.apply(lambda x: get_years_playing_music_decade(
int(x['current age']), int(x['age started playing music'])), axis=1).values.tolist())

Merge two dataframes based on interval overlap

I have two dataframes A and B:
For example:
import pandas as pd
import numpy as np
In [37]:
A = pd.DataFrame({'Start': [10, 11, 20, 62, 198], 'End': [11, 11, 35, 70, 200]})
A[["Start","End"]]
Out[37]:
Start End
0 10 11
1 11 11
2 20 35
3 62 70
4 198 200
In [38]:
B = pd.DataFrame({'Start': [8, 5, 8, 60], 'End': [10, 90, 13, 75], 'Info': ['some_info0','some_info1','some_info2','some_info3']})
B[["Start","End","Info"]]
Out[38]:
Start End Info
0 8 10 some_info0
1 5 90 some_info1
2 8 13 some_info2
3 60 75 some_info3
I would like to add column info to dataframe A based on if the interval (Start-End) of A overlaps with the interval of B. In case, the A interval overlaps with more than one B interval, the info corresponding to the shorter interval should be added.
I have been looking arround how to manage this issue and I have found kind of similar questions but most of their answers are using iterrows() which in my case, as I am dealing with huge dataframes is not viable.
I would like something like:
A.merge(B,on="overlapping_interval", how="left")
And then drop duplicates keeping the info coming from the shorter interval.
The output should look like this:
In [39]:
C = pd.DataFrame({'Start': [10, 11, 20, 62, 198], 'End': [11, 11, 35, 70, 200], 'Info': ['some_info0','some_info2','some_info1','some_info3',np.nan]})
C[["Start","End","Info"]]
Out[39]:
Start End Info
0 10 11 some_info0
1 11 11 some_info2
2 20 35 some_info1
3 62 70 some_info3
4 198 200 NaN
I have found this question really interesting as it suggests the posibility of solving this issue using pandas Interval object. But after lots attempts I have not managed to solve it.
Any ideas?
I would suggest to do a function then apply on the rows:
First I compute the delta (End - Start) in B for sorting purpose
B['delta'] = B.End - B.Start
Then a function to get information:
def get_info(x):
#Fully included
c0 = (x.Start >= B.Start) & (x.End <= B.End)
#start lower, end include
c1 = (x.Start <= B.Start) & (x.End >= B.Start)
#start include, end higher
c2 = (x.Start <= B.End) & (x.End >= B.End)
#filter with conditions and sort by delta
_B = B[c0|c1|c2].sort_values('delta',ascending=True)
return None if len(_B) == 0 else _B.iloc[0].Info #None if no info corresponding
Then you can apply this function to A:
A['info'] = A.apply(lambda x : get_info(x), axis='columns')
print(A)
Start End info
0 10 11 some_info0
1 11 11 some_info2
2 20 35 some_info1
3 62 70 some_info3
4 198 200 None
Note:
Instead of using pd.Interval, make your own conditions. cx are your intervals definitions, change them to get the exact expected behaviour

python pandas isin method?

I have a dictionary 'wordfreq' like this:
{'techsmart': 30, 'paradies': 57, 'jobvark': 5000, 'midgley': 100, 'weisman': 2, 'tucuman': 1, 'amdahl': 2, 'frogfeet': 1, 'd8848': 1, 'jiaoyuwang': 1, 'walter': 19}
and I want to put the keys in a list if the value is more than 5 and also if the key is not in another dataframe 'df', and then adding them to a list called 'stopword':here is a df dataframe:
word freq
1 paradies 1
5 tucuman 1
and here is the code I am using:
stopword = []
for k,v in wordfreq.items():
if v >= 5:
if k not in list_c:
stopword.append((k))
Anybody knows how can I do the same thing with isin() method or more efficiently at least?
I'd load your dict into a df:
In [177]:
wordfreq = {'techsmart': 30, 'paradies': 57, 'jobvark': 5000, 'midgley': 100, 'weisman': 2, 'tucuman': 1, 'amdahl': 2, 'frogfeet': 1, 'd8848': 1, 'jiaoyuwang': 1, 'walter': 19}
df = pd.DataFrame({'word':list(wordfreq.keys()), 'freq':list(wordfreq.values())})
df
Out[177]:
freq word
0 1 frogfeet
1 1 tucuman
2 57 paradies
3 1 d8848
4 5000 jobvark
5 100 midgley
6 1 jiaoyuwang
7 30 techsmart
8 2 weisman
9 19 walter
10 2 amdahl
And then filter using isin against the other df (df_1 in my case) like this:
In [181]:
df[(df['freq'] > 5) & (~df['word'].isin(df1['word']))]
Out[181]:
freq word
4 5000 jobvark
5 100 midgley
7 30 techsmart
9 19 walter
So the boolean condition looks for freq values greater than 5 and also where the word is not in the other df using isin and invert the boolean mask ~.
You can then now get a list easily:
In [182]:
list(df[(df['freq'] > 5) & (~df['word'].isin(df1['word']))]['word'])
Out[182]:
['jobvark', 'midgley', 'techsmart', 'walter']

Categories