I have two dataframes as described above
I would like to create in the second table an additional feature (Col_to_create) related to the value of feature A.
Table 2 has more than 800 000 samples so that I ask for a faster way to do that.
First table:
a b
1 100
2 400
3 500
Second table:
id Refer_to_A Col_to_create
0 3 500
1 1 100
2 3 500
3 2 400
4 1 100
You can use the method map:
df2['Col_to_create'] = df2['Refer_to_A'].map(df1.set_index('a')['b'])
Output:
Refer_to_A Col_to_create
id
0 3 500
1 1 100
2 3 500
3 2 400
4 1 100
One possible way is you can apply the function on new column of the dataset :
If your dataset is :
dataframe_a = pd.DataFrame({'a': [1,2,3], 'b': [100,400,500]})
dataframe_b = pd.DataFrame({'Refer_to_A': [3,1,3,2,1]})
You can try something like :
dataframe_b['Col_to_create'] = dataframe_b['Refer_to_A'].apply(lambda col: dataframe_a['b'][col-1])
output:
Refer_to_A Col_to_create
0 3 500
1 1 100
2 3 500
3 2 400
4 1 100
Related
I have a DataFrame
>>> df = pd.DataFrame({'a':[1,1,1,2,2,2],
... 'b':[10,20,20,10,20,20],
... 'result':[100,200,300,400,500,600]})
...
>>> df
a b result
0 1 10 100
1 1 20 200
2 1 20 300
3 2 10 400
4 2 20 500
5 2 20 600
and want to create a new column that is the average result for the corresponding values for 'a' and 'b'. I can get those values with a groupby:
>>> df.groupby(['a','b'])['result'].mean()
a b
1 10 100
20 250
2 10 400
20 550
Name: result, dtype: int64
but can not figure out how to turn that into a new column in the original DataFrame. The final result should look like this,
>>> df
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
I could do this by looping through the combinations of 'a' and 'b' but that would get really slow and unwieldy for larger sets of data. There is probably a much simpler and faster way to go.
You need transform:
df['avg_result'] = df.groupby(['a', 'b'])['result'].transform('mean')
This generates a correctly indexed column of the groupby values for you:
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
Since the previous answer(https://stackoverflow.com/a/33445035/6504287) is pandas based, I'm adding the pyspark based solution as in below:
So it is better to go with the Window function as in the below code snippet example:
windowSpecAgg = Window.partitionBy('a', 'b')
ext_data_df.withColumn('avg_result', avg('result').over(windowSpecAgg)).show()
The above code is with respect to the example took in the previously provided solution(https://stackoverflow.com/a/33445035/6504287).
I have a DataFrame
>>> df = pd.DataFrame({'a':[1,1,1,2,2,2],
... 'b':[10,20,20,10,20,20],
... 'result':[100,200,300,400,500,600]})
...
>>> df
a b result
0 1 10 100
1 1 20 200
2 1 20 300
3 2 10 400
4 2 20 500
5 2 20 600
and want to create a new column that is the average result for the corresponding values for 'a' and 'b'. I can get those values with a groupby:
>>> df.groupby(['a','b'])['result'].mean()
a b
1 10 100
20 250
2 10 400
20 550
Name: result, dtype: int64
but can not figure out how to turn that into a new column in the original DataFrame. The final result should look like this,
>>> df
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
I could do this by looping through the combinations of 'a' and 'b' but that would get really slow and unwieldy for larger sets of data. There is probably a much simpler and faster way to go.
You need transform:
df['avg_result'] = df.groupby(['a', 'b'])['result'].transform('mean')
This generates a correctly indexed column of the groupby values for you:
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
Since the previous answer(https://stackoverflow.com/a/33445035/6504287) is pandas based, I'm adding the pyspark based solution as in below:
So it is better to go with the Window function as in the below code snippet example:
windowSpecAgg = Window.partitionBy('a', 'b')
ext_data_df.withColumn('avg_result', avg('result').over(windowSpecAgg)).show()
The above code is with respect to the example took in the previously provided solution(https://stackoverflow.com/a/33445035/6504287).
I have the following 2 data set:
Individual data
household_id member_id channel event_begin event_end
0 1 1 100 83098 83300
1 1 2 100 83150 83600
2 1 1 200 83700 83865
3 1 2 200 83931 83963
4 1 3 200 84367 84532
5 1 4 450 84598 84721
6 2 1 300 83841 83906
7 2 2 300 78219 78500
Household data
household_id channel Begin End
0 1 100 83098 83600
1 1 200 84367 84532
2 2 300 83841 83906
3 2 300 78219 78452
I want to add column in Individual data as ['FS_NFS'] if Household data has same household_id and channel as in Individual data (i.e. Individual and Household data has same id=(household_id&channel))
Now I want put 'FS' in column 'FS_NFS' in Individual Data if the following condition satisfies
(indv['event_begin']>=HH['Begin']) & (indv['event_end']<=HH['End']) &
(indv['household_id']==HH['household_id']) & (indv['channel']==HH['channel'])
else I I want 'NFS' in column 'FS_NFS' in Individual data
Expected O/p:
household_id member_id channel event_begin event_end FS_NFS
0 1 1 100 83098 83300 FS
1 1 2 100 83150 83600 FS
2 1 1 200 83700 83865 NFS
3 1 2 200 83931 83963 NFS
4 1 3 200 84367 84532 FS
5 1 4 450 84598 84721 NFS (Channel not prsent in both)
6 2 1 300 83841 83906 FS
7 2 2 300 78219 78500 NFS
The simplest way to solve your problem will be to pd.merge individual data with household data by double-key, which will be household_id and channel
data = pd.merge(ind, household, how = 'left', on = ['household_id', 'channel'], left_index = False)
Then you can create FS_NFS columns based on nonmissing value in channel_y variable.
HH.columns=['household_id','channel','HH_event_Begin','HH_event_End']
merged=pd.merge(indv,HH,on=['household_id','channel'],how='left')
merged['event_begin']=merged['event_begin'].astype(float)#.inplace=True
merged['event_end']=merged['event_end'].astype(float)#.inplace=True
merged['FSNFSNew']=''
i=0
length=len(merged)-1
while i<=length:
if (merged['event_begin'][i]>=merged['HH_event_Begin'][i])&(merged['event_end'][i]<=merged['HH_event_End'][i]):
merged['FSNFSNew'][i]='FS'
else:
merged['FSNFSNew'][i]='NFS'
i=i+1
I have what i think is a simple quesiton, but I am not sure how to implement.
I have the following dataframe:
ID Value
1 100
2 250
3 300
4 400
5 600
7 800
I would like to look at 2 id's: 3 & 5 and then drop the one with the lower value. So i am assuming I would use something like the following code, but again, i am not sure how to implement,nor am i sure how to utilize the inequality to point towards the value while directing my function at a very specific pair of id's.
def ChooseGreater(x):
if df['id'] == 3 > df['id'] ==5
return del df['id']==5
else:
return del df['id']==3
Thank you!
I think you can do:
df.drop(df.loc[df.ID.isin([3,5]),'Value'].idxmin(), inplace=True)
Using Python's min
df.drop(min(df.query('ID in [3, 5]').index, key=df.Value.get))
ID Value
0 1 100
1 2 250
3 4 400
4 5 600
5 7 800
groupby and tail
df.sort_values('Value').groupby(df.ID.replace({3: 5})).tail(1)
ID Value
0 1 100
1 2 250
3 4 400
4 5 600
5 7 800
You can calculate idxmin and then use np.in1d with pd.DataFrame.loc:
idx = df.loc[df['ID'].isin([3,5]), 'Value'].idxmin()
res = df.loc[~np.in1d(df.index, idx)]
print(res)
ID Value
0 1 100
1 2 250
3 4 400
4 5 600
5 7 800
This is method from groupby
df.loc[df.Value.groupby((~df.ID.isin([3,5])).sort_values().cumsum()).idxmax()].sort_index()
Out[167]:
ID Value
0 1 100
1 2 250
3 4 400
4 5 600
5 7 800
My dataframe looks like this:
id month spent limit
1 1 2.6 10
1 2 4 10
1 3 6 10
2 1 3 100
2 2 89 100
2 3 101 100
3 1 239 500
3 2 432 500
3 3 100 500
I want to groupby id and then get the ids for which spent column is less than or equal to limit column for every row in the grouped by object.
For my above example, I should get ids 1 and 3 as my result because id 2 spends 101 in 3rd month and hence exceeds the limit of 100.
How can I do this in pandas efficiently?
Thanks in advance!
You can create a mask by finding the ids where spent is greater than limit. The mask out the ids in the mask
mask = df.loc[df['spent'] > df['limit'], 'id'].values.tolist()
df.id[df['id'] != mask].unique()
gives you
array([1, 3])
This should give you something like what you want
df.groupby('id').apply(lambda g: (g.spent < g.limit).all()).to_frame('not_exceeded').query('not_exceeded == True')
Reverse logic! Check for unique ids where spent is greater than limit. Then filter out those.
df[~df.id.isin(df.set_index('id').query('limit < spent').index.unique())]
id month spent limit
0 1 1 2.6 10
1 1 2 4.0 10
2 1 3 6.0 10
6 3 1 239.0 500
7 3 2 432.0 500
8 3 3 100.0 500