I have what i think is a simple quesiton, but I am not sure how to implement.
I have the following dataframe:
ID Value
1 100
2 250
3 300
4 400
5 600
7 800
I would like to look at 2 id's: 3 & 5 and then drop the one with the lower value. So i am assuming I would use something like the following code, but again, i am not sure how to implement,nor am i sure how to utilize the inequality to point towards the value while directing my function at a very specific pair of id's.
def ChooseGreater(x):
if df['id'] == 3 > df['id'] ==5
return del df['id']==5
else:
return del df['id']==3
Thank you!
I think you can do:
df.drop(df.loc[df.ID.isin([3,5]),'Value'].idxmin(), inplace=True)
Using Python's min
df.drop(min(df.query('ID in [3, 5]').index, key=df.Value.get))
ID Value
0 1 100
1 2 250
3 4 400
4 5 600
5 7 800
groupby and tail
df.sort_values('Value').groupby(df.ID.replace({3: 5})).tail(1)
ID Value
0 1 100
1 2 250
3 4 400
4 5 600
5 7 800
You can calculate idxmin and then use np.in1d with pd.DataFrame.loc:
idx = df.loc[df['ID'].isin([3,5]), 'Value'].idxmin()
res = df.loc[~np.in1d(df.index, idx)]
print(res)
ID Value
0 1 100
1 2 250
3 4 400
4 5 600
5 7 800
This is method from groupby
df.loc[df.Value.groupby((~df.ID.isin([3,5])).sort_values().cumsum()).idxmax()].sort_index()
Out[167]:
ID Value
0 1 100
1 2 250
3 4 400
4 5 600
5 7 800
Related
I have a DataFrame
>>> df = pd.DataFrame({'a':[1,1,1,2,2,2],
... 'b':[10,20,20,10,20,20],
... 'result':[100,200,300,400,500,600]})
...
>>> df
a b result
0 1 10 100
1 1 20 200
2 1 20 300
3 2 10 400
4 2 20 500
5 2 20 600
and want to create a new column that is the average result for the corresponding values for 'a' and 'b'. I can get those values with a groupby:
>>> df.groupby(['a','b'])['result'].mean()
a b
1 10 100
20 250
2 10 400
20 550
Name: result, dtype: int64
but can not figure out how to turn that into a new column in the original DataFrame. The final result should look like this,
>>> df
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
I could do this by looping through the combinations of 'a' and 'b' but that would get really slow and unwieldy for larger sets of data. There is probably a much simpler and faster way to go.
You need transform:
df['avg_result'] = df.groupby(['a', 'b'])['result'].transform('mean')
This generates a correctly indexed column of the groupby values for you:
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
Since the previous answer(https://stackoverflow.com/a/33445035/6504287) is pandas based, I'm adding the pyspark based solution as in below:
So it is better to go with the Window function as in the below code snippet example:
windowSpecAgg = Window.partitionBy('a', 'b')
ext_data_df.withColumn('avg_result', avg('result').over(windowSpecAgg)).show()
The above code is with respect to the example took in the previously provided solution(https://stackoverflow.com/a/33445035/6504287).
I have a DataFrame
>>> df = pd.DataFrame({'a':[1,1,1,2,2,2],
... 'b':[10,20,20,10,20,20],
... 'result':[100,200,300,400,500,600]})
...
>>> df
a b result
0 1 10 100
1 1 20 200
2 1 20 300
3 2 10 400
4 2 20 500
5 2 20 600
and want to create a new column that is the average result for the corresponding values for 'a' and 'b'. I can get those values with a groupby:
>>> df.groupby(['a','b'])['result'].mean()
a b
1 10 100
20 250
2 10 400
20 550
Name: result, dtype: int64
but can not figure out how to turn that into a new column in the original DataFrame. The final result should look like this,
>>> df
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
I could do this by looping through the combinations of 'a' and 'b' but that would get really slow and unwieldy for larger sets of data. There is probably a much simpler and faster way to go.
You need transform:
df['avg_result'] = df.groupby(['a', 'b'])['result'].transform('mean')
This generates a correctly indexed column of the groupby values for you:
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
Since the previous answer(https://stackoverflow.com/a/33445035/6504287) is pandas based, I'm adding the pyspark based solution as in below:
So it is better to go with the Window function as in the below code snippet example:
windowSpecAgg = Window.partitionBy('a', 'b')
ext_data_df.withColumn('avg_result', avg('result').over(windowSpecAgg)).show()
The above code is with respect to the example took in the previously provided solution(https://stackoverflow.com/a/33445035/6504287).
I have two dataframes as described above
I would like to create in the second table an additional feature (Col_to_create) related to the value of feature A.
Table 2 has more than 800 000 samples so that I ask for a faster way to do that.
First table:
a b
1 100
2 400
3 500
Second table:
id Refer_to_A Col_to_create
0 3 500
1 1 100
2 3 500
3 2 400
4 1 100
You can use the method map:
df2['Col_to_create'] = df2['Refer_to_A'].map(df1.set_index('a')['b'])
Output:
Refer_to_A Col_to_create
id
0 3 500
1 1 100
2 3 500
3 2 400
4 1 100
One possible way is you can apply the function on new column of the dataset :
If your dataset is :
dataframe_a = pd.DataFrame({'a': [1,2,3], 'b': [100,400,500]})
dataframe_b = pd.DataFrame({'Refer_to_A': [3,1,3,2,1]})
You can try something like :
dataframe_b['Col_to_create'] = dataframe_b['Refer_to_A'].apply(lambda col: dataframe_a['b'][col-1])
output:
Refer_to_A Col_to_create
0 3 500
1 1 100
2 3 500
3 2 400
4 1 100
I have a dataframe:
id val size
1 100
2 500
3 300
i have a nested list L = [[300,20],[100,45],[500,12]]
I want to fill my dataframe with 2nd element in my sublist, corresponding to that column value.
i.e my final dataframe should look like
id val size
1 100 45
2 500 12
3 300 20
Another way using merge
In [1417]: df.merge(pd.DataFrame(L, columns=['val', 'size']), on='val')
Out[1417]:
id val size
0 1 100 45
1 2 500 12
2 3 300 20
First initialise a mapping:
In [132]: mapping = dict([[300,20],[100,45],[500,12]]); mapping
Out[132]: {100: 45, 300: 20, 500: 12}
Now, you can use either df.replace or df.map.
Option 1
Using df.replace:
In [137]: df['size'] = df.val.replace(mapping)
In [138]: df
Out[138]:
id val size
0 1 100 45
1 2 500 12
2 3 300 20
Option 2
Using df.map:
In [140]: df['size'] = df.val.map(mapping)
In [141]: df
Out[141]:
id val size
0 1 100 45
1 2 500 12
2 3 300 20
My dataframe looks like this:
id month spent limit
1 1 2.6 10
1 2 4 10
1 3 6 10
2 1 3 100
2 2 89 100
2 3 101 100
3 1 239 500
3 2 432 500
3 3 100 500
I want to groupby id and then get the ids for which spent column is less than or equal to limit column for every row in the grouped by object.
For my above example, I should get ids 1 and 3 as my result because id 2 spends 101 in 3rd month and hence exceeds the limit of 100.
How can I do this in pandas efficiently?
Thanks in advance!
You can create a mask by finding the ids where spent is greater than limit. The mask out the ids in the mask
mask = df.loc[df['spent'] > df['limit'], 'id'].values.tolist()
df.id[df['id'] != mask].unique()
gives you
array([1, 3])
This should give you something like what you want
df.groupby('id').apply(lambda g: (g.spent < g.limit).all()).to_frame('not_exceeded').query('not_exceeded == True')
Reverse logic! Check for unique ids where spent is greater than limit. Then filter out those.
df[~df.id.isin(df.set_index('id').query('limit < spent').index.unique())]
id month spent limit
0 1 1 2.6 10
1 1 2 4.0 10
2 1 3 6.0 10
6 3 1 239.0 500
7 3 2 432.0 500
8 3 3 100.0 500