I have a pandas data frame that looks like the following:
fastmoving[['dist','unique','id']]
Out[683]:
dist unique id
1 0.406677 4.0 4.997434e+09
2 0.406677 4.0 4.452593e+09
5 0.406677 4.0 4.188395e+09
1 0.434386 4.0 8.288070e+09
4 0.434386 4.0 3.274609e+09
What I want to achieve is to:
Find top n longest-distance entries. Column 'dist'
Find which ids have the largest percentage m in the top n entries. Column 'id'.
So far I was able to write the code for the maximum entries.
#Get the first id with the largest dist:
fastmoving.loc[fastmoving['dist'].idxmax(),'id']
#Get all id's with the largest dist:
fastmoving.loc[fastmoving['dist']==fastmoving['dist'].max(),'id']
what I miss is to my code to work for more than one value.
So instead of the maximum value, to work for a range of maximum values (top n values).
And then get all the ids that belong with over some m percentage in those n maximum values.
Can you please help me on how I can achieve that in pandas?
Thanks a lot
Alex
you can use nlargest for top n and quantile for top m%, like this:
import pandas as pd
from io import StringIO
fastmoving = pd.read_csv(StringIO("""
dist unique id
1 0.406677 4.0 4.997434e+09
2 0.406677 4.0 4.452593e+09
5 0.406677 4.0 4.188395e+09
1 0.434386 4.0 8.288070e+09
4 0.434386 4.0 3.274609e+09"""), sep="\s+")
n = 3
m = 50
top_n_dist = fastmoving.nlargest(n, ["dist"])
top_m_precent_id_in_top_n_dist = top_n_dist[top_n_dist['id']>top_n_dist['id'].quantile(m/100)]
print(top_m_precent_id_in_top_n_dist)
IIUC, you can leverage nlargest. The following example would take the top 3 values of dist, and from that, extract the top 2 values of id:
fastmoving.nlargest(3, ["dist", "id"]).nlargest(2, "id")
dist unique id
1 0.434386 4.0 8.288070e+09
1 0.406677 4.0 4.997434e+09
Related
I'm trying to figure out if the value in my dataframe is increasing in the tens/hundreds place. For example I created a dataframe with a few values, I duplicate the values and shifted them and now i'm able to compare them. But how do i code and find out if the tens place is increasing or if it just increasing by a little, for example 0.02 points.
import pandas as pd
import numpy as np
data = {'value':['9','10','19','22','31']}
df = pd.DataFrame(data)
df['value_copy'] = df['value'].shift(1)
df['Increase'] = np.where(df['value']<df['value_copy'],1,0)
output should be in this case:
[nan,1,0,1,1]
IIUC, divide by 10, get the floor, then compare the successive values (diff(1)) to see if the difference is exactly 1:
np.floor(df['value'].astype(float).div(10)).diff(1).eq(1).astype(int)
If you want a jump to at least the next tens (or more) use ge (≥):
np.floor(df['value'].astype(float).div(10)).diff(1).ge(1).astype(int)
output:
0 0
1 1
2 0
3 1
4 1
Name: value, dtype: int64
NB. if you insist on the NaN:
s = np.floor(df['value'].astype(float).div(10)).diff(1)
s.eq(1).astype(int).mask(s.isna())
output:
0 NaN
1 1.0
2 0.0
3 1.0
4 1.0
Name: value, dtype: float64
I have a dataframe with 2 columns.
df=pd.DataFrame({'values':arrays,'ii':lin_index})
I want to group the values by the lin_index and get the mean per group and the most common value per group
I try this
bii=df.groupby('ii').median()
bii2=df.groupby('ii').agg(lambda x:x.value_counts().index[0])
bii3=df.groupby('ii')['values'].agg(pd.Series.mode)
I wonder if bii2 and bii3 return the same values
Then I want to return the mean and most common value to the original array
bs=np.zeros((np.unique(array).shape[0],1))
bs[bii.index.values]=bii.values
Does this look good?
df looks like
values ii
0 1.0 10446786
1 1.0 11316289
2 1.0 16416704
3 1.0 12151686
4 1.0 30312736
... ...
93071038 3.0 28539525
93071039 3.0 19667948
93071040 3.0 22240849
93071041 3.0 22212513
93071042 3.0 41641943
[93071043 rows x 2 columns]
something like this maybe:
# get the mean
df.groupby(['ii']).mean()
# get the most frequent
df.groupby(['ii']).agg(pd.Series.mode)
your question seems similar to
GroupBy pandas DataFrame and select most common value
this link might also be useful https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#computations-descriptive-stats
I have data like below:
id movie details value
5 cane1 good 6
5 wind2 ok 30.3
5 wind1 ok 18
5 cane1 good 2
5 cane22 ok 4
5 cane34 good 7
5 wind2 ok 2
I want the output with below criteria:
If movie name starts with 'cane' - sum the value
If movie name starts with 'wind' - count the occurrence.
So - the final output will be:
id movie value
5 cane1 8
5 cane22 4
5 cane34 7
5 wind1 1
5 wind2 2
I tried to use:
movie_df.groupby(['id']).apply(aggr)
def aggr(x):
if x['movie'].str.startswith('cane'):
y = x.groupby(['value']).sum()
else:
y = x.groupby(['movie']).count()
return y
But It's not working. Can anyone please help?
You should aim for vectorised operations where possible.
You can calculate 2 results and then concatenate them.
mask = df['movie'].str.startswith('cane')
df1 = df[mask].groupby('movie')['value'].sum()
df2 = df[~mask].groupby('movie').size()
res = pd.concat([df1, df2], ignore_index=0)\
.rename('value').reset_index()
print(res)
movie value
0 cane1 8.0
1 cane22 4.0
2 cane34 7.0
3 wind1 1.0
4 wind2 2.0
There might be multiple ways of doing this. One way would to filter by the start of movie name first and then aggregate and merge afterwards.
cane = movie_df[movie_df['movie'].str.startswith('cane1')]
wind = movie_df[movie_df['movie'].str.startswith('wind')]
cane_sum = cane.groupby(['id']).agg({'movie':'first', 'value':'sum'}).reset_index()
wind_count = wind.groupby(['id']).agg({'movie':'first', 'value':'count'}).reset_index()
pd.concat([cane_sum, wind_count])
First of all, you need to perform string operation. I guess in your case you don't want digits in a movie name. Use solution discussed at pandas applying regex to replace values.
Then you call groupby() on new series.
FYI: Some movie names have digits only; in that case, you need to use update function. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.update.html
I would start by creating a column which defines the required groups. For the example at hand this can be done with
df['group'] = df.movie.transform(lambda x : x[:4])
The next step would be to group by this column
df.groupby('group').apply(agg_fun)
using the following aggregation function
def agg_fun(grp):
if grp.name == "cane":
value=grp.value.sum()
else:
value=grp.value.count()
return value
The output of this code is
group
cane 19.0
wind 3.0
I've been wondering how to solve the following problem. Say I have a dataframe df, which looks like this:
Name quantity price
A 1 10.0
A 3 26.0
B 1 15.0
B 3 30.0
...
Now, say I wanted to extrapolate the price by quantity, and for each Name create a row for quantity = 1,2,3, which is some function of the list of available quantities and respective prices. (I.e. say that I have a function extrapolate(qts, prices, n) that computes a price for quantity=n based on known qts and prices, then the result would look like:
Name quantity price
A 1 10.0
A 2 extrapolate([1, 3], [10.0, 26.0], 2)
A 3 26.0
B 1 15.0
B 2 extrapolate([1, 3], [15.0, 30.0], 2)
B 3 30.0
...
I would appreciate some insight on how to achieve this, or a place to reference to learn more about how groupby can be used for this case
Thank you in advance
What do you want is called missing data imputation. There are many approaches to it.
You may want to check package called fancyimpute. It offers imputing data using MICE, which seems to do what you want.
Other than that, if your case is just as simple in structure as the example is, you can always groupby('Name').mean() and you will get the middle value for each subgroup.
The following should do what you described:
def get_extrapolate_val(group, qts, prices, n):
# do your actual calculations here; now it returns just a dummy value
some_value = (group[qts] * group[prices]).sum() / n
return some_value
# some definitions
n = 2
quan_col = 'quantity'
price_col = 'price'
First we group by Name and then apply the function get_extrapolate_val to each group whereby we pass the additional column names and n as arguments. As this returns a series object, we need an additional reset_index and rename which will make the concatenation easier.
new_stuff = df.groupby('Name').apply(get_extrapolate_val, quan_col, price_col, n).reset_index().rename(columns={0: price_col})
Add n as additional column
new_stuff[quan_col] = n
We concatenate the two dataframes and are done
final_df = pd.concat([df, new_stuff]).sort_values(['Name', quan_col]).reset_index(drop=True)
Name price quantity
0 A 10.0 1
1 A 44.0 2
2 A 26.0 3
3 B 15.0 1
4 B 52.5 2
5 B 30.0 3
The values I now added in are of course meaningless but are just there to illustrate the method.
OLD version
Assuming that there is always only 1 and 3 in your quantity column, the following should work:
new_stuff = df.groupby('Name', as_index=False)['price'].mean()
This gives
Name price
0 A 18.0
1 B 22.5
That - as written - assumes that it is always only 1 and 3, so we can simply calculate the mean.
Then we add the 2
new_stuff['quantity'] = 2
and concatenate the two dataframes with an additional sorting
pd.concat([df, new_stuff]).sort_values(['Name', 'quantity']).reset_index(drop=True)
which gives the desired outcome
Name price quantity
0 A 10.0 1
1 A 18.0 2
2 A 26.0 3
3 B 15.0 1
4 B 22.5 2
5 B 30.0 3
There are probably far more elegant ways to do this though...
In R , way to break ties randomly when using the rank function is simple:
rank(my_vec, ties.method = "random")
However, though both scipy (scipy.stats.rankdata) and pandas (pandas.Series.rank) have ranking functions, none of them suggest a method that break ties randomly.
Is there a simple way to use a framework in python that has this feature? Given that list order has to remain the same.
Pandas' rank allows for these methods:
method : {'average', 'min', 'max', 'first', 'dense'}
* average: average rank of group
* min: lowest rank in group
* max: highest rank in group
* first: ranks assigned in order they appear in the array
* dense: like 'min', but rank always increases by 1 between groups
To "simply" accomplish your goal we can use 'first' after having randomized the Series.
Assume my series is named my_vec
my_vec.sample(frac=1).rank(method='first')
You can then put it back in the same order it was with
my_vec.sample(frac=1).rank(method='first').reindex_like(my_vec)
Example Runs
my_vec = pd.Series([1, 2, 3, 1, 2, 3])
Trial 1
my_vec.sample(frac=1).rank(method='first').reindex_like(my_vec)
0 2.0 <- I expect this and
1 4.0
2 6.0
3 1.0 <- this to be first ranked
4 3.0
5 5.0
dtype: float64
Trial 2
my_vec.sample(frac=1).rank(method='first').reindex_like(my_vec)
0 1.0 <- Still first ranked
1 3.0
2 6.0
3 2.0 <- but order has switched
4 4.0
5 5.0
dtype: float64