I have a dataset with a very long tail and wish to sample only 90% of the data.
city score
bangkok 60
kl 20
sydney 10
melbourne 5
dhaka 5
should be:
city score
bangkok 60
kl 20
sydney 10
First, sort the values that you want to filter the highest 90% of the data
df.sort_values('score', ascending=False, inplace=True)
Then, you calculate the cumulative sum and divide with the total in order to make your filtering conditions (you can replace 0.9 with your custom limit)
df = df[df['score'].cumsum() / df['score'].sum() < 0.9]
Now df looks like
city score
bangkok 60
kl 20
sydney 10
I believe need count score by division of sum and then filter by boolean indexing, last sort_values for better performance in filtered rows:
a = 0.9
df = df[df['score'].div(df['score'].sum()) >= 1 - a].sort_values('score', ascending=False)
Or:
df = df[df['score'].div(df['score'].sum()) >= 0.1].sort_values('score', ascending=False)
print (df)
city score
0 bangkok 60
1 kl 20
2 sydney 10
Detail:
print (df['score'].div(df['score'].sum()))
0 0.60
1 0.20
2 0.10
3 0.05
4 0.05
Name: score, dtype: float64
Related
I have a year wise dataframe with each year has three parameters year,type and value. I'm trying to calculate percentage of taken vs empty. For example year 2014 has total of 50 empty and 50 taken - So 50% in empty and 50% in taken as shown in final_df
df
year type value
0 2014 Total 100
1 2014 Empty 50
2 2014 Taken 50
3 2013 Total 2000
4 2013 Empty 100
5 2013 Taken 1900
6 2012 Total 50
7 2012 Empty 45
8 2012 Taken 5
Final df
year Empty Taken
0 2014 50 50
0 2013 ... ...
0 2012 ... ...
Should i shift cells up and do the percentage calculate or any other method?
You can use pivot_table:
new = df[df['type'] != 'Total']
res = (new.pivot_table(index='year',columns='type',values='value').sort_values(by='year',ascending=False).reset_index())
which gets you:
res
year Empty Taken
0 2014 50 50
1 2013 100 1900
2 2012 45 5
And then you can get the percentages for each column:
total = (res['Empty'] + res['Taken'])
for col in ['Empty','Taken']:
res[col+'_perc'] = res[col] / total
year Empty Taken Empty_perc Taken_perc
2014 50 50 0.50 0.50
2013 100 1900 0.05 0.95
2012 45 5 0.90 0.10
As #sophods pointed out, you can use pivot_table to rearange your dataframe, however, to add to his answer; i think you're after the percentage, hence i suggest you keep the 'Total' record and then apply your calculation:
#pivot your data
res = (df.pivot_table(index='year',columns='type',values='value')).reset_index()
#calculate percentages of empty and taken
res['Empty'] = res['Empty']/res['Total']
res['Taken'] = res['Taken']/res['Total']
#final dataframe
res = res[['year', 'Empty', 'Taken']]
You can filter out records having Empty and Taken in type and then groupby year and apply func. In func, you can set the type as index and then get the required values and calculate the percentage. x in func would be dataframe having type and value columns and data per group.
def func(x):
x = x.set_index('type')
total = x['value'].sum()
return [(x.loc['Empty', 'value']/total)*100, (x.loc['Taken', 'value']/total)*100]
temp = (df[df['type'].isin({'Empty', 'Taken'})]
.groupby('year')[['type', 'value']]
.apply(lambda x: func(x)))
temp
year
2012 [90.0, 10.0]
2013 [5.0, 95.0]
2014 [50.0, 50.0]
dtype: object
Convert the result into the required dataframe
pd.DataFrame(temp.values.tolist(), index=temp.index, columns=['Empty', 'Taken'])
Empty Taken
year
2012 90.0 10.0
2013 5.0 95.0
2014 50.0 50.0
I'm working with a Data Frame with categorical values where my input DataFrame is below:
df
Age Gender Smoke
18 Female Yes
24 Female No
18 Female Yes
34 Male Yes
34 Male No
I want to groupby my DataFrame based on columns "Age" and "Gender" where "Occurrence" column calculates the frequency of each selection and then, I want to create two other columns "Smoke Yes" that calculates number of smoking people based on the selection and "Smoke No" that calculates number of non smoking people
Age Gender Occurence Smoke Yes Smoke No
18 Woman 2 0.50 0.50
24 Woman 1 0 1
34 Man 2 0.5 0.5
In order to do that, I used the following code
#Group and sort
df1=df.groupby(['Age', 'Gender']).size().reset_index(name='Frequency').sort_values('Frequency', ascending=False)
#Delete index
df1.reset_index(drop=True,inplace=True)
However the df['Smoke'] column is disappeared so I can't continue my calculus. Does any one have an idea and what can I do to obtain like the output DataFrame?
you can use groupby and value_counts with normalize=True to return percentage share. then unstack. Also using a dictionary you can replace the Gender column to match the desired output.
d = {"Female":"Woman","Male":"Man"}
u = (df.groupby(['Age','Gender'])['Smoke'].value_counts(normalize=True)
.unstack().fillna(0))
s = df.groupby("Age")['Gender'].value_counts()
u.columns = u.columns.name+"_"+u.columns
out=u.rename_axis(None,axis=1).assign(Occurance=s).reset_index().replace({"Gender":d})
print(out)
Age Gender Smoke_No Smoke_Yes Occurance
0 18 Woman 0.0 1.0 2
1 24 Woman 1.0 0.0 1
2 34 Man 0.5 0.5 2
I have a dataframe df1 like:
cycleName quarter product qty price sell/buy
0 2020 q3 wood 10 100 sell
1 2020 q3 leather 5 200 buy
2 2020 q3 wood 2 200 buy
3 2020 q4 wood 12 40 sell
4 2020 q4 leather 12 40 sell
5 2021 q1 wood 12 80 sell
6 2021 q2 leather 12 90 sell
And another dataframe df2 as below. It has unique products of df1:
product currentValue
0 wood 20
1 leather 50
I want to create new column in df2, called income which will be based on calculations on df1 data. Example if product is wood the income2020 will be created seeing if cycleName is 2020 and if sell/buy is sell then add quantity * price else subtract quantity * price.
product currentValue income2020
0 wood 20 10 * 100 - 2 * 200 + 12 * 40 (=1080)
1 leather 50 -5 * 200 + 12 * 40 (= -520)
I have a problem statement in python, which I am trying to do using pandas dataframes, which I am very new to.
I am not able to understand how to create that column in df2 based on different conditions on df1.
You can map sell as 1 and buy as -1 using pd.Series.map then multiply columns qty, price and sell/buy using df.prod to get only 2020 cycleName values use df.query and groupby by product and take sum using GroupBy.sum
df_2020 = df.query('cycleName == 2020').copy() # `df[df['cycleName'] == 2020].copy()`
df_2020['sell/buy'] = df_2020['sell/buy'].map({'sell':1, 'buy':-1})
df_2020[['qty', 'price', 'sell/buy']].prod(axis=1).groupby(df_2020['cycleName']).sum()
product
leather -520
wood 1080
dtype: int64
Note:
Use .copy else you would get SettingWithCopyWarning
To maintain the order use sort=False in df.groupby
(df_2020[['qty', 'price', 'sell/buy']].
prod(axis=1).
groupby(df_2020['product'],sort=False).sum()
)
product
wood 1080
leather -520
dtype: int64
I have the following dataframe (dummy data):
score GDP
country
Bangladesh 6 12
Bolivia 4 10
Nigeria 3 9
Pakistan 2 3
Ghana 1 3
India 1 3
Algeria 1 3
And I want to split it into two groups based on GDP and sum the score of each group. On the condition of GDP being less than 9:
sum_score
country
rich 13
poor 5
You can use np.where to make your rich and poor categories, then groupby that category and get the sum:
df['country_cat'] = np.where(df.GDP < 9, 'poor', 'rich')
df.groupby('country_cat')['score'].sum()
country_cat
poor 5
rich 13
You can also do the same in one step, by not creating the extra column for the category (but IMO the code becomes less readable):
df.groupby(np.where(df.GDP < 9, 'poor', 'rich'))['score'].sum()
You can aggregate by boolean mask and last only rename index:
a = df.groupby(df.GDP < 9)['score'].sum().rename({True:'rich', False:'poor'})
print (a)
GDP
poor 13
rich 5
Name: score, dtype: int64
Last for one column DataFrame add Series.to_frame:
df = a.to_frame('sum_score')
print (df)
sum_score
GDP
poor 13
rich 5
I Need to calculate the compund interest rate so, lets say I have a Dataframe like that:
days
1 10
2 15
3 20
What I want to get is (suppose the interest rate is 1% every day:
days interst rate
1 10 10,46%
2 15 16,10%
3 20 22,02%
My code is as follows:
def inclusao_juros (x):
dias = df_arrumada_4['Prazo Medio']
return ((1.0009723)^dias)-1
df_arrumada_4['juros_acumulado'] = df_arrumada_4['Prazo Medio'].apply(inclusao_juros)
What should I do??? Tks
I think you need numpy.power:
df['new'] = np.power(1.01, df['days']) - 1
print (df)
days new
1 10 0.104622
2 15 0.160969
3 20 0.220190
IIUC
pd.Series([1.01]*len(df)).pow(df.reset_index().days,0).sub(1)
Out[695]:
0 0.104622
1 0.160969
2 0.220190
dtype: float64
Jez's : pd.Series([1.01]*len(df),index=df.index).pow(df.days,0).sub(1)
Or using your apply
df.days.apply(lambda x: 1.01**x -1)
Out[697]:
1 0.104622
2 0.160969
3 0.220190
Name: days, dtype: float64