identify records that make up 90% of total - python

I have a report that identifies key drivers of an overall number/trend. I would like to automate the functionality to be able to list/identify the underlying records based on a percentage of that number. For example if the net change for sales of widgets in the south(region) is -5,000.00, but there are positives and negatives- I would like to identify at least ~90% (-4,500.00) of all underlying drivers that make up that -5,000.00 total from largest to smallest.
data
region OfficeLocation sales
South 1 -500
South 2 300
South 3 -1000
South 4 -2000
South 5 300
South 6 -700
South 7 -400
South 8 800
North 11 300
North 22 -400
North 33 1000
North 44 800
North 55 900
North 66 -800
for South, the total sales is -3200. I would like to identify/list the drivers that make up at least 90% of this move(in descending order)- so 90% of -3200 would be 2880. And the directional moves/sales for South office 3 & 4 = -3000 would be the output for this request:
region OfficeLocation sales
South 3 -1000
South 4 -2000
for North, the total sales is +1800. I would like to identify/list the drivers that make up at least 90% of this move(in descending order)- so at least 90% of 1800 would be 1620. And the directional moves/sales for South office 3 & 4 = -3000 would be the output for this request:
region OfficeLocation sales
North 33 1000
North 44 800
Dataset above has both positive and negative trends for south/north. Any help you can provide would be greatly appreciated!

As mentioned in the comment, it isn't clear what to do in the 'North' case as the sum is positive there, but ignoring that, you could do something like the following:
In [200]: df[df.groupby('region').sales.apply(lambda g: g <= g.loc[(g.sort_values().cumsum() > 0.9*g.sum()).idxmin()])]
Out[200]:
region OfficeLocation sales
2 South 3 -1000
3 South 4 -2000
13 North 66 -800
If, in the positive case, you want to find as few elements as possible that together have the property that they make up 90% of the sum of the sales, the above solution can be adopted as follows:
def is_driver(group):
s = group.sum()
if s > 0:
group *= -1
s *= -1
a = group.sort_values().cumsum() > 0.9*s
return group <= group.loc[a.idxmin()]
In [168]: df[df.groupby('region').sales.apply(is_driver)]
Out[168]:
region OfficeLocation sales
2 South 3 -1000
3 South 4 -2000
10 North 33 1000
12 North 55 900
Note that in the case of a tie, only one element is picked out.

Related

Create multiple new pandas column based on other columns in a loop

Assuming I have the following toy dataframe, df:
Country Population Region HDI
China 100 Asia High
Canada 15 NAmerica V.High
Mexico 25 NAmerica Medium
Ethiopia 30 Africa Low
I would like to create new columns based on the population, region, and HDI of Ethiopia in a loop. I tried the following method, but it is time-consuming when a lot of columns are involved.
df['Population_2'] = df['Population'][df['Country'] == "Ethiopia"]
df['Region_2'] = df['Region'][df['Country'] == "Ethiopia"]
df['Population_2'].fillna(method='ffill')
My final DataFrame df should look like:
Country Population Region HDI Population_2 Region_2 HDI_2
China 100 Asia High 30 Africa Low
Canada 15 NAmerica V.High 30 Africa Low
Mexico 25 NAmerica Medium 30 Africa Low
Ethiopia 30 Africa Low 30 Africa Low
How about this?
for col in ['Population', 'Region', 'HDI']:
df[col + '_2'] = df.loc[df.Country=='Ethiopia', col].iat[0]
I don't quite understand the broader point of what you're trying to do, and if Ethiopia could have multiple values the solution might be different. But this works for the problem as you presented it.
You can use:
# select Ethiopia row and add suffix "_2" to the columns (except Country)
s = (df.drop(columns='Country')
.loc[df['Country'].eq('Ethiopia')].add_suffix('_2').squeeze()
)
# broadcast as new columns
df[s.index] = s
output:
Country Population Region HDI Population_2 Region_2 HDI_2
0 China 100 Asia High 30 Africa Low
1 Canada 15 NAmerica V.High 30 Africa Low
2 Mexico 25 NAmerica Medium 30 Africa Low
3 Ethiopia 30 Africa Low 30 Africa Low
You can use assign and also assuming that you have only row corresponding to Ethiopia:
d = dict(zip(df.columns.drop('Country').map('{}_2'.format),
df.set_index('Country').loc['Ethiopia']))
df = df.assign(**d)
print(df):
Country Population Region HDI Population_2 Region_2 HDI_2
0 China 100 Asia High 30 Africa Low
1 Canada 15 NAmerica V.High 30 Africa Low
2 Mexico 25 NAmerica Medium 30 Africa Low
3 Ethiopia 30 Africa Low 30 Africa Low

How do I print a simple Python statement based on Pandas dataframe?

Date
Train Number
Station 1
Station 2
Equipment Available?
2022-06-16
1111
North
Central
Y
2022-06-20
1111
North
Central
Y
2022-06-01
2222
North
South
Y
2022-06-02
2222
North
South
Y
2022-06-03
2222
North
South
Y
2022-06-04
2222
North
South
Y
2022-06-05
2222
North
South
Y
2022-06-06
2222
North
South
Y
2022-06-07
2222
North
South
Y
2022-06-08
2222
North
South
Y
I have a Pandas dataframe that looks like the one above that is sorted by Train Number and then Date. I would like to print a simple Python statement that says:
"For Train Number 1111 North to Central, we have equipment available on June 16th and June 20th.
For Train Number 2222 North to South, we have equipment available from June 1st to June 8th."
How am I able to do this?????
I've made a little function which you can call on whatever df you want.
I find this solution more readable and flexible for further requests.
def equip_avail(df):
for i in df['Train Number'].unique():
date_start = df.Date.loc[(df['Train Number']==i)].min()
date_end = df.Date.loc[(df['Train Number']==i)].max()
from_start = df.Station1.loc[(df['Train Number']==i)].values[0]
to_end = df.Station2.loc[(df['Train Number']==i)].values[0]
print(f'For Train Number {i} {from_start} to {to_end}, we have equipment available from {date_start} to {date_end}.')
Then you call it like this:
equip_avail(df)
Result:
For Train Number 1111 North to Central, we have equipment available from 2022-06-16 to 2022-06-20.
For Train Number 2222 North to South, we have equipment available from 2022-06-01 to 2022-06-08.
You could get the min and max values for each Train's Date with a groupby, dedupe the DataFrame to get the other columns (as they are repeated) and then print the results with some datetime formatting
df.loc[:, 'Date'] = pd.to_datetime(df['Date'])
g = df.groupby(['Train Number']).agg(date_min=pd.NamedAgg(column='Date', aggfunc='min'), date_max=pd.NamedAgg(column='Date', aggfunc='max'))
g = g.join(df_deduped, how='inner')
df_deduped = df.loc[:, 'Train Number':].drop_duplicates().set_index('Train Number')
for index, values in g.reset_index().iterrows():
print(f'For Train Number {values["Train Number"]}, {values["Station 1"]} to {values["Station 2"]}, we have equipment available from {values["date_min"].strftime("%b %d")} to {values["date_max"].strftime("%b %d")}')
The output is -
For Train Number 1111, North to Central, we have equipment available from Jun 16 to Jun 20
For Train Number 2222, North to South, we have equipment available from Jun 01 to Jun 08
here is one way to do it. Group by Train, station1, station2, taking both min and max of the dates
Finally printing them out from the resulting df from groupby
df2=df.groupby(['TrainNumber', 'Station1', 'Station2'])['Date'].aggregate([min, max]).reset_index()
for idx, row in df2.iterrows():
print("For Train Number {0} {1} to {2}, we have equipment available on {3} and {4}".format(
row[0],row[1],row[2], row[3] , row[4] ))
For Train Number 1111 North to Central, we have equipment available on 2022-06-16 and 2022-06-20
For Train Number 2222 North to South, we have equipment available on 2022-06-01 and 2022-06-08

How to separate a combined column, but with incongruent data

I'm preparing for a new job where I'll be receiving data submissions in varying quality, often times dates/chars/etc are combined together nonsensically and must be separated before analysis. Thinking ahead of how might this be solved.
Using a fictitious example below, I combined region, rep, and product together.
file['combine'] = file['Region'] + file['Sales Rep'] + file['Product']
Shift Region Sales Rep Product Cost per Units Sold combine
0 3 East Shirlene Pencil 5 71 EastShirlenePencil
1 3 South Anderson Folder 17 69 SouthAndersonFolder
2 3 West Shelli Folder 17 185 WestShelliFolder
3 3 South Damion Binder 30 159 SouthDamionBinder
4 3 West Shirlene Stapler 25 41 WestShirleneStapler
Assuming no other data, the question is, how can the 'combine' column be split up?
Many thanks in advance!
If you want space between the strings, you can do:
df["combine"] = df[["Region", "Sales Rep", "Product"]].apply(" ".join, axis=1)
print(df)
Prints:
Shift Region Sales Rep Product Cost per Units Sold combine
0 3 East Shirlene Pencil 5 71 East Shirlene Pencil
1 3 South Anderson Folder 17 69 South Anderson Folder
2 3 West Shelli Folder 17 185 West Shelli Folder
3 3 South Damion Binder 30 159 South Damion Binder
4 3 West Shirlene Stapler 25 41 West Shirlene Stapler
Or: if you want to split the already combined string:
import re
df["separated"] = df["combine"].apply(lambda x: re.findall(r"[A-Z][^A-Z]*", x))
print(df)
Prints:
Shift Region Sales Rep Product Cost per Units Sold combine separated
0 3 East Shirlene Pencil 5 71 EastShirlenePencil [East, Shirlene, Pencil]
1 3 South Anderson Folder 17 69 SouthAndersonFolder [South, Anderson, Folder]
2 3 West Shelli Folder 17 185 WestShelliFolder [West, Shelli, Folder]
3 3 South Damion Binder 30 159 SouthDamionBinder [South, Damion, Binder]
4 3 West Shirlene Stapler 25 41 WestShirleneStapler [West, Shirlene, Stapler]

Group by and find top n value_counts pandas

I have a dataframe of taxi data with two columns that looks like this:
Neighborhood Borough Time
Midtown Manhattan X
Melrose Bronx Y
Grant City Staten Island Z
Midtown Manhattan A
Lincoln Square Manhattan B
Basically, each row represents a taxi pickup in that neighborhood in that borough. Now, I want to find the top 5 neighborhoods in each borough with the most number of pickups. I tried this:
df['Neighborhood'].groupby(df['Borough']).value_counts()
Which gives me something like this:
borough
Bronx High Bridge 3424
Mott Haven 2515
Concourse Village 1443
Port Morris 1153
Melrose 492
North Riverdale 463
Eastchester 434
Concourse 395
Fordham 252
Wakefield 214
Kingsbridge 212
Mount Hope 200
Parkchester 191
......
Staten Island Castleton Corners 4
Dongan Hills 4
Eltingville 4
Graniteville 4
Great Kills 4
Castleton 3
Woodrow 1
How do I filter it so that I get only the top 5 from each? I know there are a few questions with a similar title but they weren't helpful to my case.
I think you can use nlargest - you can change 1 to 5:
s = df['Neighborhood'].groupby(df['Borough']).value_counts()
print s
Borough
Bronx Melrose 7
Manhattan Midtown 12
Lincoln Square 2
Staten Island Grant City 11
dtype: int64
print s.groupby(level=[0,1]).nlargest(1)
Bronx Bronx Melrose 7
Manhattan Manhattan Midtown 12
Staten Island Staten Island Grant City 11
dtype: int64
additional columns were getting created, specified level info
You can do this in a single line by slightly extending your original groupby with 'nlargest':
>>> df.groupby(['Borough', 'Neighborhood']).Neighborhood.value_counts().nlargest(5)
Borough Neighborhood Neighborhood
Bronx Melrose Melrose 1
Manhattan Midtown Midtown 1
Manhatten Lincoln Square Lincoln Square 1
Midtown Midtown 1
Staten Island Grant City Grant City 1
dtype: int64
Solution: for get topn from every group
df.groupby(['Borough']).Neighborhood.value_counts().groupby(level=0, group_keys=False).head(5)
.value_counts().nlargest(5) in other answers only give you one group top 5, doesn't make sence for me too.
group_keys=False to avoid duplicated index
because value_counts() has already sorted, just need head(5)
df['Neighborhood'].groupby(df['Borough']).value_counts().head(5)
head() gets the top 5 rows in a data frame.
Try this one (just change the number in head() to your choice):
# top 3 : total counts of 'Neighborhood' in each Borough
Z = df.groupby('Borough')['Neighborhood'].value_counts().groupby(level=0).head(3).sort_values(ascending=False).to_frame('counts').reset_index()
Z
You can also try below code to get only top 10 values of value counts
'country_code' and 'raised_amount_usd' is column names.
groupby_country_code=master_frame.groupby('country_code')
arr=groupby_country_code['raised_amount_usd'].sum().sort_index()[0:10]
print(arr)
[0:10] shows index 0 to 10 from array for slicing. you can choose your slicing option.

Generate columns of top ranked values in Pandas

I have a dataframe topic_data that contains the output of an LDA topic model:
topic_data.head(15)
topic word score
0 0 Automobile 0.063986
1 0 Vehicle 0.017457
2 0 Horsepower 0.015675
3 0 Engine 0.014857
4 0 Bicycle 0.013919
5 1 Sport 0.032938
6 1 Association_football 0.025324
7 1 Basketball 0.020949
8 1 Baseball 0.016935
9 1 National_Football_League 0.016597
10 2 Japan 0.051454
11 2 Beer 0.032839
12 2 Alcohol 0.027909
13 2 Drink 0.019494
14 2 Vodka 0.017908
This shows the top 5 terms for each topic, and the score (weight) for each. What I'm trying to do is reformat so that the index is the rank of the term, the columns are the topic IDs, and the values are formatted strings generated from the word and score columns (something along the lines of "%s (%.02f)" % (word,score)). That means the new dataframe should look something like this:
Topic 0 1 ...
Rank
0 Automobile (0.06) Sport (0.03) ...
1 Vehicle (0.017) Association_football (0.03) ...
... ... ... ...
What's the right way of going about this? I assume it involves a combination of index-setting, unstacking, and ranking, but I'm not sure of the right approach.
It would be something like this, note that Rank has to be generated first:
In [140]:
df['Rank'] = (-1*df).groupby('topic').score.transform(np.argsort)
df['New_str'] = df.word + df.score.apply(' ({0:.2f})'.format)
df2 = df.sort(['Rank', 'score'])[['New_str', 'topic','Rank']]
print df2.pivot(index='Rank', values='New_str', columns='topic')
topic 0 1 2
Rank
0 Automobile (0.06) Sport (0.03) Japan (0.05)
1 Vehicle (0.02) Association_football (0.03) Beer (0.03)
2 Horsepower (0.02) Basketball (0.02) Alcohol (0.03)
3 Engine (0.01) Baseball (0.02) Drink (0.02)
4 Bicycle (0.01) National_Football_League (0.02) Vodka (0.02)

Categories