Changing value to be the maximum value per group

Changing value to be the maximum value per group - python

I have this kind of structure:
country product installs purchases
US T 100 100
US A 5 5
AU T 500 500
AU A 20 20
I am trying to get:
country product installs purchases
US T 100 100
US A 100 5
AU T 500 500
AU A 500 20
Each value in the installs columns needs to be the value of installs where product column's value is T.
I tried:
exp.groupby(['country','product'])['date_install_'] = max(exp.groupby(['country','product'])['date_install_'])
Which does not work and I am kind of lost. How can I achieve the result?

Find the rows where the product is T, groupby the country, and get the maxiumum of the installs. Use this as a map to replace the values in installs:
df['installs'] = df['country'].map(df[df['product'] == 'T'].groupby('country')['installs'].max())
Result:
country product installs purchases
0 US T 100 100
1 US A 100 5
2 AU T 500 500
3 AU A 500 20
For clarity, this is what is being passed to map:
>>> df[df['product'] == 'T'].groupby('country')['installs'].max()
country
AU 500
US 100
Name: installs, dtype: int64
So you can use it like a dict with the index (country) as a key and the installs as a value.

If the T values is always the max value, you can use an auxiliary df that holds the max value of installs per country and then merge that with the original df and replace the max value for the install value:
aux = df.groupby('country').installs.max().reset_index
df.drop('installs', axis=1).merge(aux, how='left', on='country')
You reset the index so that you can use country as a column in the first line.
You drop installs before you merge because the aux df already has the value and name of the installs you want.

Related

Python dataframe returning closest value above specified input in one row (pivot_table)

I have the following DataFrame, output_excel, containing inventory data and sales data for different products. See the DataFrame below:
Product 2022-04-01 2022-05-01 2022-06-01 2022-07-01 2022-08-01 2022-09-01 AvgMonthlySales Current Inventory
1 BE37908 1500 1400 1200 1134 1110 1004 150.208333 1500
2 BE37907 2000 1800 1800 1540 1300 1038 189.562500 2000
3 DE37907 5467 5355 5138 4926 4735 4734 114.729167 5467
Please note that that in my example, today's date is 2022-04-01, so all inventory numbers for the months May through September are predicted values, while the AvgMonthlySales are the mean of actual, past sales for that specific product. The current inventory just displays today's value.
I also have another dataframe, df2, containing the lead time, the same sales data, and the calculated security stock for the same products. The formula for the security stock is ((leadtime in weeks / 4) + 1) * AvgMonthlySales:
Product AvgMonthlySales Lead time in weeks Security Stock
1 BE37908 250.208333 16 1251.04166
2 BE37907 189.562500 24 1326.9375
3 DE37907 114.729167 10 401.552084
What I am trying to achieve:
I want to create a new dataframe, which tells me how many months are left until our inventory drops below the security stock. For example, for the first product, BE37908, the security stock is ~1251 units, and by 2022-06-01 our inventory will drop below that number. So I want to return 2022-05-01, as this is the last month where our inventories are projected to be above the security stock. The whole output should look something like this:
Product Last Date Above Security Stock
1 BE37908 2022-05-01
2 BE37907 2022-07-01
3 DE37907 NaN
Please also note that the timeframe for the projections (the columns) can be set by the user, so we couldn't just select columns 2 through 7. However, the Product column will always be the first one, and the AvgMonthlySales and the Current Inventory columns will always be the last two.
To recap, I want to return the column with the smallest value above the security stock for each product. I have an idea on how to do that by column using argsort, but not by row. What is the best way to achieve this? Any tips?

You could try as follows:
# create list with columns with dates
cols = [col for col in df.columns if col.startswith('20')]
# select cols, apply df.gt row-wise, sum and subtract 1
idx = df.loc[:,cols].gt(df2['Security Stock'], axis=0).sum(axis=1).sub(1)
# get the correct dates from the cols
# if the value == len(cols)-1, *all* values will have been greater so: np.nan
idx = [cols[i] if i != len(cols)-1 else np.nan for i in idx]
out = df['Product'].to_frame()
out['Last Date Above Security Stock'] = idx
print(out)
Product Last Date Above Security Stock
1 BE37908 2022-05-01
2 BE37907 2022-07-01
3 DE37907 NaN

Finding earliest date after groupby a specific column

I have a dataframe that look like below.
id name tag location date
1 John 34 FL 01/12/1990
1 Peter 32 NC 01/12/1990
1 Dave 66 SC 11/25/1990
1 Mary 12 CA 03/09/1990
1 Sue 29 NY 07/10/1990
1 Eve 89 MA 06/12/1990
: : : : :
n John 34 FL 01/12/2000
n Peter 32 NC 01/12/2000
n Dave 66 SC 11/25/1999
n Mary 12 CA 03/09/1999
n Sue 29 NY 07/10/1998
n Eve 89 MA 06/12/1997
I need to find the location information based on the id column but with one condition, only need the earliest date. For example, the earliest date for id=1 group is 01/12/1990, which means the location is FL and NC. Then apply it to all the different id group to get the top 3 locations. I have written the code to do this for me.
#Get the earliest date base on id group
df_ear = df.loc[df.groupby('id')['date'].idxmin()]
#Count the occurancees of the location
df_ear['location'].value_counts()
The code works perfectly fine but it cannot return more than 1 location (using my first line of code) if they have the same earliest date, for example, id=1 group will only return FL instead FL and NC. I am wondering how can I fix my code to include the condition that if the earliest date is more than 1.
Thanks!

Use GroupBy.transform for Series for minimal date per groups, so possible compare by column Date in boolean indexing:
df['date'] = pd.to_datetime(df['date'])
df_ear = df[df.groupby('id')['date'].transform('min').eq(df['date'])]

dataframe count frequency of a string in a column

I have a csv that contains 1000 rows in a python code and returning a new dataframe with 3 columns:
noOfPeople and Description, Location
My final df will be like this one:
id companyName noOfPeople Description Location
1 comp1 75 tech USA
2 comp2 22 fashion USA
3 comp3 70 tech USA
I want to write a code that will stop once I have 200 rows where noOfPeople is greater or equal to 70 and it will return all the rest rows empty. So the code will count columns where noOfPeople >=70. Once I have 200 rows that has this condition, the code will stop.
Can someone help?

df[df['noOfPeople'] >= 70].iloc[:200]

Use head or iloc for select first 200 values and then get max:
print (df1['noOfPeople'].iloc[:199].max())
And add your filter what ever you need.

Python Dataframe: Dropping duplicates base on certain conditions

Dataframe with duplicate Shop IDs where some Shop IDs occurred twice and some occurred thrice:
I only want to keep unique Shop IDs base on the shortest Shop Distance assigned to its Area.
Area Shop Name Shop Distance Shop ID
0 AAA Ly 86 5d87790c46a77300
1 AAA Hi 230 5ce5522012138400
2 BBB Hi 780 5ce5522012138400
3 CCC Ly 450 5d87790c46a77300
...
91 MMM Ju 43 4f76d0c0e4b01af7
92 MMM Hi 1150 5ce5522012138400
...
Using pandas drop_duplicates drop the row duplicates but the condition is base on the first/ last occurring Shop ID which does not allow me to sort by distance:
shops_df = shops_df.drop_duplicates(subset='Shop ID', keep= 'first')
I also tried to group by Shop ID then sort, but sort returns error: Duplicates
bbtshops_new['C'] = bbtshops_new.groupby('Shop ID')['Shop ID'].cumcount()
bbtshops_new.sort_values(by=['C'], axis=1)
So far i tried doing up till this stage:
# filter all the duplicates into a new df
df_toclean = shops_df[shops_df['Shop ID'].duplicated(keep= False)]
# create a mask for all unique Shop ID
mask = df_toclean['Shop ID'].value_counts()
# create a mask for the Shop ID that occurred 2 times
shop_2 = mask[mask==2].index
# create a mask for the Shop ID that occurred 3 times
shop_3 = mask[mask==3].index
# create a mask for the Shops that are under radius 750
dist_1 = df_toclean['Shop Distance']<=750
# returns results for all the Shop IDs that appeared twice and under radius 750
bbtshops_2 = df_toclean[dist_1 & df_toclean['Shop ID'].isin(shop_2)]
* if i use df_toclean['Shop Distance'].min() instead of dist_1 it returns 0 results
I think i'm doing it the long way and still haven't figure out dropping the duplicates, anyone knows how to solve this in a shorter way? I'm new to python, thanks for helping out!

Try to first sort the dataframe based on distance, then drop the duplicate shops.
df = shops_df.sort_values('Distance')
df = df[~df['Shop ID'].duplicated()] # The tilda (~) inverts the boolean mask.
Or just as one chained expression (per comment from #chmielcode).
df = (
shops_df
.sort_values('Distance')
.drop_duplicates(subset='Shop ID', keep= 'first')
.reset_index(drop=True) # Optional.
)

You can use idxmin:
df.loc[df.groupby('Area')['Shop Distance'].idxmin()]
Area Shop Name Shop Distance Shop ID
0 AAA Ly 86 5d87790c46a77300
2 BBB Hi 780 5ce5522012138400
3 CCC Ly 450 5d87790c46a77300
4 MMM Ju 43 4f76d0c0e4b01af7

Pandas - Access a value in Column B based on a value in Column A

So I have a small set of data, TFR.csv taking the form:
Year State1 State2 State3
1993 3 4 5
1994 6 2 1.4
...
I am supposed to determine when State 2's value is at its lowest (1994), and extract whatever value State 3 has in that year (1.4).
To do this, I've written a short filter:
State1Min = min(TFR['State1']) #Determine the minimum value for State1
filt = (TFR['State1']==State1Min) #Filter to select the row where the value exists
TFR[filt]['State3'] #Apply this filter to the original table, and return the value in the State3 column.
It returns the right value I'm looking for, but also the row number at the front:
2 1.4
Name: NT, dtype: float64
I need to print this value of 1.4, so I'm trying to find a way to extract it out of this output.
Thanks for helping out.

Use pandas.DataFrame.set_index and idxmin:
df = df.set_index('Year')
df.loc[df['State2'].idxmin(), 'State3']
Output:
1.4

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.