When using the np.select function,
I'd like to refer to the values of the current column I am appending it on, and set a
value based on a condition
Problem:
On the code below, my conditions are referring to the 'Profit' column where np.select would assign a value to it, but somehow my code does not obey these conditions.
In the 'Profit' column, when I would set a value of 1 Month, the value of the cell on top of it has to be 'Yes'.
Example code:
conditions = [
(df['UserID'].shift() == df['UserID']) & (df['totalSold'] >= df['totalBought']),
(df['UserID'].shift() == df['UserID']) & (df['totalSold'] >= df['totalBought']) & (df['Profit'].shift() == 'Yes'),
(df['UserID'].shift() == df['UserID']) & (df['totalSold'] >= df['totalBought']) & (df['Profit'].shift() == '1 Month')]
values = ['Yes', '1 Month', '2 Month']
df['Profit'] = np.select(conditions, values, default = "No")
Input Dataframe:
id
month
totalBought
totalSold
aaa
Jan
200
300
aaa
Feb
250
300
aaa
March
100
350
bbb
Jan
100
150
Expected Output dataframe:
id
month
totalBought
totalSold
Profit
aaa
Jan
200
300
Yes
aaa
Feb
250
300
1 Month
aaa
March
100
350
2 Month
bbb
Jan
100
150
Yes
I believe you're looking for something a little more dynamic, like this:
ge_mask = df['totalSold'].diff().fillna(-1).ge(0)
df['Profit'] = np.select([ge_mask, df['totalSold'].ge(df['totalBought'])], [ge_mask.cumsum().astype(str) + ' Month', 'Yes'])
Output:
>>> df
id month totalBought totalSold Profit
0 aaa Jan 200 300 Yes
1 aaa Feb 250 300 1 Month
2 aaa March 100 350 2 Month
3 bbb Jan 100 150 Yes
Related
I have a data frame df
df =
Code Bikes Year
12 356 2020
4 378 2020
2 389 2020
35 378 2021
40 370 2021
32 350 2021
I would like to group the data frame based on Year using df.groupby('Year') and check the close values in column df['Çode'] to find values that are close by at least 3 and retain the row with a maximum value in the df['Bikes'] column out of them.
For instance, in the first group of 2020, values 4 and 2 are close by at least 3 since 4-2=2 ≤ 3 and since 389 (df['Bikes']) corresponding to df['Code'] = 2 is the highest among the two, retain that and drop row where df['code']=4.
The expected out for the given example:
Code Bikes Year
12 356 2020
2 389 2020
35 378 2021
40 370 2021
new_df = df.sort_values(['Year', 'Code'])
new_df['diff'] = new_df['Code'] - new_df.groupby('Year')['Code'].shift()
new_df['cumsum'] = ((new_df['diff'] > 3) | (new_df['diff'].isna())).cumsum()
new_df = new_df.sort_values('Bikes', ascending=False).drop_duplicates(['cumsum']).sort_index()
new_df.drop(columns=['diff', 'cumsum'], inplace=True)
You can first sort values per both columns by DataFrame.sort_values, then create groups by compare differences with treshold 3 and cumulative sum, last use DataFrameGroupBy.idxmax for get maximal indices by Bikes per Year and helper Series:
df1 = df.sort_values(['Year','Code'])
g = df1.groupby('Year')['Code'].diff().gt(3).cumsum()
df2 = df.loc[df1.groupby(['Year', g])['Bikes'].idxmax()].sort_index()
print (df2)
Code Bikes Year
0 12 356 2020
2 2 389 2020
3 35 378 2021
4 40 370 2021
I try to filter the data of the customers who have their first bill in 2019, then calculate their number of bills in 2020 and 2021:
DF = pd.DataFrame(
{"ID": ["1","1","2", "2", "2","2","3","3"],
"Year": ["2017", "2019", "2019","2019", "2020", "2020","2019","2021"],
"Price": ["10","0","200","100","4000","3440","3445","2303"]}
)
The result I try to find look like this :
ID 2019 price_2019 2020 price_2020 2021 price_2021
2 2 300 2 7440 0 0
3 1 3445 0 0 1 2303
I can't find a function to do this calculation. Any idea how to make it work?
DF = DF.astype(int)
Filter for rows where the first Year is 2019:
year_2019 = DF.groupby('ID', sort = False).Year.transform('min') == 2019
filtered = DF.loc[year_2019]
Aggregate and rename:
filtered = (filtered
.pivot_table(index='ID',
values='Price',
columns='Year',
aggfunc=['sum', 'size'],
fill_value = 0)
.rename(columns={'sum':'price'}, level=0)
)
# reorganize the column names to match expected output
filtered.columns = [f"{left}_{right}"
if left == 'price'
else right
for left, right in filtered]
filtered
price_2019 price_2020 price_2021 2019 2020 2021
ID
2 300 7440 0 2 2 0
3 3445 0 2303 1 0 1
I have the following dataset
df=pd.read_csv('https://raw.githubusercontent.com/michalis0/DataMining_and_MachineLearning/master/data/sales.csv')
df["OrderYear"] = pd.DatetimeIndex(df['Order Date']).year
I want to compare the customers in 2017 and 2018 and see if the store has lost customers.
I did two subsets corresponding to 2017 and 2018 :
Customer_2018 = df.loc[(df.OrderYear == 2018)]
Customer_2017 = df.loc[(df.OrderYear == 2017)]
I then tried to do this to compare the two :
Churn = Customer_2017['Customer ID'].isin(Customer_2018['Customer ID']).value_counts()
Churn
And i get the following output :
True 2206
False 324
Name: Customer ID, dtype: int64
The problem is some customers may appear several times in the dataset since they made several orders.
I would like to get only unique customers (Customer ID is the only unique attribute) and then compare the two dataframes to see how many customers the store lost between 2017 and 2018.
To go further in the analysis, you can use pd.crosstab:
out = pd.crosstab(df['Customer ID'], df['OrderYear'])
At this point your dataframe looks like:
>>> out
OrderYear 2015 2016 2017 2018
Customer ID
AA-10315 4 1 4 2
AA-10375 2 4 4 5
AA-10480 1 0 10 1
AA-10645 6 3 8 1
AB-10015 4 0 2 0 # <- lost customer
... ... ... ... ...
XP-21865 10 3 9 6
YC-21895 3 1 3 1
YS-21880 0 5 0 7
ZC-21910 5 9 9 8
ZD-21925 3 0 5 1
Values are the number of order per customer and year.
Now it's easy to get "lost customers":
>>> sum((out[2017] != 0) & (out[2018] == 0))
83
If only one comparison is required, I would use python sets:
c2017 = set(Customer_2017['Customer ID'])
c2018 = set(Customer_2018['Customer ID'])
print(f'lost customers between 2017 and 2018: {len(c2017 - c2018)}')
print(f'customers from 2017 remaining in 2018: {len(c2017 & c2018)}')
print(f'new customers in 2018: {len(c2018 - c2017)}')
output:
lost customers between 2017 and 2018: 83
customers from 2017 remaining in 2018: 552
new customers in 2018: 138
building on the crosstab suggestion from #Corralien:
out = pd.crosstab(df['Customer ID'], df['OrderYear'])
(out.gt(0).astype(int).diff(axis=1)
.replace({0: 'remained', 1: 'new', -1: 'lost'})
.apply(pd.Series.value_counts)
)
output:
OrderYear 2015 2016 2017 2018
lost NaN 163 123 83
new NaN 141 191 138
remained NaN 489 479 572
You could just use normal sets to get unique customer ids for each year and then subtract them appropriately:
set_lost_cust = set(Customer_2017["Customer ID"]) - set(Customer_2018["Customer ID"])
len(set_lost_cust)
Out: 83
For your original approach to work you would need to drop the duplicates from the DataFrames, to make sure each customer appears only a single time:
Customer_2018 = df.loc[(df.OrderYear == 2018), "Customer ID"].drop_duplicates()
Customer_2017 = df.loc[(df.OrderYear == 2017), "Customer ID"].drop_duplicates()
Churn = Customer_2017.isin(Customer_2018)
Churn.value_counts()
#Out:
True 552
False 83
Name: Customer ID, dtype: int64
I have a DataFrame with 4 fields: Locatiom Year, Week and Sales. I would like to know the difference in Sales between two years preserving the granularity of the dataset. I mean, I would like to know for each Location, Year and Week, what is the difference to the same week of another Year.
The following will generate a Dataframe with a similar structure:
raw_data = {'Location': ['A']*30 + ['B']*30 + ['C']*30,
'Year': 3*([2018]*10+[2019]*10+[2020]*10),
'Week': 3*(3*list(range(1,11))),
'Sales': random.randint(100, size=(90))
}
df = pd.DataFrame(raw_data)
Location Year Week Sales
A 2018 1 67
A 2018 2 93
A 2018 … 67
A 2019 1 49
A 2019 2 38
A 2019 … 40
B 2018 1 18
… … … …
Could you please show me what would be the best approach?
Thank you very much
You can do it using groupby and shift:
df["Next_Years_Sales"] = df.groupby(["Location", "Week"])["Sales"].shift(-1)
df["YoY_Sales_Difference"] = df["Next_Years_Sales"] - df["Sales"]
Spot checking it:
df[(df["Location"] == "A") & (df["Week"] == 1)]
Out[37]:
Location Year Week Sales Next_Years_Sales YoY_Sales_Difference
0 A 2018 1 99 10.0 -89.0
10 A 2019 1 10 3.0 -7.0
20 A 2020 1 3 NaN NaN
I have a dictionary named c with objects as dataframe, each dataframe has 3 columns: 'year' 'month' & 'Tmed' , I want to calculate the monthly mean values of Tmed for each year, I used
for i in range(22) : c[i].groupby(['year','month']).mean().reset_index()
This returns
year month Tmed
0 2018 12 14.8
2 2018 12 12.0
3 2018 11 16.1
5 2018 11 9.8
6 2018 11 9.8
9 2018 11 9.3
4425 rows × 3 columns
The index is not as it should be, and for the 11th month of 2018 for example, there should be only one row but as you see the dataframe has more than one.
I tried the code on a single dataframe and it gave the wanted result :
c[3].groupby(['year','month']).mean().reset_index()
year month Tmed
0 1999 9 23.950000
1 1999 10 19.800000
2 1999 11 12.676000
3 1999 12 11.012000
4 2000 1 9.114286
5 2000 2 12.442308
6 2000 3 13.403704
7 2000 4 13.803846
8 2000 5 17.820000
.
.
.
218 2018 6 21.093103
219 2018 7 24.977419
220 2018 8 26.393103
221 2018 9 24.263333
222 2018 10 19.069565
223 2018 11 13.444444
224 2018 12 13.400000
225 rows × 3 columns
I need to put for loop because I have many dataframes, I can't figure out the issue, any help would be gratefull.
I don't see a reason why your code should fail. I tried below and got the required results:
import numpy as np
import pandas as pd
def getRandomDataframe():
rand_year = pd.DataFrame(np.random.randint(2010, 2011,size=(50, 1)), columns=list('y'))
rand_month = pd.DataFrame(np.random.randint(1, 13,size=(50, 1)), columns=list('m'))
rand_value = pd.DataFrame(np.random.randint(0, 100,size=(50, 1)), columns=list('v'))
df = pd.DataFrame(columns=['year', 'month', 'value'])
df['year'] = rand_year
df['month'] = rand_month
df['value'] = rand_value
return df
def createDataFrameDictionary():
_dict = {}
length = 3
for i in range(length):
_dict[i] = getRandomDataframe()
return _dict
c = createDataFrameDictionary()
for i in range(3):
c[i] = c[i].groupby(['year','month'])['value'].mean().reset_index()
# Check results
print(c[0])
Please check if the year, month combo repeats in different dataframes which could be the reason for the repeat.
In your scenario, it may be a good idea to collect the groupby.mean results for each dataframe in another dataframe and do a groupby mean again on the new dataframe
Can you try the following:
main_df = pd.DataFrame()
for i in range(22):
main_df = pd.concat([main_df, c[i].groupby(['year','month']).mean().reset_index()])
print(main_df.groupby(['year','month']).mean())