Data Comparison in Python Pandas Series

Data Comparison in Python Pandas Series - python

That are 2 Data Series that i wish to compare based on a third.
data_SKU1:
SKU Weight1
1234 20
1235 30
111 40
101 23
data_SKU2:
SKU Weight2
1234 22
1235 35
111 47
101 87
flag_Data:
SKU
1234 True
1235 False
111 True
101 False
Name: Date, dtype: bool
Basically based on the values in the flag_Data Serie, i need to divide the value of Weight1 by Weight2 or vice-versa.
For intance:
j = flag_Data(dados_SKU1, dados_SKU1) #this generates the third series
if(j[1234]==True):
generated_serie = data_SKU1['Weight1'][1234] / dados_SKU2['Weight2'][1234]
else:
generated_serie = data_SKU2['Weight2'][1234] / data_SKU1['Weight1'][1234]
But it should be done for all SKU's in the series, not only SKU 1234. Could you guys help to figure it out how?

Setup
merge
df = df1.merge(df2)
SKU Weight1 Weight2 FLAG
0 1234 20 22 True
1 1235 30 35 False
2 111 40 47 True
3 101 23 87 False
Option 1
np.where
df['division'] = np.where(df['FLAG'], df['Weight1']/df['Weight2'], df['Weight2']/df['Weight1'])
Option 2
loc with fillna
df.loc[df['FLAG'], 'division'] = df.Weight1 / df.Weight2
df['division'] = df.division.fillna(df.Weight2/df.Weight1)
Option 3
mask with fillna
df['division'] = (df.Weight1 / df.Weight2.mask(~df.FLAG)).fillna(df.Weight2/df.Weight1)
All result in:
SKU Weight1 Weight2 FLAG division
0 1234 20 22 True 0.909091
1 1235 30 35 False 1.166667
2 111 40 47 True 0.851064
3 101 23 87 False 3.782609

You can use np.where for this:
result = np.where(flag_Data,
data_SKU1['Weight1']/data_SKU2['Weight2'],
data_SKU2['Weight2']/data_SKU1['Weight1'])

Related

Finding mean/SD of a group of population and mean/SD of remaining population within a data frame

I have a pandas data frame that looks like this:
id age weight group
1 12 45 [10-20]
1 18 110 [10-20]
1 25 25 [20-30]
1 29 85 [20-30]
1 32 49 [30-40]
1 31 70 [30-40]
1 37 39 [30-40]
I am looking for a data frame that would look like this: (sd=standard deviation)
group group_mean_weight group_sd_weight rest_mean_weight rest_sd_weight
[10-20]
[20-30]
[30-40]
Here the second/third columns are mean and SD for that group. columns third and fourth are mean and SD for the rest of the groups combined.

Here's a way to do it:
res = df.group.to_frame().groupby('group').count()
for group in res.index:
mask = df.group==group
srGroup, srOther = df.loc[mask, 'weight'], df.loc[~mask, 'weight']
res.loc[group, ['group_mean_weight','group_sd_weight','rest_mean_weight','rest_sd_weight']] = [
srGroup.mean(), srGroup.std(), srOther.mean(), srOther.std()]
res = res.reset_index()
Output:
group group_mean_weight group_sd_weight rest_mean_weight rest_sd_weight
0 [10-20] 77.500000 45.961941 53.60 24.016661
1 [20-30] 55.000000 42.426407 62.60 28.953411
2 [30-40] 52.666667 15.821926 66.25 38.378596
An alternative way to get the same result is:
res = ( pd.DataFrame(
df.group.drop_duplicates().to_frame()
.apply(lambda x: [
df.loc[df.group==x.group,'weight'].mean(),
df.loc[df.group==x.group,'weight'].std(),
df.loc[df.group!=x.group,'weight'].mean(),
df.loc[df.group!=x.group,'weight'].std()], axis=1, result_type='expand')
.to_numpy(),
index=list(df.group.drop_duplicates()),
columns=['group_mean_weight','group_sd_weight','rest_mean_weight','rest_sd_weight'])
.reset_index().rename(columns={'index':'group'}) )
Output:
group group_mean_weight group_sd_weight rest_mean_weight rest_sd_weight
0 [10-20] 77.500000 45.961941 53.60 24.016661
1 [20-30] 55.000000 42.426407 62.60 28.953411
2 [30-40] 52.666667 15.821926 66.25 38.378596
UPDATE:
OP asked in a comment: "what if I have more than one weight column? what if I have around 10 different weight columns and I want sd for all weight columns?"
To illustrate below, I have created two weight columns (weight and weight2) and have simply provided all 4 aggregates (mean, sd, mean of other, sd of other) for each weight column.
wgtCols = ['weight','weight2']
res = ( pd.concat([ pd.DataFrame(
df.group.drop_duplicates().to_frame()
.apply(lambda x: [
df.loc[df.group==x.group,wgtCol].mean(),
df.loc[df.group==x.group,wgtCol].std(),
df.loc[df.group!=x.group,wgtCol].mean(),
df.loc[df.group!=x.group,wgtCol].std()], axis=1, result_type='expand')
.to_numpy(),
index=list(df.group.drop_duplicates()),
columns=[f'group_mean_{wgtCol}',f'group_sd_{wgtCol}',f'rest_mean_{wgtCol}',f'rest_sd_{wgtCol}'])
for wgtCol in wgtCols], axis=1)
.reset_index().rename(columns={'index':'group'}) )
Input:
id age weight weight2 group
0 1 12 45 55 [10-20]
1 1 18 110 120 [10-20]
2 1 25 25 35 [20-30]
3 1 29 85 95 [20-30]
4 1 32 49 59 [30-40]
5 1 31 70 80 [30-40]
6 1 37 39 49 [30-40]
Output:
group group_mean_weight group_sd_weight rest_mean_weight rest_sd_weight group_mean_weight2 group_sd_weight2 rest_mean_weight2 rest_sd_weight2
0 [10-20] 77.500000 45.961941 53.60 24.016661 87.500000 45.961941 63.60 24.016661
1 [20-30] 55.000000 42.426407 62.60 28.953411 65.000000 42.426407 72.60 28.953411
2 [30-40] 52.666667 15.821926 66.25 38.378596 62.666667 15.821926 76.25 38.378596

Pandas: query + mul + groupby + cumsum

My dataframe looks like this:
CUST_NO
ORDER_AMOUNT
PAYT_CODE
IS_PAYMENT_SUCCESSFUL
001
50
OR
1
001
20
IC
0
001
10
IC
1
002
55
IC
1
002
300
MR
1
002
215
MR
0
I want to know the total amount a customer has successfully paid all-time, specifically from the payment codes 'OR', 'IC'. The dataframe is sorted and indexed by order date.
The expected output is shown in the CUMSUM_OR_IC_SUCCESSFUL column:
CUST_NO
ORDER_AMOUNT
PAYT_CODE
IS_PAYMENT_SUCCESSFUL
CUMSUM_OR_IC_SUCCESSFUL
001
50
OR
1
0
001
20
IC
0
50
001
10
IC
1
50
002
55
IC
1
0
002
300
MR
1
55
002
215
MR
0
55
I already have some code that should work, but it just keeps running until the kernel crashes.
df["CUMSUM_OR_IC_SUCCESSFUL "] = (df.query("PAYT_CODE == ('OR', 'IC')")["IS_PAYMENT_SUCCESSFUL"].mul(df["ORDER_AMOUNT"])
.groupby(df["CUST_NO"])
.transform(lambda x: x.cumsum().shift().fillna(0))
)
Any help is appreciated!

Answer
agg = df.groupby("CUST_NO").apply(lambda x:(x["ORDER_AMOUNT"] * x["PAYT_CODE"].isin(["IC", "OR"]) * x["IS_PAYMENT_SUCCESSFUL"]).cumsum())
df["CUMSUM_OR_IC_SUCCESSFUL"] = agg.to_numpy()
Output
Although not as different as your expectations, I still guess that your output table has a little mistake.
If you want to shift CUMSUM_OR_IC_SUCCESSFUL with one position, use agg.shift().to_numpy()
CUST_NO ORDER_AMOUNT ... IS_PAYMENT_SUCCESSFUL CUMSUM_OR_IC_SUCCESSFUL
0 1 50 ... 1 50
1 1 20 ... 0 50
2 1 10 ... 1 60
3 2 55 ... 1 55
4 2 300 ... 1 55
5 2 215 ... 0 55
Explanation
apply will run for each group

After some experimenting, this one worked:
df["CUMSUM_GUARANTEED_SUCCESSFUL"] = df["ORDER_AMOUNT"].mul(df["PAYMENT_SUCCESSFUL"]).mul(df["PAYT_CODE"].isin(['IC', 'OC'])).groupby(df["CUST_NO"]).transform(lambda x: x.cumsum().shift().fillna(0))}

Replace blank value in dataframe based on another column condition

I have many blanks in a merged data set and I want to fill them with a condition.
My current code looks like this
import pandas as pd
import csv
import numpy as np
pd.set_option('display.max_columns', 500)
# Read all files into pandas dataframes
Jan = pd.read_csv(r'C:\~\Documents\Jan.csv')
Feb = pd.read_csv(r'C:\~\Documents\Feb.csv')
Mar = pd.read_csv(r'C:\~\Documents\Mar.csv')
Jan=pd.DataFrame({'Department':['52','5','56','70','7'],'Item':['2515','254','818','','']})
Feb=pd.DataFrame({'Department':['52','56','765','7','40'],'Item':['2515','818','524','','']})
Mar=pd.DataFrame({'Department':['7','70','5','8','52'],'Item':['45','','818','','']})
all_df_list = [Jan, Feb, Mar]
appended_df = pd.concat(all_df_list)
df = appended_df
df.to_csv(r"C:\~\Documents\SallesDS.csv", index=False)
Data set:
df
Department Item
52 2515
5 254
56 818
70
7 50
52 2515
56 818
765 524
7
40
7 45
70
5 818
8
52
What I want is to fill the empty cells in Item with a correspondent values of the Department column.
So If Department is 52 and Item is empty it should be filled with 2515
Department 7 and Item is empty fill it with 45
and the result should look like this
df
Department Item
52 2515
5 254
56 818
70
7 50
52 2515
56 818
765 524
7 45
40
7 45
70
5 818
8
52 2515
I tried the following method but non of them worked.
1
df.loc[(df['Item'].isna()) & (df['Department'].str.contains(52)), 'Item'] = 2515
df.loc[(df['Item'].isna()) & (df['Department'].str.contains(7)), 'Item'] = 45
2
df["Item"] = df["Item"].fillna(df["Department"])
df = df.replace({"Item":{"52":"2515", "7":"45"}})
both ethir return error or do not work
Answer:
Hi I have used the below code and it worked
b = [52]
df.Item=np.where(df.Department.isin(b),df.Item.fillna(2515),df.Item)
a = [7]
df.Item=np.where(df.Department.isin(a),df.Item.fillna(45),df.Item)
Hope it helps someone who face the same issue

The following solution first creates a map of each department and it's maximum corresponding item (assuming there is one), and then matches that item to a department with a blank item. Note that in your data frame, the empty items are an empty string ("") and not NaN.
Create a map:
values = df.groupby('Department').max()
values['Item'] = values['Item'].apply(lambda x: np.nan if x == "" else x)
values = values.dropna().reset_index()
Department Item
0 5 818
1 52 2515
2 56 818
3 7 45
4 765 524
Then use df.apply():
df['Item'] = df.apply(lambda x: values[values['Department'] == x['Department']]['Item'].values if x['Item'] == "" else x['Item'], axis=1)
In this case, the new values will have brackets around them. They can be removed with str.replace():
df['Item'] = df['Item'].astype(str).str.replace(r'\[|\'|\'|\]', "", regex=True)
The result:
Department Item
0 52 2515
1 5 254
2 56 818
3 70
4 7 45
0 52 2515
1 56 818
2 765 524
3 7 45
4 40
0 7 45
1 70
2 5 818
3 8
4 52 2515

Hi I have used the below code and it worked
b = [52]
df.Item=np.where(df.Department.isin(b),df.Item.fillna(2515),df.Item)
a = [7]
df.Item=np.where(df.Department.isin(a),df.Item.fillna(45),df.Item)
Hope it helps someone who face the same issue

Preserving NaN values when using groupby and lambda function on dataframe

Following on from this question, I have a dataset as such:
ChildID MotherID preDiabetes
0 20 455 No
1 20 455 Not documented
2 13 102 NaN
3 13 102 Yes
4 702 946 No
5 82 571 No
6 82 571 Yes
7 82 571 Not documented
8 60 530 NaN
Which I have transformed to the following such that each mother has a single value for preDiabetes:
ChildID MotherID preDiabetes
0 20 455 No
1 13 102 Yes
2 702 946 No
3 82 571 Yes
4 60 530 No
I did this by applying the following logic:
if preDiabetes=="Yes" for a particular MotherID, assign preDiabetes a value of "Yes" regardless of the remaining observations
else if preDiabetes != "Yes" for a particular MotherID, I will assign preDiabetes a value of "No"
However, after thinking about this again, I realised that I should preserve NaN values to impute them later on, rather than just assign them 'No".
So I should edit my logic to be:
if preDiabetes=="Yes" for a particular MotherID, assign preDiabetes a value of "Yes" regardless of the remaining observations
else if all values for preDiabetes==NaN for a particular MotherID, assign preDiabetes a single NaN value
else assign preDiabetes a value of "No"
So, in the above table MotherID=530 should have a value of NaN for preDiabetes like so:
ChildID MotherID preDiabetes
0 20 455 No
1 13 102 Yes
2 702 946 No
3 82 571 Yes
4 60 530 NaN
I tried doing this using the following line of code:
df=df.groupby(['MotherID', 'ChildID'])['preDiabetes'].apply(
lambda x: 'Yes' if 'Yes' in x.values else (np.NaN if np.NaN in x.values.all() else 'No'))
However, running this line of code is resulting in the following error:
TypeError: 'in ' requires string as left operand, not float
I'd appreciate if you guys can point out what it is I am doing wrong. Thank you.

You can try:
import pandas as pd
import numpy as np
import io
data_string = """ChildID,MotherID,preDiabetes
20,455,No
20,455,Not documented
13,102,NaN
13,102,Yes
702,946,No
82,571,No
82,571,Yes
82,571,Not documented
60,530,NaN
"""
data = io.StringIO(data_string)
df = pd.read_csv(data, sep=',', na_values=['NaN'])
df.fillna('no_value', inplace=True)
df = df.groupby(['MotherID', 'ChildID'])['preDiabetes'].apply(
lambda x: 'Yes' if 'Yes' in x.values else (np.NaN if 'no_value' in x.values.all() else 'No'))
df
Result:
MotherID ChildID
102 13 Yes
455 20 No
530 60 NaN
571 82 Yes
946 702 No
Name: preDiabetes, dtype: object

You can do using a custom function:
def func(s):
if s.eq('Yes').any():
return 'Yes'
elif s.isna().all():
return np.nan
else:
return 'No'
df = (df
.groupby(['ChildID', 'MotherID'])
.agg({'preDiabetes': func}))
print(df)
ChildID MotherID preDiabetes
0 13 102 Yes
1 20 455 No
2 60 530 NaN
3 82 571 Yes
4 702 946 No

Try:
df['preDiabetes']=df['preDiabetes'].map({'Yes': 1, 'No': 0}).fillna(-1)
df=df.groupby(['MotherID', 'ChildID'])['preDiabetes'].max().map({1: 'Yes', 0: 'No', -1: 'NaN'}).reset_index()
First line will format preDiabetes to numbers, assuming NaN to be everything other than Yes or No (denoted by -1).
Second line assuming at least one preDiabetes is Yes - we output Yes for the group. Assuming we have both No and NaN - we output No. Assuming all are NaN we output NaN.
Outputs:
>>> df
MotherID ChildID preDiabetes
0 102 13 Yes
1 455 20 No
2 530 60 NaN
3 571 82 Yes
4 946 702 No

Numpy: Use vectorization for loop while referring to previous row value?

I have the following dataframe for which I want to create a column named 'Value' using numpy for fast looping and at the same time refer to the previous row value in the same column.
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"Product": ["A", "A", "A", "A", "B", "B", "B", "C", "C"],
"Inbound": [115, 220, 200, 402, 313, 434, 321, 343, 120],
"Outbound": [10, 20, 24, 52, 40, 12, 43, 23, 16],
"Is First?": ["Yes", "No", "No", "No", "Yes", "No", "No", "Yes", "No"],
}
)
Product Inbound Outbound Is First? Value
0 A 115 10 Yes 125
1 A 220 20 No 105
2 A 200 24 No 81
3 A 402 52 No 29
4 B 313 40 Yes 353
5 B 434 12 No 341
6 B 321 43 No 298
7 C 343 23 Yes 366
8 C 120 16 No 350
The formula for Value column in pseudocode is:
if ['Is First?'] = 'Yes' then [Value] = [Inbound] + [Outbound]
else [Value] = [Previous Value] - [Outbound]
The ideal way of creating the Value column right now is to do a for loop and use shift to refer to the previous column (which I am somehow not able to make work). But since I will be applying this over a giant dataset, I want to use the numpy vectorization method on it.
for i in range(len(df)):
if df.loc[i, "Is First?"] == "Yes":
df.loc[i, "Value"] = df.loc[i, "Inbound"] + df.loc[i, "Outbound"]
else:
df.loc[i, "Value"] = df.loc[i, "Value"].shift(-1) + df.loc[i, "Outbound"]

One way:
You may use np.subtract.accumulate with transform
s = df['Is First?'].eq('Yes').cumsum()
df['value'] = ((df.Inbound + df.Outbound).where(df['Is First?'].eq('Yes'), df.Outbound)
.groupby(s)
.transform(np.subtract.accumulate))
Out[1749]:
Product Inbound Outbound Is First? value
0 A 115 10 Yes 125
1 A 220 20 No 105
2 A 200 24 No 81
3 A 402 52 No 29
4 B 313 40 Yes 353
5 B 434 12 No 341
6 B 321 43 No 298
7 C 343 23 Yes 366
8 C 120 16 No 350
Another way:
Assign value for Yes. Create groupid s to use for groupby. Groupby and shift Outbound to calculate cumsum, and subtract it from 'Yes' value of each group. Finally, use it to fillna.
df['value'] = (df.Inbound + df.Outbound).where(df['Is First?'].eq('Yes'))
s = df['Is First?'].eq('Yes').cumsum()
s1 = df.value.ffill() - df.Outbound.shift(-1).groupby(s).cumsum().shift()
df['value'] = df.value.fillna(s1)
Out[1671]:
Product Inbound Outbound Is First? value
0 A 115 10 Yes 125.0
1 A 220 20 No 105.0
2 A 200 24 No 81.0
3 A 402 52 No 29.0
4 B 313 40 Yes 353.0
5 B 434 12 No 341.0
6 B 321 43 No 298.0
7 C 343 23 Yes 366.0
8 C 120 16 No 350.0

This is not a trivial task, the difficulty lies in the consecutive Nos. It's necessary to group consecutive no's together, the code below should do,
col_sum = df.Inbound+df.Outbound
mask_no = df['Is First?'].eq('No')
mask_yes = df['Is First?'].eq('Yes')
consec_no = mask_yes.cumsum()
result = col_sum.groupby(consec_no).transform('first')-df['Outbound'].where(mask_no,0).groupby(consec_no).cumsum()

Use:
df.loc[df['Is First?'].eq('Yes'),'Value']=df['Inbound']+df['Outbound']
df.loc[~df['Is First?'].eq('Yes'),'Value']=df['Value'].fillna(0).shift().cumsum()-df.loc[~df['Is First?'].eq('Yes'),'Outbound'].cumsum()

Annotated numpy code:
## 1. line up values to sum
ob = -df["Outbound"].values
# get yes indices
fi, = np.where(df["Is First?"].values == "Yes")
# insert yes formula at yes positions
ob[fi] = df["Inbound"].values[fi] - ob[fi]
## 2. calculate block sums and subtract each from the
## first element of the **next** block
ob[fi[1:]] -= np.add.reduceat(ob,fi)[:-1]
# now simply taking the cumsum will reset after each block
df["Value"] = ob.cumsum()
Result:
Product Inbound Outbound Is First? Value
0 A 115 10 Yes 125
1 A 220 20 No 105
2 A 200 24 No 81
3 A 402 52 No 29
4 B 313 40 Yes 353
5 B 434 12 No 341
6 B 321 43 No 298
7 C 343 23 Yes 366
8 C 120 16 No 350

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Data Comparison in Python Pandas Series - python

You can use np.where for this: result = np.where(flag_Data, data_SKU1['Weight1']/data_SKU2['Weight2'], data_SKU2['Weight2']/data_SKU1['Weight1'])

Related

Finding mean/SD of a group of population and mean/SD of remaining population within a data frame

Pandas: query + mul + groupby + cumsum

Replace blank value in dataframe based on another column condition

Preserving NaN values when using groupby and lambda function on dataframe

Numpy: Use vectorization for loop while referring to previous row value?

Categories

Resources