Pandas Reindex Multiindex Dataframe Replicating Index - python

Thank you for taking a look! I am having issues with a 4 level multiindex & attempting to make sure every possible value of the 4th index is represented.
Here is my dataframe:
np.random.seed(5)
size = 25
dict = {'Customer':np.random.choice( ['Bob'], size),
'Grouping': np.random.choice( ['Corn','Wheat','Soy'], size),
'Date':np.random.choice( pd.date_range('1/1/2018','12/12/2022', freq='D'), size),
'Data': np.random.randint(20,100, size=(size))
}
df = pd.DataFrame(dict)
# create the Sub-Group column
df['Sub-Group'] = np.nan
df.loc[df['Grouping'] == 'Corn', 'Sub-Group'] = np.random.choice(['White', 'Dry'], size=len(df[df['Grouping'] == 'Corn']))
df.loc[df['Grouping'] == 'Wheat', 'Sub-Group'] = np.random.choice(['SRW', 'HRW', 'SWW'], size=len(df[df['Grouping'] == 'Wheat']))
df.loc[df['Grouping'] == 'Soy', 'Sub-Group'] = np.random.choice(['Beans', 'Meal'], size=len(df[df['Grouping'] == 'Soy']))
df['Year'] = df.Date.dt.year
With that, I'm looking to create a groupby like the following:
(df.groupby(['Customer','Grouping','Sub-Group',df['Date'].dt.month,'Year'])
.agg(Units = ('Data','sum'))
.unstack()
)
This works as expected. I want to reindex this dataframe so that every single month (index 3) is represented & filled with 0s. The reason I want this is later on I'll be doing a cumulative sum of a groupby.
I have tried both the following reindex & nothing happens - many months are still missing.
rere = pd.date_range('2018-01-01','2018-12-31', freq='M').month
(df.groupby(['Customer','Grouping','Sub-Group',df['Date'].dt.month,'Year'])
.agg(Units = ('Data','sum'))
.unstack()
.fillna(0)
.pipe(lambda x: x.reindex(rere, level=3, fill_value=0))
)
I've also tried the following:
(df.groupby(['Customer','Grouping','Sub-Group',df['Date'].dt.month,'Year'])
.agg(Units = ('Data','sum'))
.unstack()
.fillna(0)
.pipe(lambda x: x.reindex(pd.MultiIndex.from_product(x.index.levels)))
)
The issue with the last one is that the index is much too long - it's doing the cartesian product of Grouping & Sub-Group when really there are no combinations of 'Wheat' as a Grouping & 'Dry' as 'Sub-Group'.
I'm looking for a flexible way to reindex this dataframe to make sure a specific index level (3rd in this case) has every option.
Thanks so much for any help!

try this:
def reindex_sub(g: pd.DataFrame):
g = g.droplevel([0, 1, 2])
result = g.reindex(range(1, 13))
return result
tmp = (df.groupby(['Customer','Grouping','Sub-Group',df['Date'].dt.month,'Year'])
.agg(Units = ('Data','sum'))
.unstack()
)
grouped = tmp.groupby(level=[0,1,2], group_keys=True)
out = grouped.apply(reindex_sub)
print(out)

Related

How to delete row in pandas dataframe based on condition if string is found in cell value of type list?

I've been struggling with the following issue that sounds very easy in fact but can't seem to figure it out and I'm sure it's something very obvious in the stacktrace but I'm just being dumb.
I simply have a pandas dataframe looking like this:
And want to drop the rows that contain, in the jpgs cell value (list), the value "123.jpg". So normally I would get the final dataframe with only rows of index 1 and 3.
However I've tried a lot of methods and none of them works.
For example:
df = df["123.jpg" not in df.jpgs]
or
df = df[df.jpgs.tolist().count("123.jpg") == 0]
give error KeyError: True:
df = df[df['jpgs'].str.contains('123.jpg') == False]
Returns an empty dataframe:
df = df[df.jpgs.count("123.jpg") == 0]
And
df = df.drop(df["123.jpg" in df.jpgs].index)
Gives KeyError: False:
This is my entire code if needed, and I would really appreciate if someone would help me with an answer to what I'm doing wrong :( . Thanks!!
import pandas as pd
df = pd.DataFrame(columns=["person_id", "jpgs"])
id = 1
pair1 = ["123.jpg", "124.jpg"]
pair2 = ["125.jpg", "300.jpg"]
pair3 = ["500.jpg", "123.jpg"]
pair4 = ["111.jpg", "122.jpg"]
row1 = {'person_id': id, 'jpgs': pair1}
row2 = {'person_id': id, 'jpgs': pair2}
row3 = {'person_id': id, 'jpgs': pair3}
row4 = {'person_id': id, 'jpgs': pair4}
df = df.append(row1, ignore_index=True)
df = df.append(row2, ignore_index=True)
df = df.append(row3, ignore_index=True)
df = df.append(row4, ignore_index=True)
print(df)
#df = df["123.jpg" not in df.jpgs]
#df = df[df['jpgs'].str.contains('123.jpg') == False]
#df = df[df.jpgs.tolist().count("123.jpg") == 0]
df = df.drop(df["123.jpg" in df.jpgs].index)
print("\n Final df")
print(df)
Since you filter on a list column, apply lambda would probably be the easiest:
df.loc[df.jpgs.apply(lambda x: "123.jpg" not in x)]
Quick comments on your attempts:
In df = df.drop(df["123.jpg" in df.jpgs].index) you are checking whether the exact value "123.jpg" is contained in the column ("123.jpg" in df.jpgs) rather than in any of the lists, which is not what you want.
In df = df[df['jpgs'].str.contains('123.jpg') == False] goes in the right direction, but you are missing the regex=False keyword, as shown in Ibrahim's answer.
df[df.jpgs.count("123.jpg") == 0] is also not applicable here, since count returns the total number of non-NaN values in the Series.
For str.contains one this is how it is done
df[df.jpgs.str.contains("123.jpg", regex=False)]
You can try this:
mask = df.jpgs.apply(lambda x: '123.jpg' not in x)
df = df[mask]

Using a tuple to map values between dataframes

If I need to map one value between two dataframes, and get 'FD' value from row where Round = 1 an Id is 262:
df1 = pd.DataFrame({'Round':1,'ID':262,'FD':30,
'Round':2,'ID':262,'FD':20}, index=[0])
df2 = pd.DataFrame({'Round':1, 'Opponent':262,
'Round':2, 'Opponent':262},index=[0])
I have tried to map with:
df2['P_GS_by_FD'] = f2['Opponent'].map(df1.set_index('ID')['FD'])
df2 Expected output:
Round Opponent P_GS_by_FD
1 262 30
I would use drop_duplicates
this would select 'Round 1' rows
df1.drop_duplicates('Id', keep='first')
df2['P_GS_by_FD'] = df2['Opponent'].map(df1.drop_duplicates('Id', keep='first').set_index('Id')['FD'])
(I think your example df1, df2 would make only one row instead of two)
Then we need create the round in df2 as well
df2['Round'] = df.groupby('Opponent').cumcount()+1
yourdf = df2.merge(df1.rename(columns={'Id' : 'Opponent'}), on = ['Opponent','Round'], how = 'left')
Base on your update
yourdf = df2.merge(df1.rename(columns={'ID' : 'Opponent'}), on = ['Opponent','Round'], how = 'left')

Pandas: get the cumulative sum of a column only if the timestamp is greater than that of another column

For each customer, I would like to get the cumulative sum of a column (Dollar Value) only when Timestamp 1 is less than Timestamp 2. I could do a cartesian join of the values based on Customer or iterate through the dataframe, but wanted to see if there was an easier way to to do this maybe with a groupby and apply.
DataFrame:
df = pd.DataFrame({'Customer': ['Alice','Brian','Alice','Alice','Alice','Brian', 'Brian'], 'Timestamp': [1,2,3,4,5,3,6], 'Timestamp 2': [2,5,4,6,7,5,7], 'Dollar Value':[0,1,3,5,3,2,3]})
Sort Values:
df = df.sort_values(['Customer','Timestamp'])
Expected Results:
df['Desired_result'] = [0,0,0,3,0,0,3]
This could work
get the rows where the condition match and then do cumsum
cond = df["Timestamp"]>df["Timestamp 2"]
df["Dollar Value"].where(cond, 0).groupby([cond, df["Customer"]]).cumsum()
Edit
based on you comment may be this is what you want
df = pd.DataFrame({'Customer': ['Alice','Brian','Alice','Alice','Alice','Brian', 'Brian'], 'Timestamp': [1,2,3,4,5,3,6], 'Timestamp 2': [2,5,4,6,7,5,7], 'Dollar Value':[0,1,3,5,3,2,3]})
def sum_dollar_value(group):
group = group.copy()
last_row = group.iloc[-1, :]
cond = group["Timestamp 2"]<last_row["Timestamp"]
group.loc[last_row.name, "result"] = np.sum(group["Dollar Value"].where(cond, 0))
return group
df.groupby("Customer").apply(sum_dollar_value).reset_index(level=0, drop=True)
I would suggest setting a condition, then grouping by customer:
# set condition
cond = df["Timestamp"]<df["Timestamp 2"]
df[cond].groupby('Customer')['Dollar Value'].sum()
Note: I borrowed the syntax of condition from the previous answer by Dev Khadka

Diff between two dataframes in pandas

I have two dataframes both of which have the same basic schema. (4 date fields, a couple of string fields, and 4-5 float fields). Call them df1 and df2.
What I want to do is basically get a "diff" of the two - where I get back all rows that are not shared between the two dataframes (not in the set intersection). Note, the two dataframes need not be the same length.
I tried using pandas.merge(how='outer') but I was not sure what column to pass in as the 'key' as there really isn't one and the various combinations I tried were not working. It is possible that df1 or df2 has two (or more) rows that are identical.
What is a good way to do this in pandas/Python?
Try this:
diff_df = pd.merge(df1, df2, how='outer', indicator='Exist')
diff_df = diff_df.loc[diff_df['Exist'] != 'both']
You will have a dataframe of all rows that don't exist on both df1 and df2.
IIUC:
You can use pd.Index.symmetric_difference
pd.concat([df1, df2]).loc[
df1.index.symmetric_difference(df2.index)
]
You can use this function, the output is an ordered dict of 6 dataframes which you can write to excel for further analysis.
'df1' and 'df2' refers to your input dataframes.
'uid' refers to the column or combination of columns that make up the unique key. (i.e. 'Fruits')
'dedupe' (default=True) drops duplicates in df1 and df2. (refer to Step 4 in comments)
'labels' (default = ('df1','df2')) allows you to name the input dataframes. If a unique key exists in both dataframes, but have
different values in one or more columns, it is usually important to know these rows, put them one on top of the other and label the row with the name so we know to which dataframe does it belong to.
'drop' can take a list of columns to be excluded from the consideration when considering the difference
Here goes:
df1 = pd.DataFrame([['apple', '1'], ['banana', 2], ['coconut',3]], columns=['Fruits','Quantity'])
df2 = pd.DataFrame([['apple', '1'], ['banana', 3], ['durian',4]], columns=['Fruits','Quantity'])
dict1 = diff_func(df1, df2, 'Fruits')
In [10]: dict1['df1_only']:
Out[10]:
Fruits Quantity
1 coconut 3
In [11]: dict1['df2_only']:
Out[11]:
Fruits Quantity
3 durian 4
In [12]: dict1['Diff']:
Out[12]:
Fruits Quantity df1 or df2
0 banana 2 df1
1 banana 3 df2
In [13]: dict1['Merge']:
Out[13]:
Fruits Quantity
0 apple 1
Here is the code:
import pandas as pd
from collections import OrderedDict as od
def diff_func(df1, df2, uid, dedupe=True, labels=('df1', 'df2'), drop=[]):
dict_df = {labels[0]: df1, labels[1]: df2}
col1 = df1.columns.values.tolist()
col2 = df2.columns.values.tolist()
# There could be columns known to be different, hence allow user to pass this as a list to be dropped.
if drop:
print ('Ignoring columns {} in comparison.'.format(', '.join(drop)))
col1 = list(filter(lambda x: x not in drop, col1))
col2 = list(filter(lambda x: x not in drop, col2))
df1 = df1[col1]
df2 = df2[col2]
# Step 1 - Check if no. of columns are the same:
len_lr = len(col1), len(col2)
assert len_lr[0]==len_lr[1], \
'Cannot compare frames with different number of columns: {}.'.format(len_lr)
# Step 2a - Check if the set of column headers are the same
# (order doesnt matter)
assert set(col1)==set(col2), \
'Left column headers are different from right column headers.' \
+'\n Left orphans: {}'.format(list(set(col1)-set(col2))) \
+'\n Right orphans: {}'.format(list(set(col2)-set(col1)))
# Step 2b - Check if the column headers are in the same order
if col1 != col2:
print ('[Note] Reordering right Dataframe...')
df2 = df2[col1]
# Step 3 - Check datatype are the same [Order is important]
if set((df1.dtypes == df2.dtypes).tolist()) - {True}:
print ('dtypes are not the same.')
df_dtypes = pd.DataFrame({labels[0]:df1.dtypes,labels[1]:df2.dtypes,'Diff':(df1.dtypes == df2.dtypes)})
df_dtypes = df_dtypes[df_dtypes['Diff']==False][[labels[0],labels[1],'Diff']]
print (df_dtypes)
else:
print ('DataType check: Passed')
# Step 4 - Check for duplicate rows
if dedupe:
for key, df in dict_df.items():
if df.shape[0] != df.drop_duplicates().shape[0]:
print(key + ': Duplicates exists, they will be dropped.')
dict_df[key] = df.drop_duplicates()
# Step 5 - Check for duplicate uids.
if type(uid)==str or type(uid)==list:
print ('Uniqueness check: {}'.format(uid))
for key, df in dict_df.items():
count_uid = df.shape[0]
count_uid_unique = df[uid].drop_duplicates().shape[0]
var = [0,1][count_uid_unique == df.shape[0]] #<-- Round off to the nearest integer if it is 100%
pct = round(100*count_uid_unique/df.shape[0], var)
print ('{}: {} out of {} are unique ({}%).'.format(key, count_uid_unique, count_uid, pct))
# Checks complete, begin merge. '''Remenber to dedupe, provide labels for common_no_match'''
dict_result = od()
df_merge = pd.merge(df1, df2, on=col1, how='inner')
if not df_merge.shape[0]:
print ('Error: Merged DataFrame is empty.')
else:
dict_result[labels[0]] = df1
dict_result[labels[1]] = df2
dict_result['Merge'] = df_merge
if type(uid)==str:
uid = [uid]
if type(uid)==list:
df1_only = df1.append(df_merge).reset_index(drop=True)
df1_only['Duplicated']=df1_only.duplicated(keep=False) #keep=False, marks all duplicates as True
df1_only = df1_only[df1_only['Duplicated']==False]
df2_only = df2.append(df_merge).reset_index(drop=True)
df2_only['Duplicated']=df2_only.duplicated(keep=False)
df2_only = df2_only[df2_only['Duplicated']==False]
label = labels[0]+' or '+labels[1]
df_lc = df1_only.copy()
df_lc[label] = labels[0]
df_rc = df2_only.copy()
df_rc[label] = labels[1]
df_c = df_lc.append(df_rc).reset_index(drop=True)
df_c['Duplicated'] = df_c.duplicated(subset=uid, keep=False)
df_c1 = df_c[df_c['Duplicated']==True]
df_c1 = df_c1.drop('Duplicated', axis=1)
df_uc = df_c[df_c['Duplicated']==False]
df_uc_left = df_uc[df_uc[label]==labels[0]]
df_uc_right = df_uc[df_uc[label]==labels[1]]
dict_result[labels[0]+'_only'] = df_uc_left.drop(['Duplicated', label], axis=1)
dict_result[labels[1]+'_only'] = df_uc_right.drop(['Duplicated', label], axis=1)
dict_result['Diff'] = df_c1.sort_values(uid).reset_index(drop=True)
return dict_result
Set df2.columns = df1.columns
Now, set every column as the index: df1 = df1.set_index(df1.columns.tolist()), and similarly for df2.
You can now do df1.index.difference(df2.index), and df2.index.difference(df1.index), and the two results are your distinct columns.
with
left_df.merge(df,left_on=left_df.columns.tolist(),right_on=df.columns.tolist(),how='outer')
you can get the outer join result.
Similarly, you can get the inner join result.Then make a diff that would be what you want.

Pandas Columns Operations with List

I have a pandas dataframe with two columns, the first one with just a single date ('action_date') and the second one with a list of dates ('verification_date'). I am trying to calculate the time difference between the date in 'action_date' and each of the dates in the list in the corresponding 'verification_date' column, and then fill the df new columns with the number of dates in verification_date that have a difference of either over or under 360 days.
Here is my code:
df = pd.DataFrame()
df['action_date'] = ['2017-01-01', '2017-01-01', '2017-01-03']
df['action_date'] = pd.to_datetime(df['action_date'], format="%Y-%m-%d")
df['verification_date'] = ['2016-01-01', '2015-01-08', '2017-01-01']
df['verification_date'] = pd.to_datetime(df['verification_date'], format="%Y-%m-%d")
df['user_name'] = ['abc', 'wdt', 'sdf']
df.index = df.action_date
df = df.groupby(pd.TimeGrouper(freq='2D'))['verification_date'].apply(list).reset_index()
def make_columns(df):
df = df
for i in range(len(df)):
over_360 = []
under_360 = []
for w in [(df['action_date'][i]-x).days for x in df['verification_date'][i]]:
if w > 360:
over_360.append(w)
else:
under_360.append(w)
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
return df
make_columns(df)
This kinda works EXCEPT the df has the same values for each row, which is not true as the dates are different. For example, in the first row of the dataframe, there IS a difference of over 360 days between the action_date and both of the items in the list in the verification_date column, so the over_360 column should be populated with 2. However, it is empty and instead the under_360 column is populated with 1, which is accurate only for the second row in 'action_date'.
I have a feeling I'm just messing up the looping but am really stuck. Thanks for all help!
Your problem was that you were always updating the whole column with the value of the last calculation with these lines:
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
what you want to do instead is set the value for each line calculation accordingly, you can do this by replacing the above lines with these:
df.set_value(i,'over_360',len(over_360))
df.set_value(i,'under_360',len(under_360))
what it does is, it sets a value in line i and column over_360 or under_360.
you can learn more about it here.
If you don't like using set_values you can also use this:
df.ix[i,'over_360'] = len(over_360)
df.ix[i,'under_360'] = len(under_360)
you can check dataframe.ix here.
you might want to try this:
df['over_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days >360) for i in x['verification_date']]) , axis=1)
df['under_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days <360) for i in x['verification_date']]) , axis=1)
I believe it should be a bit faster.
You didn't specify what to do if == 360, so you can just change > or < into >= or <=.

Categories