Boolean Masking on a Pandas Dataframe where columns may not exist - python

I have a dataframe called compare that looks like this:
Resident
1xdisc
1xdisc_doc
conpark
parking
parking_doc
conmil
conmil_doc
pest
pest_doc
pet
pet1x
pet_doc
rent
rent_doc
stlc
storage
trash
trash_doc
water
water_doc
John
0
-500
0
50
50
0
0
3
3
0
0
0
1803
1803
0
0
30
30
0
0
Cheldone
-500
0
0
50
50
0
0
1.25
1.25
0
0
0
1565
1565
0
0
30
30
0
0
Dieu
-300
-300
0
0
0
0
0
3
3
0
0
0
1372
1372
0
0
18
18
0
0
Here is the dataframe in a form that can be copied and pasted:
,Resident,1xdisc,1xdisc_doc,conpark,parking,parking_doc,conmil,conmil_doc,pest,pest_doc,pet,pet1x,pet_doc,rent,rent_doc,stlc,storage,trash,trash_doc,water,water_doc 0,Acacia,0,0,0,0,0,0,-500,3.0,3.0,0,0,70,2067,2067,0,0,15,15,0,0 1,ashley,0,0,0,0,0,0,0,3.0,3.0,0,0,0,2067,2067,0,0,15,15,0,0 2,Sheila,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1574,1574,0,0,0,0,0,0 3,Brionne,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1787,1787,0,0,0,0,0,0 4,Danielle,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1422,0,0,0,0,0,0,0 5,Nmesomachi,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1675,1675,0,0,0,0,0,0 6,Doaa,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1306,1306,0,0,0,0,0,0 7,Reynaldo,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1685,1685,0,0,0,0,0,0 8,Shajuan,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1768,0,0,0,0,0,0,0 9,Dalia,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1596,1596,0,0,0,0,0,0
I want to create another dataframe using boolean masking that only contains rows where there are mismatches various sets of columns. For example, where parking doesn't match parking_doc or conmil doesn't match conmil_doc.
Here is the code I am using currently:
nonmatch = compare[((compare['1xdisc']!=compare['1xdisc_doc']) &
(compare['conpark']!=compare['1xdisc']))
| (compare['rent']!=compare['rent_doc'])
| (compare['parking']!=compare['parking_doc'])
| (compare['trash']!=compare['trash_doc'])
| (compare['pest']!=compare['pest_doc'])
| (compare['stlc']!=compare['stlc_doc'])
| (compare['pet']!=compare['pet_doc'])
|(compare['conmil']!=compare['conmil_doc'])
]
The problem I'm having is that some columns may not always exist, for example stlc_doc or pet_doc. How do I select rows with mismatches, but only check for mismatches for particular columns if the columns exist?

If the column names doesn't always exist, you can either add the columns that doesn't exist which I don't think will be a good idea since you will have to replicate the corresponding columns which will eventually increase the size of the dataframe.
So, another approach might be to filter the column names themselves and take only the column pairs that exists:
Given DataFrame:
>>> df.head(3)
Resident 1xdisc 1xdisc_doc conpark parking parking_doc conmil conmil_doc pest pest_doc pet pet1x pet_doc rent rent_doc stlc storage trash trash_doc water water_doc
0 Acacia 0 0 0 0 0 0 -500 3.0 3.0 0 0 70 2067 2067 0 0 15 15 0 0
1 ashley 0 0 0 0 0 0 0 3.0 3.0 0 0 0 2067 2067 0 0 15 15 0 0
2 Sheila 0 0 0 0 0 0 0 0.0 0.0 0 0 0 1574 1574 0 0 0 0 0 0
Take out the columns pairs:
>>> maskingCols = [(col[:-4], col) for col in df if col[:-4] in df and col.endswith('_doc')]
maskingCols
[('1xdisc', '1xdisc_doc'), ('parking', 'parking_doc'), ('conmil', 'conmil_doc'), ('pest', 'pest_doc'), ('pet', 'pet_doc'), ('rent', 'rent_doc'), ('trash', 'trash_doc')]
Now that you have the column pairs, you can create the expression required to mask the dataframe.
>>> "|".join(f"(df['{col1}'] != df['{col2}'])" for col1, col2 in maskingCols)
"(df['1xdisc'] != df['1xdisc_doc'])|(df['parking'] != df['parking_doc'])|(df['conmil'] != df['conmil_doc'])|(df['pest'] != df['pest_doc'])|(df['pet'] != df['pet_doc'])|(df['rent'] != df['rent_doc'])|(df['trash'] != df['trash_doc'])"
You can simply pass this expression string to eval function to evaluate it.
>>> eval("|".join(f"(df['{col1}'] != df['{col2}'])" for col1, col2 in maskingCols))
You can add other criteria other than this masking:
>>> eval("|".join(f"(df['{col1}'] != df['{col2}'])" for col1, col2 in maskingCols)) | ((df['1xdisc']!=df['1xdisc_doc']) & (df['conpark']!=df['1xdisc']))
0 True
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 True
9 False
dtype: bool
You can use it to get your desired dataframe:
>>> df[eval("|".join(f"(df['{col1}'] != df['{col2}'])" for col1, col2 in maskingCols)) | ((df['1xdisc']!=df['1xdisc_doc']) & (df['conpark']!=df['1xdisc']))]
OUTPUT:
Resident 1xdisc 1xdisc_doc conpark parking parking_doc conmil conmil_doc pest pest_doc pet pet1x pet_doc rent rent_doc stlc storage trash trash_doc water water_doc
0 Acacia 0 0 0 0 0 0 -500 3.0 3.0 0 0 70 2067 2067 0 0 15 15 0 0
4 Danielle 0 0 0 0 0 0 0 0.0 0.0 0 0 0 1422 0 0 0 0 0 0 0
8 Shajuan 0 0 0 0 0 0 0 0.0 0.0 0 0 0 1768 0 0 0 0 0 0 0

I'm not a pandas expert, so there might be a simpler, library way to do this, but here's a relatively Pythonic, adaptable implementation:
mask = True
for col_name in df.columns:
# inefficient but readable, could get this down
# to O(n) with a better data structure
if col_name + '_doc' in df.columns:
mask = mask & (df[col_name] != df[col_name + '_doc'])
non_match = df[mask]

Related

Realise accumulated DataFrame from a column of Boolean values

Be the following python pandas DataFrame:
ID
Holidays
visit_1
visit_2
visit_3
other
0
True
1
2
0
red
0
False
3
2
0
red
0
True
4
4
1
blue
1
False
2
0
0
red
1
True
1
2
1
green
2
False
1
0
0
red
Currently I calculate a new DataFrame with the accumulated visit values as follows.
# Calculate the columns of the total visit count
visit_df = df.groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
I would like to create a new one taking into account only the rows whose Holiday value is True. How could I do this?
Simple subset the rows first:
df[df['Holidays']].groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
output:
visit_1 visit_2 visit_3
ID
0 5 6 1
1 1 2 1
Alternative if you want to also get the groups without any match:
df2 = df.set_index('ID')
(df2.where(df2['Holidays'])
.groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
)
output:
visit_1 visit_2 visit_3
ID
0 5.0 6.0 1.0
1 1.0 2.0 1.0
2 0.0 0.0 0.0
variant
df2 = df.set_index('ID')
(df2.where(df2['Holidays'])
.groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
.convert_dtypes()
.add_suffix('_Holidays')
)
output:
visit_1_Holidays visit_2_Holidays visit_3_Holidays
ID
0 5 6 1
1 1 2 1
2 0 0 0

Joining dataframe on itself with different filtering

I have a df that follows this structure:
store day type sales orders
amazon 2021-10-10 web 10 1
amazon 2021-10-10 retail 500 50
facebook 2021-10-10 retail 300 50
facebook 2021-09-05 retail 10 50
apple 2021-09-01 web 5 1
uber 2021-08-01 web 50 1
uber 2021-08-01 retail 60 1
...
I am trying to have a df_res that would have sales & orders by store, day & type weekly so that the output would look like so:
day type sales_amazon orders_amazon sales_facebook orders_facebook sales_apple orders_apple sales_uber orders_uber
2021-08-01 web 0 0 0 0 0 0 50 1
2021-08-01 rtail 0 0 0 0 0 0 60 1
2021-10-10 web 10 1 0 0 0 0 0 0
2021-10-10 retail 500 50 300 50 0 0 0 0
...
I tried:
# main df to be joined on
df_res = df[df.store.isin(['amazon'])].groupby(['store','type', pd.Grouper(key = 'day', freq = 'W-MON',
label = 'right')])[['store','day','orders','sales','type']].sum().reset_index()
# merging on main df each store df
for branch in ['facebook', 'apple', 'apple', 'uber']:
df_res = df_res.merge(df[df.store.isin([branch])].groupby(['store','type', pd.Grouper(key = 'day', freq = 'W-MON',
label = 'right')])[['store','day','orders','sales','type']].sum().reset_index(),
on =['day','type'], suffixes= [f'_{branch}', f'_{branch}'], how = 'outer')
But this does not produce the structure I desire, I tried using join but that throws a different length error because there are occasions when there is no sale for a particular date & type combination for a given store.
You can pivot and rework the MultiIndex:
df2 = (df.pivot_table(index=['day', 'type'], columns='store',
values=['sales', 'orders'], fill_value=0)
.sort_index(axis=1, level=1)
)
df2.columns = df2.columns.map('_'.join)
df2.reset_index()
output:
day type orders_amazon sales_amazon orders_apple sales_apple orders_facebook sales_facebook orders_uber sales_uber
0 2021-08-01 retail 0 0 0 0 0 0 1 60
1 2021-08-01 web 0 0 0 0 0 0 1 50
2 2021-09-01 web 0 0 1 5 0 0 0 0
3 2021-09-05 retail 0 0 0 0 50 10 0 0
4 2021-10-10 retail 50 500 0 0 50 300 0 0
5 2021-10-10 web 1 10 0 0 0 0 0 0

Pandas split column based on category of the elements

I have a pandas.DataFrame in it I have a column. The columns contains, integers, strings, time...
I want to create columns (containing [0,1]) that tells if the value in that column is a string or not, a time or not... in an efficient way.
A
0 Hello
1 Name
2 123
3 456
4 22/03/2019
And the output should be
A A_string A_number A_date
0 Hello 1 0 0
1 Name 1 0 0
2 123 0 1 0
3 456 0 1 0
4 22/03/2019 0 0 1
Using pandas str methods to check for the string type could help:
df = pd.read_clipboard()
df['A_string'] = df.A.str.isalpha().astype(int)
df['A_number'] = df.A.str.isdigit().astype(int)
#naive assumption
df['A_Date'] = (~df.A.str.isalnum()).astype(int)
df.filter(['A','A_string','A_number','A_Date'])
A A_string A_number A_Date
0 Hello 1 0 0
1 Name 1 0 0
2 123 0 1 0
3 456 0 1 0
4 22/03/2019 0 0 1
We can use the native pandas .to_numeric, to_datetime to test for dates & numbers. Then we can use .loc for assignment and fillna to match your target df.
df.loc[~pd.to_datetime(df['A'],errors='coerce').isna(),'A_Date'] = 1
df.loc[~pd.to_numeric(df['A'],errors='coerce').isna(),'A_Number'] = 1
df.loc[(pd.to_numeric(df['A'],errors='coerce').isna())
& pd.to_datetime(df['A'],errors='coerce').isna()
,'A_String'] = 1
df = df.fillna(0)
print(df)
A A_Date A_Number A_String
0 Hello 0.0 0.0 1.0
1 Name 0.0 0.0 1.0
2 123 0.0 1.0 0.0
3 456 0.0 1.0 0.0
4 22/03/2019 1.0 0.0 0.0

Df Headers: Insert a full year of header rows at end of month and fill non populated months with zero

Afternoon All,
Test Data as at 30 Mar 2019:
Test_Data = [
('Index', ['Year_Month','Done_RFQ','Not_Done_RFQ','Total_RFQ']),
('0', ['2019-01',10,20,30]),
('1', ['2019-02', 10, 20, 30]),
('2', ['2019-03', 20, 40, 60]),
]
df = pd.DataFrame(dict(Test_Data))
print(df)
Index 0 1 2
0 Year_Month 2019-01 2019-02 2019-03
1 Done_RFQ 10 10 20
2 Not_Done_RFQ 20 20 40
3 Total_RFQ 30 30 60
Desired output as at 31 Mar 2019
Desired output as at 30 Apr 2019
As each month progresses the unformatted df will have an additional column of data
I'd like to:
a. Replace headers in the existing df, note there will only be four columns in March, then 5 in April....13 in Dec:
df.columns = ['Report_Mongo','Month_1','Month_2','Month_3','Month_4','Month_5','Month_6','Month_7','Month_8','Month_9','Month_10','Month_11','Month_12']
b. As we progress through the year zero valaues would be replaced with data. The challenge is to determine how many months have passed and only update non populated columns with data
You can assign columns by length of original columns and DataFrame.reindex:
c = ['Report_Mongo','Month_1','Month_2','Month_3','Month_4','Month_5','Month_6',
'Month_7','Month_8','Month_9','Month_10','Month_11','Month_12']
df.columns = c[:len(df.columns)]
df = df.reindex(c, axis=1, fill_value=0)
print (df)
Report_Mongo Month_1 Month_2 Month_3 Month_4 Month_5 Month_6 \
0 Year_Month 2019-01 2019-02 2019-03 0 0 0
1 Done_RFQ 10 10 20 0 0 0
2 Not_Done_RFQ 20 20 40 0 0 0
3 Total_RFQ 30 30 60 0 0 0
Month_7 Month_8 Month_9 Month_10 Month_11 Month_12
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
Alternative is create header with months periods, advantage is only numeric data in all rows:
#set columns by first row
df.columns = df.iloc[0]
#remove first row and create index by first column
df = df.iloc[1:].set_index('Year_Month')
#convert columns to month periods
df.columns = pd.to_datetime(df.columns).to_period('m')
#reindex to full year
df = df.reindex(pd.period_range(start='2019-01',end='2019-12',freq='m'),axis=1,fill_value=0)
print (df)
2019-01 2019-02 2019-03 2019-04 2019-05 2019-06 2019-07 \
Year_Month
Done_RFQ 10 10 20 0 0 0 0
Not_Done_RFQ 20 20 40 0 0 0 0
Total_RFQ 30 30 60 0 0 0 0
2019-08 2019-09 2019-10 2019-11 2019-12
Year_Month
Done_RFQ 0 0 0 0 0
Not_Done_RFQ 0 0 0 0 0
Total_RFQ 0 0 0 0 0

CountTokenizing a field, turning into columns

I'm working with data that look something like this:
ID PATH GROUP
11937 MM-YT-UJ-OO GT
11938 YT-RY-LM TQ
11939 XX-XX-OT DX
I'd like to tokenize the PATH column into n-grams and then one-hot encode those into their own columns so I'd end up with something like:
ID GROUP MM YT UJ OO RY LM XX OT MM-YT YT-UH ...
11937 GT 1 1 1 1 0 0 0 0 1 1
I could also use counted tokens rather than one-hot, so 11939 would have a 2 in the XX column instead of a 1, but I can work with either.
I can tokenize the column quite easily with scikitlearn CountVectorizer, but then I have to cbind the ID and GROUP fields. Is there a standard way to do this or a best practice that anyone has discovered?
A solution:
df.set_index(['ID', 'GROUP'], inplace=True)
pd.get_dummies(df.PATH.str.split('-', expand=True).stack())\
.groupby(level=[0,1]).sum().reset_index()
Isolate the ID and GROUP columns as index. Then convert the string to cell items
df.PATH.str.split('-', expand=True)
Out[37]:
0 1 2 3
ID GROUP
11937 GT MM YT UJ OO
11938 TQ YT RY LM None
11939 DX XX XX OT None
Get them into a single column of data
df.PATH.str.split('-', expand=True).stack()
Out[38]:
ID GROUP
11937 GT 0 MM
1 YT
2 UJ
3 OO
11938 TQ 0 YT
1 RY
2 LM
11939 DX 0 XX
1 XX
2 OT
get_dummies bring the counter as columns spread accross rows
pd.get_dummies(df.PATH.str.split('-', expand=True).stack())
Out[39]:
LM MM OO OT RY UJ XX YT
ID GROUP
11937 GT 0 0 1 0 0 0 0 0 0
1 0 0 0 0 0 0 0 1
2 0 0 0 0 0 1 0 0
3 0 0 1 0 0 0 0 0
11938 TQ 0 0 0 0 0 0 0 0 1
1 0 0 0 0 1 0 0 0
2 1 0 0 0 0 0 0 0
11939 DX 0 0 0 0 0 0 0 1 0
1 0 0 0 0 0 0 1 0
2 0 0 0 1 0 0 0 0
Group by the data per ID, GROUP (levels 0 and 1 in the index) to sum up the rows together and have one line per tuple. And finally reset the index to get ID and GROUP column back as regular columns.
Maybe you can try something like that.
# Test data
df = DataFrame({'GROUP': ['GT', 'TQ', 'DX'],
'ID': [11937, 11938, 11939],
'PATH': ['MM-YT-UJ-OO', 'YT-RY-LM', 'XX-XX-OT']})
# Expanding data and creating on column by token
tmp = pd.concat([df.loc[:,['GROUP', 'ID']],
df['PATH'].str.split('-', expand=True)], axis=1)
# Converting wide to long format
tmp = pd.melt(tmp, id_vars=['ID', 'GROUP'])
# Now grouping and counting
tmp.groupby(['ID', 'GROUP', 'value']).count().unstack().fillna(0)
# variable
# value LM MM OO OT RY UJ XX YT
# ID GROUP
# 11937 GT 0.0 1.0 1.0 0.0 0.0 1.0 0.0 1.0
# 11938 TQ 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0
# 11939 DX 0.0 0.0 0.0 1.0 0.0 0.0 2.0 0.0

Categories