Category Condition on multiple columns - python

I have a data set as below:
ID A1 A2
0 A123 1234
1 1234 5568
2 5568 NaN
3 Zabc NaN
4 3456 3456
5 3456 3456
6 NaN NaN
7 NaN NaN
Intention is to go through each column (A1 and A2), identify where both the columns are blank as in row 6 and 7, create a new column and categorise as "Both A1 and A2 are blank"
I used the below code:
df['Z_Tax No Not Mapped'] = np.NaN
df['Z_Tax No Not Mapped'] = np.where((df['A1'] == np.NaN) & (df['A2'] == np.NaN), 1, 0)
However the output captures all the rows as 0 under new column 'Z_Tax No Not Mapped', but the data have instances where both the columns are blank. Not sure where i'm making a mistake to filter such cases.
Note: Columns A1 and A2 are sometimes alphanumeric or just numeric.
Idea is to place a category in a separate column as "IDs are not updated" or "IDs are updated", so that by placing a simple filter on "IDs are not updated" we can identify cases that are blank in both columns.

Use DataFrame.isna with DataFrame.all for test if all columns are Trues - missing values:
df['Z_Tax No Not Mapped'] = np.where(df[['A1','A2']].isna().all(axis=1),
'Both A1 and A2 are blank',
'')

df.loc[df.isna().all(axis=1), "Z_Tax No Not Mapped"] = "Both A1 and A2 are blank"

Related

How to transpose a column in a pandas dataframe with its values drawn from a different column? [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 6 months ago.
I have a pandas dataframe with column "Code" (categorical) with more than 100 unique values. I have multiple rows for same "Name" and would like to capture all of information pertaining to a unique "Name" in one row. Therefore, I'd like to transpose the column "Code" with the values from "Counter".
How do I transpose "Code" in such a way that the following table:
Name
Col1
Col2
Col3
Code
Counter
Alice
a1
4
Alice
a2
3
Bob
b1
9
Bob
c2
1
Bob
a2
4
becomes this:
Name
Col1
Col2
Col3
a1
a2
b1
c2
Alice
4
3
0
0
Bob
0
4
9
1
I can't comment yet but the above answer (from Yuca) should work for you - you can assign the pivot table to a variable and it will be your dataframe. you can also to be sure use Pandas to make it a dataframe:
import pandas as pd
Pivoted = df.pivot(index='Name', columns='Code', values='Counter').fillna(0)
dataframe = pd.Dataframe (data = Pivoted)
try
df.pivot(index='Name', columns='Code', values='Counter').fillna(0)
output
Code a1 a2 b1 c2
Name
Alice 4.0 3.0 0.0 0.0
Bob 0.0 4.0 9.0 1.0

Appending column from one dataframe to another dataframe with multiple matches in loop

My question is an extension of this question:
Check if value in a dataframe is between two values in another dataframe
df1
df1_Col df1_start
0 A1 1200
1 B2 4000
2 B2 2500
df2
df2_Col df2_start df2_end data
0 A1 1000 2000 DATA_A1
1 A1 900 1500 DATA_A1_A1
**2 A1 2000 3000 DATA_A1_A1_A1**
2 B1 2000 3000 DATA_B1
3 B2 2000 3000 DATA_B2
output:
df1_Col df1_start data
0 A1 1200 DATA_A1;DATA_A1_A1
1 B2 4000
2 B2 2500 DATA_B2
I am comparing the value of df1_Col to match with df2_Col and df1_start to be within the range of df2_start and df2_end, then add values of data column in df1. If there multiple matches, then data can combine with any delimiter like ';'.
The code is as follows:
for v,ch in zip(df1.df1_start, df1.df1_Col):
df3 = df2[(df2['df2_start'] < v) & (df2['df2_end'] > v) & (df2['df2_Col'] ==ch)]
data = df3['data']
df1['data'] = data
Loops are used because file is huge.
EDIT:
Looking forward for your assistance.
IIUC:
try via merge()+groupby()+agg():
Left merge on df1 then check if 'df1_start' falls between 'df2_start' and 'df2_end' and creating column 'data' and setting it's value equal to None.Then we are grouping on ['df1_Col','df1_start'] and joining the values of 'date' seperated by ';' by dropping None:
out=df1.merge(df2,left_on='df1_Col',right_on='df2_Col',how='left',sort=True)
out.loc[~out['df1_start'].between(out['df2_start'], out['df2_end']), 'data'] = None
out=out.groupby(['df1_Col','df1_start'],as_index=False,sort=False)['data'].agg(lambda x:';'.join(x.dropna()))
output of out:
df1_Col df1_start data
0 A1 1200 DATA_A1;DATA_A1_A1
1 B2 4000
2 B2 2500 DATA_B2

Changing the DataFrame in a complex way

I have a dataframe as follows:
id|s1|s2|s3|s4|s5
0|a|b|NaN|NaN|NaN
0|NaN|NaN|NaN|c|NaN
0|a1|NaN|NaN|c2|NaN
1|b|c|NaN|NaN|NaN
1|NaN|NaN|a1|NaN|NaN
1|a1|b|NaN|c1|NaN
.
.
.
.
1000(rows)...............
I want this to be restructured like this:
id|s1|s2|s3|s4|s5
0|a|b|NaN|c|NaN
0|a1|b|NaN|c2|NaN
1|b|c|a1|c1|NaN
1|a1|b|a1|c1|NaN
I have tried:
df.unstack(),df.melt() and df.pivot()
None of them gave me the expected result.Basically I want to reduce the NaN as much as possible. Could anyone suggest me a way? I want only one entry per cell not a group of entries in single cell.
I dont want NaN values but I want flows as mentioned in the first output.I want NaN only when there exists no values in any of the rows in same id
Group on id and ffill+bfill each row , then drop_duplicates:
df.groupby('id').apply(lambda x: x.ffill().bfill()).drop_duplicates()
id s1 s2 s3 s4 s5
0 0 a b NaN c NaN
2 0 a1 b NaN c2 NaN
3 1 b c a1 c1 NaN
5 1 a1 b a1 c1 NaN

How to compare two dataframes and filter rows and columns where a difference is found

I am testing dataframes for equality.
df_diff=(df1!=df2)
I get df_diff which is same shape as df*, and contains boolean True/False.
Now I would like to keep only the columns and rows of df1 where there was at least a different value.
If I simply do
df1=[df_diff.values]
I get all the rows where there was at least one True in df_diff, but lots of columns originally had False only.
As a second step, I would like then to be able to replace all the values (element-wise in the dataframe) which were equal (where df_diff==False) with NaNs.
example:
df1=pd.DataFrame(data=[[1,2,3],[4,5,6],[7,8,9]])
df2=pd.DataFrame(data=[[1,99,3],[4,5,99],[7,8,9]])
I would like to get from df1
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
to
1 2
0 2 NaN
1 NaN 6
I think you need DataFrame.any for check at least one True per rows of columns:
df = df_diff[df_diff.any(axis=1)]
It is possible to filter both of the original dataframes like so:
df11 = df1[df_diff.any(axis=1)]
df22 = df2[df_diff.any(axis=1)]
If want all columns and rows:
df = df_diff.loc[df_diff.any(axis=1), df_diff.any()]
EDIT: Filter d1 and add NaNs by where:
df_diff=(df1!=df2)
m1 = df_diff.any(axis=1)
m2 = df_diff.any()
out = df1.loc[m1, m2].where(df_diff.loc[m1, m2])
print (out)
1 2
0 2.0 NaN
1 NaN 6.0

df.unique() on whole DataFrame based on a column

I have a DataFrame df filled with rows and columns where there are duplicate Id's:
Index Id Type
0 a1 A
1 a2 A
2 b1 B
3 b3 B
4 a1 A
...
When I use:
uniqueId = df["Id"].unique()
I get a list of unique IDs.
How can I apply this filtering on the whole DataFrame such that it keeps the structure but that the duplicates (based on "Id") are removed?
It seems you need DataFrame.drop_duplicates with parameter subset which specify where are test duplicates:
#keep first duplicate value
df = df.drop_duplicates(subset=['Id'])
print (df)
Id Type
Index
0 a1 A
1 a2 A
2 b1 B
3 b3 B
#keep last duplicate value
df = df.drop_duplicates(subset=['Id'], keep='last')
print (df)
Id Type
Index
1 a2 A
2 b1 B
3 b3 B
4 a1 A
#remove all duplicate values
df = df.drop_duplicates(subset=['Id'], keep=False)
print (df)
Id Type
Index
1 a2 A
2 b1 B
3 b3 B
It's also possible to call duplicated() to flag the duplicates and drop the negation of the flags.
df = df[~df.duplicated(subset=['Id'])].copy()
This is particularly useful if you want to conditionally drop duplicates, e.g. drop duplicates of a specific value, etc. For example, the following code drops duplicate 'a1's from column Id (other duplicates are not dropped).
new_df = df[~df['Id'].duplicated() | df['Id'].ne('a1')].copy()

Categories