I have a data set as below:
ID A1 A2
0 A123 1234
1 1234 5568
2 5568 NaN
3 Zabc NaN
4 3456 3456
5 3456 3456
6 NaN NaN
7 NaN NaN
Intention is to go through each column (A1 and A2), identify where both the columns are blank as in row 6 and 7, create a new column and categorise as "Both A1 and A2 are blank"
I used the below code:
df['Z_Tax No Not Mapped'] = np.NaN
df['Z_Tax No Not Mapped'] = np.where((df['A1'] == np.NaN) & (df['A2'] == np.NaN), 1, 0)
However the output captures all the rows as 0 under new column 'Z_Tax No Not Mapped', but the data have instances where both the columns are blank. Not sure where i'm making a mistake to filter such cases.
Note: Columns A1 and A2 are sometimes alphanumeric or just numeric.
Idea is to place a category in a separate column as "IDs are not updated" or "IDs are updated", so that by placing a simple filter on "IDs are not updated" we can identify cases that are blank in both columns.
Use DataFrame.isna with DataFrame.all for test if all columns are Trues - missing values:
df['Z_Tax No Not Mapped'] = np.where(df[['A1','A2']].isna().all(axis=1),
'Both A1 and A2 are blank',
'')
df.loc[df.isna().all(axis=1), "Z_Tax No Not Mapped"] = "Both A1 and A2 are blank"
Related
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 6 months ago.
I have a pandas dataframe with column "Code" (categorical) with more than 100 unique values. I have multiple rows for same "Name" and would like to capture all of information pertaining to a unique "Name" in one row. Therefore, I'd like to transpose the column "Code" with the values from "Counter".
How do I transpose "Code" in such a way that the following table:
Name
Col1
Col2
Col3
Code
Counter
Alice
a1
4
Alice
a2
3
Bob
b1
9
Bob
c2
1
Bob
a2
4
becomes this:
Name
Col1
Col2
Col3
a1
a2
b1
c2
Alice
4
3
0
0
Bob
0
4
9
1
I can't comment yet but the above answer (from Yuca) should work for you - you can assign the pivot table to a variable and it will be your dataframe. you can also to be sure use Pandas to make it a dataframe:
import pandas as pd
Pivoted = df.pivot(index='Name', columns='Code', values='Counter').fillna(0)
dataframe = pd.Dataframe (data = Pivoted)
try
df.pivot(index='Name', columns='Code', values='Counter').fillna(0)
output
Code a1 a2 b1 c2
Name
Alice 4.0 3.0 0.0 0.0
Bob 0.0 4.0 9.0 1.0
My question is an extension of this question:
Check if value in a dataframe is between two values in another dataframe
df1
df1_Col df1_start
0 A1 1200
1 B2 4000
2 B2 2500
df2
df2_Col df2_start df2_end data
0 A1 1000 2000 DATA_A1
1 A1 900 1500 DATA_A1_A1
**2 A1 2000 3000 DATA_A1_A1_A1**
2 B1 2000 3000 DATA_B1
3 B2 2000 3000 DATA_B2
output:
df1_Col df1_start data
0 A1 1200 DATA_A1;DATA_A1_A1
1 B2 4000
2 B2 2500 DATA_B2
I am comparing the value of df1_Col to match with df2_Col and df1_start to be within the range of df2_start and df2_end, then add values of data column in df1. If there multiple matches, then data can combine with any delimiter like ';'.
The code is as follows:
for v,ch in zip(df1.df1_start, df1.df1_Col):
df3 = df2[(df2['df2_start'] < v) & (df2['df2_end'] > v) & (df2['df2_Col'] ==ch)]
data = df3['data']
df1['data'] = data
Loops are used because file is huge.
EDIT:
Looking forward for your assistance.
IIUC:
try via merge()+groupby()+agg():
Left merge on df1 then check if 'df1_start' falls between 'df2_start' and 'df2_end' and creating column 'data' and setting it's value equal to None.Then we are grouping on ['df1_Col','df1_start'] and joining the values of 'date' seperated by ';' by dropping None:
out=df1.merge(df2,left_on='df1_Col',right_on='df2_Col',how='left',sort=True)
out.loc[~out['df1_start'].between(out['df2_start'], out['df2_end']), 'data'] = None
out=out.groupby(['df1_Col','df1_start'],as_index=False,sort=False)['data'].agg(lambda x:';'.join(x.dropna()))
output of out:
df1_Col df1_start data
0 A1 1200 DATA_A1;DATA_A1_A1
1 B2 4000
2 B2 2500 DATA_B2
I have a dataframe as follows:
id|s1|s2|s3|s4|s5
0|a|b|NaN|NaN|NaN
0|NaN|NaN|NaN|c|NaN
0|a1|NaN|NaN|c2|NaN
1|b|c|NaN|NaN|NaN
1|NaN|NaN|a1|NaN|NaN
1|a1|b|NaN|c1|NaN
.
.
.
.
1000(rows)...............
I want this to be restructured like this:
id|s1|s2|s3|s4|s5
0|a|b|NaN|c|NaN
0|a1|b|NaN|c2|NaN
1|b|c|a1|c1|NaN
1|a1|b|a1|c1|NaN
I have tried:
df.unstack(),df.melt() and df.pivot()
None of them gave me the expected result.Basically I want to reduce the NaN as much as possible. Could anyone suggest me a way? I want only one entry per cell not a group of entries in single cell.
I dont want NaN values but I want flows as mentioned in the first output.I want NaN only when there exists no values in any of the rows in same id
Group on id and ffill+bfill each row , then drop_duplicates:
df.groupby('id').apply(lambda x: x.ffill().bfill()).drop_duplicates()
id s1 s2 s3 s4 s5
0 0 a b NaN c NaN
2 0 a1 b NaN c2 NaN
3 1 b c a1 c1 NaN
5 1 a1 b a1 c1 NaN
I am testing dataframes for equality.
df_diff=(df1!=df2)
I get df_diff which is same shape as df*, and contains boolean True/False.
Now I would like to keep only the columns and rows of df1 where there was at least a different value.
If I simply do
df1=[df_diff.values]
I get all the rows where there was at least one True in df_diff, but lots of columns originally had False only.
As a second step, I would like then to be able to replace all the values (element-wise in the dataframe) which were equal (where df_diff==False) with NaNs.
example:
df1=pd.DataFrame(data=[[1,2,3],[4,5,6],[7,8,9]])
df2=pd.DataFrame(data=[[1,99,3],[4,5,99],[7,8,9]])
I would like to get from df1
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
to
1 2
0 2 NaN
1 NaN 6
I think you need DataFrame.any for check at least one True per rows of columns:
df = df_diff[df_diff.any(axis=1)]
It is possible to filter both of the original dataframes like so:
df11 = df1[df_diff.any(axis=1)]
df22 = df2[df_diff.any(axis=1)]
If want all columns and rows:
df = df_diff.loc[df_diff.any(axis=1), df_diff.any()]
EDIT: Filter d1 and add NaNs by where:
df_diff=(df1!=df2)
m1 = df_diff.any(axis=1)
m2 = df_diff.any()
out = df1.loc[m1, m2].where(df_diff.loc[m1, m2])
print (out)
1 2
0 2.0 NaN
1 NaN 6.0
I have a DataFrame df filled with rows and columns where there are duplicate Id's:
Index Id Type
0 a1 A
1 a2 A
2 b1 B
3 b3 B
4 a1 A
...
When I use:
uniqueId = df["Id"].unique()
I get a list of unique IDs.
How can I apply this filtering on the whole DataFrame such that it keeps the structure but that the duplicates (based on "Id") are removed?
It seems you need DataFrame.drop_duplicates with parameter subset which specify where are test duplicates:
#keep first duplicate value
df = df.drop_duplicates(subset=['Id'])
print (df)
Id Type
Index
0 a1 A
1 a2 A
2 b1 B
3 b3 B
#keep last duplicate value
df = df.drop_duplicates(subset=['Id'], keep='last')
print (df)
Id Type
Index
1 a2 A
2 b1 B
3 b3 B
4 a1 A
#remove all duplicate values
df = df.drop_duplicates(subset=['Id'], keep=False)
print (df)
Id Type
Index
1 a2 A
2 b1 B
3 b3 B
It's also possible to call duplicated() to flag the duplicates and drop the negation of the flags.
df = df[~df.duplicated(subset=['Id'])].copy()
This is particularly useful if you want to conditionally drop duplicates, e.g. drop duplicates of a specific value, etc. For example, the following code drops duplicate 'a1's from column Id (other duplicates are not dropped).
new_df = df[~df['Id'].duplicated() | df['Id'].ne('a1')].copy()