INPUT>df1
ColumnA ColumnB
A1 NaN
A1A2 NaN
A3 NaN
What I tried to do is to change column B's value conditionally,
based on iteration of checking ColumnA, adding remarks to column B.
The previous value of column B shall be kept after new string is added.
In sample dataframe, what I want to do would be
If ColumnA contains A1. If so, add string "A1" to Column B (without cleaning all previous value.)
If ColumnA contains A2. If so, add string "A2" to Column B (without cleaning all previous value.)
OUTPUT>df1
ColumnA ColumnB
A1 A1
A1A2 A1_A2
A3 NaN
I have tried the following codes but not working well.
Could anyone give me some advices? Thanks.
df1['ColumnB'] = np.where(df1['ColumnA'].str.contains('A1'), df1['ColumnB']+"_A1",df1['ColumnB'])
df1['ColumnB'] = np.where(df1['ColumnA'].str.contains('A2'), df1['ColumnB']+"_A2",df1['ColumnB'])
One way using pandas.Series.str.findall with join:
key = ["A1", "A2"]
df["ColumnB"] = df["ColumnA"].str.findall("|".join(key)).str.join("_")
print(df)
Output:
ColumnA ColumnB
0 A1 A1
1 A1A2 A1_A2
2 A3
You cannot add or append strings to np.nan. That means you would always need to check if any position in your ColumnB is still a np.nan or already a string to properly set its new value. If all you want to do is to work with text you could initialize your ColumnB with empty strings and append selected string pieces from ColumnA as:
import pandas as pd
import numpy as np
I = pd.DataFrame({'ColA': ['A1', 'A1A2', 'A2', 'A3']})
I['ColB'] = ''
I.loc[I.ColA.str.contains('A1'), 'ColB'] += 'A1'
print(I)
I.loc[I.ColA.str.contains('A2'), 'ColB'] += 'A2'
print(I)
The output is:
ColA ColB
0 A1 A1
1 A1A2 A1
2 A2
3 A3
ColA ColB
0 A1 A1
1 A1A2 A1A2
2 A2 A2
3 A3
Note: this is a very verbose version as an example.
Related
I have a data set as below:
ID A1 A2
0 A123 1234
1 1234 5568
2 5568 NaN
3 Zabc NaN
4 3456 3456
5 3456 3456
6 NaN NaN
7 NaN NaN
Intention is to go through each column (A1 and A2), identify where both the columns are blank as in row 6 and 7, create a new column and categorise as "Both A1 and A2 are blank"
I used the below code:
df['Z_Tax No Not Mapped'] = np.NaN
df['Z_Tax No Not Mapped'] = np.where((df['A1'] == np.NaN) & (df['A2'] == np.NaN), 1, 0)
However the output captures all the rows as 0 under new column 'Z_Tax No Not Mapped', but the data have instances where both the columns are blank. Not sure where i'm making a mistake to filter such cases.
Note: Columns A1 and A2 are sometimes alphanumeric or just numeric.
Idea is to place a category in a separate column as "IDs are not updated" or "IDs are updated", so that by placing a simple filter on "IDs are not updated" we can identify cases that are blank in both columns.
Use DataFrame.isna with DataFrame.all for test if all columns are Trues - missing values:
df['Z_Tax No Not Mapped'] = np.where(df[['A1','A2']].isna().all(axis=1),
'Both A1 and A2 are blank',
'')
df.loc[df.isna().all(axis=1), "Z_Tax No Not Mapped"] = "Both A1 and A2 are blank"
When reading an Excel spreadsheet into a Pandas DataFrame, Pandas appears to be handling merged cells in an odd fashion. For the most part, it interprets the merged cells as desired, apart from the first merged cell for each column, which is producing NaN values where it shouldn't.
dataframes = pd.read_excel(
"../data/data.xlsx",
sheet_name=[0,1,2], # read the first three sheets as separate DataFrames
header=[0,1], # rows [1,2] in Excel
index_col=[0,1,2], # cols [A,B,C] in Excel
)
I load three sheets, but behaviour is identical for each so from now on I will only discuss one of them.
> dataframes[0]
Header 1
H2
H3
Value 1
Overall
Overall
A1
B1
0
10
NaN
NaN
1
11
NaN
B2
0
12
NaN
B2
1
13
--------
-------
-------
-------
A2
B1
0
11
A2
B1
1
12
A2
B2
0
13
A2
B2
1
14
As you can see, A1 loads with NaNs yet A2 (and all beyond it, in the real data) load fine. Both A1 and A1 are actually a single merged cell spanning 4 rows in the Excel spreadsheet itself.
What could be causing this issue? It would normally be a simple fix via a fillna(method="ffill") but MultiIndex does not support that. I have so far not found another workaround.
I have two dataframes. Each has a two-level multi-index. The first level is the same in each, but the second level is different. I would like to concatenate the dataframes and end up with a dataframe with a three-level multi-index, where records from the first dataframe would have 'NaN' in the third index level, and records from the second dataframe would have 'NaN' in the second index level. Instead, I get a dataframe with a two-level index, where the values in the second level of each dataframe are put in the same index level, which takes the name of the second level in the first dataframe (see code below).
Is there a nice way to do this? I could make the second level of each index into a column, concatenate, then put them back into the index, but this seems like a roundabout way of doing it to me.
df1 = pd.DataFrame({'index-1':['a1','b1','c1','d1'], 'index-2':['a2','b2','c2','d2'], 'values':[1,2,3,4]})
df2 = pd.DataFrame({'index-1':['a1','b1','c1','d1'], 'index-3':['a3','b3','c3','d3'], 'values':[5,6,7,8]})
df1.set_index(['index-1','index-2'], inplace=True)
df2.set_index(['index-1','index-3'], inplace=True)
pd.concat([df1, df2])
Thanks!
It'll be easier to reset the index on the two input dataframes, concat them and then set the index again:
pd.concat([df1.reset_index(), df2.reset_index()], sort=False) \
.set_index(['index-1', 'index-2', 'index-3'])
Result:
values
index-1 index-2 index-3
a1 a2 NaN 1
b1 b2 NaN 2
c1 c2 NaN 3
d1 d2 NaN 4
a1 NaN a3 5
b1 NaN b3 6
c1 NaN c3 7
d1 NaN d3 8
I have some data frames where I want to add new columns, and in this new column each element should be a string for example of two rows,
df
index colA colB
0 a a1
1 b b1
Now I can add new column as
df['colC']=5
index colA colB colC
0 a a1 5
1 b b1 5
now I want to add a third column with each element as list
index colA colB colC
0 a a1 ['m','n','p']
1 b b1 ['m','n','p']
but,
df['colC']=['m','n','p'] is giving error
ValueError: Length of values does not match length of index
which is obvious.
I know in our example I can do
df['colC']=[['m','n','p'],['m','n','p']]
But I want to set each element to same list of strings, when I do not know number of rows.
Can anyone suggest something easy to achieve this.
Adding object(list) to cell is tricky
df['colC']=[['m','n','p']]*len(df)
Or
df['colC'] = [list('mnp') for _ in range(len(df))]
df returns:
index colA colB colC
0 0 a a1 [m, n, p]
1 1 b b1 [m, n, p]
I have a DataFrame df filled with rows and columns where there are duplicate Id's:
Index Id Type
0 a1 A
1 a2 A
2 b1 B
3 b3 B
4 a1 A
...
When I use:
uniqueId = df["Id"].unique()
I get a list of unique IDs.
How can I apply this filtering on the whole DataFrame such that it keeps the structure but that the duplicates (based on "Id") are removed?
It seems you need DataFrame.drop_duplicates with parameter subset which specify where are test duplicates:
#keep first duplicate value
df = df.drop_duplicates(subset=['Id'])
print (df)
Id Type
Index
0 a1 A
1 a2 A
2 b1 B
3 b3 B
#keep last duplicate value
df = df.drop_duplicates(subset=['Id'], keep='last')
print (df)
Id Type
Index
1 a2 A
2 b1 B
3 b3 B
4 a1 A
#remove all duplicate values
df = df.drop_duplicates(subset=['Id'], keep=False)
print (df)
Id Type
Index
1 a2 A
2 b1 B
3 b3 B
It's also possible to call duplicated() to flag the duplicates and drop the negation of the flags.
df = df[~df.duplicated(subset=['Id'])].copy()
This is particularly useful if you want to conditionally drop duplicates, e.g. drop duplicates of a specific value, etc. For example, the following code drops duplicate 'a1's from column Id (other duplicates are not dropped).
new_df = df[~df['Id'].duplicated() | df['Id'].ne('a1')].copy()