Modify data in pandas - python

I have a Dataframe with several columns, for simplication this a reduced version:
ID geo value
a1 FR 3
a1 ES 7
a1 DE 6
a2 FR 3
a2 ES 5
a2 DE 10
I want to modify some of the values, my file is huge, based in some conditions.
Ideally I would do:
df[(df.ID=='1') & (df.geo=='DE')]['value']=9999
But this doesn't work, I guess because I obtaining a copy of my original dataframe instead the dataframe
Any simple way to update values based in complex conditions?

Try this:
condition = (df.ID=='a1') & (df.geo=='DE')
df.ix[condition, 'value'] = 9999

Related

Category Condition on multiple columns

I have a data set as below:
ID A1 A2
0 A123 1234
1 1234 5568
2 5568 NaN
3 Zabc NaN
4 3456 3456
5 3456 3456
6 NaN NaN
7 NaN NaN
Intention is to go through each column (A1 and A2), identify where both the columns are blank as in row 6 and 7, create a new column and categorise as "Both A1 and A2 are blank"
I used the below code:
df['Z_Tax No Not Mapped'] = np.NaN
df['Z_Tax No Not Mapped'] = np.where((df['A1'] == np.NaN) & (df['A2'] == np.NaN), 1, 0)
However the output captures all the rows as 0 under new column 'Z_Tax No Not Mapped', but the data have instances where both the columns are blank. Not sure where i'm making a mistake to filter such cases.
Note: Columns A1 and A2 are sometimes alphanumeric or just numeric.
Idea is to place a category in a separate column as "IDs are not updated" or "IDs are updated", so that by placing a simple filter on "IDs are not updated" we can identify cases that are blank in both columns.
Use DataFrame.isna with DataFrame.all for test if all columns are Trues - missing values:
df['Z_Tax No Not Mapped'] = np.where(df[['A1','A2']].isna().all(axis=1),
'Both A1 and A2 are blank',
'')
df.loc[df.isna().all(axis=1), "Z_Tax No Not Mapped"] = "Both A1 and A2 are blank"

Pandas: Some MultiIndex values appearing as NaN when reading Excel sheets

When reading an Excel spreadsheet into a Pandas DataFrame, Pandas appears to be handling merged cells in an odd fashion. For the most part, it interprets the merged cells as desired, apart from the first merged cell for each column, which is producing NaN values where it shouldn't.
dataframes = pd.read_excel(
"../data/data.xlsx",
sheet_name=[0,1,2], # read the first three sheets as separate DataFrames
header=[0,1], # rows [1,2] in Excel
index_col=[0,1,2], # cols [A,B,C] in Excel
)
I load three sheets, but behaviour is identical for each so from now on I will only discuss one of them.
> dataframes[0]
Header 1
H2
H3
Value 1
Overall
Overall
A1
B1
0
10
NaN
NaN
1
11
NaN
B2
0
12
NaN
B2
1
13
--------
-------
-------
-------
A2
B1
0
11
A2
B1
1
12
A2
B2
0
13
A2
B2
1
14
As you can see, A1 loads with NaNs yet A2 (and all beyond it, in the real data) load fine. Both A1 and A1 are actually a single merged cell spanning 4 rows in the Excel spreadsheet itself.
What could be causing this issue? It would normally be a simple fix via a fillna(method="ffill") but MultiIndex does not support that. I have so far not found another workaround.

Appending column from one dataframe to another dataframe with multiple matches in loop

My question is an extension of this question:
Check if value in a dataframe is between two values in another dataframe
df1
df1_Col df1_start
0 A1 1200
1 B2 4000
2 B2 2500
df2
df2_Col df2_start df2_end data
0 A1 1000 2000 DATA_A1
1 A1 900 1500 DATA_A1_A1
**2 A1 2000 3000 DATA_A1_A1_A1**
2 B1 2000 3000 DATA_B1
3 B2 2000 3000 DATA_B2
output:
df1_Col df1_start data
0 A1 1200 DATA_A1;DATA_A1_A1
1 B2 4000
2 B2 2500 DATA_B2
I am comparing the value of df1_Col to match with df2_Col and df1_start to be within the range of df2_start and df2_end, then add values of data column in df1. If there multiple matches, then data can combine with any delimiter like ';'.
The code is as follows:
for v,ch in zip(df1.df1_start, df1.df1_Col):
df3 = df2[(df2['df2_start'] < v) & (df2['df2_end'] > v) & (df2['df2_Col'] ==ch)]
data = df3['data']
df1['data'] = data
Loops are used because file is huge.
EDIT:
Looking forward for your assistance.
IIUC:
try via merge()+groupby()+agg():
Left merge on df1 then check if 'df1_start' falls between 'df2_start' and 'df2_end' and creating column 'data' and setting it's value equal to None.Then we are grouping on ['df1_Col','df1_start'] and joining the values of 'date' seperated by ';' by dropping None:
out=df1.merge(df2,left_on='df1_Col',right_on='df2_Col',how='left',sort=True)
out.loc[~out['df1_start'].between(out['df2_start'], out['df2_end']), 'data'] = None
out=out.groupby(['df1_Col','df1_start'],as_index=False,sort=False)['data'].agg(lambda x:';'.join(x.dropna()))
output of out:
df1_Col df1_start data
0 A1 1200 DATA_A1;DATA_A1_A1
1 B2 4000
2 B2 2500 DATA_B2

split rows in pandas dataframe based on spaces in cell entries

I have created the following dataframe in python using pandas
import numpy as np
import pandas as pd
WE create a list
A=["THIS IS A NEW WORLD WE NEED A NEW PARADIGM: FOR THE NATION FOR THE PEOPLE",
"THIS IS A NEW WORLD ORDER;. WE NEED A NEW PARADIGM-: FOR THE NATION FOR THE PEOPLE%",
"THIS IS A NEW WORLD? WE NEED A NEW PARADIGM FOR THE NATION FOR THE PEOPLE PRESENT."]
Next we create a dataframe
df1=pd.DataFrame()
df1["A"]=A
df1["B"]=["A1", "A2", "A3"]
The dataframe appears as follows
A B
0 THIS IS A NEW WORLD WE NEED A NEW PARADIGM: FOR THE NATION FOR THE PEOPLE A1
1 THIS IS A NEW WORLD ORDER;. WE NEED A NEW PARADIGM-: FOR THE NATION FOR THE PEOPLE% A2
2 THIS IS A NEW WORLD? WE NEED A NEW PARADIGM FOR THE NATION FOR THE PEOPLE PRESENT. A3
In the above dataframe the column A has character vectors separatde by a space
How do I transform the dataframe to yield the following dataframe
A B
0 THIS IS A NEW WORLD A1
1 WE NEED A NEW PARADIGM: A1
2 FOR THE NATION FOR THE PEOPLE A1
3 THIS IS A NEW WORLD ORDER;. A2
4 WE NEED A NEW PARADIGM-: A2
5 FOR THE NATION FOR THE PEOPLE% A2
6 THIS IS A NEW WORLD? A3
7 WE NEED A NEW PARADIGM A3
8 FOR THE NATION FOR THE PEOPLE PRESENT. A3
I request someone to take a look
If need split by 2 or more spaces add regex \s{2,} to Series.str.split and then use DataFrame.explode:
df1['A'] = df1['A'].str.split('\s{2,}')
df = df1.explode('A')
print (df)
A B
0 THIS IS A NEW WORLD A1
0 WE NEED A NEW PARADIGM: FOR THE NATION FOR THE... A1
1 THIS IS A NEW WORLD ORDER;. A2
1 WE NEED A NEW PARADIGM-: FOR THE NATION FOR TH... A2
2 THIS IS A NEW WORLD? A3
2 WE NEED A NEW PARADIGM A3
2 FOR THE NATION FOR THE PEOPLE PRESENT. A3

How can I concatenate two dataframes with different multi-indexes without merging indexes?

I have two dataframes. Each has a two-level multi-index. The first level is the same in each, but the second level is different. I would like to concatenate the dataframes and end up with a dataframe with a three-level multi-index, where records from the first dataframe would have 'NaN' in the third index level, and records from the second dataframe would have 'NaN' in the second index level. Instead, I get a dataframe with a two-level index, where the values in the second level of each dataframe are put in the same index level, which takes the name of the second level in the first dataframe (see code below).
Is there a nice way to do this? I could make the second level of each index into a column, concatenate, then put them back into the index, but this seems like a roundabout way of doing it to me.
df1 = pd.DataFrame({'index-1':['a1','b1','c1','d1'], 'index-2':['a2','b2','c2','d2'], 'values':[1,2,3,4]})
df2 = pd.DataFrame({'index-1':['a1','b1','c1','d1'], 'index-3':['a3','b3','c3','d3'], 'values':[5,6,7,8]})
df1.set_index(['index-1','index-2'], inplace=True)
df2.set_index(['index-1','index-3'], inplace=True)
pd.concat([df1, df2])
Thanks!
It'll be easier to reset the index on the two input dataframes, concat them and then set the index again:
pd.concat([df1.reset_index(), df2.reset_index()], sort=False) \
.set_index(['index-1', 'index-2', 'index-3'])
Result:
values
index-1 index-2 index-3
a1 a2 NaN 1
b1 b2 NaN 2
c1 c2 NaN 3
d1 d2 NaN 4
a1 NaN a3 5
b1 NaN b3 6
c1 NaN c3 7
d1 NaN d3 8

Categories