Pandas: Some MultiIndex values appearing as NaN when reading Excel sheets - python

When reading an Excel spreadsheet into a Pandas DataFrame, Pandas appears to be handling merged cells in an odd fashion. For the most part, it interprets the merged cells as desired, apart from the first merged cell for each column, which is producing NaN values where it shouldn't.
dataframes = pd.read_excel(
"../data/data.xlsx",
sheet_name=[0,1,2], # read the first three sheets as separate DataFrames
header=[0,1], # rows [1,2] in Excel
index_col=[0,1,2], # cols [A,B,C] in Excel
)
I load three sheets, but behaviour is identical for each so from now on I will only discuss one of them.
> dataframes[0]
Header 1
H2
H3
Value 1
Overall
Overall
A1
B1
0
10
NaN
NaN
1
11
NaN
B2
0
12
NaN
B2
1
13
--------
-------
-------
-------
A2
B1
0
11
A2
B1
1
12
A2
B2
0
13
A2
B2
1
14
As you can see, A1 loads with NaNs yet A2 (and all beyond it, in the real data) load fine. Both A1 and A1 are actually a single merged cell spanning 4 rows in the Excel spreadsheet itself.
What could be causing this issue? It would normally be a simple fix via a fillna(method="ffill") but MultiIndex does not support that. I have so far not found another workaround.

Related

Appending column from one dataframe to another dataframe with multiple matches in loop

My question is an extension of this question:
Check if value in a dataframe is between two values in another dataframe
df1
df1_Col df1_start
0 A1 1200
1 B2 4000
2 B2 2500
df2
df2_Col df2_start df2_end data
0 A1 1000 2000 DATA_A1
1 A1 900 1500 DATA_A1_A1
**2 A1 2000 3000 DATA_A1_A1_A1**
2 B1 2000 3000 DATA_B1
3 B2 2000 3000 DATA_B2
output:
df1_Col df1_start data
0 A1 1200 DATA_A1;DATA_A1_A1
1 B2 4000
2 B2 2500 DATA_B2
I am comparing the value of df1_Col to match with df2_Col and df1_start to be within the range of df2_start and df2_end, then add values of data column in df1. If there multiple matches, then data can combine with any delimiter like ';'.
The code is as follows:
for v,ch in zip(df1.df1_start, df1.df1_Col):
df3 = df2[(df2['df2_start'] < v) & (df2['df2_end'] > v) & (df2['df2_Col'] ==ch)]
data = df3['data']
df1['data'] = data
Loops are used because file is huge.
EDIT:
Looking forward for your assistance.
IIUC:
try via merge()+groupby()+agg():
Left merge on df1 then check if 'df1_start' falls between 'df2_start' and 'df2_end' and creating column 'data' and setting it's value equal to None.Then we are grouping on ['df1_Col','df1_start'] and joining the values of 'date' seperated by ';' by dropping None:
out=df1.merge(df2,left_on='df1_Col',right_on='df2_Col',how='left',sort=True)
out.loc[~out['df1_start'].between(out['df2_start'], out['df2_end']), 'data'] = None
out=out.groupby(['df1_Col','df1_start'],as_index=False,sort=False)['data'].agg(lambda x:';'.join(x.dropna()))
output of out:
df1_Col df1_start data
0 A1 1200 DATA_A1;DATA_A1_A1
1 B2 4000
2 B2 2500 DATA_B2

Pandas: conditionally concatenate original columns with a string

INPUT>df1
ColumnA ColumnB
A1 NaN
A1A2 NaN
A3 NaN
What I tried to do is to change column B's value conditionally,
based on iteration of checking ColumnA, adding remarks to column B.
The previous value of column B shall be kept after new string is added.
In sample dataframe, what I want to do would be
If ColumnA contains A1. If so, add string "A1" to Column B (without cleaning all previous value.)
If ColumnA contains A2. If so, add string "A2" to Column B (without cleaning all previous value.)
OUTPUT>df1
ColumnA ColumnB
A1 A1
A1A2 A1_A2
A3 NaN
I have tried the following codes but not working well.
Could anyone give me some advices? Thanks.
df1['ColumnB'] = np.where(df1['ColumnA'].str.contains('A1'), df1['ColumnB']+"_A1",df1['ColumnB'])
df1['ColumnB'] = np.where(df1['ColumnA'].str.contains('A2'), df1['ColumnB']+"_A2",df1['ColumnB'])
One way using pandas.Series.str.findall with join:
key = ["A1", "A2"]
df["ColumnB"] = df["ColumnA"].str.findall("|".join(key)).str.join("_")
print(df)
Output:
ColumnA ColumnB
0 A1 A1
1 A1A2 A1_A2
2 A3
You cannot add or append strings to np.nan. That means you would always need to check if any position in your ColumnB is still a np.nan or already a string to properly set its new value. If all you want to do is to work with text you could initialize your ColumnB with empty strings and append selected string pieces from ColumnA as:
import pandas as pd
import numpy as np
I = pd.DataFrame({'ColA': ['A1', 'A1A2', 'A2', 'A3']})
I['ColB'] = ''
I.loc[I.ColA.str.contains('A1'), 'ColB'] += 'A1'
print(I)
I.loc[I.ColA.str.contains('A2'), 'ColB'] += 'A2'
print(I)
The output is:
ColA ColB
0 A1 A1
1 A1A2 A1
2 A2
3 A3
ColA ColB
0 A1 A1
1 A1A2 A1A2
2 A2 A2
3 A3
Note: this is a very verbose version as an example.

How can I concatenate two dataframes with different multi-indexes without merging indexes?

I have two dataframes. Each has a two-level multi-index. The first level is the same in each, but the second level is different. I would like to concatenate the dataframes and end up with a dataframe with a three-level multi-index, where records from the first dataframe would have 'NaN' in the third index level, and records from the second dataframe would have 'NaN' in the second index level. Instead, I get a dataframe with a two-level index, where the values in the second level of each dataframe are put in the same index level, which takes the name of the second level in the first dataframe (see code below).
Is there a nice way to do this? I could make the second level of each index into a column, concatenate, then put them back into the index, but this seems like a roundabout way of doing it to me.
df1 = pd.DataFrame({'index-1':['a1','b1','c1','d1'], 'index-2':['a2','b2','c2','d2'], 'values':[1,2,3,4]})
df2 = pd.DataFrame({'index-1':['a1','b1','c1','d1'], 'index-3':['a3','b3','c3','d3'], 'values':[5,6,7,8]})
df1.set_index(['index-1','index-2'], inplace=True)
df2.set_index(['index-1','index-3'], inplace=True)
pd.concat([df1, df2])
Thanks!
It'll be easier to reset the index on the two input dataframes, concat them and then set the index again:
pd.concat([df1.reset_index(), df2.reset_index()], sort=False) \
.set_index(['index-1', 'index-2', 'index-3'])
Result:
values
index-1 index-2 index-3
a1 a2 NaN 1
b1 b2 NaN 2
c1 c2 NaN 3
d1 d2 NaN 4
a1 NaN a3 5
b1 NaN b3 6
c1 NaN c3 7
d1 NaN d3 8

How to read csv file with missing values and 'delim_whitespace=True'?

I would like to know if it is possible to simply drop any line that couses error instead of rising an exception.
My issue is connected to processing text file such this one:
111 aaa 222 bbb
1 a 2 b
11 22
Because of the varied number of whitespaces as separators, I am using option 'delim_whitespace=True' to read_csv function. I am however also explicitelly specifying data types by 'dtype' parameter.
It is natural that pandas shifts value 22 to second column for the third row (and I don't believe there is a way how to convince it that it actually bellongst to the third one). However since the second column is expected to be string it raises an exception.
I understand that this could be probably solved using 'converters' parameter, but I am worried about performance since the data file is quite large (millions of rows).
So is it possible to drop lines with lower number or columns (there is 'error_bad_lines' for higher) or drop any line which couses exception during retyping. Or do you have any other ideas?
Use pandas.read_fwf to read file. This will fill empty string with NaN values.
=^..^=
import pandas as pd
data = pd.read_fwf('data.txt', header=None)
data.columns = ["c1", "c2", "c3", "c4"]
load:
c1 c2 c3 c4
0 111 aaa 222 bbb
1 1 a 2 b
2 11 NaN 22 NaN
Next simply drop rows with NaN values:
out_data = data.dropna()
Output:
c1 c2 c3 c4
0 111 aaa 222 bbb
1 1 a 2 b

Modify data in pandas

I have a Dataframe with several columns, for simplication this a reduced version:
ID geo value
a1 FR 3
a1 ES 7
a1 DE 6
a2 FR 3
a2 ES 5
a2 DE 10
I want to modify some of the values, my file is huge, based in some conditions.
Ideally I would do:
df[(df.ID=='1') & (df.geo=='DE')]['value']=9999
But this doesn't work, I guess because I obtaining a copy of my original dataframe instead the dataframe
Any simple way to update values based in complex conditions?
Try this:
condition = (df.ID=='a1') & (df.geo=='DE')
df.ix[condition, 'value'] = 9999

Categories