Pandas: Reindexing dataframe won't keep initial values - python

I have a dataframe consisting of 5 decreasing series (290 rows each) whose values are comprised between 0 and 1.
The data looks like that:
A B C D E
0.60 0.998494 1.0 1.0 1.0 1.0
0.65 0.997792 1.0 1.0 1.0 1.0
0.70 0.996860 1.0 1.0 1.0 1.0
0.75 0.995359 1.0 1.0 1.0 1.0
0.80 0.992870 1.0 1.0 1.0 1.0
I want to reindex the dataframe so that I have 0.01 increments between each row. I've tried pd.DataFrame.reindex but to no avail: that returns a dataframe where most of the values are np.NaN
import pandas as pd
df = pd.read_csv('http://pastebin.com/raw/yeHdk2Gq', index_col=0)
print df.reindex(np.arange(0.6, 3.5, 0.025)).head()
Which returns only two valid rows, and converts the 288 others to NaN:
A B C D E
0.600 0.998494 1.0 1.0 1.0 1.0
0.625 NaN NaN NaN NaN NaN
0.650 0.997792 1.0 1.0 1.0 1.0
0.675 NaN NaN NaN NaN NaN
0.700 NaN NaN NaN NaN NaN ##This row existed before reindexing
Pandas can't match the new index with the intial values, although there doesn't seem to be rounding issues (the initial index has no more than 2 decimals).
This seems somehow related to my data as the following works as intended:
df = pd.DataFrame(np.random.randn(10,3), columns=['A', 'B', 'C'])\
.reindex(np.arange(1, 10, 0.5))
print df.head()
Which gives:
A B C
1.0 0.206539 0.346656 2.578709
1.5 NaN NaN NaN
2.0 1.164226 2.693394 1.183696
2.5 NaN NaN NaN
3.0 -0.532072 -1.044149 0.818853
Thanks for your help!

This is because the precision of numpy.
In [31]: np.arange(0.6, 3.5, 0.025).tolist()[0:10]
Out[31]:
[0.6, 0.625, 0.65, 0.675, 0.7000000000000001, 0.7250000000000001,
0.7500000000000001, 0.7750000000000001, 0.8000000000000002, 0.8250000000000002]

As pointed out by #Danche and #EdChum, that was actually a NumPy rounding issue. The following works:
df = pd.read_csv('http://pastebin.com/raw/yeHdk2Gq', index_col=0)\
.reindex([round(i, 5) for i in np.arange(0.6, 3.5, 0.01)])\
.interpolate(kind='cubic', axis=0)
Returns as intended:
A B C D E
0.60 0.998494 1.0 1.0 1.0 1.0
0.61 0.998354 1.0 1.0 1.0 1.0
0.62 0.998214 1.0 1.0 1.0 1.0
0.63 0.998073 1.0 1.0 1.0 1.0
0.64 0.997933 1.0 1.0 1.0 1.0
Thanks

Related

Multiply columns values by a scalar based on conditions DataFrame

I want to multiply column values by a specific scalar based on the name of the column:
if column name = "Math", then all the values in 'Math" column should be multiply by 5;
if column name = "Physique", values in that column should be multiply by 4;
if column name = "Bio", values in that column should be multiplied by 3;
all the remaining columns should be multiplied by 2
What I have:
This is what I should have :
listm = ['Math', 'Physique', 'Bio']
def note_coef(row):
for m in listm:
if 'Math' in listm:
result = df['Math']*5
return result
df2=df.apply(note_coef)
df2
Note I stopped with only 1 if to test my code but the outcome is not what I expected. I am quite new in programming and here as well.
I think the most elegant solution is to define a dictionary (or a pandas.Series) with the multiplying factor for each column of your DataFrame (factors). Then you can multiply all the columns with the corresponding factor simply using df *= factors.
The multiplication is done via column axis alignment, i.e. by aligning the df.columns with the dictionary keys.
For instance, given the following DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame(np.ones(shape=(4, 5)), columns=['Math', 'Physique', 'Bio', 'Algo', 'Archi'])
>>> df
Math Physique Bio Algo Archi
0 1.0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0 1.0
You can do:
factors = {'Math': 5, 'Physique': 4, 'Bio': 3}
default_factor = 2
factors.update({col: default_factor for col in df.columns if col not in factors})
df *= factors
print(df)
Output:
Math Physique Bio Algo Archi
0 5.0 4.0 3.0 2.0 2.0
1 5.0 4.0 3.0 2.0 2.0
2 5.0 4.0 3.0 2.0 2.0
3 5.0 4.0 3.0 2.0 2.0
Fake data
n=5
d = {'a':np.ones(n),
'b':np.ones(n),
'c':np.ones(n),
'd':np.ones(n)}
df = pd.DataFrame(d)
print(df)
Select the columns and multiply by a tuple.
df[['a','c']] = df[['a','c']] * (2,4)
print(df)
a b c d
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
a b c d
0 2.0 1.0 4.0 1.0
1 2.0 1.0 4.0 1.0
2 2.0 1.0 4.0 1.0
3 2.0 1.0 4.0 1.0
4 2.0 1.0 4.0 1.0
You can use df['col_name'].multiply(value) to apply on a whole column. The remaining columns can be modified in a loop of all columns except listm.
listm = ['Math', 'Physique', 'Bio']
for i, head in enumerate(listm):
df[head] = df[head].multiply(5-i)
heads = df.head()
for head in heads:
if not head in listm:
df[head] = df[head].multiply(2)
here is another way to do it using array multiplication
The data was not provided as a text, so created the test data in a patter of the screen shot
mul = [5,4,3,2,2,2,2,1] # multipliers
df1=df.iloc[:,1:].mul(mul)
df1.total = df1.iloc[:,:7].sum(axis=1)
df.update(df1, join='left', overwrite=True)
df
source Math Physics Bio Algo Archi Sport eng total
0 A 50.0 60.0 60.0 50.0 60.0 70.0 80.0 430.0
1 B 55.0 64.0 63.0 52.0 62.0 72.0 82.0 450.0
2 C 5.5 8.4 9.3 NaN NaN NaN NaN 23.2
3 D NaN NaN NaN 22.0 42.0 62.0 82.0 208.0
4 E 6.0 8.8 9.6 NaN NaN NaN NaN 24.4
5 F NaN NaN NaN 24.0 44.0 64.0 84.0 216.0
TEST DATA
data_out = [
['A', 10,15,20,25,30,35,40],
['B', 11,16,21,26,31,36,41],
['C', 1.1,2.1,3.1],
['D', np.NaN,np.NaN,np.NaN,11,21,31,41],
['E', 1.2,2.2,3.2],
['F', np.NaN,np.NaN,np.NaN,12,22,32,42],
]
df=pd.DataFrame(data_out, columns=[ 'source', 'Math', 'Physics', 'Bio', 'Algo', 'Archi', 'Sport', 'eng'])
df['total'] = df.iloc[:,1:].sum(axis=1)
source Math Physics Bio Algo Archi Sport eng total
0 A 10.0 15.0 20.0 25.0 30.0 35.0 40.0 175.0
1 B 11.0 16.0 21.0 26.0 31.0 36.0 41.0 182.0
2 C 1.1 2.1 3.1 NaN NaN NaN NaN 6.3
3 D NaN NaN NaN 11.0 21.0 31.0 41.0 104.0
4 E 1.2 2.2 3.2 NaN NaN NaN NaN 6.6
5 F NaN NaN NaN 12.0 22.0 32.0 42.0 108.0

Pandas dynamically replace nan values

I have a DataFrame that looks like this:
df = pd.DataFrame({'a':[1,2,np.nan,1,np.nan,np.nan,4,2,3,np.nan],
'b':[4,2,3,np.nan,np.nan,1,5,np.nan,5,8]
})
a b
0 1.0 4.0
1 2.0 2.0
2 NaN 3.0
3 1.0 NaN
4 NaN NaN
5 NaN 1.0
6 4.0 5.0
7 2.0 NaN
8 3.0 5.0
9 NaN 8.0
I want to dynamically replace the nan values. I have tried doing (df.ffill()+df.bfill())/2 but that does not yield the desired output, as it casts the fill value to the whole column at once, rather then dynamically. I have tried with interpolate, but it doesn't work well for non linear data.
I have seen this answer but did not fully understand it and not sure if it would work.
Update on the computation of the values
I want every nan value to be the mean of the previous and next non nan value. In case there are more than 1 nan value in sequence, I want to replace one at a time and then compute the mean e.g., in case there is 1, np.nan, np.nan, 4, I first want the mean of 1 and 4 (2.5) for the first nan value - obtaining 1,2.5,np.nan,4 - and then the second nan will be the mean of 2.5 and 4, getting to 1,2.5,3.25,4
The desired output is
a b
0 1.00 4.0
1 2.00 2.0
2 1.50 3.0
3 1.00 2.0
4 2.50 1.5
5 3.25 1.0
6 4.00 5.0
7 2.00 5.0
8 3.00 5.0
9 1.50 8.0
Inspired by the #ye olde noobe answer (thanks to him!):
I've optimized it to make it ≃ 100x faster (times comparison below):
def custom_fillna(s:pd.Series):
for i in range(len(s)):
if pd.isna(s[i]):
last_valid_number = (s[s[:i].last_valid_index()] if s[:i].last_valid_index() is not None else 0)
next_valid_numer = (s[s[i:].first_valid_index()] if s[i:].first_valid_index() is not None else 0)
s[i] = (last_valid_number+next_valid_numer)/2
custom_fillna(df['a'])
df
Times comparison:
Maybe not the most optimized, but it works (note: from your example, I assume that if there is no valid value before or after a NaN, like the last row on column a, 0 is used as a replacement):
import pandas as pd
def fill_dynamically(s: pd.Series):
for i in range(len(s)):
s[i] = (
(0 if s[i:].first_valid_index() is None else s[i:][s[i:].first_valid_index()]) +
(0 if s[:i+1].last_valid_index() is None else s[:i+1][s[:i+1].last_valid_index()])
) / 2
Use like this for the full dataframe:
df = pd.DataFrame({'a':[1,2,np.nan,1,np.nan,np.nan,4,2,3,np.nan],
'b':[4,2,3,np.nan,np.nan,1,5,np.nan,5,8]
})
df.apply(fill_dynamically)
df after applying:
a b
0 1.00 4.0
1 2.00 2.0
2 1.50 3.0
3 1.00 2.0
4 2.50 1.5
5 3.25 1.0
6 4.00 5.0
7 2.00 5.0
8 3.00 5.0
9 1.50 8.0
In case you would have other columns and don't want to apply that on the whole dataframe, you can of course use it on a single column, like that:
df = pd.DataFrame({'a':[1,2,np.nan,1,np.nan,np.nan,4,2,3,np.nan],
'b':[4,2,3,np.nan,np.nan,1,5,np.nan,5,8]
})
fill_dynamically(df['a'])
In this case, df looks like that:
a b
0 1.00 4.0
1 2.00 2.0
2 1.50 3.0
3 1.00 NaN
4 2.50 NaN
5 3.25 1.0
6 4.00 5.0
7 2.00 NaN
8 3.00 5.0
9 1.50 8.0

How to get "true" decimal place precision with pandas round?

What am I missing? I tried appending .round(3) to the end of of the api call but it doesnt work, and it also doesnt work in separate calls. The data types for all columns is numpy.float32.
>>> summary_data = api._get_data(units=list(units.id),
downsample=downsample,
table='summary_tb',
db=db).astype(np.float32)
>>> summary_data.head()
id asset_id cycle hs alt Mach TRA T2
0 10.0 1.0 1.0 1.0 3081.0 0.37945 70.399887 522.302124
1 20.0 1.0 1.0 1.0 3153.0 0.38449 70.575668 522.428162
2 30.0 1.0 1.0 1.0 3229.0 0.39079 70.575668 522.645020
3 40.0 1.0 1.0 1.0 3305.0 0.39438 70.575668 522.651184
4 50.0 1.0 1.0 1.0 3393.0 0.39690 70.663559 522.530090
>>> summary_data = summary_data.round(3)
>>> summary_data.head()
id asset_id cycle hs alt Mach TRA T2
0 10.0 1.0 1.0 1.0 3081.0 0.379 70.400002 522.302002
1 20.0 1.0 1.0 1.0 3153.0 0.384 70.575996 522.427979
2 30.0 1.0 1.0 1.0 3229.0 0.391 70.575996 522.645020
3 40.0 1.0 1.0 1.0 3305.0 0.394 70.575996 522.651001
4 50.0 1.0 1.0 1.0 3393.0 0.397 70.664001 522.530029
>>> print(type(summary_data))
pandas.core.frame.DataFrame
>>> print([type(summary_data[col][0]) for col in summary_data.columns])
[numpy.float32,
numpy.float32,
numpy.float32,
numpy.float32,
numpy.float32,
numpy.float32,
numpy.float32,
numpy.float32]
It does in fact look like some form of rounding is taking place, but something weird is happening. Thanks in advance.
EDIT
The point of this is to use 32 bit floating numbers, not 64 bit. I have since used pd.set_option('precision', 3), but according the the documentation this only affects the display, but not the underlying value. As mentioned in a comment below, I am trying to minimize the number of atomic operations. Calculations on 70.575996 vs 70.57600 are more expensive, and this is the issue I am trying to tackle. Thanks in advance.
Hmm, this might be a floating-point issue. You could change the dtype to float instead of np.float32:
>>> summary_data.astype(float).round(3)
id asset_id cycle hs alt Mach TRA T2
0 10.0 1.0 1.0 1.0 3081.0 0.379 70.400 522.302
1 20.0 1.0 1.0 1.0 3153.0 0.384 70.576 522.428
2 30.0 1.0 1.0 1.0 3229.0 0.391 70.576 522.645
3 40.0 1.0 1.0 1.0 3305.0 0.394 70.576 522.651
4 50.0 1.0 1.0 1.0 3393.0 0.397 70.664 522.530
If you change it back to np.float32 afterwards, it re-exhibits the issue:
>>> summary_data.astype(float).round(3).astype(np.float32)
id asset_id cycle hs alt Mach TRA T2
0 10.0 1.0 1.0 1.0 3081.0 0.379 70.400002 522.302002
1 20.0 1.0 1.0 1.0 3153.0 0.384 70.575996 522.427979
2 30.0 1.0 1.0 1.0 3229.0 0.391 70.575996 522.645020
3 40.0 1.0 1.0 1.0 3305.0 0.394 70.575996 522.651001
4 50.0 1.0 1.0 1.0 3393.0 0.397 70.664001 522.530029

Best way to reassemble a pandas data frame

Need to reassemble a data frame that is the result of a group by operation. It is assumed to be ordered.
Major Minor RelType SomeNulls
0 0.0 0.0 1 1.0
1 NaN NaN 2 NaN
2 1.0 1.0 1 NaN
3 NaN NaN 2 NaN
4 NaN NaN 3 NaN
5 2.0 3.0 1 NaN
6 NaN NaN 2 2.0
And looking for something like this
Major Minor RelType SomeNulls
0 0.0 0.0 1 1.0
1 0.0 0.0 2 NaN
2 1.0 1.0 1 NaN
3 1.0 1.0 2 NaN
4 1.0 1.0 3 NaN
5 2.0 3.0 1 NaN
6 2.0 3.0 2 2.0
Wondering if there is an elegant way to resolve it.
import pandas as pd
import numpy as np
def refill_frame(df, cols):
while df[cols].isnull().values.any():
for col in cols:
if col in list(df):
#print (col)
df[col]= np.where(df[col].isnull(), df[col].shift(1), df[col])
return df
df = pd.DataFrame({'Major': [0, None, 1, None, None,2, None],
'Minor': [0, None, 1, None, None,3, None],
'RelType': [1, 2, 1, 2,3, 1,2],
'SomeNulls': [1, None,None, None,None,None,2]
})
print (df)
cols2fill =['Major', 'Minor']
df = refill_frame(df, cols2fill)
print (df)
If I understand the question correctly, You could do a transform on the specific columns:
df.loc[:, ['Major', 'Minor']] = df.loc[:, ['Major', 'Minor']].transform('ffill')
Major Minor RelType SomeNulls
0 0.0 0.0 1 1.0
1 0.0 0.0 2 NaN
2 1.0 1.0 1 NaN
3 1.0 1.0 2 NaN
4 1.0 1.0 3 NaN
5 2.0 3.0 1 NaN
6 2.0 3.0 2 2.0
You could also use the fill_direction function from pyjanitor:
# pip install pyjanitor
import janitor
df.fill_direction({"Major":"down", "Minor":"down"})
Major Minor RelType SomeNulls
0 0.0 0.0 1 1.0
1 0.0 0.0 2 NaN
2 1.0 1.0 1 NaN
3 1.0 1.0 2 NaN
4 1.0 1.0 3 NaN
5 2.0 3.0 1 NaN
6 2.0 3.0 2 2.0

Pandas fillna() not working as expected

I'm trying to replace NaN values in my dataframe with means from the same row.
sample_df = pd.DataFrame({'A':[1.0,np.nan,5.0],
'B':[1.0,4.0,5.0],
'C':[1.0,1.0,4.0],
'D':[6.0,5.0,5.0],
'E':[1.0,1.0,4.0],
'F':[1.0,np.nan,4.0]})
sample_mean = sample_df.apply(lambda x: np.mean(x.dropna().values.tolist()) ,axis=1)
Produces:
0 1.833333
1 2.750000
2 4.500000
dtype: float64
But when I try to use fillna() to fill the missing dataframe values with values from the series, it doesn't seem to work.
sample_df.fillna(sample_mean, inplace=True)
A B C D E F
0 1.0 1.0 1.0 6.0 1.0 1.0
1 NaN 4.0 1.0 5.0 1.0 NaN
2 5.0 5.0 4.0 5.0 4.0 4.0
What I expect is:
A B C D E F
0 1.0 1.0 1.0 6.0 1.0 1.0
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.0 5.0 4.0 5.0 4.0 4.0
I've reviewed the other similar questions and can't seem to uncover the issue. Thanks in advance for your help.
By using pandas
sample_df.T.fillna(sample_df.T.mean()).T
Out[1284]:
A B C D E F
0 1.00 1.0 1.0 6.0 1.0 1.00
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.00 5.0 4.0 5.0 4.0 4.00
Here's one way -
sample_df[:] = np.where(np.isnan(sample_df), sample_df.mean(1)[:,None], sample_df)
Sample output -
sample_df
Out[61]:
A B C D E F
0 1.00 1.0 1.0 6.0 1.0 1.00
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.00 5.0 4.0 5.0 4.0 4.00
Another pandas way:
>>> sample_df.where(pd.notnull(sample_df), sample_df.mean(axis=1), axis='rows')
A B C D E F
0 1.00 1.0 1.0 6.0 1.0 1.00
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.00 5.0 4.0 5.0 4.0 4.00
An if condition is True is in operation here: Where elements of pd.notnull(sample_df) are True use the corresponding elements from sample_df else use the elements from sample_df.mean(axis=1) and perform this logic along axis='rows'.

Categories