mean chart:
interval gross(mean)
(1920, 1925] NaN
(1925, 1930] 3.443000e+06
(1930, 1935] 4.746000e+05
(1935, 1940] 2.011249e+06
i have a huge dataframe(df) which has some Nan values in gross columns
Now i want to fill those Nan values from mean chart according to respective interval.
df:
name gross interval
k 1000 (1935, 1940]
l Nan (1950, 1955]
,,,
here interval is categorical index.
You can add a column to the dataframe with the corresponding mean value using your mean chart (you can do a left join using pd.merge by joining on the interval column). Once you have this column, you can use -
df['gross'].fillna(df['means'])
You can create new Series by map and then replace NaNs by combine_first.
Main advantage is no necessary helper column, which is necessary remove later.
df1=pd.DataFrame({'gross(mean)':[np.nan,3.443000e+06, 4.746000e+05, 2.011249e+06, 10,20,30],
'interval':[1922,1927,1932, 1938,1932,1938,1953]})
df1['interval'] = pd.cut(df1['interval'], bins=[1920,1925,1930,1935,1940,1945,1950,1955])
print (df1)
gross(mean) interval
0 NaN (1920, 1925]
1 3443000.0 (1925, 1930]
2 474600.0 (1930, 1935]
3 2011249.0 (1935, 1940]
4 10.0 (1930, 1935]
5 20.0 (1935, 1940]
6 30.0 (1950, 1955]
df = pd.DataFrame({'name':['k','l'],
'gross':[1000, np.nan],
'interval':[1938, 1952]}, columns=['name','gross','interval'])
df['interval'] = pd.cut(df['interval'], bins=[1925,1930,1935,1940,1945,1950,1955])
print (df)
name gross interval
0 k 1000.0 (1935, 1940]
1 l NaN (1950, 1955]
mapped = df['interval'].map(df1.set_index('interval')['gross(mean)'].to_dict())
print (mapped)
0 20.0
1 30.0
Name: interval, dtype: float64
df['gross'] = df['gross'].combine_first(mapped)
print (df)
name gross interval
0 k 1000.0 (1935, 1940]
1 l 30.0 (1950, 1955]
Related
I have a DataFrame with 2 columns total_open_amount and invoice_currency.
invoice_currency has
USD 45011
CAD 3828
Name: invoice_currency, dtype: int64
And I want to convert all the CAD to USD from the total_open_amount column wrt to invoice_currency with an exchange rate of 1 CAD = 0.7USD and store them in a separate column.
My code:
df_data['converted_usd'] = df_data['total_open_amount'].where(df_data['invoice_currency']=='CAD')
df_data['converted_usd']= df_data['converted_usd'].apply(lambda x: x*0.7)
df_data['converted_usd']
output:
0 NaN
1 NaN
2 NaN
3 2309.79
4 NaN
...
49995 NaN
49996 NaN
49997 NaN
49998 NaN
49999 NaN
Name: converted_usd, Length: 48839, dtype: float64
I was able to fill the new column with CAD values converted but how do I fill the rest of the USD values now?
We can use Series.mask or Series.where, series.mask set to NaN the rows where 'invoice_currency' is USD, but with the other parameter we tell it that these values have to be filled with df_data['total_open_amount'] series multiplied by 0.7.
using serie.where the rows that do not meet the condition are set to NaN, so first we multiply the series by 0.7 and leave only the rows where the condition is met, that is, the rows with USD currency and we use other parameter to leave the rest of rows with initial value
Note that series.mask and series.where are the opposite of each other.
df_data['converted_usd'] = df_data['total_open_amount']\
.mask(df_data['invoice_currency'] == 'CAD',
other=df_data['total_open_amount'].mul(0.7))
Or:
df_data['converted_usd'] = df_data['total_open_amount'].mul(0.7)\
.where(df_data['invoice_currency'] == 'CAD',
df_data['total_open_amount'])
numpy version
df_data['converted_usd'] = \
np.where(df_data['invoice_currency'] == 'CAD',
df_data['total_open_amount'].mul(0.7),
df_data['total_open_amount'])
I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
Following this answer to a previous questions I had asked, I used this code to summarise the ultrasound measurements using the maximum measurement recorded in a single trimester (13 weeks):
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
This results in the following output:
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
However, MotherID and PregnancyID no longer appear as columns in the output of df.info(). Similarly, when I output the dataframe to a csv file, I only get columns 1,2 and 3. The id columns only appear when running df.head() as can be seen in the dataframe above.
I need to preserve the id columns as I want to use them to merge this dataframe with another one using the ids. Therefore, my question is, how do I preserve these id columns as part of my dataframe after running the code above?
Chain that with reset_index:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
# .drop(columns = 'gestationalAgeInWeeks') # don't need this
.groupby(['MotherID', 'PregnancyID','tm'])['abdomCirc'] # change here
.max().add_prefix('abdomCirc_') # here
.unstack()
.reset_index() # and here
)
Or a more friendly version with pivot_table:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
.pivot_table(index= ['MotherID', 'PregnancyID'], columns='tm',
values= 'abdomCirc', aggfunc='max')
.add_prefix('abdomCirc_') # remove this if you don't want the prefix
.reset_index()
)
Output:
tm MotherID PregnancyID abdomCirc_1 abdomCirc_2 abdomCirc_3
0 abdomCirc_0 abdomCirc_0 NaN 200.0 NaN
1 abdomCirc_1 abdomCirc_1 NaN 315.0 350.0
2 abdomCirc_2 abdomCirc_2 180.0 NaN NaN
I have a pandas DataFrame:
df['total_price'].describe()
returns
count 24895.000000
mean 216.377369
std 161.246931
min 0.000000
25% 109.900000
50% 174.000000
75% 273.000000
max 1355.900000
Name: total_price, dtype: float64
When I apply preprocessing.StandardScaler() to it:
x = df[['total_price']]
standard_scaler = preprocessing.StandardScaler()
x_scaled = standard_scaler.fit_transform(x)
df['new_col'] = pd.DataFrame(x_scaled)
<y new column with the standardized values contains some NaNs:
df[['total_price', 'new_col']].head()
total_price new_col
0 241.95 0.158596
1 241.95 0.158596
2 241.95 0.158596
3 81.95 -0.833691
4 81.95 -0.833691
df[['total_price', 'new_col']].tail()
total_price new_col
28167 264.0 NaN
28168 264.0 NaN
28176 94.0 NaN
28177 166.0 NaN
28178 166.0 NaN
What's going wrong here?
The indices in your dataframe have gaps:
28167
28168
28176
28177
28178
When you call pd.DataFrame(x_scaled) you are creating a new contiguous index and hence when assigining this as a column in the original dataframe, many lines will not have a match. You can resolve this by resetting the index in the original dataframe (df.reset_index()) or by updating x inplace (x.update(x_scaled)).
I am trying to apply a map across several columns in pandas to reflect when data is invalid. When data is invalid in my df['Count'] column, I want to set my df['Value'], df['Lower Confidence Interval'], df['Upper Confidence Interval'] and df['Denominator'] columns to -1.
This is a sample of the dataframe:
Count Value Lower Confidence Interval Upper Confidence Interval Denominator
121743 54.15758428 53.95153779 54.36348867 224794
280 91.80327869 88.18009411 94.38654088 305
430 56.95364238 53.39535553 60.44152684 755
970 70.54545455 68.0815009 72.89492873 1375
nan
70 28.57142857 23.27957213 34.52488678 245
125 62.5 55.6143037 68.91456314 200
Currently, I am trying:
set_minus_1s = {np.nan: -1, '*': -1, -1: -1}
then:
df[['Value', 'Count', 'Lower Confidence Interval', 'Upper Confidence Interval', 'Denominator']] = df['Count'].map(set_minus_1s)
and getting the error:
ValueError: Must have equal len keys and value when setting with an iterable
Is there any way of chaining the column references to make one call to the map rather than have separate lines for each column to call the set_minus_1s dictionary as a map?
I think you can use where or mask and replace all rows where not isnull after apply map:
val = df['Count'].map(set_minus_1s)
print (val)
0 NaN
1 NaN
2 NaN
3 NaN
4 -1.0
5 NaN
6 NaN
Name: Count, dtype: float64
cols =['Value','Count','Lower Confidence Interval','Upper Confidence Interval','Denominator']
df[cols] = df[cols].where(val.isnull(), val, axis=0)
print (df)
Count Value Lower Confidence Interval Upper Confidence Interval \
0 121743.0 54.157584 53.951538 54.363489
1 280.0 91.803279 88.180094 94.386541
2 430.0 56.953642 53.395356 60.441527
3 970.0 70.545455 68.081501 72.894929
4 -1.0 -1.000000 -1.000000 -1.000000
5 70.0 28.571429 23.279572 34.524887
6 125.0 62.500000 55.614304 68.914563
Denominator
0 224794.0
1 305.0
2 755.0
3 1375.0
4 -1.0
5 245.0
6 200.0
cols = ['Value', 'Count', 'Lower Confidence Interval', 'Upper Confidence Interval', 'Denominator']
df[cols] = df[cols].mask(val.notnull(), val, axis=0)
print (df)
Count Value Lower Confidence Interval Upper Confidence Interval \
0 121743.0 54.157584 53.951538 54.363489
1 280.0 91.803279 88.180094 94.386541
2 430.0 56.953642 53.395356 60.441527
3 970.0 70.545455 68.081501 72.894929
4 -1.0 -1.000000 -1.000000 -1.000000
5 70.0 28.571429 23.279572 34.524887
6 125.0 62.500000 55.614304 68.914563
Denominator
0 224794.0
1 305.0
2 755.0
3 1375.0
4 -1.0
5 245.0
6 200.0
I have two dataframes: one has multi levels of columns, and another has only single level column (which is the first level of the first dataframe, or say the second dataframe is calculated by grouping the first dataframe).
These two dataframes look like the following:
first dataframe-df1
second dataframe-df2
The relationship between df1 and df2 is:
df2 = df1.groupby(axis=1, level='sector').mean()
Then, I get the index of rolling_max of df1 by:
result1=pd.rolling_apply(df1,window=5,func=lambda x: pd.Series(x).idxmax(),min_periods=4)
Let me explain result1 a little bit. For example, during the five days (window length) 2016/2/23 - 2016/2/29, the max price of the stock sh600870 happened in 2016/2/24, the index of 2016/2/24 in the five-day range is 1. So, in result1, the value of stock sh600870 in 2016/2/29 is 1.
Now, I want to get the sector price for each stock by the index in result1.
Let's take the same stock as example, the stock sh600870 is in sector ’家用电器视听器材白色家电‘. So in 2016/2/29, I wanna get the sector price in 2016/2/24, which is 8.770.
How can I do that?
idxmax (or np.argmax) returns an index which is relative to the rolling
window. To make the index relative to df1, add the index of the left edge of
the rolling window:
index = pd.rolling_apply(df1, window=5, min_periods=4, func=np.argmax)
shift = pd.rolling_min(np.arange(len(df1)), window=5, min_periods=4)
index = index.add(shift, axis=0)
Once you have ordinal indices relative to df1, you can use them to index
into df1 or df2 using .iloc.
For example,
import numpy as np
import pandas as pd
np.random.seed(2016)
N = 15
columns = pd.MultiIndex.from_product([['foo','bar'], ['A','B']])
columns.names = ['sector', 'stock']
dates = pd.date_range('2016-02-01', periods=N, freq='D')
df1 = pd.DataFrame(np.random.randint(10, size=(N, 4)), columns=columns, index=dates)
df2 = df1.groupby(axis=1, level='sector').mean()
window_size, min_periods = 5, 4
index = pd.rolling_apply(df1, window=window_size, min_periods=min_periods, func=np.argmax)
shift = pd.rolling_min(np.arange(len(df1)), window=window_size, min_periods=min_periods)
# alternative, you could use
# shift = np.pad(np.arange(len(df1)-window_size+1), (window_size-1, 0), mode='constant')
# but this is harder to read/understand, and therefore it maybe more prone to bugs.
index = index.add(shift, axis=0)
result = pd.DataFrame(index=df1.index, columns=df1.columns)
for col in index:
sector, stock = col
mask = pd.notnull(index[col])
idx = index.loc[mask, col].astype(int)
result.loc[mask, col] = df2[sector].iloc[idx].values
print(result)
yields
sector foo bar
stock A B A B
2016-02-01 NaN NaN NaN NaN
2016-02-02 NaN NaN NaN NaN
2016-02-03 NaN NaN NaN NaN
2016-02-04 5.5 5 5 7.5
2016-02-05 5.5 5 5 8.5
2016-02-06 5.5 6.5 5 8.5
2016-02-07 5.5 6.5 5 8.5
2016-02-08 6.5 6.5 5 8.5
2016-02-09 6.5 6.5 6.5 8.5
2016-02-10 6.5 6.5 6.5 6
2016-02-11 6 6.5 4.5 6
2016-02-12 6 6.5 4.5 4
2016-02-13 2 6.5 4.5 5
2016-02-14 4 6.5 4.5 5
2016-02-15 4 6.5 4 3.5
Note in pandas 0.18 the rolling_apply syntax was changed. DataFrames and Series now have a rolling method, so that now you would use:
index = df1.rolling(window=window_size, min_periods=min_periods).apply(np.argmax)
shift = (pd.Series(np.arange(len(df1)))
.rolling(window=window_size, min_periods=min_periods).min())
index = index.add(shift.values, axis=0)