How to merge time series data frames in Pandas with asymmetric data? - python

I have two Pandas DataFrames indexed by a timeline. We'll call the first df_A, in which the 'epoch' corresponds to the index.
df_A:
timeline epoch price z-value
0 1476336104 0 434.313 1
1 1476336120 1 434.312 false
2 1476336134 2 434.312 false
3 1476336149 3 435.900 false
4 1476336165 4 435.900 1
5 1476336178 5 435.500 1
The second, df_B, has entries that may have one, none, or multiple entries per index of df_A, as you can see by the 'epoch' column.
df_B:
timeline epoch send-value tx-in
0 1476336123 1 10000 False
1 1476336169 4 299950000 False
2 1476336187 5 22879033493 False
3 1476336194 5 130000000 False
4 1476336212 7 10000000000 False
How can I merge these on the index of df_A, and add extra values contained in df_B as columns? I'd like to also add a suffix to differentiate the additional columns. The two example datasets should create a new DataFrame, df_AB that looks like this:
timeline epoch price z-value send-value tx-in send-value_1 tx-in_1
0 1476336104 0 434.313 1 NaN NaN NaN NaN
1 1476336120 1 434.312 false 10000 False NaN NaN
2 1476336134 2 434.312 false NaN NaN NaN NaN
3 1476336149 3 435.900 false NaN NaN NaN NaN
4 1476336165 4 435.900 1 299950000 False NaN NaN
5 1476336178 5 435.500 1 22879033493 False 130000000 False
It looks like there are a few different methods where I might be able to reindex and then merge on 'timeline', or use something like merge_asof, but I can't seem to get any of them to produce the result I am looking for.
How can I do this?

Related

Pandas: Use selected amount of previous rows in apply function

lets say I have dataframe below:
index value
1 1
2 2
3 3
4 4
I want to apply a function to each row using previous two rows using "apply" statement. Lets say for example I want to multiple current row and previous 2 rows if it exists. (This could be any funtion)
Result:
index value result
1 1 nan
2 2 nan
3 3 6
4 4 24
Thank you.
You can try rolling with prod:
df['result'] = df['value'].rolling(3).apply(lambda x: x.prod())
Output:
index value result
0 1 1 NaN
1 2 2 NaN
2 3 3 6.0
3 4 4 24.0
Use assign function:
df = df.assign(result = lambda x: x['value'].cumprod().tail(len(df)-2))
I presume you have more than four rows. If so, please try groupby every four rows, cumproduct, choose the last 2 and join to the original datframe.
df['value']=df.index.map(df.assign(result=df['value'].cumprod(0)).groupby(df.index//4).result.tail(2).to_dict())
If just four rows then this should you;
Lets try combine .cumprod() and .tail()
df['result']=df['value'].cumprod(0).tail(2)
index value result
0 1 1 NaN
1 2 2 NaN
2 3 3 6.0
3 4 4 24.0

how to apply np.power to a dataframe

I have two dfs. One contains rates, one contains empty values that need to be calculated based on df1-rate table. The two tables look like:
I want to calculate each column from 0 to 2 in df2 by using the following equation
(1+rate)^(-age)
so my results table should look like this:
df1
0 1 2
rate 0.54 0.45 0.25
df2
Age 0 1 2
0 1 NaN NaN NaN
1 2 NaN NaN NaN
...
29 30 NaN NaN NaN
results
Age 0 1 2
0 1 (1+0.54)^(-1) (1+0.45)^(-1) (1+0.25)^(-1)
1 2 (1+0.54)^(-2) (1+0.45)^(-2) (1+0.25)^(-2)
...
29 30 (1+0.54)^(-30) (1+0.45)^(-30) (1+0.25)^(-30)
I tried my code
y=np.power(1+rate.to_numpy(),-(df.Age))
but I got error message:
"operands could not be broadcast together with shapes (1,6) (30,)"
How can I fix the code?
Based on your sample, you want to broadcast Age:
np.power(1 + df1.loc['rate'].to_numpy(), -df2['Age'].to_numpy()[:,None])
Output (for 3 rows)
array([[6.49350649e-01, 6.89655172e-01, 8.00000000e-01],
[4.21656266e-01, 4.75624257e-01, 6.40000000e-01],
[2.36798188e-06, 1.44198231e-05, 1.23794004e-03]])

Empty DataFrame doesn't admit its empty

I must not understand something about emptiness when it comes to pandas DataFrames. I have a DF with empty rows but when I isolate one of these rows its not empty.
Here I've made a dataframe:
>>> df = pandas.DataFrame(columns=[1,2,3], data=[[1,2,3],[1,None,3],[None, None, None],[3,2,1],[4,5,6],[None,None,None],[None,None,None]])
>>> df
1 2 3
0 1.0 2.0 3.0
1 1.0 NaN 3.0
2 NaN NaN NaN
3 3.0 2.0 1.0
4 4.0 5.0 6.0
5 NaN NaN NaN
6 NaN NaN NaN
Then I know row '2' is full of nothing so I check for that...
>>> df[2:3].empty
False
Odd. So I split it out into its own dataframe:
>>> df1 = df[2:3]
>>> df1
1 2 3
2 NaN NaN NaN
>>> df1.empty
False
How do I check for emptiness (all the elements in a row being None or NaN?)
http://pandas.pydata.org/pandas-docs/version/0.18/generated/pandas.DataFrame.empty.html
You're misunderstanding what empty is for. It's meant to check that the size of a series/dataframe is greater than 0, meaning there are rows. For example,
df.iloc[1:0]
Empty DataFrame
Columns: [1, 2, 3]
Index: []
df.iloc[1:0].empty
True
If you want to check that a row has all NaNs, use isnull + all:
df.isnull().all(1)
0 False
1 False
2 True
3 False
4 False
5 True
6 True
dtype: bool
For your example, this should do:
df[2:3].isnull().all(1).item()
True
Note that you can't use item if your slice is more than one row in size.
I guess you are looking for something like this:
In [296]: df[5:]
Out[296]:
1 2 3
5 NaN NaN NaN
6 NaN NaN NaN
In [297]: df[5:].isnull().all(1).all()
Out[297]: True
or even better (as proposed by #IanS):
In [300]: df[5:].isnull().all().all()
Out[300]: True
You can drop all null values from your selection and check if the result is empty:
>>> df[5:].dropna(how='all').empty
True
If you are do not want to count NaN value as real number , this will equal to
df.dropna().iloc[5:]
You select the line did not exist in your dataframe
df.dropna().iloc[5:].empty
Out[921]: True
If you have a dataframe and want to drop all rows containing NaN in each of the columns, you can do this
df.dropna(how='all')
Noticed that your dataframe also has NaN in one the columns in some cases. If you need to drop the entire row in such case:
df.dropna(how='any')
After you do this (which ever is your preference) you could check length of dataframe (number of rows it contains) using:
len(df)
I guess you have to use isnull() instead of empty().
import pandas
df = pandas.DataFrame(columns=[1,2,3], data=[[1,2,3],[1,None,3],[None, None, None],[3,2,1],[4,5,6],[None,None,None],[None,None,None]])
df[2:3].isnull()
1 2 3
True True True

pandas: create column on a subset of a dataframe, set null on other rows?

I've got a pandas dataframe and I want to calculate percentiles based on the value of the calc_value column, unless calc_value is null, in which case percentile should also be null.
I'm using scipy's rankdata to calculate the percentiles, because it handles repeated values better than pandas's qcut.
However, rankdata has one flaw, which is that it will happily include null values, and there doesn't seem to be an option to exclude them.
df = pd.DataFrame({'calc_value': [0, 0.081928, 0.94444, None, None]})
df['rank_val'] = rankdata(df.calc_value.values, method='min')
df.rank_val = df.rank_val - 1
df['percentile'] = (df.rank_val / float(len(df)-1)) * 100
This produces obviously wrong results:
calc_value rank_val percentile
0 0.000000 0 0
1 0.081928 1 25
2 0.944440 2 50
3 NaN 3 75
4 NaN 4 100
I can calculate the percentiles for all non-null values by slicing the dataframe, and doing the same calculations on the slice:
df_without_nan = df[df.calc_value.notnull()]
But what I don't know is how to push these values back into the main dataframe as df['percentile'], setting percentile and rank_val to be null on any rows where calc_value is also null.
Can anyone advise? I'm looking for the following results:
calc_value rank_val percentile
0 0.000000 0 0
1 0.081928 1 25
2 0.944440 2 50
3 NaN NaN NaN
4 NaN NaN NaN
Use pd.merge:
df_nonan = df[df['calc_value'].notnull()]
df_nonan['rank_val'] = stats.rankdata(df_nonan.calc_value.values, method='min')
df_nonan['rank_val'] = df_nonan['rank_val'] - 1
df_nonan['percentile'] = (df_nonan.rank_val / float(len(df)-1)) * 100
df_merge = pd.merge(df, df_nonan, left_index=True, right_index=True, how='left')
(This will give a SettingWithCopyWarning; if that's a problem you can do reset_index on both dataframes and use the column that generates named index instead: pd.merge(df, df_nonan, on='index', how='left'), and drop the index column after the merge.) The merged dataframe at this point is
calc_value_x calc_value_y rank_val percentile
0 0.000000 0.000000 0 0
1 0.081928 0.081928 1 25
2 0.944440 0.944440 2 50
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
Then do a bit of cleanup on the redundant columns:
del df_merge['calc_value_x']
df_merge = df_merge.rename(columns = {'calc_value_y' : 'calc_value'})
to wind up with
calc_value rank_val percentile
0 0.000000 0 0
1 0.081928 1 25
2 0.944440 2 50
3 NaN NaN NaN
4 NaN NaN NaN

Pandas divide fill with multiple values

I have two pandas series objects with slightly different indexes. I want to divide one series by another. The default method gives me NAs when one of the two series is missing an indexed element. There is an option to fill missing values, but it can only be set to one value. I want to fill a value based on which series is missing the value.
For example
series1
0 10
1 20
2 30
3 40
series2
1 2
2 3
3 4
4 5
expected result: series1.divide(series2)
0 inf
1 10
2 10
3 10
4 0
actual result: series1.divide(series2)
0 NaN
1 10
2 10
3 10
4 NaN
Is there an easy way to do this?
You could use reindex to expand series1.index to include series2.index, filling missing values with 0. Then you could use the div method, which fills in missing values with NaN by default:
series1 = pd.Series([10,20,30,40], index=[0,1,2,3])
series2 = pd.Series([2,3,4,5], index=[1,2,3,4])
series1 = series1.reindex(series1.index.union(series2.index), fill_value=0)
print(series1.div(series2))
# 0 nan
# 1 10.00000
# 2 10.00000
# 3 10.00000
# 4 0.00000
# dtype: float64

Categories