Python pandas: how to fill values between existing ones in dataframe column? - python

I have a pandas DataFrame with 3 columns. The first column contains string values in ascending order, at a certain frequency (e.g. '20173070000', '20173070020', '20173070040', etc.). The second and third columns contain corresponding integer values. I would like to re-sample the first column to every one - '20173070000', '20173070001', '20173070002', simultaneously filling the second and third columns with NaN values, and then I would like to interpolate those NaN values.
I've looked into re-sampling data, but this appears to only work for timedate values. I have also looked into pd.interpolate, but this appears to work for interpolating between missing values. As stated above, my dataset does not contain missing data. I am simply looking to increase the frequency of my entries - to fill between existing values.
To give some reference, my current DataFrame looks like this:
0 1 2
0 20173070000 14.0 13.9
1 20173070020 14.1 14.1
2 20173070040 13.8 13.6
3 20173070060 13.7 13.7
4 20173070080 13.8 13.5
5 20173070100 13.9 14.0
I would like to generate a DataFrame that looks like:
0 1 2
0 20173070000 14.0 13.9
1 20173070001 NaN NaN
2 20173070002 NaN NaN
3 20173070003 NaN NaN
4 20173070004 NaN NaN
5 20173070005 NaN NaN
...
20 20173070020 14.1 14.1
21 20173070021 NaN NaN
...
I have no problem sorting the interpolation afterwards, but I have not worked out how to up sample yet.

You can just use reindex function. By default, it places NaN in locations having no value in the "new" index.
df = pd.DataFrame({'A': [20173070000, 20173070020, 20173070040, 20173070060, 20173070080, 20173070100 ],
'B': [14, 14.1, 13.8, 13.7, 13.8, 13.9],
'C': [13.9, 14.1, 13.6, 13.7, 13.5, 14.0] })
df.set_index('A').reindex(np.arange(np.min(df.A), np.max(df.A)+1) ).reset_index()

I believe the interpolate() is the way to go for you. After having upsampled as you described and given the column containing the values you want to interpolate is called 'val1', you can do:
df.loc[:, 'val1'] = df.loc[:, 'val1'].interpolate()

Related

Rolling Correlation of Multi-Column Panda

I am trying to calcualte and then visualize the rolling correlation between multiple columns in a 180 (3 in this example) days window.
My data is formatted like that (in the orginal file there are 12 columns plus the timestamp and thousands of rows):
import numpy as np
import pandas as pd
df = pd.DataFrame({"Timestamp" : ['1993-11-01' ,'1993-11-02', '1993-11-03', '1993-11-04','1993-11-15'], "Austria" : [6.18 ,6.18, 6.17, 6.17, 6.40],"Belgium" : [7.05, 7.05, 7.2, 7.5, 7.6],"France" : [7.69, 7.61, 7.67, 7.91, 8.61]},index = [1, 2, 3,4,5])
Timestamp Austria Belgium France
1 1993-11-01 6.18 7.05 7.69
2 1993-11-02 6.18 7.05 7.61
3 1993-11-03 6.17 7.20 7.67
4 1993-11-04 6.17 7.50 7.91
5 1993-11-15 6.40 7.60 8.61
I cant just use this formula, because I get a formatting error if I do because of the Timestamp column:
df.rolling(2).corr(df)
ValueError: could not convert string to float: '1993-11-01'
When I drop the Timestamp column I get a result of 1.0 for every cell, thats also not right and additionally I lose the Timestamp which I will need for the visualization graph in the end.
df_drop = df.drop(columns=['Timestamp'])
df_drop.rolling(2).corr(df_drop)
Austria Belgium France
1 NaN NaN NaN
2 NaN NaN 1.0
3 1.0 1.0 1.0
4 -inf1.0 1.0
5 1.0 1.0 1.0
Any experiences how to do the rolling correlation with multiple columns and a data index?
Building on the answer of Shreyans Jain I propose the following. It should work with an arbitrary number of columns:
import itertools as it
# omit timestamp-col
cols = list(df.columns)[1:]
# -> ['Austria', 'Belgium', 'France']
col_pairs = list(it.combinations(cols, 2))
# -> [('Austria', 'Belgium'), ('Austria', 'France'), ('Belgium', 'France')]
res = pd.DataFrame()
for pair in col_pairs:
# select the first three letters of each name of the pair
corr_name = f"{pair[0][:3]}_{pair[1][:3]}_corr"
res[corr_name] = df[list(pair)].\
rolling(min_periods=1, window=3).\
corr().iloc[0::2, -1].reset_index(drop=True)
print(str(res))
Aus_Bel_corr Aus_Fra_corr Bel_Fra_corr
0 NaN NaN NaN
1 NaN NaN NaN
2 -1.000000 -0.277350 0.277350
3 -0.755929 -0.654654 0.989743
4 0.693375 0.969346 0.849167
The NaN-Values at the beginning result from the windowing.
Update: I uploaded a notebook with detailed explanations for what happens inside the loop.
https://github.com/cknoll/demo-material/blob/main/pandas/pandas_rolling_correlation_iloc.ipynb
You can probably calculate pair-wise correlation like this, instead of going for all 3 at once.
Once you have the correlation, you can directly add them as your columns as well, preserving the timestamp.
df['Aus_Bel_corr'] = df[['Austria','Belgium']].rolling(min_periods = 1, window = 3).corr().iloc[0::2,-1].reset_index(drop = True)
df['Bel_Fin_corr'] = df[['Belgium','Finland']].rolling(min_periods = 1, window = 3).corr().iloc[0::2,-1].reset_index(drop = True)
df['Aus_Fin_corr'] = df[['Austria','Finland']].rolling(min_periods = 1, window = 3).corr().iloc[0::2,-1].reset_index(drop = True)```
I guess that there is an another way.
df['Aus_Bel_corr'] = df['Austria']\
.rolling(min_periods = 1, window = 3)\
.corr(df['Belgium'])
For me, I think it is a little simple than the previous answer.

Pandas combine two dataseries into one series

I need to combine the dataseries rateScore and rate into one.
This is the current DataFrame I have
rateScore rate
10 NaN 4.5
11 2.5 NaN
12 4.5 NaN
13 NaN 5.0
..
235 NaN 4.7
236 3.8 NaN
This needs to be something like this:
rateScore
10 4.5
11 2.5
12 4.5
13 5.0
..
235 4.7
236 3.8
The rate column needs to be dropped after merging the series and also for each row, the index number needs stay the same.
You can try with the following with fillna(), redifining the rateScore column and dropping rate:
df = df.fillna(0)
df['rateScore'] = df['rateScore'] + df['rate']
df = df.drop(columns='rate')
You could use combine_first to fill NaN values from a second Series:
df['rateScore'] = df['rateScore'].combine_first(df['rateScore'])
Let us do add
df['rateScore'] = df['rateScore'].add(df['rate'],fill_value=0)

Python - multiplying dataframes of different size

I have two dataframes:
df1 - is a pivot table that has totals for both columns and rows, both with default names "All"
df2 - a df I created manually by specifying values and using the same index and column names as are used in the pivot table above. This table does not have totals.
I need to multiply the first dataframe by the values in the second. I expect the totals return NaNs since totals don't exist in the second table.
When I perform multiplication, I get the following error:
ValueError: cannot join with no level specified and no overlapping names
When I try the same on dummy dataframes it works as expected:
import pandas as pd
import numpy as np
table1 = np.matrix([[10, 20, 30, 60],
[50, 60, 70, 180],
[90, 10, 10, 110],
[150, 90, 110, 350]])
df1 = pd.DataFrame(data = table1, index = ['One','Two','Three', 'All'], columns =['A', 'B','C', 'All'] )
print(df1)
table2 = np.matrix([[1.0, 2.0, 3.0],
[5.0, 6.0, 7.0],
[2.0, 1.0, 5.0]])
df2 = pd.DataFrame(data = table2, index = ['One','Two','Three'], columns =['A', 'B','C'] )
print(df2)
df3 = df1*df2
print(df3)
This gives me the following output:
A B C All
One 10 20 30 60
Two 50 60 70 180
Three 90 10 10 110
All 150 90 110 350
A B C
One 1.00 2.00 3.00
Two 5.00 6.00 7.00
Three 2.00 1.00 5.00
A All B C
All nan nan nan nan
One 10.00 nan 40.00 90.00
Three 180.00 nan 10.00 50.00
Two 250.00 nan 360.00 490.00
So, visually, the only difference between df1 and df2 is the presence/absence of the column and row "All".
And I think the only difference between my dummy dataframes and the real ones is that the real df1 was created with pd.pivot_table method:
df1_real = pd.pivot_table(PY, values = ['Annual Pay'], index = ['PAR Rating'],
columns = ['CR Range'], aggfunc = [np.sum], margins = True)
I do need to keep the total as I'm using them in other calculations.
I'm sure there is a workaround but I just really want to understand why the same code works on some dataframes of different sizes but not others. Or maybe an issue is something completely different.
Thank you for reading. I realize it's a very long post..
IIUC,
My Preferred Approach
you can use the mul method in order to pass the fill_value argument. In this case, you'll want a value of 1 (multiplicative identity) to preserve the value from the dataframe in which the value is not missing.
df1.mul(df2, fill_value=1)
A All B C
All 150.0 350.0 90.0 110.0
One 10.0 60.0 40.0 90.0
Three 180.0 110.0 10.0 50.0
Two 250.0 180.0 360.0 490.0
Alternate Approach
You can also embrace the np.nan and use a follow-up combine_first to fill back in the missing bits from df1
(df1 * df2).combine_first(df1)
A All B C
All 150.0 350.0 90.0 110.0
One 10.0 60.0 40.0 90.0
Three 180.0 110.0 10.0 50.0
Two 250.0 180.0 360.0 490.0
I really like Pir 's approach , and here is mine :-)
df1.loc[df2.index,df2.columns]*=df2
df1
Out[293]:
A B C All
One 10.0 40.0 90.0 60
Two 250.0 360.0 490.0 180
Three 180.0 10.0 50.0 110
All 150.0 90.0 110.0 350
#Wen, #piRSquared, thank you for your help. This is what I ended up doing. There is probably a more elegant solution but this worked for me.
Since I was able to multiply two dummy dataframes of different sizes, I reasoned the issue wasn't the size, but the fact that one of the dataframes was created as a pivot table. Somehow in this pivot table, the headers were not recognized, though visually they were there. So, I decided to convert the pivot table to a regular dataframe. Steps I took:
Converted the pivot table to records and then back to dataframe using solution from this thread: pandas pivot table to data frame .
Cleaned up the column headers using solution from the same thread above: pandas pivot table to data frame .
Set my first column as the index following suggestion in this thread: How to remove index from a created Dataframe in Python?
This gave me a dataframe that was visually identical to what I had before but was no longer a pivot table.
I was then able to multiply the two dataframes with no issues. I used approach suggested by #Wen because I like that it preserves the structure.

resample Pandas dataframe and merge strings in column

I want to resample a pandas dataframe and apply different functions to different columns. The problem is that I cannot properly process a column with strings. I would like to apply a function that merges the string with a delimiter such as " - ". This is a data example:
import pandas as pd
import numpy as np
idx = pd.date_range('2017-01-31', '2017-02-03')
data=list([[1,10,"ok"],[2,20,"merge"],[3,30,"us"]])
dates=pd.DatetimeIndex(['2017-01-31','2017-02-03','2017-02-03'])
d=pd.DataFrame(data, index=,columns=list('ABC'))
A B C
2017-01-31 1 10 ok
2017-02-03 2 20 merge
2017-02-03 3 30 us
Resampling the numeric columns A and B with a sum and mean aggregator works. Column C however kind of works with sum (but it gets placed on the second place, which might mean that something fails).
d.resample('D').agg({'A': sum, 'B': np.mean, 'C': sum})
A C B
2017-01-31 1.0 a 10.0
2017-02-01 NaN 0 NaN
2017-02-02 NaN 0 NaN
2017-02-03 5.0 merge us 25.0
I would like to get this:
...
2017-02-03 5.0 merge - us 25.0
I tried using lambda in different ways but without success (not shown).
If I may ask a second related question: I can do some post processing for this, but how to fill missing cells in different columns with zeros or ""?
Your agg function for column 'C' should be a join
d.resample('D').agg({'A': sum, 'B': np.mean, 'C': ' - '.join})
A B C
2017-01-31 1.0 10.0 ok
2017-02-01 NaN NaN
2017-02-02 NaN NaN
2017-02-03 5.0 25.0 merge - us

Excluding the NaN values while doing a Sum operation across the rows inside FOR Loop

I am having two data frame as given below
df1=
2492 3853 2486 3712 2288
0 4 NaN 3.5 NaN NaN
1 3 NaN 2.0 4.5 3.5
2 3 3.5 4.5 NaN 3.5
3 3. NaN 3.5 4.5 NaN
df2=
2492 0.476683
3853 0.464110
2486 0.438992
3712 0.400275
2288 0.379856
Right now I would like to get the sum of df2 values by excluding the NaN Values
Expected output
0 0.915675[0.476683+0.438992]
1 1.695806[0.476683+0.438992+0.400275+0.379856]
2 1.759641[0.476683+0.464110+0.438992+0.379856]
3 1.31595 [0.476683+0.438992+0.400275]
Please let me know your thoughts how to achieve this issue(without replacing NaN values as "0" )
df2.sum(1).sum()
Should be enough and skip NaNs.
The first sum is a DataFrame method that returns a Series which contains the sum for every line, then the second is summing the values on this Series.
NaNs are ignored by default.
edit: using simply df2.sum() should be enough
You can do:
>>> ((df1.fillna(0)>0)*1).mul(df2.iloc[:,1].values).sum(axis=1)
0 0.915675
1 1.695806
2 1.759641
3 1.315950
dtype: float64
Note that NaN are not replaced "by reference", you still have NaN in your original df1 after this operation.

Categories