How to calculate mean in a particular subset and replace the value - python

csv table :
So I have a csv file that has different columns like nodeVolt, Temperature1, temperature2, temperature3, pressure and luminosity. Under temperatures column, there are various cells where the value is wrong (ie. 220). I want to replace that value in that cell by taking a mean of the previous 10 cells and replacing it there. I want this to run dynamically by finding all the cells with values 220 in that particular column and replace with the mean of previous 10 values in the same column.
I was able to search the cells containing 220 in that particular problem but unable to take mean and replace it.
import pandas as pd
import numpy as np
data = pd.read_csv(r"108e.csv")
data = data.drop(['timeStamp','nodeRSSI','packetID', 'solarPanelVolt', 'solarPanelBattVolt',
'solarPanelCurr','temperature2','nodeVolt','nodeAddress'], axis = 1)
df = pd.DataFrame(data)
df1 = df.loc[lambda df: df['temperature3'] == 220]
print(df1)
for i in df1:
df1["temperature3"][i] == df["temperature3"][i-11:i-1, 'temperature3'].mean()

Here you go:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"something": 3.37,
"temperature3": [
31.94,
31.93,
31.85,
31.91,
31.92,
31.89,
31.9,
31.94,
32.06,
32.16,
32.3,
220,
32.1,
32.5,
32.2,
32.3,
],
}
)
# replace all 220 values by NaN
df["temperature3"] = df["temperature3"].replace({220: np.nan})
# fill all NaNs with an shifted rolling average of the last 10 rows
df["temperature3"] = df["temperature3"].fillna(
df["temperature3"].rolling(10, min_periods=1).mean().shift(1)
)
Result:
something temperature3
0 3.37 31.940
1 3.37 31.930
2 3.37 31.850
3 3.37 31.910
4 3.37 31.920
5 3.37 31.890
6 3.37 31.900
7 3.37 31.940
8 3.37 32.060
9 3.37 32.160
10 3.37 32.300
11 3.37 31.986
12 3.37 32.100
13 3.37 32.500
14 3.37 32.200
15 3.37 32.300
(please provide next time some sample data as code, not as an image)

Related

Pandas pivots index as columns using groupby-apply

I have come across some strange behavior in Pandas groupby-apply that I am trying to figure out.
Take the following example dataframe:
import pandas as pd
import numpy as np
index = range(1, 11)
groups = ["A", "B"]
idx = pd.MultiIndex.from_product([index, groups], names = ["index", "group"])
np.random.seed(12)
df = pd.DataFrame({"val": np.random.normal(size=len(idx))}, index=idx).reset_index()
print(df.tail().round(2))
index group val
15 8 B -0.12
16 9 A 1.01
17 9 B -0.91
18 10 A -1.03
19 10 B 1.21
And using this framework (which allows me to execute any arbitrary function within a groupby-apply):
def add_two(x):
return x + 2
def pd_groupby_apply(df, out_name, in_name, group_name, index_name, function):
def apply_func(df):
if index_name is not None:
df = df.set_index(index_name).sort_index()
df[out_name] = function(df[in_name].values)
return df[out_name]
return df.groupby(group_name).apply(apply_func)
Whenever I call pd_groupby_apply with the following inputs, I get a pivoted DataFrame:
df_out1 = pd_groupby_apply(df=df,
out_name="test",
in_name="val",
group_name="group",
index_name="index",
function=add_two)
print(df_out1.head().round(2))
index 1 2 3 4 5 6 7 8 9 10
group
A 2.47 2.24 2.75 2.01 1.19 1.40 3.10 3.34 3.01 0.97
B 1.32 0.30 0.47 1.88 4.87 2.47 0.78 1.88 1.09 3.21
However, as soon as my dataframe does not contain full group-index pairs, and I call my pd_groupby_apply function again, I do recieve my dataframe back in the way that I want (i.e. not pivoted):
df_notfull = df.iloc[:-1]
df_out2 = pd_groupby_apply(df=df_notfull,
out_name="test",
in_name="val",
group_name="group",
index_name="index",
function=add_two)
print(df_out2.head().round(2))
group index
A 1 2.47
2 2.24
3 2.75
4 2.01
5 1.19
Why is this? And more importantly, how can I prevent Pandas from pivoting my dataframe when I have full index-group pairs in my dataframe?

Rolling Correlation of Multi-Column Panda

I am trying to calcualte and then visualize the rolling correlation between multiple columns in a 180 (3 in this example) days window.
My data is formatted like that (in the orginal file there are 12 columns plus the timestamp and thousands of rows):
import numpy as np
import pandas as pd
df = pd.DataFrame({"Timestamp" : ['1993-11-01' ,'1993-11-02', '1993-11-03', '1993-11-04','1993-11-15'], "Austria" : [6.18 ,6.18, 6.17, 6.17, 6.40],"Belgium" : [7.05, 7.05, 7.2, 7.5, 7.6],"France" : [7.69, 7.61, 7.67, 7.91, 8.61]},index = [1, 2, 3,4,5])
Timestamp Austria Belgium France
1 1993-11-01 6.18 7.05 7.69
2 1993-11-02 6.18 7.05 7.61
3 1993-11-03 6.17 7.20 7.67
4 1993-11-04 6.17 7.50 7.91
5 1993-11-15 6.40 7.60 8.61
I cant just use this formula, because I get a formatting error if I do because of the Timestamp column:
df.rolling(2).corr(df)
ValueError: could not convert string to float: '1993-11-01'
When I drop the Timestamp column I get a result of 1.0 for every cell, thats also not right and additionally I lose the Timestamp which I will need for the visualization graph in the end.
df_drop = df.drop(columns=['Timestamp'])
df_drop.rolling(2).corr(df_drop)
Austria Belgium France
1 NaN NaN NaN
2 NaN NaN 1.0
3 1.0 1.0 1.0
4 -inf1.0 1.0
5 1.0 1.0 1.0
Any experiences how to do the rolling correlation with multiple columns and a data index?
Building on the answer of Shreyans Jain I propose the following. It should work with an arbitrary number of columns:
import itertools as it
# omit timestamp-col
cols = list(df.columns)[1:]
# -> ['Austria', 'Belgium', 'France']
col_pairs = list(it.combinations(cols, 2))
# -> [('Austria', 'Belgium'), ('Austria', 'France'), ('Belgium', 'France')]
res = pd.DataFrame()
for pair in col_pairs:
# select the first three letters of each name of the pair
corr_name = f"{pair[0][:3]}_{pair[1][:3]}_corr"
res[corr_name] = df[list(pair)].\
rolling(min_periods=1, window=3).\
corr().iloc[0::2, -1].reset_index(drop=True)
print(str(res))
Aus_Bel_corr Aus_Fra_corr Bel_Fra_corr
0 NaN NaN NaN
1 NaN NaN NaN
2 -1.000000 -0.277350 0.277350
3 -0.755929 -0.654654 0.989743
4 0.693375 0.969346 0.849167
The NaN-Values at the beginning result from the windowing.
Update: I uploaded a notebook with detailed explanations for what happens inside the loop.
https://github.com/cknoll/demo-material/blob/main/pandas/pandas_rolling_correlation_iloc.ipynb
You can probably calculate pair-wise correlation like this, instead of going for all 3 at once.
Once you have the correlation, you can directly add them as your columns as well, preserving the timestamp.
df['Aus_Bel_corr'] = df[['Austria','Belgium']].rolling(min_periods = 1, window = 3).corr().iloc[0::2,-1].reset_index(drop = True)
df['Bel_Fin_corr'] = df[['Belgium','Finland']].rolling(min_periods = 1, window = 3).corr().iloc[0::2,-1].reset_index(drop = True)
df['Aus_Fin_corr'] = df[['Austria','Finland']].rolling(min_periods = 1, window = 3).corr().iloc[0::2,-1].reset_index(drop = True)```
I guess that there is an another way.
df['Aus_Bel_corr'] = df['Austria']\
.rolling(min_periods = 1, window = 3)\
.corr(df['Belgium'])
For me, I think it is a little simple than the previous answer.

Matplotlib/Seaborn plot a boxplot with on the x-axis different range of values categories

I want to make a boxplot with on the the x-axis having the x variable split in different ranges, for eg: 0-5, 5-10, 10+. Is there a way to do this efficiently in Matplotlib/Seaborn without having to create uneven new columns based on subsetting? So for this example dataset below a I want a boxplot with 3 boxes 0-5 (1a4j,1a6u,1ahc), 5-10 (1brq,1bya), 10+ (1bya,1bbs) given the rot_bonds variable
structure rot_bonds no_atoms logP
0 1a4j 3 37 2.46
1 1a6u 4 17 1.58
2 1ahc 0 10 -0.06
3 1bbs 20 51 4.81
4 1brq 5 21 5.51
5 1bya 10 45 -9.75
Thanks in advance.
With seaborn you could use the slicing into ranges as the x axis, and for example 'no_atoms' as the y-values for the boxplot:
from matplotlib import pyplot as plt
from io import StringIO
import pandas as pd
import seaborn as sns
s = ''' structure rot_bonds no_atoms logP
0 1a4j 3 37 2.46
1 1a6u 4 17 1.58
2 1ahc 0 10 -0.06
3 1bbs 20 51 4.81
4 1brq 5 21 5.51
5 1bya 10 45 -9.75'''
df = pd.read_csv(StringIO(s), delim_whitespace=True)
sns.boxplot(x=pd.cut(df['rot_bonds'], [0, 5, 10, 1000]), y='no_atoms', data=df)
plt.show()

Python Pandas Simple Moving Average (deprecated pd.rolling_mean) [duplicate]

I would like to add a moving average calculation to my exchange time series.
Original data from Quandl
Exchange = Quandl.get("BUNDESBANK/BBEX3_D_SEK_USD_CA_AC_000",
authtoken="xxxxxxx")
# Value
# Date
# 1989-01-02 6.10500
# 1989-01-03 6.07500
# 1989-01-04 6.10750
# 1989-01-05 6.15250
# 1989-01-09 6.25500
# 1989-01-10 6.24250
# 1989-01-11 6.26250
# 1989-01-12 6.23250
# 1989-01-13 6.27750
# 1989-01-16 6.31250
# Calculating Moving Avarage
MovingAverage = pd.rolling_mean(Exchange,5)
# Value
# Date
# 1989-01-02 NaN
# 1989-01-03 NaN
# 1989-01-04 NaN
# 1989-01-05 NaN
# 1989-01-09 6.13900
# 1989-01-10 6.16650
# 1989-01-11 6.20400
# 1989-01-12 6.22900
# 1989-01-13 6.25400
# 1989-01-16 6.26550
I would like to add the calculated Moving Average as a new column to the right after Value using the same index (Date). Preferably I would also like to rename the calculated moving average to MA.
The rolling mean returns a Series you only have to add it as a new column of your DataFrame (MA) as described below.
For information, the rolling_mean function has been deprecated in pandas newer versions. I have used the new method in my example, see below a quote from the pandas documentation.
Warning Prior to version 0.18.0, pd.rolling_*, pd.expanding_*, and pd.ewm* were module level functions and are now deprecated. These are replaced by using the Rolling, Expanding and EWM. objects and a corresponding method call.
df['MA'] = df.rolling(window=5).mean()
print(df)
# Value MA
# Date
# 1989-01-02 6.11 NaN
# 1989-01-03 6.08 NaN
# 1989-01-04 6.11 NaN
# 1989-01-05 6.15 NaN
# 1989-01-09 6.25 6.14
# 1989-01-10 6.24 6.17
# 1989-01-11 6.26 6.20
# 1989-01-12 6.23 6.23
# 1989-01-13 6.28 6.25
# 1989-01-16 6.31 6.27
A moving average can also be calculated and visualized directly in a line chart by using the following code:
Example using stock price data:
import pandas_datareader.data as web
import matplotlib.pyplot as plt
import datetime
plt.style.use('ggplot')
# Input variables
start = datetime.datetime(2016, 1, 01)
end = datetime.datetime(2018, 3, 29)
stock = 'WFC'
# Extrating data
df = web.DataReader(stock,'morningstar', start, end)
df = df['Close']
print df
plt.plot(df['WFC'],label= 'Close')
plt.plot(df['WFC'].rolling(9).mean(),label= 'MA 9 days')
plt.plot(df['WFC'].rolling(21).mean(),label= 'MA 21 days')
plt.legend(loc='best')
plt.title('Wells Fargo\nClose and Moving Averages')
plt.show()
Tutorial on how to do this: https://youtu.be/XWAPpyF62Vg
In case you are calculating more than one moving average:
for i in range(2,10):
df['MA{}'.format(i)] = df.rolling(window=i).mean()
Then you can do an aggregate average of all the MA
df[[f for f in list(df) if "MA" in f]].mean(axis=1)
To get the moving average in pandas we can use cum_sum and then divide by count.
Here is the working example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': range(5),
'value': range(100,600,100)})
# some other similar statistics
df['cum_sum'] = df['value'].cumsum()
df['count'] = range(1,len(df['value'])+1)
df['mov_avg'] = df['cum_sum'] / df['count']
# other statistics
df['rolling_mean2'] = df['value'].rolling(window=2).mean()
print(df)
output
id value cum_sum count mov_avg rolling_mean2
0 0 100 100 1 100.0 NaN
1 1 200 300 2 150.0 150.0
2 2 300 600 3 200.0 250.0
3 3 400 1000 4 250.0 350.0
4 4 500 1500 5 300.0 450.0

Convert dataframe to numpy matrix where indexes stored in dataframe

I have a dataframe that looks like so
time usd hour day
0 2015-08-30 07:56:28 1.17 7 0
1 2015-08-30 08:56:28 1.27 8 0
2 2015-08-30 09:56:28 1.28 9 0
3 2015-08-30 10:56:28 1.29 10 0
4 2015-08-30 11:56:28 1.29 11 0
14591 2017-04-30 23:53:46 9.28 23 609
Given this how would I go about building a numpy 2d matrix with hour being one axis day being the other axis and then usd being the value stored in the matrix
Consider the dataframe df
df = pd.DataFrame(dict(
time=pd.date_range('2015-08-30', periods=14000, freq='H'),
usd=(np.random.randn(14000) / 100 + 1.0005).cumprod()
))
Then we can set the index with the date and hour of df.time column and unstack. We take the values of this result in order to access the numpy array.
a = df.set_index([df.time.dt.date, df.time.dt.hour]).usd.unstack().values
I would do a pivot_table and leave the data as a pandas DataFrame but the conversion to a numpy array is trivial if you don't want labels.
import pandas as pd
data = <data>
data.pivot_table(values = 'usd', index = 'hour', columns = 'day').values
Edit: Thank you #pyRSquared for the "Value"able tip. (changed np.array(data) to df...values)
You can use the pivot functionality of pandas, as described here. You will get NaN values for usd, when there is no value for the day or hour.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'usd': [1.17, 1.27, 1.28, 1.29, 1.29, 9.28], 'hour': [7, 8, 9, 10, 11, 23], 'day': [0, 0, 0, 0, 0, 609]})
In [3]: df
Out[3]:
day hour usd
0 0 7 1.17
1 0 8 1.27
2 0 9 1.28
3 0 10 1.29
4 0 11 1.29
5 609 23 9.28
In [4]: df.pivot(index='hour', columns='day', values='usd')
Out[4]:
day 0 609
hour
7 1.17 NaN
8 1.27 NaN
9 1.28 NaN
10 1.29 NaN
11 1.29 NaN
23 NaN 9.28

Categories