Convert dataframe to numpy matrix where indexes stored in dataframe - python

I have a dataframe that looks like so
time usd hour day
0 2015-08-30 07:56:28 1.17 7 0
1 2015-08-30 08:56:28 1.27 8 0
2 2015-08-30 09:56:28 1.28 9 0
3 2015-08-30 10:56:28 1.29 10 0
4 2015-08-30 11:56:28 1.29 11 0
14591 2017-04-30 23:53:46 9.28 23 609
Given this how would I go about building a numpy 2d matrix with hour being one axis day being the other axis and then usd being the value stored in the matrix

Consider the dataframe df
df = pd.DataFrame(dict(
time=pd.date_range('2015-08-30', periods=14000, freq='H'),
usd=(np.random.randn(14000) / 100 + 1.0005).cumprod()
))
Then we can set the index with the date and hour of df.time column and unstack. We take the values of this result in order to access the numpy array.
a = df.set_index([df.time.dt.date, df.time.dt.hour]).usd.unstack().values

I would do a pivot_table and leave the data as a pandas DataFrame but the conversion to a numpy array is trivial if you don't want labels.
import pandas as pd
data = <data>
data.pivot_table(values = 'usd', index = 'hour', columns = 'day').values
Edit: Thank you #pyRSquared for the "Value"able tip. (changed np.array(data) to df...values)

You can use the pivot functionality of pandas, as described here. You will get NaN values for usd, when there is no value for the day or hour.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'usd': [1.17, 1.27, 1.28, 1.29, 1.29, 9.28], 'hour': [7, 8, 9, 10, 11, 23], 'day': [0, 0, 0, 0, 0, 609]})
In [3]: df
Out[3]:
day hour usd
0 0 7 1.17
1 0 8 1.27
2 0 9 1.28
3 0 10 1.29
4 0 11 1.29
5 609 23 9.28
In [4]: df.pivot(index='hour', columns='day', values='usd')
Out[4]:
day 0 609
hour
7 1.17 NaN
8 1.27 NaN
9 1.28 NaN
10 1.29 NaN
11 1.29 NaN
23 NaN 9.28

Related

Pandas pivots index as columns using groupby-apply

I have come across some strange behavior in Pandas groupby-apply that I am trying to figure out.
Take the following example dataframe:
import pandas as pd
import numpy as np
index = range(1, 11)
groups = ["A", "B"]
idx = pd.MultiIndex.from_product([index, groups], names = ["index", "group"])
np.random.seed(12)
df = pd.DataFrame({"val": np.random.normal(size=len(idx))}, index=idx).reset_index()
print(df.tail().round(2))
index group val
15 8 B -0.12
16 9 A 1.01
17 9 B -0.91
18 10 A -1.03
19 10 B 1.21
And using this framework (which allows me to execute any arbitrary function within a groupby-apply):
def add_two(x):
return x + 2
def pd_groupby_apply(df, out_name, in_name, group_name, index_name, function):
def apply_func(df):
if index_name is not None:
df = df.set_index(index_name).sort_index()
df[out_name] = function(df[in_name].values)
return df[out_name]
return df.groupby(group_name).apply(apply_func)
Whenever I call pd_groupby_apply with the following inputs, I get a pivoted DataFrame:
df_out1 = pd_groupby_apply(df=df,
out_name="test",
in_name="val",
group_name="group",
index_name="index",
function=add_two)
print(df_out1.head().round(2))
index 1 2 3 4 5 6 7 8 9 10
group
A 2.47 2.24 2.75 2.01 1.19 1.40 3.10 3.34 3.01 0.97
B 1.32 0.30 0.47 1.88 4.87 2.47 0.78 1.88 1.09 3.21
However, as soon as my dataframe does not contain full group-index pairs, and I call my pd_groupby_apply function again, I do recieve my dataframe back in the way that I want (i.e. not pivoted):
df_notfull = df.iloc[:-1]
df_out2 = pd_groupby_apply(df=df_notfull,
out_name="test",
in_name="val",
group_name="group",
index_name="index",
function=add_two)
print(df_out2.head().round(2))
group index
A 1 2.47
2 2.24
3 2.75
4 2.01
5 1.19
Why is this? And more importantly, how can I prevent Pandas from pivoting my dataframe when I have full index-group pairs in my dataframe?

Pandas combine two different length of time series dataframe

I am trying to combine two different timeframes of pandas dataframe. The first dataframe has 1 hour timeseries. and the second dataframe has 1 minute timeseries.
1 hour dataframe
get_time value
0 1599739200 123.10
1 1599742800 136.24
2 1599750000 224.14
1 minute dataframe
get_time value
0 1599739200 2.11
1 1599739260 3.11
2 1599739320 3.12
3 1599742800 4.23
4 1599742860 2.22
5 1599742920 1.11
6 1599746400 7.24
7 1599746460 22.10
8 1599746520 2.13
9 1599750000 5.14
10 1599750060 12.10
11 1599750120 21.30
I want to combine those two dataframes, so the value of 1 hour dataframe will be mapped in 1 minute dataframe. if there is no 1 hour value then the mapped value will be nan.
Desired Result:
get_time value 1h mapped value
0 1599739200 2.11 123.10
1 1599739260 3.11 123.10
2 1599739320 3.12 123.10
3 1599742800 4.23 136.24
4 1599742860 2.22 136.24
5 1599742920 1.11 136.24
6 1599746400 7.24 NaN
7 1599746460 22.10 NaN
8 1599746520 2.13 NaN
9 1599750000 5.14 224.14
10 1599750060 12.10 224.14
11 1599750120 21.30 224.14
Basically i want to combine those dataframe with these logic:
if (1m_get_time >= 1h_get_time) and (1m_get_time < 1h_get_time+60minutes)
1h mapped value = 1h value
else:
1h mapped value = nan
Currently i use recursive method. But it takes long time for big size of data. here is the example of dataframe:
dfhigh_ = pd.DataFrame({
'get_time' : [1599739200, 1599742800, 1599750000],
'value' : [123.1, 136.24, 224.14],
})
dflow_ = pd.DataFrame({
'get_time' : [1599739200, 1599739260, 1599739320, 1599742800, 1599742860, 1599742920, 1599746400, 1599746460, 1599746520, 1599750000, 1599750060, 1599750120],
'value' : [2.11, 3.11, 3.12, 4.23, 2.22, 1.11, 7.24, 22.1, 2.13, 5.14, 12.1, 21.3],
})
Floor the get_time from dflow_ to nearest hour representation then use Series.map to map the values from dfhigh_ to dflow_ based on this rounded timestamp:
hr = dflow_['get_time'] // 3600 * 3600
dflow_['mapped_value'] = hr.map(dfhigh_.set_index('get_time')['value'])
get_time value mapped_value
0 1599739200 2.11 123.10
1 1599739260 3.11 123.10
2 1599739320 3.12 123.10
3 1599742800 4.23 136.24
4 1599742860 2.22 136.24
5 1599742920 1.11 136.24
6 1599746400 7.24 NaN
7 1599746460 22.10 NaN
8 1599746520 2.13 NaN
9 1599750000 5.14 224.14
10 1599750060 12.10 224.14
11 1599750120 21.30 224.14
This should work (for edge cases as well):
import pandas as pd
from datetime import datetime
import numpy as np
dfhigh_ = dfhigh_.rename(columns={'value': '1h mapped value'})
df_new = pd.merge(dflow_, dfhigh_, how='outer', on=['get_time'])
df_new.get_time = [datetime.fromtimestamp(x) for x in df_new['get_time']]
for idx,row in df_new.iterrows():
if not np.isnan(row['1h mapped value']):
current_hour, current_1h_mapped_value = row['get_time'].hour, row['1h mapped value']
for sub_idx,sub_row in df_new.loc[(df_new.get_time.dt.hour == current_hour) & np.isnan(df_new['1h mapped value'])].iterrows():
df_new.loc[sub_idx, '1h mapped value'] = current_1h_mapped_value

How to calculate mean in a particular subset and replace the value

csv table :
So I have a csv file that has different columns like nodeVolt, Temperature1, temperature2, temperature3, pressure and luminosity. Under temperatures column, there are various cells where the value is wrong (ie. 220). I want to replace that value in that cell by taking a mean of the previous 10 cells and replacing it there. I want this to run dynamically by finding all the cells with values 220 in that particular column and replace with the mean of previous 10 values in the same column.
I was able to search the cells containing 220 in that particular problem but unable to take mean and replace it.
import pandas as pd
import numpy as np
data = pd.read_csv(r"108e.csv")
data = data.drop(['timeStamp','nodeRSSI','packetID', 'solarPanelVolt', 'solarPanelBattVolt',
'solarPanelCurr','temperature2','nodeVolt','nodeAddress'], axis = 1)
df = pd.DataFrame(data)
df1 = df.loc[lambda df: df['temperature3'] == 220]
print(df1)
for i in df1:
df1["temperature3"][i] == df["temperature3"][i-11:i-1, 'temperature3'].mean()
Here you go:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"something": 3.37,
"temperature3": [
31.94,
31.93,
31.85,
31.91,
31.92,
31.89,
31.9,
31.94,
32.06,
32.16,
32.3,
220,
32.1,
32.5,
32.2,
32.3,
],
}
)
# replace all 220 values by NaN
df["temperature3"] = df["temperature3"].replace({220: np.nan})
# fill all NaNs with an shifted rolling average of the last 10 rows
df["temperature3"] = df["temperature3"].fillna(
df["temperature3"].rolling(10, min_periods=1).mean().shift(1)
)
Result:
something temperature3
0 3.37 31.940
1 3.37 31.930
2 3.37 31.850
3 3.37 31.910
4 3.37 31.920
5 3.37 31.890
6 3.37 31.900
7 3.37 31.940
8 3.37 32.060
9 3.37 32.160
10 3.37 32.300
11 3.37 31.986
12 3.37 32.100
13 3.37 32.500
14 3.37 32.200
15 3.37 32.300
(please provide next time some sample data as code, not as an image)

How to fix index getting lost when ussing append within a nested for loop

I want to filter out rows from a dataframe which are below a threshold (5th percentile)in another dataframe
I have tried doing a nested for loop and appending the output but index is lost
and that runtime is really long over two minutes
I have a dataframe called fiveperc which is in the format (366,1):
tmin
1 11.32
2 11.0
3 11.41
4 11.885
5 12.155
....
366 13.08
and another dataframe called df2 in the format of (18910,1)
date tmin
1966-01-01 13.9
1966-01-02 17.1
1966-01-03 17.1
1966-01-04 16.2
.....
2018-12-31 17
Using:
anomaly = []
for yearday,perc in fiveperc.iterrows():
for date,temp in df2.iterrows():
if yearday == date.dayofyear:
anomaly.append(temp - perc)
anomaly = pd.DataFrame(anomaly)
Using the first block of code above has an output dataframe (18910,1):
index tmin
0 2.58
1 3.27
2 4.27
3 2.08
4 -3.52
....
18909 5.579
The problem here is that datetime index from df2 is lost, resulting in a different arrangement!
and that this nested for loop takes over two minutes to run.
extra code if i get the code above work:
anomaly[anomaly>0]=np.nan
anomaly[anomaly<0]= 1
anomaly.replace(0, np.nan, inplace=True)
Frequency = pd.DataFrame(final.groupby(lambda x: x.dayofyear)['anomaly'].agg(sum))
Is there a much better way to do this?
You can lookup the dayoftheyear on a column with the dt accessor:
In [11]: df
Out[11]:
date tmin
0 1966-01-01 13.9
1 1966-01-02 17.1
2 1966-01-03 17.1
3 1966-01-04 16.2
In [12]: df1
Out[12]:
tmin
1 11.320
2 11.000
3 11.410
4 11.885
5 12.155
In [13]: df1.loc[df.date.dt.dayofyear, "tmin"]
Out[13]:
1 11.320
2 11.000
3 11.410
4 11.885
Name: tmin, dtype: float64
In [14]: df["tmin"] - df1.loc[df.date.dt.dayofyear, "tmin"].values
Out[14]:
0 2.580
1 6.100
2 5.690
3 4.315
Name: tmin, dtype: float64
You can also do this with a groupby transform, but my suspicion is this will be slightly slower:
In [21]: df.groupby(df.date.dt.dayofyear)["tmin"].transform(lambda x: x - df1.loc[x.name, "tmin"])
Out[21]:
0 2.580
1 6.100
2 5.690
3 4.315
Name: tmin, dtype: float64

Pandas DF Multiple Conditionals using np.where

I am trying to combine a few relatively simple conditions into an np.where clause, but am having trouble getting the syntax down for the logic.
My current dataframe looks like the df below, with four columns. I would like to add two columns, named the below, with the following conditions:
The desired output is below - the df df_so_v2
Days since activity
*Find most recent prior row with same ID, then subtract dates column
*If no most recent value, return NA
Chg. Avg. Value
Condition 1: If Count = 0, NA
Condition 2: If Count !=0, find most recent prior row with BOTH the same ID and Count!=0, then find the difference in Avg. Value column.
However, I am building off simple np.where queries like the below and do not know how to combine the multiple conditions needed in this case.
df['CASH'] = np.where(df['CASH'] != 0, df['CASH'] + commission , df['CASH'])
Thank you very much for your help on this.
df_dict={'DateOf': ['2017-08-07','2017-08-07','2017-08-07','2017-08-04','2017-08-04','2017-08-04'
, '2017-08-03','2017-08-03','2017-08-03','2017-08-02','2017-08-02','2017-08-02','2017-08-01','2017-08-01','2017-08-01'],
'ID': ['553','559','914','553','559','914','553','559','914','553','559','914','553','559','914'], 'Count': [0, 4, 5, 0, 11, 10, 3, 9, 0,1,0,2,4,4,0],
'Avg. Value': [0,3.5,2.2,0,4.2,3.3,5.3,5,0,3,0,2,4.4,6.4,0]}
df_so=pd.DataFrame(df_dict)
df_dict_v2={'DateOf': ['2017-08-07','2017-08-07','2017-08-07','2017-08-04','2017-08-04','2017-08-04'
, '2017-08-03','2017-08-03','2017-08-03','2017-08-02','2017-08-02','2017-08-02','2017-08-01','2017-08-01','2017-08-01'],
'ID': ['553','559','914','553','559','914','553','559','914','553','559','914','553','559','914'], 'Count': [0, 4, 5, 0, 11, 10, 3, 9, 0,1,0,2,4,4,0],
'Avg. Value': [0,3.5,2.2,0,4.2,3.3,5.3,5,0,3,0,2,4.4,6.4,0],
'Days_since_activity': [4,3,1,1,1,2,1,2,1,1,1,1,'NA','NA','NA'],
'Chg. Avg Value': ['NA',-0.7,-1.1,'NA',-0.8,1.3,2.3,-1.4,'NA',-1.4,'NA','NA','NA','NA','NA']
}
df_so_v2=pd.DataFrame(df_dict_v2)
Here is the answer to this part of the question. I need more clarification on the conditions of 2.
1) Days since activity *Find most recent prior row with same ID, then subtract dates column *If no most recent value, return NA
First you need to convert strings to datetime, then sort the dates in ascending order. Finally use .transform to find the difference.
df_dict={'DateOf': ['2017-08-07','2017-08-07','2017-08-07','2017-08-04','2017-08-04','2017-08-04'
, '2017-08-03','2017-08-03','2017-08-03','2017-08-02','2017-08-02','2017-08-02','2017-08-01','2017-08-01','2017-08-01'],
'ID': ['553','559','914','553','559','914','553','559','914','553','559','914','553','559','914'], 'Count': [0, 4, 5, 0, 11, 10, 3, 9, 0,1,0,2,4,4,0],
'Avg. Value': [0,3.5,2.2,0,4.2,3.3,5.3,5,0,3,0,2,4.4,6.4,0]}
df_so = pd.DataFrame(df_dict)
df_so['DateOf'] = pd.to_datetime(df_so['DateOf'])
df_so.sort_values('DateOf', inplace=True)
df_so['Days_since_activity'] = df_so.groupby(['ID'])['DateOf'].transform(pd.Series.diff)
df_so.sort_index()
Edited based on your comment:
Find the most recent previous day that does not have a count of Zero and calculate the difference.
df_dict={'DateOf': ['2017-08-07','2017-08-07','2017-08-07','2017-08-04','2017-08-04','2017-08-04'
, '2017-08-03','2017-08-03','2017-08-03','2017-08-02','2017-08-02','2017-08-02','2017-08-01','2017-08-01','2017-08-01'],
'ID': ['553','559','914','553','559','914','553','559','914','553','559','914','553','559','914'], 'Count': [0, 4, 5, 0, 11, 10, 3, 9, 0,1,0,2,4,4,0],
'Avg. Value': [0,3.5,2.2,0,4.2,3.3,5.3,5,0,3,0,2,4.4,6.4,0]}
df = pd.DataFrame(df_dict)
df['DateOf'] = pd.to_datetime(df['DateOf'], format='%Y-%m-%d')
df.sort_values(['ID','DateOf'], inplace=True)
df['Days_since_activity'] = df.groupby(['ID'])['DateOf'].diff()
mask = df.ID != df.ID.shift(1)
mask2 = df.groupby('ID').Count.shift(1) == 0
df['Days_since_activity'][mask] = np.nan
df['Days_since_activity'][mask2] = df.groupby(['ID'])['DateOf'].diff(2)
df['Chg. Avg Value'] = df.groupby(['ID'])['Avg. Value'].diff()
df['Chg. Avg Value'][mask2] = df.groupby(['ID'])['Avg. Value'].diff(2)
conditions = [((df['Count'] == 0)),]
choices = [np.nan,]
df['Chg. Avg Value'] = np.select(conditions, choices, default = df['Chg. Avg Value'])
# df = df.sort_index()
df
New unsorted Output for easy comparison:
DateOf ID Count Avg. Value Days_since_activity Chg. Avg Value
12 2017-08-01 553 4 4.4 NaT NaN
9 2017-08-02 553 1 3.0 1 days -1.4
6 2017-08-03 553 3 5.3 1 days 2.3
3 2017-08-04 553 0 0.0 1 days NaN
0 2017-08-07 553 0 0.0 4 days NaN
13 2017-08-01 559 4 6.4 NaT NaN
10 2017-08-02 559 0 0.0 1 days NaN
7 2017-08-03 559 9 5.0 2 days -1.4
4 2017-08-04 559 11 4.2 1 days -0.8
1 2017-08-07 559 4 3.5 3 days -0.7
14 2017-08-01 914 0 0.0 NaT NaN
11 2017-08-02 914 2 2.0 NaT NaN
8 2017-08-03 914 0 0.0 1 days NaN
5 2017-08-04 914 10 3.3 2 days 1.3
2 2017-08-07 914 5 2.2 3 days -1.1
index 11 should be NaT because the most current previous row has a count of zero and there is nothing else to compare it to

Categories