Pandas pct change from initial value - python

I want to find the pct_change of Dew_P Temp (C) from the initial value of -3.9. I want the pct_change in a new column.
Source here:
weather = pd.read_csv('https://raw.githubusercontent.com/jvns/pandas-cookbook/master/data/weather_2012.csv')
weather[weather.columns[:4]].head()
Date/Time Temp (C) Dew_P Temp (C) Rel Hum (%)
0 2012-01-01 -1.8 -3.9 86
1 2012-01-01 -1.8 -3.7 87
2 2012-01-01 -1.8 -3.4 89
3 2012-01-01 -1.5 -3.2 88
4 2012-01-01 -1.5 -3.3 88
I have tried variations of this for loop (even going as far as adding an index shown here) but to no avail:
for index, dew_point in weather['Dew_P Temp (C)'].iteritems():
new = weather['Dew_P Temp (C)'][index]
old = weather['Dew_P Temp (C)'][0]
pct_diff = (new-old)/old*100
weather['pct_diff'] = pct_diff
I think the problem is the weather['pct_diff'], it doesn't take the new it takes the last value of the data frame and subtracts it from old
So its always (2.1-3.9)/3.9*100 thus my percent change is always -46%.
The end result I want is this:
Date/Time Temp (C) Dew_P Temp (C) Rel Hum (%) pct_diff
0 2012-01-01 -1.8 -3.9 86 0.00%
1 2012-01-01 -1.8 -3.7 87 5.12%
2 2012-01-01 -1.8 -3.4 89 12.82%
Any ideas? Thanks!

You can use iat to access the scalar value (e.g. iat[0] accesses the first value in the series).
df = weather
df['pct_diff'] = df['Dew_P Temp (C)'] / df['Dew_P Temp (C)'].iat[0] - 1

IIUC you can do it this way:
In [88]: ((weather['Dew Point Temp (C)'] - weather.ix[0, 'Dew Point Temp (C)']).abs() / weather.ix[0, 'Dew Point Temp (C)']).abs() * 100
Out[88]:
0 0.000000
1 5.128205
2 12.820513
3 17.948718
4 15.384615
5 15.384615
6 20.512821
7 7.692308
8 7.692308
9 20.512821

I find this more graceful
weather['Dew_P Temp (C)'].pct_change().fillna(0).add(1).cumprod().sub(1)
0 0.000000
1 -0.051282
2 -0.128205
3 -0.179487
4 -0.153846
Name: Dew_P Temp (C), dtype: float64
To get your expected output with absolute values
weather['pct_diff'] = weather['Dew_P Temp (C)'].pct_change().fillna(0).add(1).cumprod().sub(1).abs()
weather

Related

I wanna read each cell of pandas df one after another and do some calculation on them

I wanna read each cell of pandas df one after another and do some calculation on them, but I have a problem using dictionaries or lists. for example, I wanna check the Ith cell whether the outdoor door temperature is more than X and also humidity is more/less than Y!then do a special calculation for the row.
here is the body of loaded df:
data=pd.read_csv('/content/drive/My Drive/Thesis/DS1.xlsx - Sheet1.csv')
data=data.drop(columns=["Date","time","real feel","Humidity","indoor temp"])
print(data)
and here is the data:
outdoor temp Unnamed: 6 Humidity Estimation: (poly3)
0 26 NaN 64.1560
1 25 NaN 68.6875
2 25 NaN 68.6875
3 24 NaN 72.4640
4 24 NaN 72.4640
.. ... ... ...
715 35 NaN 22.5625
716 33 NaN 28.1795
717 32 NaN 32.3680
718 31 NaN 37.2085
719 30 NaN 42.5000
[720 rows x 3 columns]
Create a function and then use .apply() to use the function on each row. You can edit temp and humid to your desired values. If you want to reference a specific row then just use data[row index]. I am not sure what calculation you want to do but I just added one to the value.
def calculation(row, temp, humid):
if row["outdoor temp"] > temp:
row["outdoor temp"] += 1
if row["humidity"] > humid:
row["humidity"] += 1
data = data.apply(lambda row : calculation(row, temp, humid), axis = 1)

Python: merging pandas data frames results in only NaN in column

I am currently working on a Big Data project requiring the merger of a number of files into a single table that can be analyzed through SAS. The majority of work is completed with the final fact tables needing to be added to the final output.
I have hit a snag when attempting to combine a fact table to the final output. In this csv file that has been loaded to its own dataframe the following columns are present.
table name: POP
year | Borough | Population
within the dataset on which they are to be enjoined these fields also exist along with around 26 others. When first attempted to merge through the following line:
#Output = pd.merge(Output, POP, on=['year', 'Borough'], how='outer')
the following error was returned
ValueError: You are trying to merge on object and int64 columns. If
you wish to proceed you should use pd.concat
i understood this to simply be a data type mismatch and so added the following line ahead of the merge command:
POP['year'] = POP['year'].astype(object)
doing so allows for ""successfull"" execution of the program, however, the output file has the population column, but it is filled with NaN when it should have the appropriate population for each row where a combination of "year" and "borough" match that found in the POP table.
any help would be greatly appreciated and i provide below a fuller excerpt of the code for those who find that easier to parse:
import pandas as pd
#
# Add Population Data
#
#rename columns for easier joining
POP.rename(columns={"Area name":"Borough"}, inplace=True)
POP.rename(columns={"Persons":"Population"}, inplace=True)
POP.rename(columns={"Year":"year"}, inplace=True)
#convert type of output column to allow join
POP['year'] = POP['year'].astype(object)
#add to output file
Output = pd.merge(Output, POP, on=['year', 'Borough'], how='outer')
Additionally find below some information about the data types and shape of the items and tables involved in case it is of use:
> Output table info
>
> <class 'pandas.core.frame.DataFrame'> Int64Index: 34241 entries, 0 to
> 38179 Data columns (total 2 columns): year 34241 non-null object
> Borough 34241 non-null object dtypes: object(2) memory usage:
> 535.0+ KB None table shape: (34241, 36)
> ----------
>
> POP table info <class 'pandas.core.frame.DataFrame'> RangeIndex: 357
> entries, 0 to 356 Data columns (total 3 columns): year 357
> non-null object Borough 357 non-null object Population 357
> non-null object dtypes: object(3) memory usage: 4.2+ KB None table
> shape: (357, 3)
finally, my apologies if any of this is asked or presented incorrectly, I'm relatively new to python and its my first time using stack as a contributor
EDITS:
(1)
as requested here is samples from the data:
this is the Output Dataframe
[357 rows x 3 columns]
Borough Date totalIncidents Calculated Mean Closure \
0 Camden 2013-11-06 2.0 613.5
1 Kensington and Chelsea 2013-11-06 NaN NaN
2 Westminster 2013-11-06 1.0 113.0
PM2.5 (ug/m3) PM10 (ug/m3) (R) SO2 (ug/m3) (R) CO (mg m-3) (R) \
0 9.55 16.200 5.3 NaN
1 10.65 21.125 1.7 0.2
2 19.90 30.600 NaN 0.7
NO (ug/m3) (R) NO2 (ug/m3) (R) O3 (ug/m3) (R) Bus Stops \
0 135.866670 82.033333 24.4 447.0
1 80.360000 65.680000 29.3 270.0
2 171.033333 109.000000 21.3 489.0
Cycle Parking Points \
0 67.0
1 27.0
2 45.0
Average Public Transport Access Index 2015 (AvPTAI2015) \
0 24.316782
1 23.262691
2 39.750796
Public Transport Accessibility Levels (PTAL) Catagorisation \
0 5
1 5
2 6a
Underground Stations in Borough \
0 16.0
1 12.0
2 31.0
PM2.5 Daily Air Quality Index(DAQI) classification \
0 Low
1 Low
2 Low
PM2.5 Above WHO 24hr mean Guidline PM10 Above WHO 24hr mean Guidline \
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
O3 Above WHO 8hr mean Guidline* NO2 Above WHO 1hr mean Guidline* \
0 1.0 1.0
1 1.0 1.0
2 1.0 1.0
SO2 Above WHO 24hr mean Guidline SO2 Above EU 24hr mean Allowence \
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
NO2 Above EU 24hr mean Allowence CO Above EU 8hr mean Allowence \
0 1.0 0.0
1 1.0 0.0
2 1.0 0.0
O3 Above EU 8hr mean Allowence year NO2 Year Mean (ug/m3) \
0 0.0 2013 50.003618
1 0.0 2013 50.003618
2 0.0 2013 50.003618
PM2.5 Year Mean (ug/m3) PM10 Year Mean (ug/m3) \
0 15.339228 24.530299
1 15.339228 24.530299
2 15.339228 24.530299
NO2 Above WHO Annual mean Guidline NO2 Above EU Annual mean Allowence \
0 0.0 1.0
1 0.0 1.0
2 0.0 1.0
PM2.5 Above EU Annual mean Allowence PM10 Above EU Annual mean Allowence \
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
Number of Bicycle Hires (All Boroughs)
0 18,431
1 18,431
2 18,431
here is the Population dataframe
year Borough Population
0 2010 Barking and Dagenham 182,838
1 2011 Barking and Dagenham 187,029
2 2012 Barking and Dagenham 190,560
edit(2):
so this seems that it was a date type issue but im still not entirely sure why as i had attempted recasting the datatypes. However, the solution that finall got me going was to save the output dataframe as a csv and reload it into the programme, from there the merge started working again.

How to make different dataframes of different lengths become equal in length (downsampling and upsampling)

I have many dataframes (timeseries) that are of different lengths ranging between 28 and 179. I need to make them all of length 104. (upsampling those below 104 and downsampling those above 104)
For upsampling, the linear method can be sufficient to my needs. For downsampling, the mean of the values should be good.
To get all files to be the same length, I thought that I need to make all dataframes start and end at the same dates.
I was able to downsample all to the size of the smallest dataframe (i.e. 28) using below lines of code:
df.set_index(pd.date_range(start='1/1/1991' ,periods=len(df), end='1/1/2000'), inplace=True)
resampled=df.resample('120D').mean()
However, this will not give me good results when I feed them into the model I need them for as it shrinks the longer files so much thus distorting the data.
This is what I tried so far:
df.set_index(pd.date_range(start='1/1/1991' ,periods=len(df), end='1/1/2000'), inplace=True)
if df.shape[0]>100: resampled=df.resample('D').mean()
elif df.shape[0]<100: resampled=df.astype(float).resample('33D').interpolate(axis=0, method='linear')
else: break
Now, in the above lines of code, I am getting the files to be the same length (length 100). The downsampling part works fine too.
What's not working is the interpoaltion on the upsampling part. It just returns dataframes of length 100 with the first value of every column just copied over to all the rows.
What I need is to make them all size 104 (average size). This means any df of length>104 needs to downsampled and any df of length<104 needs to be upsampled.
As an example, please consider the two dfs as follows:
>>df1
index
0 3 -1 0
1 5 -3 2
2 9 -5 0
3 11 -7 -2
>>df2
index
0 3 -1 0
1 5 -3 2
2 9 -5 0
3 6 -3 -2
4 4 0 -4
5 8 2 -6
6 10 4 -8
7 12 6 -10
Suppose the avg length is 6, the expected output would be:
df1 upsampled to length 6 using interpolation - for e.g. resamle(rule).interpolate().
And df2 downsampled to length 6 using resample(rule).mean() .
Update:
If I could get all the files to be upsampled to 179, that would be fine as well.
I assume the problem is when you do resample in the up-sampling case, the other values are not kept. With you example df1, you can see it by using asfreq on one column:
print (df1.set_index(pd.date_range(start='1/1/1991' ,periods=len(df1), end='1/1/2000'))[1]
.resample('33D').asfreq().isna().sum(0))
#99 rows are nan on the 100 length resampled dataframe
So when you do interpolate instead of asfreq, it actually interpolates with just the first value, meaning that the first value is "repeated" over all the rows
To get the result you want, then before interpolating, use also mean even in the up-sampling case, such as:
print (df1.set_index(pd.date_range(start='1/1/1991' ,periods=len(df1), end='1/1/2000'))[1]
.resample('33D').mean().interpolate().head())
1991-01-01 3.000000
1991-02-03 3.060606
1991-03-08 3.121212
1991-04-10 3.181818
1991-05-13 3.242424
Freq: 33D, Name: 1, dtype: float64
and you will get values as you want.
To conclude, I think in both up-sampling and down-sampling cases, you can use the same command
resampled = (df.set_index(pd.date_range(start='1/1/1991' ,periods=len(df), end='1/1/2000'))
.resample('33D').mean().interpolate())
Because the interpolate would not affect the result in the down-sampling case.
Here is my version using skimage.transform.resize() function:
df1 = pd.DataFrame({
'a': [3,5,9,11],
'b': [-1,-3,-5,-7],
'c': [0,2,0,-2]
})
df1
a b c
0 3 -1 0
1 5 -3 2
2 9 -5 0
3 11 -7 -2
import pandas as pd
import numpy as np
from skimage.transform import resize
def df_resample(df1, num=1):
df2 = pd.DataFrame()
for key, value in df1.iteritems():
temp = value.to_numpy()/value.abs().max() # normalize
resampled = resize(temp, (num,1), mode='edge')*value.abs().max() # de-normalize
df2[key] = resampled.flatten().round(2)
return df2
df2 = df_resample(df1, 20) # resampling rate is 20
df2
a b c
0 3.0 -1.0 0.0
1 3.0 -1.0 0.0
2 3.0 -1.0 0.0
3 3.4 -1.4 0.4
4 3.8 -1.8 0.8
5 4.2 -2.2 1.2
6 4.6 -2.6 1.6
7 5.0 -3.0 2.0
8 5.8 -3.4 1.6
9 6.6 -3.8 1.2
10 7.4 -4.2 0.8
11 8.2 -4.6 0.4
12 9.0 -5.0 0.0
13 9.4 -5.4 -0.4
14 9.8 -5.8 -0.8
15 10.2 -6.2 -1.2
16 10.6 -6.6 -1.6
17 11.0 -7.0 -2.0
18 11.0 -7.0 -2.0
19 11.0 -7.0 -2.0

Filtering a dataframe by a list

I have the following dataframe
Dataframe:
Date Name Value Rank Mean
01/02/2019 A 10 100 8.2
02/03/2019 A 9 120 7.9
01/03/2019 B 3 40 6.4
03/02/2019 B 1 39 5.9
...
And following list:
date=['01/02/2019','03/02/2019'...]
I would like to filter the df by the list, but as a date range, so for each value in the list I would like to bring back data between the date and the date-30 days
I am using numpy broadcast here, notice this method is o(n*m) , which mean if both of the df and date list is huge , it will exceed the memory limit
s=pd.to_datetime(date).values
df.Date=pd.to_datetime(df.Date)
s1=df.Date.values
t=(s-s1[:,None]).astype('timedelta64[D]').astype(int)
df[np.any((t>=0)&(t<=30),1)]
Out[120]:
Date Name Value Rank Mean
0 2019-01-02 A 10 100 8.2
1 2019-02-03 A 9 120 7.9
3 2019-03-02 B 1 39 5.9
If your date is a string, just do:
df[df.date.isin(list_of_dates)]

count values in one pandas series based on date column and other column

I have several columns of data and they are in pandas dataframe. the data looks like
cus_id timestamp values second_val
0 10173 2010-06-12 39.0 1
1 95062 2010-09-11 35.0 2
2 171081 2010-07-05 39.0 1
3 122867 2010-08-18 39.0 1
4 107186 2010-11-23 0.0 3
5 171085 2010-09-02 0.0 2
6 169767 2010-07-03 28.0 2
7 80170 2010-03-23 39.0 2
8 154178 2010-10-02 37.0 2
9 3494 2010-11-01 0.0 1
.
.
.
.
5054054 1716139 2012-01-12 0.0 2
5054055 1716347 2012-01-18 28.0 1
5054056 1807501 2012-01-21 0.0 1
there are 0 values data which appears in values column and it appears on different days. I tried to group all second_val values for each month when the values column data at that time equal to zero to plot them properly and I did it by using
Jan10 = df.second_val[df['timestamp'].str.contains('2010-01')][df['values']==0].sum()
Feb10 = df.second_val[df['timestamp'].str.contains('2010-02')][df['values']==0].sum()
Mar10 = df.second_val[df['timestamp'].str.contains('2010-03')][df['values']==0].sum()
.
.
.
.
Jan12 = df.second_val[df['timestamp'].str.contains('2012-01')][df['values']==0].sum()
Feb12 = df.second_val[df['timestamp'].str.contains('2012-02')][df['values']==0].sum()
Months = ['2010-01', '2010-02', '2010-03', '2010-04' . . . . ., '2012-01', '2012-02']
Months_Orders = [Jan10, Feb10, Mar10, Apr10, . . . . .. ., Jan12, Feb12]
plt.figure(figsize=(15,8))
plt.scatter(x = Months, y = Months_Orders)
like if 0 appear for 10 days in jan10 and sum of second_val data is 20. then it should give me 20 for January month
e.g
cus_id timestamp values second_val
0 10173 2010-01-10 0.0 1
.
.
13 95062 2010-01-11 0.0 2
34 171081 2010-01-23 0.0 1
Is there any way to make better by writing in a function or any built-in pandas way. I tried my previous question solution but it was different and didn't work properly for me so I use this hard coded and it seems inefficient. Thanks
IIUC
df.timestamp=pd.to_datetime(df.timestamp)
df=df[df['values']==0]# filter it before groupby
df.groupby(df.timestamp.dt.strftime('%Y-%m')).second_val.sum()# using groupby after filter to get what you need, group key is format %Y-%m

Categories