Is there a faster way to read following pandas dataframe? - python

I have a huge .csv file(2.3G) which I have to read into pandas dataframe.
start_date,wind_90.0_0.0,wind_90.0_5.0,wind_87.5_2.5
1948-01-01,15030.64,15040.64,16526.35
1948-01-02,15050.14,15049.28,16526.28
1948-01-03,15076.71,15075.0,16525.28
I want to process above data into below structure:
start_date lat lon wind
0 1948-01-01 90.0 0.0 15030.64
1 1948-01-01 90.0 5.0 15040.64
2 1948-01-01 87.5 2.5 16526.35
3 1948-01-02 90.0 0.0 15050.14
4 1948-01-02 90.0 5.0 15049.28
5 1948-01-02 87.5 2.5 16526.28
6 1948-01-03 90.0 0.0 15076.71
7 1948-01-03 90.0 5.0 15075.0
8 1948-01-03 87.5 2.5 16525.28
Code I have so far which does what I want but is too slow and takes up a lot of memory.
def load_data_as_pandas(fileName, featureName):
df = pd.read_csv(fileName)
df = pd.melt(df, id_vars = df.columns[0])
df['lat'] = df['variable'].str.split('_').str[-2]
df['lon'] = df['variable'].str.split('_').str[-1]
df = df.drop('variable', axis=1)
df.columns = ['start_date', featureName,'lat','lon']
df = df.groupby(['start_date','lat','lon']).first()
df = df.reset_index()
df['start_date'] = pd.to_datetime(df['start_date'], format='%Y-%m-%d', errors='coerce')
return df

This should spead up your code:
We can use melt to unpivot your data from wide to long. Then we use str.split on the column name (values) and use expand=True to get a new column for each split. Finally we join these newly created columns back to our original dataframe:
melt = df.melt(id_vars='start_date').sort_values('start_date').reset_index(drop=True)
newcols = melt['variable'].str.split('_', expand=True).iloc[:, 1:].rename(columns={1:'lat', 2:'lon'})
final = melt.drop(columns='variable').join(newcols)
Output
start_date value lat lon
0 1948-01-01 15030.64 90.0 0.0
1 1948-01-01 15040.64 90.0 5.0
2 1948-01-01 16526.35 87.5 2.5
3 1948-01-02 15050.14 90.0 0.0
4 1948-01-02 15049.28 90.0 5.0
5 1948-01-02 16526.28 87.5 2.5
6 1948-01-03 15076.71 90.0 0.0
7 1948-01-03 15075.00 90.0 5.0
8 1948-01-03 16525.28 87.5 2.5
Timeit test on 800k rows:
3.55 s ± 347 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Related

Merge several Dataframes with outside temperature and power generation

I have several dataframes of heating devices which are containing data over 1 year. One time step is 15 min, each df have two columns: outside_temp and heat_generation. Each df looks like this:
outside_temp heat_production
0 11.1 200
1 11.1 150
2 11.0 245
3 11.0 0
4 11.0 300
5 10.9 49
6
.
.
.
35037 -5.1 450
35038 -5.1 450
35039 -5.1 450
35040 -5.2 600
I now want to know at which outside_temp I need how much heat_production for all heat devices(and therefore for all dataframes) -> I was thinking about groupby oder somthing else. But I dont know how to handel this amount of data the best way. When directly merging the dfs there is the problem that the outside temperature is there several times and the heat production of course differs. To solve this, I could imagine to take the average heat_production for each device at a given outside_temperature. Of course it can also be the case that a device was not measuring a specific temperature (e.g. the device is located in warmer or colder area -> Therefore NaN Values are possbile)
At the end I want to get kind of Polynomial/Sigmoid function to see how much heat_production is necessary at a given outside temperature
At the end I want to have a dataframe like this:
outside_temp heat_production_average_device_1 heat_production_average_device_2 ...etc
-20.0 790 NaN
-19.9 789 NaN
-19.8 788 790
-19.7 NaN 780
-19.6 770 NaN
.
.
.
19.6 34 0
19.7 32 0
19.8 30 0
19.9 32 0
20.0 0 0
Any idea whats the best way to do so ?
Given:
>>> df1
outside_temp heat_production
0 11.1 200
1 11.1 150
2 11.0 245
>>> df2
outside_temp heat_production
3 11.0 0
4 11.0 300
5 10.9 49
Doing:
def my_func(i, df):
renamer = {'heat_production': f'heat_production_average_device_{i}'}
return (df.groupby('outside_temp')
.mean()
.rename(columns=renamer))
dfs = [df1, df2]
dfs = [my_func(i+1, df) for i, df in enumerate(dfs)]
df = pd.concat(dfs, axis=1)
print(df)
Output:
heat_production_average_device_1 heat_production_average_device_2
outside_temp
11.0 245.0 150.0
11.1 175.0 NaN
10.9 NaN 49.0

PatsyError: numbers besides '0' and '1' are only allowed with ** doesnt' not resolve when using Q

I'm trying to run anova test to dataframe that looks like this:
>>>code 2020-11-01 2020-11-02 2020-11-03 2020-11-04 ...
0 1 22.5 73.1 12.2 77.5
1 1 23.1 75.4 12.4 78.3
2 2 43.1 72.1 13.4 85.4
3 2 41.6 85.1 34.1 96.5
4 3 97.3 43.2 31.1 55.3
5 3 12.1 44.4 32.2 52.1
...
I want to calculate one way anova for each column based on the code. I have used for that statsmodel and for loop :
keys = []
tables = []
for variable in df.columns[1:]:
model = ols('{} ~ code'.format(variable), data=df).fit()
anova_table = sm.stats.anova_lm(model)
keys.append(variable)
tables.append(anova_table)
df_anova = pd.concat(tables, keys=keys, axis=0)
df_anova
The problem is that I keep getting error for the 4th line:
PatsyError: numbers besides '0' and '1' are only allowed with **
2020-11-01 ~ code
^^^^
I have tried to use the Q argument as suggested here:
...
model = ols('{Q(x)} ~ code'.format(x=variable), data=df).fit()
KeyError: 'Q(x)'
I have also tried to locate the Q outside but got the same error.
My end goal: to calculate one-way anove for each day (each column) based on the "code" column.
You can try to pivot it long and skip the iteration through columns:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
df = pd.DataFrame({"code":[1,1,2,2,3,3],
"2020-11-01":[22.5,23.1,43.1,41.6,97.3,12.1],
"2020-11-02":[73.1,75.4,72.1,85.1,43.2,44.4]})
df_long = df.melt(id_vars="code")
df_long
code variable value
0 1 2020-11-01 22.5
1 1 2020-11-01 23.1
2 2 2020-11-01 43.1
3 2 2020-11-01 41.6
4 3 2020-11-01 97.3
5 3 2020-11-01 12.1
6 1 2020-11-02 73.1
7 1 2020-11-02 75.4
8 2 2020-11-02 72.1
9 2 2020-11-02 85.1
10 3 2020-11-02 43.2
11 3 2020-11-02 44.4
Then applying your code:
tables = []
keys = df_long.variable.unique()
for D in keys:
model = ols('value ~ code', data=df_long[df_long.variable == D]).fit()
anova_table = sm.stats.anova_lm(model)
tables.append(anova_table)
pd.concat(tables,keys=keys)
Or simply:
def aov_func(x):
model = ols('value ~ code', data=x).fit()
return sm.stats.anova_lm(model)
df_long.groupby("variable").apply(aov_func)
Gives this result:
df sum_sq mean_sq F PR(>F)
variable
2020-11-01 code 1.0 1017.6100 1017.610000 1.115768 0.350405
Residual 4.0 3648.1050 912.026250 NaN NaN
2020-11-02 code 1.0 927.2025 927.202500 6.194022 0.067573
Residual 4.0 598.7725 149.693125 NaN NaN

Fill NaN values from previous column with data

I have a dataframe in pandas, and I am trying to take data from the same row and different columns and fill NaN values in my data. How would I do this in pandas?
For example,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
83 27.0 29.0 NaN 29.0 30.0 NaN NaN 15.0 16.0 17.0 NaN 28.0 30.0 NaN 28.0 18.0
The goal is for the data to look like this:
1 2 3 4 5 6 7 ... 10 11 12 13 14 15 16
83 NaN NaN NaN 27.0 29.0 29.0 30.0 ... 15.0 16.0 17.0 28.0 30.0 28.0 18.0
The goal is to be able to take the mean of the last five columns that have data. If there are not >= 5 data-filled cells, then take the average of however many cells there are.
Use function justify for improve performance with filter all columns without first by DataFrame.iloc:
print (df)
name 1 2 3 4 5 6 7 8 9 10 11 12 13 \
80 bob 27.0 29.0 NaN 29.0 30.0 NaN NaN 15.0 16.0 17.0 NaN 28.0 30.0
14 15 16
80 NaN 28.0 18.0
df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan, side='right')
print (df)
name 1 2 3 4 5 6 7 8 9 10 11 12 13 \
80 bob NaN NaN NaN NaN NaN 27.0 29.0 29.0 30.0 15.0 16.0 17.0 28.0
14 15 16
80 30.0 28.0 18.0
Function:
#https://stackoverflow.com/a/44559180/2901002
def justify(a, invalid_val=0, axis=1, side='left'):
"""
Justifies a 2D array
Parameters
----------
A : ndarray
Input array to be justified
axis : int
Axis along which justification is to be made
side : str
Direction of justification. It could be 'left', 'right', 'up', 'down'
It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.
"""
if invalid_val is np.nan:
mask = ~np.isnan(a)
else:
mask = a!=invalid_val
justified_mask = np.sort(mask,axis=axis)
if (side=='up') | (side=='left'):
justified_mask = np.flip(justified_mask,axis=axis)
out = np.full(a.shape, invalid_val)
if axis==1:
out[justified_mask] = a[mask]
else:
out.T[justified_mask.T] = a.T[mask.T]
return out
Performance:
#100 rows
df = pd.concat([df] * 100, ignore_index=True)
#41 times slowier
In [39]: %timeit df.loc[:,df.columns[1:]] = df.loc[:,df.columns[1:]].apply(fun, axis=1)
145 ms ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [41]: %timeit df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan, side='right')
3.54 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#1000 rows
df = pd.concat([df] * 1000, ignore_index=True)
#198 times slowier
In [43]: %timeit df.loc[:,df.columns[1:]] = df.loc[:,df.columns[1:]].apply(fun, axis=1)
1.13 s ± 37.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [45]: %timeit df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan, side='right')
5.7 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Assuming you need to move all NaN to the first columns I would define a function that takes all NaN and places them first and leave the rest as it is:
def fun(row):
index_order = row.index[row.isnull()].append(row.index[~row.isnull()])
row.iloc[:] = row[index_order].values
return row
df_fix = df.loc[:,df.columns[1:]].apply(fun, axis=1)
If you need to overwrite the results in the same dataframe then:
df.loc[:,df.columns[1:]] = df_fix.copy()

Difference of datetimes in hours, excluding the weekend

I currently have a dataframe, where an uniqueID has multiple dates in another column. I want extract the hours between each date, but ignore the weekend if the next date is after the weekend. For example, if today is friday at 12 pm,
and the following date is tuesday at 12 pm then the difference in hours between these two dates would be 48 hours.
Here is my dataset with the expected output:
df = pd.DataFrame({"UniqueID": ["A","A","A","B","B","B","C","C"],"Date":
["2018-12-07 10:30:00","2018-12-10 14:30:00","2018-12-11 17:30:00",
"2018-12-14 09:00:00","2018-12-18 09:00:00",
"2018-12-21 11:00:00","2019-01-01 15:00:00","2019-01-07 15:00:00"],
"ExpectedOutput": ["28.0","27.0","Nan","48.0","74.0","NaN","96.0","NaN"]})
df["Date"] = df["Date"].astype(np.datetime64)
This is what I have so far, but it includes the weekends:
df["date_diff"] = df.groupby(["UniqueID"])["Date"].apply(lambda x: x.diff()
/ np.timedelta64(1 ,'h')).shift(-1)
Thanks!
Idea is floor datetimes for remove times and get number of business days between start day + one day and shifted day to hours3 column by numpy.busday_count and then create hour1 and hour2 columns for start and end hours if not weekends hours. Last sum all hours columns together:
df["Date"] = pd.to_datetime(df["Date"])
df = df.sort_values(['UniqueID','Date'])
df["shifted"] = df.groupby(["UniqueID"])["Date"].shift(-1)
df["hours1"] = df["Date"].dt.floor('d')
df["hours2"] = df["shifted"].dt.floor('d')
mask = df['shifted'].notnull()
f = lambda x: np.busday_count(x['hours1'] + pd.Timedelta(1, unit='d'), x['hours2'])
df.loc[mask, 'hours3'] = df[mask].apply(f, axis=1) * 24
mask1 = df['hours1'].dt.dayofweek < 5
hours1 = df['hours1'] + pd.Timedelta(1, unit='d') - df['Date']
df['hours1'] = np.where(mask1, hours1, np.nan) / np.timedelta64(1 ,'h')
mask1 = df['hours2'].dt.dayofweek < 5
df['hours2'] = np.where(mask1, df['shifted']-df['hours2'], np.nan) / np.timedelta64(1 ,'h')
df['date_diff'] = df['hours1'].fillna(0) + df['hours2'] + df['hours3']
print (df)
UniqueID Date ExpectedOutput shifted hours1 \
0 A 2018-12-07 10:30:00 28.0 2018-12-10 14:30:00 13.5
1 A 2018-12-10 14:30:00 27.0 2018-12-11 17:30:00 9.5
2 A 2018-12-11 17:30:00 Nan NaT 6.5
3 B 2018-12-14 09:00:00 48.0 2018-12-18 09:00:00 15.0
4 B 2018-12-18 09:00:00 74.0 2018-12-21 11:00:00 15.0
5 B 2018-12-21 11:00:00 NaN NaT 13.0
6 C 2019-01-01 15:00:00 96.0 2019-01-07 15:00:00 9.0
7 C 2019-01-07 15:00:00 NaN NaT 9.0
hours2 hours3 date_diff
0 14.5 0.0 28.0
1 17.5 0.0 27.0
2 NaN NaN NaN
3 9.0 24.0 48.0
4 11.0 48.0 74.0
5 NaN NaN NaN
6 15.0 72.0 96.0
7 NaN NaN NaN
First solution was removed with 2 reasons - was not accurate and slow:
np.random.seed(2019)
dates = pd.date_range('2015-01-01','2018-01-01', freq='H')
df = pd.DataFrame({"UniqueID": np.random.choice(list('ABCDEFGHIJ'), size=100),
"Date": np.random.choice(dates, size=100)})
print (df)
def old(df):
df["Date"] = pd.to_datetime(df["Date"])
df = df.sort_values(['UniqueID','Date'])
df["shifted"] = df.groupby(["UniqueID"])["Date"].shift(-1)
def f(x):
a = pd.date_range(x['Date'], x['shifted'], freq='T')
return ((a.dayofweek < 5).sum() / 60).round()
mask = df['shifted'].notnull()
df.loc[mask, 'date_diff'] = df[mask].apply(f, axis=1)
return df
def new(df):
df["Date"] = pd.to_datetime(df["Date"])
df = df.sort_values(['UniqueID','Date'])
df["shifted"] = df.groupby(["UniqueID"])["Date"].shift(-1)
df["hours1"] = df["Date"].dt.floor('d')
df["hours2"] = df["shifted"].dt.floor('d')
mask = df['shifted'].notnull()
f = lambda x: np.busday_count(x['hours1'] + pd.Timedelta(1, unit='d'), x['hours2'])
df.loc[mask, 'hours3'] = df[mask].apply(f, axis=1) * 24
mask1 = df['hours1'].dt.dayofweek < 5
hours1 = df['hours1'] + pd.Timedelta(1, unit='d') - df['Date']
df['hours1'] = np.where(mask1, hours1, np.nan) / np.timedelta64(1 ,'h')
mask1 = df['hours2'].dt.dayofweek < 5
df['hours2'] = np.where(mask1, df['shifted'] - df['hours2'], np.nan) / np.timedelta64(1 ,'h')
df['date_diff'] = df['hours1'].fillna(0) + df['hours2'] + df['hours3']
return df
print (new(df))
print (old(df))
In [44]: %timeit (new(df))
22.7 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [45]: %timeit (old(df))
1.01 s ± 8.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

How to assign a values to dataframe's column by comparing values in another dataframe

I have two data frames. One has rows for every five minutes in a day:
df
TIMESTAMP TEMP
1 2011-06-01 00:05:00 24.5
200 2011-06-01 16:40:00 32.0
1000 2011-06-04 11:20:00 30.2
5000 2011-06-18 08:40:00 28.4
10000 2011-07-05 17:20:00 39.4
15000 2011-07-23 02:00:00 29.3
20000 2011-08-09 10:40:00 29.5
30656 2011-09-15 10:40:00 13.8
I have another dataframe that ranks the days
ranked
TEMP DATE RANK
62 43.3 2011-08-02 1.0
63 43.1 2011-08-03 2.0
65 43.1 2011-08-05 3.0
38 43.0 2011-07-09 4.0
66 42.8 2011-08-06 5.0
64 42.5 2011-08-04 6.0
84 42.2 2011-08-24 7.0
56 42.1 2011-07-27 8.0
61 42.1 2011-08-01 9.0
68 42.0 2011-08-08 10.0
Both the columns TIMESTAMP and DATE are datetime datatypes (dtype returns dtype('M8[ns]').
What I want to be able to do is add a column to the dataframe df and then put the rank of the row based on the TIMESTAMP and corresponding day's rank from ranked (so in a day all the 5 minute timesteps will have the same rank).
So, the final result would look something like this:
df
TIMESTAMP TEMP RANK
1 2011-06-01 00:05:00 24.5 98.0
200 2011-06-01 16:40:00 32.0 98.0
1000 2011-06-04 11:20:00 30.2 96.0
5000 2011-06-18 08:40:00 28.4 50.0
10000 2011-07-05 17:20:00 39.4 9.0
15000 2011-07-23 02:00:00 29.3 45.0
20000 2011-08-09 10:40:00 29.5 40.0
30656 2011-09-15 10:40:00 13.8 100.0
What I have done so far:
# Separate the date and times.
df['DATE'] = df['YYYYMMDDHHmm'].dt.normalize()
df['TIME'] = df['YYYYMMDDHHmm'].dt.time
df = df[['DATE', 'TIME', 'TAIR']]
df['RANK'] = 0
for index, row in df.iterrows():
df.loc[index, 'RANK'] = ranked[ranked['DATE']==row['DATE']]['RANK'].values
But I think I am going in a very wrong direction because this takes ages to complete.
How do I improve this code?
IIUC, you can play with indexes to match the values
df = df.set_index(df.TIMESTAMP.dt.date)\
.assign(RANK=ranked.set_index('DATE').RANK)\
.set_index(df.index)

Categories