With the following data, I would like to show the mean and other averages:
time = ['2020-01-01T00:00:00.000000000' '2020-01-02T00:00:00.000000000'
'2020-01-03T00:00:00.000000000' '2020-01-04T00:00:00.000000000'
'2020-01-05T00:00:00.000000000' '2020-01-06T00:00:00.000000000'
'2020-01-07T00:00:00.000000000' '2020-01-08T00:00:00.000000000'
'2020-01-09T00:00:00.000000000' '2020-01-10T00:00:00.000000000'
'2020-01-11T00:00:00.000000000' '2020-01-12T00:00:00.000000000'
'2020-01-13T00:00:00.000000000' '2020-01-14T00:00:00.000000000'
'2020-01-15T00:00:00.000000000' '2020-01-16T00:00:00.000000000'
'2020-01-17T00:00:00.000000000' '2020-01-18T00:00:00.000000000'
'2020-01-19T00:00:00.000000000' '2020-01-20T00:00:00.000000000'
'2020-01-21T00:00:00.000000000' '2020-01-22T00:00:00.000000000'
'2020-01-23T00:00:00.000000000' '2020-01-24T00:00:00.000000000'
'2020-01-25T00:00:00.000000000' '2020-01-26T00:00:00.000000000'
'2020-01-27T00:00:00.000000000' '2020-01-28T00:00:00.000000000'
'2020-01-29T00:00:00.000000000' '2020-01-30T00:00:00.000000000'
'2020-01-31T00:00:00.000000000']
print(np.mean(time)) has an error: TypeError: cannot perform reduce with flexible type
I think i may need to implement pandas / dataframe / slicing, however i am unsure how to do this.
First you need to add commas between the entries of your list. Then, a possible option is to use pandas:
import pandas as pd
import numpy as np
You can convert your sting list to a pandas datetime list.
time_pd = pd.to_datetime(time)
Then turn this into an integer list and perform all the calculations you want. For example, calculating the mean:
time_np = time_pd.astype(np.int64)
average_time_np = np.average(time_np)
average_time_pd = pd.to_datetime(average_time_np)
print(average_time_pd)
Which prints: 2020-01-16 00:00:00
There are certainly ways to cast the time strings directly to numpy without using pandas, but that's the solution that I could figure out without much more research.
Here is one approach based on converting back and forth between Unix time
dt = np.array(time, dtype='datetime64')
delta_sec = np.timedelta64(1, 's')
epoch = '1970-01-01T00:00:00'
epoch_sec = (dt - np.datetime64(epoch)) / delta_sec
epoch_sec_mean = np.mean(epoch_sec)
dt_mean = np.datetime64(epoch) + np.timedelta64(int(epoch_sec_mean), 's')
print(dt_mean)
Output
2020-01-16T00:00:00
Related
I have this code where I wish to change the dataformat. But I only manage to change one line and not the whole dataset.
Code:
import pandas as pd
df = pd.read_csv ("data_q_3.csv")
result = df.groupby ("Country/Region").max().sort_values(by='Confirmed', ascending=False)[:10]
pd.set_option('display.max_column', None)
print ("Covid 19 top 10 countries based on confirmed case:")
print(result)
from datetime import datetime
datetime.fromisoformat("2020-03-18T12:13:09").strftime("%Y-%m-%d-%H:%M")
Does anyone know how to fit the code so that the datetime changes in the whole dataset?
Thanks!
After looking at your problem for a while, I figured out how to change the values in the 'DateTime' column. The only problem that may arise is if the 'Country/Region' column has duplicate location names.
Editing the time is simple, as all you have to do is make use of pythons slicing. You can slice a string by typing
string = 'abcdefghijklnmopqrstuvwxyz'
print(string[0:5])
which will result in abcdef.
Below is the finished code.
import pandas as pd
# read unknown data
df = pd.read_csv("data_q_3.csv")
# List of unknown data
result = df.groupby("Country/Region").max().sort_values(by='Confirmed', ascending=False)[:10]
pd.set_option('display.max_column', None)
# you need a for loop to go through the whole column
for row in result.index:
# get the current stored time
time = result.at[row, 'DateTime']
# reformat the time string by slicing the
# string from index 0 to 10, and from index 12 to 16
# and putting a dash in the middle
time = time[0:10] + "-" + time[12:16]
# store the new time in the result
result.at[row, 'DateTime'] = time
#print result
print ("Covid 19 top 10 countries based on confirmed case:")
print(result)
I tried comparing the performance of Pandas and the traditional loop. I realized that with the same input and output, Pandas performed terribly fast calculations compared to the traditional loop.
My code:
#df_1h has been imported before
import time
n = 14
pd.options.display.max_columns = 8
display("df_1h's Shape {} rows x {} columns".format(df_1h.shape[0], df_1h.shape[1]))
close = df_1h['close']
start = time.time()
df_1h['sma_14_pandas'] = close.rolling(14).mean()
end = time.time()
display('pandas: {}'.format(end - start))
start = time.time()
df_1h['sma_14_loop'] = np.nan
for i in range(n-1, df_1h.shape[0]):
df_1h['sma_14_loop'][i] = close[i-n+1:i+1].mean()
end = time.time()
display('loop: {}'.format(end - start))
display(df_1h.tail())
Output:
"df_1h's Shape 16598 rows x 15 columns"
'pandas: 0.0030088424682617188'
'loop: 7.2529966831207275'
open_time open high low ... ignore rsi_14 sma_14_pandas sma_14_loop
16593 1.562980e+12 11707.39 11739.90 11606.04 ... 0.0 51.813151 11646.625714 11646.625714
16594 1.562983e+12 11664.32 11712.61 11625.00 ... 0.0 49.952679 11646.834286 11646.834286
16595 1.562987e+12 11632.64 11686.47 11510.00 ... 0.0 47.583619 11643.321429 11643.321429
16596 1.562990e+12 11582.06 11624.04 11500.00 ... 0.0 48.725262 11644.912857 11644.912857
16597 1.562994e+12 11604.96 11660.00 11588.16 ... 0.0 50.797087 11656.723571 11656.723571
5 rows × 15 columns
Pandas almost faster than 2.5k times!!!
My Questions:
Is my code wrong?
If my code is correct, why is Pandas so fast?
How to define custom functions that run so fast for Pandas?
As to your three questions:
Your code is correct in the sense that it produces the correct result. Explicitely iterating over the rows of a dataframe is as a rule however not so good an idea in terms of performance. Most often the same result can be achieved far more efficiently by pandas methods (as you demonstrated yourself).
Pandas is so fast because it uses numpy under the hood. Numpy implements highly efficient array operations. Also, the original creator of pandas, Wes McKinney, is kinda obsessed with efficiency and speed.
Use numpy or other optimized libraries. I recommend reading the Enhancing performance section of the pandas docs. If you can't use built-in pandas methods, if often makes sense to retrieve a numpy respresentation of the dataframe or series (using the value attribute or to_numpy() method), do all the calculations on the numpy array and only then store the result back to the dataframe or series.
Why is the loop algorithm so slow?
In your loop algorithm, mean is calculated over 16500 times, each time adding up 14 elements to find the mean. Pandas' rolling method uses a more sophisticated approach, heavily reducing the number of arithmetic operations.
You can achieve similar (and in fact about 3 times better) performance than pandas if you do the calculations in numpy. This is illustrated in the following example:
import pandas as pd
import numpy as np
import time
data = np.random.uniform(10000,15000,16598)
df_1h = pd.DataFrame(data, columns=['Close'])
close = df_1h['Close']
n = 14
print("df_1h's Shape {} rows x {} columns".format(df_1h.shape[0], df_1h.shape[1]))
start = time.time()
df_1h['SMA_14_pandas'] = close.rolling(14).mean()
print('pandas: {}'.format(time.time() - start))
start = time.time()
df_1h['SMA_14_loop'] = np.nan
for i in range(n-1, df_1h.shape[0]):
df_1h['SMA_14_loop'][i] = close[i-n+1:i+1].mean()
print('loop: {}'.format(time.time() - start))
def np_sma(a, n=14) :
ret = np.cumsum(a)
ret[n:] = ret[n:] - ret[:-n]
return np.append([np.nan]*(n-1), ret[n-1:] / n)
start = time.time()
df_1h['SMA_14_np'] = np_sma(close.values)
print('np: {}'.format(time.time() - start))
assert np.allclose(df_1h.SMA_14_loop.values, df_1h.SMA_14_pandas.values, equal_nan=True)
assert np.allclose(df_1h.SMA_14_loop.values, df_1h.SMA_14_np.values, equal_nan=True)
Output:
df_1h's Shape 16598 rows x 1 columns
pandas: 0.0031278133392333984
loop: 7.605962753295898
np: 0.0010571479797363281
I am working to try to convert a program to be parallelizable/multithreaded with the excellent dask library. Here is the program I am working on converting:
Python PANDAS: Stack by Enumerated Date to Create Records Vectorized
import pandas as pd
import numpy as np
import dask.dataframe as dd
import dask.array as da
from io import StringIO
test_data = '''id,transaction_dt,units,measures
1,2018-01-01,4,30.5
1,2018-01-03,4,26.3
2,2018-01-01,3,12.7
2,2018-01-03,3,8.8'''
df_test = pd.read_csv(StringIO(test_data), sep=',')
df_test['transaction_dt'] = pd.to_datetime(df_test['transaction_dt'])
df_test = df_test.loc[np.repeat(df_test.index, df_test['units'])]
df_test['transaction_dt'] += pd.to_timedelta(df_test.groupby(level=0).cumcount(), unit='d')
df_test = df_test.reset_index(drop=True)
expected results:
id,transaction_dt,measures
1,2018-01-01,30.5
1,2018-01-02,30.5
1,2018-01-03,30.5
1,2018-01-04,30.5
1,2018-01-03,26.3
1,2018-01-04,26.3
1,2018-01-05,26.3
1,2018-01-06,26.3
2,2018-01-01,12.7
2,2018-01-02,12.7
2,2018-01-03,12.7
2,2018-01-03,8.8
2,2018-01-04,8.8
2,2018-01-05,8.8
It occurred to me that this might be a good candidate to try to parallelize because the separate dask partitions should not need to know anything about each other to accomplish the required operations. Here is a naive representation of how I thought it might work:
dd_test = dd.from_pandas(df_test, npartitions=3)
dd_test = dd_test.loc[da.repeat(dd_test.index, dd_test['units'])]
dd_test['transaction_dt'] += dd_test.to_timedelta(dd.groupby(level=0).cumcount(), unit='d')
dd_test = dd_test.reset_index(drop=True)
So far I have been trying to work through the following errors or idiomatic differences:
"NotImplementedError: Only integer valued repeats supported."
I have tried to convert the index into a int column/array to try as well but still run into the issue.
2. dask does not support the mutating operator: "+="
3. No dask .to_timedelta() argument
4. No dask .cumcount() (but I think .cumsum() is interchangable?!)
If there are any dask experts out there who might be able let me know if there are fundamental impediments to preclude me from trying this or any tips on implementation, that would be a great help!
Edit:
I think I have made a bit of progress on this since posting the question:
dd_test = dd.from_pandas(df_test, npartitions=3)
dd_test['helper'] = 1
dd_test = dd_test.loc[da.repeat(dd_test.index, dd_test['units'])]
dd_test['transaction_dt'] = dd_test['transaction_dt'] + (dd.test.groupby('id')['helper'].cumsum()).astype('timedelta64[D]')
dd_test = dd_test.reset_index(drop=True)
However, I am still stuck on the dask array repeats error. Any tips still welcome.
Not sure if this is exactly what you are looking for, but I replaced the da.repeat with using np.repeat, along with explicity casting dd_test.index and dd_test['units'] to numpy arrays, and finally adding dd_test['transaction_dt'].astype('M8[us]') to your timedelta calculation.
df_test = pd.read_csv(StringIO(test_data), sep=',')
dd_test = dd.from_pandas(df_test, npartitions=3)
dd_test['helper'] = 1
dd_test = dd_test.loc[np.repeat(np.array(dd_test.index),
np.array(dd_test['units']))]
dd_test['transaction_dt'] = dd_test['transaction_dt'].astype('M8[us]') + (dd_test.groupby('id')['helper'].cumsum()).astype('timedelta64[D]')
dd_test = dd_test.reset_index(drop=True)
df_expected = dd_test.compute()
I have some pandas TimeSeries with date index:
import pandas as pd
import numpy as np
pandas_ts = pd.TimeSeries(np.random.randn(100),pd.date_range(start='2000-01-01', periods=100))
I need convert it to R TS (like sunspots dataset) to call some R function (slt) with my TS, which works only with timeseries. But i found that in pandas.rpy and rpy2 API's there is only DataFrame support. Is there another way to do this?
If there is no such I can convert TS to DataFrame in python, then convert it to R DF and convert it to TS in R but I have some troubles at last step because i'm new in R.
Any ideas or help in converting in R? =)
I am not a pandas proficient , But you can save you pandas time series to csv file and read it from R.
Python:
## write data
with open(PATH_CSV_FILE,"w") as file:
pandas_ts.to_csv(file)
## read data
with open(PATH_CSV_FILE,"r") as file:
pandas_ts.from_csv(file)
R:
library(xts)
## to read data
ts.xts <- read.zoo(PATH_CSV_FILE,index=0)
## to save data
write.zoo(ts.xts,PATH_CSV_FILE)
The easiest might just be to use the R function ts() in a call corresponding to your pandas.date_range() call.
from rpy2.robjects.packages import importr
stats = importr('stats')
from rpy2.robjects.vectors import IntVector
# The time series created in the question is:
# pd.date_range(start='2000-01-01', periods=100)
stats.ts(IntVector(range(100)), start=IntVector((2000, 1, 1)))
Inspired by the answers given here already, I created a small function for conversion of an existing Pandas time series towards an R time series. It might be usefull to more of you. Feel free to further improve and edit my contribution.
def pd_ts2r_ts(pd_ts):
'''Pandas timeseries (pd_ts) to R timeseries (r_ts) conversion
'''
from rpy2.robjects.vectors import IntVector,FloatVector
rstats = rpackages.importr('stats')
r_start = IntVector((pd_ts.index[0].year,pd_ts.index[0].month,pd_ts.index[0].day))
r_end = IntVector((pd_ts.index[-1].year,pd_ts.index[-1].month,pd_ts.index[-1].day))
freq_pandas2r_ts = {
# A dictionary for converting pandas.Series frequencies into R ts frequencies
'D' : 365, # is this correct, how about leap-years?
'M' : 12,
'Y' : 1,
}
r_freq = freq_pandas2r_ts[pd_ts.index.freqstr]
result = rstats.ts(FloatVector(pd_ts.values),start=r_start,end=r_end,frequency=r_freq)
return result
I have an excel/( to be converted to CSV a link) file.
The data- , has 8 columns. The first two are day of the year and time respectively while two before the last are minimum temperature and maximum temperature. For each day I need to find the maximum and minimum of the day subtract and save the value for that day.
Two problems I ran into, how do I parse 24 lines at a time ( there are no missing data lines!) and in each batch find the maximum or minimum.
I have 63126 lines=24 hr*263 days
So to iterate through the lines;
import numpy as np
input_temps='/L7_HW_SASP_w1112.csv'
up_air_min=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(5))
up_air_max=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(6))
day_year=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(0))
dt_per_all_days=[]
for i in range (0,63126,1):
# I get stuck here how to limit the iteration for 24 at a time.
# if I can do that I think I can get the rest done.
min_d=[]
max_d=[]
min_d.append( up_air_min[i])
max_d.append( up_air_max[i])
max_per_day=max(max_d)
min_per_day=min(min_d)
dt_d=max_per_day-min_per_day
dt_per_all_days.append(dt_d)
del(min_d)
del(max_d)
move to the next batch of 24 lines....
`
Use the Numpy, Luke, avoid for-loops.
Then you have ap_air_min and ap_air_max numpy arrays you can easily do what you want by using numpy element-wise functions.
At first, create 2d array with 263 rows (one for a day) and 24 columns like this:
min_matrix = up_air_min.reshape((263, 24))
max_matrix = up_air_max.reshape((263, 24))
Then use np.min and np.max functions along axis 1 (good array tip sheet):
min_temperature = np.min(min_matrix, axis=1)
max_temperature = mp.max(max_matrix, axis=1)
And find the difference:
dt = max_temperature - min_temperature
dt is array with needed values. Let's save it to foo.csv:
np.savetxt('foo.csv', np.swapaxes([day_year, dt], 0, 1), delimiter=',')
And final code looks like this:
import numpy as np
# This I got from your answer.
input_temps='/L7_HW_SASP_w1112.csv'
up_air_min=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(5))
up_air_max=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(6))
day_year=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(0))
# Split arrays and create matrix with 263 lines-days and 24 values in every line.
min_matrix = up_air_min.reshape((263, 24))
max_matrix = up_air_max.reshape((263, 24))
# Find min temperature for every day. min_temperature is an array with 263 values.
min_temperature = np.min(min_matrix, axis=1)
# The same for max temperature.
max_temperature = mp.max(max_matrix, axis=1)
# Subtract min temperature from max.
dt = max_temperature - min_temperature
# Save result in csv.
np.savetxt('foo.csv', np.swapaxes([day_year, dt], 0, 1), delimiter=',')
A reasonably pythonic way to do this would be to have a function that loops over the rows, gathering them up and spitting out the gathered rows using yield when the day changes. This gives you a generator that generates 263 lists each holding 24 values, which is a bit easier to process.
If you've definitely not got any missing values, you could use a trivial doubly-nested loop without batching up the elements first. That's a bit more fragile, but it sounds like you might not be planning to re-use the code anyway.
Here's a somewhat contrived example of how you could chunk things by 24 lines at a time.
from StringIO import StringIO
from random import random as r
import numpy as np
import operator
s = StringIO()
for x in xrange(0,10000):
s.write('%f,%f,%f\n' % (r(),r()*10,r()*100))
s.seek(0)
data = np.genfromtxt(s,dtype=None, names=['pitch','yaw','thrust'], delimiter=',')
for x in range(0,len(data),24):
print('Acting on hours %d through %d' % (x, x+24))
one_day = data[x:x+24]
minimum_yaw = min(one_day['yaw'])
max_yaw = max(one_day['yaw'])
print 'min',minimum_yaw,'max',max_yaw,'one_day',one_day['yaw']