filling a dateframe from a list - python

I tried to create a date frame with lists:
lista_values = [2983, 2983, 5652, 12375, 13055, 26180]
labels = ['00_04', '04_08', '08_12', '12_16', '16_20', '20_24']
lista_index = [datetime.datetime(2017, 11, 11, 0, 0)]
df = pd.DataFrame(lista_values,index=lista_index,columns=labels)
then my error:
ValueError: Shape of passed values is (1, 6), indices imply (6, 1)
How can I create a data frame with these lists, for example? I really don't understand why is out of bounds here if the columns and labels have same length

Now is this what you want? I flipped the dataframe with transpose (T).
import pandas as pd
import datetime
lista_values = [2983, 2983, 5652, 12375, 13055, 26180]
labels = ['00_04', '04_08', '08_12', '12_16', '16_20', '20_24']
lista_index = [datetime.datetime(2017, 11, 11, 0, 0)]
df = pd.DataFrame(data=lista_values, index=labels, columns = lista_index).T
df
Returns:
00_04 04_08 08_12 12_16 16_20 20_24
2017-11-11 2983 2983 5652 12375 13055 26180

Related

Iterating over NumPy array for timestep

I have 5 columns in an array that are DateTime type values. There are 100k+ values in each column
The values are the DateTime for a timeline going from 2015-01-15 00:30 to 2020-12-31 23:00 in 30-minute increments.
Basically what I want to do is to loop through values in the array, and check if the current value is an exact 30-minute timestep from the last value. In an array with columns, this would be the value above the current value being investigated
There's probably a few ways to do this, but I've included a pseudocode sample of how I'm thinking about it
for row in _the_whole_array :
for cell in row:
if cell == to the 30 minute timestep of the cell above it
continue iterating
else:
store that value
return the smallest timestep found, and the biggest timestep found
I have looked at for loops as well as nditer but I'm getting errors with iterating over date-times, and I'm also wondering how I can find the cell value above the current cell value above it.
Any help is hugely appreciated
i wasn't sure of your axis, even if you use rows on your pseudocode. but the overall idea is the same: if your dtype is datetime you can do basic arithmetic on it
x, y = _the_whole_array.shape
for row in range(x):
diff=_the_whole_array[row][1:]-_the_whole_array[row][:-1]
print(np.amin(diff), np.amax(diff))
for column in range(y):
diff=_the_whole_array[:,column][1:]-_the_whole_array[:,column][:-1]
print(np.amin(diff), np.amax(diff))
As mentioned by #Heidiki, consider using pandas because it will be much easier to process large data.
To address your question, you can create a temporary variable to store values of previous row while looping. At every iteration, you calculate the differences and check for absolute min and max, exactly as your pseudocode.
Here are an example and test data, 3x5-matrix. Please check whether the output matches your requirement.
from datetime import datetime, timedelta
import pandas as pd
import numpy as np
# test data
arr = np.array([
[datetime(2019, 1, 2, 13, 12), datetime(2019, 1, 2, 15, 19),
datetime(2019, 1, 2, 15, 59), datetime(2019, 1, 2, 17, 23),
datetime(2019, 1, 2, 15, 18)],
[datetime(2019, 1, 2, 13, 34), datetime(2019, 1, 2, 15, 57),
datetime(2019, 1, 2, 18, 53), datetime(2019, 1, 2, 17, 34),
datetime(2019, 1, 2, 15, 29)],
[datetime(2019, 1, 2, 13, 49), datetime(2019, 1, 2, 16, 35),
datetime(2019, 1, 2, 21, 18), datetime(2019, 1, 2, 17, 59),
datetime(2019, 1, 2, 15, 46)]
])
def timedelta_to_minutes(dt: timedelta) -> int:
return (dt.days * 24 * 60) + (dt.seconds // 60)
def min_max_timestep(data: np.array) -> tuple:
prev_row = min_step = max_step = None
df = pd.DataFrame(data)
for idx, row in df.iterrows():
if not idx:
prev_row = row
continue
diff = row - prev_row
min_diff, max_diff = min(diff), max(diff)
if min_step is None or min_diff < min_step:
min_step = min_diff
if max_step is None or max_diff > max_step:
max_step = max_diff
prev_row = row
# convert timedelta to minutes
return timedelta_to_minutes(min_step), timedelta_to_minutes(max_step)
result = min_max_timestep(arr)
try converting your npArray to a pandas dataFrame as it allows you to iterate thought its rows, here is a code snippet to help you out.
import pandas
import numpy
# Creating Dataframe From NumPy Array
df = pd.DataFrame(yout_array)
#iterating through the dataframe
for index, row in df.iterrows():
# print current row
print(row)
# print previous row
print(df[index -1]) if index != 0 else print('no previous row')

How to align data of two dataframes in a single dataframe based on common dates [duplicate]

This question already has answers here:
how to sort pandas dataframe from one column
(13 answers)
Closed 1 year ago.
I have two csv files from which I have created a single dataframe. Problem is first it adds data of 'Tatasteel' first and then data of 'Tatamotors'.
What I want is to order them according to date. For example, first data of Tatasteel at 9:15 and then data of Tatamotors at 9:15 and so on.
Here is the code:
import pandas as pd
import datetime as dt
from datetime import timedelta
import os
import glob
exclude_days = [dt.datetime(2019, 1, 5), dt.datetime(2019, 1, 6), dt.datetime(2019, 1, 12), dt.datetime(2019, 1, 13), dt.datetime(2019, 1, 19), dt.datetime(2019, 1, 20),
dt.datetime(2019, 1, 26), dt.datetime(2019, 1, 27) ]
backtest_start = dt.datetime(2019, 1, 1)
backtest_end = dt.datetime(2019, 1, 2)
path = os.getcwd()
path = os.path.join(path,"2019/*")
dat=[]
data3 =[]
stock_list = glob.glob(path)
for stock in stock_list:
rdata= pd.read_csv(stock, parse_dates=['Date'])
dat.append(rdata)
data = pd.concat(dat, ignore_index=True)
curr_day = backtest_start
while curr_day < backtest_end:
if curr_day not in exclude_days:
day_start = dt.datetime(curr_day.year,curr_day.month,curr_day.day, 9, 15)
day_end = dt.datetime(curr_day.year,curr_day.month,curr_day.day, 15, 15)
data2 = data[data['Date'].between(day_start,day_end)]
data2 = data2.reset_index(drop=True)
data2 = pd.DataFrame(data2)
print(data2)
curr_day += timedelta(days=1)
Here is the output:
Use sort_values:
df = df.sort_values('Date')
Or if you want the Date column as index:
df = df.set_index('Date').sort_index()

Using pandas string selection after sort_index

I can't figure out why un-commenting ts = ts.sort_index() in the code below throws an ErrorKey:
import datetime
import pandas as pd
df = pd.DataFrame({
'x': [2, 1, 3],
'd': [
datetime.datetime(2018, 5, 21),
datetime.datetime(2018, 5, 20),
datetime.datetime(2018, 5, 22)
]
})
ts = df.set_index('d')
#ts = ts.sort_index()
ts['2018-05-21']
My assumption is that sort_index in some ways generates a new index and therefore breaks the string selection but I can't find any evidence of it.
To provide some context, I want to sort this time series in order to select a time range (e.g., ts['2018-05-21':]). If I don't sort it, it works for the example above but not for the time range.
I will recommend using .loc
#ts = df.set_index('d')
#ts = ts.sort_index()
ts.loc['2018-05-21':,:]
Out[102]:
x
d
2018-05-21 2
2018-05-22 3

Matplotlib histogram from x,y values with datetime months as bins

I have an array of date time objects x and an array of y values corresponding to those datetimes. I'm trying to create a histogram which groups all those y values into the same bin by month. Basically adding all y values which are in the same month and creating a histogram which shows the total values for each month.
This is a simplified version of what my data looks like:
x = np.array(datetime.datetime(2014, 2, 1, 0, 0), datetime.datetime(2014, 2, 13, 0, 0),\n
datetime.datetime(2014, 3, 4, 0, 0), datetime.datetime(2014, 3, 6, 0, 0))
y = np.array(4,3,2,6)
The end result should be a histogram showing month 2 in 2014 with y value 7 and month 3 in 2014 with y value 8.
The first thing I tried was creating a pandas dataframe out of my two array like so:
frame = pd.DataFrame({'x':x,'y':y})
This worked fine with x mapping to all datetime objects and y to all corresponding values. However after creating this dataframe I'm kind of lost on how to add all the y values by month and create bins out of these months using plt.hist()
First of all, thanks for a well-posed question with an example of your data.
This seems to be what you want:
import pandas as pd
import numpy as np
import datetime
%matplotlib inline
x = np.array([datetime.datetime(2014, 2, 1, 0, 0),
datetime.datetime(2014, 2, 13, 0, 0),
datetime.datetime(2014, 3, 4, 0, 0),
datetime.datetime(2014, 3, 6, 0, 0)])
y = np.array([4,3,2,6])
frame = pd.DataFrame({'x':x,'y':y})
(frame.set_index('x'). # use date-time as index
assign(month=lambda x: x.index.month). # add new column with month
groupby('month'). # group by that column
sum(). # find a sum of the only column 'y'
plot.bar()) # make a barplot
Do This First
df = pd.DataFrame(dict(y=y), pd.DatetimeIndex(x, name='x'))
df
y
x
2014-02-01 4
2014-02-13 3
2014-03-04 2
2014-03-06 6
Option 1
df.resample('M').sum().hist()
Option 2
df.groupby(pd.TimeGrouper('M')).sum().hist()
Or Do This First
df = pd.DataFrame(dict(x=pd.to_datetime(x), y=y))
df
x y
0 2014-02-01 4
1 2014-02-13 3
2 2014-03-04 2
3 2014-03-06 6
Option 3
df.resample('M', on='x').sum().hist()
Yields

Pandas Dataframe plot rows as x values and column header as y values

I have a Pandas DataFrame with displacements for different times (rows) and specific vertical locations (columns names). The goal is to plot the displacements (x axis) for the vertical location (y axis) for a given time (series).
According to the next example (time = 0, 1, 2, 3, 4 and vertical locations = 0.5, 1.5, 2.5, 3.5), how can the displacements be plotted for the times 0 and 3?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(88)
df = pd.DataFrame({
'time': np.arange(0, 5, 1),
'0.5': np.random.uniform(-1, 1, size = 5),
'1.5': np.random.uniform(-2, 2, size = 5),
'2.5': np.random.uniform(-3, 3, size = 5),
'3.5': np.random.uniform(-4, 4, size = 5),
})
df = df.set_index('time')
You can filter your dataframe to only contain the desired rows. Either by using the positional index
filtered = df.iloc[[0,3],:]
or by using the actualy index of the dataframe,
filtered = df.iloc[(df.index == 3) | (df.index == 0),:]
You can then plot a scatter plot like this:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(88)
df = pd.DataFrame({
'time': np.arange(0, 5, 1),
'0.5': np.random.uniform(-1, 1, size = 5),
'1.5': np.random.uniform(-2, 2, size = 5),
'2.5': np.random.uniform(-3, 3, size = 5),
'3.5': np.random.uniform(-4, 4, size = 5),
})
df = df.set_index('time')
filtered_df = df.iloc[[0,3],:]
#filtered_df = df.iloc[(df.index == 3) | (df.index == 0),:]
loc = list(map(float, df.columns))
fig, ax = plt.subplots()
for row in filtered_df.iterrows():
ax.scatter(row[1], loc, label=row[1].name)
plt.legend()
plt.show()

Categories