I have the following tuple that I'm trying to create a data frame out of:
testing =
([datetime.datetime(2020, 2, 5, 0, 0),
datetime.datetime(2020, 2, 5, 2, 40),
datetime.datetime(2020, 2, 5, 5, 20),
datetime.datetime(2020, 2, 5, 8, 0),
datetime.datetime(2020, 2, 5, 10, 40),
datetime.datetime(2020, 2, 5, 13, 20),
datetime.datetime(2020, 2, 5, 16, 0),
datetime.datetime(2020, 2, 5, 18, 40),
datetime.datetime(2020, 2, 5, 21, 20),
datetime.datetime(2020, 2, 6, 0, 0)],
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
I use this snippet to create a data frame:
df_testing = pd.DataFrame(testing)
df_testing.head()
However, this causes the kernel to die every time. If i only loop at 1 item (e.g. I do df_testing = pd.DataFrame(testing[0])), the code runs fine.
I'm not super familiar with using tuples so is there some type of property that inhibits them from being turned into a data frame?
NOTE:
There is a lot of code that generates that this "testing variable" it's just a portion of the overall data I would like to eventually convert. I filled in some dummy data for the example here. I would prefer not to modify the data type of this variable if at all possible.
Also I'm running Python 3.7 in case that matters.
EDIT:
Here is a screenshot of my trying to run the test code I put in.
I just ran your exact code (pay attention you wrote different variable names - test vs testing).
After changing the variable names it worked just fine:
I guess the problem is with your jupyter Labs installation.
I would use:
new_df = pd.Series(dict(zip(*test))).to_frame('name_column')
print(new_df)
or
new_df = pd.DataFrame({'name_column':dict(zip(*test))})
print(new_df)
Output
name_column
2020-02-05 00:00:00 1
2020-02-05 02:40:00 2
2020-02-05 05:20:00 3
2020-02-05 08:00:00 4
2020-02-05 10:40:00 5
2020-02-05 13:20:00 6
2020-02-05 16:00:00 7
2020-02-05 18:40:00 8
2020-02-05 21:20:00 9
2020-02-06 00:00:00 10
You could use DataFrame.reset_index if you want convert index in column
Another option is DataFrame.transpose
new_df = pd.DataFrame(test,index=['Date','values']).T
print(new_df)
Date values
0 2020-02-05 00:00:00 1
1 2020-02-05 02:40:00 2
2 2020-02-05 05:20:00 3
3 2020-02-05 08:00:00 4
4 2020-02-05 10:40:00 5
5 2020-02-05 13:20:00 6
6 2020-02-05 16:00:00 7
7 2020-02-05 18:40:00 8
8 2020-02-05 21:20:00 9
9 2020-02-06 00:00:00 10
Related
I have a DataFrame that is indexed by Date and has a couple of columns like this:
XLY UA
Date
2017-04-01 0.023991 0.060656
2017-05-01 0.010993 -0.081401
2017-06-01 -0.015596 0.130679
2017-07-01 0.019302 -0.101686
2017-08-01 -0.018608 -0.166207
2017-09-01 0.004684 -0.005298
2017-10-01 0.021203 -0.232357
2017-11-01 0.050658 0.034692
2017-12-01 0.021107 0.116513
2018-01-01 0.092411 -0.035285
2018-02-01 -0.034691 0.171206
...
2022-03-01 0.079468 0.039667
I have a python dictionary of weights
weights = {2022: 6, 2021: 5, 2020: 4, 2019: 3, 2018: 2, 2017: 1}
Is there a way to apply these weights to each row of the Dataframe so that for example, the row 2022-03-01 would be 0.079468 * 6 and .039667 * 6 and so on for all the rows that are in the year 2022, when it gets to 2021, it would apply 5 *, etc.
I know I can loop and do this. I am looking for a functional concise version.
Use mul on axis=0:
weights = {2022: 6, 2021: 5, 2020: 4, 2019: 3, 2018: 2, 2017: 1}
cols = ['XLY', 'UA']
df[cols] = df[cols].mul(df.index.year.map(weights), axis=0)
print(df)
# Output
XLY UA
Date
2017-04-01 0.023991 0.060656
2017-05-01 0.010993 -0.081401
2017-06-01 -0.015596 0.130679
2017-07-01 0.019302 -0.101686
2017-08-01 -0.018608 -0.166207
2017-09-01 0.004684 -0.005298
2017-10-01 0.021203 -0.232357
2017-11-01 0.050658 0.034692
2017-12-01 0.021107 0.116513
2018-01-01 0.184822 -0.070570
2018-02-01 -0.069382 0.342412
2022-03-01 0.476808 0.238002
I would do something llike this:
col_weights = np.array([weights[dt.year] for dt in df.index.get_level_values(0)])
df.loc[:, "XLY"] = df["XLY"] * col_weights
df.loc[:, "UA"] = df["UA"] * col_weights
fist line creates a weights array mapping index.year to weitghs dict.
next lines applies weight to each column.
So, I have 2 data frames where the first one has the following structure:
'ds', '1_sensor_id', '1_val_1', '1_val_2'
0 2019-09-13 12:40:00 33469 30 43
1 2019-09-13 12:45:00 33469 43 43
The second one has the following structure:
'ds', '2_sensor_id', '2_val_1', '2_val_2'
0 2019-09-13 12:42:00 20006 6 50
1 2019-09-13 12:47:00 20006 5 80
So what I want to do is merge the two pandas frame through interpolation. So ultimately, the merged frame should have values defined at the time stamps (ds) defined in frame 1 and the 2_val_1 and 2_val_2 columns would be interpolated and the merged frame would have a row for each value in ds column in frame 1. What would be the best way to do this in pandas? I tried the merge_asof function but this does nearest neighbourhood interpolation and I did not get all the time stamps back.
You can append one frame to another and use interpolate(), example:
import datetime
import pandas as pd
df1 = pd.DataFrame(columns=['ds', '1_sensor_id', '1_val_1', '1_val_2'],
data=[[datetime.datetime(2019, 9, 13, 12, 40, 00), 33469, 30, 43],
[datetime.datetime(2019, 9, 13, 12, 45, 00), 33469, 33, 43]])
df2 = pd.DataFrame(columns=['ds', '2_sensor_id', '2_val_1', '2_val_2'],
data=[[datetime.datetime(2019, 9, 13, 12, 42, 00), 20006, 6, 50],
[datetime.datetime(2019, 9, 13, 12, 47, 00), 20006, 5, 80]])
df = df1.append(df2, sort=False)
df.set_index('ds', inplace=True)
df.interpolate(method = 'time', limit_direction='backward', inplace=True)
print(df)
1_sensor_id 1_val_1 ... 2_val_1 2_val_2
ds ...
2019-09-13 12:40:00 33469.0 30.0 ... 6.0 50.0
2019-09-13 12:45:00 33469.0 33.0 ... 5.4 68.0
2019-09-13 12:42:00 NaN NaN ... 6.0 50.0
2019-09-13 12:47:00 NaN NaN ... 5.0 80.0
This seems like a basic question. I want to use the datetime index in a pandas dataframe as the x values of a machine leanring algorithm for a univarte time series comparisons.
I tried to isolate the index and then convert it to a number but i get an error.
df=data["Close"]
idx=df.index
df.index.get_loc(idx)
Date
2014-03-31 0.9260
2014-04-01 0.9269
2014-04-02 0.9239
2014-04-03 0.9247
2014-04-04 0.9233
this is what i get when i add your code
2019-04-24 00:00:00 0.7097
2019-04-25 00:00:00 0.7015
2019-04-26 00:00:00 0.7018
2019-04-29 00:00:00 0.7044
x (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
Name: Close, Length: 1325, dtype: object
I ne
ed a column of 1 to the number of values in my dataframe
First select column Close by double [] for one column DataFrame, so possible add new column:
df = data[["Close"]]
df["x"] = np.arange(1, len(df) + 1)
print (df)
Close x
Date
2014-03-31 0.9260 1
2014-04-01 0.9269 2
2014-04-02 0.9239 3
2014-04-03 0.9247 4
2014-04-04 0.9233 5
You can add a column with value range(1, len(data) + 1) as so:
df = pd.DataFrame({"y": [5, 4, 3, 2, 1]}, index=pd.date_range(start="2019-08-01", periods=5))
In [3]: df
Out[3]:
y
2019-08-01 5
2019-08-02 4
2019-08-03 3
2019-08-04 2
2019-08-05 1
df["x"] = range(1, len(df) + 1)
In [7]: df
Out[7]:
y x
2019-08-01 5 1
2019-08-02 4 2
2019-08-03 3 3
2019-08-04 2 4
2019-08-05 1 5
I want to multiIndex an array of data.
Initially, I was indexing my data with datetime, but for some later applications, I had to add another numeric index (that goes from 0 the len(array)-1).
I have written those little lines:
O = [0.701733664614, 0.699495411782, 0.572129320819, 0.613315597684, 0.58079660603, 0.596638918579, 0.48453382119]
Ab = [datetime.datetime(2018, 12, 11, 14, 0), datetime.datetime(2018, 12, 21, 10, 0), datetime.datetime(2018, 12, 21, 14, 0), datetime.datetime(2019, 1, 1, 10, 0), datetime.datetime(2019, 1, 1, 14, 0), datetime.datetime(2019, 1, 11, 10, 0), datetime.datetime(2019, 1, 11, 14, 0)]
tst = pd.Series(O,index=Ab)
ld = len(tst)
index = pd.MultiIndex.from_product([(x for x in range(0,ld)),Ab], names=['id','dtime'])
print (index)
data = pd.Series(O,index=index)
But when printting index, I get some bizzare ''codes'':
The levels & names are perfect, but the codes go from 0 to 763...764 times (instead of one)!
I tried to add the set_codes command:
index.set_codes([x for x in range(0,ld)], level=0)
print (index)
I vain, I have the following error :
ValueError: Unequal code lengths: [764, 583696]
the initial pandas series:
print (tst)
2005-01-01 14:00:00 0.544177
2005-01-01 14:00:00 0.544177
2005-01-21 14:00:00 0.602239
...
2019-05-21 10:00:00 0.446813
2019-05-21 14:00:00 0.466573
Length: 764, dtype: float64
the new expected one
id dtime
0 2005-01-01 14:00:00 0.544177
1 2005-01-01 14:00:00 0.544177
2 2005-01-21 14:00:00 0.602239
...
762 2019-05-21 10:00:00 0.446813
763 2019-05-21 14:00:00 0.466573
Thanks in advance
You can create new index by MultiIndex.from_arrays and reassign to Series:
s.index = pd.MultiIndex.from_arrays([np.arange(len(s)), s.index], names=['id','dtime'])
below is a simplified version of my setup:
import pandas as pd
import datetime as dt
df_data = pd.DataFrame({'DateTime' : [dt.datetime(2017, 9, 1, 0, 0, 0),dt.datetime(2017, 9, 1, 1, 0, 0),dt.datetime(2017, 9, 1, 2, 0, 0),dt.datetime(2017, 9, 1, 3, 0, 0)], 'Data' : [1,2,3,5]})
df_timeRanges = pd.DataFrame({'startTime':[dt.datetime(2017, 8, 30, 0, 0, 0), dt.datetime(2017, 9, 1, 1, 30, 0)], 'endTime':[dt.datetime(2017, 9, 1, 0, 30, 0), dt.datetime(2017, 9, 1, 2, 30, 0)]})
print df_data
print df_timeRanges
This gives:
Data DateTime
0 1 2017-09-01 00:00:00
1 2 2017-09-01 01:00:00
2 3 2017-09-01 02:00:00
3 5 2017-09-01 03:00:00
endTime startTime
0 2017-09-01 00:30:00 2017-08-30 00:00:00
1 2017-09-01 02:30:00 2017-09-01 01:30:00
I would like to filter df_data with df_timeRanges, with the remaining rows in a single dataframe, kind of like:
df_data_filt = df_data[(df_data['DateTime'] >= df_timeRanges['startTime']) & (df_data['DateTime'] <= df_timeRanges['endTime'])]
I did not expect the above line to work, and it returned this error:
ValueError: Can only compare identically-labeled Series objects
Would anyone be able to provide some tips on this? The df_data and df_timeRanges in my real task are much bigger.
Thanks in advance
IIUIC, Use
In [794]: mask = np.logical_or.reduce([
(df_data.DateTime >= x.startTime) & (df_data.DateTime <= x.endTime)
for i, x in df_timeRanges.iterrows()])
In [795]: df_data[mask]
Out[795]:
Data DateTime
0 1 2017-09-01 00:00:00
2 3 2017-09-01 02:00:00
Or, also
In [807]: func = lambda x: (df_data.DateTime >= x.startTime) & (df_data.DateTime <= x.endTime)
In [808]: df_data[df_timeRanges.apply(func, axis=1).any()]
Out[808]:
Data DateTime
0 1 2017-09-01 00:00:00
2 3 2017-09-01 02:00:00