Pandas gives integer from datetime index with item() - python

I have 2 dataframes, taken from a larger frame (with df.head(x) ), both with the same index:
print df
val
DT
2017-03-06 00:00:00 1.06207
2017-03-06 00:02:00 1.06180
2017-03-06 00:04:00 1.06167
2017-03-06 00:06:00 1.06141
2017-03-06 00:08:00 1.06122
... ...
2017-03-10 21:50:00 1.06719
2017-03-10 21:52:00 1.06719
2017-03-10 21:54:00 1.06697
2017-03-10 21:56:00 1.06713
2017-03-10 21:58:00 1.06740
a and b are then taken from df
print a.index
print b.index
DatetimeIndex(['2017-03-06 00:32:00'], dtype='datetime64[ns]', name=u'DT', freq=None)
DatetimeIndex(['2017-03-06 00:18:00'], dtype='datetime64[ns]', name=u'DT', freq=None)
But, when I use a.index.item(), I get it in the format 1488759480000000000. That means when
I go to take a slice from df based on a and b , I get an empty dataframe
>>> df[a.index.item() : b.index.item()]
Empty DataFrame
and further to that, when I try to convert them both:
df[a.index.to_pydatetime() : b.index.to_pydatetime()]
TypeError: Cannot convert input [[datetime.datetime(2017, 3, 6, 0, 18)]] of type <type 'numpy.ndarray'> to Timestamp
This is infuriating, surely there should be continuity of objects when using item(). Can anyone give me some pointers?

You can use loc with first value of a and b:
df.loc[a.index[0] : b.index[0]]
Your solution working if convert to Timestamp:
print (df.loc[pd.Timestamp(a.index.item()): pd.Timestamp(b.index.item())])

Related

Converting columns with hours to datetime type pandas

I try to convert my column with "time" in the form "hr hr: min min :sec sec" in my pandas frame from object to date time 64 as I want to filter for hours.
I tried new['Time'] = pd.to_datetime(new['Time'], format='%H:%M:%S').dt.time which has no effect at all (it is still an object).
I also tried new['Time'] = pd.to_datetime(new['Time'],infer_datetime_format=True)
which gets the error message: TypeError: <class 'datetime.time'> is not convertible to datetime
I want to be able to sort my data frame for hours.
How do i convert the object to the hour?
can I then filter by hour (for example everything after 8am) or do I have to enter the exact value with minutes and seconds to filter for it?
Thank you
If you want your df['Time'] to be of type datetime64 just use
df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S')
print(df['Time'])
This will result in the following column
0 1900-01-01 00:00:00
1 1900-01-01 00:01:00
2 1900-01-01 00:02:00
3 1900-01-01 00:03:00
4 1900-01-01 00:04:00
...
1435 1900-01-01 23:55:00
1436 1900-01-01 23:56:00
1437 1900-01-01 23:57:00
1438 1900-01-01 23:58:00
1439 1900-01-01 23:59:00
Name: Time, Length: 1440, dtype: datetime64[ns]
If you just want to extract the hour from the timestamp extent pd.to_datetime(...) by .dt.hour
If you want to group your values on an hourly basis you can also use (after converting the df['Time'] to datetime):
new_df = df.groupby(pd.Grouper(key='Time', freq='H'))['Value'].agg({pd.Series.to_list})
This will return all values grouped by hour.
IIUC, you already have a time structure from datetime module:
Suppose this dataframe:
from datetime import time
df = pd.DataFrame({'Time': [time(10, 39, 23), time(8, 47, 59), time(9, 21, 12)]})
print(df)
# Output:
Time
0 10:39:23
1 08:47:59
2 09:21:12
Few operations:
# Check if you have really `time` instance
>>> df['Time'].iloc[0]
datetime.time(10, 39, 23)
# Sort values by time
>>> df.sort_values('Time')
Time
1 08:47:59
2 09:21:12
0 10:39:23
# Extract rows from 08:00 and 09:00
>>> df[df['Time'].between(time(8), time(9))]
Time
1 08:47:59

Pandas read_excel function ignoring dtype

I'm trying to read an excel file with pd.read_excel().
The excel file has 2 columns Date and Time and I want to read both columns as str not the excel dtype.
Example of the excel file
I've tried to specify the dtype or the converters arguments to no avail.
df = pd.read_excel('xls_test.xlsx',
dtype={'Date':str,'Time':str})
df.dtypes
Date object
Time object
dtype: object
df.head()
Date Time
0 2020-03-08 00:00:00 10:00:00
1 2020-03-09 00:00:00 11:00:00
2 2020-03-10 00:00:00 12:00:00
3 2020-03-11 00:00:00 13:00:00
4 2020-03-12 00:00:00 14:00:00
As you can see the Date column is not treated as str...
Same thing when using converters
df = pd.read_excel('xls_test.xlsx',
converters={'Date':str,'Time':str})
df.dtypes
Date object
Time object
dtype: object
df.head()
Date Time
0 2020-03-08 00:00:00 10:00:00
1 2020-03-09 00:00:00 11:00:00
2 2020-03-10 00:00:00 12:00:00
3 2020-03-11 00:00:00 13:00:00
4 2020-03-12 00:00:00 14:00:00
I have also tried to use other engine but the result is always the same.
The dtype argument seems to work as expected when reading a csv though
What am I doing wrong here ??
Edit:
I forgot to mention, I'm using the last version of pandas 1.2.2 but had the same problem before updating from 1.1.2.
here is a simple solution, even if you apply the "str" in a dtype it will return as an object only. Use the below code to read the columns as string Dtype.
df= pd.read_excel("xls_test.xlsx",dtype={'Date':'string','Time':'string'})
To understand more about the Pandas String Dtype use the link below,
https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html
Let me know if you have any issues on that !!
The problem you’re having is that cells in excel have datatypes. So here the data type is a date or a time, and it’s formatted for display only. Loading it “directly” means loading a datetime type*.
This means that, whatever you do with the dtype= argument, the data will be loaded as a date, and then converted to string, giving you the result you see:
>>> pd.read_excel('test.xlsx').head()
Date Time Datetime
0 2020-03-08 10:00:00 2020-03-08 10:00:00
1 2020-03-09 11:00:00 2020-03-09 11:00:00
2 2020-03-10 12:00:00 2020-03-10 12:00:00
3 2020-03-11 13:00:00 2020-03-11 13:00:00
4 2020-03-12 14:00:00 2020-03-12 14:00:00
>>> pd.read_excel('test.xlsx').dtypes
Date datetime64[ns]
Time object
Datetime datetime64[ns]
dtype: object
>>> pd.read_excel('test.xlsx', dtype='string').head()
Date Time Datetime
0 2020-03-08 00:00:00 10:00:00 2020-03-08 10:00:00
1 2020-03-09 00:00:00 11:00:00 2020-03-09 11:00:00
2 2020-03-10 00:00:00 12:00:00 2020-03-10 12:00:00
3 2020-03-11 00:00:00 13:00:00 2020-03-11 13:00:00
4 2020-03-12 00:00:00 14:00:00 2020-03-12 14:00:00
>>> pd.read_excel('test.xlsx', dtype='string').dtypes
Date string
Time string
Datetime string
dtype: object
Only in csv files are datetime data stored as string in the file. There, loading it “directly” as a string makes sense. In an excel file, you may as well load it as a date and format it with .dt.strftime()
That’s not to say that you can’t load the data as it is formatted, but you’ll need 2 steps:
load data
re-apply formatting
There is some translation to be done between formatting types, and you can’t use pandas directly − however you can use the engine that pandas uses as a backend:
import datetime
import openpyxl
import re
date_corresp = {
'dd': '%d',
'mm': '%m',
'yy': '%y',
'yyyy': '%Y',
}
time_corresp = {
'hh': '%h',
'mm': '%M',
'ss': '%S',
}
def datecell_as_formatted(cell):
if isinstance(cell.value, datetime.time):
dfmt, tfmt = '', cell.number_format
elif isinstance(cell.value, (datetime.date, datetime.datetime)):
dfmt, tfmt, *_ = cell.number_format.split('\\', 1) + ['']
else:
raise ValueError('Not a datetime cell')
for fmt in re.split(r'\W', dfmt):
if fmt:
dfmt = re.sub(f'\\b{fmt}\\b', date_corresp.get(fmt, fmt), dfmt)
for fmt in re.split(r'\W', tfmt):
if fmt:
tfmt = re.sub(f'\\b{fmt}\\b', time_corresp.get(fmt, fmt), tfmt)
return cell.value.strftime(dfmt + tfmt)
Which you can then use as follows:
>>> wb = openpyxl.load_workbook('test.xlsx')
>>> ws = wb.worksheets[0]
>>> datecell_as_formatted(ws.cell(row=2, column=1))
'08/03/20'
(You can also complete the _corresp dictionaries with more date/time formatting items if they are incomplete)
* It is stored as a floating-point number, which is the number of days since 1/1/1900, as you can see by formatting a date as number or on this excelcampus page.
The issue just like the other comments say is most likely a bug
Although not ideal, but you could always do something like this?
import pandas as pd
#df = pd.read_excel('test.xlsx',dtype={'Date':str,'Time':str})
# this line can be then simplified to :
df = pd.read_excel('test.xlsx')
df['Date'] = df['Date'].apply(lambda x: '"' + str(x) + '"')
df['Time'] = df['Time'].apply(lambda x: '"' + str(x) + '"')
print (df)
print(df['Date'].dtype)
print(df['Time'].dtype)
Date Time
0 "2020-03-08 00:00:00" "10:00:00"
1 "2020-03-09 00:00:00" "11:00:00"
2 "2020-03-10 00:00:00" "12:00:00"
3 "2020-03-11 00:00:00" "13:00:00"
4 "2020-03-12 00:00:00" "14:00:00"
5 "2020-03-13 00:00:00" "15:00:00"
6 "2020-03-14 00:00:00" "16:00:00"
7 "2020-03-15 00:00:00" "17:00:00"
8 "2020-03-16 00:00:00" "18:00:00"
9 "2020-03-17 00:00:00" "19:00:00"
10 "2020-03-18 00:00:00" "20:00:00"
11 "2020-03-19 00:00:00" "21:00:00"
object
object
Since version 1.0.0, there are two ways to store text data in pandas: object or StringDtype (source).
And since version 1.1.0, StringDtype now works in all situations where astype(str) or dtype=str work (source).
All dtypes can now be converted to StringDtype
You just need to specify dtype="string" when loading your data with pandas:
>>df = pd.read_excel('xls_test.xlsx', dtype="string")
>>df.dtypes
Date string
Time string
dtype: object

Pandas: Using iterrows() and pd.Series to Append Values to Series

My input data looks like this:
cat start target
0 1 2016-09-01 00:00:00 4.370279
1 1 2016-09-01 00:00:00 1.367778
2 1 2016-09-01 00:00:00 0.385834
I want to build out a series using "start" for the Start Date and "target" for the series values. The iterrows() is pulling the correct values for "imp", but when appending to the time_series, only the first value is carried through to all series points. What's the reason for "data=imp" pulling the 0th row every time?
t0 = model_input_test['start'][0] # t0 = 2016-09-01 00:00:00
num_ts = len(model_input_test.index) # num_ts = 1348
time_series = []
for i, row in model_input_test.iterrows():
imp = row.loc['target']
print(imp)
index = pd.DatetimeIndex(start=t0, freq='H', periods=num_ts)
time_series.append(pd.Series(data=imp, index=index))
A screenshot can be seen here.
Series "time_series" should look like this:
2016-09-01 00:00:00 4.370279
2016-09-01 01:00:00 1.367778
2016-09-01 02:00:00 0.385834
But ends up looking like this:
2016-09-01 00:00:00 4.370279
2016-09-01 01:00:00 4.370279
2016-09-01 02:00:00 4.370279
I'm using Jupyter conda_python3 on Sagemaker.
When using dataframes, there are usually better ways to go about tasks then iterating through the dataframe. For example, in your case, you can create your series like this:
time_series = (df.set_index(pd.date_range(pd.to_datetime(df.start).iloc[0],
periods = len(df), freq='H')))['target']
>>> time_series
2016-09-01 00:00:00 4.370279
2016-09-01 01:00:00 1.367778
2016-09-01 02:00:00 0.385834
Freq: H, Name: target, dtype: float64
>>> type(time_series)
<class 'pandas.core.series.Series'>
Essentially, this says: "set the index to be a date range incremented hourly from your first date, then take the target column"
Given a dataframe df and series start and target, you can simply use set_index:
time_series = df.set_index('start')['target']

How can I print certain rows from a CSV in pandas

My problem is that I have a big dataframe with over 40000 Rows and now I want to select the rows from 2013-01-01 00:00:00 until 2013-31-12 00:00:00
print(df.loc[df['localhour'] == '2013-01-01 00:00:00'])
Thats my code now but I can not choose an intervall for printing out ... any ideas ?
One way is to set your index as datetime and then use pd.DataFrame.loc with string indexers:
df = pd.DataFrame({'Date': ['2013-01-01', '2014-03-01', '2011-10-01', '2013-05-01'],
'Var': [1, 2, 3, 4]})
df['Date'] = pd.to_datetime(df['Date'])
res = df.set_index('Date').loc['2010-01-01':'2013-01-01']
print(res)
Var
Date
2013-01-01 1
2011-10-01 3
Make a datetime object and then apply the condition:
print(df)
date
0 2013-01-01
1 2014-03-01
2 2011-10-01
3 2013-05-01
df['date']=pd.to_datetime(df['date'])
df['date'].loc[(df['date']<='2013-12-31 00:00:00') & (df['date']>='2013-01-01 00:00:00')]
Output:
0 2013-01-01
3 2013-05-01

Selecting single row as dataframe with DatetimeIndex

I have a time series in a dataframe with DatetimeIndex like that:
import pandas as pd
dates= ["2015-10-01 00:00:00",
"2015-10-01 01:00:00",
"2015-10-01 02:00:00",
"2015-10-01 03:00:00",
"2015-10-01 04:00:00"]
df = pd.DataFrame(index=pd.DatetimeIndex(dates))
df["values"] = range(0,5)
Out[]:
values
2015-10-01 00:00:00 0
2015-10-01 01:00:00 1
2015-10-01 02:00:00 2
2015-10-01 03:00:00 3
2015-10-01 04:00:00 4
I would like to as simple clean as possible select a row looking like that, based on the date being the key, e.g. "2015-10-01 02:00:00":
Out[]:
values
2015-10-01 02:00:00 2
Simply using indexing results in a key error:
df["2015-10-01 02:00:00"]
Out[]:
KeyError: '2015-10-01 02:00:00'
Similarly this:
df.loc[["2015-10-01 02:00:00"]]
Out[]:
KeyError: "None of [['2015-10-01 02:00:00']] are in the [index]"
These surprisingly (?) result in the same series as follows:
df.loc["2015-10-01 02:00:00"]
Out[]:
values 2
Name: 2015-10-01 02:00:00, dtype: int32
df.loc["2015-10-01 02:00:00",:]
Out[]:
values 2
Name: 2015-10-01 02:00:00, dtype: int32
print(type(df.loc["2015-10-01 02:00:00"]))
print(type(df.loc["2015-10-01 02:00:00",:]))
print(df.loc["2015-10-01 02:00:00"].shape)
print(df.loc["2015-10-01 02:00:00",:].shape)
Out[]:
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
(1,)
(1,)
I could wrap any of those in DataFrame like that:
slize = pd.DataFrame(df.loc["2015-10-01 02:00:00",:])
Out[]:
2015-10-01 02:00:00
values 2
Of course I could do this to reach my result:
slize.T
Out[]:
values
2015-10-01 02:00:00 2
But as at this point, I could also expect a column as a series it is kinda hard to test if it is a row or columns series to add the T automatically.
Did I miss a way of selecting what I want?
I recommend to generate your index using pd.date_range for convenience, and then to use .loc with a Timestamp or datetime object.
from datetime import datetime
import pandas as pd
start = datetime(2015, 10, 1, 0, 0, 0)
end = datetime(2015, 10, 1, 4, 0, 0)
dates = pd.date_range(start, end, freq='H')
df = pd.DataFrame(index=pd.DatetimeIndex(dates))
df["values"] = range(0,5)
Then you can use .loc with a Timestamp or datetime object.
In [2]: df.loc[[start]]
Out[2]:
values
2015-10-01 0
Further details
Simply using indexing results in a key error:
df["2015-10-01 02:00:00"]
Out[]:
KeyError: '2015-10-01 02:00:00'
KeyError occurs because you try to return a view of the DataFrame by looking for a column named "2015-10-01 02:00:00"
Similarly this:
df.loc[["2015-10-01 02:00:00"]]
Out[]:
KeyError: "None of [['2015-10-01 02:00:00']] are in the [index]"
Your second option cannot work with str indexing, you should use exact indexing as mentioned instead.
These surprisingly (?) result in the same series as follows:
df.loc["2015-10-01 02:00:00"]
Out[]:
values 2
Name: 2015-10-01 02:00:00, dtype: int32
If you use .loc on a single row you will have a coercion to Series type as you noticed. Hence you shall cast to DataFrame and then transpose the result.
You can convert string to datetime - using exact indexing:
print (df.loc[[pd.to_datetime("2015-10-01 02:00:00")]])
values
2015-10-01 02:00:00 2
Or convert Series to DataFrame and transpose:
print (df.loc["2015-10-01 02:00:00"].to_frame().T)
values
2015-10-01 02:00:00 2
df[df[time_series_row] == “data_to_match”]
Sorry for the formatting. On my phone, will update when I’m back at a computer.
Edit:
I would generally write it like this:
bitmask = df[time_seried_row] == "data_to_match"
row = df[bitmask]

Categories