processing timestamps in different formats in Pandas - python

Consider this simple example
import pandas as pd
mydf = pd.DataFrame({'timestamp': ['Tue, 27 Jul 2021 06:43:18 +0000',
'Sun, 20 Jun 2021 17:00:17 GMT',
'Wed, 28 Jul 2021 08:44:00 -0400']})
mydf
Out[50]:
timestamp
0 Tue, 27 Jul 2021 06:43:18 +0000
1 Sun, 20 Jun 2021 17:00:17 GMT
2 Wed, 28 Jul 2021 08:44:00 -0400
I am trying to convert all the timestamps to GMT and get rid of the timezone offset.
Unfortunately, the usual solution does not work
pd.to_datetime(mydf['timestamp']).dt.tz_localize(None)
AttributeError: Can only use .dt accessor with datetimelike values
What is the issue?
Thanks!

The problem is your function call is not able to convert the mydf['timestamp'] into datetime object but plain object and therefore, when you try to access .dt, it complains.
You can use the flag utc=True to do time-zone aware conversion.
pd.to_datetime(mydf['timestamp'], utc=True).dt.tz_localize(None)

Related

Convert pandas index to date format for sorting

I have a dataframe (df) with a type string index:
0
Jan 2021 0.852144
Jan 2022 0.442388
Feb 2021 0.406946
Feb 2022 0.296960
: :
Nov 2021 0.829171
Nov 2022 0.601725
Dec 2021 0.214810
Dec 2022 0.673403
How do I convert the index to type datetime so I can sort the df to look like:
0
Jan 2021 0.686585
Feb 2021 0.214810
Mar 2021 0.852144
Apr 2021 0.920720
: :
Sep 2022 0.770413
Oct 2022 0.751213
Nov 2022 0.601725
Dec 2022 0.924836
For smae format like original only sorted values use DataFrame.sort_index with key parameter:
df = df.sort_index(key=lambda x: pd.to_datetime(x))
If need DatetimeIndex and then sorting ouput is different:
df.index = pd.to_datetime(df.index)
df = df.sort_index()

How can I order the table by month and then show it in plot chart? Python

I want to Order the table by the year and by month.
Sort_values doesnt work for me.
after that I need to show it in plot line chart with month over time
How can I do it?
df10=df.groupby(['year','month'],as_index=False).Sales.sum()
df10
year month Sales
0 2018 Apr 452546547.720000
1 2018 Aug 452830473.750001
2 2018 Dec 525888501.900000
3 2018 Feb 417589010.130000
4 2018 Jan 506665837.860000
5 2018 Jul 527113871.520000
6 2018 Jun 489527703.960000
7 2018 Mar 471807206.670001
8 2018 May 517740285.600000
9 2018 Nov 417862539.330000
10 2018 Oct 441153829.710001
11 2018 Sep 450298873.800000
12 2019 Apr 440397073.890000
13 2019 Feb 408684717.060001
14 2019 Jan 511212275.310001
15 2019 Mar 455560627.320000
16 2019 May 571120956.510000
sns.lineplot(x='month',y='Sales',data=df10)
'To sort by month, you need to have mont has number, or sorted string text. Either way, refer below to my code to get month as number, then plot the df however you like.
from time import strptime
df['month_num'] = [strptime(x,'%b').tm_mon for x in df['month']
df = df.soft_vlaues(['year', 'month_num')
data['y-m'] = data['year'].astype(str) +'-'+ data['month']
data['y-m'] = pd.to_datetime(data['y-m'])
sns.lineplot(y='Sales',x='y-m',data=data)
plt.xticks(rotation=45)
plt.show()
When sorting by dates, you first need to convert your data to datetime using datetime.date(year, month)
the key parameter helps you with that.
df10.sort_values(key=lambda e: datetime.date(e["year"], e["month"]))

Changing datetime format which includes weekday?

Currently I have datetime column in this format
Datime
Thu Jun 18 23:04:19 +0000 2020
Thu Jun 18 23:04:18 +0000 2020
Thu Jun 18 23:04:14 +0000 2020
Thu Jun 18 23:04:13 +0000 2020
I want to change it to:
Datetime
2020-06-18 23:04:19
2020-06-18 23:04:18
2020-06-18 23:04:14
2020-06-18 23:04:13
Assuming you have loaded your pandas dataframe, you can convert Datetime column to specified format using this function. You can rename this function.
import datetime
def modify_datetime(dtime):
my_time = datetime.datetime.strptime(dtime, '%a %b %d %H:%M:%S %z %Y')
return my_time.strftime('%Y-%m-%d %H:%M:%S')
First argument to strptime function is date string and second argument is format.
Directive, Description
%a Weekday abbreviated
%b Month abbreviated name
%d Day of the month
%H Hour (24-hour format)
%M Minute with zero padding
%S Second with zero padding
%z UTC offset
%Y Full year
Once you converted string date to datetime objects you can convert it back to string with specified format using strftime function. You can read more about formats here.
Finally, just modify the Datetime column
df['Datetime'] = df['Datetime'].apply(modify_datetime)
You can use pandas.to_datetime and pandas.Series.dt.strftime appropriately:
>>> import pandas as pd
>>> from datetime import datetime
>>> datetime_strs = ["Thu Jun 18 23:04:19 +0000 2020", "Thu Jun 18 23:04:18 +0000 2020", "Thu Jun 18 23:04:14 +0000 2020", "Thu Jun 18 23:04:13 +0000 2020"]
>>> d = {'Datetimes': datetime_strs}
>>> df = pd.DataFrame(data=d)
>>> df
Datetimes
0 Thu Jun 18 23:04:19 +0000 2020
1 Thu Jun 18 23:04:18 +0000 2020
2 Thu Jun 18 23:04:14 +0000 2020
3 Thu Jun 18 23:04:13 +0000 2020
>>> df['Datetimes'] = pd.to_datetime(df['Datetimes'], format='%a %b %d %H:%M:%S %z %Y')
>>> df
Datetimes
0 2020-06-18 23:04:19+00:00
1 2020-06-18 23:04:18+00:00
2 2020-06-18 23:04:14+00:00
3 2020-06-18 23:04:13+00:00
>>> df['Datetimes'] = df['Datetimes'].dt.strftime('%Y-%m-%d %H:%M:%S')
>>> df
Datetimes
0 2020-06-18 23:04:19
1 2020-06-18 23:04:18
2 2020-06-18 23:04:14
3 2020-06-18 23:04:13

TypeError when selecting rows by DatetimeIndex?

I would like to select rows in a dataframe by retaining those sharing the same timestamps as that of another dataframe
In short, I want to select lines that should be the same based on timestamp values. In a 2nd step I intend to check that values in columns are indeed the same.
As an example, here are some lines of code to reproduce the type of input dataframes I am working with.
import pandas as pd
ts_list_5m_1 = [
'Sun Dec 22 2019 07:40:00 GMT-0100',
'Sun Dec 22 2019 07:45:00 GMT-0100',
'Sun Dec 22 2019 07:50:00 GMT-0100',
'Sun Dec 22 2019 07:55:00 GMT-0100']
ts_list_5m_2 = [
'Sun Dec 22 2019 07:50:00 GMT-0100',
'Sun Dec 22 2019 07:55:00 GMT-0100',
'Sun Dec 22 2019 08:00:00 GMT-0100',
'Sun Dec 22 2019 08:05:00 GMT-0100']
op_list_5m_1 = [7134.0, 7134.34, 7135.03, 7131.74]
op_list_5m_2 = [7135.03, 7131.74, 7234.50, 7334.88]
GC_5m_1 = pd.DataFrame(list(zip(ts_list_5m_1, op_list_5m_1)), columns =['Timestamp', 'Open'])
GC_5m_2 = pd.DataFrame(list(zip(ts_list_5m_2, op_list_5m_2)), columns =['Timestamp', 'Open'])
GC_5m_1['date'] = pd.to_datetime(GC_5m_1['Timestamp'], utc=True)
GC_5m_2['date'] = pd.to_datetime(GC_5m_2['Timestamp'], utc=True)
GC_5m_1.set_index('Timestamp', inplace = True, verify_integrity = True)
GC_5m_2.set_index('Timestamp', inplace = True, verify_integrity = True)
To get the list of shared timestamps, I use following line of code:
shared_TS = GC_5m_1.index.intersection(GC_5m_2.index)
And then, I would like to get data from both dataframes for these index. I am using this line.
GC_5m_1.loc[shared_TS]
But this throws me following error.
raise TypeError("unhashable type: %r" % type(self).__name__)
TypeError: unhashable type: 'DatetimeIndex'
Please, do you have any idea how I can get this sub-dataframe properly?
I intend then to use it with the merge function (+indicator) to check that values in 'Open' column are indeed the same.
I thank you in advance for your help.
Have a good day,
Bests,

Convert a large list of timestamps in an excel file from one format to another with python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have an excel file named "hello123.xlsx". There is a column of timestamps that has a lot of rows (more than 80,000 rows for now). The excel file basically looks like this:
Tue Mar 13 14:51:04 +0000 2018
Tue Mar 13 14:51:10 +0000 2018
Tue Mar 13 14:51:11 +0000 2018
Tue Mar 13 14:51:12 +0000 2018
Tue Mar 13 14:51:13 +0000 2018
Tue Mar 13 14:51:13 +0000 2018
Tue Mar 13 14:51:15 +0000 2018
Tue Mar 13 14:51:35 +0000 2018
Tue Mar 13 14:51:43 +0000 2018
Tue Mar 13 14:52:12 +0000 2018
And so on...
As can be seen above, the timestamps I have is in the format of this:
%a %b %d %H:%M:%S +0000 %Y.
But I need to convert it to the standard format like %m/%d/%Y %H:%M:%S.
So basically I have to read a column of numerous timestamps in an excel file and then to convert them to a new format in a new column.
I'm very new to python and I searched some methods online but failed to do so. I hope someone can help figure it out I would really appreciate that. :)
You can accomplish this quite easily using the pandas library, in particular pandas.Series.dt.strftime:
You will also need to install xlrd and openpyxl:
import pandas as pd
df = pd.read_excel('in.xlsx', header=None)
df[0] = pd.to_datetime(df[0]).dt.strftime("%m/%d/%Y %H:%M:%S")
df.to_excel('out.xlsx', index=False, header=False)
Sample run:
in.xlsx
Tue Mar 13 14:51:04 +0000 2018
Tue Mar 13 14:51:04 +0000 2018
Tue Mar 13 14:51:04 +0000 2018
Tue Mar 13 14:51:04 +0000 2018
out.xlsx
03/13/2018 14:51:04
03/13/2018 14:51:04
03/13/2018 14:51:04
03/13/2018 14:51:04
Use strptime() to convert the string into a datetime object and then use strftime() to format that object into the string you desire. See https://docs.python.org/3.6/library/datetime.html#strftime-and-strptime-behavior
Openpyxl can be used to read and write *.xlsx files.
Or you can transform timestamp in excel with something like this:
=DATE(RIGHT(A1,4),IF(MID(A1,5,3)="Jan",1,IF(MID(A1,5,3)="Feb",2,IF(MID(A1,5,3)="Mar",3,IF(MID(A1,5,3)="Apr",4,
IF(MID(A1,5,3)="May",5,IF(MID(A1,5,3)="Jun",6,IF(MID(A1,5,3)="Jul",7,IF(MID(A1,5,3)="Aug",8,IF(MID(A1,5,3)="Sep",9,IF(MID(A1,5,3)="Okt",10,IF(MID(A1,5,3)="Nov",11,IF(MID(A1,5,3)="Dec",12,0)))))))))))),MID(A1,9,2)) + (MID(A1,12,8))
In A1 is timestamp ex. Tue Mar 13 14:51:04 +0000 2018

Categories