pandas: How to convert a string to datetime? - python

I have a CSV file in which I am trying to convert the date column to dd/mm/yyyy format using the below code but I am still struggling with unconverted data remains.
A few sample data-records from the CSV-file
hotel,author,review,date,Overall,Value,Rooms,Location,Cleanliness,Check in/front desk,Service,Business,r9
in,everywhereman2 ,Old seattle getaway This was Old World Excellence at it's best.THIS is the place to stay at when visiting the historical area of Seattle. Your right on the water front near the ferry's and great sea food restrauntsand still with'in walking distance for great blues and jazz music. The staff for this hotel are excellentthey make you feel right at home. The breakfast was great.We did'nt have to travel far to have a good cup of JOE and a light meal to start our adventurous day off into one of the most beautifull city's in america. This hotel is in an area that makes it easy to get to any place you want to go and still find your way back I highly recomend this hotel for your next visit to seattle. ,Jan 6 2009 ,5,5,5,5,5,5,5,5,
in,RW53 ,Location! Location? view from room of nearby freeway ,Dec 26 2008 ,3,4,3,2,4,3,-1,-1,
Can you, please, help me to point out my mistake?
I am reading the file in RSS and then:
import time
for end_date in rss['date']:
end_date = end_date.split(" ")
end_date[-1] = end_date[-1][:4]
end_date = " ".join(end_date)
conv = time.strptime( end_date,"%b %d %Y" )
time.strftime( "%d/%m/%Y", conv )
rss['date']
Thank you in advance.

I just tried your data and the following worked for me without having to do post-processing:
In [17]:
df =pd.read_csv(r'c:\data\out.csv', parse_dates=['date'])
df.dtypes
Out[17]:
hotel object
author object
review object
date datetime64[ns]
Overall int64
Value int64
Rooms int64
Location int64
Cleanliness int64
Check in/front desk int64
Service int64
Business int64
r9 object
dtype: object

Related

How to retrieve a certain day of the month for each row based on a dataframe value?

I am trying to replace some hardcoded SQL queries related to timezone changes with a more dynamic/data-driven Python script. I have a dataset that looks like this spreadsheet below. WEEK_START/DAY/MONTH is the week, day, and month when daylight savings time begins (for example Canberra starts the first Sunday of April while Vienna is the last Sunday of March). The end variables are in the same format and display when it ends.
Dataset
Here is the issue. I have seen solutions for specific use cases such as this, finding the last Sunday of the month:
current_year=today.year
current_month=today.month
current_day=today.day
month = calendar.monthcalendar(current_year, current_month)
day_of_month = max(month[-1][calendar.SUNDAY], month[-2][calendar.SUNDAY])
print(day_of_month)
31
This tells me that the last day of this month is the 31st. I can adjust the attributes for one given month/scenario, but how would I make a column for each and every row (city) to retrieve each? That is, several cities that change times on different dates? I thought if I could set attributes in day_of_month in an apply function it would work but when I do something like weekday='SUNDAY' it returns an error because of course the string 'SUNDAY' is not the same as SUNDAY the attribute of calendar. My SQL queries are grouped by cities that change on the same day but ideally anyone would be able to edit the CSV that loads the above dataset as needed and then each day the script would run once to see if today is between the start and end of daylight savings. We might have new cities to add in the future. I'm confident in doing that bit but quite lost on how to retrieve the dates for a given year.
My alternate, less resilient, option is to look at the distinct list of potential dates (last Sunday of March, first Sunday of April, etc.), write code to retrieve each one upfront (as in the above snippet above), and assign the dates in that way. I say that this is less resilient because if a city is added that does not fit in an existing group for time changes, the code would need to be altered as well.
So stackoverflow, is there a way to do this in a data driven way in pandas through an apply or something similar? Thanks in advance.
Basically I think you have most of what you need. Just map the WEEK_START / WEEK_END column {-1, 1} to last or first day of month, put it all in a function and apply it to each row. EX:
import calendar
import operator
import pandas as pd
def get_date(year: int, month: int, dayname: str, first=-1) -> pd.Timestamp:
"""
get the first or last day "dayname" in given month and year.
returns last by default.
"""
daysinmonth = calendar.monthcalendar(year, month)
getday = operator.attrgetter(dayname.upper())
if first == 1:
day = daysinmonth[0][getday(calendar)]
else:
day = max(daysinmonth[-1][getday(calendar)], daysinmonth[-2][getday(calendar)])
return pd.Timestamp(year, month, day)
year = 2021 # we need a year...
df['date_start'] = df.apply(lambda row: get_date(year,
row['MONTH_START'],
row['DAY_START'],
row['WEEK_START']), # selects first or last
axis=1) # to each row
df['date_end'] = df.apply(lambda row: get_date(year,
row['MONTH_END'],
row['DAY_END'],
row['WEEK_END']),
axis=1)
giving you for the sample data
df[['CITY', 'date_start', 'date_end']]
CITY date_start date_end
0 Canberra 2021-04-04 2021-10-03
1 Melbourne 2021-04-04 2021-10-03
2 Sydney 2021-04-04 2021-10-03
3 Kitzbuhel 2021-03-28 2021-10-31
4 Vienna 2021-03-28 2021-10-31
5 Antwerp 2021-03-28 2021-10-31
6 Brussels 2021-03-28 2021-10-31
7 Louvain-la-Neuve 2021-03-28 2021-10-31
Once you start working with time zones and DST transitions, Q: Is there a way to infer in Python if a date is the actual day in which the DST (Daylight Saving Time) change is made? might also be interesting to you.

filter pytrends data and also get data and country columns

I'm pulling data from google trends ,encountering some issues:
code:
import pandas as pd
from pytrends.request import TrensReq
pytrends=TrendReq()
kw_list= ['Solar power','Starlink']
df1=pytrends.build(kw_list,timeframe='today 100-w',geo='US','UK')
df1=pytrends.interest_by_region(),pytrends.interest_over_time()
df1.to_excel(r'e:\google trends\putout.xlsx')
i want data for 2 regions- US and UK .but it is not working .
also i want data for past 100 weeks from today's date.I checked on google to see what is the syntax
for looking in past weeks but no help.
Also if i use " pytrends.interest_by_region(),pytrends.interest_over_time()", i get data like:
solar power Starlink
date
But country column is not included.I have used pytrends.interest_by_region() but it is not coming in my dataframe.
Expected output:
solar power Starlink
country date
US 2021-05-01 5 4
UK 2021-05-01 4 5
....so on. Let me know how to get both country and date in the dataset.
and
And finally export it to csv or excel file.
Check this code, this will give result in required format:
import pandas as pd
from pytrends.request import TrendReq
kw_list= ['Solar power','Starlink']
l = []
for i in ['US','GB']:
pytrends=TrendReq()
pytrends.build_payload(kw_list,timeframe='today 3-m',geo=i)
df=pytrends.interest_over_time()
df['country']=i
l.append(df.reset_index())
df1 = pd.concat(l)
df1.to_excel(r'e:\google trends\putout.xlsx')
Following changes I did in your current code:
For timeframe, pytrends provide only few options and "today 100-w" is not one of them
Current Time Minus Time Pattern:
By Month: 'today #-m' where # is the number of months from that date to pull data for, only work for 1, 2, 3 months
Daily: 'now #-d' where # is the number of days from that date to pull data for,
only work for 1, 7 days
Hourly: 'now #-H' where # is the number of hours from that date to pull data for, only work for 1, 4 hours
I suggest for specific dates, use 'YYYY-MM-DD YYYY-MM-DD' example '2016-12-14 2017-01-25'
You can not provide list to geo parameter, it should be string of country code, For United kingdom the code is 'GB'(Refer this link: Country json, This will give you the json of all google trends supported countries and its respective codes)

Including more than one variable in ax.set_title?

I have this title, where the variable option indicates the regional level.
option=3
year=2017
ax.set_title('Household Gas Consumption, Water Heating,NUTS %.i'%option,fontdict={'fontsize':'10','fontweight':'3'})
How could I also add the variable year so the title looks like this?
Household Gas Consumption, Water Heating, NUTS 3, Year: 2017
I tried different ways but I could't make it work. Thanks in advance.
'Household Gas Consumption, Water Heating,NUTS %i Year: %i' %(option , year)

how to set time variable in python, seasonal

I am learning python for a few days now and I have a hard time setting a variable as a time variable. I would be grateful if anyone can help me.
The variable is of the type: pandas.core.series.Series
And looks like the following:
2018S1;
2017S2;
2017S1
The idea is that python recognizes this as time data such that I can plot it and use it in regressions. I have searched on the forum and the internet but did not find any similar problem.
Kind regards
It looks like your data consists of years and seasons. For plotting purposes you could use a date (using typical year, month, day) in the middle of the season.
There is a post where someone was determining seasons based on date, it might give you some ideas Determine season given timestamp in Python using datetime
For pandas periods have a look here.
If the last number means month, use pd.Period(pp,freq='M')
If the last number means quarter, use pd.Period(pp,freq='Q')
The following workaround generates a pandas Series which you can use for regressions and more:
A = np.array(['2018S1', '2017S2', '2017S1'])
periods = []
for a in A:
yr =a[0:4]
ss =a[-1]
pp = yr + '-' + ss
periods.append(pd.Period(pp,freq='Q') )
ts = pd.Series(np.random.randn(3), periods)
ts
In the case of quarters we get:
2018Q1 0.531245
2017Q1 -0.126469
2017Q1 0.250046
Freq: Q-DEC, dtype: float64
In the case of month we get:
2018-01 0.098571
2017-02 1.407439
2017-01 -0.406087
Freq: M, dtype: float64

Bi-monthly salary between interval of two dates

I'm trying to program a salary calculator that tells you what your salary is during sick leave. In Costa Rica, where I live, salaries are paid bi-monthly (the 15th and 30th of each month), and each sick day you get paid 80% of your salary. So, the program asks you what your monthly salary is and then asks you what was the start date and finish date of your sick leave. Finally, it's meant to print out what you got paid each payday between your sick leave. This is what I have so far:
import datetime
salario = float(input("What is your monthly salary? "))
fecha1 = datetime.strptime(input('Start date of sick leave m/d/y: '), '%m/%d/%Y')
fecha2 = datetime.strptime(input('End date of sick leave m/d/y: '), '%m/%d/%Y')
diasinc = ((fecha2 - fecha1).days)
print ("Number of days in sick leave: ")
print (diasinc)
def daterange(fecha1, fecha2):
for n in range(int ((fecha2 - fecha1).days)):
yield fecha1 + timedelta(n)
for single_date in daterange(fecha1, fecha2):
print (single_date.strftime("%Y-%m-%d")) #This prints out each individual day between those dates.
I know for the salary I just multiply it by .8 to get 80% but how do I get the program to print it out for each pay day?
Thank you in advance.
Here's an old answer to a similar question from about eight years ago: python count days ignoring weekends ...
... read up on the Python: datetime module and adjust Dave Webb's generator expression to count each time the date is on the 15th or the 30th. Here's another example for counting the number of occurrences of Friday on the 13th of any month.
There are fancier ways to shortcut this calculation using modulo arithmetic. But they won't matter unless you're processing millions of these at a time on lower powered hardware and for date ranges spanning months at a time. There may even be a module somewhere that does this sort of thing, more efficiently, for you. But it might be hard for you to validate (test for correctness) as well as being hard to find.
Note that one approach which might be better in the long run would be to use Python: SQLite3 which should be included with the standard libraries of your Python distribution. Use that to generate a reference table of all dates over a broad range (from the founding of your organization until a century from now). You can add a column to that table to note all paydays and use SQL to query that table and select the dates WHERE payday==True AND date BETWEEN .... etc.
There's an example of how to SQLite: Get all dates between dates.
That approach invests some minor coding effort and some storage space into a reference table which can be used efficiently for the foreseeable future.

Categories