Parsing Environment Canada Website

Parsing Environment Canada Website - python

I am trying to scrape the weather forecast from "https://weather.gc.ca/city/pages/ab-52_metric_e.html". With the code below I am able to get the table containing the data but I'm stuck. During the day the second row contains Today's forecast and the third row contains tonight's forecast. At the end of the day the second row becomes Tonight's forecast and Today's forecast is dropped. What I want to do is parse through the table to get the forecast for Today, Tonight, and each continuing day even if Today's forecast is missing; something like this:
Today: A mix of sun and cloud. 60 percent chance of showers this afternoon with risk of a thunderstorm. Widespread smoke. High 26. UV index 6 or high.
Tonight: Partly cloudy. Becoming clear this evening. Increasing cloudiness before morning. Widespread smoke. Low 13.
Friday: Mainly cloudy. Widespread smoke. Wind becoming southwest 30 km/h gusting to 50 in the afternoon. High 24.
#using Beautiful Soup 3, Python 2.6
from BeautifulSoup import BeautifulSoup
import urllib
pageFile = urllib.urlopen("https://weather.gc.ca/city/pages/ab- 52_metric_e.html")
pageHtml = pageFile.read()
pageFile.close()
soup = BeautifulSoup("".join(pageHtml))
data = soup.find("div", {"id": "mainContent"})
forecast = data.find('table',{'class':"table mrgn-bttm-md mrgn-tp-md textforecast hidden-xs"})

You could do something like iterate over each line in the table and get the value of the rows. An example would be:
forecast = data.find('table',{'class':"table mrgn-bttm-md mrgn-tp-md textforecast hidden-xs"}).find_all("tr")
for tr in forecast[1:]:
print " ".join(tr.text.split())
With this approach you get the contents of each lines (exclusive the first one which is some header.

Related

How to scrape yahoo finance news headers with BeautifulSoup?

I would like to scrape news from yahoo's finance, for a pair.
How does bs4's find() or find_all() work?
for this example:
with this link:
I'm traying to extract the data ... but no data is scraped. why? what's wrong?
I'm using this, but nothing is printed (except the tickers)
html = BeautifulSoup(source_s, "html.parser") # "html")
news_table_s = html.find_all("div",{"class":"Py(14px) Pos(r)"})
news_tables_s[ticker_s] = news_table_s
print("news_tables", news_tables_s)
I would like to extract the headers from a yahoo finance web page.

You have to iterate your ResultSet to get anything out.
for e in html.find_all("div",{"class":"Py(14px) Pos(r)"}):
print(e.h3.text)
Recommendation - Do not use dynamic classes to select elements use more static ids or HTML structure, here selected via css selector
for e in html.select(' div:has(>h3>a)'):
print(e.h3.text)
Example
from bs4 import BeautifulSoup
import requests
url='https://finance.yahoo.com/quote/EURUSD%3DX?p=EURUSD%3DX'
html = BeautifulSoup(requests.get(url).text)
for e in html.select(' div:has(>h3>a)'):
print(e.h3.text)
Output
EUR/USD steadies, but bears sharpen claws as dollar feasts on Fed bets
EUR/USD Weekly Forecast – Euro Gives Up Early Gains for the Week
EUR/USD Forecast – Euro Plunges Yet Again on Friday
EUR/USD Forecast – Euro Gives Up Early Gains
EUR/USD Forecast – Euro Continues to Test the Same Indicator
Dollar gains as inflation remains sticky; sterling retreats
Siemens Issues Blockchain Based Euro-Denominated Bond on Polygon Blockchain
EUR/USD Forecast – Euro Rallies
FOREX-Dollar slips with inflation in focus; euro, sterling up on jobs data
FOREX-Jobs figures send euro, sterling higher; dollar slips before CPI

AutoArima - Selecting correct value for m

So for argument sake here is an example of autoarima for daily data:
auto_arima(df['orders'],seasonal=True,m=7)
Now in that example after running a Seasonal Decomposition that has shown weekly seasonality I "think" you select 7 for m? Is this correct as the seasonality is shown to be weekly?
My first question is as follows - If seasonality is Monthly do you use 12? If it is Annually do you use 1? And is there ever a reason to select 365 for daily?
Secondly if the data you are given is already weekly e.g
date weekly tot
2021/01/01 - 10,000
2021/01/07 - 15,000
2021/01/14 - 9,000
and so on......
And you do the seasonal decomposition would m=1 be used for weekly, m=4 for monthly and m=52 for annually.
Finally if its monthly like so:
date monthly tot
2020/01/01- 10,000
2020/02/01- 15,000
2020/03/01- 9,000
and so on......
And you do the seasonal decomposition would m=1 for monthly and m=12 for annually.
Any help would be greatly appreciated, I just want to be able to confidently select the right criteria.

A season is a recurring pattern in your data and m is the length of that season. m in that case is not a code or anything but simply the length:
Imagine the weather, if you had the weekly average temperature it will rise in the summer and fall in the winter. Since the length of one "season" is a year or 52 weeks, you set m to 52.
If you had a repeating pattern every quarter, then m would be 12, since a quarter is equal to 12 weeks. It always depends on your data and your use case.
To your questions:
If seasonality is Monthly do you use 12?
If the pattern you are looking for repeats every 12 months yes, if it repeats every 3 months it would be 3 and so on.
If it is Annually do you use 1?
A seasonality of 1 does not really make sense, since it would mean that you have a repeating pattern in every single data point.
And is there ever a reason to select 365 for daily?
If your data is daily and the pattern repeats every 365 days (meaning every year) then yes (you need to remember that every fourth year has 366 days though).
I hope you get the concept behind seasonality and m so you can answer the rest.

filter pytrends data and also get data and country columns

I'm pulling data from google trends ,encountering some issues:
code:
import pandas as pd
from pytrends.request import TrensReq
pytrends=TrendReq()
kw_list= ['Solar power','Starlink']
df1=pytrends.build(kw_list,timeframe='today 100-w',geo='US','UK')
df1=pytrends.interest_by_region(),pytrends.interest_over_time()
df1.to_excel(r'e:\google trends\putout.xlsx')
i want data for 2 regions- US and UK .but it is not working .
also i want data for past 100 weeks from today's date.I checked on google to see what is the syntax
for looking in past weeks but no help.
Also if i use " pytrends.interest_by_region(),pytrends.interest_over_time()", i get data like:
solar power Starlink
date
But country column is not included.I have used pytrends.interest_by_region() but it is not coming in my dataframe.
Expected output:
solar power Starlink
country date
US 2021-05-01 5 4
UK 2021-05-01 4 5
....so on. Let me know how to get both country and date in the dataset.
and
And finally export it to csv or excel file.

Check this code, this will give result in required format:
import pandas as pd
from pytrends.request import TrendReq
kw_list= ['Solar power','Starlink']
l = []
for i in ['US','GB']:
pytrends=TrendReq()
pytrends.build_payload(kw_list,timeframe='today 3-m',geo=i)
df=pytrends.interest_over_time()
df['country']=i
l.append(df.reset_index())
df1 = pd.concat(l)
df1.to_excel(r'e:\google trends\putout.xlsx')
Following changes I did in your current code:
For timeframe, pytrends provide only few options and "today 100-w" is not one of them
Current Time Minus Time Pattern:
By Month: 'today #-m' where # is the number of months from that date to pull data for, only work for 1, 2, 3 months
Daily: 'now #-d' where # is the number of days from that date to pull data for,
only work for 1, 7 days
Hourly: 'now #-H' where # is the number of hours from that date to pull data for, only work for 1, 4 hours
I suggest for specific dates, use 'YYYY-MM-DD YYYY-MM-DD' example '2016-12-14 2017-01-25'
You can not provide list to geo parameter, it should be string of country code, For United kingdom the code is 'GB'(Refer this link: Country json, This will give you the json of all google trends supported countries and its respective codes)

Bi-monthly salary between interval of two dates

I'm trying to program a salary calculator that tells you what your salary is during sick leave. In Costa Rica, where I live, salaries are paid bi-monthly (the 15th and 30th of each month), and each sick day you get paid 80% of your salary. So, the program asks you what your monthly salary is and then asks you what was the start date and finish date of your sick leave. Finally, it's meant to print out what you got paid each payday between your sick leave. This is what I have so far:
import datetime
salario = float(input("What is your monthly salary? "))
fecha1 = datetime.strptime(input('Start date of sick leave m/d/y: '), '%m/%d/%Y')
fecha2 = datetime.strptime(input('End date of sick leave m/d/y: '), '%m/%d/%Y')
diasinc = ((fecha2 - fecha1).days)
print ("Number of days in sick leave: ")
print (diasinc)
def daterange(fecha1, fecha2):
for n in range(int ((fecha2 - fecha1).days)):
yield fecha1 + timedelta(n)
for single_date in daterange(fecha1, fecha2):
print (single_date.strftime("%Y-%m-%d")) #This prints out each individual day between those dates.
I know for the salary I just multiply it by .8 to get 80% but how do I get the program to print it out for each pay day?
Thank you in advance.

Here's an old answer to a similar question from about eight years ago: python count days ignoring weekends ...
... read up on the Python: datetime module and adjust Dave Webb's generator expression to count each time the date is on the 15th or the 30th. Here's another example for counting the number of occurrences of Friday on the 13th of any month.
There are fancier ways to shortcut this calculation using modulo arithmetic. But they won't matter unless you're processing millions of these at a time on lower powered hardware and for date ranges spanning months at a time. There may even be a module somewhere that does this sort of thing, more efficiently, for you. But it might be hard for you to validate (test for correctness) as well as being hard to find.
Note that one approach which might be better in the long run would be to use Python: SQLite3 which should be included with the standard libraries of your Python distribution. Use that to generate a reference table of all dates over a broad range (from the founding of your organization until a century from now). You can add a column to that table to note all paydays and use SQL to query that table and select the dates WHERE payday==True AND date BETWEEN .... etc.
There's an example of how to SQLite: Get all dates between dates.
That approach invests some minor coding effort and some storage space into a reference table which can be used efficiently for the foreseeable future.

Pandas: convert datetime timestamp to whether it's day or night?

I am trying to determine if its a day or night based on list of timestamps. Will it be correct if I just check the hour between 7:00AM to 6:00PM to classify it as "day", otherwise "night"? Like I have done in below code. I am not sure of this because sometimes its day even after 6pm so whats the accurate way to differentiate between day or night using python?
sample data: (timezone= utc/zulutime)
timestamps = ['2015-03-25 21:15:00', '2015-06-27 18:24:00', '2015-06-27 18:22:00', '2015-06-27 18:21:00', '2015-07-07 07:53:00']
Code:
for timestamp in timestamps:
time = datetime.datetime.strptime(timestamp, "%Y-%m-%d %H:%M:%S")
hr, mi = (time.hour, time.minute)
if hr>=7 and hr<18: print ("daylight")
else: print ("evening or night")
sample output:
evening or night
evening or night
evening or night
evening or night
daylight

You could use pyephem for this task. It's a
Python package for performing high-precision astronomy computations.
You could set the desired location and get the sun altitude. There are multiple definitions for night, depending if it's for civil (-6°), nautical (-12°) or astronomical (-18°) purposes. Just pick a treshold : if the sun is below, it's nighttime!
#encoding: utf8
import ephem
import math
import datetime
sun = ephem.Sun()
observer = ephem.Observer()
# ↓ Define your coordinates here ↓
observer.lat, observer.lon, observer.elevation = '48.730302', '9.149483', 400
# ↓ Set the time (UTC) here ↓
observer.date = datetime.datetime.utcnow()
sun.compute(observer)
current_sun_alt = sun.alt
print(current_sun_alt*180/math.pi)
# -16.8798870431°

As a workaround, there is a free api for adhan times for muslims. It includes sunset and sunrise times exactly. However you still need location coordinates to obtain the data. It is free at the moment.

Unfortunately python's timestamp cannot determine whether it's day or night. This is also because it depends on where you are located and how exactly you define day and night. I'm afraid you will have to get auxiliary data for that.

You need to know both latitude and longitude. In fact, if a place is in a deep valley the sunrise there will be later and the sunset earlier. You can pay for this service if you need to obtain it many times per day or simply scrape pages like the one at https://www.timeanddate.com/worldclock/uk/london.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing Environment Canada Website - python

Related

How to scrape yahoo finance news headers with BeautifulSoup?

AutoArima - Selecting correct value for m

filter pytrends data and also get data and country columns

Bi-monthly salary between interval of two dates

Pandas: convert datetime timestamp to whether it's day or night?

Categories

Resources