I try to scrape a webpage with hourly energy prices. I want to use the data for home-automation. if the hourly price =< baseload price, certain times should turn on via Mqtt.
I managed to get the data from the baseload price and the hourly prices from its column. The output from the column seems not to be in one list but in 24 lists. correct? how to fix this so that the hourly price can be compared with the baseload price?
import datetime
import pytz
import requests
from bs4 import BeautifulSoup as bs
today_utc = pytz.utc.localize(datetime.datetime.utcnow())
today = today_utc.astimezone(pytz.timezone("Europe/Amsterdam"))
text_today = today.strftime("%y-%m-%d")
print(today)
print(text_today)
yesterday = datetime.datetime.now(tz=pytz.timezone("Europe/Amsterdam")) - datetime.timedelta(1)
text_yesterday = yesterday.strftime("%y-%m-%d")
print(yesterday)
print(text_yesterday)
url_part1 = 'https://www.epexspot.com/en/market-data?market_area=NL&trading_date='
url_part2 = '&delivery_date='
url_part3 = '&underlying_year=&modality=Auction&sub_modality=DayAhead&technology=&product=60&data_mode=table&period=&production_period='
url_text = url_part1+text_yesterday+url_part2+text_today+url_part3
print(url_text)
html_text = requests.get(url_text).text
#print(html_text)
soup = bs(html_text,'lxml')
#print(soup.prettify())
baseload = soup.find_all('div', class_='flex day-1')
for baseload_price in baseload:
baseload_price = baseload_price.find('span').text.replace(' ', '')
print(baseload_price)
table = soup.find_all('tr',{'class':"child"})
#print(table)
for columns in table:
column3 = columns.find_all('td')[3:]
#print(columns)
column3_text = [td.text.strip() for td in column3]
column3_text = column3_text
print(column3_text)
In the for loop for columns in table, you are creating a new list column3_text. If you intend for column3 text to be a list of the next 24 hours, you can replace this for loop with this:
column3_text = [column.find_all("td")[3].text.strip() for column in table]
Additionally, if you are going to be comparing the baseload price to the hourly prices, you'll want to convert the strings to floats or Decimals. :)
You simply need to use join:
column3_text = "".join([td.text.strip() for td in column3])
If you want to compare the values use pandas.
Here's how:
import datetime
import urllib.parse
import pandas as pd
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:104.0) Gecko/20100101 Firefox/104.0",
}
today = datetime.datetime.today().strftime("%Y-%m-%d")
yesterday = (
datetime.datetime.today() - datetime.timedelta(days=1)
).strftime("%Y-%m-%d")
url = "https://www.epexspot.com/en/market-data?"
data = {
"market_area": "NL",
"trading_date": yesterday,
"delivery_date": today,
"underlying_year": "",
"modality": "Auction",
"sub_modality": "DayAhead",
"technology": "",
"product": "60",
"data_mode": "table",
"period": "",
"production_period": "",
}
query_url = f"{url}{urllib.parse.urlencode(data)}"
with requests.Session() as s:
s.headers.update(headers)
response = s.get(query_url).text
baseload = (
BeautifulSoup(response, "html.parser")
.select_one(".day-1 > span:nth-child(1)")
.text
)
print(f"Baselaod: {baseload}")
df = pd.concat(pd.read_html(response, flavor="lxml"), ignore_index=True)
df.columns = range(df.shape[1])
df = df.drop(df.columns[[4, 5, 6, 7]], axis=1)
df['is_higher'] = df[[3]].apply(lambda x: (x >= float(baseload)), axis=1)
df['price_diff'] = df[[3]].apply(lambda x: (x - float(baseload)), axis=1)
df = df.set_axis(
[
"buy_volume",
"sell_volume",
"volume",
"price",
"is_higher",
"price_diff",
],
axis=1,
copy=False,
)
df.insert(
0,
"hours",
[
f"0{value}:00 - {value + 1}:00" if value < 10
else f"{value}:00 - {value + 1}:00"
for value in range(0, 24)
],
)
print(df)
Output:
Baselaod: 144.32
hours buy_volume sell_volume ... price is_higher price_diff
0 00:00 - 1:00 2052.2 3608.7 ... 124.47 False -19.85
1 01:00 - 2:00 2467.8 3408.9 ... 119.09 False -25.23
2 02:00 - 3:00 2536.8 3220.5 ... 116.32 False -28.00
3 03:00 - 4:00 2552.0 3206.5 ... 114.60 False -29.72
4 04:00 - 5:00 2524.4 3010.0 ... 115.07 False -29.25
5 05:00 - 6:00 2542.4 3342.7 ... 123.54 False -20.78
6 06:00 - 7:00 2891.2 3872.2 ... 145.42 True 1.10
7 07:00 - 8:00 3413.2 3811.0 ... 166.40 True 22.08
8 08:00 - 9:00 3399.4 3566.0 ... 168.00 True 23.68
9 09:00 - 10:00 2919.3 3159.4 ... 153.30 True 8.98
10 10:00 - 11:00 2680.2 3611.5 ... 143.35 False -0.97
11 11:00 - 12:00 2646.8 3722.3 ... 141.95 False -2.37
12 12:00 - 13:00 2606.4 3723.3 ... 141.96 False -2.36
13 13:00 - 14:00 2559.7 3232.3 ... 145.96 True 1.64
14 14:00 - 15:00 2544.9 3261.2 ... 155.00 True 10.68
15 15:00 - 16:00 2661.7 3428.0 ... 169.15 True 24.83
16 16:00 - 17:00 3072.2 3529.4 ... 173.36 True 29.04
17 17:00 - 18:00 3593.7 3091.4 ... 192.00 True 47.68
18 18:00 - 19:00 3169.0 3255.4 ... 182.86 True 38.54
19 19:00 - 20:00 2710.1 3630.3 ... 167.96 True 23.64
20 20:00 - 21:00 2896.3 3728.8 ... 147.17 True 2.85
21 21:00 - 22:00 3160.3 3639.2 ... 136.78 False -7.54
22 22:00 - 23:00 3506.2 3196.3 ... 119.90 False -24.42
23 23:00 - 24:00 3343.8 3414.1 ... 100.00 False -44.32
I have a script that get data from a dataframe, use those data to make a request to a website, using fuzzywuzzy module find the exact href and then runs a function to scrape odds. I would speed up this script with the multiprocessing module, it is possible?
Date HomeTeam AwayTeam
0 Monday 6 December 2021 20:00 Everton Arsenal
1 Monday 6 December 2021 17:30 Empoli Udinese
2 Monday 6 December 2021 19:45 Cagliari Torino
3 Monday 6 December 2021 20:00 Getafe Athletic Bilbao
4 Monday 6 December 2021 15:00 Real Zaragoza Eibar
5 Monday 6 December 2021 17:15 Cartagena Tenerife
6 Monday 6 December 2021 20:00 Girona Leganes
7 Monday 6 December 2021 19:45 Niort Toulouse
8 Monday 6 December 2021 19:00 Jong Ajax FC Emmen
9 Monday 6 December 2021 19:00 Jong AZ Excelsior
Script
df = pd.read_excel(path)
dates = df.Date
hometeams = df.HomeTeam
awayteams = df.AwayTeam
matches_odds = list()
for i,(a,b,c) in enumerate(zip(dates, hometeams, awayteams)):
try:
r = requests.get(f'https://www.betexplorer.com/results/soccer/?year={a.split(" ")[3]}&month={monthToNum(a.split(" ")[2])}&day={a.split(" ")[1]}')
except requests.exceptions.ConnectionError:
sleep(10)
r = requests.get(f'https://www.betexplorer.com/results/soccer/?year={a.split(" ")[3]}&month={monthToNum(a.split(" ")[2])}&day={a.split(" ")[1]}')
soup = BeautifulSoup(r.text, 'html.parser')
f = soup.find_all('td', class_="table-main__tt")
for tag in f:
match = fuzz.ratio(f'{b} - {c}', tag.find('a').text)
hour = a.split(" ")[4]
if hour.split(':')[0] == '23':
act_hour = '00' + ':' + hour.split(':')[1]
else:
act_hour = str(int(hour.split(':')[0]) + 1) + ':' + hour.split(':')[1]
if match > 70 and act_hour == tag.find('span').text:
href_id = tag.find('a')['href']
table = get_odds(href_id)
matches_odds.append(table)
print(i, ' of ', len(dates))
PS: The monthToNum function just replace the month name to his number
First, you make a function of your loop body with inputs i, a, b and c. Then, you create a multiprocessing.Pool and submit this function with the proper arguments (i, a, b, c) to the pool.
import multiprocessing
df = pd.read_excel(path)
dates = df.Date
hometeams = df.HomeTeam
awayteams = df.AwayTeam
matches_odds = list()
def fetch(data):
i, (a, b, c) = data
try:
r = requests.get(f'https://www.betexplorer.com/results/soccer/?year={a.split(" ")[3]}&month={monthToNum(a.split(" ")[2])}&day={a.split(" ")[1]}')
except requests.exceptions.ConnectionError:
sleep(10)
r = requests.get(f'https://www.betexplorer.com/results/soccer/?year={a.split(" ")[3]}&month={monthToNum(a.split(" ")[2])}&day={a.split(" ")[1]}')
soup = BeautifulSoup(r.text, 'html.parser')
f = soup.find_all('td', class_="table-main__tt")
for tag in f:
match = fuzz.ratio(f'{b} - {c}', tag.find('a').text)
hour = a.split(" ")[4]
if hour.split(':')[0] == '23':
act_hour = '00' + ':' + hour.split(':')[1]
else:
act_hour = str(int(hour.split(':')[0]) + 1) + ':' + hour.split(':')[1]
if match > 70 and act_hour == tag.find('span').text:
href_id = tag.find('a')['href']
table = get_odds(href_id)
matches_odds.append(table)
print(i, ' of ', len(dates))
if __name__ == '__main__':
num_processes = 20
with multiprocessing.Pool(num_processes) as pool:
pool.map(fetch, enumerate(zip(dates, hometeams, awayteams)))
Besides, multiprocessing is not the only way to improve the speed. Asynchronous programming can be used as well and is probably better for this scenario, although multiprocessing does the job, too - just want to mention that.
If carefully read the Python multiprocessing documentation, then it'll be obvious.
I have a file which looks like this:
London XXX Europe 2020 9 7 0 0 0 2 2020 9 7 0 11 35 2 57
Tanger XXX Africa 2020 9 7 0 29 54 2 2020 9 7 23 57 16 2 29
Doha XXX Asia 2020 9 7 0 57 23 2 2020 9 7 23 58 48 2 11
I'am trying to combine index 3,4,5,6,7,8 into a datetimeobject with Year, Month, Day, Hour, Minute, Second. I try to do the same with end_time. However, the zeros in my file seem to produce some weird output.
This is my code:
path = r'c:\data\EK\Desktop\Python Microsoft Visual Studio\Extra\test_datetime.txt'
with open(path, 'r') as input_file:
reader = csv.reader(input_file, delimiter='\t')
for row in reader:
start_time = (row[3] + row[4] + row[5] + row[6] + row[7] + row[8])
end_time = (row[10] + row[11] + row[12] + row[13] + row[14] + row[15])
start_time = datetime.datetime.strptime(start_time, "%Y%m%d%H%M%S")
end_time = datetime.datetime.strptime(end_time, "%Y%m%d%H%M%S")
print(start_time)
print(end_time)
This is my current output:
2020-09-07 00:00:00
2020-09-07 01:13:05
2020-09-07 02:09:54
2020-09-07 23:57:16
2020-09-07 05:07:23
2020-09-07 23:58:48
This is my expected output:
2020-09-07 00:00:00
2020-09-07 00:11:35
2020-09-07 00:29:54
2020-09-07 23:57:16
2020-09-07 00:57:23
2020-09-07 23:58:48
The problem is that when you concatenate the fields like row[3] + row[4] + row[5] + row[6] + row[7] + row[8] all the single-digit fields have no leading zeroes, so they aren't parsed properly with strptime().
You could use a string formatting function to add leading zeroes, but there's no reason to use strptime() in the first place. Just call datetime.datetime() to create an object directly from the values.
start_time = datetime.datetime(*map(int, row[3:9]))
end_time = datetime.datetime(*map(int, row[10:16]))
Example if we type 0010 then it should display 12:10 AM like that for every number it should display.,
military_times = ['0010', '0125', '1159', '1200', '1201',
'1259', '1344', '1959', '2359']
for item in military_times:
# time is between 0000 and 0059
if int(item[0: 2]) == 0:
t = '12:' + item[2: 4] + ' AM'
# time is between 1:00 am and 9:59 am
elif int(item[0: 2]) > 0 and int(item[0: 2]) <= 9:
t = str(item[1: 2]) + ':' + item[2: 4] + ' AM'
# time is between 10:00 am and 11:59 am
elif int(item[0: 2]) > 0 and int(item[0: 2]) <= 11:
t = str(item[0: 2]) + ':' + item[2: 4] + ' AM'
# time is between 12:00 pm and 12:59 pm
elif int(item[0: 2]) > 0 and int(item[0: 2]) == 12:
t = str(item[0: 2]) + ':' + item[2: 4] + ' PM'
# time is between 12:00 pm and 11:59 pm
else:
t = str(int(item[0: 2]) - 12) + ':' + item[2: 4] + ' PM'
print t
'''
# output
12:10 AM
1:25 AM
11:59 AM
12:00 PM
12:01 PM
12:59 PM
1:44 PM
7:59 PM
11:59 PM
'''
def to_12_hour(ts):
hr = int(ts[:2])
mins = int(ts[-2:])
return "{0}:{1}{2}".format(hr, mins, 'AM' if hr < 12 else 'PM')
I have a csv file as shown below:
19/04/2015 00:00 180 187 85 162 608 61
19/04/2015 01:00 202 20 26 70 171 61
19/04/2015 02:00 20 40 40 11 40 810
19/04/2015 03:00 20 80 81 24 0 86
19/04/2015 04:00 25 30 70 91 07 50
19/04/2015 05:00 80 611 691 70 790 37
19/04/2015 06:00 199 69 706 70 790 171
19/04/2015 07:00 80 81 90 192 57 254
19/04/2015 08:00 40 152 454 259 52 151
Each row is in the same cell in the file.
I'm trying to make it look like this:
19/04/2015 00:00 180
19/04/2015 00:10 187
19/04/2015 00:20 85
19/04/2015 00:30 162
19/04/2015 00:40 608
19/04/2015 00:50 61
19/04/2015 01:00 202
etc..
Explaination:
The first list of numbers is a date dd/M/YYYY HH:mm with 6 values, each value per 10 minutes.
In the second presentation, I wanted to have the date of each value with the exact time with minutes.
Here is what I've tried so far:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import sys, getopt
import tarfile
import re
import pandas as pd
import tempfile
import shutil
import collections
import urllib
import numpy
import logging
import csv
csvFile = "testfile.csv"
data = []
minutes = ['00:00','10:00','20:00','30:00','40:00','50:00']
with open(csvFile, 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for row in reader:
row[0] = re.sub("\s+", ";", row[0].strip())
rowlist = row[0].split(';')
while(len(rowlist)<8):
rowlist.append(0)
for i in range(len(rowlist)):
for m in minutes:
data.append(rowlist[0]+rowlist[1]+m)
data.append(rowlist[i])
df = pd.DataFrame(data)
df.to_csv('example.csv')
But this code didn't give me the desired result.
Any suggestions?
Okay I'm going to be explaining a lot in this one.
I Highly recommend you to use datetime objects if you are going to deal with dates. Because that's exactly why they are in the first place. Convert them into datetime object and you could do lots and lots of manipulations.
This a complete working code for you. I'll explain all of the concepts in depth!.
Input:
19/04/2015 00:00, 180 , 187 , 85 , 162 , 608 , 61
19/04/2015 01:00, 202 , 20 , 26 , 70 , 171 , 61
19/04/2015 02:00, 20 , 40 , 40 , 11 , 40 , 810
The code:
import csv
from datetime import datetime,timedelta
list_of_list = []
with open("old_file.csv","r+") as my_csv:
for line in my_csv:
line = line.strip().replace(" ",'').split(',')
list_of_list.append(line)
for item in list_of_list:
dt = datetime.strptime(item[0],'%d/%m/%Y%H:%M')
item[0]=dt
fin_list = []
for item in list_of_list:
temp_list = [item[0]+timedelta(minutes=10*(i)) for i,x in enumerate(item)]
my_list = [list(a) for a in zip(temp_list,item[1:])]
fin_list.extend(my_list)
for item in fin_list:
item[0] = datetime.strftime(item[0],"%d/%m/%Y %H:%M")
item[0].replace('"','')
print(fin_list)
with open("new_file.csv","w+") as my_csv:
csvWriter = csv.writer(my_csv,delimiter=' ',quotechar = " ")
csvWriter.writerows(fin_list)
output:
19/04/2015 00:00 180
19/04/2015 00:10 187
19/04/2015 00:20 85
19/04/2015 00:30 162
19/04/2015 00:40 608
19/04/2015 00:50 61
19/04/2015 01:00 202
19/04/2015 01:10 20
19/04/2015 01:20 26
19/04/2015 01:30 70
19/04/2015 01:40 171
19/04/2015 01:50 61
19/04/2015 02:00 20
19/04/2015 02:10 40
19/04/2015 02:20 40
19/04/2015 02:30 11
19/04/2015 02:40 40
19/04/2015 02:50 810
1) See I'm taking each row and making them into a list. Also replacing and stripping all the whitespaces,\n,\r
line = line.strip().replace(" ",'').split(',')
list_of_list.append(line)
output after this:
['19/04/201500:00', '180', '187', '85', '162', '608']
2) dt = datetime.strptime(item[0],'%d/%m/%Y%H:%M') what's this? the strptime from datetime takes a string and converts it into a datetime object which you can manipulate easily.
Example:
>>> datetime.strptime('19/04/201500:00','%d/%m/%Y%H:%M')
>>> datetime.datetime(2015, 4, 19, 0, 0)
>>> datetime.strptime('19/04/2015 00:00','%d/%m/%Y %H:%M') #notice how this is different from above!
>>> datetime.datetime(2015, 4, 19, 0, 0)
>>> datetime.strptime('Apr 19 2015 12:00','%b %d %Y %H:%M')
>>> datetime.datetime(2015, 4, 19, 12, 0)
Can you see how it transformed? Once you change it into a datetime object you can then easily add minutes,days,hours,months anything you want with it!.
But to add them you need a timedelta object. Consider it like this to an integer you add an integer same way to datetime add timedelta.
[item[0]+timedelta(minutes=10*(i)) for i,x in enumerate(item)]
You might be thinking hey what the hell's this?.enumerate of a iterable (list,string,tuple..etc) gives two things i,element. Where i would be 0,1,2,3,....till last index of iterable (here list) . So first i,x would be 0,item[0] next i,x would be 1,item[1] and so on.
So the list comprehension just adds 0,10,20 ,30,40,.. minutes to every datetime object.
Each item would be the below,
[datetime.datetime(2015, 4, 19, 0, 0), '180']
And finally after extend you get this:
[[datetime.datetime(2015, 4, 19, 0, 0), '180'],
[datetime.datetime(2015, 4, 19, 0, 10), '187'],
[datetime.datetime(2015, 4, 19, 0, 20), '85'],
[datetime.datetime(2015, 4, 19, 0, 30), '162'],
[datetime.datetime(2015, 4, 19, 0, 40), '608'],
[datetime.datetime(2015, 4, 19, 0, 50), '61']]
How beautiful?
Now again convert the datetime objects to string using this,
item[0] = datetime.strftime(item[0],"%d/%m/%Y %H:%M")
So strftime converts it into desired format!. And alas write them in the new csv file using csv writer.
Note: This would print dates along with quotes by default!. Which you didn't want in your output so use quotechar = " " to remove them.
This should work:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
in_name = 'test.csv'
out_name = 'sample.csv'
with open(in_name, 'rb') as infile, open(out_name, 'wb') as out_file:
for line in infile:
parts = line.split()
date, time, data = parts[0], parts[1], parts[2:]
hours, _ = time.split(':')
for minutes, value in zip(range(0, 60, 10), data):
out_file.write('{date} {hours}:{minutes:02d} {value:>5}\n'.format(
date=date, hours=hours, minutes=minutes, value=value
))
You also had a lot of unused imports which were unnecessary and could reduce performance.