My current project is scraping weather data from websites for calculation. Part of this calculation involves different logic depending on if the current time is before or after noon.
import pandas as pd
from bs4 import BeautifulSoup
import requests
import numpy as np
# Arkansas State Plant Board Weather Web data
url1 = "http://170.94.200.136/weather/Inversion.aspx"
response1 = requests.get(url1)
soup1 = BeautifulSoup(response1.content)
table1 = soup1.find("table", id="MainContent_GridView1")
data1 = pd.read_html(str(table1),header=0)[0]
data1.columns = ['Station', 'Low Temp (F)', 'Time of Low', 'Current Temp (F)', 'Current Time', 'Wind Speed (MPH)', 'Wind Dir', 'High Temp (F)', 'Time Of High']
print(url1)
print(data1[0:4])
array1 = np.array(data1[0:4])
This is my code to bring in the data I need. However, I don't know how to compare the current time I request as a Unicode string to see if it is before or after noon. Can anyone help me with this?
Edit: some data from the current request
Station Low Temp (F) Time of Low Current Temp (F) Current Time \
0 Arkansas 69.0 5:19 AM 88.7 2:09 PM
1 Ashley 70.4 4:39 AM 91.2 2:14 PM
2 Bradley 69.4 4:09 AM 90.6 2:14 PM
3 Chicot -40.2 2:14 PM -40.2 2:14 PM
Wind Speed (MPH) Wind Dir High Temp (F) Time Of High
0 4.3 213 88.9 2:04 PM
1 4.1 172 91.2 2:14 PM
2 6.0 203 90.6 2:09 PM
3 2.2 201 -40.1 12:24 AM
Just check if the meridian is PM or AM.
time = "2:09 PM"
meridian = time.split(' ')[-1] # get just the meridian
before_noon = meridian == 'AM'
after_noon = meridian == 'PM'
You can do it like this:
t = pd.to_datetime(data1['Current Time'][0:1][0])
noon = pd.to_datetime("12:00 PM")
if t < noon:
print("yes")
else:
print("no")
>>> no
t
>>> Timestamp('2016-07-11 14:04:00')
noon
>>> Timestamp('2016-07-11 12:00:00')
Related
I am working Bicycle dataset. I want to replace text values in 'weather' column with numbers 1 to 4. This field is an object field. I tried all of these following ways but none seems to work.
There is another field called 'season'. If I apply same code on 'season', my code works fine. Please help.
Sample data:
datetime season holiday workingday weather temp atemp humidity windspeed
0 5/10/2012 11:00 Summer NaN 1 Clear + Few clouds 21.32 25.000 48 35.0008
1 6/9/2012 7:00 Summer NaN 0 Clear + Few clouds 23.78 27.275 64 7.0015
2 3/6/2011 20:00 Spring NaN 0 Light Snow, Light Rain 11.48 12.120 100 27.9993
3 10/13/2011 11:00 Winter NaN 1 Mist + Cloudy 25.42 28.790 83 0.0000
4 6/2/2012 12:00 Summer NaN 0 Clear + Few clouds 25.42 31.060 43 23.9994
I tried following, none worked on 'weather' but when i use same code on 'season' column it works fine.
test["weather"] = np.where(test["weather"]=="Clear + Few clouds", 1,
(np.where(test["weather"]=="Mist + Cloudy",2,(np.where(test["weather"]=="Light Snow, Light
Rain",3,(np.where(test["weather"]=="Heavy Rain + Thunderstorm",4,0)))))))
PE_weather = [
(train['weather'] == ' Clear + Few clouds '),
(train['weather'] =='Mist + Cloudy') ,
(train['weather'] >= 'Light Snow, Light Rain'),
(train['weather'] >= 'Heavy Rain + Thunderstorm')]
PE_weather_value = ['1', '2', '3','4']
train['Weather'] = np.select(PE_weather, PE_weather_value)
test.loc[test.weather =='Clear + Few clouds', 'weather']='1'
I suggest you make a dictionary to look up the corresponding values and then apply a lookup to the weather column.
weather_lookup = {
'Clear + Few clouds': 1,
'Mist + Cloudy': 2,
'Light Snow, Light Rain': 3,
'Heavy Rain + Thunderstorm': 4
}
def lookup(w):
return weather_lookup.get(w, 0)
test['weather'] = test['weather'].apply(lookup)
Output:
datetime season holiday workingday weather temp atemp humidity windspeed
0 5/10/2012 11:00 Summer NaN 1 1 21.32 25.000 48 35.0008
1 6/9/2012 7:00 Summer NaN 0 1 23.78 27.275 64 7.0015
2 3/6/2011 20:00 Spring NaN 0 3 11.48 12.120 100 27.9993 NaN
3 10/13/2011 11:00 Winter NaN 1 2 25.42 28.790 83 0.0000
4 6/2/2012 12:00 Summer NaN 0 1 25.42 31.060 43 23.9994
I have a script that get data from a dataframe, use those data to make a request to a website, using fuzzywuzzy module find the exact href and then runs a function to scrape odds. I would speed up this script with the multiprocessing module, it is possible?
Date HomeTeam AwayTeam
0 Monday 6 December 2021 20:00 Everton Arsenal
1 Monday 6 December 2021 17:30 Empoli Udinese
2 Monday 6 December 2021 19:45 Cagliari Torino
3 Monday 6 December 2021 20:00 Getafe Athletic Bilbao
4 Monday 6 December 2021 15:00 Real Zaragoza Eibar
5 Monday 6 December 2021 17:15 Cartagena Tenerife
6 Monday 6 December 2021 20:00 Girona Leganes
7 Monday 6 December 2021 19:45 Niort Toulouse
8 Monday 6 December 2021 19:00 Jong Ajax FC Emmen
9 Monday 6 December 2021 19:00 Jong AZ Excelsior
Script
df = pd.read_excel(path)
dates = df.Date
hometeams = df.HomeTeam
awayteams = df.AwayTeam
matches_odds = list()
for i,(a,b,c) in enumerate(zip(dates, hometeams, awayteams)):
try:
r = requests.get(f'https://www.betexplorer.com/results/soccer/?year={a.split(" ")[3]}&month={monthToNum(a.split(" ")[2])}&day={a.split(" ")[1]}')
except requests.exceptions.ConnectionError:
sleep(10)
r = requests.get(f'https://www.betexplorer.com/results/soccer/?year={a.split(" ")[3]}&month={monthToNum(a.split(" ")[2])}&day={a.split(" ")[1]}')
soup = BeautifulSoup(r.text, 'html.parser')
f = soup.find_all('td', class_="table-main__tt")
for tag in f:
match = fuzz.ratio(f'{b} - {c}', tag.find('a').text)
hour = a.split(" ")[4]
if hour.split(':')[0] == '23':
act_hour = '00' + ':' + hour.split(':')[1]
else:
act_hour = str(int(hour.split(':')[0]) + 1) + ':' + hour.split(':')[1]
if match > 70 and act_hour == tag.find('span').text:
href_id = tag.find('a')['href']
table = get_odds(href_id)
matches_odds.append(table)
print(i, ' of ', len(dates))
PS: The monthToNum function just replace the month name to his number
First, you make a function of your loop body with inputs i, a, b and c. Then, you create a multiprocessing.Pool and submit this function with the proper arguments (i, a, b, c) to the pool.
import multiprocessing
df = pd.read_excel(path)
dates = df.Date
hometeams = df.HomeTeam
awayteams = df.AwayTeam
matches_odds = list()
def fetch(data):
i, (a, b, c) = data
try:
r = requests.get(f'https://www.betexplorer.com/results/soccer/?year={a.split(" ")[3]}&month={monthToNum(a.split(" ")[2])}&day={a.split(" ")[1]}')
except requests.exceptions.ConnectionError:
sleep(10)
r = requests.get(f'https://www.betexplorer.com/results/soccer/?year={a.split(" ")[3]}&month={monthToNum(a.split(" ")[2])}&day={a.split(" ")[1]}')
soup = BeautifulSoup(r.text, 'html.parser')
f = soup.find_all('td', class_="table-main__tt")
for tag in f:
match = fuzz.ratio(f'{b} - {c}', tag.find('a').text)
hour = a.split(" ")[4]
if hour.split(':')[0] == '23':
act_hour = '00' + ':' + hour.split(':')[1]
else:
act_hour = str(int(hour.split(':')[0]) + 1) + ':' + hour.split(':')[1]
if match > 70 and act_hour == tag.find('span').text:
href_id = tag.find('a')['href']
table = get_odds(href_id)
matches_odds.append(table)
print(i, ' of ', len(dates))
if __name__ == '__main__':
num_processes = 20
with multiprocessing.Pool(num_processes) as pool:
pool.map(fetch, enumerate(zip(dates, hometeams, awayteams)))
Besides, multiprocessing is not the only way to improve the speed. Asynchronous programming can be used as well and is probably better for this scenario, although multiprocessing does the job, too - just want to mention that.
If carefully read the Python multiprocessing documentation, then it'll be obvious.
I'm trying to retrieve Financial Information from reuters.com, especially the Long Term Growth Rates of Companies. The element I want to scrape doesn't appear on all Webpages, in my example not for the Ticker 'AMCR'. All scraped info shall be appended to a list.
I've already figured out to exclude the element if it doesn't exist, but instead of appending it to the list in a place where it should be, the "NaN" is appended as the last element and not in a place where it should be.
import requests
from bs4 import BeautifulSoup
LTGRMean = []
tickers = ['MMM','AES','LLY','LOW','PWR','TSCO','YUM','ICE','FB','AAPL','AMCR','FLS','GOOGL','FB','MSFT']
Ticker LTGRMean
0 MMM 3.70
1 AES 9.00
2 LLY 10.42
3 LOW 13.97
4 PWR 12.53
5 TSCO 11.44
6 YUM 15.08
7 ICE 8.52
8 FB 19.07
9 AAPL 12.00
10 AMCR 19.04
11 FLS 16.14
12 GOOGL 19.07
13 FB 14.80
14 MSFT NaN
My individual text "not existing" isn't appearing.
Instead of for AMCR where Reuters doesn't provide any information, the Growth Rate of FLS (19.04) is set instead. So, as a result, all info is shifted up one index, where NaN should appear next to AMCR.
Stack() Function in dataframe stacks the column to rows at level 1.
import requests
from bs4 import BeautifulSoup
import pandas as pd
LTGRMean = []
tickers = ['MMM', 'AES', 'LLY', 'LOW', 'PWR', 'TSCO', 'YUM', 'ICE', 'FB', 'AAPL', 'AMCR', 'FLS', 'GOOGL', 'FB', 'MSFT']
for i in tickers:
Test = requests.get('https://www.reuters.com/finance/stocks/financial-highlights/' + i)
ReutSoup = BeautifulSoup(Test.content, 'html.parser')
td = ReutSoup.find('td', string="LT Growth Rate (%)")
my_dict = {}
#validate td object not none
if td is not None:
result = td.findNext('td').findNext('td').text
else:
result = "NaN"
my_dict[i] = result
LTGRMean.append(my_dict)
df = pd.DataFrame(LTGRMean)
print(df.stack())
O/P:
0 MMM 3.70
1 AES 9.00
2 LLY 10.42
3 LOW 13.97
4 PWR 12.53
5 TSCO 11.44
6 YUM 15.08
7 ICE 8.52
8 FB 19.90
9 AAPL 12.00
10 AMCR NaN
11 FLS 19.04
12 GOOGL 16.14
13 FB 19.90
14 MSFT 14.80
dtype: object
I am learning to scrape data from website through Python. Extracting weather information about San Francisco from this page. I get stuck while combining data into a Pandas Dataframe. Is it possible to create a dataframe where each rows have different length?
I have already tried 2 ways based on answers here, but they are not excatly what I am looking for. Both answers shift the values of temps column to up. Here is the screen what I try to explain..
1st way: https://stackoverflow.com/a/40442094/10179259
2nd way: https://stackoverflow.com/a/19736406/10179259
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
periods=[pt.get_text() for pt in seven_day.select('.tombstone-container .period-name')]
short_descs=[sd.get_text() for sd in seven_day.select('.tombstone-container .short-desc')]
temps=[t.get_text() for t in seven_day.select('.tombstone-container .temp')]
descs = [d['alt'] for d in seven_day.select('.tombstone-container img')]
#print(len(periods), len(short_descs), len(temps), len(descs))
weather = pd.DataFrame({
"period": periods, #length is 9
"short_desc": short_descs, #length is 9
"temp": temps, #problem here length is 8
#"desc":descs #length is 9
})
print(weather)
I expect that first row of the temp column to be Nan. Thank you.
You can loop each forecast_items value with iter and next for select first value, if not exist is assigned fo dictionary NaN value:
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
out = []
for x in forecast_items:
periods = next(iter([t.get_text() for t in x.select('.period-name')]), np.nan)
short_descs = next(iter([t.get_text() for t in x.select('.short-desc')]), np.nan)
temps = next(iter([t.get_text() for t in x.select('.temp')]), np.nan)
descs = next(iter([d['alt'] for d in x.select('img')]), np.nan)
out.append({'period':periods, 'short_desc':short_descs, 'temp':temps, 'descs':descs})
weather = pd.DataFrame(out)
print (weather)
descs period \
0 NOW until4:00pm Sat
1 Today: Showers, with thunderstorms also possib... Today
2 Tonight: Showers likely and possibly a thunder... Tonight
3 Sunday: A chance of showers before 11am, then ... Sunday
4 Sunday Night: Rain before 11pm, then a chance ... SundayNight
5 Monday: A 40 percent chance of showers. Cloud... Monday
6 Monday Night: A 30 percent chance of showers. ... MondayNight
7 Tuesday: A 50 percent chance of rain. Cloudy,... Tuesday
8 Tuesday Night: Rain. Cloudy, with a low aroun... TuesdayNight
short_desc temp
0 Wind Advisory NaN
1 Showers andBreezy High: 56 °F
2 ShowersLikely Low: 49 °F
3 Heavy Rainand Windy High: 56 °F
4 Heavy Rainand Breezythen ChanceShowers Low: 52 °F
5 ChanceShowers High: 58 °F
6 ChanceShowers Low: 53 °F
7 Chance Rain High: 59 °F
8 Rain Low: 53 °F
I have a little problem parsing a date in python
This is the date I have to parse:
Sun Sep 15, 2013 12:10pm EDT
And that is the code I'm using to parse it:
datetime.strptime( date, "%a %b %d, %Y %I:%M%p %Z")
Everything is fine but the time-zone parsing, which always return a ValueError exception. I've also tried pytz but without any success.
So how can i parse this kind date using python?
Using dateutil:
import dateutil.parser
import pytz
tz_str = '''-12 Y
-11 X NUT SST
-10 W CKT HAST HST TAHT TKT
-9 V AKST GAMT GIT HADT HNY
-8 U AKDT CIST HAY HNP PST PT
-7 T HAP HNR MST PDT
-6 S CST EAST GALT HAR HNC MDT
-5 R CDT COT EASST ECT EST ET HAC HNE PET
-4 Q AST BOT CLT COST EDT FKT GYT HAE HNA PYT
-3 P ADT ART BRT CLST FKST GFT HAA PMST PYST SRT UYT WGT
-2 O BRST FNT PMDT UYST WGST
-1 N AZOT CVT EGT
0 Z EGST GMT UTC WET WT
1 A CET DFT WAT WEDT WEST
2 B CAT CEDT CEST EET SAST WAST
3 C EAT EEDT EEST IDT MSK
4 D AMT AZT GET GST KUYT MSD MUT RET SAMT SCT
5 E AMST AQTT AZST HMT MAWT MVT PKT TFT TJT TMT UZT YEKT
6 F ALMT BIOT BTT IOT KGT NOVT OMST YEKST
7 G CXT DAVT HOVT ICT KRAT NOVST OMSST THA WIB
8 H ACT AWST BDT BNT CAST HKT IRKT KRAST MYT PHT SGT ULAT WITA WST
9 I AWDT IRKST JST KST PWT TLT WDT WIT YAKT
10 K AEST ChST PGT VLAT YAKST YAPT
11 L AEDT LHDT MAGT NCT PONT SBT VLAST VUT
12 M ANAST ANAT FJT GILT MAGST MHT NZST PETST PETT TVT WFT
13 FJST NZDT
11.5 NFT
10.5 ACDT LHST
9.5 ACST
6.5 CCT MMT
5.75 NPT
5.5 SLT
4.5 AFT IRDT
3.5 IRST
-2.5 HAT NDT
-3.5 HNT NST NT
-4.5 HLV VET
-9.5 MART MIT'''
tzd = {}
for tz_descr in map(str.split, tz_str.split('\n')):
tz_offset = int(float(tz_descr[0]) * 3600)
for tz_code in tz_descr[1:]:
tzd[tz_code] = tz_offset
date = 'Sun Sep 15, 2013 12:10pm EDT'
dateutil.parser.parse(date, tzinfos=tzd) # => datetime.datetime(2013, 9, 15, 12, 10, tzinfo=tzoffset(u'EDT', -14400))
tzd generation code comes from this answer.
UPDATE
NOTE The list of time zone abbreviations is not accurate as Matt Johnson commented. See his answer.
You can't. Not reliably anyway. Time zone abbreviations are ambiguous and contradictory. There are no standards.
For example "CST" has 5 distinctly different meanings.
(UTC-06:00) Central Standard Time (America)
(UTC-05:00) Cuba Standard Time
(UTC+08:00) China Standard Time
(UTC+09:30) Central Standard Time (Australia)
(UTC+10:30) Central Summer Time (Australia)
See this list for additional examples.