Airline Price Scraping with Python - python

I've been trying to create python code to scrape airline prices from JFK to LAX.
The URL of the prices that I want to scrape are here: https://www.google.com/flights/#search;f=JFK;t=LAX;d=2014-05-28;r=2014-06-01;tt=o
I would ideally be able to get a list of time of the airline, time of departure and price.
I know that
'div class="GHOFUQ5BGJC>" $210 '
corresponds to the price and
'div class="GHOFUQ5BMFC">Sun Country'
corresponds to the airline.
So far, this is what I have
import re
import urllib
html = "https://www.google.com/flights/#search;f=JFK;t=LAX;d=2014-05-28;r=2014-06-01;tt=o"
htmlfile = urllib.urlopen(html)
htmltext = htmlfile.read()
re1 = '<div class="GHOFUQ5BGJC">(.+?)</div>'
pattern1 = re.compile(re1)
price = re.findall(pattern1, htmltext)
re2 ='<div class="GHOFUQ5BMFC">(.+?)</div>'
pattern2 = re.compile(re2)
airline = re.findall(pattern2, htmltext)
print price
print airline
Is there a way to access the price and airline tags through beautiful soup? Or am I on the right track with the regex?
When run, the code just gives me two empty lists.
What am I doing wrong?
Thanks

Related

How to scrap values of a specific paragraph based on pattern

In the page there is a paragraph like this:
The latest financial statements filed by x in the business register it corresponds to the year 2020 and shows a turnover range of 'Between 6,000,000 and 30,000,000 Euros'.
In the page it is: L’ultimo bilancio depositato da Euro P.a. - S.r.l. nel registro delle imprese corrisponde all’anno 2020 e riporta un range di fatturato di 'Tra 6.000.000 e 30.000.000 Euro'.
I need to scrape the value inside the ' ' in this case (Between 6,000,000 and 30,000,000 Euros).
And put it inside a column called "range".
I tried with no success this code:
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://www.informazione-aziende.it/Azienda_EURO-PA-SRL'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
turnover = soup.find("span", {"id": "turnover"}).text
year = soup.find("span", {"id": "year"}).text
data = {'turnover': turnover, 'year': year}
df = pd.DataFrame(data, index=[0])
print(df)
But i get: AttributeError: 'NoneType' object has no attribute 'text'
First, scrape the whole text with BeautifulSoup, and assign it to a variable such as:
text = "The latest financial statements filed by x in the business register it corresponds to the year 2020 and shows a turnover range of 'Between 6,000,000 and 30,000,000 Euros'."
Then, execute the following code:
import re
pattern = "'.+'"
result = re.search(pattern, text)
result = result[0].replace("'", "")
The output will be:
'Between 6,000,000 and 30,000,000 Euros'
An alternative can be:
Split the text by the single quote character - ' - and get the text at position 1 of the list.
Code:
text = "The latest financial statements filed by x in the business register it corresponds to the year 2020 and shows a turnover range of 'Between 6,000,000 and 30,000,000 Euros'."
# Get the text at position 1 of the list:
desired_text = text.split("'")[1]
# Print the result:
print(desired_text)
Result:
Between 6,000,000 and 30,000,000 Euros

How to web-scrape data that may move indexes in the future

I am trying to web scrape NFL standings data and am interested in the categories "PCT" and "Net Pts" from the table from this url. https://www.nfl.com/standings/league/2021/REG
I have set up BeautifulSoup and printed the all 'td' in this page. The problem is when doing so you get an order of the teams from worst record to the best. Obviously this will cause problems in the future if I have a specific index that I have identified as the Lions' PCT for example, as when their record changes that data will have a different index. In fact the order of the teams on the website will change every week as more games are played.
Is there any way to say anything like if the name of the team is X do something? Like use the table data 4 indexes lower? I haven't seen how to deal with this problem on any youtube tutorial or book so I am wondering what the thought process is. I need a way to identify each team and their PCT and Net points instantaneously as this info will be put into another function.
Here is what I have so far for example:
When you do something like this...
import requests
from bs4 import BeautifulSoup
url = 'https://www.nfl.com/standings/league/2021/REG'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'lxml')
data = soup.find_all('td')[0:10]
print(data)
#I am using just the first 10 indexes to keep it short here
...you get the table data info for the Detroit Lions as they are the worst team in the league at the time of posting this question. I have identified that their "PCT" data point would be
win_pct = soup.find_all('td')[4]
print(float(win_pct.text.strip()))
However, if another team becomes the worst team in the league this index would belong to them and not the Lions. How would I work around this? Thanks
You can use dictionary to store data about clubs and then use club name as a key to get the data (independent of club position). For example:
import requests
from bs4 import BeautifulSoup
url = "https://www.nfl.com/standings/league/2021/REG"
res = requests.get(url)
soup = BeautifulSoup(res.text, "lxml")
data = {}
for row in soup.select("tr:has(td)"):
cells = [td.get_text(strip=True) for td in row.select("td")[1:]]
club_name = row.select_one(".d3-o-club-fullname").get_text(strip=True)
data[club_name] = cells
# print PCT/Net Pts of Detroit Lions:
print(data["Detroit Lions"][3], data["Detroit Lions"][6])
Prints:
0.000 -63

How to extract values or data from a list of stored links using selenium python?

I am trying to scrape price of a real estate website namely this one, so I made a list of scraped links and wrote scripts to get prices from all those links. I tried googling and asking around but could not find a decent answer, I just want to get price values from list of links and store it in a way so that it can be converted into a csv file later on with house name, location,price as headers along with respective datas . The output I am getting is: .The last list with a lot of prices is what I want. My code is as follows
from selenium import webdriver
PATH = "C:/ProgramData/Anaconda3/scripts/chromedriver.exe" #always keeps chromedriver.exe inside scripts to save hours of debugging
driver =webdriver.Chrome(PATH) #preety important part
driver.get("https://www.nepalhomes.com/list/&sort=1&find_property_purpose=5db2bdb42485621618ecdae6&find_property_category=5d660cb27682d03f547a6c4a")
driver.implicitly_wait(10)
data_extract= pd.read_csv(r'F:\github projects\homie.csv') #reading csv file which contains 8 links
de = data_extract['Links'].tolist() #converting the csv file to list so that it can be iterated
data=[] # created an empty list to store extracted prices after the scraping is done from homie.csv
for url in de[0:]: #de has all the links which i want to iterate and scrape prices
driver.get(url)
prices = driver.find_elements_by_xpath("//div[#id='app']/div[1]/div[2]/div[1]/div[2]/div/p[1]")
for price in prices: #after finding xapth get prices
data.append(price.text)
print(data) # printing in console just to check what kind of data i obtained
any help will be appreciated. The output I am expecting is something like this [[price of house inside link 0], [price of house inside link 1], similarly]..the links in homie.csv are as follows
Links
https://www.nepalhomes.com/detail/bungalow-house-for-sale-at-mandikhatar
https://www.nepalhomes.com/detail/brand-new-house-for-sale-in-baluwakhani
https://www.nepalhomes.com/detail/bungalow-house-for-sale-in-bhangal-budhanilkantha
https://www.nepalhomes.com/detail/commercial-house-for-sale-in-mandikhatar
https://www.nepalhomes.com/detail/attractive-house-on-sale-in-budhanilkantha
https://www.nepalhomes.com/detail/house-on-sale-at-bafal
https://www.nepalhomes.com/detail/house-on-sale-in-madigaun-sunakothi
https://www.nepalhomes.com/detail/house-on-sale-in-chhaling-bhaktapur
There is no need to use Selenium to get the data you need. That page loads it's data from an API endpoint.
The API endpoint:
https://www.nepalhomes.com/api/property/public/data?&sort=1&find_property_purpose=5db2bdb42485621618ecdae6&find_property_category=5d660cb27682d03f547a6c4a
You can directly make a request to that API endpoint using requests module and get your data.
This code will print all the prices.
import requests
url = 'https://www.nepalhomes.com/api/property/public/data?&sort=1&find_property_purpose=5db2bdb42485621618ecdae6&find_property_category=5d660cb27682d03f547a6c4a'
r = requests.get(url)
info = r.json()
for i in info['data']:
print([i['basic']['title'],i['price']['value']])
['House on sale at Kapan near Karuna Hospital ', 15500000]
['House on sale at Banasthali', 70000000]
['Bungalow house for sale at Mandikhatar', 38000000]
['Brand new house for sale in Baluwakhani', 38000000]
['Bungalow house for sale in Bhangal, Budhanilkantha', 29000000]
['Commercial house for sale in Mandikhatar', 27500000]
['Attractive house on sale in Budhanilkantha', 55000000]
['House on sale at Bafal', 45000000]
I see several problems here:
I couldn't see no elements matching text-3xl font-bold leading-none text-black class names on the https://www.nepalhomes.com/list/&sort=1&find_property_purpose=5db2bdb42485621618ecdae6&find_property_category=5d660cb27682d03f547a6c4a web page
Even if there were such elements - for multiple class names you should use CSS selector or XPath so instead of
find_elements_by_class_name('text-3xl font-bold leading-none text-black')
it should be
find_elements_by_css_selector('.text-3xl.font-bold.leading-none.text-black')
find_elements method returns a list of web elements, so to get texts from these elements you have to iterate over the list and get text from each element, like following:
prices = driver.find_elements_by_css_selector('.text-3xl.font-bold.leading-none.text-black')
for price in prices:
data.append(price.text)
UPD
With this locator it works correct for me:
prices = driver.find_elements_by_xpath("//p[#class='text-xl leading-none text-black']/p[1]")
for price in prices:
data.append(price.text)
Tried with below xpath. And it retrieved the prize.
price_list,nameprice_list = [],[]
houses = driver.find_elements_by_xpath("//div[contains(#class,'table-list')]/a")
for house in houses:
name = house.find_element_by_tag_name("h2").text
address = house.find_element_by_xpath(".//p[contains(#class,'opacity-75')]").text
price = (house.find_element_by_xpath(".//p[contains(#class,'text-xl')]/p").text).replace('Rs. ','')
price_list.append(price)
nameprice_list.append((name,price))
print("{}: {}".format(name,price))
And output:
House on sale at Kapan near Karuna Hospital: Kapan, Budhanilkantha Municipality,1,55,00,000
House on sale at Banasthali: Banasthali, Kathmandu Metropolitan City,7,00,00,000
...
[('House on sale at Kapan near Karuna Hospital', '1,55,00,000'), ('House on sale at Banasthali', '7,00,00,000'), ('Bungalow house for sale at Mandikhatar', '3,80,00,000'), ('Brand new house for sale in Baluwakhani', '3,80,00,000'), ('Bungalow house for sale in Bhangal, Budhanilkantha', '2,90,00,000'), ('Commercial house for sale in Mandikhatar', '2,75,00,000'), ('Attractive house on sale in Budhanilkantha', '5,50,00,000'), ('House on sale at Bafal', '4,50,00,000')]
['1,55,00,000', '7,00,00,000', '3,80,00,000', '3,80,00,000', '2,90,00,000', '2,75,00,000', '5,50,00,000', '4,50,00,000']
by first look, only 8 prices are visible, and if you just want to scrape them using selenium
driver.maximize_window()
driver.implicitly_wait(30)
driver.get("https://www.nepalhomes.com/list/&sort=1&find_property_purpose=5db2bdb42485621618ecdae6&find_property_category=5d660cb27682d03f547a6c4a")
wait = WebDriverWait(driver, 20)
for price in driver.find_elements(By.XPATH, "//p[contains(#class,'leading')]/p[1]"):
print(price.text.split('.')[1])
this will print all the price, without RS.
This print statement should be outside the for loops to avoid staircase printing of output.
from selenium import webdriver
PATH = "C:/ProgramData/Anaconda3/scripts/chromedriver.exe" #always keeps chromedriver.exe inside scripts to save hours of debugging
driver =webdriver.Chrome(PATH) #preety important part
driver.get("https://www.nepalhomes.com/list/&sort=1&find_property_purpose=5db2bdb42485621618ecdae6&find_property_category=5d660cb27682d03f547a6c4a")
driver.implicitly_wait(10)
data_extract= pd.read_csv(r'F:\github projects\homie.csv')
de = data_extract['Links'].tolist()
data=[]
for url in de[0:]:
driver.get(url)
prices = driver.find_elements_by_xpath("//div[#id='app']/div[1]/div[2]/div[1]/div[2]/div/p[1]")
for price in prices: #after finding xapth get prices
data.append(price.text)
print(data)

Using Beautiful Soup to pull dates from table

I'm looking to do something with bills that have been delivered to the governor - collecting dates for when they were delivered and the date of the last legislative action before they were sent.
I'm doing this for a whole series of similar URLs. Problem is, my code (below) works for some URLs and not others. I'm writing this to a pandas dataframe and then to csv file. When the code fails, it writes the else block when either if of elif should've been triggered.
Here's a fail URL: https://www.nysenate.gov/legislation/bills/2011/s663
And a succeed URL: https://www.nysenate.gov/legislation/bills/2011/s333
Take the first URL for example. Underneath the "view actions" dropdown, it says it was delivered to the governor on Jul 29, 2011. Prior to that, it was returned to assembly on Jun 20, 2011.
Using "delivered to governor" location as td in the table, I'd like to collect both dates using Bs4.
Here's what I have in my code:
check_list = [item.text.strip() for item in tablebody.select("td")]
dtg = "delivered to governor"
dtg_regex = re.compile(
'/.*(\S\S\S\S\S\S\S\S\S\s\S\S\s\S\S\S\S\S\S\S\S).*'
)
if dtg in check_list:
i = check_list.index(dtg)
transfer_list.append(check_list[i+1]) ## last legislative action date (not counting dtg)
transfer_list.append(check_list[i-1]) ## dtg date
elif any(dtg_regex.match(dtg_check_list) for dtg_check_list in check_list):
transfer_list.append(check_list[4])
transfer_list.append(check_list[2])
else:
transfer_list.append("no floor vote")
transfer_list.append("not delivered to governor")
You could use :has and :contains to target the right first row and find_next to move to next row. You can use last-of-type to get last action in first row select_one to get first in second row. You can use the class of each "column" to move between first and second columns.
Your mileage may vary with other pages.
import requests
from bs4 import BeautifulSoup as bs
links = ['https://www.nysenate.gov/legislation/bills/2011/s663', 'https://www.nysenate.gov/legislation/bills/2011/s333']
transfer_list = []
with requests.Session() as s:
for link in links:
r = s.get(link)
soup = bs(r.content, 'lxml')
target = soup.select_one('.cbill--actions-table--row:has(td:contains("delivered"))')
if target:
print(target.select_one('.c-bill--actions-table-col1').text)
# transfer_list.append(target.select_one('.c-bill--actions-table-col1').text)
print(target.select_one('.c-bill--action-line-assembly:last-of-type, .c-bill--action-line-senate:last-of-type').text)
# transfer_list.append(target.select_one('.c-bill--action-line-assembly:last-of-type, .c-bill--action-line-senate:last-of-type').text)
print(target.find_next('tr').select_one('.c-bill--actions-table-col1').text)
# append again
print(target.find_next('tr').select_one('.c-bill--actions-table-col2 span').text)
# append again
else:
transfer_list.append("no floor vote")
transfer_list.append("not delivered to governor")
Make full use of XPath:
Get date of "delivered to governor"
//text()[contains(lower-case(.), 'delivered to governor')]/ancestor::tr/td[1]/text()
S663A - http://xpather.com/bUQ6Gva8
S333 - http://xpather.com/oTNfuH75
Get date of "returned to assembly/senate"
//text()[contains(lower-case(.), 'delivered to governor')]/ancestor::tr/following-sibling::tr/td//text()[contains(lower-case(.), 'returned to')]/ancestor::tr/td[1]/text()
S663A - http://xpather.com/Rnufm2TB
S333 - http://xpather.com/4x9UHn4L
Get date of action which precedes "delivered to governor" row regardless of what the action is
//text()[contains(lower-case(.), 'delivered to governor')]/ancestor::tr/following-sibling::tr[1]/td/text()
S663A - http://xpather.com/AUpWCFIz
S333 - http://xpather.com/u8LOCb0x

Scraping headlines from Yahoo Finance using Python

I am using beautiful soup to extract headlines from this page http://in.finance.yahoo.com/q?s=AAPL but I need headlines for past 3 months i.e from 10 Dec 2013 to 10 March 2014. But I am able to extract only the headlines that are their on this specific page. How to extract the required headlines for any specific company?
Code:
url = 'http://in.finance.yahoo.com/q?s=AAPL'
data = urllib2.urlopen(url)
soup = BeautifulSoup(data)
divs = soup.find('div',attrs={'id':'yfi_headlines'})
div = divs.find('div',attrs={'class':'bd'})
ul = div.find('ul')
lis = ul.findAll('li')
hls = []
for li in lis:
headlines = li.find('a').contents[0]
print headlines
I think your problem is more related to where you get your data from, if you need data from the last three months you should query the http://in.finance.yahoo.com/q/hp?s=AAPL instead, where all the data you look for is presented on a table.
on http://in.finance.yahoo.com/q?s=AAPL, click on 'more headlines from AAPL'. from there you'll get a link that has a datetime field in it. modify that and you should be good. (http://in.finance.yahoo.com/q/h?s=AAPL&t=2014-02-08T15:06:40+05:30)

Categories