Why doesn't my web-scraping code work?

Why doesn't my web-scraping code work? - python

I want to scrape the airplane arrivals from a website with Python 2.7, and export it to excel, but something is wrong with my code:
import urllib2
import unicodecsv as csv
import os
import sys
import io
import time
import datetime
import pandas as pd
from bs4 import BeautifulSoup
filename=r'output.csv'
resultcsv=open(filename,"wb")
output=csv.writer(resultcsv, delimiter=';',quotechar = '"', quoting=csv.QUOTE_NONNUMERIC, encoding='latin-1')
url = "https://www.flightradar24.com/data/airports/bud/arrivals"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
data = soup.find('div', { "class" : "row cnt-schedule-table"})
print data
I need the contents of the div with class row cnt-schedule table. What am I doing wrong?

I believe the problem is that you are trying to get data from a JavaScript loaded data-set. Instead of loading from the page directly you'll need to mimic the requests for the data that the page is making to populate it.

Related

BeautifulSoup gives None When using find_all()

import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import urllib
import csv,datetime,re
url_ = "https://www.wunderground.com/history/daily/ca/toronto/CYTZ/date/2016-6-25"
mypage = requests.get(url_).text
soup = BeautifulSoup(mypage,'html.parser')
soup.find_all('tr')
I was trying to fetch the weather data from wunderground. BeautifulSoup has fetched the source code but I don't know why when I use soup.find_all('tr') it keeps on giving me [] ('None'). Anyone know why?
Thank you!

The table data (most probably) gets populated by javascript. take a look at this question.

How to extract content from <script> using Beautiful Soup

I am attempting to extract campaign_hearts and postal_code from the code in the script tag here (the entire code is too long to post):
<script>
...
"campaign_hearts":4817,"social_share_total":11242,"social_share_last_update":"2020-01-17T10:51:22-06:00","location":{"city":"Los Angeles, CA","country":"US","postal_code":"90012"},"is_partner":false,"partner":{},"is_team":true,"team":{"name":"Team STEVENS NATION","team_pic_url":"https://d2g8igdw686xgo.cloudfront.net
...
I can identify the script I need with the following code:
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from time import sleep
import requests
import re
import json
page = requests.get("https://www.gofundme.com/f/eric-stevens-care-trust")
soup = BeautifulSoup(page.content, 'html.parser')
all_scripts = soup.find_all('script')
all_scripts[0]
However, I'm at a loss for how to extract the values I want. (I'm very new to Python.)
This thread recommended the following solution for a similar problem (edited to reflect the html I'm working with).
data = json.loads(all_scripts[0].get_text()[27:])
However, running this produces an error: JSONDecodeError: Expecting value: line 1 column 1 (char 0).
What can I do to extract the values I need now that I have the correct script identified? I have also tried the solutions listed here, but had trouble importing Parser.

You can parse the content of <script> with json module and then get your values. For example:
import re
import json
import requests
url = 'https://www.gofundme.com/f/eric-stevens-care-trust'
txt = requests.get(url).text
data = json.loads(re.findall(r'window\.initialState = ({.*?});', txt)[0])
# print( json.dumps(data, indent=4) ) # <-- uncomment this to see all data
print('Campaign Hearts =', data['feed']['campaign']['campaign_hearts'])
print('Postal Code =', data['feed']['campaign']['location']['postal_code'])
Prints:
Campaign Hearts = 4817
Postal Code = 90012

The more libraries you use; the more inefficient a code becomes! Here is a simpler solution-
#This imports the website content.
import requests
url = "https://www.gofundme.com/f/eric-stevens-care-trust"
a = requests.post(url)
a= (a.content)
print(a)
#These will show your data.
campaign_hearts = str(a,'utf-8').split('campaign_hearts":')[1]
campaign_hearts = campaign_hearts.split(',"social_share_total"')[0]
print(campaign_hearts)
postal_code = str(a,'utf-8').split('postal_code":"')[1]
postal_code = postal_code.split('"},"is_partner')[0]
print(postal_code)

Your json.loads was failing because of the final semicolon. It will work if you use a regex to extract only the object string (excluding the final semicolon).
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from time import sleep
import requests
import re
import json
page = requests.get("https://www.gofundme.com/f/eric-stevens-care-trust")
soup = BeautifulSoup(page.content, 'html.parser')
all_scripts = soup.find_all('script')
txt = all_scripts[0].get_text()
data = json.loads(re.findall(r'window\.initialState = ({.*?});', txt)[0])

This should be fine for now, I might try to write a pure lxml version or at least improve the searching for the element.
This solution uses regex to get only the JSON data, without the window.initialState = and semicolon.
import json
import re
import requests
from bs4 import BeautifulSoup
url_1 = "https://www.gofundme.com/f/eric-stevens-care-trust"
req = requests.get(url_1)
soup = BeautifulSoup(req.content, 'lxml')
script_tag = soup.find('script')
raw_json = re.fullmatch(r"window\.initialState = (.+);", script_tag.text).group(1)
json_content = json.loads(raw_json)

Unable to crawl CSS to HTML with Beautifulsoup

Hi I'm try to crawl the correct CSS to go with the html table created from beautifulsoup. The table is done but CSS is not. Can anyone take a look at my code and perhaps suggeste a better way to crawl stylesheet?
I can see two issues:
1. I'm not locating the correct stylesheet on the page matching the table
2. My implementation of the CSS into the html file is awkward if not any issues.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os
import tabulate
import urllib.request
import io
from bs4 import Comment
url = "https://www.etax.nat.gov.tw/etw-main/web/ETW183W2_10805/"
url_css = "https://www.etax.nat.gov.tw/etwmain/resources/web/css/main.fia.css"
soup = BeautifulSoup(urllib.request.urlopen(url).read(), features="html.parser",from_encoding='utf-16')
soup_table = soup.findAll('table')[0]
soup_css = BeautifulSoup(urllib.request.urlopen(url_css).read(), features="html.parser",from_encoding='utf-16')
with io.open("soup_table.html", "w", encoding='utf-16') as f:
f.write(str(soup_table))
f.write("<script>")
f.write(str(soup_css))
f.write("</script>")
There is no error message, just that the table doesn't look right without properly styling.

How to scrape specific IDs from a Webpage

I need to do some real estate market research and for this in need the prices, and other values from new houses.
So my idea was to go on the website where i get the information.
Go to the Main-Search-Site and scrape all the RealEstateIDs that would navigate me directly to the single pages for each house where i can than extract my infos that i need.
My problem is how do i get all the real estate ids from the main page and store them in a list, so i can use them in the next step to build the urls with them to go to the acutal sites.
I tried it with beautifulsoup but failed because i dont understand how to search for a specific word and extract what comes after it.
The html code looks like this:
""realEstateId":110356727,"newHomeBuilder":"false","disabledGrouping":"false","resultlist.realEstate":{"#xsi.type":"search:ApartmentBuy","#id":"110356727","title":"
Since the value "realEstateId" appears around 60 times, i want to scrape evertime the number (here: 110356727) that comes after it and store it in a list, so that i can use them later.
Edit:
import time
import urllib.request
from urllib.request import urlopen
import bs4 as bs
import datetime as dt
import matplotlib.pyplot as plt
from matplotlib import style
import numpy as np
import os
import pandas as pd
import pandas_datareader.data as web
import pickle
import requests
from requests import get
url = 'https://www.immobilienscout24.de/Suche/S-T/Wohnung-Kauf/Nordrhein-Westfalen/Duesseldorf/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/true?enteredFrom=result_list'
response = get(url)
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
def expose_IDs():
resp = requests.get('https://www.immobilienscout24.de/Suche/S-T/Wohnung-Kauf/Nordrhein-Westfalen/Duesseldorf/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/true?enteredFrom=result_list')
soup = bs.BeautifulSoup(resp.text, 'lxml')
table = soup.find('resultListModel')
tickers = []
for row in table.findAll('realestateID')[1:]:
ticker = row.findAll(',')[0].text
tickers.append(ticker)
with open("exposeID.pickle", "wb") as f:
pickle.dump(tickers, f)
return tickers
expose_IDs()

Something like this? There are 68 keys in a dictionary that are ids. I use regex to grab the same script as you are after and trim of an unwanted character, then load with json.loads and access the json object as shown in image at bottom.
import requests
import json
from bs4 import BeautifulSoup as bs
import re
res = requests.get('https://www.immobilienscout24.de/Suche/S-T/Wohnung-Kauf/Nordrhein-Westfalen/Duesseldorf/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/true?enteredFrom=result_list')
soup = bs(res.content, 'lxml')
r = re.compile(r'resultListModel:(.*)')
data = soup.find('script', text=r).text
script = r.findall(data)[0].rstrip(',')
#resultListModel:
results = json.loads(script)
ids = list(results['searchResponseModel']['entryInformation'].keys())
print(ids)
Ids:
Since website updated:
import requests
import json
from bs4 import BeautifulSoup as bs
import re
res = requests.get('https://www.immobilienscout24.de/Suche/S-T/Wohnung-Kauf/Nordrhein-Westfalen/Duesseldorf/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/true?enteredFrom=result_list')
soup = bs(res.content, 'lxml')
r = re.compile(r'resultListModel:(.*)')
data = soup.find('script', text=r).text
script = r.findall(data)[0].rstrip(',')
results = json.loads(script)
ids = [item['#id'] for item in results['searchResponseModel']['resultlist.resultlist']['resultlistEntries'][0]['resultlistEntry']]
print(ids)

python beautiful soup import urls

I am trying to import a list of urls and grab pn2 and main1. I can run it without importing the file so I know it works but I just have no idea what to do with the import. Here is what I have tried most recent and below it is a small portion of the urls. Thanks in advance.
import urllib
import urllib.request
import csv
from bs4 import BeautifulSoup
csvfile = open("ecco1.csv")
csvfilelist = csvfile.read()
theurl="csvfilelist"
soup = BeautifulSoup(theurl,"html.parser")
for row in csvfilelist:
for pn in soup.findAll('td',{"class":"productText"}):
pn2.append(pn.text)
for main in soup.find_all('div',{"class":"breadcrumb"}):
main1 = main.text
print (main1)
print ('\n'.join(pn2))
Urls:
http://www.eccolink.com/products/productresults.aspx?catId=2458
http://www.eccolink.com/products/productresults.aspx?catId=2464
http://www.eccolink.com/products/productresults.aspx?catId=2435
http://www.eccolink.com/products/productresults.aspx?catId=2446
http://www.eccolink.com/products/productresults.aspx?catId=2463

From what I see, you are opening a CSV file and using BeautifulSoup to parse it.
That should not be the way.
BeautifulSoup parses html files, not CSV.
Looking at your code, it seems correct if you were passing in html code to Bs4.
from bs4 import BeautifulSoup
import requests
links = []
file = open('links.txt')
html = requests.get('http://www.example.com')
soup = BeautifulSoup(html, 'html.parser')
for x in soup.find_all('a',"class":"abc"):
links.append(x)
file.write(x)
file.close()
Above is a very basic implementation of how I could get a target element in the html code and write it to a file/ or append it to a list. Use Requests rather than urllib. It is a better library and more modern.
If you want to input your data as CSV, my best option is to use csv reader as import.
Hope that helps.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why doesn't my web-scraping code work? - python

I believe the problem is that you are trying to get data from a JavaScript loaded data-set. Instead of loading from the page directly you'll need to mimic the requests for the data that the page is making to populate it.

Related

BeautifulSoup gives None When using find_all()

How to extract content from <script> using Beautiful Soup

Unable to crawl CSS to HTML with Beautifulsoup

How to scrape specific IDs from a Webpage

python beautiful soup import urls

Categories

Resources