Hi I'm try to crawl the correct CSS to go with the html table created from beautifulsoup. The table is done but CSS is not. Can anyone take a look at my code and perhaps suggeste a better way to crawl stylesheet?
I can see two issues:
1. I'm not locating the correct stylesheet on the page matching the table
2. My implementation of the CSS into the html file is awkward if not any issues.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os
import tabulate
import urllib.request
import io
from bs4 import Comment
url = "https://www.etax.nat.gov.tw/etw-main/web/ETW183W2_10805/"
url_css = "https://www.etax.nat.gov.tw/etwmain/resources/web/css/main.fia.css"
soup = BeautifulSoup(urllib.request.urlopen(url).read(), features="html.parser",from_encoding='utf-16')
soup_table = soup.findAll('table')[0]
soup_css = BeautifulSoup(urllib.request.urlopen(url_css).read(), features="html.parser",from_encoding='utf-16')
with io.open("soup_table.html", "w", encoding='utf-16') as f:
f.write(str(soup_table))
f.write("<script>")
f.write(str(soup_css))
f.write("</script>")
There is no error message, just that the table doesn't look right without properly styling.
Related
I want to get real estate data from https://www.realtor.com/
I use this code:
from bs4 import BeautifulSoup as bs
import requests
main_url='https://www.realtor.com/realestateandhomes-search/New-York_NY'
page=requests.get(main_url).content
bs(page,'html.parser')
It does not output the full HTML of the page, so can't find the tags I am interested in.
Is there another way to get the full HTML?
import requests
main_url='https://www.realtor.com/realestateandhomes-search/New-York_NY'
page=requests.get(main_url)
results = bs(page.content,'html.parser')
print(results)
This should work
I have a problem with website scrape in Python. Specifically, the problem is I can not scrape live scores websites with library BeautifulSoup in Python. The problem in my code is that: the html elements can not be inserted into list in Python.
import urllib3
from bs4 import BeautifulSoup
import requests
import pymysql
import timeit
data_list=[]
url_p=requests.get('my url website')
soup = BeautifulSoup(url_p.text,'html.parser')
vathmoi_table=soup.find("td",class_="label")
for table in soup.findAll("table"):
print(table)
print(vathmoi_table)
for team_name in soup.findAll("td"):
data_list_r=[]
simvolo = team_name.find("img")
name=team_name.find("td",class_="label")
vathmologia=team_name.find("td",class_="points")
if(name!=None):
data_list_r.append(symvolo.get_text().strip())
data_list_r.append(name.get_text().strip())
data_list_r.append(vathmologia.get_text().strip())
data_list.append(data_list_r)
for tr_parse in team_name.findAll("tr"):
team=tr_parse.find("td",class_="team")
if(team!=None):
print(team.get_text())
print(data_list)
I want to scrape the airplane arrivals from a website with Python 2.7, and export it to excel, but something is wrong with my code:
import urllib2
import unicodecsv as csv
import os
import sys
import io
import time
import datetime
import pandas as pd
from bs4 import BeautifulSoup
filename=r'output.csv'
resultcsv=open(filename,"wb")
output=csv.writer(resultcsv, delimiter=';',quotechar = '"', quoting=csv.QUOTE_NONNUMERIC, encoding='latin-1')
url = "https://www.flightradar24.com/data/airports/bud/arrivals"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
data = soup.find('div', { "class" : "row cnt-schedule-table"})
print data
I need the contents of the div with class row cnt-schedule table. What am I doing wrong?
I believe the problem is that you are trying to get data from a JavaScript loaded data-set. Instead of loading from the page directly you'll need to mimic the requests for the data that the page is making to populate it.
I am trying to import a list of urls and grab pn2 and main1. I can run it without importing the file so I know it works but I just have no idea what to do with the import. Here is what I have tried most recent and below it is a small portion of the urls. Thanks in advance.
import urllib
import urllib.request
import csv
from bs4 import BeautifulSoup
csvfile = open("ecco1.csv")
csvfilelist = csvfile.read()
theurl="csvfilelist"
soup = BeautifulSoup(theurl,"html.parser")
for row in csvfilelist:
for pn in soup.findAll('td',{"class":"productText"}):
pn2.append(pn.text)
for main in soup.find_all('div',{"class":"breadcrumb"}):
main1 = main.text
print (main1)
print ('\n'.join(pn2))
Urls:
http://www.eccolink.com/products/productresults.aspx?catId=2458
http://www.eccolink.com/products/productresults.aspx?catId=2464
http://www.eccolink.com/products/productresults.aspx?catId=2435
http://www.eccolink.com/products/productresults.aspx?catId=2446
http://www.eccolink.com/products/productresults.aspx?catId=2463
From what I see, you are opening a CSV file and using BeautifulSoup to parse it.
That should not be the way.
BeautifulSoup parses html files, not CSV.
Looking at your code, it seems correct if you were passing in html code to Bs4.
from bs4 import BeautifulSoup
import requests
links = []
file = open('links.txt')
html = requests.get('http://www.example.com')
soup = BeautifulSoup(html, 'html.parser')
for x in soup.find_all('a',"class":"abc"):
links.append(x)
file.write(x)
file.close()
Above is a very basic implementation of how I could get a target element in the html code and write it to a file/ or append it to a list. Use Requests rather than urllib. It is a better library and more modern.
If you want to input your data as CSV, my best option is to use csv reader as import.
Hope that helps.
I'm using BeautifulSoup (BS4) to build a scraper tool that will allow me to pull the product name from any TopShop.com product page, which sits between 'h1' tags. Can't figure out why the code I've written isn't working!
from urllib2 import urlopen
from bs4 import BeautifulSoup
import re
TopShop_URL = raw_input("Enter a TopShop Product URL")
ProductPage = urlopen(TopShop_URL).read()
soup = BeautifulSoup(ProductPage)
ProductNames = soup.find_all('h1')
print ProductNames
I get this working using requests (http://docs.python-requests.org/en/latest/)
from bs4 import BeautifulSoup
import requests
content = requests.get("TOPShop_URL").content
soup = BeautifulSoup(content)
product_names = soup.findAll("h1")
print product_names
Your code is correct, but the problem is that the div which includes the product name is dynamically generated via JavaScript.
In order to be able to successfully parse this element you should mind using Selenium or a similar tool, that will allow you to parse the webpage after all the dom has been fully loaded.