Removing UTF 8 encoding in python

Removing UTF 8 encoding in python - python

I tried scraping the webpage for Passengers & Cargo data. I couldn't convert them into normal data, and web encoding seems to be the challenge.
The Code I used is:
from __future__ import print_function
import requests
import pandas as pd
from bs4 import BeautifulSoup
import urllib
url = "https://www.faa.gov/data_research/passengers_cargo/unruly_passengers/"
r = requests.get(url)
soup = BeautifulSoup(r.content)
links = soup.find_all("tbody")
for link in links:
print(link.text)
Output1
This prints in the format Year and Total. But when I append it to a list, the encoding ruins the data. You can see that in Output1
names = []
for link in links:
names.append(link.text)
names = map(lambda x: x.strip().encode('ascii'), names)
print(names)
Output2
The desired output should be Years and Total for me to perform analyses

You can use find_all tr and td like this:
import requests
from bs4 import BeautifulSoup
import urllib
url = "https://www.faa.gov/data_research/passengers_cargo/unruly_passengers/"
r = requests.get(url)
soup = BeautifulSoup(r.content)
links = soup.find_all("tr")
data = []
for link in links:
tds = link.find_all('td')
if tds:
data.append({'year':tds[0].text,'total':tds[1].text})
print(data)
It's worked.
Hope it helps you

Related

lxml to grab All items that share a certain xpath

I'm trying to grab all prices from a website, using the xpath. all prices have the same xpath, and only [0], or I assume the 1st item works... let me show you:
webpage = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(webpage.content, "html.parser")
dom = etree.HTML(str(soup))
print(dom.xpath('/html/body/div[1]/div[5]/div/div/div/div[1]/ul/li[1]/article/div[1]/div[2]/div')[0].text)
This successfully prints the 1st price!!!
I tried changing "[0].text" to 1, to print the 2nd item but it returned "out of range".
Then I was trying to think of some For loop that would print All Items, so I could create an average.
Any help would be Greatly appreciated!!!
I apologize edited in is the code
from bs4 import BeautifulSoup
from lxml import etree
import requests
URL = "https://www.newegg.com/p/pl?d=GPU&N=601357247%20100007709"
#HEADERS = you'll need to add your own headers here, won't let post.
webpage = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(webpage.content, "html.parser")
dom = etree.HTML(str(soup))
print(dom.xpath('/html/body/div[10]/div[4]/section/div/div/div[2]/div/div/div/div[2]/div/div[2]/div[2]/div[1]/div/div[2]/ul/li[3]/strong')[0].text)

You could just use css selectors which, in this instance, are a lot more readable. I would also remove some of the offers info to leave just the actual price.
import requests
from bs4 import BeautifulSoup as bs
from pprint import pprint
r = requests.get("https://www.newegg.com/p/pl?d=GPU&N=601357247%20100007709", headers = {'User-Agent':'Mozilla/5.0'})
soup = bs(r.text, features="lxml")
prices = {}
for i in soup.select('.item-container'):
if a:=i.select_one('.price-current-num'): a.decompose()
prices[i.select_one('.item-title').text] = i.select_one('.price-current').get_text(strip=True)[:-1]
pprint(prices)
prices as list of floats
import requests, re
from bs4 import BeautifulSoup as bs
from pprint import pprint
r = requests.get("https://www.newegg.com/p/pl?d=GPU&N=601357247%20100007709", headers = {'User-Agent':'Mozilla/5.0'})
soup = bs(r.text, features="lxml")
prices = []
for i in soup.select('.item-container'):
if a:=i.select_one('.price-current-num'): a.decompose()
prices.append(float(re.sub('\$|,', '', i.select_one('.price-current').get_text(strip=True)[:-1])))
pprint(prices)

Beautiful Soup get value from page which updates daily

I am trying to get a single value from website which will update daily.
I am trying to get latest price in Vijayawada in page
below is my code but I am getting empty space as output but expecting 440 as output.
import requests
from bs4 import BeautifulSoup
import csv
res = requests.get('https://www.e2necc.com/home/eggprice')
soup = BeautifulSoup(res.content, 'html.parser')
price = soup.select("Vijayawada")
Looking to get value: 440 [Which is today value] any suggestions?

One approach could be to select the tr by its text and iterate over its strings to pick the last one that isdigit():
[x for x in soup.select_one('tr:-soup-contains("Vijayawada")').stripped_strings if x.isdigit()][-1]
Example
import requests
from bs4 import BeautifulSoup
res = requests.get('https://www.e2necc.com/home/eggprice')
soup = BeautifulSoup(res.content, 'html.parser')
price = [x for x in soup.select_one('tr:-soup-contains("Vijayawada")').stripped_strings if x.isdigit()][-1]
print(price)
Another one is to pick the element by day:
import requests
from bs4 import BeautifulSoup
import datetime
d = datetime.datetime.now().strftime("%d")
res = requests.get('https://www.e2necc.com/home/eggprice')
soup = BeautifulSoup(res.content, 'html.parser')
soup.select('tr:-soup-contains("Vijayawada") td')[int(d)].text
Output
440

Scraping data using BeautifulSoup

I'm trying scrape the data into a dictionary from this site,
from bs4 import BeautifulSoup
import requests
from pprint import pprint
page = requests.get('https://webscraper.io/')
soup = BeautifulSoup(page.text, "lxml")
info = []
for x in range(1,7):
items = soup.findAll("div",{"class":f"info{x}"})
info.append(items)
however, the HTML tags are not being removed.

You need to use .text. Then to get in the way you want, would need to do a bit of string manipulation.
from bs4 import BeautifulSoup
import requests
from pprint import pprint
url = 'https://webscraper.io/'
page = requests.get(url)
soup = BeautifulSoup(page.text, "lxml")
info = []
for x in range(1,7):
item = soup.find("div",{"class":"info%s" %x}).text.strip().replace('\n',': ')
info.append(item)
info = '\n'.join(info)
print (info)

Something like this might work? (Replace the webscraper.io url with your actual request URL; Also, you'd still need to clean up the \n characters from the output):
from bs4 import BeautifulSoup
import requests
from pprint import pprint
page = requests.get('https://webscraper.io/')
soup = BeautifulSoup(page.text, "lxml")
info = []
for x in range(1,7):
items = soup.findAll("div",{"class":f"info{x}"})
info += [item.text for item in items]
I.e. item.text, and concatenate the resulting array with info

How to scrape the yahoo earnings calendar with beautifulsoup

How can I scrape the yahoo earnings calendar to pull out the dates?
This is for python 3.
from bs4 import BeautifulSoup as soup
import urllib
url = 'https://finance.yahoo.com/calendar/earnings?day=2019-06-13&symbol=ibm'
response = urllib.request.urlopen(url)
html = response.read()
page_soup = soup(html,'lxml')
table = page_soup.find('p')
print(table)
the output is "None"

Beautiful Soup has some find functions that you can use to inspect the DOM , please refer to the documentation
from bs4 import BeautifulSoup as soup
import urllib.request
url = 'https://finance.yahoo.com/calendar/earnings?day=2019-06-13&symbol=ibm'
response = urllib.request.urlopen(url)
html = response.read()
page_soup = soup(html,'lxml')
table = page_soup.find_all('td')
Dates = []
for something in table:
try:
if something['aria-label'] == "Earnings Date":
Dates.append(something.text)
except:
print('')
print(Dates)

Might be off-topic but since you want to get a table from a webpage, you might consider using pandas which works with two lines:
import pandas as pd
earnings = pd.read_html('https://finance.yahoo.com/calendar/earnings?day=2019-06-13&symbol=ibm')[0]

Here are two succinct ways
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://finance.yahoo.com/calendar/earnings?day=2019-06-13&symbol=ibm&guccounter=1')
soup = bs(r.content, 'lxml')
# using attribute = value selector
dates = [td.text for td in soup.select('[aria-label="Earnings Date"]')]
#using nth-of-type to get column
dates = [td.text for td in soup.select('#cal-res-table td:nth-of-type(3)')]

Building a python web scraper, Need help to get correct output

I was building a web-scraper using python.
The purpose of my scraper is to fetch all the links to websites from this webpage http://www.ebizmba.com/articles/torrent-websites
I want output like -
www.thepiratebay.se
www.kat.ph
I am new to python and scraping, and I was doing this just for practice. Please help me to get the right output.
My code --------------------------------------
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.ebizmba.com/articles/torrent-websites")
soup = BeautifulSoup(r.content, "html.parser")
data = soup.find_all("div", {"class:", "main-container-2"})
for item in data:
print(item.contents[1].find_all("a"))
My Output --- http://i.stack.imgur.com/Xi37B.png

If you are webscraping for practice, have a look at regular expressions.
This here would get just the headline links... The Needle string is the match string, the brackets (http://.*?) contain the match group.
import urllib2
import re
myURL = "http://www.ebizmba.com/articles/torrent-websites"
req = urllib2.Request(myURL)
Needle1 = '<p><a href="(http:.*?)" rel="nofollow" target="_blank">'
for match in re.finditer(Needle1, urllib2.urlopen(req).read()):
print(match.group(1))

Use .get('href') like this:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.ebizmba.com/articles/torrent-websites")
soup = BeautifulSoup(r.text, "html.parser")
data = soup.find_all("div", {"class:", "main-container-2"})
for i in data:
for j in i.contents[1].find_all("a"):
print(j.get('href'))
Full output:
http://www.thepiratebay.se
http://siteanalytics.compete.com/thepiratebay.se
http://quantcast.com/thepiratebay.se
http://www.alexa.com/siteinfo/thepiratebay.se/
http://www.kickass.to
http://siteanalytics.compete.com/kickass.to
http://quantcast.com/kickass.to
http://www.alexa.com/siteinfo/kickass.to/
http://www.torrentz.eu
http://siteanalytics.compete.com/torrentz.eu
http://quantcast.com/torrentz.eu
http://www.alexa.com/siteinfo/torrentz.eu/
http://www.extratorrent.cc
http://siteanalytics.compete.com/extratorrent.cc
http://quantcast.com/extratorrent.cc
http://www.alexa.com/siteinfo/extratorrent.cc/
http://www.yify-torrents.com
http://siteanalytics.compete.com/yify-torrents.com
http://quantcast.com/yify-torrents.com
http://www.alexa.com/siteinfo/yify-torrents.com
http://www.bitsnoop.com
http://siteanalytics.compete.com/bitsnoop.com
http://quantcast.com/bitsnoop.com
http://www.alexa.com/siteinfo/bitsnoop.com/
http://www.isohunt.to
http://siteanalytics.compete.com/isohunt.to
http://quantcast.com/isohunt.to
http://www.alexa.com/siteinfo/isohunt.to/
http://www.sumotorrent.sx
http://siteanalytics.compete.com/sumotorrent.sx
http://quantcast.com/sumotorrent.sx
http://www.alexa.com/siteinfo/sumotorrent.sx/
http://www.torrentdownloads.me
http://siteanalytics.compete.com/torrentdownloads.me
http://quantcast.com/torrentdownloads.me
http://www.alexa.com/siteinfo/torrentdownloads.me/
http://www.eztv.it
http://siteanalytics.compete.com/eztv.it
http://quantcast.com/eztv.it
http://www.alexa.com/siteinfo/eztv.it/
http://www.rarbg.com
http://siteanalytics.compete.com/rarbg.com
http://quantcast.com/rarbg.com
http://www.alexa.com/siteinfo/rarbg.com/
http://www.1337x.org
http://siteanalytics.compete.com/1337x.org
http://quantcast.com/1337x.org
http://www.alexa.com/siteinfo/1337x.org/
http://www.torrenthound.com
http://siteanalytics.compete.com/torrenthound.com
http://quantcast.com/torrenthound.com
http://www.alexa.com/siteinfo/torrenthound.com/
https://demonoid.org/
http://siteanalytics.compete.com/demonoid.pw
http://quantcast.com/demonoid.pw
http://www.alexa.com/siteinfo/demonoid.pw/
http://www.fenopy.se
http://siteanalytics.compete.com/fenopy.se
http://quantcast.com/fenopy.se
http://www.alexa.com/siteinfo/fenopy.se/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing UTF 8 encoding in python - python

Related

lxml to grab All items that share a certain xpath

Beautiful Soup get value from page which updates daily

Scraping data using BeautifulSoup

How to scrape the yahoo earnings calendar with beautifulsoup

Building a python web scraper, Need help to get correct output

Categories

Resources