Screen Scraping Twitter Page with Unicode Equal Comparison Failure Python - python

I'm using the following code to obtain a list of a user's followers on twitter:
import urllib
from BeautifulSoup import BeautifulSoup
#code only looks at one page of followers instead of continuing to all of a user's followers
#decided to only use a small sample
site = "http://mobile.twitter.com/NYTimesKrugman/following"
friends = set()
response = urllib.urlopen(site)
html = response.read()
soup = BeautifulSoup(html)
names = soup.findAll('a', {'href': True})
for name in names:
a = name.renderContents()
b = a.lower()
if ("http://mobile.twitter.com/" + b) == name['href']:
c = str (b)
friends.add(c)
for friend in friends:
print friend
print ("Done!")
However, I get the following results:
NYTimeskrugman
nytimesphoto
rasermus
Warning (from warnings module):
File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion14.py", line 42
if ("http://mobile.twitter.com/" + b) == name['href']:
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
amnesty_norge
zynne_
fredssenteret
oljestudentene
solistkoret
....(and so it continues)
It would appear that I was able to get most of the names of the following but I received a somewhat random error. It didn't stop the code from finishing however...I was hoping that someone could enlighten me as to what happened?

Don't know if my answer will be useful several years later, but I rewrote your code using requests instead of urllib.
I think it's better to made an other selection with the class "username" to consider only followers names !
Here's the stuff :
import requests
from bs4 import BeautifulSoup
site = "http://mobile.twitter.com/paulkrugman/followers"
friends = set()
response = requests.get(site)
soup = BeautifulSoup(response.text)
names = soup.findAll('a', {'href': True})
for name in names:
pseudo = name.find("span", {"class": "username"})
if pseudo:
pseudo = pseudo.get_text()
friends.add(pseudo)
for friend in friends:
print (friend)
print("Done !")
#paulkrugman appears in every set, so don't forget to delete it !

Related

Accessing duplicate feed tags using feedparser

I'm trying to parse this feed: https://feeds.podcastmirror.com/dudesanddadspodcast
The channel section has two entries for podcast:person
<podcast:person role="host" img="https://dudesanddadspodcast.com/files/2019/03/andy.jpg" href="https://www.podchaser.com/creators/andy-lehman-107aRuVQLA">Andy Lehman</podcast:person>
<podcast:person role="host" img="https://dudesanddadspodcast.com/files/2019/03/joel.jpg" href="https://www.podchaser.com/creators/joel-demott-107aRuVQLH" >Joel DeMott</podcast:person>
When parsed, feedparser only brings in one name
> import feedparser
> d = feedparser.parse('https://feeds.podcastmirror.com/dudesanddadspodcast')
> d.feed['podcast_person']
> {'role': 'host', 'img': 'https://dudesanddadspodcast.com/files/2019/03/joel.jpg', 'href': 'https://www.podchaser.com/creators/joel-demott-107aRuVQLH'}
What would I change so it would instead show a list for podcast_person so I could loop through each one?
Idea #1:
from bs4 import BeautifulSoup
import requests
r = requests.get("https://feeds.podcastmirror.com/dudesanddadspodcast").content
soup = BeautifulSoup(r, 'html.parser')
soup.find_all("podcast:person")
Output:
[<podcast:person href="https://www.podchaser.com/creators/andy-lehman-107aRuVQLA" img="https://dudesanddadspodcast.com/files/2019/03/andy.jpg" role="host">Andy Lehman</podcast:person>,
<podcast:person href="https://www.podchaser.com/creators/joel-demott-107aRuVQLH" img="https://dudesanddadspodcast.com/files/2019/03/joel.jpg" role="host">Joel DeMott</podcast:person>,
<podcast:person href="https://www.podchaser.com/creators/cory-martin-107aRwmCuu" img="" role="guest">Cory Martin</podcast:person>,
<podcast:person href="https://www.podchaser.com/creators/julie-lehman-107aRuVQPL" img="" role="guest">Julie Lehman</podcast:person>]
Idea #2:
import feedparser
d = feedparser.parse('https://feeds.podcastmirror.com/dudesanddadspodcast')
hosts = d.entries[1]['authors'][1]['name'].split(", ")
print("The hosts of this Podcast are {} and {}.".format(hosts[0], hosts[1]))
Output:
The hosts of this Podcast are Joel DeMott and Andy Lehman.
You can iterate over the feed['items'] and get all the records.
import feedparser
feed = feedparser.parse('https://feeds.podcastmirror.com/dudesanddadspodcast')
if feed:
for item in feed['items']:
print(f'{item["title"]} - {item["author"]}')
Instead of feedparser I would prefer BeautifulSoup.
You can copy the below code to test the end results.
from bs4 import BeautifulSoup
import requests
r = requests.get("https://feeds.podcastmirror.com/dudesanddadspodcast").content
soup = BeautifulSoup(r, 'html.parser')
feeds = soup.find_all("podcast:person")
print(type(feeds)) # <list>
# You can loop the `feeds` variable.
Since I'm familiar withlxml, and considering the fact that someone has already posted a solution using feedparser, I wanted to test howlxml could be used to parse an RSS feed. In my opinion, the daunting part is the handling of the RSS namespaces, but once that is resolved the task becomes quite easy:
import urllib.request
from lxml import etree
feed = etree.parse(urllib.request.urlopen('https://feeds.podcastmirror.com/dudesanddadspodcast')).getroot()
namespaces = {
'itunes': 'http://www.itunes.com/dtds/podcast-1.0.dtd'
}
for episode in feed.iter('item'):
# print(etree.tostring(episode))
authors = episode.xpath('itunes:author/text()', namespaces=namespaces)
print(authors)
#title = episode.xpath('itunes:title/text()', namespaces=namespaces)
#episode_metadata = '{} - {}'.format(title[0] if title else 'Missing title', authors[0] if authors else 'Missing authors')
#print(episode_metadata)
The execution time of the code above is close to 3x faster compared to a similar solution with feedparser, reflecting the performance gains from using lxml as the parsing library.

PY script list index out of range python

I've written a simple python script for web scraping:
import requests
from bs4 import BeautifulSoup
for i in range(1,3):
url = "https://www.n11.com/telefon-ve-aksesuarlari/cep-telefonu?m=Samsung&pg="+str(i)
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")
list = soup.find_all("li",{"class":"column"})
for li in list:
name = li.div.a.h3.text.strip()
print(name)
link = li.div.a.get("href")
oldprice = li.find("div",{"class":"proDetail"}).find_all("a")[0].text.strip().strip('TL')
newprice = li.find("div",{"class":"proDetail"}).find_all("a")[1].text.strip().strip('TL')
print(f"name: {name} link: {link} old price: {oldprice} new price: {newprice}")
It gives me a list index out of range error in the line newprice = li.find("div",{"class":"proDetail"}).find_all("a")[1].text.strip().strip('TL')
Why am I getting this error? How can I fix it?
As it is mentioned above, your code is not returning as many elements as you expect.
newprice = li.find("div",{"class":"proDetail"}).find_all("a")[1].text.strip().strip('TL') This find_all("a") is only returning a list of 1 <a> tag.
Additionally, you should check which web page this is happening in. By that I mean,
for i in range(1,3):
url = "https://www.n11.com/telefon-ve-aksesuarlari/cep-telefonu?m=Samsung&pg="+str(i)
It could also be the case that the code fails when i=1 or when i=2 oe both. So you should examine each web page also.

Python: Web Scraping with input from user

from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
import requests
url = 'https://en.wikisource.org/wiki/Main_Page'
r = requests.get(url)
Soup = BeautifulSoup(r.text, "html5lib")
List = Soup.find("div",class_="enws-mainpage-widget-content", id="enws-mainpage-newtexts-content").find_all('a')
ebooks=[]
i=0
for ebook in List:
x=ebook.get('title')
for ch in x:
if(ch==":"):
x=""
if x!="":
ebooks.append(x)
i=i+1
print("Please select a book: ")
inputnumber=0
while inputnumber<len(ebooks):
print(inputnumber+1, " - ", ebooks[inputnumber])
inputnumber=inputnumber+1
input=int(input())
selectedbook = Soup.find("href", title=ebooks[input-1])
print(selectedbook)
I want to get the href of whichever was selected by user but as output I get: None
Can someone please tell me where I am doing wrong
I changed the last two lines of your code, and added these
selectedbook = Soup.find("a", title=ebooks[input-1])
print(selectedbook['title'])
print("https://en.wikisource.org/"+selectedbook['href'])
This just works !.
NB: The find() method searches for the first tag with the specified name and returns an object of type bs4.element.Tag.

how to use python to parse a html that is in txt format?

I am trying to parse a txt, example as below link.
The txt, however, is in the form of html. I am trying to get "COMPANY CONFORMED NAME" which located at the top of the file, and my function should return "Monocle Acquisition Corp".
https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt
I have tried below:
import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html")
However, "soup" does not contain "COMPANY CONFORMED NAME" at all.
Can someone point me in the right direction?
The data you are looking for is not in an HTML structure so Beautiful Soup is not the best tool. The correct and fast way of searching for this data is just using a simple Regular Expression like this:
import re
import requests
url = 'https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt'
r = requests.get(url)
text_string = r.content.decode()
name_re = re.compile("COMPANY CONFORMED NAME:[\\t]*(.+)\n")
match = name_re.search(text_string).group(1)
print(match)
the part you look like is inside a huge tag <SEC-HEADER>
you can get the whole section by using soup.find('sec-header')
but you will need to parse the section manually, something like this works, but it's some dirty job :
(view it in replit : https://repl.it/#gui3/stackoverflow-parsing-html)
import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html")
header = soup.find('sec-header').text
company_name = None
for line in header.split('\n'):
split = line.split(':')
if len(split) > 1 :
key = split[0]
value = split[1]
if key.strip() == 'COMPANY CONFORMED NAME':
company_name = value.strip()
break
print(company_name)
There may be some library able to parse this data better than this code

Having problems following links with webcrawler

I am trying to create a webcrawler that parses all the html on the page, grabs a specified (via raw_input) link, follows that link, and then repeats this process a specified number of times (once again via raw_input). I am able to grab the first link and successfully print it. However, I am having problems "looping" the whole process, and usually grab the wrong link. This is the first link
https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html
(Full disclosure, this questions pertains to an assignment for a Coursera course)
Here's my code
import urllib
from BeautifulSoup import *
url = raw_input('Enter - ')
rpt=raw_input('Enter Position')
rpt=int(rpt)
cnt=raw_input('Enter Count')
cnt=int(cnt)
count=0
counts=0
tags=list()
soup=None
while x==0:
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# Retrieve all of the anchor tags
tags=soup.findAll('a')
for tag in tags:
url= tag.get('href')
count=count + 1
if count== rpt:
break
counts=counts + 1
if counts==cnt:
x==1
else: continue
print url
Based on DJanssens' response, I found the solution;
url = tags[position-1].get('href')
did the trick for me!
Thanks for the assistance!
I also worked on that course, and help with a friend, I got this worked out:
import urllib
from bs4 import BeautifulSoup
url = "http://python-data.dr-chuck.net/known_by_Happy.html"
rpt=7
position=18
count=0
counts=0
tags=list()
soup=None
x=0
while x==0:
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")
tags=soup.findAll('a')
url= tags[position-1].get('href')
count=count + 1
if count == rpt:
break
print url
I believe this is what you are looking for:
import urllib
from bs4 import *
url = raw_input('Enter - ')
position=int(raw_input('Enter Position'))
count=int(raw_input('Enter Count'))
#perform the loop "count" times.
for _ in xrange(0,count):
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
tags=soup.findAll('a')
for tag in tags:
url= tag.get('href')
tags=soup.findAll('a')
# if the link does not exist at that position, show error.
if not tags[position-1]:
print "A link does not exist at that position."
# if the link at that position exist, overwrite it so the next search will use it.
url = tags[position-1].get('href')
print url
The code will now loop the amount of times as specified in the input, each time it will take the href at the given position and replace it with the url, in that way each loop will look further in the tree structure.
I advice you to use full names for variables, which is a lot easier to understand. In addition you could cast them and read them in a single line, which makes your beginning easier to follow.
Here is my 2-cents:
import urllib
#import ssl
from bs4 import BeautifulSoup
#'http://py4e-data.dr-chuck.net/known_by_Fikret.html'
url = raw_input('Enter URL : ')
position = int(raw_input('Enter position : '))
count = int(raw_input('Enter count : '))
print('Retrieving: ' + url)
soup = BeautifulSoup(urllib.urlopen(url).read())
for x in range(1, count + 1):
link = list()
for tag in soup('a'):
link.append(tag.get('href', None))
print('Retrieving: ' + link[position - 1])
soup = BeautifulSoup(urllib.urlopen(link[position - 1]).read())

Categories