Recording web scraping data - python

Hi everyone although I got the data I was looking for in a text format, when I try to record it as a list or convert it into a dataframe, it simply doesn't work. What I got was a huge list with only one item, which is the last text line of the data I got, i.e. the number '9.054.333,18'. Can anyone help me, please? I need to organize all this data in a list or dataframe.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
html = urlopen('http://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/mercado-a-vista/termo/posicoes-em-aberto/posicoes-em-aberto-8AA8D0CC77D179750177DF167F150965.htm?data=16/04/2021&f=0#conteudo-principal')
soup = BeautifulSoup(html.read(), 'html.parser')
texto = soup.find_all('td')
for t in texto:
print(t.text)
lista=[]
for i in soup.find_all('td'):
lista.append(t.text)
print(lista)

Your iterators are wrong -- you're using i in the last loop while appending t.text.
You can just use a list comprehension:
# ...
soup = BeautifulSoup(html.read(), 'html.parser')
lista = [t.text for t in soup.find_all('td')]

Related

How to get the content of a tag with a Beautiful Soup?

I'm trying to extract questions from various AMC tests. Consider https://artofproblemsolving.com/wiki/index.php/2002_AMC_10B_Problems/Problem_1 for example. To get the problem text, I just need the regular string text in the first <p> element and the latex in the <img> in the first <p> element.
My code so far:
res = requests.get('https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
latex_equation = soup.select('p img')[0].get('alt')
It works when I get the latex equation, but there is more parts of the question before in double quotes. Is there a way to get the other part of the question which is "What is the value of". I'm thinking of using a regex but I want to see if Beautiful Soup has a feature that can get it for me.
Try using zip():
import requests
from bs4 import BeautifulSoup
URL = "https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
for text, tag in zip(soup.select_one(".mw-parser-output p"), soup.select("p img")):
print(text, tag.get("alt"))
break
Output:
What is the value of $\frac{2a^{-1}+\frac{a^{-1}}{2}}{a}$
Edit:
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
for text, tag in zip(soup.select(".mw-parser-output p"), soup.select("p img")):
print(text.text.strip(), tag.get("alt"))
Well BS4 seems to be a bit buggy. Took me a while to get this. Don't think that it is viable with these weird spacings and everything. A RegEx would be your best option. Let me know if this is good. Checked on the first 2 questions and they worked fine. The AMC does have some image problems with geometry, however, so I don't think it will work for those.
import bs4
import requests
import re
res = requests.get('https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1')
soup = bs4.BeautifulSoup(res.content, 'html.parser').find('p')
elements = [i for i in soup.prettify().split("\n") if i][1:-2]
latex_reg = re.compile(r'alt="(.*?)"')
for n, i in enumerate(elements):
mo = latex_reg.search(i)
if mo:
elements[n] = mo.group(1)
elements[n] = re.sub(' +', ' ', elements[n]).lstrip()
if elements[n][0] == "$":
elements[n] = " "+elements[n]+" "
print(elements)
print("".join(elements))

Beautifulsoup multiple div content to dictionary

I try to get the contents of two div inside a dictionary in python. The main problem is that I'm able to fetch the first div content and the second, but not in the right key:value manner. I'm only able to get the keys back. As such, I know that I need to iterate through the content, but I can't see to get my for loop correct.
Following 1 and 2 cannot get the case done what I'm looking for.
This is what I've tried so far:
from bs4 import BeautifulSoup
import requests
url='https://www.samenvoordeklant.nl/arbeidsmarktregios'
base=requests.get(url, timeout=15)
html=BeautifulSoup(base.text, 'lxml')
regios=html.find_all('div',attrs={'class':['field field--name-node-title field--type-ds field--label-hidden field__item animated','field field--name-field-gemeenten field--type-string-long field--label-hidden field__item animated']})
for regio in regios:
print({regio.get_text(strip=True)})
The result:
{'Achterhoek'}
{'Aalten, Berkelland, Bronckhorst, Doetinchem, Montferland, Oost Gelre, Oude IJsselstreek, Winterswijk'}
{'Amersfoort'}
{'Amersfoort, Baarn, Bunschoten, Leusden, Nijkerk, Soest, Woudenberg'}
etc.
The result I'm after is:
{'Achterhoek':'Aalten', 'Berkelland', 'Bronckhorst', 'Doetinchem', 'Montferland', 'Oost Gelre', 'Oude IJsselstreek', 'Winterswijk'}
{'Amersfoort':'Amersfoort', 'Baarn', 'Bunschoten', 'Leusden', 'Nijkerk', 'Soest', 'Woudenberg'}
etc. This allows me to move it afterwards into a pandas dataframe more easily.
An easy way is with dict and zip of the two lists. Note I have used faster css selectors and avoided using full multi-value of class.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.samenvoordeklant.nl/arbeidsmarktregios')
soup = bs(r.content, 'lxml')
result = dict(zip([i.text for i in soup.select('h2 a')], [i.text for i in soup.select('.field--type-string-long')]))
print(result)
# result = {k:v.split(', ') for k, v in result.items()} ##add this line at end if want list as value rather than string
Sample pprint output:
If you want a list as the value you can simply add a last line of:
result = {k:v.split(', ') for k, v in result.items()}
This is one approach using zip.
Ex:
regios=html.find_all('div',attrs={'class':['field field--name-node-title field--type-ds field--label-hidden field__item animated','field field--name-field-gemeenten field--type-string-long field--label-hidden field__item animated']})
result = {key.get_text(strip=True): value.get_text(strip=True) for key, value in zip(regios[0::2], regios[1::2])}
pprint(result)
Output:
{'Achterhoek': 'Aalten, Berkelland, Bronckhorst, Doetinchem, Montferland, Oost '
'Gelre, Oude IJsselstreek, Winterswijk',
'Amersfoort': 'Amersfoort, Baarn, Bunschoten, Leusden, Nijkerk, Soest, '
'Woudenberg',......
If you need the value as a list of items
Use:
result = {key.get_text(strip=True): [i.strip() for i in value.get_text(strip=True).split(",")] for key, value in zip(regios[0::2], regios[1::2])}
Output:
{'Achterhoek': ['Aalten',
'Berkelland',
'Bronckhorst',
'Doetinchem',
'Montferland',
'Oost Gelre',
'Oude IJsselstreek',
'Winterswijk'],
'Amersfoort': ['Amersfoort',
'Baarn',
'Bunschoten',....

How to store different sets of text in one single variable using Beautifulsoup

I am building a simple program using Python3 on MacOS, to scrap all the lyrics of an artist in one single variable. Although I am able to correctly iterate through different URL's (each Url is a song from this artist) and have the output that I want being printed, I am struggling to be able to store all the different songs in one single variable.
I've tried different approaches, trying to store it in a list, dictionary, dictionary inside a list, etc. but it didn't work out. I've also read Beautifulsoup documentation and several forums without success.
I am sure this should be something very simple. This is the code that I am running:
import requests
import re
from bs4 import BeautifulSoup
r = requests.get("http://www.metrolyrics.com/notorious-big-albums-list.html")
c = r.content
soup = BeautifulSoup(c, "html.parser")
albums = soup.find("div", {'class' : 'grid_8'})
for page in albums.find_all('a', href=True, alt=True):
d = {}
r = requests.get(a['href'])
c = r.content
soup = BeautifulSoup(c, "html.parser")
song = soup.find_all('p', {'class':'verse'})
title = soup.find_all('h1')
for item in title:
title = item.text.replace('Lyrics','')
print("\n",title.upper(),"\n")
for item in song:
song = item.text
print(song)
When running this code, you get the exact output that I would like to have stored in a single variable.
I've been struggling with this for days so I would really appreciate some help.
Thanks
Here's an example of how you should store data in one variable.
This can be JSON or similar by using a python dictionary.
a = dict()
#We create a instance of a dict. Same as a = {}.
a[1] = 'one'
#This is how a basic dictionary works. There is a key and a value.
a[2] = {'two':'the number 2'}
#Now our Key is normal, however, our value is another dictionary.
print(a)
#This is how we access the dict inside the dict.
print(a[2]['two'])
# first key [2] (gives us {'two':'the number 2'} we access value inside it [2]['two']")
You'll be able to apply this knowledge to your algorithm.
Use the album as the first key all['Stay strong'] = {'some-song':'text_heavy'}
I also recommend making a function since you're re-using code.
for instance, the request and then parsing using bs4
def parser(url):
make_req = request.get(url).text #or .content
return BeautifulSoup(make_req, 'html.parser')
A good practice for software developement is so called DRY (Don't repeat yourself) since readability counts and as opposed to WET (Waste everyones time, Write Everything Twice).
Just something to keep in mind.
I made it!!
I wasn't able to store the output in a variable, but I was able to write a txt file storing all the content which is even better. This is the code I used:
import requests
import re
from bs4 import BeautifulSoup
with open('nBIGsongs.txt', 'a') as f:
r = requests.get("http://www.metrolyrics.com/notorious-big-albums-list.html")
c = r.content
soup = BeautifulSoup(c, "html.parser")
albums = soup.find("div", {'class' : 'grid_8'})
for a in albums.find_all('a', href=True, alt=True):
r = requests.get(a['href'])
c = r.content
soup = BeautifulSoup(c, "html.parser")
song = soup.find_all('p', {'class':'verse'})
title = soup.find_all('h1')
for item in title:
title = item.text.replace('Lyrics','')
f.write("\n" + title.upper() + "\n")
for item in song:
f.write(item.text)
f.close()
I would still love to hear if there are other better approaches.
Thanks!

Looping through a list of urls for web scraping with BeautifulSoup

I want to extract some information off websites with URLs of the form:
http://www.pedigreequery.com/american+pharoah
where "american+pharoah" is the extension for one of many horse names.
I have a list of the horse names I'm searching for, I just need to figure out how to plug the names in after "http://www.pedigreequery.com/"
This is what I currently have:
import csv
allhorses = csv.reader(open('HORSES.csv') )
rows=list(allhorses)
import requests
from bs4 import BeautifulSoup
for i in rows: # Number of pages plus one
url = "http://www.pedigreequery.com/".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content)
letters = soup.find_all("a", class_="horseName")
print(letters)
When I print out the url it doesn't have the horse's name at the end, just the URL in quotes. the letters/print statement at the end are just to check if it's actually going to the website.
This is how I've seen it done for looping URLs that change by numbers at the end- I haven't found advice on URLs that change by characters.
Thanks!
You are missing the placeholder in your format so scange the format to:
url = "http://www.pedigreequery.com/{}".format(i)
^
#add placeholder
Also you are getting a list of lists at best from rows=list(allhorses) so you would be passing a list not a string/horsename, just open the file normally if you have a horse per line and iterate over the file object stripping the newline.
Presuming one horse name per line, the whole working code would be:
import requests
from bs4 import BeautifulSoup
with open("HORSES.csv") as f:
for horse in map(str.strip,f): # Number of pages plus one
url = "http://www.pedigreequery.com/{}".format(horse)
r = requests.get(url)
soup = BeautifulSoup(r.content)
letters = soup.find_all("a", class_="horseName")
print(letters)
If you have multiple horses per line you can use the csv lib but you will need an inner loop:
with open("HORSES.csv") as f:
for row in csv.reader(f):
# Number of pages plus one
for horse in row:
url = "http://www.pedigreequery.com/{}".format(horse)
r = requests.get(url)
soup = BeautifulSoup(r.content)
letters = soup.find_all("a", class_="horseName")
print(letters)
Lastly if you don't have the names store correctly you have a few options the simplest of which is to split and create the create the query manually.
url = "http://www.pedigreequery.com/{}".format("+".join(horse.split()))

Python 2.7 : Can't figure out how to parse a tree with BeautifulSoup4

I am trying to parse this site to create 5 lists, one for each day and filled with one string for each announcement. For example
[in] custom_function(page)
[out] [[<MONDAYS ANNOUNCEMENTS>],
[<TUESDAYS ANNOUNCEMENTS>],
[<WEDNESDAYS ANNOUNCEMENTS>],
[<THURSDAYS ANNOUNCEMENTS>],
[<FRIDAYS ANNOUNCEMENTS>]]
But I can't figure out the correct way to do this.
This is what I have so far
from bs4 import BeautifulSoup
import requests
import datetime
url = http://mam.econoday.com/byweek.asp?day=7&month=4&year=2014&cust=mam&lid=0
# Get the text of the webpage
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
full_table_1 = soup.find('table', 'eventstable')
I Figured out that what I want is in the highlighted tag, but I'm not sure how to get to that exact tag and then parse out the times/announcements into a list. I've tried multiple methods but it just keeps getting messier.
What do I do?
The idea is to find all td elements with events class, then read div elements inside:
data = []
for day in soup.find_all('td', class_='events'):
data.append([div.text for div in day.find_all('div', class_='econoevents')])
print data
prints:
[[u'Gallup US Consumer Spending Measure8:30 AM\xa0ET',
u'4-Week Bill Announcement11:00 AM\xa0ET',
u'3-Month Bill Auction11:30 AM\xa0ET',
...
],
...
]

Categories