Beautifulsoup multiple div content to dictionary

Beautifulsoup multiple div content to dictionary - python

I try to get the contents of two div inside a dictionary in python. The main problem is that I'm able to fetch the first div content and the second, but not in the right key:value manner. I'm only able to get the keys back. As such, I know that I need to iterate through the content, but I can't see to get my for loop correct.
Following 1 and 2 cannot get the case done what I'm looking for.
This is what I've tried so far:
from bs4 import BeautifulSoup
import requests
url='https://www.samenvoordeklant.nl/arbeidsmarktregios'
base=requests.get(url, timeout=15)
html=BeautifulSoup(base.text, 'lxml')
regios=html.find_all('div',attrs={'class':['field field--name-node-title field--type-ds field--label-hidden field__item animated','field field--name-field-gemeenten field--type-string-long field--label-hidden field__item animated']})
for regio in regios:
print({regio.get_text(strip=True)})
The result:
{'Achterhoek'}
{'Aalten, Berkelland, Bronckhorst, Doetinchem, Montferland, Oost Gelre, Oude IJsselstreek, Winterswijk'}
{'Amersfoort'}
{'Amersfoort, Baarn, Bunschoten, Leusden, Nijkerk, Soest, Woudenberg'}
etc.
The result I'm after is:
{'Achterhoek':'Aalten', 'Berkelland', 'Bronckhorst', 'Doetinchem', 'Montferland', 'Oost Gelre', 'Oude IJsselstreek', 'Winterswijk'}
{'Amersfoort':'Amersfoort', 'Baarn', 'Bunschoten', 'Leusden', 'Nijkerk', 'Soest', 'Woudenberg'}
etc. This allows me to move it afterwards into a pandas dataframe more easily.

An easy way is with dict and zip of the two lists. Note I have used faster css selectors and avoided using full multi-value of class.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.samenvoordeklant.nl/arbeidsmarktregios')
soup = bs(r.content, 'lxml')
result = dict(zip([i.text for i in soup.select('h2 a')], [i.text for i in soup.select('.field--type-string-long')]))
print(result)
# result = {k:v.split(', ') for k, v in result.items()} ##add this line at end if want list as value rather than string
Sample pprint output:
If you want a list as the value you can simply add a last line of:
result = {k:v.split(', ') for k, v in result.items()}

This is one approach using zip.
Ex:
regios=html.find_all('div',attrs={'class':['field field--name-node-title field--type-ds field--label-hidden field__item animated','field field--name-field-gemeenten field--type-string-long field--label-hidden field__item animated']})
result = {key.get_text(strip=True): value.get_text(strip=True) for key, value in zip(regios[0::2], regios[1::2])}
pprint(result)
Output:
{'Achterhoek': 'Aalten, Berkelland, Bronckhorst, Doetinchem, Montferland, Oost '
'Gelre, Oude IJsselstreek, Winterswijk',
'Amersfoort': 'Amersfoort, Baarn, Bunschoten, Leusden, Nijkerk, Soest, '
'Woudenberg',......
If you need the value as a list of items
Use:
result = {key.get_text(strip=True): [i.strip() for i in value.get_text(strip=True).split(",")] for key, value in zip(regios[0::2], regios[1::2])}
Output:
{'Achterhoek': ['Aalten',
'Berkelland',
'Bronckhorst',
'Doetinchem',
'Montferland',
'Oost Gelre',
'Oude IJsselstreek',
'Winterswijk'],
'Amersfoort': ['Amersfoort',
'Baarn',
'Bunschoten',....

Related

How to make a list out of a BS4 output

I have this code right now:
from bs4 import BeautifulSoup
import requests
get = requests.get("https://solmfers-minting-site.netlify.app/")
soup = BeautifulSoup(get.text, 'html.parser')
for i in soup.find_all('script'):
print(i.get('src'))
And I need to somehow turn the output into a list and remove the None values from it since it outputs it like this:
jquery.js
nicepage.js
None
None
/static/js/2.c20455e8.chunk.js
/static/js/main.87864e1d.chunk.js

Just append your extracted values to a list.
result = []
for i in soup.find_all('script'):
elem = i.get('src')
if elem is not None:
result.append(elem)
Or using a list comprehension:
result = [x['src'] for x in soup.find_all('script') if x.get('src') is not None]

Your near to your goal, but select your elements more specific and append the src to a list while iterating your ResultSet:
data = []
for i in soup.find_all('script', src=True):
data.append(i.get('src'))
Alternative with css selectors:
for i in soup.select('script[src]'):
data.append(i.get('src'))
And as allready mentioned with list comprehension:
[i.get('src') for i in soup.select('script[src]')]
Output
['jquery.js', 'nicepage.js', '/static/js/2.c20455e8.chunk.js', '/static/js/main.87864e1d.chunk.js']

Recording web scraping data

Hi everyone although I got the data I was looking for in a text format, when I try to record it as a list or convert it into a dataframe, it simply doesn't work. What I got was a huge list with only one item, which is the last text line of the data I got, i.e. the number '9.054.333,18'. Can anyone help me, please? I need to organize all this data in a list or dataframe.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
html = urlopen('http://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/mercado-a-vista/termo/posicoes-em-aberto/posicoes-em-aberto-8AA8D0CC77D179750177DF167F150965.htm?data=16/04/2021&f=0#conteudo-principal')
soup = BeautifulSoup(html.read(), 'html.parser')
texto = soup.find_all('td')
for t in texto:
print(t.text)
lista=[]
for i in soup.find_all('td'):
lista.append(t.text)
print(lista)

Your iterators are wrong -- you're using i in the last loop while appending t.text.
You can just use a list comprehension:
# ...
soup = BeautifulSoup(html.read(), 'html.parser')
lista = [t.text for t in soup.find_all('td')]

Can't isolate desired results out of crude ones

I've created a script in python to get the name of neighbors from a webpage. I've used requests library along with re module to parse the content from some script tag out of that site. when I run the script I get the name of neighbors in the right way. However, the problem is i've used this line if not item.startswith("NY:"):continue to get rid of unwanted results from that page. I do not wish to use this hardcoded portion NY: to do this trick.
website link
I've tried with:
import re
import json
import requests
link = 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=New%20York%2C%20NY&start=1'
resp = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
data = json.loads(re.findall(r'data-hypernova-key[^{]+(.*)--></script>',resp.text)[0])
items = data['searchPageProps']['filterPanelProps']['filterInfoMap']
for item in items:
if not item.startswith("NY:"):continue
print(item)
Result I'm getting (desired result):
NY:New_York:Brooklyn:Mill_Basin
NY:New_York:Bronx:Edenwald
NY:New_York:Staten_Island:Stapleton
If I do not use this line if not item.startswith("NY:"):continue, the results are something like:
rating
NY:New_York:Brooklyn:Mill_Basin
NY:New_York:Bronx:Edenwald
NY:New_York:Staten_Island:Stapleton
NY:New_York:Staten_Island:Lighthouse_Hill
NY:New_York:Queens:Rochdale
NY:New_York:Queens:Pomonok
BusinessParking.validated
food_court
NY:New_York:Queens:Little_Neck
The bottom line is I wish to get everything started with NY:New_York:. What I meant by unwanted results are rating, BusinessParking.validated, food_court and so on.
How can I get the neighbors without using any hardcoded portion of search within the script?

I'm not certain what your complete data set looks like, but based on your sample,
you might use something like:
if ':' not in item:
continue
# or perhaps:
if item.count(':') < 3:
continue
# I'd prefer a list comprehension if I didn't need the other data
items = [x for x in data['searchPageProps']['filterPanelProps']['filterInfoMap'] if ':' in x]
If that doesn't work for what you're trying to achieve then you could just use a variable for the state.

Another solution - using BeautifulSoup - which doesn't involve regex or hardcoding "NY:New_York" is below; it's convoluted, but mainly because Yelp buried it's treasure several layers deep...
So for future reference:
from bs4 import BeautifulSoup as bs
import json
import requests
link = 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=New%20York%2C%20NY&start=1'
resp = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
target = soup.find_all('script')[14]
content = target.text.replace('<!--','').replace('-->','')
js_data = json.loads(content)
And now the fun of extracting NYC info from the json begins....
for a in js_data:
if a == 'searchPageProps':
level1 = js_data[a]
for b in level1:
if b == 'filterPanelProps':
level2 = level1[b]
for c in level2:
if c == 'filterSets':
level3 = level2[c][1]
for d in level3:
if d == 'moreFilters':
level4 = level3[d]
for e in range(len(level4)):
print(level4[e]['title'])
print(level4[e]['sectionFilters'])
print('---------------')
The output is the name of each borough plus a list of all neighborhoods in that borough. For example:
Manhattan
['NY:New_York:Manhattan:Alphabet_City',
'NY:New_York:Manhattan:Battery_Park',
'NY:New_York:Manhattan:Central_Park', 'NY:New_York:Manhattan:Chelsea',
'...]
etc.

How to store different sets of text in one single variable using Beautifulsoup

I am building a simple program using Python3 on MacOS, to scrap all the lyrics of an artist in one single variable. Although I am able to correctly iterate through different URL's (each Url is a song from this artist) and have the output that I want being printed, I am struggling to be able to store all the different songs in one single variable.
I've tried different approaches, trying to store it in a list, dictionary, dictionary inside a list, etc. but it didn't work out. I've also read Beautifulsoup documentation and several forums without success.
I am sure this should be something very simple. This is the code that I am running:
import requests
import re
from bs4 import BeautifulSoup
r = requests.get("http://www.metrolyrics.com/notorious-big-albums-list.html")
c = r.content
soup = BeautifulSoup(c, "html.parser")
albums = soup.find("div", {'class' : 'grid_8'})
for page in albums.find_all('a', href=True, alt=True):
d = {}
r = requests.get(a['href'])
c = r.content
soup = BeautifulSoup(c, "html.parser")
song = soup.find_all('p', {'class':'verse'})
title = soup.find_all('h1')
for item in title:
title = item.text.replace('Lyrics','')
print("\n",title.upper(),"\n")
for item in song:
song = item.text
print(song)
When running this code, you get the exact output that I would like to have stored in a single variable.
I've been struggling with this for days so I would really appreciate some help.
Thanks

Here's an example of how you should store data in one variable.
This can be JSON or similar by using a python dictionary.
a = dict()
#We create a instance of a dict. Same as a = {}.
a[1] = 'one'
#This is how a basic dictionary works. There is a key and a value.
a[2] = {'two':'the number 2'}
#Now our Key is normal, however, our value is another dictionary.
print(a)
#This is how we access the dict inside the dict.
print(a[2]['two'])
# first key [2] (gives us {'two':'the number 2'} we access value inside it [2]['two']")
You'll be able to apply this knowledge to your algorithm.
Use the album as the first key all['Stay strong'] = {'some-song':'text_heavy'}
I also recommend making a function since you're re-using code.
for instance, the request and then parsing using bs4
def parser(url):
make_req = request.get(url).text #or .content
return BeautifulSoup(make_req, 'html.parser')
A good practice for software developement is so called DRY (Don't repeat yourself) since readability counts and as opposed to WET (Waste everyones time, Write Everything Twice).
Just something to keep in mind.

I made it!!
I wasn't able to store the output in a variable, but I was able to write a txt file storing all the content which is even better. This is the code I used:
import requests
import re
from bs4 import BeautifulSoup
with open('nBIGsongs.txt', 'a') as f:
r = requests.get("http://www.metrolyrics.com/notorious-big-albums-list.html")
c = r.content
soup = BeautifulSoup(c, "html.parser")
albums = soup.find("div", {'class' : 'grid_8'})
for a in albums.find_all('a', href=True, alt=True):
r = requests.get(a['href'])
c = r.content
soup = BeautifulSoup(c, "html.parser")
song = soup.find_all('p', {'class':'verse'})
title = soup.find_all('h1')
for item in title:
title = item.text.replace('Lyrics','')
f.write("\n" + title.upper() + "\n")
for item in song:
f.write(item.text)
f.close()
I would still love to hear if there are other better approaches.
Thanks!

Use BeautifulSoup to loop through and retrieve specific URLs

I want to use BeautifulSoup and retrieve specific URLs at specific position repeatedly. You may imagine that there are 4 different URL lists each containing 100 different URL links.
I need to get and print always the 3rd URL on every list, while the previous URL (e.g. the 3rd URL on the first list) will lead to the 2nd list (and then need to get and print the 3rd URL and so on till the 4th retrieval).
Yet, my loop only achieves the first result (3rd URL on list 1), and I don't know how to loop the new URL back to the while loop and continue the process.
Here is my code:
import urllib.request
import json
import ssl
from bs4 import BeautifulSoup
num=int(input('enter count times: ' ))
position=int(input('enter position: ' ))
url='https://pr4e.dr-chuck.com/tsugi/mod/python-
data/data/known_by_Fikret.html'
print (url)
count=0
order=0
while count<num:
context = ssl._create_unverified_context()
htm=urllib.request.urlopen(url, context=context).read()
soup=BeautifulSoup(htm)
for i in soup.find_all('a'):
order+=1
if order ==position:
x=i.get('href')
print (x)
count+=1
url=x
print ('done')

This is a good problem to use recursion. Try to call a recursive function to do this:
def retrieve_urls_recur(url, position, index, deepness):
if index >= deepness:
return True
else:
plain_text = requests.get(url)
soup = BeautifulSoup(plain_text)
links = soup.find_all('a'):
desired_link = links[position].get('href')
print desired_link
return retrieve_urls_recur(desired_link, index+1, deepness)
and then call it with the desired parameters, in your case:
retrieve_urls_recur(url, 2, 0, 4)
2 is the url index on the list of urls, 0 is the counter, and 4 is how deep you want to go recursively
ps: I am using requests instead of urllib, and I didnt test this, although I recentely used a very similar function with sucess

Just get the link from find_all() by index:
while count < num:
context = ssl._create_unverified_context()
htm = urllib.request.urlopen(url, context=context).read()
soup = BeautifulSoup(htm)
url = soup.find_all('a')[position].get('href')
count += 1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Beautifulsoup multiple div content to dictionary - python

Related

How to make a list out of a BS4 output

Recording web scraping data

Can't isolate desired results out of crude ones

How to store different sets of text in one single variable using Beautifulsoup

Use BeautifulSoup to loop through and retrieve specific URLs

Categories

Resources