How to make a list out of a BS4 output - python

I have this code right now:
from bs4 import BeautifulSoup
import requests
get = requests.get("https://solmfers-minting-site.netlify.app/")
soup = BeautifulSoup(get.text, 'html.parser')
for i in soup.find_all('script'):
print(i.get('src'))
And I need to somehow turn the output into a list and remove the None values from it since it outputs it like this:
jquery.js
nicepage.js
None
None
/static/js/2.c20455e8.chunk.js
/static/js/main.87864e1d.chunk.js

Just append your extracted values to a list.
result = []
for i in soup.find_all('script'):
elem = i.get('src')
if elem is not None:
result.append(elem)
Or using a list comprehension:
result = [x['src'] for x in soup.find_all('script') if x.get('src') is not None]

Your near to your goal, but select your elements more specific and append the src to a list while iterating your ResultSet:
data = []
for i in soup.find_all('script', src=True):
data.append(i.get('src'))
Alternative with css selectors:
for i in soup.select('script[src]'):
data.append(i.get('src'))
And as allready mentioned with list comprehension:
[i.get('src') for i in soup.select('script[src]')]
Output
['jquery.js', 'nicepage.js', '/static/js/2.c20455e8.chunk.js', '/static/js/main.87864e1d.chunk.js']

Related

Trying to isolate URL suffix's from list of href tags

I'm currently working on a simple web crawling program that will crawl the SCP wiki to find links to other articles in each article. So far I have been able to get a list of href tags that go to other articles, but can't navigate to them since the URL I need is embedded in the tag:
[ SCP-1512,
SCP-2756,
SCP-002,
SCP-004 ]
Is there any way I would be able to isolate the "/scp-xxxx" from each item in the list so I can append it to the parent URL?
The code used to get the list looks like this:
import requests
import lxml
from bs4 import BeautifulSoup
import re
def searchSCP(x):
url = str(SCoutP(x))
c = requests.get(url)
crawl = BeautifulSoup(c.content, 'lxml')
#Searches HTML for text containing "SCP-" and href tags containing "scp-"
ref = crawl.find_all(text=re.compile("SCP-"), href=re.compile("scp-",))
param = "SCP-" + str(SkateP(x)) #SkateP takes int and inserts an appropriate number of 0's.
for i in ref: #Below function is for sorting out references to the article being searched
if str(param) in i:
ref.remove(i)
if ref != []:
print(ref)
The main idea I've tried to use is finding every item that contains items in quotations, but obviously that just returned the same list. What I want to be able to do is select a specific item in the list and take out ONLY the "scp-xxxx" part or, alternatively, change the initial code to only extract the href content in quotations to the list.
Is there any way I would be able to isolate the "/scp-xxxx" from each item in the list so I can append it to the parent URL?
If I understand correctly, you want to extract the href attribute - for that, you can use i.get('href') (or probably even just i['href']).
With .select and list comprehension, you won't even need regex to filter the results:
[a.get('href') for a in crawl.select('*[href*="scp-"]') if 'SCP-' in a.get_text()]
would return
['/scp-1512', '/scp-2756', '/scp-002', '/scp-004']
If you want the parent url attached:
root_url = 'https://PARENT-URL.com' ## replace with the actual parent url
scpLinks = [root_url + l for l, t in list(set([
(a.get('href'), a.get_text()) for a in crawl.select('*[href*="scp-"]')
])) if 'SCP-' in t]
scpLinks should return
['https://PARENT-URL.com/scp-004', 'https://PARENT-URL.com/scp-002', 'https://PARENT-URL.com/scp-1512', 'https://PARENT-URL.com/scp-2756']
If you want to filter out param, add str(param) not in t to the filter:
scpLinks = [root_url + l for l, t in list(set([
(a.get('href'), a.get_text()) for a in crawl.select('*[href*="scp-"]')
])) if 'SCP-' in t and str(param) not in t]
if str(param) was 'SCP-002', then scpLinks would be
['https://PARENT-URL.com/scp-004', 'https://PARENT-URL.com/scp-1512', 'https://PARENT-URL.com/scp-2756']

Recording web scraping data

Hi everyone although I got the data I was looking for in a text format, when I try to record it as a list or convert it into a dataframe, it simply doesn't work. What I got was a huge list with only one item, which is the last text line of the data I got, i.e. the number '9.054.333,18'. Can anyone help me, please? I need to organize all this data in a list or dataframe.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
html = urlopen('http://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/mercado-a-vista/termo/posicoes-em-aberto/posicoes-em-aberto-8AA8D0CC77D179750177DF167F150965.htm?data=16/04/2021&f=0#conteudo-principal')
soup = BeautifulSoup(html.read(), 'html.parser')
texto = soup.find_all('td')
for t in texto:
print(t.text)
lista=[]
for i in soup.find_all('td'):
lista.append(t.text)
print(lista)
Your iterators are wrong -- you're using i in the last loop while appending t.text.
You can just use a list comprehension:
# ...
soup = BeautifulSoup(html.read(), 'html.parser')
lista = [t.text for t in soup.find_all('td')]

Beautifulsoup multiple div content to dictionary

I try to get the contents of two div inside a dictionary in python. The main problem is that I'm able to fetch the first div content and the second, but not in the right key:value manner. I'm only able to get the keys back. As such, I know that I need to iterate through the content, but I can't see to get my for loop correct.
Following 1 and 2 cannot get the case done what I'm looking for.
This is what I've tried so far:
from bs4 import BeautifulSoup
import requests
url='https://www.samenvoordeklant.nl/arbeidsmarktregios'
base=requests.get(url, timeout=15)
html=BeautifulSoup(base.text, 'lxml')
regios=html.find_all('div',attrs={'class':['field field--name-node-title field--type-ds field--label-hidden field__item animated','field field--name-field-gemeenten field--type-string-long field--label-hidden field__item animated']})
for regio in regios:
print({regio.get_text(strip=True)})
The result:
{'Achterhoek'}
{'Aalten, Berkelland, Bronckhorst, Doetinchem, Montferland, Oost Gelre, Oude IJsselstreek, Winterswijk'}
{'Amersfoort'}
{'Amersfoort, Baarn, Bunschoten, Leusden, Nijkerk, Soest, Woudenberg'}
etc.
The result I'm after is:
{'Achterhoek':'Aalten', 'Berkelland', 'Bronckhorst', 'Doetinchem', 'Montferland', 'Oost Gelre', 'Oude IJsselstreek', 'Winterswijk'}
{'Amersfoort':'Amersfoort', 'Baarn', 'Bunschoten', 'Leusden', 'Nijkerk', 'Soest', 'Woudenberg'}
etc. This allows me to move it afterwards into a pandas dataframe more easily.
An easy way is with dict and zip of the two lists. Note I have used faster css selectors and avoided using full multi-value of class.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.samenvoordeklant.nl/arbeidsmarktregios')
soup = bs(r.content, 'lxml')
result = dict(zip([i.text for i in soup.select('h2 a')], [i.text for i in soup.select('.field--type-string-long')]))
print(result)
# result = {k:v.split(', ') for k, v in result.items()} ##add this line at end if want list as value rather than string
Sample pprint output:
If you want a list as the value you can simply add a last line of:
result = {k:v.split(', ') for k, v in result.items()}
This is one approach using zip.
Ex:
regios=html.find_all('div',attrs={'class':['field field--name-node-title field--type-ds field--label-hidden field__item animated','field field--name-field-gemeenten field--type-string-long field--label-hidden field__item animated']})
result = {key.get_text(strip=True): value.get_text(strip=True) for key, value in zip(regios[0::2], regios[1::2])}
pprint(result)
Output:
{'Achterhoek': 'Aalten, Berkelland, Bronckhorst, Doetinchem, Montferland, Oost '
'Gelre, Oude IJsselstreek, Winterswijk',
'Amersfoort': 'Amersfoort, Baarn, Bunschoten, Leusden, Nijkerk, Soest, '
'Woudenberg',......
If you need the value as a list of items
Use:
result = {key.get_text(strip=True): [i.strip() for i in value.get_text(strip=True).split(",")] for key, value in zip(regios[0::2], regios[1::2])}
Output:
{'Achterhoek': ['Aalten',
'Berkelland',
'Bronckhorst',
'Doetinchem',
'Montferland',
'Oost Gelre',
'Oude IJsselstreek',
'Winterswijk'],
'Amersfoort': ['Amersfoort',
'Baarn',
'Bunschoten',....

Returns a list with only 20 entries. Does not go beyond that

#importing the libraries
import urllib.request as urllib2
from bs4 import BeautifulSoup
#getting the page url
quote_page="https://www.quora.com/What-is-the-best-advice-you-can-give-to-a-junior-programmer"
page=urllib2.urlopen(quote_page)
#parsing the html
soup = BeautifulSoup(page,"html.parser")
# Take out the <div> of name and get its value
name_box = soup.find("div", attrs={"class": "AnswerListDiv"})
#finding all the tags in the page
ans=name_box.find_all("div", attrs={"class": "u-serif-font-main--large"},recursive=True)
#separating the answers into lists
for i in range(0, len(ans), 100):
chunk = ans[i:i+100]
#extracting all the answers and putting into a list
finalans=[]
l=0
for i in chunk:
stri=chunk[l]
finalans.append(stri.text)
l+=1
continue
final_string = '\n'.join(finalans)
#final output
print(final_string)
I am not able to get more than 20 entries into this list. What is wrong with this code? (I am a beginner and I have used some references to write this program)
Edit: I have added the URL I want to scrape.
You try to break ans into smaller chunks, but notice that each iteration of this loop discards the previous content of chunks so you loose all but the last chunk of data.
#separating the answers into lists
for i in range(0, len(ans), 100):
chunk = ans[i:i+100] # overwrites previous chunk
This is why you only get 20 items in the list... its only the final chunk. Since you want final_string to hold all of the text nodes, there is no need to chunk and I just removed it.
Next, and this is just tightening up the code, you don't need to both iterate the values of the list and track an index just to get the same value you are indexing. Working on ans because we are no longer chunking,
finalans=[]
l=0
for i in ans:
stri=ans[l]
finalans.append(stri.text)
l+=1
continue
becomes
finalans=[]
for item in ans:
finalans.append(item.text)
or more susinctly
finalans = [item.text for item in ans]
So the program is
#importing the libraries
import urllib.request as urllib2
from bs4 import BeautifulSoup
#getting the page url
quote_page="https:abcdef.com"
page=urllib2.urlopen(quote_page)
#parsing the html
soup = BeautifulSoup(page,"html.parser")
# Take out the <div> of name and get its value
name_box = soup.find("div", attrs={"class": "AnswerListDiv"})
#finding all the tags in the page
ans=name_box.find_all("div", attrs={"class": "u-serif-font-main--large"},recursive=True)
#extracting all the answers and putting into a list
finalans = [item.text for item in ans]
final_string = '\n'.join(finalans)
#final output
print(final_string)

Parsing diferent bs4.element.Tag with beautifulSoup

I want to parse the table in this url and export it as a csv:
http://www.bde.es/webbde/es/estadis/fi/ifs_es.html
if i do this:
sauce = urlopen(url_bank).read()
soup = bs.BeautifulSoup(sauce, 'html.parser')
and then this:
resto = soup.find_all('td')
lista_text = []
for elements in resto:
lista_text = lista_text + [elements.string]
I get all the elements well parsed except the last column 'Códigos Isin'
and this is because there is a break on html code '. I do not know
what to do with, i have tried this part but still does not work:
lista_text = lista_text + [str(elements.string).replace('<br/>','')]
After that I take the list to a np.array an then to a dataframe to export it as .csv. That part is already done, I only have to fix that issue.
Thanks in advance!
It's just that you need to be careful about what .string does - if there are multiple children elements, it would return None - as in the case with <br>:
If a tag contains more than one thing, then it’s not clear what
.string should refer to, so .string is defined to be None
Use .get_text() instead:
for elements in resto:
lista_text = lista_text + [elements.get_text(strip=True)]

Categories