Adding text to results in an array using beautifulsoup in python - python

Anyone any ideas how to prepend each item in an array with text before its passed into the next loop?
Basically I have found the links that im after but they do not contain the main sites url, just the child elements
links = []
for link in soup.find_all("a", {"class": "product-info__caption"}):
links.append(link.attrs['href'])
#this returns the urls okay as /products/item
#whereas i need the https://www.example.com/products/item to pass into next loop
for x in links:
result = requests.get(x)
src = result.content
soup = BeautifulSoup(src, 'lxml')
Name = soup.find('h1', class_='product_name')
... and so on

You can prepend 'https://www.example.com' in your first loop, for example:
links = []
for link in soup.find_all("a", {"class": "product-info__caption"}):
links.append('https://www.example.com' + link.attrs['href'])
for x in links:
# your next stuff here
# ...

Building on top of #Andrej Kesely's answer, I think you should use a list comprehension
links = [
"https://www.example.com" + link.attrs['href']
for link in soup.find_all("a", {"class": "product-info__caption"})
]
List comprehensions are faster than the conventional for loop. This StackOverflow answer will explain why list comprehensions are faster.
Note
Every list comprehension can be turned into a for loop.
Further reading
Real Python has an amazing article about them here.
Official Python documentation about list comprehensions can be found here.

Related

List comprehension is returning "NoneType" TypeError for unknown reason

I'm trying to grab a specific string of a link address from a list of links that I retrieved from a webpage.
from urllib.request import urlopen
from bs4 import BeautifulSoup
# Grab table links using url
url = "https://www.epa.gov/automotive-trends/download-automotive-trends-report#Full Report"
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
auto_rep = [x for x in links if 'report-tables.xlsx' in x][0]
The append loop works as intended, resulting in a list of links. However, the auto_rep assignment throws an error:
Traceback (most recent call last):
File "<ipython-input-3-77ab86ded43b>", line 19, in <module>
auto_rep = [x for x in links if 'report-tables.xlsx' in x][0]
File "<ipython-input-3-77ab86ded43b>", line 19, in <listcomp>
auto_rep = [x for x in links if 'report-tables.xlsx' in x][0]
TypeError: argument of type 'NoneType' is not iterable
I've used this exact format of list comprehension to do the same thing in other contexts, so I'm not sure what the issue is here.
Some of the links in your links list are None, during the list comprehension you are checking if 'report-tables.xlsx' in x, since x can be None the in check throws an error.
The solution is just to only add the link to the links list if its not None, alternatively, you can use this [x for x in links if x is not None and 'report-tables.xlsx' in x]
Make sure that links doesn't have None values posted into it. An easy way to do this in Python >= 3.8 is to use an assignment expression:
links = []
for link in soup.findAll('a'):
if hrefs := link.get('href'):
links.append(hrefs)
For previous python versions you can do:
links = []
for link in soup.findAll('a'):
hrefs = link.get('href')
if hrefs:
links.append(hrefs)
auto_rep = [x for x in links if 'report-tables.xlsx' in str(x)][0]
Convert all types to strings works.
It can iterate over none types so you need to convert them
Some of the links that it fetches do not have hrefs, so before appending the href to links, check if it exists first.
links = []
for link in soup.findAll('a'):
if link.get('href'):
links.append(link.get('href'))

How to make my result into list of string?

I want to make my result (which consists of top tweeter trends) into a list. Later I will use this list items to use as a query in google news. Can anyone tell me how to make my result as a list and secondly how will I use the list items as separate query in google news (i just need how to do this. I already have a code)
Here is my code:
url = "https://trends24.in/pakistan"
req = requests.get(url)
re = req.content
soup = BeautifulSoup(re, "html.parser")
top_trends = soup.findAll("li", class_ = "")
top_trends1 = soup.find("a", {"target" : "tw"})
for result in top_trends[0:10]:
print(result.text)
the output is:
#JusticeForUsamaNadeemSatti25K
#IslamabadPolice10K
#promotemedicalstudents51K
#ArrestSheikhRasheed
#MWLHighlights202014K
Sahiwal
Deport Infidel Yasser Al-Habib
BOSS LADY RUBINA929K
Sheikh Nimr
G-10 Srinagar Highway
Thank you in advance.
To make a new list, do
newlist = []
for result in top_trends[0:10]:
newlist.append(result.text)
or via list comprehension
newlist = [result.text for result in top_trends[0:10]]

Scraping lists of items from Wikipedia

I would need to get all the information from this page:
https://it.wikipedia.org/wiki/Categoria:Periodici_italiani_in_lingua_italiana
from symbol " to letter Z.
Then:
"
"900", Cahiers d'Italie et d'Europe
A
Abitare
Aerei
Aeronautica & Difesa
Airone (periodico)
Alp (periodico)
Alto Adige (quotidiano)
Altreconomia
....
In order to do this, I have tried using the following code:
res = requests.get("https://it.wikipedia.org/wiki/Categoria:Periodici_italiani_in_lingua_italiana")
soup = bs(res.text, "html.parser")
url_list = []
links = soup.find_all('a')
for link in links:
url = link.get("href", "")
url_list.append(url)
lists_A=[]
for url in url_list:
lists_A(url)
print(lists_A)
However this code collects more information than what I would need.
In particular, the last item that I should collect would be La Zanzara (possibly all the items should not have any word in the brackets, i.e. they should not contain (rivista), (periodico), (settimanale), and so on, but just the title (e.g. Jack (periodico) should be just Jack).
Could you give me any advice on how to get this information? Thanks
This will help you to filter out some of the unwanted urls (not all though). Basically everything before "Corriere della Sera", which I'm assuming should be the first expected URL.
links = [a.get('href') for a in soup.find_all('a', {'title': True, 'href': re.compile('/wiki/(.*)'), 'accesskey': False})]
You can safely assume that all the magazine URLs are ordered at this point and since you know that "La Zanzara" should be the last expected URL you can get the position of that particular string in your new list and slice up to that index + 1
links.index('/wiki/La_zanzara_(periodico)')
Out[20]: 144
links = links[:145]
As for removing ('periodico') and other data cleaning you need to inspect your data and figure out what is it that you want to remove.
Write a simple function like this maybe:
def clean(string):
to_remove = ['_(periodico)', '_(quotidiano)']
for s in to_remove:
if s in string:
return replace(string, s, '')

Python 2 Beautiful Soup, get text from all tags

Trying to get the text from all tags that have the class task-topic-deprecated, however I only seem to be able to get one.
Not a duplicate of BeautifulSoup get_text from find_all - This issue uses multiple class names and so the working syntax is slightly different, class_ as opposed to attrs={'class':'
Source page:
https://developer.apple.com/documentation/cfnetwork?language=objc
The output would be any string that is struckout on the page above:
CFFTPCreateParsedResourceListing
kCFFTPResourceGroup
...etc
find_next() doesn't seem to move to the next item how I am expecting it to, and prints out the text I have already.
page = requests.get("https://developer.apple.com/documentation/cfnetwork?language=objc")
soup = BeautifulSoup(page.content, 'html.parser')
aRow = soup.find('a', attrs={'class':'task-topic-deprecated has-adjacent-element symbol-name'}).get_text()
print aRow
bRow = soup.find('a', attrs={'class':'task-topic-deprecated has-adjacent-element symbol-name'}).find_next().get_text()
print bRow
cRow = soup.find('a', attrs={'class':'task-topic-deprecated has-adjacent-element symbol-name'}).find_next().find_next().get_text()
print cRow
CFFTPCreateParsedResourceListing
CFFTPCreateParsedResourceListing
CFFTPCreateParsedResourceListing
Also tried putting it in a loop from various things I have found on Stack Overflow, but it seems to still only grab 1 item as per above.
Also tried with xPath, but this doesn't grab anything and prints out a blank list
tree = html.fromstring(page.content)
allItems = tree.xpath('//a[#class="task-topic-deprecated has-adjacent-element symbol-name"]/text()')
print allItems
I think you have doing it wrong instead of find you can use find_all method to get result.
for i in soup.find_all('a', class_='task-topic-deprecated has-adjacent-element symbol-name'):
print i.get_text()
May be this could help

Parsing diferent bs4.element.Tag with beautifulSoup

I want to parse the table in this url and export it as a csv:
http://www.bde.es/webbde/es/estadis/fi/ifs_es.html
if i do this:
sauce = urlopen(url_bank).read()
soup = bs.BeautifulSoup(sauce, 'html.parser')
and then this:
resto = soup.find_all('td')
lista_text = []
for elements in resto:
lista_text = lista_text + [elements.string]
I get all the elements well parsed except the last column 'Códigos Isin'
and this is because there is a break on html code '. I do not know
what to do with, i have tried this part but still does not work:
lista_text = lista_text + [str(elements.string).replace('<br/>','')]
After that I take the list to a np.array an then to a dataframe to export it as .csv. That part is already done, I only have to fix that issue.
Thanks in advance!
It's just that you need to be careful about what .string does - if there are multiple children elements, it would return None - as in the case with <br>:
If a tag contains more than one thing, then it’s not clear what
.string should refer to, so .string is defined to be None
Use .get_text() instead:
for elements in resto:
lista_text = lista_text + [elements.get_text(strip=True)]

Categories