I would need to get all the information from this page:
https://it.wikipedia.org/wiki/Categoria:Periodici_italiani_in_lingua_italiana
from symbol " to letter Z.
Then:
"
"900", Cahiers d'Italie et d'Europe
A
Abitare
Aerei
Aeronautica & Difesa
Airone (periodico)
Alp (periodico)
Alto Adige (quotidiano)
Altreconomia
....
In order to do this, I have tried using the following code:
res = requests.get("https://it.wikipedia.org/wiki/Categoria:Periodici_italiani_in_lingua_italiana")
soup = bs(res.text, "html.parser")
url_list = []
links = soup.find_all('a')
for link in links:
url = link.get("href", "")
url_list.append(url)
lists_A=[]
for url in url_list:
lists_A(url)
print(lists_A)
However this code collects more information than what I would need.
In particular, the last item that I should collect would be La Zanzara (possibly all the items should not have any word in the brackets, i.e. they should not contain (rivista), (periodico), (settimanale), and so on, but just the title (e.g. Jack (periodico) should be just Jack).
Could you give me any advice on how to get this information? Thanks
This will help you to filter out some of the unwanted urls (not all though). Basically everything before "Corriere della Sera", which I'm assuming should be the first expected URL.
links = [a.get('href') for a in soup.find_all('a', {'title': True, 'href': re.compile('/wiki/(.*)'), 'accesskey': False})]
You can safely assume that all the magazine URLs are ordered at this point and since you know that "La Zanzara" should be the last expected URL you can get the position of that particular string in your new list and slice up to that index + 1
links.index('/wiki/La_zanzara_(periodico)')
Out[20]: 144
links = links[:145]
As for removing ('periodico') and other data cleaning you need to inspect your data and figure out what is it that you want to remove.
Write a simple function like this maybe:
def clean(string):
to_remove = ['_(periodico)', '_(quotidiano)']
for s in to_remove:
if s in string:
return replace(string, s, '')
Related
I'm currently working on a simple web crawling program that will crawl the SCP wiki to find links to other articles in each article. So far I have been able to get a list of href tags that go to other articles, but can't navigate to them since the URL I need is embedded in the tag:
[ SCP-1512,
SCP-2756,
SCP-002,
SCP-004 ]
Is there any way I would be able to isolate the "/scp-xxxx" from each item in the list so I can append it to the parent URL?
The code used to get the list looks like this:
import requests
import lxml
from bs4 import BeautifulSoup
import re
def searchSCP(x):
url = str(SCoutP(x))
c = requests.get(url)
crawl = BeautifulSoup(c.content, 'lxml')
#Searches HTML for text containing "SCP-" and href tags containing "scp-"
ref = crawl.find_all(text=re.compile("SCP-"), href=re.compile("scp-",))
param = "SCP-" + str(SkateP(x)) #SkateP takes int and inserts an appropriate number of 0's.
for i in ref: #Below function is for sorting out references to the article being searched
if str(param) in i:
ref.remove(i)
if ref != []:
print(ref)
The main idea I've tried to use is finding every item that contains items in quotations, but obviously that just returned the same list. What I want to be able to do is select a specific item in the list and take out ONLY the "scp-xxxx" part or, alternatively, change the initial code to only extract the href content in quotations to the list.
Is there any way I would be able to isolate the "/scp-xxxx" from each item in the list so I can append it to the parent URL?
If I understand correctly, you want to extract the href attribute - for that, you can use i.get('href') (or probably even just i['href']).
With .select and list comprehension, you won't even need regex to filter the results:
[a.get('href') for a in crawl.select('*[href*="scp-"]') if 'SCP-' in a.get_text()]
would return
['/scp-1512', '/scp-2756', '/scp-002', '/scp-004']
If you want the parent url attached:
root_url = 'https://PARENT-URL.com' ## replace with the actual parent url
scpLinks = [root_url + l for l, t in list(set([
(a.get('href'), a.get_text()) for a in crawl.select('*[href*="scp-"]')
])) if 'SCP-' in t]
scpLinks should return
['https://PARENT-URL.com/scp-004', 'https://PARENT-URL.com/scp-002', 'https://PARENT-URL.com/scp-1512', 'https://PARENT-URL.com/scp-2756']
If you want to filter out param, add str(param) not in t to the filter:
scpLinks = [root_url + l for l, t in list(set([
(a.get('href'), a.get_text()) for a in crawl.select('*[href*="scp-"]')
])) if 'SCP-' in t and str(param) not in t]
if str(param) was 'SCP-002', then scpLinks would be
['https://PARENT-URL.com/scp-004', 'https://PARENT-URL.com/scp-1512', 'https://PARENT-URL.com/scp-2756']
I want to make my result (which consists of top tweeter trends) into a list. Later I will use this list items to use as a query in google news. Can anyone tell me how to make my result as a list and secondly how will I use the list items as separate query in google news (i just need how to do this. I already have a code)
Here is my code:
url = "https://trends24.in/pakistan"
req = requests.get(url)
re = req.content
soup = BeautifulSoup(re, "html.parser")
top_trends = soup.findAll("li", class_ = "")
top_trends1 = soup.find("a", {"target" : "tw"})
for result in top_trends[0:10]:
print(result.text)
the output is:
#JusticeForUsamaNadeemSatti25K
#IslamabadPolice10K
#promotemedicalstudents51K
#ArrestSheikhRasheed
#MWLHighlights202014K
Sahiwal
Deport Infidel Yasser Al-Habib
BOSS LADY RUBINA929K
Sheikh Nimr
G-10 Srinagar Highway
Thank you in advance.
To make a new list, do
newlist = []
for result in top_trends[0:10]:
newlist.append(result.text)
or via list comprehension
newlist = [result.text for result in top_trends[0:10]]
I have a been scraping some HTML pages with beautiful soup trying to extract some updated financial data. I only care about numbers that have a comma ie 100,000 or 12,000,000 but not 450 for example. The goal is just to find the location of the comma separated numbers within a string then I need to extract the entire sentence they are in.
I moved the entire scrape to a string list and within that list I want to extract all numbers that have a comma.
url = 'https://www.sec.gov/Archives/edgar/data/354950/000035495020000024/hd-2020proxystatement.htm'
r = requests.get(url)
soup = BeautifulSoup(r.content)
text = soup.find_all(text = True)
strings = []
for i in range(len(text)):
text_s = str(proxy_text[i])
strings.append(text)
I thought about the follow re code but I am not sure if it will extract all instances.. ie within the list there may be multiple instances of numbers separated by commas.
number = re.sub('[^>0-9,]', "", text)
Any thoughts would be a huge help! Thank you
You can use:
from bs4 import BeautifulSoup
import requests, re
url = 'https://www.sec.gov/Archives/edgar/data/354950/000035495020000024/hd-2020proxystatement.htm'
soup = BeautifulSoup(requests.get(url).text, "html5lib")
for el in soup.find_all(True): # loop all element in page
if re.search(r"(?=\d+,\d+).*", el.text):
print(el.text)
# print("END OF ELEMENT\n") # debug only
If you simply want to check if a number has a comma or not, and you want to extract it if it does, then you could try the following.
new = []
for i in text:
if ',' in i:
new.append(i)
This will append all the elements in the 'text' collection that contain a comma, even if the exact same element is repeated multiple times.
I want to parse the table in this url and export it as a csv:
http://www.bde.es/webbde/es/estadis/fi/ifs_es.html
if i do this:
sauce = urlopen(url_bank).read()
soup = bs.BeautifulSoup(sauce, 'html.parser')
and then this:
resto = soup.find_all('td')
lista_text = []
for elements in resto:
lista_text = lista_text + [elements.string]
I get all the elements well parsed except the last column 'Códigos Isin'
and this is because there is a break on html code '. I do not know
what to do with, i have tried this part but still does not work:
lista_text = lista_text + [str(elements.string).replace('<br/>','')]
After that I take the list to a np.array an then to a dataframe to export it as .csv. That part is already done, I only have to fix that issue.
Thanks in advance!
It's just that you need to be careful about what .string does - if there are multiple children elements, it would return None - as in the case with <br>:
If a tag contains more than one thing, then it’s not clear what
.string should refer to, so .string is defined to be None
Use .get_text() instead:
for elements in resto:
lista_text = lista_text + [elements.get_text(strip=True)]
I'm trying to do a massive data accumulation on college basketball teams. This link: https://www.teamrankings.com/ncb/stats/ has a TON of team stats.
I have tried to write a script that scans all the desired links (all Team Stats) from this page, finds the rank of the specified team (an input), then returns the sum of that teams ranks from all links.
I graciously found this: https://gist.github.com/phillipsm/404780e419c49a5b62a8
...which is GREAT!
But I must have something wrong because I'm getting 0
Here's my code:
import requests
from bs4 import BeautifulSoup
import time
url_to_scrape = 'https://www.teamrankings.com/ncb/stats/'
r = requests.get(url_to_scrape)
soup = BeautifulSoup(r.text, "html.parser")
stat_links = []
for table_row in soup.select(".expand-section li"):
table_cells = table_row.findAll('li')
if len(table_cells) > 0:
link = table_cells[0].find('a')['href']
stat_links.append(link)
total_rank = 0
for link in stat_links:
r = requests.get(link)
soup = BeaultifulSoup(r.text)
team_rows = soup.select(".tr-table datatable scrollable dataTable no-footer tr")
for row in team_rows:
if row.findAll('td')[1].text.strip() == 'Oklahoma':
rank = row.findAll('td')[0].text.strip()
total_rank = total_rank + rank
print total_rank
Check out that link to double check I have the correct class specified. I have a feeling the problem might be in the first for loop where I select an li tag then select all li tags within that first tag, I dunno.
I don't use Python so I'm unfamiliar with any debugging tools. So if anyone wants to forward me to one of those that would be great!
First, the team stats and player stats sections are contained in a 'div class='large column-2'. The team stats are in the first occurrence. Then you can find all of the href tags within it. I've combined both in a one-liner.
teamstats = soup(class_='column large-2')[0].find_all(href=True)
The teamstats list contains all of the 'a' tags. Use a list comprehension to extract the links. A few of the hrefs contained "#" (part of navigation links) so I excluded them.
links = [a['href'] for a in teamstats if a['href'] != '#']
Here is a sample of output:
links
Out[84]:
['/ncaa-basketball/stat/points-per-game',
'/ncaa-basketball/stat/average-scoring-margin',
'/ncaa-basketball/stat/offensive-efficiency',
'/ncaa-basketball/stat/floor-percentage',
'/ncaa-basketball/stat/1st-half-points-per-game',
A ran your code on my machine and the line --> table_cells = table_row.findAll('li') , always returns an empty list, so stat_links ends up being an empty array, therefore the iteration over stat_links never gets carried out and total_rank will not get incremented. I suggest you fiddle around with the way you find all the list elements.