I want to make my result (which consists of top tweeter trends) into a list. Later I will use this list items to use as a query in google news. Can anyone tell me how to make my result as a list and secondly how will I use the list items as separate query in google news (i just need how to do this. I already have a code)
Here is my code:
url = "https://trends24.in/pakistan"
req = requests.get(url)
re = req.content
soup = BeautifulSoup(re, "html.parser")
top_trends = soup.findAll("li", class_ = "")
top_trends1 = soup.find("a", {"target" : "tw"})
for result in top_trends[0:10]:
print(result.text)
the output is:
#JusticeForUsamaNadeemSatti25K
#IslamabadPolice10K
#promotemedicalstudents51K
#ArrestSheikhRasheed
#MWLHighlights202014K
Sahiwal
Deport Infidel Yasser Al-Habib
BOSS LADY RUBINA929K
Sheikh Nimr
G-10 Srinagar Highway
Thank you in advance.
To make a new list, do
newlist = []
for result in top_trends[0:10]:
newlist.append(result.text)
or via list comprehension
newlist = [result.text for result in top_trends[0:10]]
Related
Anyone any ideas how to prepend each item in an array with text before its passed into the next loop?
Basically I have found the links that im after but they do not contain the main sites url, just the child elements
links = []
for link in soup.find_all("a", {"class": "product-info__caption"}):
links.append(link.attrs['href'])
#this returns the urls okay as /products/item
#whereas i need the https://www.example.com/products/item to pass into next loop
for x in links:
result = requests.get(x)
src = result.content
soup = BeautifulSoup(src, 'lxml')
Name = soup.find('h1', class_='product_name')
... and so on
You can prepend 'https://www.example.com' in your first loop, for example:
links = []
for link in soup.find_all("a", {"class": "product-info__caption"}):
links.append('https://www.example.com' + link.attrs['href'])
for x in links:
# your next stuff here
# ...
Building on top of #Andrej Kesely's answer, I think you should use a list comprehension
links = [
"https://www.example.com" + link.attrs['href']
for link in soup.find_all("a", {"class": "product-info__caption"})
]
List comprehensions are faster than the conventional for loop. This StackOverflow answer will explain why list comprehensions are faster.
Note
Every list comprehension can be turned into a for loop.
Further reading
Real Python has an amazing article about them here.
Official Python documentation about list comprehensions can be found here.
I am having some problems trying to manipulate some strings here. I am scraping some data from a website and I am facing 2 challenges:
I am scraping unnecessary data as the website I target has redundant class naming. My goal is to isolate this data and delete it so I can keep only the data I am interested in.
With the data kept, I need to split the string in order to store some information into specific variables.
So initially I was planning to use a simple split() function and store each new string into list and then play with it to keep the parts that I want. Unfortunately, every time I do this, I end up with 3 separate lists that I cannot manipulate/split.
Here is the code:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome('\\Users\\rapha\\Desktop\\10Milz\\4. Python\\Python final\\Scrape\\chromedriver.exe')
driver.get("https://www.atptour.com/en/scores/2020/7851/MS011/match-stats")
content = driver.page_source
soup = BeautifulSoup(content, "html.parser" )
for infos in soup.find_all('h3', class_='section-title'):
title = infos.get_text()
title = ' '.join(title.split())
title_list = []
title_list = title.split(" | ")
print(title_list)
Here is the "raw data" retrieve
Player Results
Tournament Results
Salvatore Caruso VS. Brandon Nakashima | Indian Wells 2020
And here is what I like to achieve
Variable_1 = Salvatore Caruso
Variable_2 = Brandon Nakashima
Variable 3 = Indian Wells
Variable 4 = 2020
Could you please let me know how to proceed here?
How about this ?
Its not so pretty but will work as long as there is always a VS. and a | separating the names and that the date is always 4 digits for the year.
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome('/home/lewis/Desktop/chromedriver')
driver.get("https://www.atptour.com/en/scores/2020/7851/MS011/match-stats")
content = driver.page_source
soup = BeautifulSoup(content, "html.parser" )
text = soup.find_all('h3', class_='section-title')[2].get_text().replace("\n","")
while text.find(" ")> -1:
text = text.replace(" "," ")
text = text.strip()
#split by two parameters
split = [st.split("|") for st in text.split("VS.")]
#flatten the nested lists
flat_list = [item for sublist in split for item in sublist]
#extract the date from the end of the last item
flat_list.append(flat_list[-1][-4:])
#remove date fromt the 3rd item
flat_list[2] = flat_list[2][:-4]
#strip any leading or trailing white space
final_list = [x.strip() for x in flat_list]
print(final_list)
output
['Salvatore Caruso', 'Brandon Nakashima', 'Indian Wells', '2020']
I would need to get all the information from this page:
https://it.wikipedia.org/wiki/Categoria:Periodici_italiani_in_lingua_italiana
from symbol " to letter Z.
Then:
"
"900", Cahiers d'Italie et d'Europe
A
Abitare
Aerei
Aeronautica & Difesa
Airone (periodico)
Alp (periodico)
Alto Adige (quotidiano)
Altreconomia
....
In order to do this, I have tried using the following code:
res = requests.get("https://it.wikipedia.org/wiki/Categoria:Periodici_italiani_in_lingua_italiana")
soup = bs(res.text, "html.parser")
url_list = []
links = soup.find_all('a')
for link in links:
url = link.get("href", "")
url_list.append(url)
lists_A=[]
for url in url_list:
lists_A(url)
print(lists_A)
However this code collects more information than what I would need.
In particular, the last item that I should collect would be La Zanzara (possibly all the items should not have any word in the brackets, i.e. they should not contain (rivista), (periodico), (settimanale), and so on, but just the title (e.g. Jack (periodico) should be just Jack).
Could you give me any advice on how to get this information? Thanks
This will help you to filter out some of the unwanted urls (not all though). Basically everything before "Corriere della Sera", which I'm assuming should be the first expected URL.
links = [a.get('href') for a in soup.find_all('a', {'title': True, 'href': re.compile('/wiki/(.*)'), 'accesskey': False})]
You can safely assume that all the magazine URLs are ordered at this point and since you know that "La Zanzara" should be the last expected URL you can get the position of that particular string in your new list and slice up to that index + 1
links.index('/wiki/La_zanzara_(periodico)')
Out[20]: 144
links = links[:145]
As for removing ('periodico') and other data cleaning you need to inspect your data and figure out what is it that you want to remove.
Write a simple function like this maybe:
def clean(string):
to_remove = ['_(periodico)', '_(quotidiano)']
for s in to_remove:
if s in string:
return replace(string, s, '')
I am building a simple program using Python3 on MacOS, to scrap all the lyrics of an artist in one single variable. Although I am able to correctly iterate through different URL's (each Url is a song from this artist) and have the output that I want being printed, I am struggling to be able to store all the different songs in one single variable.
I've tried different approaches, trying to store it in a list, dictionary, dictionary inside a list, etc. but it didn't work out. I've also read Beautifulsoup documentation and several forums without success.
I am sure this should be something very simple. This is the code that I am running:
import requests
import re
from bs4 import BeautifulSoup
r = requests.get("http://www.metrolyrics.com/notorious-big-albums-list.html")
c = r.content
soup = BeautifulSoup(c, "html.parser")
albums = soup.find("div", {'class' : 'grid_8'})
for page in albums.find_all('a', href=True, alt=True):
d = {}
r = requests.get(a['href'])
c = r.content
soup = BeautifulSoup(c, "html.parser")
song = soup.find_all('p', {'class':'verse'})
title = soup.find_all('h1')
for item in title:
title = item.text.replace('Lyrics','')
print("\n",title.upper(),"\n")
for item in song:
song = item.text
print(song)
When running this code, you get the exact output that I would like to have stored in a single variable.
I've been struggling with this for days so I would really appreciate some help.
Thanks
Here's an example of how you should store data in one variable.
This can be JSON or similar by using a python dictionary.
a = dict()
#We create a instance of a dict. Same as a = {}.
a[1] = 'one'
#This is how a basic dictionary works. There is a key and a value.
a[2] = {'two':'the number 2'}
#Now our Key is normal, however, our value is another dictionary.
print(a)
#This is how we access the dict inside the dict.
print(a[2]['two'])
# first key [2] (gives us {'two':'the number 2'} we access value inside it [2]['two']")
You'll be able to apply this knowledge to your algorithm.
Use the album as the first key all['Stay strong'] = {'some-song':'text_heavy'}
I also recommend making a function since you're re-using code.
for instance, the request and then parsing using bs4
def parser(url):
make_req = request.get(url).text #or .content
return BeautifulSoup(make_req, 'html.parser')
A good practice for software developement is so called DRY (Don't repeat yourself) since readability counts and as opposed to WET (Waste everyones time, Write Everything Twice).
Just something to keep in mind.
I made it!!
I wasn't able to store the output in a variable, but I was able to write a txt file storing all the content which is even better. This is the code I used:
import requests
import re
from bs4 import BeautifulSoup
with open('nBIGsongs.txt', 'a') as f:
r = requests.get("http://www.metrolyrics.com/notorious-big-albums-list.html")
c = r.content
soup = BeautifulSoup(c, "html.parser")
albums = soup.find("div", {'class' : 'grid_8'})
for a in albums.find_all('a', href=True, alt=True):
r = requests.get(a['href'])
c = r.content
soup = BeautifulSoup(c, "html.parser")
song = soup.find_all('p', {'class':'verse'})
title = soup.find_all('h1')
for item in title:
title = item.text.replace('Lyrics','')
f.write("\n" + title.upper() + "\n")
for item in song:
f.write(item.text)
f.close()
I would still love to hear if there are other better approaches.
Thanks!
I'm trying to do a massive data accumulation on college basketball teams. This link: https://www.teamrankings.com/ncb/stats/ has a TON of team stats.
I have tried to write a script that scans all the desired links (all Team Stats) from this page, finds the rank of the specified team (an input), then returns the sum of that teams ranks from all links.
I graciously found this: https://gist.github.com/phillipsm/404780e419c49a5b62a8
...which is GREAT!
But I must have something wrong because I'm getting 0
Here's my code:
import requests
from bs4 import BeautifulSoup
import time
url_to_scrape = 'https://www.teamrankings.com/ncb/stats/'
r = requests.get(url_to_scrape)
soup = BeautifulSoup(r.text, "html.parser")
stat_links = []
for table_row in soup.select(".expand-section li"):
table_cells = table_row.findAll('li')
if len(table_cells) > 0:
link = table_cells[0].find('a')['href']
stat_links.append(link)
total_rank = 0
for link in stat_links:
r = requests.get(link)
soup = BeaultifulSoup(r.text)
team_rows = soup.select(".tr-table datatable scrollable dataTable no-footer tr")
for row in team_rows:
if row.findAll('td')[1].text.strip() == 'Oklahoma':
rank = row.findAll('td')[0].text.strip()
total_rank = total_rank + rank
print total_rank
Check out that link to double check I have the correct class specified. I have a feeling the problem might be in the first for loop where I select an li tag then select all li tags within that first tag, I dunno.
I don't use Python so I'm unfamiliar with any debugging tools. So if anyone wants to forward me to one of those that would be great!
First, the team stats and player stats sections are contained in a 'div class='large column-2'. The team stats are in the first occurrence. Then you can find all of the href tags within it. I've combined both in a one-liner.
teamstats = soup(class_='column large-2')[0].find_all(href=True)
The teamstats list contains all of the 'a' tags. Use a list comprehension to extract the links. A few of the hrefs contained "#" (part of navigation links) so I excluded them.
links = [a['href'] for a in teamstats if a['href'] != '#']
Here is a sample of output:
links
Out[84]:
['/ncaa-basketball/stat/points-per-game',
'/ncaa-basketball/stat/average-scoring-margin',
'/ncaa-basketball/stat/offensive-efficiency',
'/ncaa-basketball/stat/floor-percentage',
'/ncaa-basketball/stat/1st-half-points-per-game',
A ran your code on my machine and the line --> table_cells = table_row.findAll('li') , always returns an empty list, so stat_links ends up being an empty array, therefore the iteration over stat_links never gets carried out and total_rank will not get incremented. I suggest you fiddle around with the way you find all the list elements.