I wanted to scrape something as my first program, just to learn the basics really but I'm having trouble showing more than one result.
The premise is going to a forum (http://blackhatworld.com), scrape all thread titles and compare with a string. If it contains the word "free" it will print, otherwise it won't.
Here's the current code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.blackhatworld.com/')
content = BeautifulSoup(page.content, 'html.parser')
threadtitles = content.find_all('a', class_='PreviewTooltip')
n=0
for x in range(len(threadtitles)):
test = list(threadtitles)[n]
test2 = list(test)[0]
if test2.find('free') == -1:
n=n+1
else:
print(test2)
n=n+1
This is the result of running the program:
https://i.gyazo.com/6cf1e135b16b04f0807963ce21b2b9be.png
As you can see it's checking for the word "free" and it works but it only shows first result while there are several more in the page.
By default, strings comparison is case sensitive (FREE != free). To solve your problem, first you need to put test2 in lowercase:
test2 = list(test)[0].lower()
To solve your problem and simplify your code try this:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.blackhatworld.com/')
content = BeautifulSoup(page.content, 'html.parser')
threadtitles = content.find_all('a', class_='PreviewTooltip')
count = 0
for title in threadtitles:
if "free" in title.get_text().lower():
print(title.get_text())
else:
count += 1
print(count)
Bonus: Print value of href:
for title in threadtitles:
print(title["href"])
See also this.
Related
I am trying to scrape some pages and count occurrences of a word in the page. I have to go through different set of links to reach the final set of pages and I used for loops to collect and iterate through the links.
As the website is slow, I put the final iteration inside a while True loop. But each time I run the code, it loops through the final set of links in different ways. For example, it goes through 20 links and then repeats those 20 links again while ignoring another 20 links. Every time the number varies, sometimes within each iteration, repeating and omitting random number of links.
The website is really slow. So unless I put a while True loop, the program stops in the middle. Could someone please look through the code and point out what I am doing wrong?
from bs4 import BeautifulSoup as bs
from urllib.request import Request, urlopen
import re
import pandas as pd
import io
import requests
import time
import csv
d=open('Wyd 20-21.csv','w')
writer=csv.writer(d,lineterminator='\n')
URL = "http://mnregaweb4.nic.in/netnrega/writereaddata/citizen_out/MAT_DTL_1603_MATD_eng2021.html"
soup = bs(requests.get(URL).content, "html.parser")
base_url = "http://mnregaweb4.nic.in/netnrega/"
linksblocks = []
for tag in soup.select("td:nth-of-type(2) a"):
linksblocks.append(tag["href"])
print(linksblocks)
Allblocks = [base_url+e[6:] for e in linksblocks]
print(Allblocks)#This is the first set of links. I have to iterate through each one of them to get to the second set of links
links = []
for each in Allblocks:
soup=bs(requests.get(each).content,"html.parser")
for tag in soup.select("td:nth-of-type(2) a"):
links.append(tag["href"])
AllGPs = [base_url+e[6:] for e in links]
print(AllGPs)#This is the second set of links. I have to iterate through each one of them to get to the final set of links
gp=0
for each in AllGPs:
res=requests.get(each)
soup=bs(res.text,'html.parser')
urls=[]
for link in soup.find_all('a'):
urls.append(link.get('href'))
inte=urls[1:-1]
each_bill=[base_url+e[6:] for e in inte] #This is the final set of links. I have to iterate through each one of them to get to the final pages and look for the occurrence of the word in each of the page.
q=len(each_bill)
print("no of bills is: ",q)
gp+=1
x=0
while True:
try:
for each in each_bill:
r=requests.get(each)
y=r.text.count('Display Board')
print(y)
soup=bs(r.text,'html.parser')
table_soup=soup.findAll('table')
trow=[]
for tr in table_soup[3]:
trow.append(tr)
text=trow[1].text
b=text[13:]
print(b)
writer.writerow((y,b))
x+=1
print("Now Wyd ",x,"th bill in",gp," th GP")
if x==q:
break
if x==q:
break
except requests.exceptions.RequestException as e:
print("exception error: ",e)
time.sleep(5)
continue
d.close()
So I tried getting all the headlines of the New York Times homepage and wanted to see how many times a certain word has been mentioned. In this particular case, I wanted to see how many headlines mentioned either the Coronavirus or Trump. This is my code but it won't work as 'number' remains the integer I give it before the while loop.
import requests
from bs4 import BeautifulSoup
url = 'https://www.nytimes.com'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
a = soup.findAll("h2", class_="esl82me0")
for story_heading in a:
print(story_heading.contents[0])
lijst = ["trump", "Trump", "Corona", "COVID", "virus", "Virus", "Coronavirus", "COVID-19"]
number = 0
run = 0
while run < len(a)+1:
run += 1
if any(lijst in s for s in a)
number += 1
print("\nTrump or the Corona virus have been mentioned", number, "times.")
So I basically want the variable 'number' to increase by 1 if a headline (which is an entry in the list a) has the word Trump or Coronavirus or both in them.
Does anyone know how to do this?
In general, I recommend putting more thought into naming variables. I like how you tried to print the story headings. The line if any(lijst in s for s in a) does not do what you think it should: you need to instead iterate over each word in a single h2. The any function is just a short hand for the following:
def any(iterable):
for element in iterable:
if element:
return True
return False
In other words, you're trying to see if an entire list is in an h2 element, which will never be true. Here is an example fix.
import requests
from bs4 import BeautifulSoup
url = 'https://www.nytimes.com'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
h2s = soup.findAll("h2", class_="esl82me0")
for story_heading in h2s:
print(story_heading.contents[0])
keywords = ["trump", "Trump", "Corona", "COVID", "virus", "Virus", "Coronavirus", "COVID-19"]
number = 0
run = 0
for h2 in h2s:
headline = h2.text
words_in_headline = headline.split(" ")
for word in words_in_headline:
if word in keywords:
number += 1
print("\nTrump or the Corona virus have been mentioned", number, "times.")
Output
Trump or the Corona virus have been mentioned 7 times.
I am building a simple program using Python3 on MacOS, to scrap all the lyrics of an artist in one single variable. Although I am able to correctly iterate through different URL's (each Url is a song from this artist) and have the output that I want being printed, I am struggling to be able to store all the different songs in one single variable.
I've tried different approaches, trying to store it in a list, dictionary, dictionary inside a list, etc. but it didn't work out. I've also read Beautifulsoup documentation and several forums without success.
I am sure this should be something very simple. This is the code that I am running:
import requests
import re
from bs4 import BeautifulSoup
r = requests.get("http://www.metrolyrics.com/notorious-big-albums-list.html")
c = r.content
soup = BeautifulSoup(c, "html.parser")
albums = soup.find("div", {'class' : 'grid_8'})
for page in albums.find_all('a', href=True, alt=True):
d = {}
r = requests.get(a['href'])
c = r.content
soup = BeautifulSoup(c, "html.parser")
song = soup.find_all('p', {'class':'verse'})
title = soup.find_all('h1')
for item in title:
title = item.text.replace('Lyrics','')
print("\n",title.upper(),"\n")
for item in song:
song = item.text
print(song)
When running this code, you get the exact output that I would like to have stored in a single variable.
I've been struggling with this for days so I would really appreciate some help.
Thanks
Here's an example of how you should store data in one variable.
This can be JSON or similar by using a python dictionary.
a = dict()
#We create a instance of a dict. Same as a = {}.
a[1] = 'one'
#This is how a basic dictionary works. There is a key and a value.
a[2] = {'two':'the number 2'}
#Now our Key is normal, however, our value is another dictionary.
print(a)
#This is how we access the dict inside the dict.
print(a[2]['two'])
# first key [2] (gives us {'two':'the number 2'} we access value inside it [2]['two']")
You'll be able to apply this knowledge to your algorithm.
Use the album as the first key all['Stay strong'] = {'some-song':'text_heavy'}
I also recommend making a function since you're re-using code.
for instance, the request and then parsing using bs4
def parser(url):
make_req = request.get(url).text #or .content
return BeautifulSoup(make_req, 'html.parser')
A good practice for software developement is so called DRY (Don't repeat yourself) since readability counts and as opposed to WET (Waste everyones time, Write Everything Twice).
Just something to keep in mind.
I made it!!
I wasn't able to store the output in a variable, but I was able to write a txt file storing all the content which is even better. This is the code I used:
import requests
import re
from bs4 import BeautifulSoup
with open('nBIGsongs.txt', 'a') as f:
r = requests.get("http://www.metrolyrics.com/notorious-big-albums-list.html")
c = r.content
soup = BeautifulSoup(c, "html.parser")
albums = soup.find("div", {'class' : 'grid_8'})
for a in albums.find_all('a', href=True, alt=True):
r = requests.get(a['href'])
c = r.content
soup = BeautifulSoup(c, "html.parser")
song = soup.find_all('p', {'class':'verse'})
title = soup.find_all('h1')
for item in title:
title = item.text.replace('Lyrics','')
f.write("\n" + title.upper() + "\n")
for item in song:
f.write(item.text)
f.close()
I would still love to hear if there are other better approaches.
Thanks!
In some wikipedia pages, after the title of the article (appearing in bold), there is some text inside of parentheses used to explain the pronunciation and phonetics of the words in the title. For example, on this, after the bold title diglossia in the <p>, there is an open parenthesis. In order to find the corresponding close parenthesis, you would have to iterate through the text nodes one by one to find it, which is simple. What I'm trying to do is find the very next href link and store it.
The issue here is that (AFAIK), there isn't a way to uniquely identify the text node with the close parenthesis and then get the following href. Is there any straight forward (not convoluted) way to get the first link outside of the initial parentheses?
EDIT
In the case of the link provided here, the href to be stored should be: https://en.wikipedia.org/wiki/Dialects since that is the first link outside of the parenthesis
Is this what you want?
import requests
rs = requests.get('https://en.wikipedia.org/wiki/Diglossia', verify=False)
parsed_html = BeautifulSoup(rs.text)
print parsed_html.body.findAll('p')[0].findAll('a')[0]
This gives:
linguistics
if you want to extract href then you can use this:
parsed_html.body.findAll('p')[0].findAll('a')[0].attrs[0][1]
UPDATE
It seems you want href after parentheses not the before one.
I have written script for it. Try this:
import requests
from BeautifulSoup import BeautifulSoup
rs = requests.get('https://en.wikipedia.org/wiki/Diglossia', verify=False)
parsed_html = BeautifulSoup(rs.text)
temp = parsed_html.body.findAll('p')[0]
start_count = 0
started = False
found = False
while temp.next and found is False:
temp = temp.next
if '(' in temp:
start_count += 1
if started is False:
started = True
if ')' in temp and started and start_count > 1:
start_count -= 1
elif ')' in temp and started and start_count == 1:
found = True
print temp.findNext('a').attrs[0][1]
I'm trying to do a massive data accumulation on college basketball teams. This link: https://www.teamrankings.com/ncb/stats/ has a TON of team stats.
I have tried to write a script that scans all the desired links (all Team Stats) from this page, finds the rank of the specified team (an input), then returns the sum of that teams ranks from all links.
I graciously found this: https://gist.github.com/phillipsm/404780e419c49a5b62a8
...which is GREAT!
But I must have something wrong because I'm getting 0
Here's my code:
import requests
from bs4 import BeautifulSoup
import time
url_to_scrape = 'https://www.teamrankings.com/ncb/stats/'
r = requests.get(url_to_scrape)
soup = BeautifulSoup(r.text, "html.parser")
stat_links = []
for table_row in soup.select(".expand-section li"):
table_cells = table_row.findAll('li')
if len(table_cells) > 0:
link = table_cells[0].find('a')['href']
stat_links.append(link)
total_rank = 0
for link in stat_links:
r = requests.get(link)
soup = BeaultifulSoup(r.text)
team_rows = soup.select(".tr-table datatable scrollable dataTable no-footer tr")
for row in team_rows:
if row.findAll('td')[1].text.strip() == 'Oklahoma':
rank = row.findAll('td')[0].text.strip()
total_rank = total_rank + rank
print total_rank
Check out that link to double check I have the correct class specified. I have a feeling the problem might be in the first for loop where I select an li tag then select all li tags within that first tag, I dunno.
I don't use Python so I'm unfamiliar with any debugging tools. So if anyone wants to forward me to one of those that would be great!
First, the team stats and player stats sections are contained in a 'div class='large column-2'. The team stats are in the first occurrence. Then you can find all of the href tags within it. I've combined both in a one-liner.
teamstats = soup(class_='column large-2')[0].find_all(href=True)
The teamstats list contains all of the 'a' tags. Use a list comprehension to extract the links. A few of the hrefs contained "#" (part of navigation links) so I excluded them.
links = [a['href'] for a in teamstats if a['href'] != '#']
Here is a sample of output:
links
Out[84]:
['/ncaa-basketball/stat/points-per-game',
'/ncaa-basketball/stat/average-scoring-margin',
'/ncaa-basketball/stat/offensive-efficiency',
'/ncaa-basketball/stat/floor-percentage',
'/ncaa-basketball/stat/1st-half-points-per-game',
A ran your code on my machine and the line --> table_cells = table_row.findAll('li') , always returns an empty list, so stat_links ends up being an empty array, therefore the iteration over stat_links never gets carried out and total_rank will not get incremented. I suggest you fiddle around with the way you find all the list elements.