Google Search Results/ Beginner Python - python

Just some questions regarding Python 3.
def AllMusic():
myList1 = ["bob"]
myList2 = ["dylan"]
x = myList1[0]
z = myList2[0]
y = "-->Then 10 Numbers?"
print("AllMusic")
print("http://www.allmusic.com/artist/"+x+"-"+z+"-mn"+y)
This is my code so far.
I want to write a program that prints out the variable y.
When you go to AllMusic.com. The different artists have unique 10 numbers.
For example, www.allmusic.com/artist/the-beatles-mn0000754032‎, www.allmusic.com/artist/arcade-fire-mn0000185591.
x is the first word of the artist and y is the second word of the artist. Everything works but I can't figure out a way to find that 10 digit number and return it to me for each artist I input into my python program.
I figured out that when you go to google and type for example "Arcade Fire AllMusic", in the first result just under the heading it gives you the url of the site. www.allmusic.com/artist/arcade-fire-mn0000185591
How can I copy that 10 digit code, 0000185591, into my python program and print it out for me to see.

I wouldn't use Google at all - you can use the search on the site. There are many useful tools to help you do web scraping in python: I'd recommend installing BeautifulSoup. Here's a small script you can experiment with:
import urllib
from bs4 import BeautifulSoup
def get_artist_link(artist):
base = 'http://www.allmusic.com/search/all/'
# encode spaces
query = artist.replace(' ', '%20')
url = base + query
page = urllib.urlopen(url)
soup = BeautifulSoup(page.read())
artists = soup.find_all("li", class_ = "artist")
for artist in artists:
print(artist.a.attrs['href'])
if __name__ == '__main__':
get_artist_link('the beatles')
get_artist_link('arcade fire')
For me this prints out:
/artist/the-beatles-mn0000754032
/artist/arcade-fire-mn0000185591

Related

Can't isolate desired results out of crude ones

I've created a script in python to get the name of neighbors from a webpage. I've used requests library along with re module to parse the content from some script tag out of that site. when I run the script I get the name of neighbors in the right way. However, the problem is i've used this line if not item.startswith("NY:"):continue to get rid of unwanted results from that page. I do not wish to use this hardcoded portion NY: to do this trick.
website link
I've tried with:
import re
import json
import requests
link = 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=New%20York%2C%20NY&start=1'
resp = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
data = json.loads(re.findall(r'data-hypernova-key[^{]+(.*)--></script>',resp.text)[0])
items = data['searchPageProps']['filterPanelProps']['filterInfoMap']
for item in items:
if not item.startswith("NY:"):continue
print(item)
Result I'm getting (desired result):
NY:New_York:Brooklyn:Mill_Basin
NY:New_York:Bronx:Edenwald
NY:New_York:Staten_Island:Stapleton
If I do not use this line if not item.startswith("NY:"):continue, the results are something like:
rating
NY:New_York:Brooklyn:Mill_Basin
NY:New_York:Bronx:Edenwald
NY:New_York:Staten_Island:Stapleton
NY:New_York:Staten_Island:Lighthouse_Hill
NY:New_York:Queens:Rochdale
NY:New_York:Queens:Pomonok
BusinessParking.validated
food_court
NY:New_York:Queens:Little_Neck
The bottom line is I wish to get everything started with NY:New_York:. What I meant by unwanted results are rating, BusinessParking.validated, food_court and so on.
How can I get the neighbors without using any hardcoded portion of search within the script?
I'm not certain what your complete data set looks like, but based on your sample,
you might use something like:
if ':' not in item:
continue
# or perhaps:
if item.count(':') < 3:
continue
# I'd prefer a list comprehension if I didn't need the other data
items = [x for x in data['searchPageProps']['filterPanelProps']['filterInfoMap'] if ':' in x]
If that doesn't work for what you're trying to achieve then you could just use a variable for the state.
Another solution - using BeautifulSoup - which doesn't involve regex or hardcoding "NY:New_York" is below; it's convoluted, but mainly because Yelp buried it's treasure several layers deep...
So for future reference:
from bs4 import BeautifulSoup as bs
import json
import requests
link = 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=New%20York%2C%20NY&start=1'
resp = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
target = soup.find_all('script')[14]
content = target.text.replace('<!--','').replace('-->','')
js_data = json.loads(content)
And now the fun of extracting NYC info from the json begins....
for a in js_data:
if a == 'searchPageProps':
level1 = js_data[a]
for b in level1:
if b == 'filterPanelProps':
level2 = level1[b]
for c in level2:
if c == 'filterSets':
level3 = level2[c][1]
for d in level3:
if d == 'moreFilters':
level4 = level3[d]
for e in range(len(level4)):
print(level4[e]['title'])
print(level4[e]['sectionFilters'])
print('---------------')
The output is the name of each borough plus a list of all neighborhoods in that borough. For example:
Manhattan
['NY:New_York:Manhattan:Alphabet_City',
'NY:New_York:Manhattan:Battery_Park',
'NY:New_York:Manhattan:Central_Park', 'NY:New_York:Manhattan:Chelsea',
'...]
etc.

Scraping returning only one value

I wanted to scrape something as my first program, just to learn the basics really but I'm having trouble showing more than one result.
The premise is going to a forum (http://blackhatworld.com), scrape all thread titles and compare with a string. If it contains the word "free" it will print, otherwise it won't.
Here's the current code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.blackhatworld.com/')
content = BeautifulSoup(page.content, 'html.parser')
threadtitles = content.find_all('a', class_='PreviewTooltip')
n=0
for x in range(len(threadtitles)):
test = list(threadtitles)[n]
test2 = list(test)[0]
if test2.find('free') == -1:
n=n+1
else:
print(test2)
n=n+1
This is the result of running the program:
https://i.gyazo.com/6cf1e135b16b04f0807963ce21b2b9be.png
As you can see it's checking for the word "free" and it works but it only shows first result while there are several more in the page.
By default, strings comparison is case sensitive (FREE != free). To solve your problem, first you need to put test2 in lowercase:
test2 = list(test)[0].lower()
To solve your problem and simplify your code try this:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.blackhatworld.com/')
content = BeautifulSoup(page.content, 'html.parser')
threadtitles = content.find_all('a', class_='PreviewTooltip')
count = 0
for title in threadtitles:
if "free" in title.get_text().lower():
print(title.get_text())
else:
count += 1
print(count)
Bonus: Print value of href:
for title in threadtitles:
print(title["href"])
See also this.

Why doesn't this function return the same output in both situations(webscraping project)?

import requests
import re
from bs4 import BeautifulSoup
#The website I like to get, converts the contents of the web page to lxml format
base_url = "https://festivalfans.nl/event/dominator-festival"
url = requests.get(base_url)
soup = BeautifulSoup(url.content, "lxml")
#Modifies the given string to look visually good. Like this:
#['21 / JulZaterdag2018'] becomes 21 Jul 2018
def remove_char(string):
#All blacklisted characters and words
blacklist = ["/", "[", "]", "'", "Maandag", "Dinsdag", "Woensdag",
"Donderdag", "Vrijdag", "Zaterdag", "Zondag"]
#Replace every blacklisted character with white space
for char in blacklist:
string = string.replace(char,' ')
#Replace more than 2 consecutive white spaces
string = re.sub("\s\s+", " ", string)
#Gets the date of the festival I'm interested in
def get_date_info():
#Makes a list for the data
raw_info = []
#Adds every "div" with a certain name to list, and converts it to text
for link in soup.find_all("div", {"class": "event-single-data"}):
raw_info.append(link.text)
#Converts list into string, because remove_char() only accepts strings
raw_info = str(raw_info)
#Modifies the string as explained above
final_date = remove_char(raw_info)
#Prints the date in this format: 21 Jul 2018(example)
print(final_date)
get_date_info()
Hi there! So I'm currently working on a little webscraping project. I thought I had a good idea and I wanted to get more experienced with Python. What it basically does is it gets festival information like date, time and price and puts it in a little text file. I'm using BeautifulSoup to navigate and edit the web page. Link is down there!
But now I'm kinda running into a problem. I can't figure out what's wrong. Maybe I'm totally looking over it. So when I run this program it should give me this: 21 Jul 2018. But instead it returns 'None'. For some reason every character in the string gets removed.
I tried running remove_char() on its own, with the same list(converted it to string first) as input. This worked perfectly. It returned "21 Jul 2018" like it was supposed to do. So I'm quite sure the error is not in this function.
So somehow I'm missing something. Maybe it has to do with BeautifulSoup and how it handles things?
Hope someone can help me out!
BeautifulSoup:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Web page:
https://festivalfans.nl/event/dominator-festival
You forgot to return the value in the remove_char() function.
That's it!
Neither of your functions has a return statement, and so return None by default. remove_char() should end with return string for example.
import requests
from bs4 import BeautifulSoup
base_url = "https://festivalfans.nl/event/dominator-festival"
url = requests.get(base_url)
soup = BeautifulSoup(url.content , "html.parser")
def get_date_info():
for link in soup.find_all("div", {"class": "event-single-data"}):
day = link.find('div', {"class":"event-single-day"}).text.replace(" ", '')
month = link.find('div', {"class": "event-single-month"}).text.replace('/', "").replace(' ', '')
year = link.find('div', {"class": "event-single-year"}).text.replace(" ", '')
print(day, month, year)
get_date_info()
here is an easier code no need of re

How to store different sets of text in one single variable using Beautifulsoup

I am building a simple program using Python3 on MacOS, to scrap all the lyrics of an artist in one single variable. Although I am able to correctly iterate through different URL's (each Url is a song from this artist) and have the output that I want being printed, I am struggling to be able to store all the different songs in one single variable.
I've tried different approaches, trying to store it in a list, dictionary, dictionary inside a list, etc. but it didn't work out. I've also read Beautifulsoup documentation and several forums without success.
I am sure this should be something very simple. This is the code that I am running:
import requests
import re
from bs4 import BeautifulSoup
r = requests.get("http://www.metrolyrics.com/notorious-big-albums-list.html")
c = r.content
soup = BeautifulSoup(c, "html.parser")
albums = soup.find("div", {'class' : 'grid_8'})
for page in albums.find_all('a', href=True, alt=True):
d = {}
r = requests.get(a['href'])
c = r.content
soup = BeautifulSoup(c, "html.parser")
song = soup.find_all('p', {'class':'verse'})
title = soup.find_all('h1')
for item in title:
title = item.text.replace('Lyrics','')
print("\n",title.upper(),"\n")
for item in song:
song = item.text
print(song)
When running this code, you get the exact output that I would like to have stored in a single variable.
I've been struggling with this for days so I would really appreciate some help.
Thanks
Here's an example of how you should store data in one variable.
This can be JSON or similar by using a python dictionary.
a = dict()
#We create a instance of a dict. Same as a = {}.
a[1] = 'one'
#This is how a basic dictionary works. There is a key and a value.
a[2] = {'two':'the number 2'}
#Now our Key is normal, however, our value is another dictionary.
print(a)
#This is how we access the dict inside the dict.
print(a[2]['two'])
# first key [2] (gives us {'two':'the number 2'} we access value inside it [2]['two']")
You'll be able to apply this knowledge to your algorithm.
Use the album as the first key all['Stay strong'] = {'some-song':'text_heavy'}
I also recommend making a function since you're re-using code.
for instance, the request and then parsing using bs4
def parser(url):
make_req = request.get(url).text #or .content
return BeautifulSoup(make_req, 'html.parser')
A good practice for software developement is so called DRY (Don't repeat yourself) since readability counts and as opposed to WET (Waste everyones time, Write Everything Twice).
Just something to keep in mind.
I made it!!
I wasn't able to store the output in a variable, but I was able to write a txt file storing all the content which is even better. This is the code I used:
import requests
import re
from bs4 import BeautifulSoup
with open('nBIGsongs.txt', 'a') as f:
r = requests.get("http://www.metrolyrics.com/notorious-big-albums-list.html")
c = r.content
soup = BeautifulSoup(c, "html.parser")
albums = soup.find("div", {'class' : 'grid_8'})
for a in albums.find_all('a', href=True, alt=True):
r = requests.get(a['href'])
c = r.content
soup = BeautifulSoup(c, "html.parser")
song = soup.find_all('p', {'class':'verse'})
title = soup.find_all('h1')
for item in title:
title = item.text.replace('Lyrics','')
f.write("\n" + title.upper() + "\n")
for item in song:
f.write(item.text)
f.close()
I would still love to hear if there are other better approaches.
Thanks!

using find with multiple strings to search within larger string

Using python 2.7 I am trying to scrape title from a page, but cut it off before the closing title tag if i find one of these characters : .-_<| (as I'm just trying to get the name of the company/website) I have some code working but I'm sure there must be a simpler way. I'm open to suggestions as to libraries (beautiful soup, scrappy etc), but I would be most happy to do it without as I am happy to be slowly learning my way around python right now. You can see my code searches individually for each of the characters rather than all at once. I was hoping there was a find( x or x) function but I could not find. Later I will also be doing the same thing but looking for any numbers within 0-9 range.
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [{'User-agent' , 'Mozilla/5.0'}]
def findTitle(webaddress):
url = (webaddress)
ourUrl = opener.open(url).read()
ourUrlLower = ourUrl.lower()
x=0
positionStart = ourUrlLower.find("<title>",x)
if positionStart == -1:
return "Insert Title Here"
endTitleSignals = ['.',',','-','_','#','+',':','|','<']
positionEnd = positionStart + 50
for e in endTitleSignals:
positionHolder = ourUrlLower.find(e ,positionStart + 1)
if positionHolder < positionEnd and positionHolder != -1:
positionEnd = positionHolder
return ourUrl[positionStart + 7:positionEnd]
print findTitle('http://www.com)
The regular expression library (re) could help, but if you'd like to learn more about general python instead of specialized libraries, you could do it with sets, which are something you'll want to know about.
import sets
string = "garbage1and2recycling"
charlist = ['1', '2']
charset = sets.Set(charlist)
index = 0
for index in range(len(string)):
if string[index] in charset: break
print(index) # 7
Note that you could do the above using just charlist instead of charset, but that would take longer to run.

Categories