I've created a script in python to get the name of neighbors from a webpage. I've used requests library along with re module to parse the content from some script tag out of that site. when I run the script I get the name of neighbors in the right way. However, the problem is i've used this line if not item.startswith("NY:"):continue to get rid of unwanted results from that page. I do not wish to use this hardcoded portion NY: to do this trick.
website link
I've tried with:
import re
import json
import requests
link = 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=New%20York%2C%20NY&start=1'
resp = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
data = json.loads(re.findall(r'data-hypernova-key[^{]+(.*)--></script>',resp.text)[0])
items = data['searchPageProps']['filterPanelProps']['filterInfoMap']
for item in items:
if not item.startswith("NY:"):continue
print(item)
Result I'm getting (desired result):
NY:New_York:Brooklyn:Mill_Basin
NY:New_York:Bronx:Edenwald
NY:New_York:Staten_Island:Stapleton
If I do not use this line if not item.startswith("NY:"):continue, the results are something like:
rating
NY:New_York:Brooklyn:Mill_Basin
NY:New_York:Bronx:Edenwald
NY:New_York:Staten_Island:Stapleton
NY:New_York:Staten_Island:Lighthouse_Hill
NY:New_York:Queens:Rochdale
NY:New_York:Queens:Pomonok
BusinessParking.validated
food_court
NY:New_York:Queens:Little_Neck
The bottom line is I wish to get everything started with NY:New_York:. What I meant by unwanted results are rating, BusinessParking.validated, food_court and so on.
How can I get the neighbors without using any hardcoded portion of search within the script?
I'm not certain what your complete data set looks like, but based on your sample,
you might use something like:
if ':' not in item:
continue
# or perhaps:
if item.count(':') < 3:
continue
# I'd prefer a list comprehension if I didn't need the other data
items = [x for x in data['searchPageProps']['filterPanelProps']['filterInfoMap'] if ':' in x]
If that doesn't work for what you're trying to achieve then you could just use a variable for the state.
Another solution - using BeautifulSoup - which doesn't involve regex or hardcoding "NY:New_York" is below; it's convoluted, but mainly because Yelp buried it's treasure several layers deep...
So for future reference:
from bs4 import BeautifulSoup as bs
import json
import requests
link = 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=New%20York%2C%20NY&start=1'
resp = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
target = soup.find_all('script')[14]
content = target.text.replace('<!--','').replace('-->','')
js_data = json.loads(content)
And now the fun of extracting NYC info from the json begins....
for a in js_data:
if a == 'searchPageProps':
level1 = js_data[a]
for b in level1:
if b == 'filterPanelProps':
level2 = level1[b]
for c in level2:
if c == 'filterSets':
level3 = level2[c][1]
for d in level3:
if d == 'moreFilters':
level4 = level3[d]
for e in range(len(level4)):
print(level4[e]['title'])
print(level4[e]['sectionFilters'])
print('---------------')
The output is the name of each borough plus a list of all neighborhoods in that borough. For example:
Manhattan
['NY:New_York:Manhattan:Alphabet_City',
'NY:New_York:Manhattan:Battery_Park',
'NY:New_York:Manhattan:Central_Park', 'NY:New_York:Manhattan:Chelsea',
'...]
etc.
Related
Trying to build my first webscraper to print out how the stock market is doing on Yahoo finance. I have found out how to isolate the information I want but it returns super sloppy. How can I manipulate this data to present in an easier way?
import requests
from bs4 import BeautifulSoup
#Import your website here
html_text = requests.get('https://finance.yahoo.com/').text
soup = BeautifulSoup(html_text, 'lxml')
#Find the part of the webpage where your information is in
sp_market = soup.find('h3', class_ = 'Maw(160px)').text
print(sp_market)
The return here is : S&P 5004,587.18+65.64(+1.45%)
I want to grab these elements such as the labels and percentages and isolate them so I can print them in a way I want. Anyone know how? Thanks so much!
edit:
((S&P 5004,587.18+65.64(+1.45%)))
For simple splitting you could use the .split(separator) method that is built-in. (f.e. First split by 'x', then split by 'y', then split by 'z' with x, y, z being seperators). Since this is not efficient and if you have bit more complex regular expressions that look the same way for different elements (here: stocks) then take a look at the python regex module.
string = "Stock +45%"
pattern = '[a-z]+[0-9][0-9]'
Then, consider to use a function like find_all oder search.
I assume that the format is always S&P 500\n[number][+/-][number]([+/-][number]%).
If that is the case, we could do the following.
import re
# [your existing code]
# e.g.
# sp_market = 'S&P 500\n4,587.18+65.64(+1.45%)'
label,line2 = sp_market.split('\n')
pm = re.findall(r"[+-]",line2)
total,change,percent,_ = re.split(r"[\+\-\(\)%]+",line2)
total = float(''.join(total.split(',')))
change = float(change)
if pm[0]=='-':
change=-change
percent = float(percent)
if pm[1]=='-':
percent=-percent
print(label, total,change,percent)
# S&P 500 4587.18 65.64 1.45
Not sure, cause question do not provide an expected result, but you can "isolate" the information with stripped_strings.
This will give you a list of "isolated" values you can process:
list(soup.find('h3', class_ = 'Maw(160px)').stripped_strings)
#Output
['S&P 500', '4,587.18', '+65.64', '(+1.45%)']
For example stripping following characters "()%":
[x.strip('\(|\)|%') for x in soup.find('h3', class_ = 'Maw(160px)').stripped_strings]
#Output
['S&P 500', '4,587.18', '+65.64', '+1.45']
Simplest way to print the data not that sloppy way, is to join() the values by whitespace:
' '.join([x.strip('\(|\)|%') for x in soup.find('h3', class_ = 'Maw(160px)').stripped_strings])
#Output
S&P 500 4,587.18 +65.64 +1.45
You can also create dict() and print the key / value pairs:
for k, v in dict(zip(['Symbol','Last Price','Change','% Change'], [x.strip('\(|\)|%') for x in soup.find('h3', class_ = 'Maw(160px)').stripped_strings])).items():
print(f'{k}: {v}')
#Output
Symbol: S&P 500
Last Price: 4,587.18
Change: +65.64
% Change: +1.45
When working with a new XML structure, it is always helpful to see the big picture first.
When loading it with BeautifulSoup:
import requests, bs4
s = requests.get('https://www.w3schools.com/xml/cd_catalog.xml').text
x = bs4.BeautifulSoup(s, 'xml')
print(x)
is there a built-in way to display its tree structure with different depths?
Example for https://www.w3schools.com/xml/cd_catalog.xml, with maxdepth=0, it would be:
CATALOG
with maxdepth=1, it would be:
CATALOG
CD
CD
CD
...
and with maxdepth=2, it would be:
CATALOG
CD
TITLE
ARTIST
COUNTRY
COMPANY
PRICE
YEAR
CD
TITLE
ARTIST
COUNTRY
COMPANY
PRICE
YEAR
...
Here's a quick way to do it: Use the prettify() function to structure it, then get the indentation and opening tag names via regex (catches uppercase words inside opening tags in this case). If the indentation from pretify() meets the depth specification, then print it with the specified indentation size.
import requests, bs4
import re
maxdepth = 1
indent_size = 2
s = requests.get('https://www.w3schools.com/xml/cd_catalog.xml').text
x = bs4.BeautifulSoup(s, 'xml').prettify()
for line in x.split("\n"):
match = re.match("(\s*)<([A-Z]+)>", line)
if match and len(match.group(1)) <= maxdepth:
print(indent_size*match.group(1) + match.group(2))
I have used xmltodict 0.12.0 (installed via anaconda), which did the job for xml parsing, not for depth-wise viewing though. Works much like any other dictionary. From here a recursion with depth counting should be a way to go.
import requests, xmltodict, json
s = requests.get('https://www.w3schools.com/xml/cd_catalog.xml').text
x = xmltodict.parse(s, process_namespaces=True)
for key in x:
print(json.dumps(x[key], indent=4, default=str))
Here is one solution without BeautifulSoup.
import requests
s = requests.get('https://www.w3schools.com/xml/cd_catalog.xml').text
array = []
tab_size = 2
target_depth = 2
for element in s.split('\n'):
depth = (len(element) - len(element.lstrip())) / tab_size
if depth <= target_depth:
print(' ' * int(depth) + element)
I try to get the contents of two div inside a dictionary in python. The main problem is that I'm able to fetch the first div content and the second, but not in the right key:value manner. I'm only able to get the keys back. As such, I know that I need to iterate through the content, but I can't see to get my for loop correct.
Following 1 and 2 cannot get the case done what I'm looking for.
This is what I've tried so far:
from bs4 import BeautifulSoup
import requests
url='https://www.samenvoordeklant.nl/arbeidsmarktregios'
base=requests.get(url, timeout=15)
html=BeautifulSoup(base.text, 'lxml')
regios=html.find_all('div',attrs={'class':['field field--name-node-title field--type-ds field--label-hidden field__item animated','field field--name-field-gemeenten field--type-string-long field--label-hidden field__item animated']})
for regio in regios:
print({regio.get_text(strip=True)})
The result:
{'Achterhoek'}
{'Aalten, Berkelland, Bronckhorst, Doetinchem, Montferland, Oost Gelre, Oude IJsselstreek, Winterswijk'}
{'Amersfoort'}
{'Amersfoort, Baarn, Bunschoten, Leusden, Nijkerk, Soest, Woudenberg'}
etc.
The result I'm after is:
{'Achterhoek':'Aalten', 'Berkelland', 'Bronckhorst', 'Doetinchem', 'Montferland', 'Oost Gelre', 'Oude IJsselstreek', 'Winterswijk'}
{'Amersfoort':'Amersfoort', 'Baarn', 'Bunschoten', 'Leusden', 'Nijkerk', 'Soest', 'Woudenberg'}
etc. This allows me to move it afterwards into a pandas dataframe more easily.
An easy way is with dict and zip of the two lists. Note I have used faster css selectors and avoided using full multi-value of class.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.samenvoordeklant.nl/arbeidsmarktregios')
soup = bs(r.content, 'lxml')
result = dict(zip([i.text for i in soup.select('h2 a')], [i.text for i in soup.select('.field--type-string-long')]))
print(result)
# result = {k:v.split(', ') for k, v in result.items()} ##add this line at end if want list as value rather than string
Sample pprint output:
If you want a list as the value you can simply add a last line of:
result = {k:v.split(', ') for k, v in result.items()}
This is one approach using zip.
Ex:
regios=html.find_all('div',attrs={'class':['field field--name-node-title field--type-ds field--label-hidden field__item animated','field field--name-field-gemeenten field--type-string-long field--label-hidden field__item animated']})
result = {key.get_text(strip=True): value.get_text(strip=True) for key, value in zip(regios[0::2], regios[1::2])}
pprint(result)
Output:
{'Achterhoek': 'Aalten, Berkelland, Bronckhorst, Doetinchem, Montferland, Oost '
'Gelre, Oude IJsselstreek, Winterswijk',
'Amersfoort': 'Amersfoort, Baarn, Bunschoten, Leusden, Nijkerk, Soest, '
'Woudenberg',......
If you need the value as a list of items
Use:
result = {key.get_text(strip=True): [i.strip() for i in value.get_text(strip=True).split(",")] for key, value in zip(regios[0::2], regios[1::2])}
Output:
{'Achterhoek': ['Aalten',
'Berkelland',
'Bronckhorst',
'Doetinchem',
'Montferland',
'Oost Gelre',
'Oude IJsselstreek',
'Winterswijk'],
'Amersfoort': ['Amersfoort',
'Baarn',
'Bunschoten',....
I am building a simple program using Python3 on MacOS, to scrap all the lyrics of an artist in one single variable. Although I am able to correctly iterate through different URL's (each Url is a song from this artist) and have the output that I want being printed, I am struggling to be able to store all the different songs in one single variable.
I've tried different approaches, trying to store it in a list, dictionary, dictionary inside a list, etc. but it didn't work out. I've also read Beautifulsoup documentation and several forums without success.
I am sure this should be something very simple. This is the code that I am running:
import requests
import re
from bs4 import BeautifulSoup
r = requests.get("http://www.metrolyrics.com/notorious-big-albums-list.html")
c = r.content
soup = BeautifulSoup(c, "html.parser")
albums = soup.find("div", {'class' : 'grid_8'})
for page in albums.find_all('a', href=True, alt=True):
d = {}
r = requests.get(a['href'])
c = r.content
soup = BeautifulSoup(c, "html.parser")
song = soup.find_all('p', {'class':'verse'})
title = soup.find_all('h1')
for item in title:
title = item.text.replace('Lyrics','')
print("\n",title.upper(),"\n")
for item in song:
song = item.text
print(song)
When running this code, you get the exact output that I would like to have stored in a single variable.
I've been struggling with this for days so I would really appreciate some help.
Thanks
Here's an example of how you should store data in one variable.
This can be JSON or similar by using a python dictionary.
a = dict()
#We create a instance of a dict. Same as a = {}.
a[1] = 'one'
#This is how a basic dictionary works. There is a key and a value.
a[2] = {'two':'the number 2'}
#Now our Key is normal, however, our value is another dictionary.
print(a)
#This is how we access the dict inside the dict.
print(a[2]['two'])
# first key [2] (gives us {'two':'the number 2'} we access value inside it [2]['two']")
You'll be able to apply this knowledge to your algorithm.
Use the album as the first key all['Stay strong'] = {'some-song':'text_heavy'}
I also recommend making a function since you're re-using code.
for instance, the request and then parsing using bs4
def parser(url):
make_req = request.get(url).text #or .content
return BeautifulSoup(make_req, 'html.parser')
A good practice for software developement is so called DRY (Don't repeat yourself) since readability counts and as opposed to WET (Waste everyones time, Write Everything Twice).
Just something to keep in mind.
I made it!!
I wasn't able to store the output in a variable, but I was able to write a txt file storing all the content which is even better. This is the code I used:
import requests
import re
from bs4 import BeautifulSoup
with open('nBIGsongs.txt', 'a') as f:
r = requests.get("http://www.metrolyrics.com/notorious-big-albums-list.html")
c = r.content
soup = BeautifulSoup(c, "html.parser")
albums = soup.find("div", {'class' : 'grid_8'})
for a in albums.find_all('a', href=True, alt=True):
r = requests.get(a['href'])
c = r.content
soup = BeautifulSoup(c, "html.parser")
song = soup.find_all('p', {'class':'verse'})
title = soup.find_all('h1')
for item in title:
title = item.text.replace('Lyrics','')
f.write("\n" + title.upper() + "\n")
for item in song:
f.write(item.text)
f.close()
I would still love to hear if there are other better approaches.
Thanks!
I'm trying to do a massive data accumulation on college basketball teams. This link: https://www.teamrankings.com/ncb/stats/ has a TON of team stats.
I have tried to write a script that scans all the desired links (all Team Stats) from this page, finds the rank of the specified team (an input), then returns the sum of that teams ranks from all links.
I graciously found this: https://gist.github.com/phillipsm/404780e419c49a5b62a8
...which is GREAT!
But I must have something wrong because I'm getting 0
Here's my code:
import requests
from bs4 import BeautifulSoup
import time
url_to_scrape = 'https://www.teamrankings.com/ncb/stats/'
r = requests.get(url_to_scrape)
soup = BeautifulSoup(r.text, "html.parser")
stat_links = []
for table_row in soup.select(".expand-section li"):
table_cells = table_row.findAll('li')
if len(table_cells) > 0:
link = table_cells[0].find('a')['href']
stat_links.append(link)
total_rank = 0
for link in stat_links:
r = requests.get(link)
soup = BeaultifulSoup(r.text)
team_rows = soup.select(".tr-table datatable scrollable dataTable no-footer tr")
for row in team_rows:
if row.findAll('td')[1].text.strip() == 'Oklahoma':
rank = row.findAll('td')[0].text.strip()
total_rank = total_rank + rank
print total_rank
Check out that link to double check I have the correct class specified. I have a feeling the problem might be in the first for loop where I select an li tag then select all li tags within that first tag, I dunno.
I don't use Python so I'm unfamiliar with any debugging tools. So if anyone wants to forward me to one of those that would be great!
First, the team stats and player stats sections are contained in a 'div class='large column-2'. The team stats are in the first occurrence. Then you can find all of the href tags within it. I've combined both in a one-liner.
teamstats = soup(class_='column large-2')[0].find_all(href=True)
The teamstats list contains all of the 'a' tags. Use a list comprehension to extract the links. A few of the hrefs contained "#" (part of navigation links) so I excluded them.
links = [a['href'] for a in teamstats if a['href'] != '#']
Here is a sample of output:
links
Out[84]:
['/ncaa-basketball/stat/points-per-game',
'/ncaa-basketball/stat/average-scoring-margin',
'/ncaa-basketball/stat/offensive-efficiency',
'/ncaa-basketball/stat/floor-percentage',
'/ncaa-basketball/stat/1st-half-points-per-game',
A ran your code on my machine and the line --> table_cells = table_row.findAll('li') , always returns an empty list, so stat_links ends up being an empty array, therefore the iteration over stat_links never gets carried out and total_rank will not get incremented. I suggest you fiddle around with the way you find all the list elements.