How can I edit web scraped text data using python?

How can I edit web scraped text data using python? - python

Trying to build my first webscraper to print out how the stock market is doing on Yahoo finance. I have found out how to isolate the information I want but it returns super sloppy. How can I manipulate this data to present in an easier way?
import requests
from bs4 import BeautifulSoup
#Import your website here
html_text = requests.get('https://finance.yahoo.com/').text
soup = BeautifulSoup(html_text, 'lxml')
#Find the part of the webpage where your information is in
sp_market = soup.find('h3', class_ = 'Maw(160px)').text
print(sp_market)
The return here is : S&P 5004,587.18+65.64(+1.45%)
I want to grab these elements such as the labels and percentages and isolate them so I can print them in a way I want. Anyone know how? Thanks so much!
edit:
((S&P 5004,587.18+65.64(+1.45%)))

For simple splitting you could use the .split(separator) method that is built-in. (f.e. First split by 'x', then split by 'y', then split by 'z' with x, y, z being seperators). Since this is not efficient and if you have bit more complex regular expressions that look the same way for different elements (here: stocks) then take a look at the python regex module.
string = "Stock +45%"
pattern = '[a-z]+[0-9][0-9]'
Then, consider to use a function like find_all oder search.

I assume that the format is always S&P 500\n[number][+/-][number]([+/-][number]%).
If that is the case, we could do the following.
import re
# [your existing code]
# e.g.
# sp_market = 'S&P 500\n4,587.18+65.64(+1.45%)'
label,line2 = sp_market.split('\n')
pm = re.findall(r"[+-]",line2)
total,change,percent,_ = re.split(r"[\+\-\(\)%]+",line2)
total = float(''.join(total.split(',')))
change = float(change)
if pm[0]=='-':
change=-change
percent = float(percent)
if pm[1]=='-':
percent=-percent
print(label, total,change,percent)
# S&P 500 4587.18 65.64 1.45

Not sure, cause question do not provide an expected result, but you can "isolate" the information with stripped_strings.
This will give you a list of "isolated" values you can process:
list(soup.find('h3', class_ = 'Maw(160px)').stripped_strings)
#Output
['S&P 500', '4,587.18', '+65.64', '(+1.45%)']
For example stripping following characters "()%":
[x.strip('\(|\)|%') for x in soup.find('h3', class_ = 'Maw(160px)').stripped_strings]
#Output
['S&P 500', '4,587.18', '+65.64', '+1.45']
Simplest way to print the data not that sloppy way, is to join() the values by whitespace:
' '.join([x.strip('\(|\)|%') for x in soup.find('h3', class_ = 'Maw(160px)').stripped_strings])
#Output
S&P 500 4,587.18 +65.64 +1.45
You can also create dict() and print the key / value pairs:
for k, v in dict(zip(['Symbol','Last Price','Change','% Change'], [x.strip('\(|\)|%') for x in soup.find('h3', class_ = 'Maw(160px)').stripped_strings])).items():
print(f'{k}: {v}')
#Output
Symbol: S&P 500
Last Price: 4,587.18
Change: +65.64
% Change: +1.45

Related

Retrieve string values with list

I am having some problems trying to manipulate some strings here. I am scraping some data from a website and I am facing 2 challenges:
I am scraping unnecessary data as the website I target has redundant class naming. My goal is to isolate this data and delete it so I can keep only the data I am interested in.
With the data kept, I need to split the string in order to store some information into specific variables.
So initially I was planning to use a simple split() function and store each new string into list and then play with it to keep the parts that I want. Unfortunately, every time I do this, I end up with 3 separate lists that I cannot manipulate/split.
Here is the code:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome('\\Users\\rapha\\Desktop\\10Milz\\4. Python\\Python final\\Scrape\\chromedriver.exe')
driver.get("https://www.atptour.com/en/scores/2020/7851/MS011/match-stats")
content = driver.page_source
soup = BeautifulSoup(content, "html.parser" )
for infos in soup.find_all('h3', class_='section-title'):
title = infos.get_text()
title = ' '.join(title.split())
title_list = []
title_list = title.split(" | ")
print(title_list)
Here is the "raw data" retrieve
Player Results
Tournament Results
Salvatore Caruso VS. Brandon Nakashima | Indian Wells 2020
And here is what I like to achieve
Variable_1 = Salvatore Caruso
Variable_2 = Brandon Nakashima
Variable 3 = Indian Wells
Variable 4 = 2020
Could you please let me know how to proceed here?

How about this ?
Its not so pretty but will work as long as there is always a VS. and a | separating the names and that the date is always 4 digits for the year.
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome('/home/lewis/Desktop/chromedriver')
driver.get("https://www.atptour.com/en/scores/2020/7851/MS011/match-stats")
content = driver.page_source
soup = BeautifulSoup(content, "html.parser" )
text = soup.find_all('h3', class_='section-title')[2].get_text().replace("\n","")
while text.find(" ")> -1:
text = text.replace(" "," ")
text = text.strip()
#split by two parameters
split = [st.split("|") for st in text.split("VS.")]
#flatten the nested lists
flat_list = [item for sublist in split for item in sublist]
#extract the date from the end of the last item
flat_list.append(flat_list[-1][-4:])
#remove date fromt the 3rd item
flat_list[2] = flat_list[2][:-4]
#strip any leading or trailing white space
final_list = [x.strip() for x in flat_list]
print(final_list)
output
['Salvatore Caruso', 'Brandon Nakashima', 'Indian Wells', '2020']

Splitting a list with regex

I am having some trouble trying to split each element within a nested list. I used this method for my first split. I want to do another split to the now nested list. I thought I could simply use the same line of code with a few modifications goal2 = [[j.split("") for j in goal]], but I continue to get a common error: 'list' object has no attribute 'split'. I know that you cannot split a list, but I do not understand why my modification is any different than the linked method. This is my first project with web scraping and I am looking for just the phone numbers of the website. I'd like some help to fix my issue and not a new code so that I can continue to learn and improve my own methods.
import requests
import re
from bs4 import BeautifulSoup
source = requests.get('https://www.pickyourownchristmastree.org/ORxmasnw.php').text
soup = BeautifulSoup(source, 'lxml')
info = soup.findAll(text=re.compile("((?:\d{3}|\(\d{3}\))?(?:\s|-|\.)?\d{3}(?:\s|-|\.)\d{4})"))[:1]
goal = [i.split(".") for i in info]
goal2 = [[j.split("") for j in goal]]
for x in goal:
del x[2:]
for y in goal:
del y[:1]
print('info:', info)
print('goal:', goal)
Output without goal2 variable:
info: ['89426 Green Mountain Road, Astoria, OR 97103. Phone: 503-325-9720. Open: ']
goal: [[' Phone: 503-325-9720']]
Desired Output with "goal2" variable:
info: [info: ['89426 Green Mountain Road, Astoria, OR 97103. Phone: 503-325-9720. Open: ']
goal: [[' Phone: 503-325-9720']]
goal2: ['503-325-9720']
I will obviously have more more numbers, but I didn't want to clog up the space. So it would look somthing more like this:
goal2: ['503-325-9720', '###-###-####', '###-###-####', '###-###-####']
But I want to make sure that each number can be exported into a new row within a csv file. So when I create a csv file with a header "Phone" each number above will be in a seperate row and not clustered together. I am thinking that I might need to change my code to a for loop???

The cleaner approach here would be to just do another regex search on your info, e.g.:
pat = re.compile(r'\d{3}\-\d{3}\-\d{4}')
goal = [pat.search(i).group() for i in info if pat.search(i)]
Outputs:
goal: ['503-325-9720']
Or if there are more than one number per line:
# use captive group instead
pat = re.compile(r'(\d{3}\-\d{3}\-\d{4})')
goal = [pat.findall(i) for i in info]
Outputs:
goal = [['503-325-9720', '123-456-7890']]

How to match and extract using regex - Python

I'm taking a look how to use regex and trying to figure out how to extract the Latitude and Longitude, no matter if the number is positive or negative, right after the "?ll=" as shown below:
https://maps.google.com/maps?ll=-6.148222,106.8462&q=loc:-6.148222,106.8462&
I have used the following code in python to get only the first digits marked above:
for link in soup.find_all('a', {'class': 'popup-gmaps'}):
lnk = str(link.get('href'))
print(lnk)
m = re.match('-?\d+(?!.*ll=)(?!&q=loc)*', lnk)
print(m)
#lat, *long = m.split(',')
#print(lat)
#print(long)
The result I got isn't what I was expecting:
https://maps.google.com/maps?ll=-6.148222,106.8462&q=loc:-6.148222,106.8462&
None
I'm getting "None" rather than the value "-6.148222,106.8462". I also tried to split those numbers into two variables called lat and long, but since I always got "None" python stops processing with "exit code 1" until I comment lines.
Cheers,

You should use re.search() instead of re.match() cause re.match() is used for exact matches.
This can solve the problem
for link in soup.find_all('a', {'class': 'popup-gmaps'}):
lnk = str(link.get('href'))
m = re.search(r"(-?\d*\.\d*),(-?\d*\.\d*)", lnk)
print(m.group())
print("lat = "+m.group(1))
print("lng = "+m.group(2))

I'd use a proper URL parser, using regex here is asking for problems in case the URL embedded in the page you are crawling is changing in a way that will break the regex you use.
from urllib.parse import urlparse, parse_qs
url = 'https://maps.google.com/maps?ll=-6.148222,106.8462&q=loc:-6.148222,106.8462&'
scheme, netloc, path, params, query, fragment = urlparse(url)
# or just
# query = urlparse(url).query
parsed_query_string = parse_qs(query)
print(parsed_query_string)
lat, long = parsed_query_string['ll'][0].split(',')
print(lat)
print(long)
outputs
{'ll': ['-6.148222,106.8462'], 'q': ['loc:-6.148222,106.8462']}
-6.148222
106.8462

use diff regex for latitude and longitude
import re
str1="https://maps.google.com/maps?ll=6.148222,-106.8462&q=loc:-6.148222,106.8462&"
lat=re.search(r"(-)*\d+(.)\d+",str1).group()
lon=re.search(r",(-)*\d+(.)\d+",str1).group()
print(lat)
print(lon[1:])
output
6.148222
-106.8462

Can't isolate desired results out of crude ones

I've created a script in python to get the name of neighbors from a webpage. I've used requests library along with re module to parse the content from some script tag out of that site. when I run the script I get the name of neighbors in the right way. However, the problem is i've used this line if not item.startswith("NY:"):continue to get rid of unwanted results from that page. I do not wish to use this hardcoded portion NY: to do this trick.
website link
I've tried with:
import re
import json
import requests
link = 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=New%20York%2C%20NY&start=1'
resp = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
data = json.loads(re.findall(r'data-hypernova-key[^{]+(.*)--></script>',resp.text)[0])
items = data['searchPageProps']['filterPanelProps']['filterInfoMap']
for item in items:
if not item.startswith("NY:"):continue
print(item)
Result I'm getting (desired result):
NY:New_York:Brooklyn:Mill_Basin
NY:New_York:Bronx:Edenwald
NY:New_York:Staten_Island:Stapleton
If I do not use this line if not item.startswith("NY:"):continue, the results are something like:
rating
NY:New_York:Brooklyn:Mill_Basin
NY:New_York:Bronx:Edenwald
NY:New_York:Staten_Island:Stapleton
NY:New_York:Staten_Island:Lighthouse_Hill
NY:New_York:Queens:Rochdale
NY:New_York:Queens:Pomonok
BusinessParking.validated
food_court
NY:New_York:Queens:Little_Neck
The bottom line is I wish to get everything started with NY:New_York:. What I meant by unwanted results are rating, BusinessParking.validated, food_court and so on.
How can I get the neighbors without using any hardcoded portion of search within the script?

I'm not certain what your complete data set looks like, but based on your sample,
you might use something like:
if ':' not in item:
continue
# or perhaps:
if item.count(':') < 3:
continue
# I'd prefer a list comprehension if I didn't need the other data
items = [x for x in data['searchPageProps']['filterPanelProps']['filterInfoMap'] if ':' in x]
If that doesn't work for what you're trying to achieve then you could just use a variable for the state.

Another solution - using BeautifulSoup - which doesn't involve regex or hardcoding "NY:New_York" is below; it's convoluted, but mainly because Yelp buried it's treasure several layers deep...
So for future reference:
from bs4 import BeautifulSoup as bs
import json
import requests
link = 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=New%20York%2C%20NY&start=1'
resp = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
target = soup.find_all('script')[14]
content = target.text.replace('<!--','').replace('-->','')
js_data = json.loads(content)
And now the fun of extracting NYC info from the json begins....
for a in js_data:
if a == 'searchPageProps':
level1 = js_data[a]
for b in level1:
if b == 'filterPanelProps':
level2 = level1[b]
for c in level2:
if c == 'filterSets':
level3 = level2[c][1]
for d in level3:
if d == 'moreFilters':
level4 = level3[d]
for e in range(len(level4)):
print(level4[e]['title'])
print(level4[e]['sectionFilters'])
print('---------------')
The output is the name of each borough plus a list of all neighborhoods in that borough. For example:
Manhattan
['NY:New_York:Manhattan:Alphabet_City',
'NY:New_York:Manhattan:Battery_Park',
'NY:New_York:Manhattan:Central_Park', 'NY:New_York:Manhattan:Chelsea',
'...]
etc.

How to web scrape all of the batters names?

I would like to scrape all of the MLB batters stats for 2018. Here is my code so far:
#import modules
from urllib.request import urlopen
from lxml import html
#fetch url/html
response = urlopen("https://www.baseball-reference.com/leagues/MLB/2018-standard-batting.shtml")
content = response.read()
tree = html.fromstring( content )
#parse data
comment_html = tree.xpath('//comment()[contains(., "players_standard_batting")]')[0]
comment_html = str(comment_html).replace("-->", "")
comment_html = comment_html.replace("<!--", "")
tree = html.fromstring( comment_html )
for batter_row in tree.xpath('//table[#id="players_standard_batting"]/tbody/tr[contains(#class, "full_table")]'):
csk = batter_row.xpath('./td[#data-stat="player"]/#csk')[0]
When I scraped all of the batters there is 0.01 attached to each name. I tried to remove attached numbers using the following code:
bat_data = [csk]
string = '0.01'
result = []
for x in bat_data :
if string in x:
substring = x.replace(string,'')
if substring != "":
result.append(substring)
else:
result.append(x)
print(result)
This code removed the number, however, only the last name was printed:
Output:
['Zunino, Mike']
Also, there is a bracket and quotations around the name. The name is also in reverse order.
1) How can I print all of the batters names?
2) How can I remove the quotation marks and brackets?
3) Can I reverse the order of the names so the first name gets printed and then the last name?
The final output I am hoping for would be all of the batters names like so: Mike Zunino.
I am new to this site... I am also new to scraping/coding and will greatly appreciate any help I can get! =)

You can do the same in different ways. Here is one such approach which doesn't require post processing. You get the names how you wanted to get:
from urllib.request import urlopen
from lxml.html import fromstring
url = "https://www.baseball-reference.com/leagues/MLB/2018-standard-batting.shtml"
content = str(urlopen(url).read())
comment = content.replace("-->","").replace("<!--","")
tree = fromstring(comment)
for batter_row in tree.xpath('//table[contains(#class,"stats_table")]//tr[contains(#class,"full_table")]'):
csk = batter_row.xpath('.//td[#data-stat="player"]/a')[0].text
print(csk)
Output you may get like:
Jose Abreu
Ronald Acuna
Jason Adam
Willy Adames
Austin L. Adams

You get only the last batter because you are overwriting the value of csk each time in your first loop. Initialize the empty list bat_data first and then add each batter to it.
bat_data= []
for batter_row in blah:
csk = blah
bat_data.append(csk)
This will give you a list of all batters, ['Abreu,Jose0.01', 'Acuna,Ronald0.01', 'Adam,Jason0.01', ...]
Then loop through this list but you don't have to check if string is in the name. Just do x.replace('0.01', '') and then check if the string is empty.
To reverse the order of the names
substring = substring.split(',')
substring.reverse()
nn = " ".join(substring)
Then append nn to the result.
You are getting the quotes and the brackets because you are printing the list. Instead iterate through the list and print each item.
Your code edited assuming you got bat_data correctly:
for x in bat_data :
substring = x.replace(string,'')
if substring != "":
substring = substring.split(',')
substring.reverse()
substring = ' '.join(substring)
result.append(substring)
for x in result:
print(x)

1) Print all batter names
print(result)
This will print everything in the result object. If it’s not printing what you expect then there’s something else wrong going on.
2) Remove quotations
The brackets are due to it being an array object. Try this...
print(result[0])
This will tell the interpreter to print result at the 0 index.
3) Reverse order of names
Try
name = result[0].split(“ “).reverse()[::-1]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I edit web scraped text data using python? - python

Related

Retrieve string values with list

Splitting a list with regex

How to match and extract using regex - Python

Can't isolate desired results out of crude ones

How to web scrape all of the batters names?

Categories

Resources