How do I remove extra text to the right of a string? - python

I am trying to get the name of a car model as it appears on the website but for some reason (after trying the all of the following), it doesn't seem to work.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.carsales.com.au/cars/results?offset=12"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
model_name = soup.find_all('a', attrs={'data-webm-clickvalue':'sv-view-title'})
final_model_name = model_name[1]
clean_model_name = final_model_name.text
clean_model_name = clean_model_name.replace("\r", "")
clean_model_name = clean_model_name.replace("\n", "")
clean_model_name = clean_model_name.strip()
clean_model_name = clean_model_name.rstrip()
print(clean_model_name)
I have also created a variable that contains the whole sentence I want to remove (which works) which is then parsed in the strip function, but the MY14 element of it changes based on the year of the car. Creating a variable for each year doesn't seem very efficient.
Some indexes return clean results, however, others return the following (scroll across):
2014 Holden Cruze SRi Z Series JH Series II Auto MY14 Manufacturer Marketing Year (MY) The manufacturer's marketing year of this model.
I don't need any of the details after the car model - after researching, strip() should remove white space either side (but in this case it doesn't) and rstrip() should remove everything to the right (but in this case it doesn't)
I have successfully created a for loop which loops through each of the cars on this page, but some rows in the DataFrame are extended due to the additional unwanted text.

strip() would only remove the white space characters at the front and rear of the string that you are working with, you can try this:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.carsales.com.au/cars/results?offset=12"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
model_name = soup.find_all('a', attrs={'data-webm-clickvalue':'sv-view-title'})
final_model_name = model_name[1]
clean_model_name = final_model_name.text
clean_model_name = clean_model_name.strip().split()[:5]
clean_model_name = ' '.join(clean_model_name)
print(clean_model_name)
I noticed that most of the model names have 5 key parts (the year, brand and the model) so I used [:5] to get the first five elements of the model name, but if you want to minus the first series element then just change the value to 3. strip() helps to split the model name by the spaces. Hope this helps

Related

Retrieve string values with list

I am having some problems trying to manipulate some strings here. I am scraping some data from a website and I am facing 2 challenges:
I am scraping unnecessary data as the website I target has redundant class naming. My goal is to isolate this data and delete it so I can keep only the data I am interested in.
With the data kept, I need to split the string in order to store some information into specific variables.
So initially I was planning to use a simple split() function and store each new string into list and then play with it to keep the parts that I want. Unfortunately, every time I do this, I end up with 3 separate lists that I cannot manipulate/split.
Here is the code:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome('\\Users\\rapha\\Desktop\\10Milz\\4. Python\\Python final\\Scrape\\chromedriver.exe')
driver.get("https://www.atptour.com/en/scores/2020/7851/MS011/match-stats")
content = driver.page_source
soup = BeautifulSoup(content, "html.parser" )
for infos in soup.find_all('h3', class_='section-title'):
title = infos.get_text()
title = ' '.join(title.split())
title_list = []
title_list = title.split(" | ")
print(title_list)
Here is the "raw data" retrieve
Player Results
Tournament Results
Salvatore Caruso VS. Brandon Nakashima | Indian Wells 2020
And here is what I like to achieve
Variable_1 = Salvatore Caruso
Variable_2 = Brandon Nakashima
Variable 3 = Indian Wells
Variable 4 = 2020
Could you please let me know how to proceed here?
How about this ?
Its not so pretty but will work as long as there is always a VS. and a | separating the names and that the date is always 4 digits for the year.
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome('/home/lewis/Desktop/chromedriver')
driver.get("https://www.atptour.com/en/scores/2020/7851/MS011/match-stats")
content = driver.page_source
soup = BeautifulSoup(content, "html.parser" )
text = soup.find_all('h3', class_='section-title')[2].get_text().replace("\n","")
while text.find(" ")> -1:
text = text.replace(" "," ")
text = text.strip()
#split by two parameters
split = [st.split("|") for st in text.split("VS.")]
#flatten the nested lists
flat_list = [item for sublist in split for item in sublist]
#extract the date from the end of the last item
flat_list.append(flat_list[-1][-4:])
#remove date fromt the 3rd item
flat_list[2] = flat_list[2][:-4]
#strip any leading or trailing white space
final_list = [x.strip() for x in flat_list]
print(final_list)
output
['Salvatore Caruso', 'Brandon Nakashima', 'Indian Wells', '2020']

Find Location of All Numbers with a Comma

I have a been scraping some HTML pages with beautiful soup trying to extract some updated financial data. I only care about numbers that have a comma ie 100,000 or 12,000,000 but not 450 for example. The goal is just to find the location of the comma separated numbers within a string then I need to extract the entire sentence they are in.
I moved the entire scrape to a string list and within that list I want to extract all numbers that have a comma.
url = 'https://www.sec.gov/Archives/edgar/data/354950/000035495020000024/hd-2020proxystatement.htm'
r = requests.get(url)
soup = BeautifulSoup(r.content)
text = soup.find_all(text = True)
strings = []
for i in range(len(text)):
text_s = str(proxy_text[i])
strings.append(text)
I thought about the follow re code but I am not sure if it will extract all instances.. ie within the list there may be multiple instances of numbers separated by commas.
number = re.sub('[^>0-9,]', "", text)
Any thoughts would be a huge help! Thank you
You can use:
from bs4 import BeautifulSoup
import requests, re
url = 'https://www.sec.gov/Archives/edgar/data/354950/000035495020000024/hd-2020proxystatement.htm'
soup = BeautifulSoup(requests.get(url).text, "html5lib")
for el in soup.find_all(True): # loop all element in page
if re.search(r"(?=\d+,\d+).*", el.text):
print(el.text)
# print("END OF ELEMENT\n") # debug only
If you simply want to check if a number has a comma or not, and you want to extract it if it does, then you could try the following.
new = []
for i in text:
if ',' in i:
new.append(i)
This will append all the elements in the 'text' collection that contain a comma, even if the exact same element is repeated multiple times.

Why doesn't this function return the same output in both situations(webscraping project)?

import requests
import re
from bs4 import BeautifulSoup
#The website I like to get, converts the contents of the web page to lxml format
base_url = "https://festivalfans.nl/event/dominator-festival"
url = requests.get(base_url)
soup = BeautifulSoup(url.content, "lxml")
#Modifies the given string to look visually good. Like this:
#['21 / JulZaterdag2018'] becomes 21 Jul 2018
def remove_char(string):
#All blacklisted characters and words
blacklist = ["/", "[", "]", "'", "Maandag", "Dinsdag", "Woensdag",
"Donderdag", "Vrijdag", "Zaterdag", "Zondag"]
#Replace every blacklisted character with white space
for char in blacklist:
string = string.replace(char,' ')
#Replace more than 2 consecutive white spaces
string = re.sub("\s\s+", " ", string)
#Gets the date of the festival I'm interested in
def get_date_info():
#Makes a list for the data
raw_info = []
#Adds every "div" with a certain name to list, and converts it to text
for link in soup.find_all("div", {"class": "event-single-data"}):
raw_info.append(link.text)
#Converts list into string, because remove_char() only accepts strings
raw_info = str(raw_info)
#Modifies the string as explained above
final_date = remove_char(raw_info)
#Prints the date in this format: 21 Jul 2018(example)
print(final_date)
get_date_info()
Hi there! So I'm currently working on a little webscraping project. I thought I had a good idea and I wanted to get more experienced with Python. What it basically does is it gets festival information like date, time and price and puts it in a little text file. I'm using BeautifulSoup to navigate and edit the web page. Link is down there!
But now I'm kinda running into a problem. I can't figure out what's wrong. Maybe I'm totally looking over it. So when I run this program it should give me this: 21 Jul 2018. But instead it returns 'None'. For some reason every character in the string gets removed.
I tried running remove_char() on its own, with the same list(converted it to string first) as input. This worked perfectly. It returned "21 Jul 2018" like it was supposed to do. So I'm quite sure the error is not in this function.
So somehow I'm missing something. Maybe it has to do with BeautifulSoup and how it handles things?
Hope someone can help me out!
BeautifulSoup:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Web page:
https://festivalfans.nl/event/dominator-festival
You forgot to return the value in the remove_char() function.
That's it!
Neither of your functions has a return statement, and so return None by default. remove_char() should end with return string for example.
import requests
from bs4 import BeautifulSoup
base_url = "https://festivalfans.nl/event/dominator-festival"
url = requests.get(base_url)
soup = BeautifulSoup(url.content , "html.parser")
def get_date_info():
for link in soup.find_all("div", {"class": "event-single-data"}):
day = link.find('div', {"class":"event-single-day"}).text.replace(" ", '')
month = link.find('div', {"class": "event-single-month"}).text.replace('/', "").replace(' ', '')
year = link.find('div', {"class": "event-single-year"}).text.replace(" ", '')
print(day, month, year)
get_date_info()
here is an easier code no need of re

Looping through a list of urls for web scraping with BeautifulSoup

I want to extract some information off websites with URLs of the form:
http://www.pedigreequery.com/american+pharoah
where "american+pharoah" is the extension for one of many horse names.
I have a list of the horse names I'm searching for, I just need to figure out how to plug the names in after "http://www.pedigreequery.com/"
This is what I currently have:
import csv
allhorses = csv.reader(open('HORSES.csv') )
rows=list(allhorses)
import requests
from bs4 import BeautifulSoup
for i in rows: # Number of pages plus one
url = "http://www.pedigreequery.com/".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content)
letters = soup.find_all("a", class_="horseName")
print(letters)
When I print out the url it doesn't have the horse's name at the end, just the URL in quotes. the letters/print statement at the end are just to check if it's actually going to the website.
This is how I've seen it done for looping URLs that change by numbers at the end- I haven't found advice on URLs that change by characters.
Thanks!
You are missing the placeholder in your format so scange the format to:
url = "http://www.pedigreequery.com/{}".format(i)
^
#add placeholder
Also you are getting a list of lists at best from rows=list(allhorses) so you would be passing a list not a string/horsename, just open the file normally if you have a horse per line and iterate over the file object stripping the newline.
Presuming one horse name per line, the whole working code would be:
import requests
from bs4 import BeautifulSoup
with open("HORSES.csv") as f:
for horse in map(str.strip,f): # Number of pages plus one
url = "http://www.pedigreequery.com/{}".format(horse)
r = requests.get(url)
soup = BeautifulSoup(r.content)
letters = soup.find_all("a", class_="horseName")
print(letters)
If you have multiple horses per line you can use the csv lib but you will need an inner loop:
with open("HORSES.csv") as f:
for row in csv.reader(f):
# Number of pages plus one
for horse in row:
url = "http://www.pedigreequery.com/{}".format(horse)
r = requests.get(url)
soup = BeautifulSoup(r.content)
letters = soup.find_all("a", class_="horseName")
print(letters)
Lastly if you don't have the names store correctly you have a few options the simplest of which is to split and create the create the query manually.
url = "http://www.pedigreequery.com/{}".format("+".join(horse.split()))

Python script extract data from HTML page

I'm trying to do a massive data accumulation on college basketball teams. This link: https://www.teamrankings.com/ncb/stats/ has a TON of team stats.
I have tried to write a script that scans all the desired links (all Team Stats) from this page, finds the rank of the specified team (an input), then returns the sum of that teams ranks from all links.
I graciously found this: https://gist.github.com/phillipsm/404780e419c49a5b62a8
...which is GREAT!
But I must have something wrong because I'm getting 0
Here's my code:
import requests
from bs4 import BeautifulSoup
import time
url_to_scrape = 'https://www.teamrankings.com/ncb/stats/'
r = requests.get(url_to_scrape)
soup = BeautifulSoup(r.text, "html.parser")
stat_links = []
for table_row in soup.select(".expand-section li"):
table_cells = table_row.findAll('li')
if len(table_cells) > 0:
link = table_cells[0].find('a')['href']
stat_links.append(link)
total_rank = 0
for link in stat_links:
r = requests.get(link)
soup = BeaultifulSoup(r.text)
team_rows = soup.select(".tr-table datatable scrollable dataTable no-footer tr")
for row in team_rows:
if row.findAll('td')[1].text.strip() == 'Oklahoma':
rank = row.findAll('td')[0].text.strip()
total_rank = total_rank + rank
print total_rank
Check out that link to double check I have the correct class specified. I have a feeling the problem might be in the first for loop where I select an li tag then select all li tags within that first tag, I dunno.
I don't use Python so I'm unfamiliar with any debugging tools. So if anyone wants to forward me to one of those that would be great!
First, the team stats and player stats sections are contained in a 'div class='large column-2'. The team stats are in the first occurrence. Then you can find all of the href tags within it. I've combined both in a one-liner.
teamstats = soup(class_='column large-2')[0].find_all(href=True)
The teamstats list contains all of the 'a' tags. Use a list comprehension to extract the links. A few of the hrefs contained "#" (part of navigation links) so I excluded them.
links = [a['href'] for a in teamstats if a['href'] != '#']
Here is a sample of output:
links
Out[84]:
['/ncaa-basketball/stat/points-per-game',
'/ncaa-basketball/stat/average-scoring-margin',
'/ncaa-basketball/stat/offensive-efficiency',
'/ncaa-basketball/stat/floor-percentage',
'/ncaa-basketball/stat/1st-half-points-per-game',
A ran your code on my machine and the line --> table_cells = table_row.findAll('li') , always returns an empty list, so stat_links ends up being an empty array, therefore the iteration over stat_links never gets carried out and total_rank will not get incremented. I suggest you fiddle around with the way you find all the list elements.

Categories