I am trying to learn how to scrape data from a webpage in python and am running into trouble with how to structure my nested loops in python. I received some assistance in how I was scraping with this question (How to pull links from within an 'a' tag). I am trying to have that code essentially iterate through different weeks (and eventually years) of webpages. What I have currently is below, but it is not iterating through the two weeks I would like it to and saving it off.
import requests, re, json
from bs4 import BeautifulSoup
weeks=['1','2']
data = pd.DataFrame(columns=['Teams','Link'])
scripts_head = soup.find('head').find_all('script')
all_links = {}
for i in weeks:
r = requests.get(r'https://www.espn.com/college-football/scoreboard/_/year/2018/seasontype/2/week/'+i)
soup = BeautifulSoup(r.text, 'html.parser')
for script in scripts_head:
if 'window.espn.scoreboardData' in script.text:
json_scoreboard = json.loads(re.search(r'({.*?});', script.text).group(1))
for event in json_scoreboard['events']:
name = event['name']
for link in event['links']:
if link['text'] == 'Gamecast':
gamecast = link['href']
all_links[name] = gamecast
#Save data to dataframe
data2=pd.DataFrame(list(all_links.items()),columns=['Teams','Link'])
#Append new data to existing data
data=data.append(data2,ignore_index = True)
#Save dataframe with all links to csv for future use
data.to_csv(r'game_id_data.csv')
Edit: So to add some clarification, it is creating duplicates of the data from one week and repeatedly appending it to the end. I also edited the code to include the proper libraries, it should be able to be copy and pasted and run in python.
The problem is in your loop logic:
if 'window.espn.scoreboardData' in script.text:
...
data2=pd.DataFrame(list(all_links.items()),columns=['Teams','Link'])
#Append new data to existing data
data=data.append(data2,ignore_index = True)
Your indentation on the last line is wrong. As given, you append data2 regardless of whether you have new scoreboard data. When you don't, you skip the if body and simply append the previous data2 value.
So the workaround I came up with is below, I am still getting duplicate game ID's in my final dataset, but at least I am looping through the entire desired set and getting all of them. Then at the end I dedupe.
import requests, re, json
from bs4 import BeautifulSoup
import csv
import pandas as pd
years=['2015','2016','2017','2018']
weeks=['1','2','3','4','5','6','7','8','9','10','11','12','13','14']
data = pd.DataFrame(columns=['Teams','Link'])
all_links = {}
for year in years:
for i in weeks:
r = requests.get(r'https://www.espn.com/college-football/scoreboard/_/year/'+ year + '/seasontype/2/week/'+i)
soup = BeautifulSoup(r.text, 'html.parser')
scripts_head = soup.find('head').find_all('script')
for script in scripts_head:
if 'window.espn.scoreboardData' in script.text:
json_scoreboard = json.loads(re.search(r'({.*?});', script.text).group(1))
for event in json_scoreboard['events']:
name = event['name']
for link in event['links']:
if link['text'] == 'Gamecast':
gamecast = link['href']
all_links[name] = gamecast
#Save data to dataframe
data2=pd.DataFrame(list(all_links.items()),columns=['Teams','Link'])
#Append new data to existing data
data=data.append(data2,ignore_index = True)
#Save dataframe with all links to csv for future use
data_test=data.drop_duplicates(keep='first')
data_test.to_csv(r'all_years_deduped.csv')
Related
I am having a formatting issue when scraping multiple forums. I would hugely appreciate your help on this topic. Many thanks for your help!!!
I am scraping over multiple forums and I would like to format the output as follows: all posts of one forum should be combined into one row. For example, if forum 1 has 10 posts, all 10 posts should be appended and presented in line 1 of the df; if forum 2 has 20 posts, all 20 posts should be appended as well and presented in line 2 of the df; and so forth for the remaining forums.
Please see my code below; the part that is not working yet can be found towards the end. Thank you for your help!!!
# import
import requests
from bs4 import BeautifulSoup
import pandas as pd
# first, create an empty data frame where the final results will be stored
df = pd.DataFrame()
# second, create a function to get all the user comments
def get_comments(lst_name):
# replace all emojis with text:
for img in bs.select("img.smilie"):
img.replace_with(img["alt"])
bs.smooth()
# remove all blockquotes
for bquote in bs.select("blockquote"):
bquote.replace_with(" ")
bs.smooth()
# find all user comments and save them to a list
comment = bs.find_all(class_ = [("bbWrapper", "content")])
# iterate over the list comment to get the text and strip the strings
for c in comment:
lst_name.append(c.get_text(strip = True))
# return the list
return lst_name
# third, read URLs from a csv file
url_list = pd.read_csv('URL_List.csv', header=None)
### the links are https://vegan-forum.de/viewtopic.php?f=54&t=8325, https://forum.muscle-corps.de/threads/empfehlungen-f%C3%BCr-gute-produkte.5115/, https://forum.muscle-corps.de/threads/empfehlungen-f%C3%BCr-gute-produkte.5115/page-2
# fourth, loop over the list of URLs
urls = url_list[0]
for url in urls:
link = url
# create the list for the output of the function
user_comments = []
# get the content of the forum
page = requests.get(link)
html = page.content
bs = BeautifulSoup(html, 'html.parser')
# call the function to get the information
get_comments(user_comments)
# create a pandas dataframe for the user comments
comments_dict = {
'user_comments': user_comments
}
df_comments_info = pd.DataFrame(data=comments_dict)
### *THIS PART IS NOT WORKING* ###
# join all comments into one cell
#df_comments_info = pd.DataFrame({'user_comments': [','.join(df['user_comments'].str.strip('"').tolist())]})
# append the temporary dataframe to the dataframe which has been created earlier outside the for loop
df = df.append(df_comments_info)
# lastly, save the dataframe to a csv file
df.to_csv('test_March27.csv', header=False, index=False)
I'm trying to scrape 100 reviews/ratings on a Yelp restaurant for an assignment using BeautifulSoup. I'm specifically looking for:
Review Comment
Review ID
Review Rating
I'm pretty new to Python and I feel like I've missed something extremely obvious
Here's what I've got so far:
from bs4 import BeautifulSoup
import urllib.request
url = 'https://www.yelp.com/biz/ichiran-times-square-new-york-4?osq=Ichiban+Ramen' ourUrl = urllib.request.urlopen(url)
soup = BeautifulSoup(ourUrl,'html.parser') type(soup) print(soup.prettify())
for i in soup.find_all('div', {'class':" arrange-unit__373c0__3XPkE arrange-unit-fill__373c0__38Zde border-color--default__373c0__r305k"}): ID.append(i.find("div").get("aria-label"))
soup.find('p', {'class':"comment__373c0__Nsutg css-n6i4z7"})
i = soup.find('p', {'class':"comment__373c0__Nsutg css-n6i4z7"}) i.text
review=[]
rating = []
ID = []
for x in range(0,10):
url = "https://www.yelp.com/biz/ichiran-times-square-new-york-4?osq=Ichiban+Ramen="+str(10*x)
ourUrl = urllib.request.urlopen(url)
soup = BeautifulSoup(ourUrl,'html.parser')
#for i in soup,
for i in soup.find_all('div', {'class':" i-stars__373c0___sZu0 i-stars--regular-5__373c0__20dKs border-color--default__373c0__1yxBb overflow--hidden__373c0__1TJqF"}):
per_rating = i.text
rating.append(per_rating)
for i in soup.find_all('span', {'class':" arrange-unit__373c0__3XPkE arrange-unit-fill__373c0__38Zde border-color--default__373c0__r305k"}):
ID.append(i.find("div").get("aria-label"))
for i in soup.find_all('p', {'class':"comment__373c0__Nsutg css-n6i4z7"}):
per_review=i.text
review.append(per_review)
len(review)
Here's my attempt at exporting to csv where I get review text ONLY and nothing else:
with open('Review.csv','a',encoding = 'utf-8') as f:
for each in review:
f.write(each+'\n')
Edit - Updated
The issue actually looks due to not targeting the correct tags in the HTML.
# Import regex package
import re
# Narrow down the section that you are searching in to avoid erroneous elements
child = soup.find('div', {'class': 'css-79elbk border-color--default__373c0__1ei3H'})
for x in child.find_all('span', {'class':"fs-block css-m6anxm"}):
# Ignore the titular "Username"
if x.text != 'Username':
ID.append(x.text)
for x in child.find_all('div', {'class':re.compile(r'i-stars.+')}):
rating.append(x.get('aria-label'))
for x in child.find_all('p', {'class':'comment__373c0__Nsutg css-n6i4z7'}):
comment = x.find('span', {'class':'raw__373c0__tQAx6'})
review.append(comment.text)
The ID needed to target the specific element, 'class':"fs-block css-m6anxm", and the rating class differed depending on how many stars it achieved so implementing regex to identify anything beginning with i-stars.
Original Answer
I believe your issue is that you are only looping through review when you also need to loop ID and rating also: -
# Create new_line to work around f-strings issue with '\'
new_line = '\n'
with open('Review.csv','a',encoding = 'utf-8') as f:
for i in range(len(review):
f.write(f'{review[i]},{ID[i]},{rating[i]}{new_line}')
You could also take a look at the Pandas package in order to achieve this.
You can create a dataframe and then export that as a number of different file types, including CSV, for example: -
# Import Pandas package
import Pandas
# Store list values, along with column headings, in a dictionary
d = {'review_comment': review, 'review_id': ID, 'review_rating': rating}
# Create dataframe from the dictionary
df = pd.DataFrame(data=d)
# Export the dataframe as a CSV
df.to_csv('desired/save/location.csv', index=False)
I'm learning web scraping on Python and I decided to test my skills in the HackerRank Leaderboard page, so I wrote the code below expecting no errors before adding the country restriction to the tester function for then exporting my csv file successfully.
But then the Python console replied:
AttributeError: 'NoneType' object has no attribute 'find_all'
The error above corresponds to the line 29 from my code (for i in table.find_all({'class':'ellipsis'}):), so I decided to come here in order to ask for assistance, I'm afraid there could be more syntax or logic errors, so it's better to get rid of my doubts by getting a feedback from experts.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
from time import sleep
from random import randint
pd.set_option('display.max_columns', None)
#Declaring a variable for looping over all the pages
pages = np.arange(1, 93, 1)
a = pd.DataFrame()
#loop cycle
for url in pages:
#get html for each new page
url ='https://www.hackerrank.com/leaderboard?page='+str(url)
page = requests.get(url)
sleep(randint(3,10))
soup = BeautifulSoup(page.text, 'lxml')
#get the table
table = soup.find('header', {'class':'table-header flex'})
headers = []
#get the headers of the table and delete the "white space"
for i in table.find_all({'class':'ellipsis'}):
title = i.text.strip()
headers.append(title)
#set the headers to columns in a new dataframe
df = pd.DataFrame(columns=headers)
rows = soup.find('div', {'class':'table-body'})
#get the rows of the table but omit the first row (which are headers)
for row in rows.find_all('table-row-wrapper')[1:]:
data = row.find_all('table-row-column ellipsis')
row_data = [td.text.strip() for td in data]
length = len(df)
df.loc[length] = row_data
#set the data of the Txn Count column to float
Txn = df['SCORE'].values
#combine all the data rows in one single dataframe
a = a.append(pd.DataFrame(df))
def tester(mejora):
mejora = mejora[(mejora['SCORE']>2250.0)]
return mejora.to_csv('new_test_Score_Count.csv')
tester(a)
Do you guys have any ideas or suggestions that could fix the problem?
the error states, that you table element is None. i'm guessing here but you cant get the table from the page with bs4 because it is loaded after with javascript. I would recommend to use selenium for this instead
I have extracted a table from a site with the help of BeautifulSoup. Now I want to keep this process going in a loop with several different URL:s. If it is possible, I would like to extract these tables into different excel documents, or different sheets within a document.
I have been trying to put the code through a loop and appending the df
from bs4 import BeautifulSoup
import requests
import pandas as pd
xl = pd.ExcelFile(r'path/to/file.xlsx')
link = xl.parse('Sheet1')
#this is what I can't figure out
for i in range(0,10):
try:
url = link['Link'][i]
html = requests.get(url).content
df_list = pd.read_html(html)
soup = BeautifulSoup(html,'lxml')
table = soup.select_one('table:contains("Fees Earned")')
df = pd.read_html(str(table))
list1.append(df)
except ValueError:
print('Value')
pass
#Not as important
a = df[0]
writer = pd.ExcelWriter('mytables.xlsx')
a.to_excel(writer,'Sheet1')
writer.save()
I get a 'ValueError'(no tables found) for the first nine tables and only the last table is printed when I print mylist. However, when I print them without the for loop, one link at a time, it works.
I can't append the value of df[i] because it says 'index out of range'
I am currently learning webscraping and Python. I want to write a code that downloads a list of .xls data files based on a list of links that I have created. Each of these links downloads a data file that corresponds to FDI flows of a country.
My problem is that with the way the code is currently written, the last url in my list replaces all the previous files. The files are named correctly but they all contain the data for the last country in the list. As an example, I am only taking the last three countries in the data.
from bs4 import BeautifulSoup
import pandas as pd
import requests
import os
page = requests.get("https://unctad.org/en/Pages/DIAE/FDI%20Statistics/FDI-Statistics-Bilateral.aspx")
soup = BeautifulSoup(page.text, 'html.parser')
countries_list = soup.select('[id=FDIcountriesxls] option[value]')
links = [link.get('value') for link in countries_list[203:-1]] #sample of countries
countries = [country.text for country in countries_list[203:-1]] #sample of countries
links_complete = ["https://unctad.org" + link for link in links]
for link in links_complete:
for country in countries:
r=requests.get(link)
with open (country + '.xls', 'wb') as file:
file.write(r.content)
What this gets me is three files, all named after the three countries but containing the data for the last (Zambia).
Can anyone help with this?
Thanks.
That's because you don't have to do a double loop.
Indeed, in the "countries" loop, you rewrite each time on your file ('wb') so there are only the values of the last country left.
To solve your problem you can do a loop on countries_list directly
from bs4 import BeautifulSoup
import pandas as pd
import requests
import os
page = requests.get("https://unctad.org/en/Pages/DIAE/FDI%20Statistics/FDI-Statistics-Bilateral.aspx")
soup = BeautifulSoup(page.text, 'html.parser')
countries_list = soup.select('[id=FDIcountriesxls] option[value]')
for opt in countries_list:
value = opt.get('value')
if value:
link = "https://unctad.org" + value
country = opt.get_text()
r = requests.get(link)
with open(country + '.xls', 'wb') as file:
file.write(r.content)