Scraping Table Data from Multiple URLS, but first link is repeating - python

I'm looking to iterate through the URL with "count" as variables between 1 and 65.
Right now, I'm close but really struggling to figure out the last piece. I'm receiving the same table (from variable 1) 65 times, instead of receiving the different tables.
import requests
import pandas as pd
url = 'https://basketball.realgm.com/international/stats/2023/Averages/Qualified/All/player/All/desc/{count}'
res = []
for count in range(1, 65):
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
res.append(df)
print(res)
df.to_csv('my data.csv')
Any thoughts?

A few errors:
Your URL was templated incorrectly. It remains at .../{count} literally, without substituting or updating from the loop variable.
If you want to get page 1 to 65, use range(1, 66)
Unless you want to export only the last dataframe, you need to concatenate all of them first
# No count here, we will add it later
url = 'https://basketball.realgm.com/international/stats/2023/Averages/Qualified/All/player/All/desc'
res = []
for count in range(1, 66):
# pd.read_html accepts a URL too so no need to make a separate request
df_list = pd.read_html(f"{url}/{count}")
res.append(df_list[-1])
pd.concat(res).to_csv('my data.csv')

Related

webscraping over multiple pages in Python and creating a df showing one row per forum

I am having a formatting issue when scraping multiple forums. I would hugely appreciate your help on this topic. Many thanks for your help!!!
I am scraping over multiple forums and I would like to format the output as follows: all posts of one forum should be combined into one row. For example, if forum 1 has 10 posts, all 10 posts should be appended and presented in line 1 of the df; if forum 2 has 20 posts, all 20 posts should be appended as well and presented in line 2 of the df; and so forth for the remaining forums.
Please see my code below; the part that is not working yet can be found towards the end. Thank you for your help!!!
# import
import requests
from bs4 import BeautifulSoup
import pandas as pd
# first, create an empty data frame where the final results will be stored
df = pd.DataFrame()
# second, create a function to get all the user comments
def get_comments(lst_name):
# replace all emojis with text:
for img in bs.select("img.smilie"):
img.replace_with(img["alt"])
bs.smooth()
# remove all blockquotes
for bquote in bs.select("blockquote"):
bquote.replace_with(" ")
bs.smooth()
# find all user comments and save them to a list
comment = bs.find_all(class_ = [("bbWrapper", "content")])
# iterate over the list comment to get the text and strip the strings
for c in comment:
lst_name.append(c.get_text(strip = True))
# return the list
return lst_name
# third, read URLs from a csv file
url_list = pd.read_csv('URL_List.csv', header=None)
### the links are https://vegan-forum.de/viewtopic.php?f=54&t=8325, https://forum.muscle-corps.de/threads/empfehlungen-f%C3%BCr-gute-produkte.5115/, https://forum.muscle-corps.de/threads/empfehlungen-f%C3%BCr-gute-produkte.5115/page-2
# fourth, loop over the list of URLs
urls = url_list[0]
for url in urls:
link = url
# create the list for the output of the function
user_comments = []
# get the content of the forum
page = requests.get(link)
html = page.content
bs = BeautifulSoup(html, 'html.parser')
# call the function to get the information
get_comments(user_comments)
# create a pandas dataframe for the user comments
comments_dict = {
'user_comments': user_comments
}
df_comments_info = pd.DataFrame(data=comments_dict)
### *THIS PART IS NOT WORKING* ###
# join all comments into one cell
#df_comments_info = pd.DataFrame({'user_comments': [','.join(df['user_comments'].str.strip('"').tolist())]})
# append the temporary dataframe to the dataframe which has been created earlier outside the for loop
df = df.append(df_comments_info)
# lastly, save the dataframe to a csv file
df.to_csv('test_March27.csv', header=False, index=False)

How to get data from a link inside a webpage in Python?

I need to collect data from the website - https://webgate.ec.europa.eu/rasff-window/portal/?event=notificationsList&StartRow= and store it in a dataframe using pandas. For this I use the following code and get the data quite easily -
import pandas as pd
import requests
url = "https://webgate.ec.europa.eu/rasff-window/portal/?event=notificationsList&StartRow="
link = requests.get(url).text
df = pd.read_html(link)
df = df[-1]
But if you notice there is another hyperlink in the table on the extreme right hand side of every row of the webpage by the name "Details". I would also like to add the data from inside that hyperlink to every row in our dataframe. How do we do that?
As suggested by Shi XiuFeng, BeautifulSoup is better suited for your problem but if you still want to proceed with your current code, you would have to use regex to extract the URLs and add them as a column like this:
import pandas as pd
import requests
url = "https://webgate.ec.europa.eu/rasff-window/portal/?event=notificationsList&StartRow="
link = requests.get(url)
link_content = str(link.content)
res = re.findall(r'(<tbody.*?>.*?</tbody>)', link_content)[0]
res = re.findall(r'(<a href=\"(.*?)\">Details\<\/a\>)', res)
res = [i[1] for i in res]
link_text = link.text
df = pd.read_html(link_text)
df = df[-1]
df['links'] = res
print(df)
Hope that solves your problem.

Splitting up text into separate rows - BeautifulSoup

I'm trying to loop over a few hundred pages of a site and grab Buddhist quotes and then save them into a dataframe. I've mostly got the code working, but am struggling with parsing some of the text appropriately. On each page i'm scraping there are 5 quotes, and from what I can tell in the HTML output no obvious identifier for each. So i've attempted to loop over what I scrape from each page but it's either overwriting all previous quotes (i.e quotes 1-4) or just grouping them all together into a single cell.
See set-up and code below:
# For data handling:
import pandas as pd
# Set Pandas output options
pd.set_option('display.max_colwidth', None)
# For the scrape:
from bs4 import BeautifulSoup as BShtml
import urllib.request as ur
# Make empty dataframe
emptydata = pd.DataFrame({"quote":[], "quote_date":[], "page_no":[]})
# Populate dataframe with quotes for first three pages
for i in range(1, 4):
url = "https://www.sgi-usa.org/tag/to-my-friends/page/" + str(i)
r = ur.urlopen(url).read()
soup = BShtml(r, "html.parser")
new_result = pd.DataFrame({
"quote":[soup.find_all("div", class_="post-content")],
"quote_date":[soup.find_all("div", class_="post-date")],
"page_no": [str(i)]
})
emptydata = emptydata.append(new_result)
emptydata
As you can see from the image attached this is bundling each 5 quotes into a single cell and making a new row of the data for each page. Any thoughts on how I can split these up so I have one row per quote and date? I tried looping over soup.find_all("div", class_="post-content") but figure I must have been constructing the dataframe incorrectly as that overwrote all but the last quote on each page.
what my dataframe currently looks like
Thanks in advance! Chris
How to fix?
You should add an additional for loop to get your goal:
for post in soup.find_all("div", class_="quote-inner"):
new_result = pd.DataFrame({
"quote":[post.find("div", class_="post-content").get_text(strip=True)],
"quote_date":[post.find_all("div", class_="post-date")[1].get_text()],
"page_no": [str(i)]
})
Example
# For data handling:
import pandas as pd
# Set Pandas output options
pd.set_option('display.max_colwidth', None)
# For the scrape:
from bs4 import BeautifulSoup as BShtml
import urllib.request as ur
# Make empty dataframe
emptydata = pd.DataFrame({"quote":[], "quote_date":[], "page_no":[]})
# Populate dataframe with quotes for first three pages
for i in range(1, 4):
url = "https://www.sgi-usa.org/tag/to-my-friends/page/" + str(i)
r = ur.urlopen(url).read()
soup = BShtml(r, "html.parser")
for post in soup.find_all("div", class_="quote-inner"):
new_result = pd.DataFrame({
"quote":[post.find("div", class_="post-content").get_text(strip=True)],
"quote_date":[post.find_all("div", class_="post-date")[1].get_text()],
"page_no": [str(i)]
})
emptydata = emptydata.append(new_result)
emptydata

Extracting Tables From Different Sites With BeautifulSoup IN A LOOP

I have extracted a table from a site with the help of BeautifulSoup. Now I want to keep this process going in a loop with several different URL:s. If it is possible, I would like to extract these tables into different excel documents, or different sheets within a document.
I have been trying to put the code through a loop and appending the df
from bs4 import BeautifulSoup
import requests
import pandas as pd
xl = pd.ExcelFile(r'path/to/file.xlsx')
link = xl.parse('Sheet1')
#this is what I can't figure out
for i in range(0,10):
try:
url = link['Link'][i]
html = requests.get(url).content
df_list = pd.read_html(html)
soup = BeautifulSoup(html,'lxml')
table = soup.select_one('table:contains("Fees Earned")')
df = pd.read_html(str(table))
list1.append(df)
except ValueError:
print('Value')
pass
#Not as important
a = df[0]
writer = pd.ExcelWriter('mytables.xlsx')
a.to_excel(writer,'Sheet1')
writer.save()
I get a 'ValueError'(no tables found) for the first nine tables and only the last table is printed when I print mylist. However, when I print them without the for loop, one link at a time, it works.
I can't append the value of df[i] because it says 'index out of range'

Why is my for loop overwriting instead of appending CSV?

I am trying to scrape IB website. So, what I am doing, I have created the urls to iterate over, and I am able to extract the required information, but seems the dataframe keeps being overwritten vs appending.
import pandas as pd
from pandas import DataFrame as df
from bs4 import BeautifulSoup
import csv
import requests
base_url = "https://www.interactivebrokers.com/en/index.phpf=2222&exch=mexi&showcategories=STK&p=&cc=&limit=100"
n = 1
url_list = []
while n <= 2:
url = (base_url + "&page=%d" % n)
url_list.append(url)
n = n+1
def parse_websites(url_list):
for url in url_list:
html_string = requests.get(url)
soup = BeautifulSoup(html_string.text, 'lxml') # Parse the HTML as a string
table = soup.find('div',{'class':'table-responsive no-margin'}) #Grab the first table
df = pd.DataFrame(columns=range(0,4), index = [0]) # I know the size
for row_marker, row in enumerate(table.find_all('tr')):
column_marker = 0
columns = row.find_all('td')
try:
df.loc[row_marker] = [column.get_text() for column in columns]
except ValueError:
# It's a safe way when [column.get_text() for column in columns] is empty list.
continue
print(df)
df.to_csv('path_to_file\\test1.csv')
parse_websites(url_list)
Can you please take a look at my code at advise what I am doing wrong ?
One solution if you want to append the data frames on the file is to write in append mode:
df.to_csv('path_to_file\\test1.csv', mode='a', header=False)
otherwise you should create the data frame outside as mentioned in the comments.
If you define a data structure from within a loop, each iteration of the loop
will redefine the data structure, meaning that the work is being rewritten.
The dataframe should be defined outside of the loop if you do not want it to be overwritten.

Categories