I am scraping links from multiple pages under multiple searches and want to output scraped results into multiple .csv files. The table shows the .csv file which lists both my source urls and desired output file names:
url
outputfile
https://www.marketresearch.com/search/results.asp?categoryid=230&qtype=2&publisher=IDCs&datepub=0&submit2=Search
outputPS1xIDC.csv
https://www.marketresearch.com/search/results.asp?categoryid=90&qtype=2&publisher=IDC&datepub=0&submit2=Search
outputPS2xIDC.csv
https://www.marketresearch.com/search/results.asp?categoryid=233&qtype=2&publisher=IDC&datepub=0&submit2=Search
outputPS3xIDC.csv
https://www.marketresearch.com/search/results.asp?categoryid=169&qtype=2&publisher=IDC&datepub=0&submit2=Search
outputPS4xIDC.csv
Now, with the code below, I managed to read the urls in sequence and the rest of the code also works well (when I specify the output filename directly). However, it only outputs the last of the 4 pages in the list, so it overwrites the result each time. What I actually want for it is to output the results from the first url to the first outputfile, second to second, etc.
(Of course my actual list of source URLs is much longer than these 4).
Please help, especially with the last line, as clearly just writing [outputs] there doesn't work.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
with open('inputs.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
urls = [row["url"] for row in reader]
outputs = [row["outputfile"] for row in reader]
data = []
for url in urls:
def scrape_it(url):
page = requests.get(url, headers={'Cookie': 'ResultsPerPage=100'})
soup = BeautifulSoup(page.text, 'html.parser')
nexturl = soup.find_all(class_="standardLinkDkBlue")[-1]['href']
stri = soup.find_all(class_="standardLinkDkBlue")[-1].string
reports = soup.find_all("tr", {"class": ["SearchTableRowAlt", "SearchTableRow"]})
for report in reports:
data.append({
'title': report.find('a', class_='linkTitle').text,
'price': report.find('div', class_='resultPrice').text,
'date_author': report.find('div', class_='textGrey').text.replace(' | published by: TechNavio', ''),
'detail_link': report.a['href']
})
if 'next' not in stri:
print("All pages completed")
else:
scrape_it(nexturl)
scrape_it(url)
myOutput = pd.DataFrame(data)
myOutput.to_csv([outputs], header=False) #works (but only for the last url) if instead of [outputs] I have f'filename.csv'
I don't have Pandas, and I don't really want to run your input, but a couple of things jump out a me when I look at your code:
It looks like you are not looping over url and output together. It looks like you loop over all the URLs, and then after all those loops you write once.
Likewise, data is just having the HTML table data appended and appended, it's never reset for each individual URL.
Without being able to run this, I recommend something like this. The scraping is fully encapsulated and separate from the loop, and as such you can now more clearly see the flow of inputs and outputs:
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
def scrape_it(url, data):
page = requests.get(url, headers={'Cookie': 'ResultsPerPage=100'})
soup = BeautifulSoup(page.text, 'html.parser')
nexturl = soup.find_all(class_="standardLinkDkBlue")[-1]['href']
stri = soup.find_all(class_="standardLinkDkBlue")[-1].string
reports = soup.find_all("tr", {"class": ["SearchTableRowAlt", "SearchTableRow"]})
for report in reports:
data.append({
'title': report.find('a', class_='linkTitle').text,
'price': report.find('div', class_='resultPrice').text,
'date_author': report.find('div', class_='textGrey').text.replace(' | published by: TechNavio', ''),
'detail_link': report.a['href']
})
if 'next' in stri:
data = scrape_it(nexturl, data)
return data
with open('inputs.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
urls = [row["url"] for row in reader]
outputs = [row["outputfile"] for row in reader]
for (url, output) in zip(urls, outputs): # work on url and output together
data = scrape_it(url, [])
myOutput = pd.DataFrame(data)
myOutput.to_csv(output, header=False)
Related
I'm extracting data from DESWATER website, these data are then saved in CSV. to make a small example of the issue I have these 2 authors, one having a full text file the other doesn't. Hence, it will save the file to the wrong author.
So the CSV output looks like this:
Authors | File
First Author | Second File
Second Author | Third File
But I want the output like this:
Authors | File
First Author | 'No File'
Second Author | Second File
Third Author | Third File
Here is a small test code:
from bs4 import BeautifulSoup
import requests
import time
import csv
list_of_authors = []
list_of_full_file = []
r = requests.get('https://www.deswater.com/vol.php?vol=1&oth=1|1-3|January|2009')
# Parsing the HTML
soup = BeautifulSoup(r.content, 'html.parser')
#'Author'
s = soup.find('td', class_='testo_normale')
authors = s.find_all('i')
for author in authors:
list_of_authors.append(author.text.strip())
time.sleep(1)
#'FULL TEXT'
# find all the anchor tags with "href"
n=1
for link in soup.find_all('a', class_='testo_normale_rosso'):
if "fulltext.php?abst=" in link.get('href'):
# TO ADD
baseurl = 'https://www.deswater.com/'
Full_links=baseurl+link.attrs['href'].replace('\n','')
list_of_full_file.append(f'file {n}')
n+=1
time.sleep(1)
def Save_csv():
row_head =['Author', 'File Name']
Data = []
for author, file in zip(list_of_authors, list_of_full_file):
Data.append(author)
Data.append(file)
rows = [Data[i:i + 2] for i in range(0, len(Data), 2)]
with open('data.csv', 'w', encoding='utf_8_sig', newline="") as csvfile:
csvwriter = csv.writer(csvfile)
csvwriter.writerow(row_head)
csvwriter.writerows(rows)
Save_csv()
This code will ultimately extract data from 279 pages, so I need the code to automatically detect that there is no Full Text for this author, so I can append it as 'No File'
See the reference of the correct matching in the website here.
The first author doesn't have a full text file.
Any Ideas?
Try to change your strategy selecting the elements and avoid multiple lists if yo could not ensure same length.
Use css selectors here to select all <hr> that are the base for all other selections with find_previous():
for e in soup.select('.testo_normale hr'):
data.append({
'author': e.find_previous('i').text,
'file': 'https://www.deswater.com/'+e.find_previous('a').get('href') if 'fulltext' in e.find_previous('a').get('href') else 'no url'
})
Example
from bs4 import BeautifulSoup
import requests
import csv
soup = BeautifulSoup(requests.get('https://www.deswater.com/vol.php?vol=1&oth=1|1-3|January|2009').content)
with open('data.csv', 'w', encoding='utf-8', newline='') as f:
data = []
for e in soup.select('.testo_normale hr'):
data.append({
'author': e.find_previous('i').text,
'file': 'https://www.deswater.com/'+e.find_previous('a').get('href') if 'fulltext' in e.find_previous('a').get('href') else 'no url'
})
dict_writer = csv.DictWriter(f, data[0].keys())
dict_writer.writeheader()
dict_writer.writerows(data)
Output
author,file
Miriam Balaban,no url
W. Richard Bowen,https://www.deswater.com/fulltext.php?abst=XFxEV1RfYWJzdHJhY3RzXFx2b2xfMVxcMV8yMDA5XzEucGRm&desc=k#1#kfontk#13#kfacek#7#kk#30#kGenevak#6#kk#13#kArialk#6#kk#13#kHelveticak#6#kk#13#ksank#35#kserifk#30#kk#13#ksizek#7#kk#30#k2k#30#kk#2#kk#1#kik#2#kW.k#13#kRichardk#13#kBowenk#1#kk#4#kik#2#kk#1#kbrk#2#kWaterk#13#kengineeringk#13#kfork#13#kthek#13#kpromotionk#13#kofk#13#kpeacek#1#kbrk#2#k1k#15#k2009k#16#k1k#35#k6k#1#kbrk#4#kk#2#kk#1#kak#13#khrefk#7#kDWTk#12#kabstractsk#4#kvolk#12#k1k#4#k1k#12#k2009k#12#k1.pdfk#13#kclassk#7#kk#5#kk#30#ktestok#12#knormalek#12#krossok#5#kk#30#kk#13#ktargetk#7#kk#5#kk#30#kk#12#kblankk#5#kk#30#kk#2#kAbstractk#1#kk#4#kak#2#kk#1#kbrk#2#k&id23=RFdUX2FydGljbGVzL1REV1RfSV8wMV8wMS0wM190ZmphL1REV1RfQV8xMDUxMjg2NC9URFdUX0FfMTA1MTI4NjRfTy5wZGY=&type=1
Steven J. Duranceau,https://www.deswater.com/fulltext.php?abst=XFxEV1RfYWJzdHJhY3RzXFx2b2xfMVxcMV8yMDA5XzcucGRm&desc=k#1#kfontk#13#kfacek#7#kk#30#kGenevak#6#kk#13#kArialk#6#kk#13#kHelveticak#6#kk#13#ksank#35#kserifk#30#kk#13#ksizek#7#kk#30#k2k#30#kk#2#kk#1#kik#2#kStevenk#13#kJ.k#13#kDuranceauk#1#kk#4#kik#2#kk#1#kbrk#2#kModelingk#13#kthek#13#kpermeatek#13#ktransientk#13#kresponsek#13#ktok#13#kperturbationsk#13#kfromk#13#ksteadyk#13#kstatek#13#kink#13#kak#13#knanofiltrationk#13#kprocessk#1#kbrk#2#k1k#15#k2009k#16#k7k#35#k16k#1#kbrk#4#kk#2#kk#1#kak#13#khrefk#7#kDWTk#12#kabstractsk#4#kvolk#12#k1k#4#k1k#12#k2009k#12#k7.pdfk#13#kclassk#7#kk#5#kk#30#ktestok#12#knormalek#12#krossok#5#kk#30#kk#13#ktargetk#7#kk#5#kk#30#kk#12#kblankk#5#kk#30#kk#2#kAbstractk#1#kk#4#kak#2#kk#1#kbrk#2#k&id23=RFdUX2FydGljbGVzL1REV1RfSV8wMV8wMS0wM190ZmphL1REV1RfQV8xMDUxMjg2NS9URFdUX0FfMTA1MTI4NjVfTy5wZGY=&type=1
"Dmitry Lisitsin, David Hasson, Raphael Semiat",https://www.deswater.com/fulltext.php?abst=XFxEV1RfYWJzdHJhY3RzXFx2b2xfMVxcMV8yMDA5XzE3LnBkZg==&desc=k#1#kfontk#13#kfacek#7#kk#30#kGenevak#6#kk#13#kArialk#6#kk#13#kHelveticak#6#kk#13#ksank#35#kserifk#30#kk#13#ksizek#7#kk#30#k2k#30#kk#2#kk#1#kik#2#kDmitryk#13#kLisitsink#6#kk#13#kDavidk#13#kHassonk#6#kk#13#kRaphaelk#13#kSemiatk#1#kk#4#kik#2#kk#1#kbrk#2#kModelingk#13#kthek#13#keffectk#13#kofk#13#kantik#35#kscalantk#13#konk#13#kCaCO3k#13#kprecipitationk#13#kink#13#kcontinuousk#13#kflowk#1#kbrk#2#k1k#15#k2009k#16#k17k#35#k24k#1#kbrk#4#kk#2#kk#1#kak#13#khrefk#7#kDWTk#12#kabstractsk#4#kvolk#12#k1k#4#k1k#12#k2009k#12#k17.pdfk#13#kclassk#7#kk#5#kk#30#ktestok#12#knormalek#12#krossok#5#kk#30#kk#13#ktargetk#7#kk#5#kk#30#kk#12#kblankk#5#kk#30#kk#2#kAbstractk#1#kk#4#kak#2#kk#1#kbrk#2#k&id23=RFdUX2FydGljbGVzL1REV1RfSV8wMV8wMS0wM190ZmphL1REV1RfQV8xMDUxMjg2Ni9URFdUX0FfMTA1MTI4NjZfTy5wZGY=&type=1
"M.A. Darwish, Fatima M. Al-Awadhi, A. Akbar, A. Darwish",https://www.deswater.com/fulltext.php?abst=XFxEV1RfYWJzdHJhY3RzXFx2b2xfMVxcMV8yMDA5XzI1LnBkZg==&desc=k#1#kfontk#13#kfacek#7#kk#30#kGenevak#6#kk#13#kArialk#6#kk#13#kHelveticak#6#kk#13#ksank#35#kserifk#30#kk#13#ksizek#7#kk#30#k2k#30#kk#2#kk#1#kik#2#kM.A.k#13#kDarwishk#6#kk#13#kFatimak#13#kM.k#13#kAlk#35#kAwadhik#6#kk#13#kA.k#13#kAkbark#6#kk#13#kA.k#13#kDarwishk#1#kk#4#kik#2#kk#1#kbrk#2#kAlternativek#13#kprimaryk#13#kenergyk#13#kfork#13#kpowerk#13#kdesaltingk#13#kplantsk#13#kink#13#kKuwaitk#32#kk#13#kthek#13#knucleark#13#koptionk#13#kIk#1#kbrk#2#k1k#15#k2009k#16#k25k#35#k41k#1#kbrk#4#kk#2#kk#1#kak#13#khrefk#7#kDWTk#12#kabstractsk#4#kvolk#12#k1k#4#k1k#12#k2009k#12#k25.pdfk#13#kclassk#7#kk#5#kk#30#ktestok#12#knormalek#12#krossok#5#kk#30#kk#13#ktargetk#7#kk#5#kk#30#kk#12#kblankk#5#kk#30#kk#2#kAbstractk#1#kk#4#kak#2#kk#1#kbrk#2#k&id23=RFdUX2FydGljbGVzL1REV1RfSV8wMV8wMS0wM190ZmphL1REV1RfQV8xMDUxMjg2Ny9URFdUX0FfMTA1MTI4NjdfTy5wZGY=&type=1
...
Everything works as expected when I'm using a single URL for the URL variable to scrape, but not getting any results when attempting to read links from a csv. Any help is appreciated.
Info about the CSV:
One column with a header called "Links"
300 rows of links with no space, commoa, ; or other charters before/after the links
One link in each row
import requests # required to make request
from bs4 import BeautifulSoup # required to parse html
import pandas as pd
import csv
with open("urls.csv") as infile:
reader = csv.DictReader(infile)
for link in reader:
res = requests.get(link['Links'])
#print(res.url)
url = res
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
email_elm0 = soup.find_all(class_= "app-support-list__item")[0].text.strip()
email_elm1 = soup.find_all(class_= "app-support-list__item")[1].text.strip()
email_elm2 = soup.find_all(class_= "app-support-list__item")[2].text.strip()
email_elm3 = soup.find_all(class_= "app-support-list__item")[3].text.strip()
final_email_elm = (email_elm0,email_elm1,email_elm2,email_elm3)
print(final_email_elm)
df = pd.DataFrame(final_email_elm)
#getting an output in csv format for the dataframe we created
#df.to_csv('draft_part2_scrape.csv')
The problem lies in this part of the code:
with open("urls.csv") as infile:
reader = csv.DictReader(infile)
for link in reader:
res = requests.get(link['Links'])
...
After the loop is executed, res will have the last link. So, this program will only scrape the last link.
To solve this problem, store all the links in a list and iterate that list to scrape each of the link. You can store the scraped result in a seperate dataframe and concatenate them at the end to store in a single file:
import requests # required to make request
from bs4 import BeautifulSoup # required to parse html
import pandas as pd
import csv
links = []
with open("urls.csv") as infile:
reader = csv.DictReader(infile)
for link in reader:
links.append(link['Links'])
dfs = []
for url in links:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
email_elm0 = soup.find_all(class_="app-support-list__item")[0].text.strip()
email_elm1 = soup.find_all(class_="app-support-list__item")[1].text.strip()
email_elm2 = soup.find_all(class_="app-support-list__item")[2].text.strip()
email_elm3 = soup.find_all(class_="app-support-list__item")[3].text.strip()
final_email_elm = (email_elm0, email_elm1, email_elm2, email_elm3)
print(final_email_elm)
dfs.append(pd.DataFrame(final_email_elm))
#getting an output in csv format for the dataframe we created
df = pd.concat(dfs)
df.to_csv('draft_part2_scrape.csv')
How would I proceed in this web scraping project using bs4 and requests? I am trying to extract user info from a forum site (myfitnesspal exactly: https://community.myfitnesspal.com/en/discussion/10703170/what-were-eating/p1), specifically the username, message, and date posted, and load them into columns on a csv. I have this code so far but am unsure about how to proceed:
from bs4 import BeautifulSoup
import csv
import requests
# get page source and create a BS object
print('Reading page...')
page= requests.get('https://community.myfitnesspal.com/en/discussion/10703170/what-were-eating/p1')
src = page.content
soup = BeautifulSoup(src, 'html.parser')
#container = soup.select('#vanilla_discussion_index > div.container')
container = soup.select('#vanilla_discussion_index > div.container > div.row > div.content.column > div.CommentsWrap > div.DataBox.DataBox-Comments > ul')
postdata = soup.select('div.Message')
user = []
date = []
text = []
for post in postdata:
text.append(BeautifulSoup(str(post), 'html.parser').get_text().encode('utf-8').strip())
print(text) # this stores the text of each comment/post in a list,
# so next I'd want to store this in a csv with columns
# user, date posted, post with this under the post column
# and do the same for user and date
This script will get all messages from the page and saves them in data.csv:
import csv
import requests
from bs4 import BeautifulSoup
url = 'https://community.myfitnesspal.com/en/discussion/10703170/what-were-eating/p1'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for u, d, m in zip(soup.select('.Username'), soup.select('.DateCreated'), soup.select('.Message')):
all_data.append([u.text, d.get_text(strip=True),m.get_text(strip=True, separator='\n')])
with open('data.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for row in all_data:
writer.writerow(row)
Screenshot from LibreOffice:
One rule of thumb I like to follow with web scraping is being specific as possible without picking up unnecessary information. So for example, if I want to select a username I inspect the element containing the information I need:
<a class="Username" href="...">Username</a>
Since I am trying to collect usernames it makes the most sense to select by the class "Username":
soup.select("a.Username")
This gives me a list of all the usernames that are found on the page, this is great, however, if we want to select the data in "packages" (by post in your example we need to collect each post individually.
To accomplish this you could do something like the following:
comments = soup.select("div.comment")
This will make it easier to then do the following:
with open('file.csv', 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(['user', 'date', 'text']
for comment in comments:
username = comment.select_one("div.Username")
date = comment.select_one("span.BodyDate")
message = comment.select_one("div.Message")
writer.writerow([username, date, message])
Doing it this way also makes sure your data stays in order even if an element is missing.
Here you go:
from bs4 import BeautifulSoup
import csv
import requests
page= requests.get('https://community.myfitnesspal.com/en/discussion/10703170/what-were-eating/p1')
soup = BeautifulSoup(page.content, 'html.parser')
container = soup.select('#vanilla_discussion_index > div.container > div.row > div.content.column > div.CommentsWrap > div.DataBox.DataBox-Comments > ul > li')
with open('data.csv', 'w') as f:
writer = csv.DictWriter(f, fieldnames=['user', 'date', 'text'])
writer.writeheader()
for comment in container:
writer.writerow({
'user': comment.find('a', {'class': 'Username'}).get_text(),
'date': comment.find('span', {'class': 'BodyDate DateCreated'}).get_text().strip(),
'text': comment.find('div', {'class': 'Message'}).get_text().strip()
})
import csv
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.cbssports.com/nba/stats/playersort/nba/year-2019-season-preseason-category-scoringpergame")
soup = BeautifulSoup(page.content, 'html.parser')
for record in soup.find_all('tr'):
try:
print(record.contents[0].text)
print(record.contents[6].text)
print(record.contents[7].text)
print(record.contents[8].text)
print(record.contents[9].text)
print(record.contents[10].text)
print(record.contents[12].text)
print(record.contents[13].text)
print(record.contents[14].text)
print(record.contents[15].text)
except:
pass
print('\n')
def scrape_data(url):
response = requests.get("https://www.cbssports.com/nba/stats/playersort/nba/year-2019-season-preseason-category-scoringpergame", timeout=10)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find_all('table')[1]
rows = table.select('tbody > tr')
header = [th.text.rstrip() for th in rows[1].find_all('th')]
with open('statsoutput.csv', 'w') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(header)
for row in rows[1:]:
data = [th.text.rstrip() for th in row.find_all('td')]
writer.writerow(data)
if __name__=="__main__":
url = "https://www.cbssports.com/nba/stats/playersort/nba/year-2019-season-preseason-category-scoringpergame"
scrape_data(url)
iv'e been trying to export the stats from this web page to a csv file.
when i'm running my code, the first part works fine and retrieves the data i want.
but the function can't export it in to a csv file and iv'e been getting this error:
table = soup.find_all('table')[1]
IndexError: list index out of range
and i'm not really sure why.
You're getting this error because this site has just one <table /> html element. So soupe.find_all() is returning a list with length 1. You might solve this error doing soupe.find_all('table')[0] or, in a clean way, soup.table.
I also checked and tested your code and recommend this:
table = soup.table
rows = table.find_all('tr')
Everything will works fine after these changes. You can check this code runing here. Hope it helps.
This script is generating a csv with the data from only one of the urls fed into it. There are meant to be 98 sets of results, however the for loop isn't getting past the first url.
I've been working on this for 12hrs+ today, what am I missing in order get the correct results?
import requests
import re
from bs4 import BeautifulSoup
import csv
#Read csv
csvfile = open("gyms4.csv")
csvfilelist = csvfile.read()
def get_page_data(urls):
for url in urls:
r = requests.get(url.strip())
soup = BeautifulSoup(r.text, 'html.parser')
yield soup # N.B. use yield instead of return
print r.text
with open("gyms4.csv") as url_file:
for page in get_page_data(url_file):
name = page.find("span",{"class":"wlt_shortcode_TITLE"}).text
address = page.find("span",{"class":"wlt_shortcode_map_location"}).text
phoneNum = page.find("span",{"class":"wlt_shortcode_phoneNum"}).text
email = page.find("span",{"class":"wlt_shortcode_EMAIL"}).text
th = pages.find('b',text="Category")
td = th.findNext()
for link in td.findAll('a',href=True):
match = re.search(r'http://(\w+).(\w+).(\w+)', link.text)
if match:
web_address = link.text
gyms = [name,address,phoneNum,email,web_address]
gyms.append(gyms)
#Saving specific listing data to csv
with open ("xgyms.csv", "w") as file:
writer = csv.writer(file)
for row in gyms:
writer.writerow([row])
You have 3 for-loops in your code and do not specifiy which one causes problem. I assume it is the one in get_page_date() function.
You leave the looop exactly in the first run with the return assignemt. That is why you never get to the second url.
There are at least two possible solutions:
Append every parsed line of url to a list and return that list.
Move you processing code in the loops and append the parsed data to gyms in the loop.
As Alex.S said, get_page_data() returns on the first iteration, hence subsequent URLs are never accessed. Furthermore, the code that extracts data from the page needs to be executed for each page downloaded, so it needs to be in a loop too. You could turn get_page_data() into a generator and then iterate over the pages like this:
def get_page_data(urls):
for url in urls:
r = requests.get(url.strip())
soup = BeautifulSoup(r.text, 'html.parser')
yield soup # N.B. use yield instead of return
with open("gyms4.csv") as url_file:
for page in get_page_data(url_file):
name = page.find("span",{"class":"wlt_shortcode_TITLE"}).text
address = page.find("span",{"class":"wlt_shortcode_map_location"}).text
phoneNum = page.find("span",{"class":"wlt_shortcode_phoneNum"}).text
email = page.find("span",{"class":"wlt_shortcode_EMAIL"}).text
# etc. etc.
You can write the data to the CSV file as each page is downloaded and processed, or you can accumulate the data into a list and write it in one for with csv.writer.writerows().
Also you should pass the URL list to get_page_data() rather than accessing it from a global variable.