Here is my code, basically I wanted to output the variable "final" to excel and I wanted it to be printed in one column. My current writing code will only write the results to one row in excel..
import requests
from bs4 import BeautifulSoup
import urllib
import re
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import csv
r = requests.get("https://www.autocodes.com/obd-code-list/")
soup = BeautifulSoup(r.content)
g_data = soup.find_all("div", {"id":"scroller"})
for item in g_data:
regex = '.html">(.+?)</a>'
pattern = re.compile(regex)
htmlfile = urllib.urlopen("https://www.autocodes.com/obd-code-list/")
htmltext = htmlfile.read()
final = re.findall(pattern,htmltext)
with open('index4.csv','w') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(['company'])#row?
writer.writerows([final])
Any possible fix for this? Thanks, I am just new to Python and just studying it with little programming knowledge.
Related
I have been able to successfully scrape the website and am having some trouble saving as a csv and need help to see where I have messed up. Here is my code and I have also included a snippet of my code:
import bs4 as BeautifulSoup
import CSV
import re
import urllib.request
from IPython.display import HTML
# Program that scraps the website for
r= urllib.request.urlopen('https://www.census.gov/programs-
surveys/popest.html').read()
soup = BeautifulSoup(r,"html.parser")
for link in soup.find_all('a'):
print(link.get('href'))
with open("Giles_C996.csv","w") as csv_file:
writer = csv.writer(csv_file,delimiter="/n")
writer.writerow(Links)
Close()
Error message:
Traceback (most recent call last):
File "C:\Users\epiph\Giles_C996 Project 2.txt", line 2, in
import CSV
ModuleNotFoundError: No module named 'CSV'
My Code
You've incorrectly imported csv and bs4 modules. Also Close() is incorrect. And you can use conversion to set to get rid of duplicates.
import csv
import urllib.request
from bs4 import BeautifulSoup
r = urllib.request.urlopen('https://www.census.gov/programs-surveys/popest.html').read()
soup = BeautifulSoup(r, "html.parser")
links = set([a['href'] for a in soup.find_all('a', href=True)])
with open("Giles_C996.csv", "w", newline='') as f:
writer = csv.writer(f)
writer.writerows([link] for link in links)
Output is:
https://www.census.gov/programs-surveys/cps.html
/newsroom/press-releases/2020/65-older-population-grows/65-older-population-grows-spanish.html
https://www.census.gov/businessandeconomy
https://www.census.gov/data
/programs-surveys/popest/library.html
etc.
You had some erroneous imports and called an undefined variable.
I'm not very familiar with iPython so I can't comment much on your use of it. And always have trouble with urllibs so I just used requests.
I included some scrap code for an alternative layout for the csv file, as well as a function which can help determine if a link is valid, and a list comprehension in case you prefer that approach.
Also opens your csv file for you.
import csv, re, urllib.request, os
import requests
from bs4 import BeautifulSoup
# from IPython.display import HTML
def exists(link) -> bool:
"""
Check if request response is 200
"""
try:
return 200 == requests.get(link).status_code
except requests.exceptions.MissingSchema:
return False
except requests.exceptions.InvalidSchema:
return False
def scrapeLinks(url):
checked = set()
page = requests.get(url).text
soup = BeautifulSoup(page,"html.parser")
for a in soup.find_all('a',href=True):
link = a['href']
if not link in checked and exists(link):
yield link
checked.add(link)
# Program that scrapes the website for
url = 'https://www.census.gov/programs-surveys/popest.html'
# r = urllib.request.urlopen(url).read()
r = requests.get(url).text
soup = BeautifulSoup(r,"html.parser")
# links = [
# a['href'] for a in soup.find_all('a',href=True)\
# if exists(a['href'])
# ]
file_name = "Giles_C996.csv"
with open(file_name,"w") as csv_file:
# writer = csv.writer(csv_file),delimiter="/n")
writer = csv.writer(csv_file)
# writer.writerow(set(links)) # conversion to remove duplicates
writer.writerow(scrapeLinks(url))
# writer.writerows(enumerate(scrapeLinks(url),1)) ## if you want a 2d-indexed collection
os.startfile(file_name)
# Close()
I am attempting to extract campaign_hearts and postal_code from the code in the script tag here (the entire code is too long to post):
<script>
...
"campaign_hearts":4817,"social_share_total":11242,"social_share_last_update":"2020-01-17T10:51:22-06:00","location":{"city":"Los Angeles, CA","country":"US","postal_code":"90012"},"is_partner":false,"partner":{},"is_team":true,"team":{"name":"Team STEVENS NATION","team_pic_url":"https://d2g8igdw686xgo.cloudfront.net
...
I can identify the script I need with the following code:
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from time import sleep
import requests
import re
import json
page = requests.get("https://www.gofundme.com/f/eric-stevens-care-trust")
soup = BeautifulSoup(page.content, 'html.parser')
all_scripts = soup.find_all('script')
all_scripts[0]
However, I'm at a loss for how to extract the values I want. (I'm very new to Python.)
This thread recommended the following solution for a similar problem (edited to reflect the html I'm working with).
data = json.loads(all_scripts[0].get_text()[27:])
However, running this produces an error: JSONDecodeError: Expecting value: line 1 column 1 (char 0).
What can I do to extract the values I need now that I have the correct script identified? I have also tried the solutions listed here, but had trouble importing Parser.
You can parse the content of <script> with json module and then get your values. For example:
import re
import json
import requests
url = 'https://www.gofundme.com/f/eric-stevens-care-trust'
txt = requests.get(url).text
data = json.loads(re.findall(r'window\.initialState = ({.*?});', txt)[0])
# print( json.dumps(data, indent=4) ) # <-- uncomment this to see all data
print('Campaign Hearts =', data['feed']['campaign']['campaign_hearts'])
print('Postal Code =', data['feed']['campaign']['location']['postal_code'])
Prints:
Campaign Hearts = 4817
Postal Code = 90012
The more libraries you use; the more inefficient a code becomes! Here is a simpler solution-
#This imports the website content.
import requests
url = "https://www.gofundme.com/f/eric-stevens-care-trust"
a = requests.post(url)
a= (a.content)
print(a)
#These will show your data.
campaign_hearts = str(a,'utf-8').split('campaign_hearts":')[1]
campaign_hearts = campaign_hearts.split(',"social_share_total"')[0]
print(campaign_hearts)
postal_code = str(a,'utf-8').split('postal_code":"')[1]
postal_code = postal_code.split('"},"is_partner')[0]
print(postal_code)
Your json.loads was failing because of the final semicolon. It will work if you use a regex to extract only the object string (excluding the final semicolon).
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from time import sleep
import requests
import re
import json
page = requests.get("https://www.gofundme.com/f/eric-stevens-care-trust")
soup = BeautifulSoup(page.content, 'html.parser')
all_scripts = soup.find_all('script')
txt = all_scripts[0].get_text()
data = json.loads(re.findall(r'window\.initialState = ({.*?});', txt)[0])
This should be fine for now, I might try to write a pure lxml version or at least improve the searching for the element.
This solution uses regex to get only the JSON data, without the window.initialState = and semicolon.
import json
import re
import requests
from bs4 import BeautifulSoup
url_1 = "https://www.gofundme.com/f/eric-stevens-care-trust"
req = requests.get(url_1)
soup = BeautifulSoup(req.content, 'lxml')
script_tag = soup.find('script')
raw_json = re.fullmatch(r"window\.initialState = (.+);", script_tag.text).group(1)
json_content = json.loads(raw_json)
I'm using Python 3 to write a webscraper to pull URL links and write them to a csv file. The code does this successfully; however, there are many duplicates. How can I create the csv file with only single instances (unique) of each URL?
Thanks for the help!
import requests
from bs4 import BeautifulSoup
import csv
from urllib.parse import urljoin
r = requests.get('url')
soup = BeautifulSoup(r.text, 'html.parser')
data = []
for link in soup.find_all('a', href=True):
if '#' in link['href']:
pass
else:
print(urljoin('base-url',link.get('href')))
data.append(urljoin('base-url',link.get('href')))
with open('test.csv', 'w', newline='') as csvfile:
write = csv.writer(csvfile)
for row in data:
write.writerow([row])
Using set() somewhere along the line is the way to go. In the code below, I've added that as data = set(data) on its own line to best illustrate the usage. Here, we replace data with set(data), which drops your ~250-url list to around ~130:
import requests
from bs4 import BeautifulSoup
import csv
from urllib.parse import urljoin
r = requests.get('https://www.census.gov/programs-surveys/popest.html')
soup = BeautifulSoup(r.text, 'html.parser')
data = []
for link in set(soup.find_all('a', href=True)):
if '#' in link['href']:
pass
else:
print(urljoin('https://www.census.gov',link.get('href')))
data.append(urljoin('https://www.census.gov',link.get('href')))
data = set(data)
with open('CensusLinks.csv', 'w', newline='') as csvfile:
write = csv.writer(csvfile)
for row in data:
write.writerow([row])
Below is my code. This code works fine for the given single url. I would like to parse urls from CSV. Thanks in advance.
P.S. Im quite new to Python.
Below Code works fine for a single given url
import requests
import pandas
from bs4 import BeautifulSoup
baseurl="https//www.xxxxxxxxx.com"
r=requests.get(baseurl)
c=r.content
soup=BeautifulSoup(c, "html.parser")
all=soup.find_all("div", {"class":"biz-us"})
for br in soup.find_all("br"):
br.replace_with("\n")
This is my tried code for accessing urls from CSV
import csv
import requests
import pandas
from bs4 import BeautifulSoup
with open("input.csv", "rb") as f:
reader = csv.reader(f)
for row in reader:
url = row[0]
r=requests.get(url)
c=r.content
soup=BeautifulSoup(c, "html.parser")
all=soup.find_all("div", {"class":"biz-country-us"})
for br in soup.find_all("br"):
br.replace_with("\n")
Looks like you'll need to use your loop properly and also get the array of urls. Try this out
import csv
import requests
import pandas
from bs4 import BeautifulSoup
df1 = pandas.read_csv("input.csv", skiprows=0) #assuming headers are in first row
urls = df1['url_column_name'].tolist() #get the urls in an array list
i=0
for i in range(len(urls)):
r=requests.get(urls[i])
c=r.content
soup=BeautifulSoup(c, "html.parser")
all=soup.find_all("div", {"class":"biz-country-us"})
for br in soup.find_all("br"):
br.replace_with("\n")
Suppose you have a csv file named linklists.csv and within this there is a header Links. Now you can use all the links available under the header Links following the method I've shown below:
import csv
import requests
with open("linklists.csv") as infile:
reader = csv.DictReader(infile)
for link in reader:
res = requests.get(link['Links'])
print(res.url)
Does someone knows how to get ebay feedbacks from site using python3, beautifulsoup, re...
I have this code but it is not easy to find feedbacks.
import urllib.request
import re
from bs4 import BeautifulSoup
fhand = urllib.request.urlopen('http://feedback.ebay.com/ws/eBayISAPI.dll?ViewFeedback2&userid=nana90store&iid=-1&de=off&items=25&searchInterval=30&which=positive&interval=30&_trkparms=positive_30')
for line in fhand:
print (line.strip())
f=open('feedbacks1.txt','a')
f.write(str(line)+'\n')
f.close()
file = open('feedbacks1.txt', 'r')
cleaned = open('cleaned.txt', 'w')
soup = BeautifulSoup(file)
page = soup.getText()
letters_only = re.sub("[^a-zA-Z]", " ", page )
cleaned.write(str(letters_only))
If you just care for the feedback text this might be what you are looking for:
import urllib.request
import re
from bs4 import BeautifulSoup
fhand = urllib.request.urlopen('http://feedback.ebay.com/ws/eBayISAPI.dll?ViewFeedback2&userid=nana90store&iid=-1&de=off&items=25&searchInterval=30&which=positive&interval=30&_trkparms=positive_30')
soup = BeautifulSoup(fhand.read(), 'html.parser')
table = soup.find(attrs = {'class' : 'FbOuterYukon'})
for tr in table.findAll('tr'):
if not tr.get('class'):
print(list(tr.children)[1].getText())
I am first finding the table with feedback, then the rows that contain the feedback (no class) and then the relevant lines and parse the corresponding text. This can also be adapted for similar needs.