Does someone knows how to get ebay feedbacks from site using python3, beautifulsoup, re...
I have this code but it is not easy to find feedbacks.
import urllib.request
import re
from bs4 import BeautifulSoup
fhand = urllib.request.urlopen('http://feedback.ebay.com/ws/eBayISAPI.dll?ViewFeedback2&userid=nana90store&iid=-1&de=off&items=25&searchInterval=30&which=positive&interval=30&_trkparms=positive_30')
for line in fhand:
print (line.strip())
f=open('feedbacks1.txt','a')
f.write(str(line)+'\n')
f.close()
file = open('feedbacks1.txt', 'r')
cleaned = open('cleaned.txt', 'w')
soup = BeautifulSoup(file)
page = soup.getText()
letters_only = re.sub("[^a-zA-Z]", " ", page )
cleaned.write(str(letters_only))
If you just care for the feedback text this might be what you are looking for:
import urllib.request
import re
from bs4 import BeautifulSoup
fhand = urllib.request.urlopen('http://feedback.ebay.com/ws/eBayISAPI.dll?ViewFeedback2&userid=nana90store&iid=-1&de=off&items=25&searchInterval=30&which=positive&interval=30&_trkparms=positive_30')
soup = BeautifulSoup(fhand.read(), 'html.parser')
table = soup.find(attrs = {'class' : 'FbOuterYukon'})
for tr in table.findAll('tr'):
if not tr.get('class'):
print(list(tr.children)[1].getText())
I am first finding the table with feedback, then the rows that contain the feedback (no class) and then the relevant lines and parse the corresponding text. This can also be adapted for similar needs.
Related
I was wondering if there's any way to get text from certain url using python.
For example, from this one https://www.ixbt.com/news/2022/04/20/160-radeon-rx-6400.html
Thank you in advance.
You can do web scraping in python using BeautifulSoup:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.ixbt.com/news/2022/04/20/160-radeon-rx-6400.html"
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")
text = soup.get_text()
After that you could save the extracted text into a text file:
text_file = open("webscrap.txt", "w", encoding="utf-8")
text_file.write(text)
text_file.close()
I am trying to scrape a website, which so far I am able to scrape but I want to output the file to a text file then from there I want to delete some strings in it.
from urllib.request import urlopen
from bs4 import BeautifulSoup
delete = ['https://', 'http://', 'b\'http://', 'b\'https://']
url = urlopen('https://openphish.com/feed.txt')
bs = BeautifulSoup(url.read(), 'html.parser' )
print(bs.encode('utf_8'))
The result is a lot of links, I can show a sample.
"b'https://certain-wrench.000webhostapp.com/auth/signin/details.html\nhttps://sweer-adherence.000webhostapp.com/auth/signin/details.html\n"
UPDATED
import requests
from bs4 import BeautifulSoup
url = "https://openphish.com/feed.txt"
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
with open('url.txt', 'w', encoding='utf-8') as f_out:
f_out.write(soup.prettify())
delete = ["</p>", "</body>", "</html>", "<body>", "<p>", "<html>", "www.",
"https://", "http://", " ", " ", " "]
with open(r'C:\Users\v-morisv\Desktop\scripts\url.txt', 'r') as file:
with open(r'C:\Users\v-morisv\Desktop\scripts\url1.txt', 'w') as
file1:
for line in file:
for word in delete:
line = line.replace(word, "")
print(line, end='')
file1.write(line)
This code above works, but I have a problem because I am not getting only the domain I am getting everything after the forwarddash so it looks like this
bofawebplus.webcindario.com/index4.html and I want to remove "/" and everything after it.
This seems like a proper situation using Regular Expression.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = urlopen('https://openphish.com/feed.txt')
bs = BeautifulSoup(url.read(), 'html.parser' )
import re
domain_list = re.findall(re.compile('http[s]?://([^/]*)/'), bs.text)
print('\n'.join(domain_list))
There's no reason to use BeautifulSoup here, it is used for parsing HTML, but the URL being opened is plain text.
Here's a solution that should do what you need. It uses the Python urlparse as an easier and more reliable way of extracting the domain name.
This also uses a python set to remove duplicate entries, since there were quite a few.
from urllib.request import urlopen
from urllib.parse import urlparse
feed_list = urlopen('https://openphish.com/feed.txt')
domains = set()
for line in feed_list:
url = urlparse(line)
domain = url.netloc.decode('utf-8') # decode from utf-8 to string
domains.add(domain) # Keep all the domains in the set to remove duplicates
for domain in domains:
print(domains)
I am trying to scrape reviews from Imdb movies using python3.6. However when I print my 'review', only 1 review pops up and I am not sure why the rest does not pop up. This does not happen for my 'review_title'. Any advise or help is greatly appreciated as I've been searching forums and googling but no avail.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url = urlopen('http://www.imdb.com/title/tt0111161/reviews?ref_=tt_ov_rt').read()
soup = BeautifulSoup(url,"html.parser")
print(soup.prettify())
review_title = soup.find("div",attrs={"class":"lister"}).findAll("div",{"class":"title"})
review = soup.find("div",attrs={"class":"text"})
review = soup.find("div",attrs={"class":"text"}).findAll("div",{"class":"text"})
rating = soup.find("span",attrs={"class":"rating-other-user-rating"}).findAll("span")
Without creating any loop how can you reach all the content of that page? The way you have written your script is exactly doing what it is supposed to do (parsing the single review content).Try the below way instead. It will fetch you all the visible data.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = urlopen('http://www.imdb.com/title/tt0111161/reviews?ref_=tt_ov_rt').read()
soup = BeautifulSoup(url,"html.parser")
for item in soup.find_all(class_="review-container"):
review_title = item.find(class_="title").text
review = item.find(class_="text").text
try:
rating = item.find(class_="point-scale").previous_sibling.text
except:
rating = ""
print("Title: {}\nReview: {}\nRating: {}\n".format(review_title,review,rating))
Here is my code, basically I wanted to output the variable "final" to excel and I wanted it to be printed in one column. My current writing code will only write the results to one row in excel..
import requests
from bs4 import BeautifulSoup
import urllib
import re
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import csv
r = requests.get("https://www.autocodes.com/obd-code-list/")
soup = BeautifulSoup(r.content)
g_data = soup.find_all("div", {"id":"scroller"})
for item in g_data:
regex = '.html">(.+?)</a>'
pattern = re.compile(regex)
htmlfile = urllib.urlopen("https://www.autocodes.com/obd-code-list/")
htmltext = htmlfile.read()
final = re.findall(pattern,htmltext)
with open('index4.csv','w') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(['company'])#row?
writer.writerows([final])
Any possible fix for this? Thanks, I am just new to Python and just studying it with little programming knowledge.
I have started to learn how to scrape information from websites using urllib and beautifulsoup. I want to grab all the text from this page (in the code) and put it into a text file.
import urllib
from bs4 import BeautifulSoup as Soup
base_url = "http://www.galactanet.com/oneoff/theegg_mod.html"
url = (base_url)
soup = Soup(urllib.urlopen(url))
print(soup.get_text())
When I run this it grabs the text although it outputs it with spaces between all the letters and still shows me HTML, unsure why though.
i n ' > Y u p . B u t d o n t f e e
Like that, any idea's?
Also what would I do to put this info into a text file for me?
(Using beautifulsoup4 and running ubuntu 12.04 and python 2.7)
Thank you :)
I had some trouble with the encoding, so I changed your code slightly, then added the piece to print the results to a file:
import urllib
from bs4 import BeautifulSoup as Soup
base_url = "http://www.galactanet.com/oneoff/theegg_mod.html"
url = (base_url)
content = urllib.urlopen(url)
soup = Soup(content)
# print soup.original_encoding
theegg_text = soup.get_text().encode("windows-1252")
f = open("somefile.txt", "w")
f.write(theegg_text);
f.close()
You could try using html2text:
import html2text as htmlconverter
print htmlconverter.html2text('<HTML><BODY>HI</BODY></HTML>')