Separate Python web scraped data in different columns (Excel) - python

Dear Stackoverflow community,
Recently I started playing around with Python. I learned a lot watching YouTube videos and browsing this platform. But I can't solve my problem.
Hope you guys can help me out.
So I tried to scrape information from websites using Python(Anaconda). And put this information in an CSV file. I tried to separate the columns by adding "," in my script. But when I open my CSV file all the data is put together in 1 column(A). Instead I want the data to be separated in different columns(A & B (and C, D, E, F etc when I want to add info)).
What do I have to add into this code:
filename = "brands.csv"
f = open(filename, "w")
headers = "brand, shipping\n"
f.write(headers)
for container in containers:
brand_container = container.findAll("h2",{"class":"product-name"})
brand = brand_container[0].a.text
shipping_container = container.findAll("p",{"class":"availability in-stock"})
shipping = shipping_container[0].text.strip()
print("brand: " + brand)
print("shipping: " + shipping)
f.write(brand + "," + shipping + "," + "\n")
f.close()
Thank you for helping out!
Kind regards,
Complete script after Game0ver's suggestion:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.scraped-website.com'
# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# html parsing
page_soup = soup(page_html, "html.parser")
# grabs each product
containers = page_soup.findAll("li",{"class":"item last"})
container = containers[0]
import csv
filename = "brands.csv"
with open(filename, 'w') as csvfile:
fieldnames = ['brand', 'shipping']
# define your delimiter
writer = csv.DictWriter(csvfile, delimiter=',', fieldnames=fieldnames)
writer.writeheader()
for container in containers:
brand_container = container.findAll("h2",{"class":"product-name"})
brand = brand_container[0].a.text
shipping_container = container.findAll("p",{"class":"availability in-stock"})
shipping = shipping_container[0].text.strip()
print("brand: " + brand)
print("shipping: " + shipping)
As I mentioned this code didn't work. I must have done something wrong?

You better use python's csv module to do that:
import csv
filename = "brands.csv"
with open(filename, 'w') as csvfile:
fieldnames = ['brand', 'shipping']
# define your delimiter
writer = csv.DictWriter(csvfile, delimiter=',', fieldnames=fieldnames)
writer.writeheader()
# write rows...

Try enclosing your values in double quotes, like
f.write('"'+brand + '","' + shipping + '"\n')
Although, there are more better ways to handle this generic task and this functionality.

You can choose either of the ways I've shown below. As the url avaiable within your script is unreachable, I've provided a working one.
import csv
import requests
from bs4 import BeautifulSoup
url = "https://yts.am/browse-movies"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
with open("movieinfo.csv", 'w', newline="") as f:
writer = csv.DictWriter(f, ['name', 'year'])
writer.writeheader()
for row in soup.select(".browse-movie-bottom"):
d = {}
d['name'] = row.select_one(".browse-movie-title").text
d['year'] = row.select_one(".browse-movie-year").text
writer.writerow(d)
Or you can try like the following:
soup = BeautifulSoup(response.content, 'lxml')
with open("movieinfo.csv", 'w', newline="") as f:
writer = csv.writer(f)
writer.writerow(['name','year'])
for row in soup.select(".browse-movie-bottom"):
name = row.select_one(".browse-movie-title").text
year = row.select_one(".browse-movie-year").text
writer.writerow([name,year])

Related

How to STOP getting an empty CSV file with Scraping

When i run the code and i get my CSV file, its actually empty.
'''
import requests
from bs4 import BeautifulSoup
from csv import writer
url = 'https://www.fotocasa.es/es/alquiler/todas-las-casas/girona-provincia/todas-las-zonas/l'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
lists = soup.find_all('section', class_='re-CardPackAdvance')
with open('casas.csv', 'w', encoding='utf8', newline='') as f:
thewriter = writer(f)
header = ['Titulo', 'Precio', 'Metros', 'Telefono']
thewriter.writerow(header)
for list in lists:
titulo = list.find('a', class_='re-CardPackAdvance-info-container').text.replace('\n', '')
precio = list.find('span', class_='re-CardPrice').text.replace('\n', '')
metros = list.find('span', class_='re-CardFeaturesWithIcons-feature-icon--surface').text.replace('\n', '')
telefono = list.find('a', class_='re-CardContact-phone').text.replace('\n', '')
info = [titulo, precio, metros, telefono]
thewriter.writerow(info)
'''
I expected to have all the info scrapped from this website, but seems like i did something wrong at some point
You are parsing the resulting soup not appropriately. There is no section with the re-CardPackAdvance class. I adapted the code accordingly (find all articles with class that starts with re-CardPack). Please also note that you need to shift the for-loop by one indention. However, due to the structure of the page, only the first two entries are loaded directly when fetching the page. All other entries are fetched after the page has loaded in the browser (via javascript). I think you might consider using the API of the page instead.
import requests
from bs4 import BeautifulSoup
from csv import writer
import re
url = 'https://www.fotocasa.es/es/alquiler/todas-las-casas/girona-provincia/todas-las-zonas/l'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
lists = soup.find_all("article", class_=re.compile("^re-CardPack"))
print(len(lists))
with open('casas.csv', 'w', encoding='utf8', newline='') as f:
thewriter = writer(f)
header = ['Titulo', 'Precio', 'Metros', 'Telefono']
thewriter.writerow(header)
for list in lists:
titulo = list.find('a').get('title')
precio = list.find('span', class_='re-CardPrice').text.replace('\n', '')
metros = list.find('span', class_='re-CardFeaturesWithIcons-feature-icon--surface').text.replace('\n', '')
telefono = list.find('a', class_='re-CardContact-phone').text.replace('\n', '')
info = [titulo, precio, metros, telefono]
thewriter.writerow(info)

I'm having trouble scraping multiple URL's

I'm having trouble scraping multiple URLs. Essentially I'm able to run this for only one genre, but the second I include other links it stops working.
The goal is to get the data and place it into a csv file with the movie title, url, and genre. Any help would be appreciated!
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = ['https://www.netflix.com/browse/genre/1365', 'https://www.netflix.com/browse/genre/7424']
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html,"html.parser")
containers = page_soup.findAll("li",{"class":"nm-content-horizontal-row-item"})
# name the output file to write to local disk
out_filename = "netflixaction2.csv"
# header of csv file to be written
headers = "Movie_Name, Movie_ID \n"
# opens file, and writes headers
f = open(out_filename, "w")
f.write(headers)
for container in containers:
title_container = container.findAll("a",{"class":"nm-collections-title nm-collections-link"})
title_container = title_container[0].text
movieid = container.findAll("a",{"class":"nm-collections-title nm-collections-link"})
movieid = movieid[0].attrs['href']
print("Movie Name: " + title_container, "\n")
print("Movie ID: " , movieid, "\n")
f.write(title_container + ", " + movieid + "\n")
f.close() # Close the file
The reason you are getting the error is that you trying to do a GET requests on a list.
my_url = ['https://www.netflix.com/browse/genre/1365', 'https://www.netflix.com/browse/genre/7424']
uClient = uReq(my_url)
what I suggest to do here is to loop through each link etc:
my_url = ['https://www.netflix.com/browse/genre/1365', 'https://www.netflix.com/browse/genre/7424']
for link in my_url:
uClient = uReq(link)
page_html = uClient.read()
....
and to mention, if you are just applying the code for the loop, it will override your f.write function. What you need to do is something like:
New edit:
import csv
import requests
from bs4 import BeautifulSoup as soup
# All given URLS
my_url = ['https://www.netflix.com/browse/genre/1365', 'https://www.netflix.com/browse/genre/7424']
# Create and open CSV file
with open("netflixaction2.csv", 'w', encoding='utf-8') as csv_file:
# Headers for CSV
headers_for_csv = ['Movie Name', 'Movie Link']
# Small function for csv DictWriter
csv_writer = csv.DictWriter(csv_file, delimiter=',', lineterminator='\n', fieldnames=headers_for_csv)
csv_writer.writeheader()
# We need to loop through each URL from the list
for link in my_url:
# Do a simple GET requests with the URL
response = requests.get(link)
page_soup = soup(response.text, "html.parser")
# Find all nm-content-horizontal-row-item
containers = page_soup.findAll("li", {"class": "nm-content-horizontal-row-item"})
# Loop through each found "li"
for container in containers:
movie_name = container.text.strip()
movie_link = container.find("a")['href']
print(f"Movie Name: {movie_name} | Movie link: {movie_link}")
# Write to CSV
csv_writer.writerow({
'Movie Name': movie_name,
'Movie Link': movie_link,
})
# Close the file
csv_file.close()
That should be your solution :) Feel free to comment if i'm missing something!

Getting a list output to write correctly in rows in a csv file

I am trying to write this output to a csv file but it is simply not working. I have tried many writing to csv tutorials but none of them work. If you could please direct me to tutorial explaining why this isnt working, I would like to learn the issue and solve it.
import bs4
from urllib.request import urlopen as ureq
from bs4 import BeautifulSoup as soup
import csv
myurl = 'https://www.newegg.com/Video-Cards-Video-Devices/Category/ID-38'
uclient = ureq(myurl)
page_html = uclient.read()
uclient.close()
page_soup = soup(page_html, 'html.parser')
items = page_soup.find_all('div', {'class':'item-container'})
#filename = 'zeus.csv'
#f = open(filename, 'w')
#header = 'Item Details\n'
#f.write(header)
#contain = items[0]
#container = items[0]
for container in items:
details = container.a.img['title']
with open('zeus.csv', 'w') as f:
f.write(details + "\n")
#print(details)
You can run
with open('zeus.csv', 'w') as f:
for container in items:
details = container.a.img['title']
f.write("{} \n ".format(details))
The problems that were in the code are that with open('zeus.csv', 'w') as f: was in the loop so in each iteration it is overwritten the previous iterations.
You can try something like that for writing list to .csv file :
import csv
#open file
with open(..., 'w', newline='') as your_file:
writer = csv.writer(your_file, quoting=csv.QUOTE_ALL)
# write your list values
writer.writerow(your_list)

Crawling output into multiple csv files

I would like to know how to export my results from crawling into multiple csv files for each different city that I have crawled. Somehow I´m running into walls, do not get a proper way to sort it out.
That is my code:
import requests
from bs4 import BeautifulSoup
import csv
user_agent = {'User-agent': 'Chrome/43.0.2357.124'}
output_file= open("TA.csv", "w", newline='')
RegionIDArray = [187147,187323,186338]
dict = {187147: 'Paris', 187323: 'Berlin', 186338: 'London'}
already_printed = set()
for reg in RegionIDArray:
for page in range(1,700,30):
r = requests.get("https://www.tripadvisor.de/Attractions-c47-g" + str(reg) + "-oa" + str(page) + ".html")
soup = BeautifulSoup(r.content)
g_data = soup.find_all("div", {"class": "element_wrap"})
for item in g_data:
header = item.find_all("div", {"class": "property_title"})
item = (header[0].text.strip())
if item not in already_printed:
already_printed.add(item)
print("POI: " + str(item) + " | " + "Location: " + str(dict[reg]))
writer = csv.writer(output_file)
csv_fields = ['POI', 'Locaton']
if g_data:
writer.writerow([str(item), str(dict[reg])])
My goal would be that I get three sperate CSV files for Paris, Berlin and London instead of getting all the results in one big csv file.
Could you guys help me out? Thanks for your feedback:)
I did some minor modifications to your code. To make files for each locale, I moved the out_file name inside the loop.
Note, that I don't have time now, the very last line is a hack to ignore unicode errors -- it just skips trying to output a line with a non ascii character. Thas isn't good. Maybe someone can fix that part?
import requests
from bs4 import BeautifulSoup
import csv
user_agent = {'User-agent': 'Chrome/43.0.2357.124'}
RegionIDArray = {187147: 'Paris', 187323: 'Berlin', 186338: 'London'}
already_printed = set()
for reg in RegionIDArray:
output_file= open("TA" + str(reg) + ".csv", "w")
for page in range(1,700,30):
r = requests.get("https://www.tripadvisor.de/Attractions-c47-g" + str(reg) + "-oa" + str(page) + ".html")
soup = BeautifulSoup(r.content)
g_data = soup.find_all("div", {"class": "element_wrap"})
for item in g_data:
header = item.find_all("div", {"class": "property_title"})
item = (header[0].text.strip())
if item not in already_printed:
already_printed.add(item)
# print("POI: " + str(item) + " | " + "Location: " + str(RegionIDArray[reg]))
writer = csv.writer(output_file)
csv_fields = ['POI', 'Locaton']
if g_data:
try:
writer.writerow([str(item), str(RegionIDArray[reg])])
except:
pass

writing beautiful soup output to CSV

I want to write prices and corresponding addresses to a CSV file in Excel. I have this code so far which gives the output shown below in the photo.
What I want is a column for price first and a column for the address second.
[![from bs4 import BeautifulSoup
import requests
import csv
number = "1"
url = "http://www.trademe.co.nz/browse/categoryattributesearchresults.aspx?cid=5748&search=1&v=list&134=1&nofilters=1&originalsidebar=1&key=1654466070&page=" + number + "&sort_order=prop_default&rptpath=350-5748-3399-"
r= requests.get(url)
soup = BeautifulSoup(r.content)
output_file= open("output.csv","w")
price = soup.find_all("div",{"class":"property-card-price-container"})
address = soup.find_all("div",{"class":"property-card-subtitle"})
n = 1
while n != 150:
b = (price\[n\].text)
b = str(b)
n = n + 1
output_file.write(b)
output_file.close()][1]][1]
Maybe something like this?
from bs4 import BeautifulSoup
import requests
import csv
....
r = requests.get(url)
soup = BeautifulSoup(r.content)
price = soup.find_all("div",{"class":"property-card-price-container"})
address = soup.find_all("div",{"class":"property-card-subtitle"})
dataset = [(x.text, y.text) for x,y in zip(price, address)]
with open("output.csv", "w", newline='') as csvfile:
writer = csv.writer(csvfile)
for data in dataset[:150]: #truncate to 150 rows
writer.writerow(data)
There are a few problems with your code. Getting the prices and addresses into separate lists risks the site switching the order of the items, etc. and getting them mixed up. When scraping entries like this it is important to first find the larger enclosing container, then narrow down from there.
Unfortunately the URL you provided is no longer valid. As such I just browsed to another set of listings for this example:
from bs4 import BeautifulSoup
import requests
import csv
url = 'http://www.trademe.co.nz/property/residential-property-for-sale'
url += '/waikato/view-list'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html5lib')
with open('output.csv', 'w', newline='') as csvfile:
propertyWriter = csv.writer(csvfile, quoting=csv.QUOTE_ALL)
for listing in soup.find_all('div',
{'class': 'property-list-view-card'}):
price = listing.find_all('div',
{'class': 'property-card-price-container'})
address = listing.find_all('div',
{'class': 'property-card-subtitle'})
propertyWriter.writerow([price[0].text.strip(),
address[0].text.strip()])

Categories