I am new here and newbie with python and currently learning some basic stuff, mostly scraping and I encountered a problem that I hope you can help me to solve.
I'm trying to scrape few details from a website and writing them into a CSV file but I'm able to write only the last results into my CSV, apparently my script just overwrite the data.
Also if you find any mistakes on my code or any room for improvement (which I'm sure there are) I'd be glad if you will point them out as well.
Also2, any recommendation for videos/tutorials that can help me improve my python and scraping skills would be appreciated.
import requests
from bs4 import BeautifulSoup
import csv
url = 'https://www.tamarackgc.com/club-contacts'
source = requests.get(url).text
soup = BeautifulSoup (source, 'lxml')
csv_file = open('contacts.csv', 'w')
csv_writer = csv.writer (csv_file)
csv_writer.writerow(["department", "name", "position", "phone"])
for department in soup.find_all("div", class_="view-content"):
department_name = department.h3
print (department_name.text)
for contacts in soup.find_all("div", class_="col-md-7 col-xs-10"):
contact_name = contacts.strong
print(contact_name.text)
for position in soup.find_all("div", class_="field-content"):
print(position.text)
for phone in soup.find_all("div", class_="modal-content"):
first_phone = phone.h3
first_phones = first_phone
print(first_phones)
csv_writer.writerow([department_name, contact_name, position, first_phones])
csv_file.close()
Thanks Thomas,
Actually I tweaked my code a little bit by thinking how I can make it simpler (four for loops are too much, no?) so with the following code I solved my problem(dropped the 'department' and 'phones' because some other issues):
import requests
from bs4 import BeautifulSoup
import csv
url = 'https://www.tamarackgc.com/club-contacts'
source = requests.get(url).text
soup = BeautifulSoup (source, 'lxml')
f = open("contactslot.csv", "w+")
csv_writer = csv.writer (f)
csv_writer.writerow(["Name", "Position"])
infomation = soup.find_all("div", class_="well profile")
info = information[0]
for info in information:
contact_name = info.find_all("div", class_="col-md-7 col-xs-10")
names = contact_name[0].strong
name = names.text
print (name)
position_name = info.find_all("div", class_="field-content")
position = position_name[0].text
print(position)
print("")
csv_writer.writerow([name, position])
f.close()
Hi Babr welcome to use python. Your answer is good and here is one more little thing you may can do better.
use find replace find_all if you just want one element
import requests
from bs4 import BeautifulSoup
import csv
url = 'https://www.tamarackgc.com/club-contacts'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')
f = open("/Users/mingjunliu/Downloads/contacts.csv", "w+")
csv_writer = csv.writer(f)
csv_writer.writerow(["Name", "Position"])
for info in soup.find_all("div", class_="well profile"):
contact_name = info.find("div", class_="col-md-7 col-xs-10")
names = contact_name.strong
name = names.text
print(name)
position_name = info.find("div", class_="field-content")
position = position_name.text
print(position)
print("")
csv_writer.writerow([name, position])
f.close()
And the reason you need to drop phone and department is because of the bad website structure. It's not your fault.
Related
I want to get info from the website by Web scraping with python(I learn it now), but it prints classes (which I got the info from) first in CSV then prints the information which I want. I saw the Youtube video many times, and I wrote the same code but it doesn't happen like the problem which I got. Is there anyone kan HELP me?
This is an image link for CSV to show you how It looks when I click on RUN
Code:
import requests
from bs4 import BeautifulSoup
import csv
from itertools import zip_longest
Job_titles = []
Company_names = []
Location_names = []
Job_skills = []
Links = []
result = requests.get("https://wuzzuf.net/search/jobs/?q=python&a=hpb")
src = result.content
soup = BeautifulSoup(src, "lxml")
Job_titles = soup.find_all('h2', {"class":"css-m604qf"})
Company_names = soup.find_all('a', {"class":"css-17s97q8"})
Location_names = soup.find_all('span', {"class":"css-5wys0k"})
Job_skills = soup.find_all("div", {'class':"css-y4udm8"})
for i in range(len(Company_names)):
Job_titles.append(Job_titles[i].text)
Company_names.append(Company_names[i].text)
Location_names.append(Location_names[i].text)
Job_skills.append(Job_skills[i].text)
file_list = [Job_titles, Company_names, Location_names, Job_skills,]
exported = zip_longest(*file_list)
with open("C:/Users/Saleh saleh/Documents/jobtest.csv", "w") as myfile:
wr = csv.writer(myfile)
wr.writerow(["Job titles", "Company names", "Location", "Skills", "Links"])
wr.writerows(exported)
To get information from the site, you can use following example:
import csv
import requests
from bs4 import BeautifulSoup
url = "https://wuzzuf.net/search/jobs/?q=python&a=hpb"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
with open("data.csv", "w") as f_in:
writer = csv.writer(f_in)
writer.writerow(
["Job titles", "Company names", "Location", "Skills", "Links"]
)
for title in soup.select("h2 > a"):
company_name = title.find_next("a")
location = company_name.find_next("span")
info = location.find_next("div", {"class": None})
writer.writerow(
[
title.text,
company_name.text,
location.text,
",".join(
a.text.replace("ยท", "").strip() for a in info.select("a")
),
title["href"],
]
)
Creates data.csv (screenshot from LibreOffice):
I am new to python. is anyone know {sum(int(td.text) for td in soup.select('td:last-child')[1:])} what is use of [1:] in this or [0] or [1]. i saw it in many scraping examples below for in loop. As i was practicing i build this code and don't able to scrape all data in csv file. thanks in advance, sorry for two question at one time.
import requests
from bs4 import BeautifulSoup
import csv
url= "https://iplt20.com/stats/2020/most-runs"
r= requests.get (url)
soup= BeautifulSoup (r.content, 'html5lib')
lst= []
table=soup.find ('div', attrs = {'class':'js-table'})
#for row in table.findAll ('div', attrs= {'class':'top-players__player-name'}):
# score = {}
# score['Player'] = row.a.text.strip()
# lst.append(score)
for row in table.findAll (class_='top-players__m top-players__padded '):
score = {}
score['Matches'] = int(row.td.text)
lst.append(score)
filename= 'iplStat.csv'
with open (filename, 'w', newline='') as f:
w= csv.DictWriter(f,['Player', 'Matches'])
w.writeheader()
for score in lst:
w.writerow(score)
print (lst)
All of this is not even needed. Just use pandas:
import requests
import pandas as pd
url = "https://iplt20.com/stats/2020/most-runs"
r = requests.get (url)
df = pd.read_html(r.content)[0]
df.to_csv("iplStats.csv", index = False)
Screenshot of csv file:
I am new to scraping/BS4 and am having a problem getting this csv file to list all of the members. My problem is the CSV is listing one member's information in repeat over multiple lines. If anyone has any ideas to fix this, would be greatly appreciated.
import requests
import csv
from bs4 import BeautifulSoup
r = requests.get('https://vermontmaple.org/basic-member-list')
soup = BeautifulSoup(r.text, 'html.parser')
with open('list.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(['name', 'address', 'phone'])
for company in soup.findAll('div', class_='directory_item selected'):
maple_name = soup.find('div', class_='name').get_text(strip=True)
maple_address = soup.find('div', class_='address').get_text(strip=True)
maple_phone = soup.find('div', class_='phone').get_text(strip=True)
writer.writerow([maple_name, maple_address, maple_phone])
f.close()
change soup.find to company.find inside the forloop
for company in soup.findAll('div', class_='directory_item selected'):
maple_name = company.find('div', class_='name').get_text(strip=True)
maple_address = company.find('div', class_='address').get_text(strip=True)
maple_phone = company.find('div', class_='phone').get_text(strip=True)
there is no need for a f.close()
I am designing scraping project for my research but i am stuck in to write scrape data in csv. Please help me for that?
i have successfully scrape data but i want to store it in csv here below is my code
need to write code to pull all of the html from a website then save it to a csv file.
I believe I somehow need to turn the links into a list and then write the list, but I'm unsure how to do that.
This is what I have so far:
import requests
import time
from bs4 import BeautifulSoup
import csv
# Collect and parse first page
page = requests.get('https://www.myamcat.com/jobs')
soup = BeautifulSoup(page.content, 'lxml')
print("Wait Scraper is working on ")
time.sleep(10)
if(page.status_code != 200):
print("Error in Scraping check the url")
else:
print("Successfully scrape the data")
time.sleep(10)
print("Loading data in csv")
file = csv.writer(open('dataminer.csv', 'w'))
file.writerow(['ProfileName', 'CompanyName', 'Salary', 'Job', 'Location'])
for pname in soup.find_all(class_="profile-name"):
#print(pname.text)
profname = pname.text
file.writerow([profname, ])
for cname in soup.find_all(class_="company_name"):
print(cname.text)
for salary in soup.find_all(class_="salary"):
print(salary.text)
for lpa in soup.find_all(class_="jobText"):
print(lpa.text)
for loc in soup.find_all(class_="location"):
print(loc.text)
Make a dict and save the data into it then save to csv, check below code!
import requests
import time
from bs4 import BeautifulSoup
import csv
# Collect and parse first page
page = requests.get('https://www.myamcat.com/jobs')
soup = BeautifulSoup(page.content, 'lxml')
data = []
print("Wait Scrapper is working on ")
if(page.status_code != 200):
print("Error in Srapping check the url")
else:
print("Successfully scrape the data")
for x in soup.find_all('div',attrs={'class':'job-page'}):
data.append({
'pname':x.find(class_="profile-name").text.encode('utf-8'),
'cname':x.find(class_="company_name").text.encode('utf-8'),
'salary':x.find(class_="salary").text.encode('utf-8'),
'lpa':x.find(class_="jobText").text.encode('utf-8'),
'loc':x.find(class_="location").text.encode('utf-8')})
print("Loading data in csv")
with open('dataminer.csv', 'w') as f:
fields = ['salary', 'loc', 'cname', 'pname', 'lpa']
writer = csv.DictWriter(f, fieldnames=fields)
writer.writeheader()
writer.writerows(data)
Apart from what you have got in other answer, you can scrape and write the content at the same time as well. I used .select() instead of .find_all() to achieve the same.
import csv
import requests
from bs4 import BeautifulSoup
URL = "https://www.myamcat.com/jobs"
page = requests.get(URL)
soup = BeautifulSoup(page.text, 'lxml')
with open('myamcat_doc.csv','w',newline="",encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(['pname','cname','salary','loc'])
for item in soup.select(".job-listing .content"):
pname = item.select_one(".profile-name h3").get_text(strip=True)
cname = item.select_one(".company_name").get_text(strip=True)
salary = item.select_one(".salary .jobText").get_text(strip=True)
loc = item.select_one(".location .jobText").get_text(strip=True)
writer.writerow([pname,cname,salary,loc])
I am new to Python and attempting to use Python 3 to scrape billboard data. I am using the python billboard api (billboard.py) and want to get the hot 100 tracks in a csv format with Billboard Number, Artist Name, Song Title, Last Week Number, Peak Position and Weeks as headers. I have been looking at this for hours with no luck so any help would be greatly appreciated!
import requests, csv
from bs4 import BeautifulSoup
url = 'http://www.billboard.com/charts/hot-100'
with open('Billboard_Hot100.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['Billboard Number','Artist Name','Song Title','Last Week Number','peak_position','weeks_on_chart'])
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
for container in soup.find_all("article",class_="chart-row"):
billboard_number = container.find(class_="chart-row__current-week").text
artist_name_a_tag = container.find(class_="chart-row__artist").text.strip()
song_title = container.find(class_="chart-row__song").text
last_week_number_tag = container.find(class_="chart-row__value")
last_week_number = last_week_number_tag.text
peak_position_tag = last_week_number_tag.find_parent().find_next_sibling().find(class_="chart-row__value")
peak_position = peak_position_tag.text
weeks_on_chart_tag = peak_position_tag.find_parent().find_next_sibling().find(class_="chart-row__value").text
print(billboard_number,artist_name_a_tag,song_title,last_week_number,peak_position,weeks_on_chart_tag)
writer.writerow([billboard_number,artist_name_a_tag,song_title,last_week_number,peak_position,weeks_on_chart_tag])
The csv file returned has the headers but the columns contain no information. Am I missing something?
Can you work with this?
from lxml import html
import requests
page = requests.get('https://www.billboard.com/charts/hot-100')
tree = html.fromstring(page.content)
line = tree.xpath('//span[#class="chart-list-item__title-text"]/text()')
line = line.replace('\n','')
Here's a link that I bookmarked a long time ago. It has helped me a lot over the course of time.
https://docs.python-guide.org/scenarios/scrape/