How can I parse urls from csv in BeautifulSoup Python - python

Below is my code. This code works fine for the given single url. I would like to parse urls from CSV. Thanks in advance.
P.S. Im quite new to Python.
Below Code works fine for a single given url
import requests
import pandas
from bs4 import BeautifulSoup
baseurl="https//www.xxxxxxxxx.com"
r=requests.get(baseurl)
c=r.content
soup=BeautifulSoup(c, "html.parser")
all=soup.find_all("div", {"class":"biz-us"})
for br in soup.find_all("br"):
br.replace_with("\n")
This is my tried code for accessing urls from CSV
import csv
import requests
import pandas
from bs4 import BeautifulSoup
with open("input.csv", "rb") as f:
reader = csv.reader(f)
for row in reader:
url = row[0]
r=requests.get(url)
c=r.content
soup=BeautifulSoup(c, "html.parser")
all=soup.find_all("div", {"class":"biz-country-us"})
for br in soup.find_all("br"):
br.replace_with("\n")

Looks like you'll need to use your loop properly and also get the array of urls. Try this out
import csv
import requests
import pandas
from bs4 import BeautifulSoup
df1 = pandas.read_csv("input.csv", skiprows=0) #assuming headers are in first row
urls = df1['url_column_name'].tolist() #get the urls in an array list
i=0
for i in range(len(urls)):
r=requests.get(urls[i])
c=r.content
soup=BeautifulSoup(c, "html.parser")
all=soup.find_all("div", {"class":"biz-country-us"})
for br in soup.find_all("br"):
br.replace_with("\n")

Suppose you have a csv file named linklists.csv and within this there is a header Links. Now you can use all the links available under the header Links following the method I've shown below:
import csv
import requests
with open("linklists.csv") as infile:
reader = csv.DictReader(infile)
for link in reader:
res = requests.get(link['Links'])
print(res.url)

Related

How to scrape specific IDs from a Webpage

I need to do some real estate market research and for this in need the prices, and other values from new houses.
So my idea was to go on the website where i get the information.
Go to the Main-Search-Site and scrape all the RealEstateIDs that would navigate me directly to the single pages for each house where i can than extract my infos that i need.
My problem is how do i get all the real estate ids from the main page and store them in a list, so i can use them in the next step to build the urls with them to go to the acutal sites.
I tried it with beautifulsoup but failed because i dont understand how to search for a specific word and extract what comes after it.
The html code looks like this:
""realEstateId":110356727,"newHomeBuilder":"false","disabledGrouping":"false","resultlist.realEstate":{"#xsi.type":"search:ApartmentBuy","#id":"110356727","title":"
Since the value "realEstateId" appears around 60 times, i want to scrape evertime the number (here: 110356727) that comes after it and store it in a list, so that i can use them later.
Edit:
import time
import urllib.request
from urllib.request import urlopen
import bs4 as bs
import datetime as dt
import matplotlib.pyplot as plt
from matplotlib import style
import numpy as np
import os
import pandas as pd
import pandas_datareader.data as web
import pickle
import requests
from requests import get
url = 'https://www.immobilienscout24.de/Suche/S-T/Wohnung-Kauf/Nordrhein-Westfalen/Duesseldorf/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/true?enteredFrom=result_list'
response = get(url)
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
def expose_IDs():
resp = requests.get('https://www.immobilienscout24.de/Suche/S-T/Wohnung-Kauf/Nordrhein-Westfalen/Duesseldorf/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/true?enteredFrom=result_list')
soup = bs.BeautifulSoup(resp.text, 'lxml')
table = soup.find('resultListModel')
tickers = []
for row in table.findAll('realestateID')[1:]:
ticker = row.findAll(',')[0].text
tickers.append(ticker)
with open("exposeID.pickle", "wb") as f:
pickle.dump(tickers, f)
return tickers
expose_IDs()
Something like this? There are 68 keys in a dictionary that are ids. I use regex to grab the same script as you are after and trim of an unwanted character, then load with json.loads and access the json object as shown in image at bottom.
import requests
import json
from bs4 import BeautifulSoup as bs
import re
res = requests.get('https://www.immobilienscout24.de/Suche/S-T/Wohnung-Kauf/Nordrhein-Westfalen/Duesseldorf/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/true?enteredFrom=result_list')
soup = bs(res.content, 'lxml')
r = re.compile(r'resultListModel:(.*)')
data = soup.find('script', text=r).text
script = r.findall(data)[0].rstrip(',')
#resultListModel:
results = json.loads(script)
ids = list(results['searchResponseModel']['entryInformation'].keys())
print(ids)
Ids:
Since website updated:
import requests
import json
from bs4 import BeautifulSoup as bs
import re
res = requests.get('https://www.immobilienscout24.de/Suche/S-T/Wohnung-Kauf/Nordrhein-Westfalen/Duesseldorf/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/true?enteredFrom=result_list')
soup = bs(res.content, 'lxml')
r = re.compile(r'resultListModel:(.*)')
data = soup.find('script', text=r).text
script = r.findall(data)[0].rstrip(',')
results = json.loads(script)
ids = [item['#id'] for item in results['searchResponseModel']['resultlist.resultlist']['resultlistEntries'][0]['resultlistEntry']]
print(ids)

What is the best code to remove duplicate url links from a webscraper writing to a csv file?

I'm using Python 3 to write a webscraper to pull URL links and write them to a csv file. The code does this successfully; however, there are many duplicates. How can I create the csv file with only single instances (unique) of each URL?
Thanks for the help!
import requests
from bs4 import BeautifulSoup
import csv
from urllib.parse import urljoin
r = requests.get('url')
soup = BeautifulSoup(r.text, 'html.parser')
data = []
for link in soup.find_all('a', href=True):
if '#' in link['href']:
pass
else:
print(urljoin('base-url',link.get('href')))
data.append(urljoin('base-url',link.get('href')))
with open('test.csv', 'w', newline='') as csvfile:
write = csv.writer(csvfile)
for row in data:
write.writerow([row])
Using set() somewhere along the line is the way to go. In the code below, I've added that as data = set(data) on its own line to best illustrate the usage. Here, we replace data with set(data), which drops your ~250-url list to around ~130:
import requests
from bs4 import BeautifulSoup
import csv
from urllib.parse import urljoin
r = requests.get('https://www.census.gov/programs-surveys/popest.html')
soup = BeautifulSoup(r.text, 'html.parser')
data = []
for link in set(soup.find_all('a', href=True)):
if '#' in link['href']:
pass
else:
print(urljoin('https://www.census.gov',link.get('href')))
data.append(urljoin('https://www.census.gov',link.get('href')))
data = set(data)
with open('CensusLinks.csv', 'w', newline='') as csvfile:
write = csv.writer(csvfile)
for row in data:
write.writerow([row])

Trying to print all TR elements and all TD elements from a web page

I am playing around with the script below and trying to get it to write all TR elements and all TD elements from a web page into a CSV file. For some unknown reason, I'm getting no data, at all, in the CSV file.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
url = "https://my_url"
page = requests.get(url)
pagetext = page.text
soup = BeautifulSoup(pagetext, 'html.parser')
file = open("C:/my_path/test.csv", 'w')
for row in soup.find_all('tr'):
for col in row.find_all('td'):
print(col.text)
I am using Python 3.6.
Your url is not a website so it won't be able to find anything. You just need to fix the url and try again.
I have fixed the code so that you can finish it. It will only add the first line of data in the list to the csv file.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
url = "https://www.w3schools.com/html/html_tables.asp"
page = requests.get(url)
pagetext = page.text
soup = BeautifulSoup(pagetext, 'html.parser')
file = open("C:/Test/test2.csv", 'w')
for row in soup.find_all('tr'):
for col in row.find_all('td'):
info= col.text
print(info)
file.write(info)
file.close()

Why doesn't my web-scraping code work?

I want to scrape the airplane arrivals from a website with Python 2.7, and export it to excel, but something is wrong with my code:
import urllib2
import unicodecsv as csv
import os
import sys
import io
import time
import datetime
import pandas as pd
from bs4 import BeautifulSoup
filename=r'output.csv'
resultcsv=open(filename,"wb")
output=csv.writer(resultcsv, delimiter=';',quotechar = '"', quoting=csv.QUOTE_NONNUMERIC, encoding='latin-1')
url = "https://www.flightradar24.com/data/airports/bud/arrivals"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
data = soup.find('div', { "class" : "row cnt-schedule-table"})
print data
I need the contents of the div with class row cnt-schedule table. What am I doing wrong?
I believe the problem is that you are trying to get data from a JavaScript loaded data-set. Instead of loading from the page directly you'll need to mimic the requests for the data that the page is making to populate it.

python beautiful soup import urls

I am trying to import a list of urls and grab pn2 and main1. I can run it without importing the file so I know it works but I just have no idea what to do with the import. Here is what I have tried most recent and below it is a small portion of the urls. Thanks in advance.
import urllib
import urllib.request
import csv
from bs4 import BeautifulSoup
csvfile = open("ecco1.csv")
csvfilelist = csvfile.read()
theurl="csvfilelist"
soup = BeautifulSoup(theurl,"html.parser")
for row in csvfilelist:
for pn in soup.findAll('td',{"class":"productText"}):
pn2.append(pn.text)
for main in soup.find_all('div',{"class":"breadcrumb"}):
main1 = main.text
print (main1)
print ('\n'.join(pn2))
Urls:
http://www.eccolink.com/products/productresults.aspx?catId=2458
http://www.eccolink.com/products/productresults.aspx?catId=2464
http://www.eccolink.com/products/productresults.aspx?catId=2435
http://www.eccolink.com/products/productresults.aspx?catId=2446
http://www.eccolink.com/products/productresults.aspx?catId=2463
From what I see, you are opening a CSV file and using BeautifulSoup to parse it.
That should not be the way.
BeautifulSoup parses html files, not CSV.
Looking at your code, it seems correct if you were passing in html code to Bs4.
from bs4 import BeautifulSoup
import requests
links = []
file = open('links.txt')
html = requests.get('http://www.example.com')
soup = BeautifulSoup(html, 'html.parser')
for x in soup.find_all('a',"class":"abc"):
links.append(x)
file.write(x)
file.close()
Above is a very basic implementation of how I could get a target element in the html code and write it to a file/ or append it to a list. Use Requests rather than urllib. It is a better library and more modern.
If you want to input your data as CSV, my best option is to use csv reader as import.
Hope that helps.

Categories