Extracting blog data in python - python

We have to extract a specified number of blogs(n) by reading them from a a text file containing a list of blogs.
Then I extract the blog data and append it to a file.
This is just a part of the main assignment of applying nlp to the data.
So far I've done this:
import urllib2
from bs4 import BeautifulSoup
def create_data(n):
blogs=open("blog.txt","r") #opening the file containing list of blogs
f=file("data.txt","wt") #Create a file data.txt
with open("blog.txt")as blogs:
head = [blogs.next() for x in xrange(n)]
page = urllib2.urlopen(head['href'])
soup = BeautifulSoup(page)
link = soup.find('link', type='application/rss+xml')
print link['href']
rss = urllib2.urlopen(link['href']).read()
souprss = BeautifulSoup(rss)
description_tag = souprss.find('description')
f = open("data.txt","a") #data file created for applying nlp
f.write(description_tag)
This code doesn't work. It worked on giving the link directly.like:
page = urllib2.urlopen("http://www.frugalrules.com")
I call this function from a different script where user gives the input n.
What am I doing wrong?
Traceback:
Traceback (most recent call last):
File "C:/beautifulsoup4-4.3.2/main.py", line 4, in <module>
create_data(2)#calls create_data(n) function from create_data
File "C:/beautifulsoup4-4.3.2\create_data.py", line 14, in create_data
page=urllib2.urlopen(head)
File "C:\Python27\lib\urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 395, in open
req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'

head is a list:
head = [blogs.next() for x in xrange(n)]
A list is indexed by integer indices (or slices). You can not use head['href'] when head is a list:
page = urllib2.urlopen(head['href'])
It's hard to say how to fix this without knowing what the contents of blog.txt looks like. If each line of blog.txt contains a URL, then
you could use:
with open("blog.txt") as blogs:
for url in list(blogs)[:n]:
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
...
with open('data.txt', 'a') as f:
f.write(...)
Note that file is a deprecated form of open (which was removed in Python3). Instead of using f=file("data.txt","wt"), use the more modern with-statement syntax (as shown above).
For example,
import urllib2
import bs4 as bs
def create_data(n):
with open("data.txt", "wt") as f:
pass
with open("blog.txt") as blogs:
for url in list(blogs)[:n]:
page = urllib2.urlopen(url)
soup = bs.BeautifulSoup(page.read())
link = soup.find('link', type='application/rss+xml')
print(link['href'])
rss = urllib2.urlopen(link['href']).read()
souprss = bs.BeautifulSoup(rss)
description_tag = souprss.find('description')
with open('data.txt', 'a') as f:
f.write('{}\n'.format(description_tag))
create_data(2)
I'm assuming that you are opening, writing to and closing data.txt with each pass through the loop because you want to save partial results -- maybe in case the program is forced to terminate prematurely.
Otherwise, it would be easier to just open the file once at the very beginning:
with open("blog.txt") as blogs, open("data.txt", "wt") as f:

Related

Download ans save many PDFs files with python

I am trying to download many PDFS fle from a website and save them.
import requests
url = "https://jawdah.qcc.abudhabi.ae/en/Registration/QCCServices/Services/Registration/Trade%20Licenses/"+id+".pdf"
r = requests.get(url, stream= TRUE)
for id in range(1,125):
with open(id+'.pdf',"wb") as pdf:
for chunk in r.iter_content(chunk_size=1024):
if chunk:
pdf.write(chunk)
THE first url of the pdf is https://jawdah.qcc.abudhabi.ae/en/Registration/QCCServices/Services/Registration/Trade%20Licenses/1.pdf
and the last url is https://jawdah.qcc.abudhabi.ae/en/Registration/QCCServices/Services/Registration/Trade%20Licenses/125.pdf
I want to download all this files.
When i execute this code i have this error
Traceback (most recent call last):
File "c:\Users\king-\OneDrive\Bureau\pdfs\pdfs.py", line 6, in <module>
url = "https://jawdah.qcc.abudhabi.ae/en/Registration/QCCServices/Services/Registration/Trade%20Licenses/"+id+".pdf"
TypeError: can only concatenate str (not "builtin_function_or_method") to str
In the second line
url = "https://jawdah.qcc.abudhabi.ae/en/Registration/QCCServices/Services/Registration/Trade%20Licenses/"+id+".pdf"
you add a str object to something named id. id is a built-in function (type id() in a python console). In line 4
for id in range(1,125):
you overwrite id with something else (a number), which is possible, but not recommendable.
Apart from that you just make a single request, not one for very file. Try this:
import requests
url = "https://jawdah.qcc.abudhabi.ae/en/Registration/QCCServices/Services/Registration/Trade%20Licenses/{}.pdf"
for num in range(1,126):
r = requests.get(url.format(num), stream= TRUE)
with open('{}.pdf'.format(num),"wb") as pdf:
for chunk in r.iter_content(chunk_size=1024):
if chunk:
pdf.write(chunk)

How to fix error " object of type 'NoneType' has no len() " when creating a web scraper?

I'm trying to create a web scraper to download certain images from a webpage using Python and BeautifulSoup. I'm a beginner and have built this just through finding code online and trying to adapt it. My problem is that when I run the code, it produces this error:
line 24, in <module>
if len(nametemp) == 0:
TypeError: object of type 'NoneType' has no len()
This is what my code looks like:
i = 1
def makesoup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, "html.parser")
return soupdata
soup = makesoup("https://www.stiga.pl/sklep/koszenie-trawnika/agregaty-tnace/agregaty-tnace-park-villa/agregat-park-100-combi-3-el-qf")
for img in soup.findAll('img'):
temp=img.get('src')
if temp[:1]=="/":
image = "https://www.stiga.pl/sklep/koszenie-trawnika/agregaty-tnace/agregaty-tnace-park-villa/agregat-park-100-combi-3-el-qf" + temp
else:
image = temp
nametemp = img.get('alt', [])
if len(nametemp) == 0:
filename = str(i)
i = i + 1
else:
filename = nametemp
This works now! Thanks for the replies!
Now when I run the code, only some of the images from the webpage appear in my folder. And it returns this:
Traceback (most recent call last):
File "scrape_stiga.py", line 31, in <module>
imagefile.write(urllib .request.urlopen(image).read())
File "/Users/opt/anaconda3/lib/python3.7/urllib/request.py", line
222, in urlopen
return opener.open(url, data, timeout)
File "/Users/opt/anaconda3/lib/python3.7/urllib/request.py", line
510, in open
req = Request(fullurl, data)
File "/Users/opt/anaconda3/lib/python3.7/urllib/request.py", line
328, in __init__
self.full_url = url
File "/Users/opt/anaconda3/lib/python3.7/urllib/request.py", line
354, in full_url
self._parse()
File "/Users/opt/anaconda3/lib/python3.7/urllib/request.py", line
383, in _parse
raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: 'assets/img/logo-white.png'
Replace nametemp = img.get('alt') with nametemp = img.get('alt', '').
Some <img> elements could be missing the alt attribute. In such a case, img.get('alt') will return None and len function doesn't work on None.
By using img.get('alt', ''), you are returning an empty string when the image lacks alt attribute. len('') will return 0 and your code will not break.
Looks like nametemp is being assigned none ( that’s the default behaviour of get ).
In order to ensure nametemp is iterable, try changing your assignment line:
nametemp = img.get('alt',[])
This will ensure that if “alt” isn’t found that you will return a list and thus you can call “len”.
To control which directory your file is stored to, simply change your filename to contain the whole path i.e: “C:/Desktop/mySpecialFile.jpeg”
You are taking the length of nametemp when the error is raised. It says you can't take the length of a NoneType object. This tells you that nametemp at that point must be None.
Why is it None? Let's go back to:
nametemp = img.get('alt')
OK. img is the current <img> tag, since you're iterating over image tags. At some point you iterate over an image tag which does not have an alt attribute. Therefore, img.get('alt') returns None, and None is assigned to nametemp.
Check the HTML you are parsing and confirm that all image tags have an alt attribute. If you only want to iterate over image tags with an alt attribute, you can use a css-selector to find only image tags with an alt attribute, or you could add a try-catch to your loop, and simply continue if you come across an image tag you don't like.
EDIT - You said you want to scrape product images, but it isn't really clear what page you are trying to scrape these images from exactly. You did update your post with a URL - thank you - but what exactly are you trying to achieve? Do you want to scrape the page that contains all (or some) of the products within a certain category, and simply scrape the thumbnails? Or do you want to visit each product page individually and download the higher resolution image?
Here's something I put together: It just looks at the first page of all products within a certain category, and then scrapes and downloads the thumbnails (low resolution) images into a downloaded_images folder. If the folder doesn't exist, it will create it automatically. This does require the third party module requests, which you can install using pip install requests - though you should be able to do something similar using urllib.request if you don't want to install requests:
def download_image(image_url):
import requests
from pathlib import Path
dir_path = Path("downloaded_images/")
dir_path.mkdir(parents=True, exist_ok=True)
image_name = image_url[image_url.rfind("/")+1:]
image_path = str(dir_path) + "/" + image_name
with requests.get(image_url, stream=True) as response:
response.raise_for_status()
with open(image_path, "wb") as file:
for chunk in response.iter_content(chunk_size=8192):
file.write(chunk)
file.flush()
print(f"Finished downloading \"{image_url}\" to \"{image_path}\".\n")
def main():
import requests
from bs4 import BeautifulSoup
root_url = "https://www.stiga.pl/"
url = f"{root_url}sklep/koszenie-trawnika/agregaty-tnace/agregaty-tnace-park-villa"
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.content, "html.parser")
for product in soup.findAll("div", {"class": "products__item"}):
image_url = root_url + product.find("img")["data-src"]
download_image(image_url)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
To recap, you are using BeautifulSoup to find the URLs to the images, and then you use a simple requests.get to download the image.

Reading url from file Python

can not read the url in txt file
I want to read and open the url addresses in txt one by one, and I want to get the title of the title with regex from the source of url addresses
Error messages:
Traceback (most recent call last): File "Mypy.py", line 14, in
UrlsOpen = urllib2.urlopen(listSplit) File "/usr/lib/python2.7/urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout) File "/usr/lib/python2.7/urllib2.py", line 420, in open
req.timeout = timeout AttributeError: 'list' object has no attribute 'timeout'
Mypy.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
import requests
import urllib2
import threading
UrlListFile = open("Url.txt","r")
UrlListRead = UrlListFile.read()
UrlListFile.close()
listSplit = UrlListRead.split('\r\n')
UrlsOpen = urllib2.urlopen(listSplit)
ReadSource = UrlsOpen.read().decode('utf-8')
regex = '<title.*?>(.+?)</title>'
comp = re.compile(regex)
links = re.findall(comp,ReadSource)
for i in links:
SaveDataFiles = open("SaveDataMyFile.txt","w")
SaveDataFiles.write(i)
SaveDataFiles.close()
When you are calling urllib2.urlopen(listSplit) listSplit is a list when it needs to be a string or request object. It's a simple fix to iterate over the listSplit instead of passing the entire list to urlopen.
Also re.findall() will return a list for each ReadSource searched. You can handle this a couple of ways:
I chose to handle it by just making a list of lists
websites = [ [link, link], [link], [link, link, link]
and iterating over both lists. This makes it so you can do something specific for each list of urls from each website (put in different file ect...).
You could also flatten the website list to just contain the links instead of another list that then contains the links:
links = [link, link, link, link]
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
import urllib2
from pprint import pprint
UrlListFile = open("Url.txt", "r")
UrlListRead = UrlListFile.read()
UrlListFile.close()
listSplit = UrlListRead.splitlines()
pprint(listSplit)
regex = '<title.*?>(.+?)</title>'
comp = re.compile(regex)
websites = []
for url in listSplit:
UrlsOpen = urllib2.urlopen(url)
ReadSource = UrlsOpen.read().decode('utf-8')
websites.append(re.findall(comp, ReadSource))
with open("SaveDataMyFile.txt", "w") as SaveDataFiles:
for website in websites:
for link in website:
pprint(link)
SaveDataFiles.write(link.encode('utf-8'))
SaveDataFiles.close()

Print JSON data from csv list of multiple urls

Very new to Python and haven't found specific answer on SO but apologies in advance if this appears very naive or elsewhere already.
I am trying to print 'IncorporationDate' JSON data from multiple urls of public data set. I have the urls saved as a csv file, snippet below. I am only getting as far as printing ALL the JSON data from one url, and I am uncertain how to run that over all of the csv urls, and write to csv just the IncorporationDate values.
Any basic guidance or edits are really welcomed!
try:
# For Python 3.0 and later
from urllib.request import urlopen
except ImportError:
# Fall back to Python 2's urllib2
from urllib2 import urlopen
import json
def get_jsonparsed_data(url):
response = urlopen(url)
data = response.read().decode("utf-8")
return json.loads(data)
url = ("http://data.companieshouse.gov.uk/doc/company/01046514.json")
print(get_jsonparsed_data(url))
import csv
with open('test.csv') as f:
lis=[line.split() for line in f]
for i,x in enumerate(lis):
print ()
import StringIO
s = StringIO.StringIO()
with open('example.csv', 'w') as f:
for line in s:
f.write(line)
Snippet of csv:
http://business.data.gov.uk/id/company/01046514.json
http://business.data.gov.uk/id/company/01751318.json
http://business.data.gov.uk/id/company/03164710.json
http://business.data.gov.uk/id/company/04403406.json
http://business.data.gov.uk/id/company/04405987.json
Welcome to the Python world.
For dealing with making http requests, we commonly use requests because it's dead simple api.
The code snippet below does what I believe you want:
It grabs the data from each of the urls you posted
It creates a new CSV file with each of the IncorporationDate keys.
```
import csv
import requests
COMPANY_URLS = [
'http://business.data.gov.uk/id/company/01046514.json',
'http://business.data.gov.uk/id/company/01751318.json',
'http://business.data.gov.uk/id/company/03164710.json',
'http://business.data.gov.uk/id/company/04403406.json',
'http://business.data.gov.uk/id/company/04405987.json',
]
def get_company_data():
for url in COMPANY_URLS:
res = requests.get(url)
if res.status_code == 200:
yield res.json()
if __name__ == '__main__':
for data in get_company_data():
try:
incorporation_date = data['primaryTopic']['IncorporationDate']
except KeyError:
continue
else:
with open('out.csv', 'a') as csvfile:
writer = csv.writer(csvfile)
writer.writerow([incorporation_date])
```
First step, you have to read all the URLs in your CSV
import csv
csvReader = csv.reader('text.csv')
# next(csvReader) uncomment if you have a header in the .CSV file
all_urls = [row for row in csvReader if row]
Second step, fetch the data from the URL
from urllib.request import urlopen
def get_jsonparsed_data(url):
response = urlopen(url)
data = response.read().decode("utf-8")
return json.loads(data)
url_data = get_jsonparsed_data("give_your_url_here")
Third step:
Go through all URLs that you got from CSV file
Get JSON data
Fetch the field what you need, in your case "IncorporationDate"
Write into an output CSV file, I'm naming it as IncorporationDates.csv
Code below:
for each_url in all_urls:
url_data = get_jsonparsed_data(each_url)
with open('IncorporationDates.csv', 'w' ) as abc:
abc.write(url_data['primaryTopic']['IncorporationDate'])

Parsing XML File with Python, while extracting Attributes and Children

I'm trying to read an XML file in Python whose general format is as follows:
<item id="1149" num="1" type="topic">
<title>Afghanistan</title>
<additionalInfo>Afghanistan</additionalInfo>
</item>
(This snippet repeats many times.)
I'm trying to get the id value and the title value to be printed into a file.
Currently, I'm having trouble with getting the XML file into Python. Currently, I'm doing this to get the XML file:
import xml.etree.ElementTree as ET
from urllib2 import urlopen
url = 'http://api.npr.org/list?id=3002' #1007 is science
response = urlopen(url)
f = open('out.xml', 'w')
f.write(response)
However, whenever I run this code, I get the error Traceback (most recent call last): File "python", line 9, in <module> TypeError: expected a character buffer object, which makes me think that I'm not using something that can handle XML.
Is there any way that I can save the XML file to a file, then extract the title of each section, as well as the id attribute associated with that title?
Thanks for the help.
You can read the content of response by this code :
import urllib2
opener = urllib2.build_opener(urllib2.HTTPRedirectHandler(),urllib2.HTTPCookieProcessor())
response= opener.open("http://api.npr.org/list?id=3002").read()
opener.close()
and then write it to file :
f = open('out.xml', 'w')
f.write(response)
f.close()
What you want is response.read() not response. The response variable is an instance not the xml string. By doing response.read() it will read the xml from the response instance.
You can then write it directly to a file like so:
url = 'http://api.npr.org/list?id=3002' #1007 is science
response = urlopen(url)
f = open('out.xml', 'w')
f.write(response.read())
Alternatively you could also parse it directly into the ElementTree like so:
url = 'http://api.npr.org/list?id=3002' #1007 is science
response = urlopen(url)
tree = ET.fromstring(response.read())
To extract all of the id/title pairs you could do the following as well:
url = 'http://api.npr.org/list?id=3002' #1007 is science
response = urlopen(url)
tree = ET.fromstring(response.read())
for item in tree.findall("item"):
print item.get("id")
print item.find("title").text
From there you can decide where to store/output the values

Categories