As an absolute newbie on the topic of using python, I stumbled over a few difficulties using the newspaper library extension. My goal is to use the newspaper extension on a regular basis to download all new articles of a German news website called "tagesschau" and all articles from CNN to build a data stack I can analyze in a few years.
If I got it right I could use the following commands to download and scrape all articles into the python library.
import newspaper
from newspaper import news_pool
tagesschau_paper = newspaper.build('http://tagesschau.de')
cnn_paper = newspaper.build('http://cnn.com')
papers = [tagesschau_paper, cnn_paper]
news_pool.set(papers, threads_per_source=2) # (3*2) = 6 threads total
news_pool.join()`
If that's the right way to download all articles, so how I can extract and save those outside of python? Or saving those articles in python so that I can reuse them if I restart python again?
Thanks for your help.
The following codes will save the downloaded articles in HTML format. In the folder, you'll find. tagesschau_paper0.html, tagesschau_paper1.html, tagesschau_paper2.html, .....
import newspaper
from newspaper import news_pool
tagesschau_paper = newspaper.build('http://tagesschau.de')
cnn_paper = newspaper.build('http://cnn.com')
papers = [tagesschau_paper, cnn_paper]
news_pool.set(papers, threads_per_source=2)
news_pool.join()
for i in range (tagesschau_paper.size()):
with open("tagesschau_paper{}.html".format(i), "w") as file:
file.write(tagesschau_paper.articles[i].html)
Note: news_pool doesn't get anything from CNN, so I skipped to write codes for it. If you check cnn_paper.size(), it results to 0. You have to import and use Source instead.
The above codes can be followed as an example to save articles in other formats too, e.g. txt and also only parts that you need from the articles e.g. authors, body, publish_date.
You can use pickle to save objects outside of python and reopen them later:
file_Name = "testfile"
# open the file for writing
fileObject = open(file_Name,'wb')
# this writes the object news_pool to the
# file named 'testfile'
pickle.dump(news_pool,fileObject)
# here we close the fileObject
fileObject.close()
# we open the file for reading
fileObject = open(file_Name,'r')
# load the object from the file into var news_pool_reopen
news_pool_reopen = pickle.load(fileObject)
Related
I want to convert web PDF's such as - https://archives.nseindia.com/corporate/ICRA_26012022091856_BSER3026012022.pdf & many more into a Text without saving them into my PC ,Cause 1000's of such announcemennts come up daily , Hence wanted to convert them to text without saving them on my PC. Any Python Code Solutions to this?
Thanks
There is different methods to do this. But the simplest is to download locally the PDF then use one of following Python module to extract text (OCR) :
pdfplumber
tesseract
pdftotext
...
Here is a simple code example for that (using pdfplumber)
from urllib.request import urlopen
import pdfplumber
url = 'https://archives.nseindia.com/corporate/ICRA_26012022091856_BSER3026012022.pdf'
response = urlopen(url)
file = open("img.pdf", 'wb')
file.write(response.read())
file.close()
try:
pdf = pdfplumber.open('img.pdf')
except:
# Some files are not pdf, these are annexes and we don't want them. Or error reading the pdf (damaged ? )
print(f'Error. Are you sure this is a PDF ?')
continue
#PDF plumber text extraction
page = pdf.pages[0]
text = page.extract_text()
EDIT : My bad, just realised you asked "without saving it to my PC".
That being said, I also scrap a lot (1000s aswell) of pdf, but all save them as "img.pdf" so they just keep replacing each other and end up with only 1 pdf file. I do not provide any solution for PDF OCR without saving the file. Sorry for that :'(
I need to extract the first table account number, branch name, etc and last table date, description, and amount.
pdf file: https://drive.google.com/file/d/1b537hdTUMQwWSOJHRan6ckHBUDhRBbvX/view?usp=sharing
getting blank output using pypdf2 library.
camelot giving OSError: Ghostscript is not installed.
import PyPDF2
file_path =open(r"E:\user\programs\28_oct_bank_statement\demo.pdf", "rb")
pdf = PyPDF2.PdfFileReader(file_path)
pageObj = pdf.getPage(0)
print(pageObj.extractText())
import camelot
data = camelot.read_pdf(r"demo.pdf", pages='all')
print(data)
Camelot has dependancies that needs to be install in order to work, such as Ghoscript. You'll fist need to check if that is installed correctly for mac/ubuntu:
from ctypes.util import find_library
find_library("gs")
"libgs.so.9"
for windows:
import ctypes
from ctypes.util import find_library
find_library("".join(("gsdll", str(ctypes.sizeof(ctypes.c_voidp) * 8), ".dll")))
<name-of-ghostscript-library-on-windows>
otherwise download Ghostscript from the following page https://ghostscript.com/ for windows.I highly suggest reading through the camelot documentation again If you run into more issues.
I usually use the apache tika to do this.
As shown here
You can simply install it and then with a python script:
from tika import parser
parsed_pdf = parser.from_file("sample.pdf")
text = parsed_pdf['content']
metadata = parsed_pdf['metadata']
print(data)
Note you do need Java installed on the machine for it to run, however it will return the test and then once you have the text you can look to identify a pattern within the text to extract the exact data required.
The nice part about this is it will also return the metadata of the pdf
I want to create a .csv file to speed up the loading of the encoding file of my face recognition program using face_recognition on python.
When my algorithm detect a new face, he generate an encoding file using face_recognition and then:
with open('data.csv', 'a') as file:
writer = csv.writer(file)
writer.writerow([ID,new_face_reco])
I do that to send the code to the .csv file. (ID is a random name I give to the face and new_face_reco is the encoding of the new face)
But I want to reopen it when i relaunch the progam so I have this at the beginning:
known_face_encodings_temp = []
known_face_names_temp = []
with open('data.csv', 'rb') as file:
data = [row for row in csv.reader(file,delimiter=',')]
known_face_names_temp.append(np.array(data[0][0]))
essai = np.array(data[0][1].replace('\n',''))
known_face_encodings_temp.append(essai.tolist())
known_face_encodings=known_face_encodings_temp
known_face_name=known_face_names_temp
I have a lot of issue (this is why they are a lot of line in this part) cause my encoding change from the .csv to the reload of it. Here is what I got:
Initial data:
array([-8.31770748e-02, ... , -3.41368467e-03])
When I try to reload my csv (without me trying to change anything):
'[-1.40143648e-01 ... -8.10057670e-02\n 3.77673171e-02 1.40102580e-02 8.14460665e-02
7.52283633e-02]'
What i do when i try to change thing:
'[-1.40143648e-01 ... 7.52283633e-02]'
I need to have my load data the same as the initial data what can I do ?
Instead of using CSV files, try using numpy (.npy) files; they're much easier to save and load. I have used them myself in one of my projects that utilizes the face_recognition module and would be happy to help you out.
To save an encoding, you can:
np.save(path to save, encoding)
To load an encoding, you can:
encodingVariable = np.load(path to load)
I'm doing some research on Cambridge Analytica and wanted to have as much news articles as I can from some news outlets.
I was able to scrape them and now have a bunch of JSON files in a folder.
Some of them have only this [] written in them while others have the data I need.
Using pandas I used the following and got every webTitle in the file.
df = pd.read_json(json_file)
df['webTitle']
The thing is that whenever there's an empty file it won't even let me assign df['webTitle'] to a variable.
Is there a way for me to check if it is empty and if it is just go to the next file?
I want to make this into a spreadsheet with a few of the keys and columns and the values as rows for each news article.
My files are organized by day and I've used TheGuardian API to get the data.
I did not write much yet but just in case here's the code as it is:
import pandas as pd
import os
def makePathToFile(path):
pathtoJson = []
for root,sub,filename in os.walk(path):
for i in filename:
pathToJson.append(os.path.join(path, i))
return pathToJson
def readJsonAndWriteCSV (pathToJson):
for json_file in pathToJson:
df = pd.read_json(json_file)
Thanks!
You can set up a google Alert for the news keywords you want, then scrape the results in python using https://pypi.org/project/galerts/
I am practicing github machine learning contest using Python. I start from other's submission, but stuck at the first step: use pandas to read CSV file:
import pandas as pd
import numpy as np
filename = './facies_vectors.csv'
training_data = pd.read_csv(filename)
print(set(training_data["Well Name"]))
[enter image description here][1]training_data.head()
This gave me the following error message:
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 1 fields in line 104, saw 3
I could not understand that the .csv file describe itself as html DOCTYPE. Please help.
The representing segments of the csv data content are attached. Thanks
It turns out I download the csv file following the convention of regular web operation: right click and save as. The right way is open the item from github, and then open it from the github desktop. I got the tables now. But the way to work with html files from python is definite something I would learn more about. Thanks.