Reading pdf contents using Python [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I am trying to read the below pdf file and I need to save each and every article in seperate file.
https://dl.dropboxusercontent.com/u/23092311/sample.pdf
A article can be in one or more than one pages. I have used PDFMiner to convert the entire pdf to txt file. But I don't know how to convert into multiple articles.
I am new to Python. Please provide a best method or sample code to extract the each and every articles separately?

I'll be honest. I've never used PDFMiner before, but if you already have the PDF into a text file, couldn't you just parse the text file into a string, and then use the split function to divide the string into different articles based on "The New York Times" heading? I guess that assumes PDFMiner is capable of reading that fancy font which I don't know if that is possible.
Looking at the file you provided, you could something like the following:
reading = open('test.txt')
full_paper = reading.read()
split_paper = full_paper.split('Copyright 2014 The New York Times Company. All Rights Reserved.')
split_paper would then be an array containing your articles in indexes 1, 2, 3, 4, 5, 6 (index 0 would contain the initial heading). You'd have to do some other string cleanup to get the exact articles, but that should at least get you started.
Make sense?

Related

Scraping and Storing in CSV file(by managing text obtained) [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
The figure is given to show the output after requesting the URL and removing all the div element tags. I now need to store the data of Area, Bedroom, Location, Price, the floor in a CSV file. So how can I do it I only know python's function and method for doing it and how can I perform Indexing in such output?
Output by some manipulation done in URL request which is to be stored in CSV file
List item
#shivani Karna, there are many options here. Here are two approaches I would consider:
open a file a context manager to write to and write each found element on a new line:
https://www.w3schools.com/python/python_file_write.asp
parse the elements into a dictionary and then write the contents to a pandas dataframe, then to a csv file for a readable format:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html

Merging several word docx into one another using read() and write() fails [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I hope to use read and write method to merge two word docs into another, but it only can write the content of the f1 word document successfully. Writing the f2 word doc does not work. I tried the following:
# coding=utf-8
f=open('C:\Users\Desktop\word.doc','ab')
f1=open('C:\Users\Desktop\word1.doc','rb')
f2=open('C:\Users\Desktop\word2.doc','rb')
data1=f1.read()
data2=f2.read()
f.write(data1)
f.write(data2)
f1.close()
f2.close()
f.close()
Microsoft Word document format is much more than pure text. Simply concatenating two documents will not work at all and this is what you are effectively doing.
The proper way to concatenate two documents in DOCX format would be to open them using an appropriate module - e.g. python-docx (or docx) - that understands the internal structure of the document (which is a zip-compressed folder with numerous XML files - you can check it yourself changing the extension and decompressing the contents).
The recipe how to concatenate two Word documents should prove helpful.

Extraction of tables from PDF [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have a pdf file containing text, images and tables.I want to extract just the tables from that pdf file using either Python or R.
If you are considering using R I would recommend using the tabulizer package.
it is available here and is very easy to use.
to install it you would have to use the following command:
install.packages("devtools")
devtools::install_github("ropensci/tabulizer")
And using one of their examples:
library("tabulizer")
f <- system.file("examples", "data.pdf", package = "tabulizer")
# When f is your selected pdf file.
out1 <- extract_tables(f)
# Or even better, say what page the tables are in.
out2 <- extract_tables(f, pages = 1, guess = FALSE, method = "data.frame")
You'll probably find PyPI useful - you can search for specific things on there like 'PDF' and it will give you a list of modules relating to PDF's (here). You'll probably want PDF 1.0 judging from it's weight on PyPI. This should help you get started!

How to change data in notepad using python [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
Ok so I want to make a simple game where it's essential for the game to save. I wish to store all the data in notepad. However how do I pull notepad into python. I have tried saving notepad into the same folder where my .py file is. Then in the program im not sure what to do. Im just very confused on how to do this. I do know how to program in python, but how do I pull data out of notepad into python.
Also I wish to know how would I change the inside of the notepad file. Let's just say it has a couple of variables with certain numbers. I want to change those in game.
Thank you for any answers. :)
You stored your data in text file not in notepad. Notepad is an application to edit and read the data inside the text file.
Suppose you stored tha data in a text file (whose name is file.txt) using notepad. Now you want to read the data inside text file from your python code. You can read it directly as :
Python code :
file_pointer = open("file.txt", "r");
array = file_pointer.readlines()
print(array[0])
print(array[1])
print(array[2])
print(array[3])
file.txt
25
554
51
4147

extracting data from several xml-files with python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I just started learing python for my new job, so everything is quite difficult to me, even if the task sounds pretty straight forward.
I would like to extract several nodes from multiple xml-files, at best putting the information into an excel file in the end. Every row should contain the information from one xml-file, the columns should represent the specific nodes I am looking for, like "Zip-code" "town". Not all xml-files contain all nodes, so it would be perfect, if node "Zip-code" doesnt exist it just leaves the cell blank.
Could someone please point out a few hints how to start with this or, this is also possible, a special programm, which is easy to learn and use? My company and me only need to do it once for about 2000 files.
Thank you very much =)
For opening the files and getting their contents, you can use the Python functions: Documentation.
For XML parsing, I always use Beautiful Soup. It's a HTML/XML parser with good documentation that mostly "just works".
For creating the Excel file, you can use Xlsxwriter.

Categories