I'm converting somes pdf documents to text and for it I'm using pdfminer (using pdf2txt.py). I'm not converting directly from pdf to txt, because I want to signal formats such as italics or bold. Therefore I first convert the pdf to xml.
I'm converting pdf to xml using:
pdf2txt.py -t xml -o out_file.xml in_file.pdf
My problem is that I found an odd error in the xml file when converting this pdf. If you convert it to xml, check the following:
In page 21 of the pdf the second column starts with "Recentemente...".
The first paragraph of the first column (of the same page) ends with "...lhes falta".
The resulting converting xml file contains the item 1. (and full column) just after item 2. You can check it in line 128370 of the xml file. Then in line 131782 the correct order starts again, i.e., the paragraph that starts with "O terceiro..." follows.
So, my question is if there is a solution to avoid this error.
Related
I am reading the input docx file sections/paragraphs and then copy-pasting the content in it to another docx file at a particular section. The content is having images, tables and bullet points in between the data. However, I'm getting only text not the images, tables and bullet points present in between the text.
Tika module is able to read whole content but the whole docx is coming in a single string so I'm unable to iterate over the section and also I'm unable to edit(copy-pasting the content) the output docx file.
Tried using python-docx, whereas it reads only content and it won't identify the images and tables inside the paragraph in between text data. The python-docx will identifies all the images and tables present in whole document not particularly with paragraph
Tried unzipping word to XML, but the XML is having images in a separate folder. Also, the code will not identify the bullets
def tika_extract_data(input_file, output_file):
import tika, collections
from tika import unpack
parsed = collections.OrderedDict()
parsed = unpack.from_file(input_file)
with open(output_file, 'w') as f:
for line in parsed:
if line == 'content':
lines = parsed[line]
# print(lines)
for indx, j in enumerate(lines.split("\\n")):
print(j)
I expected the output file should be having all the sections replaced with the copied input section content(images, tables, smart art and formats)
The output file just has the text data.
I am currently extracting Tweets from Twitter using Twitter IDs. The tool I am using comes with the dataset (Twitter IDs) that I have downloaded online and will be using for my masters dissertation. The tool takes the Twitter IDs and extracts the information from the Tweets, storing each Tweet as a JSON string in a .TXT file.
Below is a link to my OneDrive, where I have 2 files:
https://1drv.ms/f/s!At39YLF-U90fhJwCdEuzAc2CGLC_fg
1) Extracted Tweet information, each as a JSON string in a .txt file
2) Extracted Tweet information, each as a JSON string in what I believe is a .json file. The reason I say 'believe' is because the tool I am using automatically creates a file that contains '.json' at the end of the filename but in a .TXT format. I have simply renamed the file by removing 'TXT.' from the end
Below is code I have written (it is simple but the more I look for alternative code online, the more confused I become):
import pandas as pd
dftest = pd.read_json('test.json', lines=True)
The following error appears when I run the code:
ValueError: Unexpected character found when decoding array value (2)
I have run the first few Tweet arrays into a free online JSON parser and it breaks out the features of the Tweet exactly how I am wishing it to (to my knowledge this confirms they Tweet arrays are in a JSON format). This can be seen in the screenshot below:
I would be grateful if people could:
1) Confirm the extracted Tweets are in fact in a JSON string format
2) Confirm if the filename is automatically saved as 'text.json.txt' and I remove 'txt' from the filename, does this become a .json file?
3) Suggest how to get my very short Python script to work. The ultimate aim is to identify the features in each Tweet that I want (e.g. "created_at", "text", "hashtags", "location" etc.) in a Dataframe, so I can then save it to a .csv file.
I have some pdf files, which have double columns per page. I want to extract text from those files by program. The content of pdf file is Chinese. I tried to use pdfminer3k library of python3 and ghostscript, whose result are all not very good.
At last, I use the github open source project named textract, and the link is deanmalmgren/textract.
But textract can not detect that the every page that contains two columns. I use the following command:
import textract
text = textract.process("/home/name/Downloads/textract-master/test.pdf")
print text
And pdf file link is https://pan.baidu.com/s/1nvLQnLf
The output result shows that the extract program regards the two columns as one column. I want to extract the double columns pdf files. How to solve?
This is the output result by extract program.
I have a question regarding the splitting of pdf files. basically I have a collection of pdf files, which files I want to split in terms of paragraph. so to each paragraph of the pdf file to be a file on its own. I would appreciate if you can help me with this, preferably in Python, but if that is not possible any language will do.
You can use pdftotext for the above, wrap it in python subprocess. Alternatively you could use some other library which already do it implicitly like textract. Here is a quick example, Note: I have used 4 spaces as delimiter to convert the text to paragraph list, you might want to use different technique.
import re
import textract
#read the content of pdf as text
text = textract.process('file_name.pdf')
#use four space as paragraph delimiter to convert the text into list of paragraphs.
print re.split('\s{4,}',text)
I am converting a generated xml file to a csv file using xmlutils. However, the nodes that I tagged in the xml file sometimes have an extra child node which messes up the formatting of the converted csv file.
For instance,
<issue>
<name>project1</name>
<key>733</key>
</issue>
<issue>
<name>project2</name>
<key>123</key>
<debt>233</debt>
</issue>
I tagged "issue" and the xml file was converted to a csv. However, once I opened the csv file, the formatting is wrong. Since there was an extra "debt" node in the second issue element, the columns for the second were shifted.
For instance,
name key
project1 733
project2 123 233
How can I tell xmlutil to generate a new column "debt"?
Also, if xmlutils cannot do the job, can you recommend me a program that is better suited?