I'm a linguistics student and I'm downloading tweets in Italian for my thesis, I've been reading previous answers to similar problems but none of them worked for me: after downloading them, if I read them in PyCharm terminal my tweets are perfectly readable, but when I open the csv file, doesn't matter the program, LibreOffice (I'm using Ubuntu 18.04), Excel 2010, Txt, characters like "é è à" and so on are visualized as a unicode string.
I tried every tutorial here and elsewhere, but I'm not having success, any idea of what could I do?
Thanks a lot
Two options you can try.
Use Sublime Text (free trial): Open your CSV file, then Save with encoding... and choose "UTF-8"
Import (rather than open) with Excel: Open blank sheet. Then Import, choose CSV File. In the following Assistant choose "UTF-8" as Source.
Related
I m having an issue with python in getting an online based pdf file to python. The below is the code i wrote
import PyPDF2
import pandas as pd
from PyPDF2 import PdfReader
reader = PdfReader(r"http://www.meteo.gov.lk/images/mergepdf/20221004MERGED.pdf")
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"
nd this gives me an error
OSError: [Errno 22] Invalid argument: 'http://www.meteo.gov.lk/images/mergepdf/20221004MERGED.pdf'
If we fix this, how do we separate the extracted data in to separate columns using pandas?
there are three tables in this pdf file.I need the first one. I have tried so many tutorials but none of them helped me. Can anyone help me in this regard please?
Thanks,
Snyder
Part one of your question is how to access the PDF content for extraction.
In order to view modify or extract the contents the bitstream it needs to be saved as a editable file. Thats why a binary DTP / printout file needs download to view. Every character in your browser screen was downloaded as text then converted from local file byte storage into graphics.
The simplest method is
curl -O http://www.meteo.gov.lk/images/mergepdf/20221004MERGED.pdf
which saves a working copy locally as 20221004MERGED.pdf
The next issue is that multi language files are a devil to extract and that one has many errors that need editing before extraction.
Here we see in Acrobat or other viewers (on the left) there are failures where eastern characters are mixed in with western ones due to the language mixing, so need corrective edit as shown on the right. Also the underlying text for extraction as seen by pdf readers is western characters that get translated inside the PDF by glyph mapping but for the extractable searchable text are just garbled plain text. this is what Adobe sees for search that first line k`l²zìq&` m[&Sw`n so you can see the W 3rd character from right.
Seriously there are just so many subset related problems to address, that it is easiest to open the PDF in any editor to reset the fonts to what they should be in each language.
The fonts you need in Word Office etc. are Kandy as I used to correct that word plus these others :-
I just want to scrape chinese language data.Everything is going good I have got and encoding issue while I run the program its print properly on terminal but I save into csv then I will get some wierd symbols. Is there any way to get rid of?
Here is terminal result:
{'Name': ' 『受注生産』KREX コラボフーディー'}
In Csv:
『å—注生産ã€KREX コラボフーディー
We get that type of wierd symbol
The issue is not with the CSV but Microsoft Excel itself. I have faced similar issue, if you open the file in a text editor, you will notice the characters are in correct encoding. But opening the CSV directly in Excel will not work.
To overcome the issue, you should open a new spreadsheet, go to data tab and click on From Text
Then select File Origin as UTF-8
And then select Comma option
Once done, you will see the correct data in Excel with proper encoding
I am using python 3 on windows 10 (though OS X is also available). I am attempting to extract the text from lots of .pdf files, all in Chinese characters. I have had success with pdfminer and textract, except for certain files. These files are not images, but proper documents with selectable text. If I use Adobe Acrobat Pro X and export to .txt, My output looks like:
!!
F/.....e..................!
216.. ..... .... ....
........
If I output to .doc, .docx, .rtf, or even copy-paste into any text editor, it looks like this:
ҁϦљӢख़ε༊ݢቹៜϐѦჾѱ॥ᓀϩӵΠ
I have no idea why Adobe would display the text properly but not export it correctly or even let me copy-paste. I thought maybe it was a font issue, the font is DFKaiShu sb-estd-bf which I already have installed (it appears to automatically come with windows).
I do have a workaround, but it's ugly and very difficult to automate; I print the pdf to a pdf (or any sort of image), then use adobe pro's built-in OCR, then convert to a word document (it still does not convert correctly to .txt). Ultimately I need to do this for ~2000 documents, each of which can be up to 200 pages.
Is there any other way to do this? Why is exporting or copy-pasting not working correctly? I have uploaded a 2-page sample to google drive here.
I recently acquired a ton of data stored in Visual FoxPro 9.0 databases. The text I need is in Cyrillic (Russian), but of the 1000 .dbf files (complete with .fpt and .cdx files), only 4 or 5 return readable text. The rest (usually in the form of memos) returns something like this:
??9Y?u?
yL??x??itZ?????zv?|7?g?̚?繠X6?~u?ꢴe}
?aL1? Ş6U?|wL(Wz???8???7?#R?
.FAc?TY?H???#f U???K???F&?w3A??hEڅԦX?MiOK?,?AZ&GtT??u??r:?q???%,NCGo0??H?5d??]?????O{??
z|??\??pq?ݑ?,??om???K*???lb?5?D?J+z!??
?G>j=???N ?H?jѺAs`c?HK\i
??9a*q??
For the life of me, I can't figure out how this is encoded. I have tried all kinds of online decoders, opened up the .dbfs in many database programs, and used Python to open and manipulate them. All of them returns the similar messiness as above, but never readable Russian.
Note: I know that these databases are not corrupt, because they came accompanied by enterprise software that can open, query and read them successfully. However, that software will not export the data, so I am left working directly with the .dbfs.
Happy to share an example .dbf if would help get to the bottom of this.
I would expect if it is FoxPro database, that the Russian there is encoded in some pre-Unicode encoding for Russian as for most Eastern European languages in ancient times.
For example: Windows-1251 or ISO 8859-5.
'?' characters don't convey much. Try looking at the contents of the memo fields as hex, and see whether what you're seeing looks anything like text in any encodings. (Apologies if you've tried this using Python already). Of course if it is actually encrypted you may be out of luck unless you can find out the key and method.
There are two possibilities:
the encoding has not been correctly stored in the dbf file
the dbf file has been encrypted
If it's been encrypted I can't help you. If it's a matter of finding the correct encoding, my dbf package may be of use. Feel free to send me a sample dbf file if you get stuck.
I want to get the content (text only) in a ppt file. How to do it?
(It likes that if I want to get content in a txt file, I just need to open and read. What do I need to do to get information from ppt files?)
By the way, I know there is a win32com in windows system. But now I am working on linux, is there any possible way?
I found this discussion over on Superuser:
Command line tool in Linux to Extract Text From Word, Excel, Powerpoint?
There are several reasonable answers listed there, including using LibreOffice to do this (and for .doc, .docx, .pptx, etc, etc.), and the Apache Tika Project (which appears to be the 5,000lb gorilla in this solution space).