Encoding Issue while scr - python

I just want to scrape chinese language data.Everything is going good I have got and encoding issue while I run the program its print properly on terminal but I save into csv then I will get some wierd symbols. Is there any way to get rid of?
Here is terminal result:
{'Name': ' 『受注生産』KREX コラボフーディー'}
In Csv:
『å—注生産ã€KREX コラボフーディー
We get that type of wierd symbol

The issue is not with the CSV but Microsoft Excel itself. I have faced similar issue, if you open the file in a text editor, you will notice the characters are in correct encoding. But opening the CSV directly in Excel will not work.
To overcome the issue, you should open a new spreadsheet, go to data tab and click on From Text
Then select File Origin as UTF-8
And then select Comma option
Once done, you will see the correct data in Excel with proper encoding

Related

Reading an online based pdf files in python and separating data in to columns -OSError

I m having an issue with python in getting an online based pdf file to python. The below is the code i wrote
import PyPDF2
import pandas as pd
from PyPDF2 import PdfReader
reader = PdfReader(r"http://www.meteo.gov.lk/images/mergepdf/20221004MERGED.pdf")
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"
nd this gives me an error
OSError: [Errno 22] Invalid argument: 'http://www.meteo.gov.lk/images/mergepdf/20221004MERGED.pdf'
If we fix this, how do we separate the extracted data in to separate columns using pandas?
there are three tables in this pdf file.I need the first one. I have tried so many tutorials but none of them helped me. Can anyone help me in this regard please?
Thanks,
Snyder
Part one of your question is how to access the PDF content for extraction.
In order to view modify or extract the contents the bitstream it needs to be saved as a editable file. Thats why a binary DTP / printout file needs download to view. Every character in your browser screen was downloaded as text then converted from local file byte storage into graphics.
The simplest method is
curl -O http://www.meteo.gov.lk/images/mergepdf/20221004MERGED.pdf
which saves a working copy locally as 20221004MERGED.pdf
The next issue is that multi language files are a devil to extract and that one has many errors that need editing before extraction.
Here we see in Acrobat or other viewers (on the left) there are failures where eastern characters are mixed in with western ones due to the language mixing, so need corrective edit as shown on the right. Also the underlying text for extraction as seen by pdf readers is western characters that get translated inside the PDF by glyph mapping but for the extractable searchable text are just garbled plain text. this is what Adobe sees for search that first line k`l²zìq&` m[&Sw`n so you can see the W 3rd character from right.
Seriously there are just so many subset related problems to address, that it is easiest to open the PDF in any editor to reset the fonts to what they should be in each language.
The fonts you need in Word Office etc. are Kandy as I used to correct that word plus these others :-

Avoid pop-ups in Excel while running code in Python

I want to be able to open an Excel document and start manipulating the data without seeing any pop-ups.
I think the pop-ups are the ones stopping my Excel file from opening successfully. Here are the pop-ups I am seeing at excel and I would like to automatically answer them instead of doing it manually. I found some answers online but not for my case.
The file format and extension of "xxx" don't match. The file could be
corrupted or unsafe. Unless you trust its source, don't open it. Do
you want to open it anyway?
option1: Yes , option2: No , option3: Help
or
Open XML Please select how you would like to open this file:
As an XML table
As a read-only workbook
use the XML Source task pane
or
XML Import Error
ok
help
After I select: Yes & As an XML table & ok, everything works perfectly. If anyone could help me out I would much appreciate it.

Unreadable characters from Python to csv file

I'm a linguistics student and I'm downloading tweets in Italian for my thesis, I've been reading previous answers to similar problems but none of them worked for me: after downloading them, if I read them in PyCharm terminal my tweets are perfectly readable, but when I open the csv file, doesn't matter the program, LibreOffice (I'm using Ubuntu 18.04), Excel 2010, Txt, characters like "é è à" and so on are visualized as a unicode string.
I tried every tutorial here and elsewhere, but I'm not having success, any idea of what could I do?
Thanks a lot
Two options you can try.
Use Sublime Text (free trial): Open your CSV file, then Save with encoding... and choose "UTF-8"
Import (rather than open) with Excel: Open blank sheet. Then Import, choose CSV File. In the following Assistant choose "UTF-8" as Source.

New line with invisible character

I'm sure this has been answered before but after attempting to search for others who had the problem I didn't have much luck.
I am using csv.reader to parse a CSV file. The file is in the correct format, but on one of the lines of the CSV file I get the notification "list index out of range" indicating that the formatting is wrong. When I look at the line, I don't see anything wrong. However, when I go back to the website where I got the text, I see a square/rectangle symbol where there is a space. This symbol must be leading csv.reader to treat that as a new line symbol.
A few questions: 1) What is this symbol and why can't I see it in my text files? 2) How do I avoid having these treated as new lines? I wonder if the best way is to find and replace them given that I will be processing the file multiple times in different ways.
Here is the symbol:
Update: When I copy and paste the symbol into Google it searches for  (a-circumflex). However, when I copy and paste  into my documents, it shows up correctly. That leads me to believe that the symbol is not actually Â.
This looks like a charset problem. The "Â" is latin-1 for a non-breaking space in UTF-8. Assuming you are running Windows, you are using one of the latins as character set. UTF-8 is the default encoding for OSX and Linux-based OSs. The OS locale is used as default locale in most text editors, and thus encode files created with those programs as latin-1. A lot of programmers on OSX have problems with non-breaking spaces because it is very easy to mistakenly type it (it is Option+Spacebar) and impossible to see.
In python >= 3.1, the csv reader supports dialects for solving those kind of problems. If you know what program was used to create the csv file, you can manually specify a dialect, like 'excel'. You can use a csv sniffer to automatically deduce it by peeking into the file.
Life Management Advice: If you happen to see weird characters anywhere, always assume charset problems. There is an awesome charset problem debug table HERE.

txt file appears blank with .write() python

I am on a windows machine and am trying to write a couple thousand lines to a text file using ipython. To test this I am just trying to get some text to appear in the file.
my code is as follows:
path="\Users\\*****\Desktop"
with open(path+'newheaders.txt','wb') as f:
f.write('new text')
This question (.write not working in Python) is answered and seems like it should have solved my issue but when I open the text file it is still blank.
I tested the file using the code below and the text appears to be there.
with open(path+'newheaders.txt','r') as f:
print f.read()
any ideas?
This 'should' work as written. A few things to try (I would put this in a comment but I lack sufficient reputation):
Delete the file and make sure the program is creating the file
Try writing as 'wt' rather than binary to see if we can narrow down the problem that way.
Remove all the business with the path and just try to write the file in the current directory.
What text editor are you using? Is it possible it's not refreshing the blank file?

Categories