Wrong encoding on CSV file in Python - python

I am not sure if I am making this question correctly but here's my issue:
I have a .csv file (InjectionWells.csv) that I need to split into columns based on commas. When I do it, it just doesn't work and I can only think might be an encoding but I don't know how to fix it. Can someone shed a light?
Here are few lines of the actual file:
API#,Operator,Operator ID,WellType,WellName,WellNumber,OrderNumbers,Approval Date,County,Sec,Twp,Rng,QQQQ,LAT,LONG,PSI,BBLS,ZONE,,,
3500300026,PHOENIX PETROCORP INC,19499,2R,SE EUREKA UNIT-TUCKER #1,21,133856,9/6/1977,ALFALFA,13,28N,10W,C-SE SE,36.9003240,-98.2182600,"2,500",300,CHEROKEE,,,
3500300163,CHAMPLIN EXPLORATION INC,4030,2R,CHRISTENSEN,1,470258,11/27/2002,ALFALFA,21,28N,09W,C-NW NW,36.8966360,-98.1777200,"2,400","1,000",RED FORK,,,
3500320786,LINN OPERATING INC,22182,2R,NE CHEROKEE UNIT,85,329426,8/19/1988,ALFALFA,24,27N,11W,SE NE,36.8061130,-98.3258400,"1,050","1,000",RED FORK,,,
3500321074,SANDRIDGE EXPLORATION & PRODUCTION LLC,22281,2R,VELMA,2-19,281652,7/11/1985,ALFALFA,19,28N,10W,SW NE NE SW,36.8885890,-98.3185300,"3,152","1,000",RED FORK,,,
I have tried both of these and non of them work:
1.
import pandas as pd
df=pd.read_csv('InjectionWells.csv', sep=',')
print(df)
import pandas as pd
test_data2=pd.read_csv('InjectionWells.csv', sep=',', encoding='utf-8')
test_data2.head()

As your CSV files contain some non-ASCII characters also, you need to pass a different encoding. UTF-8 can't handle that.
I tried this and it's working:
import pandas as pd
test_data2=pd.read_csv('InjectionWells.csv', sep=',', encoding='ISO-8859-1')
print(test_data2)

Related

How to open .ndjson file in Python?

I have .ndjson file that has 20GB that I want to open with Python. File is to big so I found a way to split it into 50 peaces with one online tool. This is the tool: https://pinetools.com/split-files
Now I get one file, that has extension .ndjson.000 (and I do not know what is that)
I'm trying to open it as json or as a csv file, to read it in pandas but it does not work.
Do you have any idea how to solve this?
import json
import pandas as pd
First approach:
df = pd.read_json('dump.ndjson.000', lines=True)
Error: ValueError: Unmatched ''"' when when decoding 'string'
Second approach:
with open('dump.ndjson.000', 'r') as f:
my_data = f.read()
print(my_data)
Error: json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 104925061 (char 104925060)
I think the problem is that I have some emojis in my file, so I do not know how to encode them?
ndjson is now supported out of the box with argument lines=True
import pandas as pd
df = pd.read_json('/path/to/records.ndjson', lines=True)
df.to_json('/path/to/export.ndjson', lines=True)
I think the pandas.read_json cannot handle ndjson correctly.
According to this issue you can do sth. like this to read it.
import ujson as json
import pandas as pd
records = map(json.loads, open('/path/to/records.ndjson'))
df = pd.DataFrame.from_records(records)
P.S: All credits for this code go to KristianHolsheimer from the Github Issue
The ndjson (newline delimited) json is a json-lines format, that is, each line is a json. It is ideal for a dataset lacking rigid structure ('non-sql') where the file size is large enough to warrant multiple files.
You can use pandas:
import pandas as pd
data = pd.read_json('dump.ndjson.000', lines=True)
In case your json strings do not contain newlines, you can alternatively use:
import json
with open("dump.ndjson.000") as f:
data = [json.loads(l) for l in f.readlines()]

Reading XLSB (binary) file with Pandas read_excel using pyxlsb reads empty rows for some xlsb file

I'm trying to read binary Excel files using read_excel method in pandas with pyxlsb engine as below:
import pandas as pd
df = pd.read_excel('test.xlsb', engine='pyxlsb')
If the xlsb file is like this file (Right now, I'm sharing this file via WeTransfer, but if there is a better way to share files on StackOverflow, let me know), the returned dataframe is filled with NaN's. I suspected that it might be because the file was saved with active cell pointing at the empty cells after the data originally. So I tried this:
import pandas as pd
with open('test.xlsb', 'rb') as data:
data.seek(0,0)
df = pd.read_excel(data, engine='pyxlsb')
but it still doesn't seem to work. I also tried reading the data from byte number 0 (from the beginning), writing it into a new file, 'test_1.xlsb', and finally reading it with pandas, but that doesn't work.
with open('test.xlsb','rb') as data:
data.seek(0,0)
with open('test_1.xlsb','wb') as outfile:
outfile.write(data.read())
df = pd.read_excel('test_1.xlsb', engine='pyxlsb')
If anyone has suggestion as to what might be going on and how to resolve it, I'd greatly appreciate the help.

Encoding issues when reading from CSV via pandas.read_csv

I've exported a CSV-file from MetaTrader5 via MQL5 expert advisor.
EURUSD,2020.02.19 05:04:00,1.07991,1.07991
EURUSD,2020.02.19 05:05:00,1.07991,1.07989
EURUSD,2020.02.19 05:06:00,1.07989,1.07988
EURUSD,2020.02.19 05:07:00,1.07988,1.07989
EURUSD,2020.02.19 05:08:00,1.07989,1.0799
...
Now I need to read this CSV-file with Pandas. When I use the following code...
import pandas as pd
df_rates = pd.read_csv('D:/Rates.csv', header=None, encoding='cp1252')
df_rates.columns = ['Currency','Time','Open','Close']
print(df_rates)
I see 'NaN' instead of all of my data. I tried different encoding settings but this doesn't help. I have operating system with cyrillic settings. Any suggestions?
Exported CSV-file is here.

Pandas: No columns to parse from file

Good afternoon
I have looked through several of the solutions linked to this problem and nothing has been able to help me. I do not understand if it is an error with the actual csv file or an error within the code itself. Below is my code:
import pandas as pd
from itertools import islice
import csv
from cStringIO import StringIO
sio = StringIO()
def forex_file():
with open("USD-ZAR.csv", "r+") as exchange_file:
for row in islice(csv.reader(exchange_file), 3, 256, None):
sio.write(row)
sio.seek(0)
df1 = pd.read_csv(sio, sep=",", encoding="utf-8", delim_whitespace=True)
I purposely placed the "delim_whitespace=True" part within the code as this has been the common suggestion in many of the other posts but is has done nothing in this case as my csv file is split by normal commas and not by white space or tabs.
Any help will really be appreciated!

how to read a data file including "pandas.core.frame, numpy.core.multiarray"

I met a DF file which is encoded in binary format. But when I open it using Vim, still I can see characters like "pandas.core.frame", "numpy.core.multiarray". So I guess it is related with Python. However I know little about the Python language. Though I have tried using pandas and numpy modules, I failed to read the file. Could you guys give any suggestion on this issue? Thank you in advance. Here is the Dropbox link to the DF file: https://www.dropbox.com/s/b22lez3xysvzj7q/flux.df
Looks like DataFrame stored with pickle, use read_pickle() to read it:
import pandas as pd
df = pd.read_pickle('flux.df')

Categories