Error Tokenizing Data - python

I have a csv file from a collaborator. He told me I could read it into into python using
import csv
t = []
f = open("measles.csv", "rb")
d = csv.reader(f, quotechar='"', delimiter="\t", lineterminator='\r\n')
for row in d:
t.append(row)
I tried to make a dataframe out of the data by using pd.DataFrame(t[1:],colums = t[0]) which was successful. However, when I write the resulting dataframe to csv, and then try to read it back in again using pd.read_csv, I get the following error
CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
I suspect it is to do with the way the data was originally given to me. I've tried error_bad_lines = False but that doesn't seem to work. Any advice?

With
pd.read_csv try engine='python' parameter.
ex.
df = pd.read_csv(file_name , engine='python')

Related

Python + Pandas JSON Url Call via api : Cannot save resuls as CSV index problem scalar values

Trying to do a simple script to save as csv. My python Version is 3.8.3 and I am using window 10.
I am trying to use the tool pandas : https://pandas.pydata.org
I am trying to get data results from the URL https://barcode.monster/api/3061990141101. I installed Pandas to convert the JSON file to csv. There is an "index" problem. None of the answers I found worked.
Value error if using all scalar values, you must pass an index.
I looked all over Google and forums, and tried adding "index_col=0" , also "index=false".
Below is my python script :
with urllib.request.urlopen("https://barcode.monster/api/3061990141101") as url:
data = json.loads(url.read().decode())
print(data)
with open('data.json', encoding='utf-8-sig') as f_input:
df = pd.read_json(f_input, index_col=0)
df.to_csv('test.csv', encoding='utf-8', index=False)
I'm sure this is obvious for any Python dev, I'm just learning from scratch. Many thanks.
There are many ways of solving this issue. If you're going to use open a text file, then you'll need to convert the string to json. Try using json.load(f). Doing so, you can call DataFrame. You will then need to either set the index to the first item or wrap the json data in a json object.
For example:
with open('data.json', "r") as f_input:
text = f_input.read()
jsonData = json.loads(text)
df = pd.DataFrame(jsonData, index=[0])
df.to_csv('test.csv', encoding='utf-8', index=False)
Or:
with open('data.json', 'r') as f:
data = json.load(f)
df = pd.DataFrame({'data': data})
df.to_csv('test.csv', encoding='utf-8', index=False)

How to open .ndjson file in Python?

I have .ndjson file that has 20GB that I want to open with Python. File is to big so I found a way to split it into 50 peaces with one online tool. This is the tool: https://pinetools.com/split-files
Now I get one file, that has extension .ndjson.000 (and I do not know what is that)
I'm trying to open it as json or as a csv file, to read it in pandas but it does not work.
Do you have any idea how to solve this?
import json
import pandas as pd
First approach:
df = pd.read_json('dump.ndjson.000', lines=True)
Error: ValueError: Unmatched ''"' when when decoding 'string'
Second approach:
with open('dump.ndjson.000', 'r') as f:
my_data = f.read()
print(my_data)
Error: json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 104925061 (char 104925060)
I think the problem is that I have some emojis in my file, so I do not know how to encode them?
ndjson is now supported out of the box with argument lines=True
import pandas as pd
df = pd.read_json('/path/to/records.ndjson', lines=True)
df.to_json('/path/to/export.ndjson', lines=True)
I think the pandas.read_json cannot handle ndjson correctly.
According to this issue you can do sth. like this to read it.
import ujson as json
import pandas as pd
records = map(json.loads, open('/path/to/records.ndjson'))
df = pd.DataFrame.from_records(records)
P.S: All credits for this code go to KristianHolsheimer from the Github Issue
The ndjson (newline delimited) json is a json-lines format, that is, each line is a json. It is ideal for a dataset lacking rigid structure ('non-sql') where the file size is large enough to warrant multiple files.
You can use pandas:
import pandas as pd
data = pd.read_json('dump.ndjson.000', lines=True)
In case your json strings do not contain newlines, you can alternatively use:
import json
with open("dump.ndjson.000") as f:
data = [json.loads(l) for l in f.readlines()]

Convert pkl file to json file

I'm new on stack-overflow.
I'm trying to convert pkl file into json file using python. Below is my sample code
import pickle
import pandas as pd
# Load pickle file
input_file = open('file.pkl', 'rb')
new_dict = pickle.load(input_file)
input_file()
# Create a Pandas DataFrame
data_frame = pd.DataFrame(new_dict)
# Copy DataFrame index as a column
data_frame['index'] = data_frame.index
# Move the new index column to the from of the DataFrame
index = data_frame['index']
data_frame.drop(labels=['index'], axis=1, inplace = True)
data_frame.insert(0, 'index', index)
# Convert to json values
json_data_frame = data_frame.to_json(orient='values', date_format='iso', date_unit='s')
with open('data.json', 'w') as js_file:
js_file.write(json_data_frame)
When I run this code I got error that TypeError: '_io.TextIOWrapper' object is not callable. By following some same issues This one and This one, these issues suggested to use write method with input_file() at line 7 but still I'm getting this error io.UnsupportedOperation: write which is probably a writing method but I'm using it with reading and for reading I'm unable to fine any method.
I also tried to read pickle file in following way
with open ('file.pkl', 'rb') as input_file:
new_dict = pickle.load(input_file)
and I'm getting this error
DataFrame constructor not properly called!.
I need some kind suggestions that how I can solve this problem?
Any suggestions about other tools which can perform this task, will be appreciable. Thanks

"No columns to parse from file" when reading in dictionary

I'm trying to take a dictionary object in python, write it out to a csv file, and then read it back in from that csv file.
But it's not working. When I try to read it back in, it gives me the following error:
EmptyDataError: No columns to parse from file
I don't understand this for two reasons. Firstly, if I used pandas very own to_csv method, it should
be giving me the correct format for a csv. Secondly, when I print out the header values (by doing this : print(df.columns.values) ) of the dataframe that I'm trying to save, it says I do in fact have headers ("one" and "two".) So if the object I was sending out had column names, I don't know why they wouldn't be found when I'm trying to read it back.
import pandas as pd
testing = {"one":1,"two":2 }
df = pd.DataFrame(testing, index=[0])
file = open('testing.csv','w')
df.to_csv(file)
new_df = pd.read_csv("testing.csv")
What am I doing wrong?
Thanks in advance for the help!
The default pandas.DataFrame.to_csv takes a path and not an text io. Just remove the file declaration and directly use the path, pass index = False to skip indexes.
import pandas as pd
testing = {"one":1,"two":2 }
df = pd.DataFrame(testing, index=[0])
df.to_csv('testing.csv', index = False)
new_df = pd.read_csv("testing.csv")

Error When Reading CSV With C Engine

I have a large data file that I'm trying to read into a Pandas Dataframe.
If I try to read it using the following code:
df = pd.read_csv(file_name,
sep='|',
compression='gzip',
skiprows=54,
comment='#',
names=column_names,
header=None,
usecols=column_numbers,
engine='python',
nrows=15347,
na_values=["None", " "])
It works perfectly, but not quickly. If I try to use the C engine to speed the import up though, I get an error message:
pandas.parser.CParserError: Error tokenizing data. C error: Expected 0 fields in line 55, saw 205
It looks like something is going wrong when I change the engine, and the parser isn't figuring out how many\which columns it should be using. What I can't figure out is why. None of the input arguments are only supported by the Python engine.
The problem only occurred after I upgraded from version 14.1 to 16.0.
I can't attach a copy of the data, because it contains confidential information.

Categories