pandas how to read this row? - python

data sample: program go wrong with the second data for it has 7 "," while normal data only have 6.
7558,1488,1738539,,,,1
7559,1489,1702292,,"(segment \"Pesnya, ili Kak velikij Luarsab khor organizovyval\")",8,1
7560,1489,2146930,1975,,21,1
It is from imdb dataset's cast_info table. ([IMDB][2] is from a database task named cardinality estimination.) Its sep is ",". But if there were some sep in string, pandas can't recognize them.
The error log:
File "\pytorch\lib\site-packages\pandas\io\parsers\readers.py", line 488, in _read
return parser.read(nrows)
File "\pytorch\lib\site-packages\pandas\io\parsers\readers.py", line 1047, in read
index, columns, col_dict = self._engine.read(nrows)
File "\pytorch\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 223, in read
chunks = self._reader.read_low_memory(nrows)
File "pandas\_libs\parsers.pyx", line 801, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas\_libs\parsers.pyx", line 857, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 843, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas\_libs\parsers.pyx", line 1925, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 7 fields in line 7559, saw 8
How can I solve it?
[2]: https://www.imdb.com/interfaces/

Try this i think this should work.
import pandas as pd
pd.read_csv(data_path,sep = ",")

Related

Error merging multiple CSV files - Python

I'm trying to merge several CSV files into one.
Searching several methods, I found this one:
files = glob.glob("D:\\green_lake\\Projects\\covid_19\\tabelas_relacao\\acre\\*.csv")
files_merged = pd.concat([pd.read_csv(df) for df in files], ignore_index=True)
When running this error is returned:
>>> files_merged = pd.concat([pd.read_csv(df) for df in files], ignore_index=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcomp>
File "C:\Users\Leonardo\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\Leonardo\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 678, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\Leonardo\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 581, in _read
return parser.read(nrows)
File "C:\Users\Leonardo\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 1253, in read
index, columns, col_dict = self._engine.read(nrows)
File "C:\Users\Leonardo\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 225, in
read
chunks = self._reader.read_low_memory(nrows)
File "pandas\_libs\parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas\_libs\parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas\_libs\parsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 243, saw 4
I'm starting to study python and if it's a stupid mistake, I apologize ;)

Python Read In Google Spreadsheet Using Pandas

I have file in Google sheets I want to read it into a Pandas Dataframe. But gives me an error i don't know what's it.
this is the code :
import pandas as pd
sheet_id = "1HUbEhsYnLxJP1IisFcSKtHTYlFj_hHe5v21qL9CVyak"
df = pd.read_csv(f"https://docs.google.com/spreadsheets/d/{sheet_id}/export?
gid=556844753&format=csv")
print(df)
And this is the error :
File "c:\Users\blaghzao\Documents\Stage PFA(Laghzaoui Brahim)\google_sheet.py", line 3, in <module>
df = pd.read_csv(f"https://docs.google.com/spreadsheets/d/{sheet_id}/export?gid=556844753&format=csv")
File "C:\Users\blaghzao\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\blaghzao\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\blaghzao\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 581, in _read
return parser.read(nrows)
File "C:\Users\blaghzao\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 1255, in read
index, columns, col_dict = self._engine.read(nrows)
File "C:\Users\blaghzao\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 225, in read
chunks = self._reader.read_low_memory(nrows)
File "pandas\_libs\parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas\_libs\parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas\_libs\parsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 89 fields in line 3, saw 243
I found the answer, the problem it's just with access permissions of the file.
enter image description here
Remove gid from code
import pandas as pd
sheet_id = "1HUbEhsYnLxJP1IisFcSKtHTYlFj_hHe5v21qL9CVyak"
df = pd.read_csv(f"https://docs.google.com/spreadsheets/d/{sheet_id}/export?format=csv")
print(df)
Click on link for sample image
As far as I know this error rises using comma delimiter and you have more commas then expected.
Can you try with below read_csv() method to avoid them;
df = pd.read_csv(f"https://docs.google.com/spreadsheets/d/{sheet_id}/export? gid=556844753&format=csv", on_bad_lines='skip')
This will avoid bad lines so you can identify problem depending on skipped lines. I believe your csv format export is not matching with what pandas read_csv() expects.

Python script works but throws error - pandas.errors tokenizing data , Expected 9 fields saw 10

I am new to python. I am trying to read json response from requests and filtering using pandas to save in csv file. This script works and gives me all the data but its throws this error after execution -
I am not able to figure out why its throwing this error ? How can I pass this error ?
Error -
script.py line 42, in <module>
df = pd.read_csv("Data_script4.csv")
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-
packages/pandas/io/parsers.py", line 686, in read_csv
return _read(filepath_or_buffer, kwds)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-
packages/pandas/io/parsers.py", line 458, in _read
data = parser.read(nrows)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-
packages/pandas/io/parsers.py", line 1196, in read
ret = self._engine.read(nrows)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-
packages/pandas/io/parsers.py", line 2155, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 862, in
pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 918, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 905, in
pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2042, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 9 fields in line 53,
saw 10
This is my script -
if __name__ == '__main__':
parser = argparse.ArgumentParser("gets data")
parser.add_argument("-o" , dest="org", help="org name")
parser.add_argument("-p" , dest="pat", help="pat value")
args = parser.parse_args()
org = args.org
token = args.pat
url = "https://dev.azure.com/{org_name}/_apis/git/repositories?
api-
version=6.0".format(org_name=org)
data = getproject(url,token)
data_file=open("Data_script4.csv", "w",newline='')
val=data['value']
csv_writer = csv.writer(data_file)
for i in val:
if count==0:
header=i.keys()
csv_writer.writerow(header)
count +=1
csv_writer.writerow(i.values())
pro_name=[]
time=[]
df = pd.read_csv("Data_script4.csv")
for i in df["project"]:
res = ast.literal_eval(i)
pro_name.append(res['name'])
time.append(res['lastUpdateTime'])
del df["project"]
df["project name"] = pro_name
df["lastUpdateTime"] = time
df =df[["id","name","url","project
name","lastUpdateTime","defaultBranch","size","remoteUrl","sshUrl","webUrl"]]
df.head()
df.to_csv("Data_Filtered.csv",index=False)
print("\nFile Created Successfully...")
data_file.close()
os.remove('Data_script4.csv')
How can I resolve this issue ?
Your question was answered here
Here's the takeaway:
You need to substitute:
df = pd.read_csv("Data_script4.csv")
with this:
df = pd.read_csv('Data_script4.csv', error_bad_lines=False)

Trying to split csv file and getting Error tokenizing data

I am trying to split a csv file into multiple csvs but keep csv header.
the code I am trying is:
import pandas as pd
chunk_size = 500000
batch_no = 1
for chunk in pd.read_csv('/Users/illys/Desktop/thefinal.csv', chunksize=chunk_size, ):
chunk.to_csv(file_path + str(batch_no) + '.csv', index=False)
batch_no += 1
And the error I get is this one:
Traceback (most recent call last):
File "splitcsv.py", line 5, in <module>
for chunk in pd.read_csv('/Users/illys/Desktop/thefinal.csv', chunksize=chunk_size, encoding='utf-8'):
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1128, in __next__
return self.get_chunk()
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1188, in get_chunk
return self.read(nrows=size)
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1154, in read
ret = self._engine.read(nrows)
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2059, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 881, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 908, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 950, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 937, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2132, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 274, saw 2
You can try to skip lines producing errors by adding error_bad_lines=False argument to pd.read_csv function. Then, your code would look like this:
import pandas as pd
chunk_size = 500000
batch_no = 1
for chunk in pd.read_csv('/Users/illys/Desktop/thefinal.csv', chunksize=chunk_size, error_bad_lines=False):
chunk.to_csv(file_path + str(batch_no) + '.csv', index=False)
batch_no += 1

I am having trouble running my csv file through my code and I don't understand the error message

I am trying to run a csv file through my code and work with the data. I am receiving a error message that I don't exactly understand.
Here is the csv file
There is a lot more code but I will only include code that is relevant to the problem. Comment below if you need more info.
import pandas as pd
df_playoffs = pd.read_csv('/Users/hannahbeegle/Desktop/playoff_teams.csv.numbers', encoding='latin-1', index_col = 'team')
df_playoffs.fillna('None', inplace=True)
Here is the error message:
Traceback (most recent call last):
File "/Users/hannahbeegle/Desktop/Baseball.py", line 130, in <module>
df_playoffs = pd.read_csv('/Users/hannahbeegle/Desktop/playoff_teams.csv.numbers', encoding='latin-1', index_col = 'team')
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/parsers.py", line 435, in _read
data = parser.read(nrows)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/parsers.py", line 1139, in read
ret = self._engine.read(nrows)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/parsers.py", line 1995, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 968, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 955, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2172, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 2
Looks as though your csv is maybe tab delimited, in the line that specifies the csv, edit to something like:
pd.read_csv('/Users/hannahbeegle/Desktop/playoff_teams.csv.numbers', sep="\t", encoding='latin-1', index_col = 'team')
[edited section after comments]
If the data is "ragged" then you could try breaking it up into a dictionary, and then using that to build the dataframe - here's an example I tried with a mocked-up sample dataset:
record_dict = {}
file=open("variable_columns.csv", mode="r")
for line in file.readlines():
split_line = line.split()
record_dict[split_line[0]]=split_line[1:]
df_playoffs = pd.DataFrame.from_dict(record_dict, orient='index' )
df_playoffs.sample(5)
You might need to look at the line.split() line, and enter "\t" as the split parameter (i.e. line.split("\t") but you can experiment with this.
Also, notice that pandas has forced the data to be rectangular, so some of the columns will contain None for the "short" rows.

Categories