I have a text file that is tab delimited for the first 80 rows, and these are the only rows I need in the file. I would normally open the file like this:
df=pd.read_csv(r'file.txt', sep='\t')
but this returns the error:
CParserError: Error tokenizing data. C error: Expected 7 fields in line 84, saw 81
because somewhere along the way it is no longer tab delimited Im pretty sure. If I manually delete everything in the file except for the first 80 rows I can set the tab delimiter and it reads fine, but I need to do this for lots of files. I know I can select only the first 80 using this:
df=df.iloc[:80,:]
but then my dataframe has \t separating every column instead of a space like I want. Is there a way to select only the first 80 rows while opening the file so then I can set sep='\t' without the error?
You can specify just to read the first 80 rows using param nrows:
df=pd.read_csv(r'file.txt', sep='\t', nrows=80)
You can set error parameter True which will drop the blank or malformed lines. nrows is not suitable as per my view because you have to manually add the row count for each file.
df=pd.read_csv(r'file.txt', sep='\t', error_bad_lines=False)
You can also go through these attributes
warn_bad_lines
skip_blank_lines
Folow this link to read more
Related
I have a dataframe with 3 columns, but 1 of the columns contain data that is separated by a semicolon(;) during export. I am trying to export a dataframe into a csv but my csv output data keeps getting separated into the following format when opening in excel:
import pandas as pd
my_dict = { 'name' : ["a", "b"],
'age' : [20,27],
'tag': ["Login Location;Visit Location;Appointment Location", "Login Location;Visit Location;Appointment Location"]}
df=pd.DataFrame(my_dict)
df.to_csv('output.csv',index=False)
print('done')
I would like to have the output in excel to be:
where the data in the tag column is intact. I've tried adding sep=',' or delimiter=',' but it still gives me the same output.
Thank you in advance,
John
Thank you #Alex and #joao for your inputs, this guided me to the right direction. I was able to get the output I needed by forcing excel to use , as the separator. By default, Excel was using tab as the delimiter, that's why it was showing me an incorrect format. Here's the link to forcing excel to use comma as a list separator: https://superuser.com/questions/606272/how-to-get-excel-to-interpret-the-comma-as-a-default-delimiter-in-csv-files
Excel does some stuff based on the fact that your file has a .csvsuffix, probably using ; as a default delimiter, as suggested in the comments.
One workaround is to use the .txt suffix instead:
df.to_csv('output.txt',index=False)
then open the file in Excel, and in the Text Import Wizard specify "Delimited" and comma as separator.
Do not pick the file in the list of previously opened files, if it's there, that won't work, you really need to do File/Open then browse the directory to find your .txt file.
I have a .csv file that has (45211rows, 1columns).
but i need to create new .scv file with (45211rows, 17columns)
These are the column names
age;"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
I add a screenshot of the .csv file that I already have.
In pandas, the read_csv method has an option for setting the separator, which is , by default. To override, you can:
pandas.read_csv(<PATH_TO_CSV_FILE>, sep=';', header=0)
This will return a new dataframe with the correct format. The header=0 might not be needed, but it will force the returned dataframe to read the first line of the CSV file as column headers
Open the CSV in Excel
Select all the data
Choose the Data tab atop the ribbon.
Select Text to Columns.
Ensure Delimited is selected and click Next.
Clear each box in the Delimiters section and instead choose Semi Colon.
Click Finish.
I'm trying to save a file with text information into csv format.
However, after I use 'to_csv' in pandas (without specifying anything) save the file and I then use pd.read_csv to reopen the file. It will give me this error message:
ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
But if I read the csv file in pandas like pd.read_csv('file.csv',lineterminator='\n'). It will open the file properly.
However, I need to use a software to process these text file, and the software opens the file the same way as how Excel opens the csv file, and I cannot specifying lineterminator='\n' like I did in the python. If I open the csv file using that software, some text in the column will go to other rows.
The text between the index 378 and 379 is supposed to be together in 378 row. However, they goes to other rows and go to the index column.
id text
378 1 Good morning. This row's text goes to the following rows
Dot dot NaN NaN
HELLO NaN NaN
Apple NaN NaN
379 2 This row is correct
Does anyone know how to solve this problem when i use pandas.to_csv to save the dataframe? what should i specify if i want to open the file properly in a software like Excel?
Try this:
df = pd.read_csv('file.csv', error_bad_lines=False)
Thanks for the replies. I have found the problem. It's the '\r' inside the text. I removed all '\r' inside the text and now it works. Thanks1
I am trying to read a csv file, which I downloaded from City of Chicago as a csv file into a dataframe. However, for many rows all data gets read into the first column like row 2. If I delete those rows, there are only 5% of the data left. Does anyone have an idea what to do?
Also when opening the csv file as txt, the rows that do not read correctly have a leading ". I dont know if that would cause the issue.
crime_df = pd.read_csv('ChicagoCrime.csv')
crime_df.head(10)
enter image description here
Edit: I downloaded the file as a CSV for Excel and didn't get an issue. I would recommend this unless you really don't want to download it again.
My first suggestion is to specify more arguments in the read_csv function to help it out. You may try specifying a delimiter, specify the engine to use, set parse_dates to True, etc. Check out https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html for all available arguments.
I have a series of large CSV files contain sensor data samples.
The format of each file is
Row 0 Column names
Row 1 Column unit of measurement.
Row 2 to EOF: time-stamp of sample was taken,voltage measurement from all sensors.....
I cannot modify the original file
So i would like to use numpy.gentromtxt(filename,Names=True,delimiter=',',dtype=None)
So far to avoid corrupting the output i have skipped the header lines and manually added the column names later.
This is not ideal as each file potentially a different order of sensors- and the information is there for the taking.
Any help/Direction would be greatly appreciated.
I can see several options:
Open the file to read just the header line, parse the names yourself; run genfromtxt with the custom names and skip_header
trick genfromtxt into treating the 2nd line as a comment line
open the file yourself; pass lines through a filter to genfromtxt. The filter function would remove the second line. (genfromtxt works with a list of lines, or anything that feeds it lines).
-