I am trying to read a csv file, which I downloaded from City of Chicago as a csv file into a dataframe. However, for many rows all data gets read into the first column like row 2. If I delete those rows, there are only 5% of the data left. Does anyone have an idea what to do?
Also when opening the csv file as txt, the rows that do not read correctly have a leading ". I dont know if that would cause the issue.
crime_df = pd.read_csv('ChicagoCrime.csv')
crime_df.head(10)
enter image description here
Edit: I downloaded the file as a CSV for Excel and didn't get an issue. I would recommend this unless you really don't want to download it again.
My first suggestion is to specify more arguments in the read_csv function to help it out. You may try specifying a delimiter, specify the engine to use, set parse_dates to True, etc. Check out https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html for all available arguments.
Related
I have a dataframe with 3 columns, but 1 of the columns contain data that is separated by a semicolon(;) during export. I am trying to export a dataframe into a csv but my csv output data keeps getting separated into the following format when opening in excel:
import pandas as pd
my_dict = { 'name' : ["a", "b"],
'age' : [20,27],
'tag': ["Login Location;Visit Location;Appointment Location", "Login Location;Visit Location;Appointment Location"]}
df=pd.DataFrame(my_dict)
df.to_csv('output.csv',index=False)
print('done')
I would like to have the output in excel to be:
where the data in the tag column is intact. I've tried adding sep=',' or delimiter=',' but it still gives me the same output.
Thank you in advance,
John
Thank you #Alex and #joao for your inputs, this guided me to the right direction. I was able to get the output I needed by forcing excel to use , as the separator. By default, Excel was using tab as the delimiter, that's why it was showing me an incorrect format. Here's the link to forcing excel to use comma as a list separator: https://superuser.com/questions/606272/how-to-get-excel-to-interpret-the-comma-as-a-default-delimiter-in-csv-files
Excel does some stuff based on the fact that your file has a .csvsuffix, probably using ; as a default delimiter, as suggested in the comments.
One workaround is to use the .txt suffix instead:
df.to_csv('output.txt',index=False)
then open the file in Excel, and in the Text Import Wizard specify "Delimited" and comma as separator.
Do not pick the file in the list of previously opened files, if it's there, that won't work, you really need to do File/Open then browse the directory to find your .txt file.
I have exported a Excel file using the pandas .to_csv method on a 9-column DataFrame successfully, as well as accessing the created file with the .to_csv method likewise, with no errors whatsoever using the following code:
dfBase = pd.read_csv('C:/Users/MyUser/Documents/Scripts/Base.csv',
sep=';', decimal=',', index_col=0, parse_dates=True,
encoding='utf-8', engine='python')
However, upon modifying the same CSV file manually using Notepad (which also extends to simply opening the file and saving it without making any actual alterations), pandas won't read it anymore, giving the following error message:
ParserError: Expected 2 fields in line 2, saw 9
In the case of the modified CSV, if the index_col=0 parameter is removed from the code, pandas is able to read the DataFrame again, however the first 8 columns become the index as a tuple and only the last column is brought as a field.
Could anyone point me out as to why I am unable to access the DataFrame after modifying it? Also, why does the removal of index_col enables its reading again with nearly all the columns as the index?
Have you tried opening and saving the file with some other text editor? Notepad really isn't that great, probably it's adding some special characters upon opening of the file or maybe the file already contains those characters and Notepad does not let you see them, hence pandas can't convert correctly
try Notepad++ or some more advanced IDEs like Atom, VSCode or PyCharm
I am trying to code a function where I grab data from my database, which already works correctly.
This is my code for the headers prior to adding the actual records:
with open('csv_template.csv', 'a') as template_file:
#declares the variable template_writer ready for appending
template_writer = csv.writer(template_file, delimiter=',')
#appends the column names of the excel table prior to adding the actual physical data
template_writer.writerow(['Arrangement_ID','Quantity','Cost'])
#closes the file after appending
template_file.close()
This is my code for the records which is contained in a while loop and is the main reason that the two scripts are kept separate.
with open('csv_template.csv', 'a') as template_file:
#declares the variable template_writer ready for appending
template_writer = csv.writer(template_file, delimiter=',')
#appends the data of the current fetched values of the sql statement within the while loop to the csv file
template_writer.writerow([transactionWordData[0],transactionWordData[1],transactionWordData[2]])
#closes the file after appending
template_file.close()
Now once I have got this data ready for excel, I run the file in excel and I would like it to be in a format where I can print immediately, however, when I do print the column width of the excel cells is too small and leads to it being cut off during printing.
I have tried altering the default column width within excel and hoping that it would keep that format permanently but that doesn't seem to be the case and every time that I re-open the csv file in excel it seems to reset completely back to the default column width.
Here is my code for opening the csv file in excel using python and the comment is the actual code I want to use when I can actually format the spreadsheet ready for printing.
#finds the os path of the csv file depending where it is in the file directories
file_path = os.path.abspath("csv_template.csv")
#opens the csv file in excel ready to print
os.startfile(file_path)
#os.startfile(file_path, 'print')
If anyone has any solutions to this or ideas please let me know.
Unfortunately I don't think this is possible for CSV file formats, since they are just plaintext comma separated values and don't support formatting.
I have tried altering the default column width within excel but every time that I re-open the csv file in excel it seems to reset back to the default column width.
If you save the file to an excel format once you have edited it that should solve this problem.
Alternatively, instead of using the csv library you could use xlsxwriter instead which does allow you to set the width of the columns in your code.
See https://xlsxwriter.readthedocs.io and https://xlsxwriter.readthedocs.io/worksheet.html#worksheet-set-column.
Hope this helps!
The csv format is nothing else than a text file, where the lines follow a given pattern, that is, a fixed number of fields (your data) delimited by comma. In contrast an .xlsx file is a binary file that contains specifications about the format. Therefore you may want write to an Excel file instead using the rich pandas library.
You can add space like as it is string so it will automatically adjust the width do it like this:
template_writer.writerow(['Arrangement_ID ','Quantity ','Cost '])
I am trying to add a header to my existing csv file and there are already content in it. I am just wondering if there is any piece of code that could insert a header row at the top (such as ['name','age','salary','country'] without affecting the contents.
Also this code is connected to API so I will run it multiple times. So just wondering if it is possible to detect whether a header exists to avoid multiple header lines.
THank you and hope you all a good day!
Your question has 2 parts:
1) To add a header to your csv (when it does not exists)
In order to insert the header row, you can read the csv with below command:
df=pd.read_csv(filename, header=None, names=['name','age','salary','country'])
To get create the csv with header row without affecting the contents you can use below command
df.to_csv(new_file_with_header.csv, header=True)
2) The second parti is little tricky. To infer whether your file is having header or not you will have to write a little code. I can provide you the algorithm.
read csv explicitly with header
df=pd.read_csv(filename.csv, header=None, names=['name','age','salary','country'])
Check for 1st row 1st column in your csv, if it contains value as 'name' then write the csv from 2nd row till end else write as is
temp_var=df['name'].iloc[0]
if (temp_var=='name'):
df.iloc[1:].to_csv(new_file.csv)
else:
df.to_csv(new_file.csv)
Hope this helps!!
Thanks,
Rohan Hodarkar
So I've got about 5008 rows in a CSV file, a total of 5009 with the headers. I'm creating and writing this file all within the same script. But when i read it at the end, with either pandas pd.read_csv, or python3's csv module, and print the len, it outputs 4967. I checked the file for any weird characters that may be confusing python but don't see any. All the data is delimited by commas.
I also opened it in sublime and it shows 5009 rows not 4967.
I could try other methods from pandas like merge or concat, but if python wont read the csv correct, that's no use.
This is one method i tried.
df1=pd.read_csv('out.csv',quoting=csv.QUOTE_NONE, error_bad_lines=False)
df2=pd.read_excel(xlsfile)
print (len(df1))#4967
print (len(df2))#5008
df2['Location']=df1['Location']
df2['Sublocation']=df1['Sublocation']
df2['Zone']=df1['Zone']
df2['Subnet Type']=df1['Subnet Type']
df2['Description']=df1['Description']
newfile = input("Enter a name for the combined csv file: ")
print('Saving to new csv file...')
df2.to_csv(newfile, index=False)
print('Done.')
target.close()
Another way I tried is
dfcsv = pd.read_csv('out.csv')
wb = xlrd.open_workbook(xlsfile)
ws = wb.sheet_by_index(0)
xlsdata = []
for rx in range(ws.nrows):
xlsdata.append(ws.row_values(rx))
print (len(dfcsv))#4967
print (len(xlsdata))#5009
df1 = pd.DataFrame(data=dfcsv)
df2 = pd.DataFrame(data=xlsdata)
df3 = pd.concat([df2,df1], axis=1)
newfile = input("Enter a name for the combined csv file: ")
print('Saving to new csv file...')
df3.to_csv(newfile, index=False)
print('Done.')
target.close()
But not matter what way I try the CSV file is the actual issue, python is writing it correctly but not reading it correctly.
Edit: Weirdest part is that i'm getting absolutely no encoding errors or any errors when running the code...
Edit2: Tried testing it with nrows param in first code example, works up to 4000 rows. Soon as i specify 5000 rows, it reads only 4967.
Edit3: manually saved csv file with my data instead of using the one written by the program, and it read 5008 rows. Why is python not writing the csv file correctly?
I ran into this issue also. I realized that some of my lines had open-ended quotes, which was for some reason interfering with the reader.
So for example, some rows were written as:
GO:0000026 molecular_function "alpha-1
GO:0000027 biological_process ribosomal large subunit assembly
GO:0000033 molecular_function "alpha-1
and this led to rows being read incorrectly. (Unfortunately I don't know enough about how csvreader works to tell you why. Hopefully someone can clarify the quote behavior!)
I just removed the quotes and it worked out.
Edited: This option works too, if you want to maintain the quotes:
quotechar=None
My best guess without seeing the file is that you have some lines with too many or not enough commas, maybe due to values like foo,bar.
Please try setting error_bad_lines=True. From Pandas documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html to see if it catches lines with errors in them, and my guess is that there will be 41 such lines.
error_bad_lines : boolean, default True
Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, and no DataFrame will be returned. If False, then these “bad lines” will dropped from the DataFrame that is returned. (Only valid with C parser)
The csv.QUOTE_NONE option seems to not quote fields and replace the current delimiter with escape_char + delimiter when writing, but you didn't paste your writing code, but on read it's unclear what this option does. https://docs.python.org/3/library/csv.html#csv.Dialect