I have a csv file with a that is encoded with commas as separator but every row has a quote character at the start and at the end of each row.
In practice the data look like this
"0.00000E+000,6.25000E-001"
"1.00000E+000,1.11926E+000"
"2.00000E+000,9.01726E-001"
"3.00000E+000,7.71311E-001"
"4.00000E+000,6.82476E-001"
If I read the data using pd.read_csv() it just reads everything under a single column. What is the best workaround? Is there a simple way to pre-emptively strip the quotes character from the whole csv file?
If your file looks like
my_file=""""0.00000E+000,6.25000E-001"
"1.00000E+000,1.11926E+000"
"2.00000E+000,9.01726E-001"
"3.00000E+000,7.71311E-001"
"4.00000E+000,6.82476E-001"
"""
One way to remove the quotes prior to using Pandas would be
for line in my_file.split('\n'):
print(line.replace('"',''))
To write that to file, use
with open('output.csv','w') as file_handle:
for line in my_file.split('\n'):
file_handle.write(line.replace('"','')+'\n')
Related
I have a comma delimited file which also contains commas in the actual field values, something like this:
foo,bar,"foo, bar"
This file is very large so I am wondering if there is a way in python to either put double quotes around ever field:
eg: "foo","bar","foo, bar"
or just change the delimeter overall?
eg: foo|bar|foo, bar
End goal:
The goal is to ultimately load this file into sql server. Given the size of the file bulk insert is only feasible approach for loading but I cannot specify a text qualifier/field quote due to the version of ssms I have.
This leads me to believe the only remaining approach is to do some preprocessing on the source file.
Changing the delimiter just requires parsing and re-encoding the data.
with open("data.csv") as input, open("new_data.csv", "w") as output:
r = csv.reader(input, delimiter=",", quotechar='"')
w = csv.writer(output, delimiter="|")
w.writerows(r)
Given that your input file is a fairly standard version of CSV, you don't even need to specify the delimiter and quote arguments to reader; the defaults will suffice.
r = csv.reader(input)
It is not an inconsistent quotes. If a value in a CSV file has comma or new line in it quotes are added to it. It shoudn't be a problem, because all standard CSV readers can read it properly.
I have a csv file like
id,body,category,subcategory,number,smstype,smsflag
50043,"Dear Customer,Thank you for registering",,,DM-YEBA,inbox,0
30082,Congrats! Your account has been activated.,,,DM-SBAW,inbox,0
when i'm using pd.read_csv() then the whole first observation is included in the id column and is not separated among other columns due to the double quotes used for the message body, while in second observation the line is properly separated among the columns.
What should I do such that the first observation is seperated among all columns like in this image
see what actually pd.read_csv is doing. It's taking the whole observation in id column
when i am opening the csv file in notepad it's adding extra quotation marks to the whole row which is eventually causing the fiasco and the quotation mark already in the file are escaped with another ' " ' as shown below.
id,body,category,subcategory,number,smstype,smsflag
"50043,""Dear Customer,Thank you for registering"",,,DM-YEBA,inbox,0"
30082,Congrats! Your account has been activated.,,,DM-SBAW,inbox,0
The main problem lies in the way csv file of microsoft excel is actually saved. When the same csv file is opened in notepad it adds extra quotation marks in the lines which have quotes.
1) It adds quote at the starting and ending of the line.
2) It escapes the existing quotes with one more quote.
Hence, when we import our csv file in pandas it takes the whole line as one string and thus it ends up all in the first column.
To tackle this-
I imported the csv file and corrected the csv by applying regex substitution and saved it as text file. Then i imported this text file as pandas dataframe. Problem Solved.
with open('csvdata.csv','r+') as csv_file:
for line in csv_file:
# removing starting and ending quotes of a line
pattern1 = re.compile(r'^"|"$',re.MULTILINE)
line = re.sub(r'^"|"$',"",line)
# substituting escaped quote with a single quote
pattern2 = re.compile(r'""')
line = re.sub(r'""','"',line)
corrected_csv = open("new_csv.txt",'a')
corrected_csv.write(line)
corrected_csv.close()
From your example, your issue seems to be that the opening quote of Dear customer ... is not the same as the closing quote (different characters). The issue seems to be in your data, not in pandas.read_csv
If you have always the same quote character, you probably are looking for the quotechar='"' argument of read_csv. More information can be found here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
you can use regex to remove double quotes
import re
for i in range(0,len(df['body'])):
df['body'][i] = re.sub('\W+','', df['body'][i])
It seems that (by default) a double-quote character is only recognized as representing the start of singular entry if it follows immediately after the delimiter character (i.e. no spaces between the comma and the double-quote). You can solve this issue though by using the input argument skipinitialspace=True (i.e. ignore spaces after delimiter). The following code:
import io
import pandas as pd
# Create virtual CSV file
csv_file = io.StringIO(
'id, body, category, subcategory, number, smstype, smsflag\n'
'50043, "Dear Customer,Thank you for registering",, , DM - YEBA, inbox, 0\n'
'30082, Congrats! Your account has been activated.,, , DM - SBAW, inbox, 0\n'
)
# Read out CSV file
df = pd.read_csv(csv_file, skipinitialspace=True)
gives the following result:
In [1]: df
Out[1]:
id body ... smstype smsflag
0 50043 Dear Customer,Thank you for registering ... inbox 0
1 30082 Congrats! Your account has been activated. ... inbox 0
[2 rows x 7 columns]
My CSV file has 3 columns: Name,Age and Sex and sample data is:
AlexÇ39ÇM
#Ç#SheebaÇ35ÇF
#Ç#RiyaÇ10ÇF
The column delimiter is 'Ç' and record delimiter is '#Ç#'. Note the first record don't have the record delimiter(#Ç#), but all other records have record delimiter(#Ç#). Could you please tell me how to read this file and store it in a dataframe?
Both csv and pandas module support reading csv-files directly. However, since you need to modify the file contents line by line before further processing, I suggest reading the file line by line, modify each line as desired and store the processed data in a list for further handling.
The necessary steps include:
open file
read file line by line
remove newline char (which is part of the line when using readlines()
replace record delimiter (since a record is equivalent to a line)
split lines at column delimiter
Since .split() returns a list of string elements we get an overall list of lists, where each 'sub-list' contains/represents the data of a line/record. Data formatted like this can be read by pandas.DataFrame.from_records() which comes in quite handy at this point:
import pandas as pd
with open('myData.csv') as file:
# `.strip()` removes newline character from each line
# `.replace('#;#', '')` removes '#;#' from each line
# `.split(';')` splits at given string and returns a list with the string elements
lines = [line.strip().replace('#;#', '').split(';') for line in file.readlines()]
df = pd.DataFrame.from_records(lines, columns=['Name', 'Age', 'Sex'])
print(df)
Remarks:
I changed Ç to ; which worked better for me due to encoding issues. However, the basic idea of the algorithm is still the same.
Reading data manually like this can become quite resource-intensive which might be a problem when handling larger files. There might be more elegant ways, which I am not aware of. When getting problems with performance, try to read the file in chunks or have a look for more effective implementations.
I have a dataframe and I want to save it to a csv file. This operation is quite simple because I just need to use the following command:
df.to_csv(namefile, sep='', index=False)
The output is a csv file where each line contains the content of the a row of the dataframe. The output is this:
A,B,C,D
1,2,3,4
5,6,7,8
9,1,2,3
However, what I would like to do is to have a blank line every other row so that the output looks like this:
A,B,C,D
1,2,3,4
5,6,7,8
9,1,2,3
Basically I need to add the CR and LF symbol between every other line.
Can you suggest me a smart and elegant way to achieve my goal?
Use parameter line_terminator='\n\n':
line_terminator : string, default '\n'
The newline character or character sequence to use in the output file
Demo:
df.to_csv(namefile, line_terminator='\n\n')
I have a file that looks like this:
1111,AAAA,aaaa\n
2222,BB\nBB,bbbb\n
3333,CCC\nC,cccc\n
...
Where \n represents a newline.
When I read this line-by-line, it's read as:
1111,AAAA,aaaa\n
2222,BB\n
BB,bbbb\n
3333,CCC\n
C,cccc\n
...
This is a very large file. Is there a way to read a line until a specific number of delimiters, or remove the newline character within a column in Python?
I think after you read the line, you need to count the number of commas
aStr.count(',')
While the number of commas is too small (there can be more than one \n in the input), then read the next line and concatenate the strings
while aStr.count(',') < Num:
another = file.readline()
aStr = aStr + another
1111,AAAA,aaaa\n
2222,BB\nBB,bbbb\n
According to your file \n here is not actually a newline character, it is plain text.
For actually stripping newline characters you could use strip() or other variations like rstrip() ot lstrip().
If you work with large files you don't need to load full content in memory. You could iterate line by line until some counter or anything else.
I think perhaps you are parsing a CSV file that has embedded newlines in some of the text fields. Further, I suppose that the program that created the file put quotation marks (") around the fields.
That is, I supposed that your text file actually looks like this:
1111,AAAA,aaaa
2222,"BB
BB",bbbb
3333,"CCC
C",cccc
If that is the case, you might want to use code with better CSV support than just line.split(','). Consider this program:
import csv
with open('foo.csv') as fp:
reader = csv.reader(fp)
for row in reader:
print row
Which produces this output:
['1111', 'AAAA', 'aaaa']
['2222', 'BB\nBB', 'bbbb']
['3333', 'CCC\nC', 'cccc']
Notice the five lines (delimited by newline characters) of the CSV file become 3 rows (some with embedded newline characters) in the CSV data structure.