Reading a csv file into pandas dataframe with quotation in some entries - python

I have a csv file like
id,body,category,subcategory,number,smstype,smsflag
50043,"Dear Customer,Thank you for registering",,,DM-YEBA,inbox,0
30082,Congrats! Your account has been activated.,,,DM-SBAW,inbox,0
when i'm using pd.read_csv() then the whole first observation is included in the id column and is not separated among other columns due to the double quotes used for the message body, while in second observation the line is properly separated among the columns.
What should I do such that the first observation is seperated among all columns like in this image
see what actually pd.read_csv is doing. It's taking the whole observation in id column
when i am opening the csv file in notepad it's adding extra quotation marks to the whole row which is eventually causing the fiasco and the quotation mark already in the file are escaped with another ' " ' as shown below.
id,body,category,subcategory,number,smstype,smsflag
"50043,""Dear Customer,Thank you for registering"",,,DM-YEBA,inbox,0"
30082,Congrats! Your account has been activated.,,,DM-SBAW,inbox,0

The main problem lies in the way csv file of microsoft excel is actually saved. When the same csv file is opened in notepad it adds extra quotation marks in the lines which have quotes.
1) It adds quote at the starting and ending of the line.
2) It escapes the existing quotes with one more quote.
Hence, when we import our csv file in pandas it takes the whole line as one string and thus it ends up all in the first column.
To tackle this-
I imported the csv file and corrected the csv by applying regex substitution and saved it as text file. Then i imported this text file as pandas dataframe. Problem Solved.
with open('csvdata.csv','r+') as csv_file:
for line in csv_file:
# removing starting and ending quotes of a line
pattern1 = re.compile(r'^"|"$',re.MULTILINE)
line = re.sub(r'^"|"$',"",line)
# substituting escaped quote with a single quote
pattern2 = re.compile(r'""')
line = re.sub(r'""','"',line)
corrected_csv = open("new_csv.txt",'a')
corrected_csv.write(line)
corrected_csv.close()

From your example, your issue seems to be that the opening quote of Dear customer ... is not the same as the closing quote (different characters). The issue seems to be in your data, not in pandas.read_csv
If you have always the same quote character, you probably are looking for the quotechar='"' argument of read_csv. More information can be found here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

you can use regex to remove double quotes
import re
for i in range(0,len(df['body'])):
df['body'][i] = re.sub('\W+','', df['body'][i])

It seems that (by default) a double-quote character is only recognized as representing the start of singular entry if it follows immediately after the delimiter character (i.e. no spaces between the comma and the double-quote). You can solve this issue though by using the input argument skipinitialspace=True (i.e. ignore spaces after delimiter). The following code:
import io
import pandas as pd
# Create virtual CSV file
csv_file = io.StringIO(
'id, body, category, subcategory, number, smstype, smsflag\n'
'50043, "Dear Customer,Thank you for registering",, , DM - YEBA, inbox, 0\n'
'30082, Congrats! Your account has been activated.,, , DM - SBAW, inbox, 0\n'
)
# Read out CSV file
df = pd.read_csv(csv_file, skipinitialspace=True)
gives the following result:
In [1]: df
Out[1]:
id body ... smstype smsflag
0 50043 Dear Customer,Thank you for registering ... inbox 0
1 30082 Congrats! Your account has been activated. ... inbox 0
[2 rows x 7 columns]

Related

Inconsistent quotes on .csv file

I have a comma delimited file which also contains commas in the actual field values, something like this:
foo,bar,"foo, bar"
This file is very large so I am wondering if there is a way in python to either put double quotes around ever field:
eg: "foo","bar","foo, bar"
or just change the delimeter overall?
eg: foo|bar|foo, bar
End goal:
The goal is to ultimately load this file into sql server. Given the size of the file bulk insert is only feasible approach for loading but I cannot specify a text qualifier/field quote due to the version of ssms I have.
This leads me to believe the only remaining approach is to do some preprocessing on the source file.
Changing the delimiter just requires parsing and re-encoding the data.
with open("data.csv") as input, open("new_data.csv", "w") as output:
r = csv.reader(input, delimiter=",", quotechar='"')
w = csv.writer(output, delimiter="|")
w.writerows(r)
Given that your input file is a fairly standard version of CSV, you don't even need to specify the delimiter and quote arguments to reader; the defaults will suffice.
r = csv.reader(input)
It is not an inconsistent quotes. If a value in a CSV file has comma or new line in it quotes are added to it. It shoudn't be a problem, because all standard CSV readers can read it properly.

pandas read csv with quotes at beginning and ending of every tow

I have a csv file with a that is encoded with commas as separator but every row has a quote character at the start and at the end of each row.
In practice the data look like this
"0.00000E+000,6.25000E-001"
"1.00000E+000,1.11926E+000"
"2.00000E+000,9.01726E-001"
"3.00000E+000,7.71311E-001"
"4.00000E+000,6.82476E-001"
If I read the data using pd.read_csv() it just reads everything under a single column. What is the best workaround? Is there a simple way to pre-emptively strip the quotes character from the whole csv file?
If your file looks like
my_file=""""0.00000E+000,6.25000E-001"
"1.00000E+000,1.11926E+000"
"2.00000E+000,9.01726E-001"
"3.00000E+000,7.71311E-001"
"4.00000E+000,6.82476E-001"
"""
One way to remove the quotes prior to using Pandas would be
for line in my_file.split('\n'):
print(line.replace('"',''))
To write that to file, use
with open('output.csv','w') as file_handle:
for line in my_file.split('\n'):
file_handle.write(line.replace('"','')+'\n')

Read a csv into pandas that has commas *within* first/index cells of the csv rows without changing value

Ok, I get this error...:
"pandas.errors.ParserError: Error tokenizing data. C error: Expected 6 fields in line 12, saw 7"
...when trying to import a csv into a python script with pandas.read_csv():
path,Drawing_but_no_F5,Paralell_F5,Fixed,Needs_Attention,Errors
R:\13xx Original Ranch Buildings\1301 Stonehouse\1301-015\F - Bid Documents and Contract Award,Yes,No,No,No,No
R:\13xx Original Ranch Buildings\1302 Carriage House\1302-026A Carriage House, Redo North Side Landscape\F - Bid Document and Contract Award,Yes,No,No,No,No
R:\13xx Original Ranch Buildings\1302 Carriage House\1302-028\F - Bid Documents and Contract Award,Yes,No,No,No,No
R:\13xx Original Ranch Buildings\1302 Carriage House\1302-029\F - Bid Documents and Contract Award,Yes,No,No,No,No
Obviously, in the above entries, it is the third line that throws the error. Caveats include that I have to use that column as a path to process files there so changing the entry is not allowed. CSV is created elsewhere; I get it as-is.
I do want to preserve the column header.
This filepath column is used later as an index, so I would like to preserve that.
Many, many similar issues, but solutions seem very specific and I cannot get them to cooperate for my use case:
Pandas, read CSV ignoring extra commas
Solutions seem to change entry values or rely on the cells being in the last column
Commas within CSV Data
Solution involves sql tools methinks. I don't want to read the csv into sql tables...
csv file is already delimited by commas so I don't think I changing the sep value will work.. (I cannot get it to work -- yet)
Problems reading CSV file with commas and characters in pandas
Solution throws error: "for line in reader:_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)"
Not too optimistic since op had the cell value in quotes whereas I do not.
Here is a solution which is a minor modification of the accepted answer by #DSM in the last thread to which you linked (Problems reading CSV file with commas and characters in pandas).
import csv
with open('original.csv', 'r') as infile, open('fixed.csv', 'w') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
for line in reader:
newline = [','.join(line[:-5])] + line[-5:]
writer.writerow(newline)
After running the above preprocessing code, you should be able to read fixed.csv using pd.read_csv().
This solution depends on knowing how many of the rightmost columns are always formatted correctly. In your example data, the rightmost five columns are always good, so we treat everything to the left of these columns as a single field, which csv.writer() wraps in double quotes.

Read_csv in Pandas where file structure is inconsistent

Having trouble with reading a csv file into a pandas dataframe where the line endings are not standard.
Here is my code:
df_feb = pd.read_csv(data_location, sep = ",",nrows = 500, header = None, skipinitialspace = True,encoding = 'utf-8')
Here is the output (personal info scratched out):
Output
This is what the input data looks like:
The above output splits what should be a single line into 4 lines. A new line should start for every phone number (phone number = scratched out bit).
I am aiming to have each line look like this:
Goal output
Thank you in advance for your help!
If the file have format have any rule (not unique format for each record), then I suggest you write your own convertion tool.
Here I suggest what the tool should do
Read file as plain text.
Put 4 lines into 1 records/class object ( as I see in picture, 4 records seem to have 4 lines)
Parse the line (split by comma, tab, whatever you have) to get attribute
Write attribute in another file, split by tab (or comma) => your csv
Now, you can load your csv to Pandas.

How to remove newline within a column in delimited file?

I have a file that looks like this:
1111,AAAA,aaaa\n
2222,BB\nBB,bbbb\n
3333,CCC\nC,cccc\n
...
Where \n represents a newline.
When I read this line-by-line, it's read as:
1111,AAAA,aaaa\n
2222,BB\n
BB,bbbb\n
3333,CCC\n
C,cccc\n
...
This is a very large file. Is there a way to read a line until a specific number of delimiters, or remove the newline character within a column in Python?
I think after you read the line, you need to count the number of commas
aStr.count(',')
While the number of commas is too small (there can be more than one \n in the input), then read the next line and concatenate the strings
while aStr.count(',') < Num:
another = file.readline()
aStr = aStr + another
1111,AAAA,aaaa\n
2222,BB\nBB,bbbb\n
According to your file \n here is not actually a newline character, it is plain text.
For actually stripping newline characters you could use strip() or other variations like rstrip() ot lstrip().
If you work with large files you don't need to load full content in memory. You could iterate line by line until some counter or anything else.
I think perhaps you are parsing a CSV file that has embedded newlines in some of the text fields. Further, I suppose that the program that created the file put quotation marks (") around the fields.
That is, I supposed that your text file actually looks like this:
1111,AAAA,aaaa
2222,"BB
BB",bbbb
3333,"CCC
C",cccc
If that is the case, you might want to use code with better CSV support than just line.split(','). Consider this program:
import csv
with open('foo.csv') as fp:
reader = csv.reader(fp)
for row in reader:
print row
Which produces this output:
['1111', 'AAAA', 'aaaa']
['2222', 'BB\nBB', 'bbbb']
['3333', 'CCC\nC', 'cccc']
Notice the five lines (delimited by newline characters) of the CSV file become 3 rows (some with embedded newline characters) in the CSV data structure.

Categories