I have a file that looks like this:
1111,AAAA,aaaa\n
2222,BB\nBB,bbbb\n
3333,CCC\nC,cccc\n
...
Where \n represents a newline.
When I read this line-by-line, it's read as:
1111,AAAA,aaaa\n
2222,BB\n
BB,bbbb\n
3333,CCC\n
C,cccc\n
...
This is a very large file. Is there a way to read a line until a specific number of delimiters, or remove the newline character within a column in Python?
I think after you read the line, you need to count the number of commas
aStr.count(',')
While the number of commas is too small (there can be more than one \n in the input), then read the next line and concatenate the strings
while aStr.count(',') < Num:
another = file.readline()
aStr = aStr + another
1111,AAAA,aaaa\n
2222,BB\nBB,bbbb\n
According to your file \n here is not actually a newline character, it is plain text.
For actually stripping newline characters you could use strip() or other variations like rstrip() ot lstrip().
If you work with large files you don't need to load full content in memory. You could iterate line by line until some counter or anything else.
I think perhaps you are parsing a CSV file that has embedded newlines in some of the text fields. Further, I suppose that the program that created the file put quotation marks (") around the fields.
That is, I supposed that your text file actually looks like this:
1111,AAAA,aaaa
2222,"BB
BB",bbbb
3333,"CCC
C",cccc
If that is the case, you might want to use code with better CSV support than just line.split(','). Consider this program:
import csv
with open('foo.csv') as fp:
reader = csv.reader(fp)
for row in reader:
print row
Which produces this output:
['1111', 'AAAA', 'aaaa']
['2222', 'BB\nBB', 'bbbb']
['3333', 'CCC\nC', 'cccc']
Notice the five lines (delimited by newline characters) of the CSV file become 3 rows (some with embedded newline characters) in the CSV data structure.
Related
I have a csv file with a that is encoded with commas as separator but every row has a quote character at the start and at the end of each row.
In practice the data look like this
"0.00000E+000,6.25000E-001"
"1.00000E+000,1.11926E+000"
"2.00000E+000,9.01726E-001"
"3.00000E+000,7.71311E-001"
"4.00000E+000,6.82476E-001"
If I read the data using pd.read_csv() it just reads everything under a single column. What is the best workaround? Is there a simple way to pre-emptively strip the quotes character from the whole csv file?
If your file looks like
my_file=""""0.00000E+000,6.25000E-001"
"1.00000E+000,1.11926E+000"
"2.00000E+000,9.01726E-001"
"3.00000E+000,7.71311E-001"
"4.00000E+000,6.82476E-001"
"""
One way to remove the quotes prior to using Pandas would be
for line in my_file.split('\n'):
print(line.replace('"',''))
To write that to file, use
with open('output.csv','w') as file_handle:
for line in my_file.split('\n'):
file_handle.write(line.replace('"','')+'\n')
Using the python open built-in function in this way:
with open('myfile.csv', mode='r') as rows:
for r in rows:
print(r.__repr__())
I obtain this ouput
'col1,col2,col3\n'
'fst,snd,trd\n'
'1,2,3\n'
I don't want the \n character. Do you know some efficient way to remove that char (in place of the obvious r.replace('\n',''))?
If you are trying to read and parse csv file, Python's csv module might serve better:
import csv
reader = csv.reader(open('myfile.csv', 'r'))
for row in reader:
print(', '.join(row))
Although you cannot change the line terminator for reader here, it ends a row with either '\r' or '\n', which works for your case.
https://docs.python.org/3/library/csv.html#csv.Dialect.lineterminator
Again, for most of the cases, I don't think you need to parse csv file manually. There are a few issues/reasons that makes csv module easier for you: field containing separator, field containing newline character, field containing quote character, etc.
You can use string.strip(), which (with no arguments) removes any whitespace from the start and end of a string:
for r in rows:
print(r.strip())
If you want to remove only newlines, you can pass that character as an argument to strip:
for r in rows:
print(r.strip('\n'))
For a clean solution, you could use a generator to wrap open, like this:
def open_no_newlines(*args, **kwargs):
with open(*args, **kwargs) as f:
for line in f:
yield line.strip('\n')
You can then use open_no_newlines like this:
for line in open_no_newlines('myfile.csv', mode='r'):
print(line)
wrote a python script in windows 8.1 using Sublime Text editor and I just tried to run it from terminal in OSX Yosemite but I get an error.
My error occurs when parsing the first line of a .CSV file. This is the slice of the code
lines is an array where each element is the line in the file it is read from as a string
we split the string by the desired delimiter
we skip the first line because that is the header information (else condition)
For the last index in the for loop i = numlines -1 = the number of lines in the file - 2
We only add one to the value of i because the last line is blank in the file
for i in range(numlines):
if i == numlines-1:
dataF = lines[i+1].split(',')
else:
dataF = lines[i+1].split(',')
dataF1 = list(dataF[3])
del(dataF1[len(dataF1)-1])
del(dataF1[len(dataF1)-1])
del(dataF1[0])
f[i] = ''.join(dataF1)
return f
All the lines in the csv file looks like this (with the exception of the header line):
"08/06/2015","19:00:00","1","410"
So it saves the single line into an array where each element corresponds to one of the 4 values separated by commas in a line of the CSV file. Then we take the 3 element in the array, "410" ,and create a list that should look like
['"','4','1','0','"','\n']
(and it does when run from windows)
but it instead looks like
['"','4','1','0','"','\r','\n']
and so when I concatenate this string based off the above code I get 410 instead of 410.
My question is: Where did the '\r' term come from? It is non-existent in the original files when ran by a windows machine. At first I thought it was the text format so I saved the CSV file to a UTF-8, that didn’t work. I tried changing the tab size from 4 to 8 spaces, that didn’t work. Running out of ideas now. Any help would be greatly appreciated.
Thanks
The "\r" is the line separator. The "\r\n" is also a line separator. Different platforms have different line separators.
A simple fix: if you read a line from a file yourself, then line.rstrip() will remove the whitespace from the line end.
A proper fix: use Python's standard CSV reader. It will skip the blank lines and comments, will properly handle quoted strings, etc.
Also, when working with long lists, it helps to stop thinking about them as index-addressed 'arrays' and use the 'stream' or 'sequential reading' metaphor.
So the typical way of handling a CSV file is something like:
import csv
with open('myfile.csv') as f:
reader = csv.reader(f)
# We assume that the file has 3 columns; adjust to taste
for (first_field, second_field, third_field) in reader:
# do something with field values of the current lines here
I wrote an HTML parser in python used to extract data to look like this in a csv file:
itemA, itemB, itemC, Sentence that might contain commas, or colons: like this,\n
so I used a delmiter ":::::" thinking that it wouldn't be mined in the data
itemA, itemB, itemC, ::::: Sentence that might contain commas, or colons: like this,::::\n
This works for most of the thousands of lines, however, apparently a colon : offset this when I imported the csv in Calc.
My question is, what is the best or a unique delimiter to use when creating a csv with many variations of sentences that need to be separated with some delimiter? Am I understanding delimiters correctly in that they separate the values within a CSV?
As I suggested informally in a comment, unique just means you need to use some character that won't be in the data — chr(255) might be a good choice. For example:
Note: The code shown is for Python 2.x — see comments for a Python 3 version.
import csv
DELIMITER = chr(255)
data = ["itemA", "itemB", "itemC",
"Sentence that might contain commas, colons: or even \"quotes\"."]
with open('data.csv', 'wb') as outfile:
writer = csv.writer(outfile, delimiter=DELIMITER)
writer.writerow(data)
with open('data.csv', 'rb') as infile:
reader = csv.reader(infile, delimiter=DELIMITER)
for row in reader:
print row
Output:
['itemA', 'itemB', 'itemC', 'Sentence that might contain commas, colons: or even "quotes".']
If you're not using the csv module and instead are writing and/or reading the data manually, then it would go something like this:
with open('data.csv', 'wb') as outfile:
outfile.write(DELIMITER.join(data) + '\n')
with open('data.csv', 'rb') as infile:
row = infile.readline().rstrip().split(DELIMITER)
print row
Yes, delimiters separate values within each line of a CSV file. There are two strategies to delimiting text that has a lot of punctuation marks. First, you can quote the values, e.g.:
Value 1, Value 2, "This value has a comma, <- right there", Value 4
The second strategy is to use tabs (i.e., '\t').
Python's built-in CSV module can both read and write CSV files that use quotes. Check out the example code under the csv.reader function. The built-in csv module will handle quotes correctly, e.g. it will escape quotes that are in the value itself.
CSV files usually use double quotes " to wrap long fields that might contain a field separator like a comma. If the field contains a double quote it's escaped with a backslash: \".
I have a text file which has been made from a excel file, in the excel file cell A2 has the name 'Supplier A'. When I import the text file and i use the following code:
filea = open ( "jag.txt").readlines()
lines =[x.split() for x in filea]
print lines [0][1]
It returns just 'supplier' and not Supplier A, the A is located in lines [0][2]. How dow I import it and have it recognise the complete word. Because if a copy the text field back into excel it does copy it properly so the txt file definitley recognises them as being together.
Excel regulary is using the 'tab' as separtor sign for saving in 'txt' format.
So you should try something like this:
lines = []
with open('jag.txt') as f:
lines = [ line.split('\t') for line in f.read().splitlines() ]
print(lines)
and should get something like this
[ ['A1', 'A2', ...], ['B1', 'B2'], ... ]
Why not only "f.readlines()"? Because using this, your last cell will also contain the carriage return sign ('\n').
Why using the with statement? With will close the file finally, and this is a good election in any case.
An alternative way to parse your text file could be the python (included) csv module. Using the csv.reader can be a very convenient way to parse character separated files/structures:
with open('jag.txt') as f:
lines = [ line for line in csv.reader(f, delimiter='\t') ]
-Colin-
It does so because str.split() splits between every whitespace, tab and line break. You can use str.split(',') as an alternative, but in fact you really want to use the csv-module for tasks like this.
What character (space, tab, comma etc) are the values seperated on each line? Your current code will split the text at whitespace, by using split() without a split character.