I'm parsing a very big csv (big = tens of gigabytes) file in python and I need only the value of the first column of every line. I wrote this code, wondering if there is a better way to do it:
delimiter = ','
f = open('big.csv','r')
for line in f:
pos = line.find(delimiter)
id = int(line[0:pos])
Is there a more effective way to get the part of the string before the first delimiter?
Edit: I do know about the CSV module (and I have used it occasionally), but I do not need to load in memory every line of this file - I need the first column. So lets focus on string parsing.
>>> a = '123456'
>>> print a.split('2', 1)[0]
1
>>> print a.split('4', 1)[0]
123
>>>
But, if you're dealing with a CSV file, then:
import csv
with open('some.csv') as fin:
for row in csv.reader(fin):
print int(row[0])
And the csv module will handle quoted columns containing quotes etc...
If the first field can't have an escaped delimiter in it such as in your case where the first field is an integer and there are no embed newlines in any field i.e., each row corresponds to exactly one physical line in the file then csv module is an overkill and you could use your code from the question or line.split(',', 1) as suggested by #Jon Clements.
To handle occasional lines that have no delimiter in them you could use str.partition:
with open('big.csv', 'rb') as file:
for line in file:
first, sep, rest = line.partition(b',')
if sep: # the line has ',' in it
process_id(int(first)) # or `yield int(first)`
Note: s.split(',', 1)[0] silently returns a wrong result (the whole string) if there is no delimiter in the string.
'rb' file mode is used to avoid unnecessary end of line manipulation (and implicit decoding to Unicode on Python 3). It is safe to use if the csv file has '\n' at the end of each raw i.e., newline is either '\n' or '\r\n'
Personnally , I would do with generators:
from itertools import imap
import csv
def int_of_0(x):
return(int(x[0]))
def obtain(filepath, treat):
with open(filepath,'rb') as f:
for i in imap(treat,csv.reader(f)):
yield i
for x in obtain('essai.txt', int_of_0):
# instructions
Related
I have a text file (.txt) which could be in tab separated format or pipe separated format, and I need to convert it into CSV file format. I am using python 2.6. Can any one suggest me how to identify the delimiter in a text file, read the data and then convert that into comma separated file.
Thanks in advance
I fear that you can't identify the delimiter without knowing what it is. The problem with CSV is, that, quoting ESR:
the Microsoft version of CSV is a textbook example of how not to design a textual file format.
The delimiter needs to be escaped in some way if it can appear in fields. Without knowing, how the escaping is done, automatically identifying it is difficult. Escaping could be done the UNIX way, using a backslash '\', or the Microsoft way, using quotes which then must be escaped, too. This is not a trivial task.
So my suggestion is to get full documentation from whoever generates the file you want to convert. Then you can use one of the approaches suggested in the other answers or some variant.
Edit:
Python provides csv.Sniffer that can help you deduce the format of your DSV. If your input looks like this (note the quoted delimiter in the first field of the second row):
a|b|c
"a|b"|c|d
foo|"bar|baz"|qux
You can do this:
import csv
csvfile = open("csvfile.csv")
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.DictReader(csvfile, dialect=dialect)
for row in reader:
print row,
# => {'a': 'a|b', 'c': 'd', 'b': 'c'} {'a': 'foo', 'c': 'qux', 'b': 'bar|baz'}
# write records using other dialect
Your strategy could be the following:
parse the file with BOTH a tab-separated csv reader and a pipe-separated csv reader
calculate some statistics on resulting rows to decide which resultset is the one you want to write. An idea could be counting the total number of fields in the two recordset (expecting that tab and pipe are not so common). Another one (if your data is strongly structured and you expect the same number of fields in each line) could be measuring the standard deviation of number of fields per line and take the record set with the smallest standard deviation.
In the following example you find the simpler statistic (total number of fields)
import csv
piperows= []
tabrows = []
#parsing | delimiter
f = open("file", "rb")
readerpipe = csv.reader(f, delimiter = "|")
for row in readerpipe:
piperows.append(row)
f.close()
#parsing TAB delimiter
f = open("file", "rb")
readertab = csv.reader(f, delimiter = "\t")
for row in readerpipe:
tabrows.append(row)
f.close()
#in this example, we use the total number of fields as indicator (but it's not guaranteed to work! it depends by the nature of your data)
#count total fields
totfieldspipe = reduce (lambda x,y: x+ y, [len(f) for f in piperows])
totfieldstab = reduce (lambda x,y: x+ y, [len(f) for f in tabrows])
if totfieldspipe > totfieldstab:
yourrows = piperows
else:
yourrows = tabrows
#the var yourrows contains the rows, now just write them in any format you like
Like this
from __future__ import with_statement
import csv
import re
with open( input, "r" ) as source:
with open( output, "wb" ) as destination:
writer= csv.writer( destination )
for line in input:
writer.writerow( re.split( '[\t|]', line ) )
I would suggest taking some of the example code from the existing answers, or perhaps better use the csv module from python and change it to first assume tab separated, then pipe separated, and produce two output files which are comma separated. Then you visually examine both files to determine which one you want and pick that.
If you actually have lots of files, then you need to try to find a way to detect which file is which.
One of the examples has this:
if "|" in line:
This may be enough: if the first line of a file contains a pipe, then maybe the whole file is pipe separated, else assume a tab separated file.
Alternatively fix the file to contain a key field in the first line which is easily identified - or maybe the first line contains column headers which can be detected.
for line in open("file"):
line=line.strip()
if "|" in line:
print ','.join(line.split("|"))
else:
print ','.join(line.split("\t"))
Want to find the delimiter in the text file.
The text looks:
ID; Name
1; John Mak
2; David H
4; Herry
The file consists of tabs with the delimiter.
I tried with following: by referring
with open(filename, 'r') as f1:
dialect = csv.Sniffer().sniff(f1.read(1024), "\t")
print 'Delimiter:', dialect.delimiter
The result shows: Delimiter:
Expected result: Delimiter: ;
sniff can conclude with only one single character as the delimiter. Since your CSV file contains two characters as the delimiter, sniff will simply pick one of them. But since you also pass in the optional second argument to sniff, it will only pick what's contained in that value as a possible delimiter, which in your case, is '\t' (which is not visible from your print output).
From sniff's documentation:
If the optional delimiters parameter is given, it is interpreted as a
string containing possible valid delimiter characters.
Sniffing is not guaranteed to work.
Here is one approach that will work with any kind of delimiter.
You start with what you assume is the most common delimiter ; if that fails, then you try others until you manage to parse the row.
import csv
with open('sample.csv') as f:
reader = csv.reader(f, delimiter=';')
for row in reader:
try:
a,b = row
except ValueError:
try:
a,b = row[0].split(None, 1)
except ValueError:
a,b = row[0].split('\t', 1)
print('{} - {}'.format(a.strip(), b.strip()))
You can play around with this at this replt.it link, play with the sample.csv file if you want to try out different delimiters.
You can combine sniffing with this to catch any odd delimiters that are not known to you.
I have the following .txt-File (modified bash emboss-dreg report, the original report has seqtable format):
Start End Strand Pattern Sequence
43392 43420 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCACGCCGAATGGAAACACGTTTT
52037 52064 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGACCCTGCTTGGCGATCCCGGCGTTTC
188334 188360 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCGCAACTGCAGCGGGAGTTAC
I would like to access the elements under "sequence" only, to compare them with some variables and delete the whole lines, if the comparison does not give the desired result (using Levenshtein distance for comparison).
But I can't even get started .... :(
I am searching for something like the linux -f option, to directly get to the right "field" in the line to do my comparison.
I came across re.split:
with open(textFile) as f:
for line in f:
cleaned=re.split(r'\t',line)
print(cleaned)
which results in:
[' Start End Strand Pattern Sequence\n']
['\n']
[' 43392 43420 + regex:[T][G][A][TC][C][CTG]\\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCACGCCGAATGGAAACACGTTTT\n']
['\n']
[' 52037 52064 + regex:[T][G][A][TC][C][CTG]\\D{15,17}[CA][G][T][AT][AT][CTA] TGACCCTGCTTGGCGATCCCGGCGTTTC\n']
['\n']
[' 188334 188360 + regex:[T][G][A][TC][C][CTG]\\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCGCAACTGCAGCGGGAGTTAC\n']
['\n']
That is the closest I got to "split my lines into elements". I feel like totally going the wrong way, but searching Stack Overflow and google did not result in anything :(
I have never worked with seqtable-format before, so I tried to deal with it as .txt Maybe, there is another approach better for dealing with it?
Python is the main language I am learning, I am not so firm in Bash, but bash-answers for dealing with the issue would be ok for me, too.
I am thankful for any hint/link/help :)
The format itself seems to be using multiple lines as delimiters while your r'\t' is not doing anything (you're instructing Python to split on a literal \t). Also, based on what you've pasted the data is not using a tab delimiter anyway, but a random number of whitespaces to pad the table.
To address both, you can read the file, treat the first line as a header (if you need it), then read the rest line by line, strip the trailing\leading whitespace, check if there is any data there and if there is - further split it on whitespace to get to your line elements:
with open("your_data", "r") as f:
header = f.readline().split() # read the first line as a header
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
elements = line.split() # split the data on whitespace to get your elements
print(elements[-1]) # print the last element
TGATCGCACGCCGAATGGAAACACGTTTT
TGACCCTGCTTGGCGATCCCGGCGTTTC
TGATCGCGCAACTGCAGCGGGAGTTAC
As a bonus, since you have the header, you can turn it into a map and then use 'proxied' named access to get the element you're looking for so you don't need to worry about the element position:
with open("your_data", "r") as f:
# read the header and turn it into a value:index map
header = {v: i for i, v in enumerate(f.readline().split())}
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
elements = line.split()
print(elements[header["Sequence"]]) # print the Sequence element
You can also use a header map to turn your rows into dict structures for even easier access.
UPDATE: Here's how to create a header map and then use it to build a dict out of your lines:
with open("your_data", "r") as f:
# read the header and turn it into an index:value map
header = {i: v for i, v in enumerate(f.readline().split())}
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
# split the line, iterate over it and use the header map to create a dict
row = {header[i]: v for i, v in enumerate(line.split())}
print(row["Sequence"]) # ... or you can append it to a list for later use
As for how to 'delete' lines that you don't want for some reason, you'll have to create a temporary file, loop through your original file, compare your values, write the ones that you want to keep into the temporary file, delete the original file and finally rename the temporary file to match your original file, something like:
import shutil
from tempfile import NamedTemporaryFile
SOURCE_FILE = "your_data" # path to the original file to process
def compare_func(seq): # a simple comparison function for our sequence
return not seq.endswith("TC") # use Levenshtein distance or whatever you want instead
# open a temporary file for writing and our source file for reading
with NamedTemporaryFile(mode="w", delete=False) as t, open(SOURCE_FILE, "r") as f:
header_line = f.readline() # read the header
t.write(header_line) # write the header immediately to the temporary file
header = {v: i for i, v in enumerate(header_line.split())} # create a header map
last_line = "" # a var to store the whitespace to keep the same format
for line in f: # read the rest of the file line-by-line
row = line.strip() # first clear out the whitespace
if row: # check if there is any content left or is it an empty line
elements = row.split() # split the row into elements
# now lets call our comparison function
if compare_func(elements[header["Sequence"]]): # keep the line if True
t.write(last_line) # write down the last whitespace to the temporary file
t.write(line) # write down the current line to the temporary file
else:
last_line = line # store the whitespace for later use
shutil.move(t.name, SOURCE_FILE) # finally, overwrite the source with the temporary file
This will produce the same file sans the second row from your example since its sequence ends in a TC and our comp_function() returns False in that case.
For a bit less complexity, instead of using temporary files you can load your whole source file into the working memory and then just overwrite it, but that would work only for files that can fit your working memory while the above approach can work with files as large as your free storage space.
In python I can easily read a file line by line into a set, just be using:
file = open("filename.txt", 'r')
content = set(file)
each of the elements in the set consists of the actual line and also the trailing line-break.
Now I have a string with multiple lines, which I want to compare to the content by using the normal set operations.
Is there any way of transforming a string into a set just the same way, such, that it also contains the line-breaks?
Edit:
The question "In Python, how do I split a string and keep the separators?" deals with a similar problem, but the answer doesn't make it easy to adopt to other use-cases.
import re
content = re.split("(\n)", string)
doesn't have the expected effect.
The str.splitlines() method does exactly what you want if you pass True as the optional keepends parameter. It keeps the newlines on the end of each line, and doesn't add one to the last line if there was no newline at the end of the string.
text = "foo\nbar\nbaz"
lines = text.splitlines(True)
print(lines) # prints ['foo\n', 'bar\n', 'baz']
Here's a simple generator that does the job:
content = set(e + "\n" for e in s.split("\n"))
This solution adds an additional newline at the end though.
you can also do it the other way round, remove line endings when reading file lines, assuming you open the file with U for universal line endings:
file = open("filename.txt", 'rU')
content = set(line.rstrip('\n') for line in file)
Could this be what you mean?
>>> from io import StringIO
>>> someLines=StringIO('''\
... line1
... line2
... line3
... ''')
>>> content=set(someLines)
>>> content
{'line1\n', 'line2\n', 'line3\n'}
I'm looking for a way using python to copy the first column from a csv into an empty file. I'm trying to learn python so any help would be great!
So if this is test.csv
A 32
D 21
C 2
B 20
I want this output
A
D
C
B
I've tried the following commands in python but the output file is empty
f= open("test.csv",'r')
import csv
reader = csv.reader(f,delimiter="\t")
names=""
for each_line in reader:
names=each_line[0]
First, you want to open your files. A good practice is to use the with statement (that, technically speaking, introduces a context manager) so that when your code exits from the with block all the files are automatically closed
with open('test.csv') as inpfile, open('out.csv', 'w') as outfile:
next you want a loop on the lines of the input file (note the indentation, we are inside the with block), line splitting is automatic when you read a text file with lines separated by newlines…
for line in inpfile:
each line is a string, but you think of it as two fields separated by white space — this situation is so common that strings have a method to deal with this situation (note again the increasing indent, we are in the for loop block)
fields = line.split()
by default .split() splits on white space, but you can use, e.g., split(',') to split on commas, etc — that said, fields is a list of strings, for your first record it is equal to ['A', '32'] and you want to output just the first field in this list… for this purpose a file object has the .write() method, that writes a string, just a string, to the file, and fields[0] IS a string, but we have to add a newline character to it because, in this respect, .write() is different from print().
outfile.write(fields[0]+'\n')
That's all, but if you omit my comments it's 4 lines of code
with open('test.csv') as inpfile, open('out.csv', 'w') as outfile:
for line in inpfile:
fields = line.split()
outfile.write(fields[0]+'\n')
When you are done with learning (some) Python, ask for an explanation of this...
with open('test.csv') as ifl, open('out.csv', 'w') as ofl:
ofl.write('\n'.join(line.split()[0] for line in ifl))
Addendum
The csv module in such a simple case adds the additional conveniences of
auto-splitting each line into a list of strings
taking care of the details of output (newlines, etc)
and when learning Python it's more fruitful to see how these steps can be done using the bare language, or at least that it is my opinion…
The situation is different when your data file is complex, has headers, has quoted strings possibly containing quoted delimiters etc etc, in those cases the use of csv is recommended, as it takes into account all the gory details. For complex data analisys requirements you will need other packages, not included in the standard library, e.g., numpy and pandas, but that is another story.
This answer reads the CSV file, understanding a column to be demarked by a space character. You have to add the header=None otherwise the first row will be taken to be the header / names of columns.
ss is a slice - the 0th column, taking all rows as denoted by :
The last line writes the slice to a new filename.
import pandas as pd
df = pd.read_csv('test.csv', sep=' ', header=None)
ss = df.ix[:, 0]
ss.to_csv('new_path.csv', sep=' ', index=False)
import csv
reader = csv.reader(open("test.csv","rb"), delimiter='\t')
writer = csv.writer(open("output.csv","wb"))
for e in reader:
writer.writerow(e[0])
The best you can do is create a empty list and append the column and then write that new list into another csv for example:
import csv
def writetocsv(l):
#convert the set to the list
b = list(l)
print (b)
with open("newfile.csv",'w',newline='',) as f:
w = csv.writer(f, delimiter=',')
for value in b:
w.writerow([value])
adcb_list = []
f= open("test.csv",'r')
reader = csv.reader(f,delimiter="\t")
for each_line in reader:
adcb_list.append(each_line)
writetocsv(adcb_list)
hope this works for you :-)