I'm new to Python and I have the following csv file (let's call it out.csv):
DATE,TIME,PRICE1,PRICE2
2017-01-15,05:44:27.363000+00:00,0.9987,1.0113
2017-01-15,13:03:46.660000+00:00,0.9987,1.0113
2017-01-15,21:25:07.320000+00:00,0.9987,1.0113
2017-01-15,21:26:46.164000+00:00,0.9987,1.0113
2017-01-16,12:40:11.593000+00:00,,1.0154
2017-01-16,12:40:11.593000+00:00,1.0004,
2017-01-16,12:43:34.696000+00:00,,1.0095
and I want to truncate the second column so the csv looks like:
DATE,TIME,PRICE1,PRICE2
2017-01-15,05:44:27,0.9987,1.0113
2017-01-15,13:03:46,0.9987,1.0113
2017-01-15,21:25:07,0.9987,1.0113
2017-01-15,21:26:46,0.9987,1.0113
2017-01-16,12:40:11,,1.0154
2017-01-16,12:40:11,1.0004,
2017-01-16,12:43:34,,1.0095
This is what I have so far..
with open('out.csv','r+b') as nL, open('outy_3.csv','w+b') as nL3:
new_csv = []
reader = csv.reader(nL)
for row in reader:
time = row[1].split('.')
new_row = []
new_row.append(row[0])
new_row.append(time[0])
new_row.append(row[2])
new_row.append(row[3])
print new_row
nL3.writelines(new_row)
I can't seem to get a new line in after writing each line to the new csv file.
This definitely doesnt look or feel pythonic
Thanks
The missing newlines issue is because the file.writelines() method doesn't automatically add line separators to the elements of the argument it's passed, which it expects to be an sequence of strings. If these elements represent separate lines, then it's your responsibility to ensure each one ends in a newline.
However, your code is tries to use it to only output a single line of output. To fix that you should use file.write() instead because it expects its argument to be a single string—and if you want that string to be a separate line in the file, it must end with a newline or have one manually added to it.
Below is code that does what you want. It works by changing one of the elements of the list of strings that the csv.reader returns in-place, and then writes the modified list to the output file as single string by join()ing them all back together, and then manually adds a newline the end of the result (stored in new_row).
import csv
with open('out.csv','rb') as nL, open('outy_3.csv','wt') as nL3:
for row in csv.reader(nL):
time_col = row[1]
try:
period_location = time_col.index('.')
row[1] = time_col[:period_location] # only keep characters in front of period
except ValueError: # no period character found
pass # leave row unchanged
new_row = ','.join(row)
print(new_row)
nL3.write(new_row + '\n')
Printed (and file) output:
DATE,TIME,PRICE1,PRICE2
2017-01-15,05:44:27,0.9987,1.0113
2017-01-15,13:03:46,0.9987,1.0113
2017-01-15,21:25:07,0.9987,1.0113
2017-01-15,21:26:46,0.9987,1.0113
2017-01-16,12:40:11,,1.0154
2017-01-16,12:40:11,1.0004,
2017-01-16,12:43:34,,1.0095
Related
load_datafile() takes a single string parameter representing the filename of a datafile.
This function must read the content of the file, convert all letters to their lowercase, and store
the result in a string, and finally return that string. I will refer to this string as data throughout
this specification, you may rename it. You must also handle all exceptions in case the datafile
is not available.
Sample output:
data = load_datafile('harry.txt')
print(data)
the hottest day of the summer so far was drawing to a close and a drowsy silence
lay over the large, square houses of privet drive.
load_wordfile() takes a single string argument representing the filename of a wordfile.
This function must read the content of the wordfile and store all words in a one-dimensional
list and return the list. Make sure that the words do not have any additional whitespace or newline character in them. You must also handle all exceptions in case the files are not
available.
Sample outputs:
pos_words = load_wordfile("positivewords.txt")
print(pos_words[2:9])
['abundance', 'abundant', 'accessable', 'accessible', 'acclaim', 'acclaimed',
'acclamation']
neg_words = load_wordfile("negativewords.txt")
print(neg_words[10:19])
['aborts', 'abrade', 'abrasive', 'abrupt', 'abruptly', 'abscond', 'absence',
'absent-minded', 'absentee']
MY CODE BELOW
def load_datafile('harryPotter.txt'):
data = ""
with open('harryPotter.txt') as file:
lines = file.readlines()
temp = lines[-1].lower()
return data
Your code has two main problems. The first one is that you are assigning an empty string to the variable data and returning it, so no matter what you do with the contents of the file you always return an empty string. The second one is that file.readlines() returns a list of strings, where each line in the file is an element on the list and you are only converting the last element lines[-1] to lowercase.
To fix your code you should make sure that you store the contents of the file on the data variable and you should apply the lower() function to each line on the file and not just the last one. Something like this:
def load_datafile(file_name):
data = ''
with open(file_name) as file:
lines = file.readlines()
for line in lines:
data = data + line.lower() + '\n'
return data
The previous example is not the best way of doing this but it's very easy to understand what is happening and I think that is more important when you are starting. To make it more efficient you might want to change it to:
def load_datafile(file_name):
with open(file_name) as file:
return '\n'.join(line.lower() for line in file.readlines())
I have the following .txt-File (modified bash emboss-dreg report, the original report has seqtable format):
Start End Strand Pattern Sequence
43392 43420 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCACGCCGAATGGAAACACGTTTT
52037 52064 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGACCCTGCTTGGCGATCCCGGCGTTTC
188334 188360 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCGCAACTGCAGCGGGAGTTAC
I would like to access the elements under "sequence" only, to compare them with some variables and delete the whole lines, if the comparison does not give the desired result (using Levenshtein distance for comparison).
But I can't even get started .... :(
I am searching for something like the linux -f option, to directly get to the right "field" in the line to do my comparison.
I came across re.split:
with open(textFile) as f:
for line in f:
cleaned=re.split(r'\t',line)
print(cleaned)
which results in:
[' Start End Strand Pattern Sequence\n']
['\n']
[' 43392 43420 + regex:[T][G][A][TC][C][CTG]\\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCACGCCGAATGGAAACACGTTTT\n']
['\n']
[' 52037 52064 + regex:[T][G][A][TC][C][CTG]\\D{15,17}[CA][G][T][AT][AT][CTA] TGACCCTGCTTGGCGATCCCGGCGTTTC\n']
['\n']
[' 188334 188360 + regex:[T][G][A][TC][C][CTG]\\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCGCAACTGCAGCGGGAGTTAC\n']
['\n']
That is the closest I got to "split my lines into elements". I feel like totally going the wrong way, but searching Stack Overflow and google did not result in anything :(
I have never worked with seqtable-format before, so I tried to deal with it as .txt Maybe, there is another approach better for dealing with it?
Python is the main language I am learning, I am not so firm in Bash, but bash-answers for dealing with the issue would be ok for me, too.
I am thankful for any hint/link/help :)
The format itself seems to be using multiple lines as delimiters while your r'\t' is not doing anything (you're instructing Python to split on a literal \t). Also, based on what you've pasted the data is not using a tab delimiter anyway, but a random number of whitespaces to pad the table.
To address both, you can read the file, treat the first line as a header (if you need it), then read the rest line by line, strip the trailing\leading whitespace, check if there is any data there and if there is - further split it on whitespace to get to your line elements:
with open("your_data", "r") as f:
header = f.readline().split() # read the first line as a header
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
elements = line.split() # split the data on whitespace to get your elements
print(elements[-1]) # print the last element
TGATCGCACGCCGAATGGAAACACGTTTT
TGACCCTGCTTGGCGATCCCGGCGTTTC
TGATCGCGCAACTGCAGCGGGAGTTAC
As a bonus, since you have the header, you can turn it into a map and then use 'proxied' named access to get the element you're looking for so you don't need to worry about the element position:
with open("your_data", "r") as f:
# read the header and turn it into a value:index map
header = {v: i for i, v in enumerate(f.readline().split())}
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
elements = line.split()
print(elements[header["Sequence"]]) # print the Sequence element
You can also use a header map to turn your rows into dict structures for even easier access.
UPDATE: Here's how to create a header map and then use it to build a dict out of your lines:
with open("your_data", "r") as f:
# read the header and turn it into an index:value map
header = {i: v for i, v in enumerate(f.readline().split())}
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
# split the line, iterate over it and use the header map to create a dict
row = {header[i]: v for i, v in enumerate(line.split())}
print(row["Sequence"]) # ... or you can append it to a list for later use
As for how to 'delete' lines that you don't want for some reason, you'll have to create a temporary file, loop through your original file, compare your values, write the ones that you want to keep into the temporary file, delete the original file and finally rename the temporary file to match your original file, something like:
import shutil
from tempfile import NamedTemporaryFile
SOURCE_FILE = "your_data" # path to the original file to process
def compare_func(seq): # a simple comparison function for our sequence
return not seq.endswith("TC") # use Levenshtein distance or whatever you want instead
# open a temporary file for writing and our source file for reading
with NamedTemporaryFile(mode="w", delete=False) as t, open(SOURCE_FILE, "r") as f:
header_line = f.readline() # read the header
t.write(header_line) # write the header immediately to the temporary file
header = {v: i for i, v in enumerate(header_line.split())} # create a header map
last_line = "" # a var to store the whitespace to keep the same format
for line in f: # read the rest of the file line-by-line
row = line.strip() # first clear out the whitespace
if row: # check if there is any content left or is it an empty line
elements = row.split() # split the row into elements
# now lets call our comparison function
if compare_func(elements[header["Sequence"]]): # keep the line if True
t.write(last_line) # write down the last whitespace to the temporary file
t.write(line) # write down the current line to the temporary file
else:
last_line = line # store the whitespace for later use
shutil.move(t.name, SOURCE_FILE) # finally, overwrite the source with the temporary file
This will produce the same file sans the second row from your example since its sequence ends in a TC and our comp_function() returns False in that case.
For a bit less complexity, instead of using temporary files you can load your whole source file into the working memory and then just overwrite it, but that would work only for files that can fit your working memory while the above approach can work with files as large as your free storage space.
I have a number of txt files that represent spatial data in a grid form, essentially arrays of the same dimensions in which each value signifies a trait about the corresponding parcel of land. I have been trying to script a sequence that imports each file, adds "-9999" on the border of the entire grid, and saves out to an otherwise identical txt file.
The first 6 rows of each txt file are header rows, and shouldn't be changed.
My progress is as follows:
for datfile in spatialfiles:
results = []
borderrow = []
with open('{}.txt'.format(datfile)) as inputfile:
#header = inputfile.readlines()
for line in inputfile:
row = ['-9999'] + line.strip().split(' ') + ['-9999']
results.append(row)
for cell in range(len(row)):
borderrow.append('-9999')
results = [borderrow] + results[6:] + [borderrow]
with file("{}-new.txt".format(datfile), 'w') as outputFile:
for row in header[:6]:
outputFile.write(row)
for row in results:
outputFile.write(row)
"header = inputfile.readlines()" has been commented out because it seems to cause a NameError in which "row" is no longer recognized. At the same time, I haven't found another way to retain the 6 header rows for exporting later.
Why does readlines() seem to alter the ability to iterate through the lines of the inputfile when it is only being used to write to a variable? What am I missing? (Any other pointers on my undoubtedly bloated code always welcome!)
readlines() reads the whole file into memory, parses it into a list, and leaves a pointer to the end of the file. When you try to read the same file again, it will attempt to resume reading from the pointer, which is already at the end of the file. Call readlines() once and loop through the list with a counter which changes the loop's behavior after 6 lines.
I have multiple entries in a file as mentioned below.
"Item_1";"Item_1";"Products///Item///ABC///XYZ";"Item_1.jpg}";"";"Buy item
<br><strong>Items</strong>
<br><strong>Time</strong>";"";"";"";"";"";"Category: M[Item]";"";"";"Y";"N";"N";"None";""
"Item_2";....
In above text, there is a newline after "Buy item" in the first line & after '/strong>' in second line.
The change which I want to make is -
1. Replace Products///Item///ABC///XYZ with Products///ABC///XYZ
2. Replace "Category: M[Item]" with "Category: M[ABC]"
3. In case if Entry 1 is Products///Item///ABC or Products///ABC, I dont want to change "Category: M[Item]" with "Category: M[ABC]", just change Products///Item///ABC to Products///ABC
I am trying to read entire file line by line & then split by '///' storing number of entries & storing 3rd entry. But this creates issues as I have multiple newlines.
Is there a simpler way of doing it by using regex or something else?
Like #Casimir suggested, you can use csv module to parse your file (because it'll handle the newlines), like this
import csv
with open(your_filename) as f:
reader = csv.reader(f, delimeter=';', quotechar='"')
rows = list(reader)
and then do what you want to the parsed result (I'm not quite sure about what you want to achieve here, comment if that's not what you want)
for row in rows:
if 'Products///Item///ABC///XY' in row:
index = row.index('Products///Item///ABC///XY')
row[index] = 'Products///ABC///XYZ'
continue # If we replaced the first thing, skip to next row
elif 'Category: M[Item]' in row:
index = row.index('Category: M[Item]')
row[index] = 'Category: M[ABC]'
I have a csv file which I want to sort by taking each row at a time. While sorting the row, I want to ignore the whitespace (or empty cell). Also, I want to ignore the first row and first column while sorting.
This is how my code looks like:
import csv, sys, operator
fname = "Source.csv"
new_fname = "Dest.csv"
data = csv.reader(open(fname,"rb"),delimiter=',')
num = 1
sortedlist = []
ind=0
for row in data:
if num==1:
sortedlist.append(row)
with open(new_fname,"wb") as f:
filewriter = csv.writer(f,delimiter=",")
filewriter.writerow(sortedlist[ind])
ind+=1
elif num > 1:
sortedlist.append(sorted(row))
with open(new_fname,"ab") as f:
filewriter = csv.writer(f,delimiter=",")
filewriter.writerow(sortedlist[ind])
ind+=1
num+=1
I was able to ignore the first row. But, I am not sure how to ignore the whitespace and the first column while sorting. Any suggestions are welcome.
I simplified your code significantly and here's what I got (although I didn't understand the part about empty columns, they are values as well... Did you mean that you wanted to keep empty columns in the same place instead of putting them at start?)
import csv
if __name__ == '__main__':
reader = csv.reader(open("Source.csv","r"),delimiter=',')
out_file = open("Dest.csv","w")
writer = csv.writer(out_file,delimiter=",")
writer.writerow(reader.next())
for row in reader:
writer.writerow([row[0]] + sorted(row[1:]))
out_file.close()
Always write executable code in if __name__ == '__main__':, this is done so that your code is not executed if your script was not run directly, but rather imported by another script.
We record the out_file variable to be able out_file.close() it cleanly later, code will work without it, but it's a clean way to write files.
Do not use "wb", "rb", "ab" for text files, the "b" part stands for "binary" and should be used for structured files.
reader.next() gets the first line of the csv file (or crashes if file is empty)
for row in reader: already runs starting from second line (because we ran reader.next() earlier), so we don't need any line number conditionals anymore.
row[0] gets the first element of the list, row[1:] gets all elements of the list, except the first one. For example, row[3:] would ignore first 3 elements and return the rest of the list. In this case, we only sort the row without its first element by doing sorted(row[1:])
EDIT: If you really want to remove empty columns from your csv, replace sorted(row[1:]) with sorted(filter(lambda x: x.strip()!='', row[1:])). This will remove empty columns from the list before sorting, but keep in mind that empty values in csv are still values.
EDIT2: As correctly pointed out by #user3468054 values will be sorted as strings, if you want them to be sorted as numbers, add a named parameter key=int to the sorted function, or key=float if your values are float.