Questions about split() and for each loops - python

I am given a file that looks like this with many more lines that I am giving.
4
5 r begin
20 wr Dark tunnel
I have created a class to handle each part of the line that I am trying to split using the split() operation. To do this I am splitting based off of spaces, but for example in the 3rd line that says "Dark tunnel" I am splitting this apart as well, but need it to read as "Dark tunnel".
The other question that I have is with the for each loop, I want to perform the same operation on each line, except for the first line that is just the number 4, where I need to multiply that by itself minus 1 (4 * (4-1))
I created a class that takes the split line and assigns each part that is split. I have also made the for each loop, but as of now it performs the same operation on every line including the first one.
class point:
def __init__(self, val, route, title):
self.value = val
self.route = route
self.title = title
Want to properly split the lines, and perform a different operation on the first than the rest.

For split, you could do:
parts = s.split()
val, route, title = parts[0], parts[1], ' '.join(parts[2:])
For the for loop, you could do:
for index, line in enumerate(lines):
if index == 0:
result = int(line)*(int(line)-1)
else:
# do something else
All together:
for index, line in enumerate(lines):
if index == 0:
result = int(line)*(int(line)-1)
else:
parts = line.split()
val, route, title = parts[0], parts[1], ' '.join(parts[2:])
p = point(val, route, title)

Related

How to extract values from a csv splitting in the correct place (no imports)?

How can I read a csv file without using any external import (e.g. csv or pandas) and turn it into a list of lists? Here's the code I worked out so far:
m = []
for line in myfile:
m.append(line.split(','))
Using this for loop works pretty fine, but if in the csv I get a ',' is in one of the fields it breaks wrongly the line there.
So, for example, if one of the lines I have in the csv is:
12,"This is a single entry, even if there's a coma",0.23
The relative element of the list is the following:
['12', '"This is a single entry', 'even if there is a coma"','0.23\n']
While I would like to obtain:
['12', '"This is a single entry, even if there is a coma"','0.23']
I would avoid trying to use a regular expression, but you would need to process the text a character at a time to determine where the quote characters are. Also normally the quote characters are not included as part of a field.
A quick example approach would be the following:
def split_row(row, quote_char='"', delim=','):
in_quote = False
fields = []
field = []
for c in row:
if c == quote_char:
in_quote = not in_quote
elif c == delim:
if in_quote:
field.append(c)
else:
fields.append(''.join(field))
field = []
else:
field.append(c)
if field:
fields.append(''.join(field))
return fields
fields = split_row('''12,"This is a single entry, even if there's a coma",0.23''')
print(len(fields), fields)
Which would display:
3 ['12', "This is a single entry, even if there's a coma", '0.23']
The CSV library though does a far better job of this. This script does not handle any special cases above your test string.
Here is my go at it:
line ='12, "This is a single entry, more bits in here ,even if there is a coma",0.23 , 12, "This is a single entry, even if there is a coma", 0.23\n'
line_split = line.replace('\n', '').split(',')
quote_loc = [idx for idx, l in enumerate(line_split) if '"' in l]
quote_loc.reverse()
assert len(quote_loc) % 2 == 0, "value was odd, should be even"
for m, n in zip(quote_loc[::2], quote_loc[1::2]):
line_split[n] = ','.join(line_split[n:m+1])
del line_split[n+1:m+1]
print(line_split)

Find first line of text according to value in Python

How can I do a search of a value of the first "latitude, longitude" coordinate in a "file.txt" list in Python and get 3 rows above and 3 rows below?
Value
37.0459
file.txt
37.04278,-95.58895
37.04369,-95.58592
37.04369,-95.58582
37.04376,-95.58557
37.04376,-95.58546
37.04415,-95.58429
37.0443,-95.5839
37.04446,-95.58346
37.04461,-95.58305
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04516,-95.57948
37.04508,-95.57914
37.04494,-95.57842
37.04483,-95.5771
37.0448,-95.57674
37.04474,-95.57606
37.04467,-95.57534
37.04462,-95.57474
37.04458,-95.57396
37.04454,-95.57274
37.04452,-95.57233
37.04453,-95.5722
37.0445,-95.57164
37.04448,-95.57122
37.04444,-95.57054
37.04432,-95.56845
37.04432,-95.56834
37.04424,-95.5668
37.044,-95.56251
37.04396,-95.5618
Expected Result
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04516,-95.57948
Additional information
In linux I can get the closest line and do the treatment I need using grep, sed, cut and others, but I'd like in Python.
Any help will be greatly appreciated!
Thank you.
How can I do a search of a value of the first "latitude, longitude"
coordinate in a "file.txt" list in Python and get 3 rows above and 3
rows below?*
You can try:
with open("text_filter.txt") as f:
text = f.readlines() # read text lines to list
filter= "37.0459"
match = [i for i,x in enumerate(text) if filter in x] # get list index of item matching filter
if match:
if len(text) >= match[0]+3: # if list has 3 items after filter, print it
print("".join(text[match[0]:match[0]+3]).strip())
print(text[match[0]].strip())
if match[0] >= 3: # if list has 3 items before filter, print it
print("".join(text[match[0]-3:match[0]]).strip())
Output:
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04597,-95.58127
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
You can use pandas to import the data in a dataframe and then easily manipulate it. As per your question the value to check is not the exact match and therefore I have converted it to string.
import pandas as pd
data = pd.read_csv("file.txt", header=None, names=["latitude","longitude"]) #imports text file as dataframe
value_to_check = 37.0459 # user defined
for i in range(len(data)):
if str(value_to_check) == str(data.iloc[i,0])[:len(str(value_to_check))]:
break
print(data.iloc[i-3:i+4,:])
output
latitude longitude
9 37.04502 -95.58204
10 37.04516 -95.58184
11 37.04572 -95.58139
12 37.04597 -95.58127
13 37.04565 -95.58073
14 37.04546 -95.58033
15 37.04516 -95.57948
A solution with iterators, that only keeps in memory the necessary lines and doesn't load the unnecessary part of the file:
from collections import deque
from itertools import islice
def find_in_file(file, target, before=3, after=3):
queue = deque(maxlen=before)
with open(file) as f:
for line in f:
if target in map(float, line.split(',')):
out = list(queue) + [line] + list(islice(f, 3))
return out
queue.append(line)
else:
raise ValueError('target not found')
Some tests:
print(find_in_file('test.txt', 37.04597))
# ['37.04502,-95.58204\n', '37.04516,-95.58184\n', '37.04572,-95.58139\n', '37.04597,-95.58127\n',
# '37.04565,-95.58073\n', '37.04565,-95.58073\n', '37.04565,-95.58073\n']
print(find_in_file('test.txt', 37.044)) # Only one line after the match
# ['37.04432,-95.56845\n', '37.04432,-95.56834\n', '37.04424,-95.5668\n', '37.044,-95.56251\n',
# '37.04396,-95.5618\n']
Also, it works if there is less than the expected number of lines before or after the match. We match floats, not strings, as '37.04' would erroneously match '37.0444' otherwise.
This solution will print the before and after elements even if they are less than 3.
Also I am using string as it is implied from the question that you want partial matches also. ie. 37.0459 will match 37.04597
search_term='37.04462'
with open('file.txt') as f:
lines = f.readlines()
lines = [line.strip().split(',') for line in lines] #remove '\n'
for lat,lon in lines:
if search_term in lat:
index=lines.index([lat,lon])
break
left=0
right=0
for k in range (1,4): #bcoz last one is not included
if index-k >=0:
left+=1
if index+k<=(len(lines)-1):
right+=1
for i in range(index-left,index+right+1): #bcoz last one is not included
print(lines[i][0],lines[i][1])

How to concentrate lists from a loop into a single lost in python

I need to concentrate some list from a loop. I am using regex to make some extraction, then am using findall in regex which get out all your result on individual list. but i need to get all this list in one. So here is my code..
import re
boy = open('mbko.txt')
girl = open ('xxx.txt','w')
for line in boy:
line = line.rstrip()
z= re.findall('[a-zA-Z0-9]\S*#\S*[a-zA-Z]',line)
if len(z)>0:
leng= len(z)
girl.write("the total length is :{}\n". format(leng))
girl.write(str(z))
print z
After this i have the result
['stephen.marquard#uct.ac.za']
['postmaster#collab.sakaiproject.org']
['source#collab.sakaiproject.org']
['apache#localhost']
['stephen.marquard#uct.ac.za']
lol don't mind the naming. but i need to get the list result in one single list like
['stephen.marquard#uct.ac.za',
'postmaster#collab.sakaiproject.org',
'source#collab.sakaiproject.org',
'apache#localhost',
'stephen.marquard#uct.ac.za']
i actually added this part after the if statement to get each line join together,
final = []
for i in z:
#for j in i:
final+=i
but still can't get a good result.
One way is to use a generator:
import re
boy = open('mbko.txt')
girl = open('xxx.txt', 'w')
def yielder(boy, girl):
for line in boy:
line = line.rstrip()
z = re.findall('[a-zA-Z0-9]\S*#\S*[a-zA-Z]', line)
if z:
girl.write("the total length is :{}\n". format(len(z)))
girl.write(str(z))
yield z[0]
print(list(yielder(boy, girl)))

How do I quickly extract data from this massive csv file?

I have genomic data from 16 nuclei. The first column represents the nucleus, the next two columns represent the scaffold (section of genome) and the position on the scaffold respectively, and the last two columns represent the nucleotide and coverage respectively. There can be equal scaffolds and positions in different nuclei.
Using input for start and end positions (scaffold and position of each), I'm supposed to output a csv file which shows the data (nucleotide and coverage) of each nucleus within the range from start to end. I was thinking of doing this by having 16 columns (one for each nucleus), and then showing the data from top to bottom. The leftmost region would be a reference genome in that range, which I accessed by creating a dictionary for each of its scaffolds.
In my code, I have a defaultdict of lists, so the key is a string which combines the scaffold and the location, while the data is an array of lists, so that for each nucleus, the data can be appended to the same location, and in the end each location has data from every nucleus.
Of course, this is very slow. How should I be doing it instead?
Code:
#let's plan this
#input is start and finish - when you hit first, add it and keep going until you hit next or larger
#dictionary of arrays
#loop through everything, output data for each nucleus
import csv
from collections import defaultdict
inrange = 0
start = 'scaffold_41,51335'
end = 'scaffold_41|51457'
locations = defaultdict(list)
count = 0
genome = defaultdict(lambda : defaultdict(dict))
scaffold = ''
for line in open('Allpaths_SL1_corrected.fasta','r'):
if line[0]=='>':
scaffold = line[1:].rstrip()
else:
genome[scaffold] = line.rstrip()
print('Genome dictionary done.')
with open('automated.csv','rt') as read:
for line in csv.reader(read,delimiter=','):
if line[1] + ',' + line[2] == start:
inrange = 1
if inrange == 1:
locations[line[1] + ',' + line[2]].append([line[3],line[4]])
if line[1] + ',' + line[2] == end:
inrange = 0
count += 1
if count%1000000 == 0:
print('Checkpoint '+str(count)+'!')
with open('region.csv','w') as fp:
wr = csv.writer(fp,delimiter=',',lineterminator='\n')
for key in locations:
nuclei = []
for i in range(0,16):
try:
nuclei.append(locations[key][i])
except IndexError:
nuclei.append(['',''])
wr.writerow([genome[key[0:key.index(',')][int(key[key.index(',')+1:])-1],key,nuclei])
print('Done!')
Files:
https://drive.google.com/file/d/0Bz7WGValdVR-bTdOcmdfRXpUYUE/view?usp=sharing
https://drive.google.com/file/d/0Bz7WGValdVR-aFdVVUtTbnI2WHM/view?usp=sharing
(Only focusing on the CSV section in the middle of your code)
The example csv file you supplied is over 2GB and 77,822,354 lines. Of those lines, you seem to only be focused on 26,804,253 lines or about 1/3.
As a general suggestion, you can speed thing up by:
Avoid processing the data you are not interested in (2/3 of the file);
Speed up identifying the data you are interested in;
Avoid the things that repeated millions of times that tend to be slower (processing each line as csv, reassembling a string, etc);
Avoid reading all data when you can break it up into blocks or lines (memory will get tight)
Use faster tools like numpy, pandas and pypy
You data is block oriented, so you can use a FlipFlop type object to sense if you are in a block or not.
The first column of your csv is numeric, so rather than splitting the line apart and reassembling two columns, you can use the faster Python in operator to find the start and end of the blocks:
start = ',scaffold_41,51335,'
end = ',scaffold_41,51457,'
class FlipFlop:
def __init__(self, start_pattern, end_pattern):
self.patterns = start_pattern, end_pattern
self.state = False
def __call__(self, st):
rtr=True if self.state else False
if self.patterns[self.state] in st:
self.state = not self.state
return self.state or rtr
lines_in_block=0
with open('automated.csv') as f:
ff=FlipFlop(start, end)
for lc, line in enumerate(f):
if ff(line):
lines_in_block+=1
print lines_in_block, lc
Prints:
26804256 77822354
That runs in about 9 seconds in PyPy and 46 seconds in Python 2.7.
You can then take the portion that reads the source csv file and turn that into a generator so you only need to deal with one block of data at a time.
(Certainly not correct, since I spent no time trying to understand your files overall..):
def csv_bloc(fn, start_pat, end_pat):
from itertools import ifilter
with open(fn) as csv_f:
ff=FlipFlop(start_pat, end_pat)
for block in ifilter(ff, csv_f):
yield block
Or, if you need to combine all the blocks into one dict:
def csv_line(fn, start, end):
with open(fn) as csv_in:
ff=FlipFlop(start, end)
for line in csv_in:
if ff(line):
yield line.rstrip().split(",")
di={}
for row in csv_line('/tmp/automated.csv', start, end):
di.setdefault((row[2],row[3]), []).append([row[3],row[4]])
That executes in about 1 minute on my (oldish) Mac in PyPy and about 3 minutes in cPython 2.7.
Best

How to Check if a RE In Python was Performed

I'm trying to check if a regular expression was executed on a specific line of the opened document and then if so add to
a count variable by 1. If the count exceeds 2 I want it to stop. The below code is what I have so far.
for line in book:
if count<=2:
reg1 = re.sub(r'Some RE',r'Replaced with..',line)
f.write(reg1)
"if reg1 was Performed add to count variable by 1"
Definitely the best way of doing this is to use re.subn() instead re.sub()
The re.subn() returns a tuple (new_string, number_of_changes_made) so it's perfect for you:
for line in book:
if count<=2:
reg1, num_of_changes = re.subn(r'Some RE',r'Replaced with..',line)
f.write(reg1)
if num_of_changes > 0:
count += 1
If the idea is to determine if a substitution was performed on the line, it is fairly simple:
count = 0
for line in book:
if count<=2:
reg1 = re.sub(r'Some RE',r'Replaced with..',line)
f.write(reg1)
count += int(reg1 == line)
You can pass a function to re.sub as the replacement value. This lets you do stuff like this: (though a simple search then sub approach while slower would be easier to reason about):
import re
class Counter(object):
def __init__(self, start=0):
self.value = start
def incr(self):
self.value += 1
book = """This is some long text
with the text 'Some RE' appearing twice:
Some RE see?
"""
def countRepl(replacement, counter):
def replacer(matchobject):
counter.incr()
return replacement
return replacer
counter = Counter(0)
print re.sub(r'Some RE', countRepl('Replaced with..', counter), book)
print counter.value
This produces the following output:
This is some long text
with the text 'Replaced with..' appearing twice:
Replaced with.. see?
2
You could compare it to the original string to see if it changed:
for line in book:
if count<=2:
reg1 = re.sub(r'Some RE',r'Replaced with..',line)
f.write(reg1)
if line != reg1:
count += 1
subn will tell you how many substitutions were made in the line and the count parameter will limit the number of substitutions that will be attempted. Put them together and you have code that will stop after two substitutions, even if there are multiple subs on a single line.
look_count = 2
for line in book:
reg1, sub_count = re.subn(r'Some RE', r'Replaced with..', line,count=look_count)
f.write(reg1)
look_count -= sub_count
if not look_count:
break

Categories