I have data in a text file that is space separated into right aligned columns. I would like to be able to take each column and put it in a list, basically like you would do with an array. I can't seem to find an equivalent to
left(strname,#ofcharacters)/mid(strname,firstcharacter,lastcharacter)/right(strname,#ofcharacters)
like you would normally use in VB to accomplish the task. How do I separate off the data and put each like 'unit' with its value next from the next line in Python.
Is it possible? Oh yeah, some spacing is 12 characters apart(right aligned) while others are 15 characters apart.
-1234 56 32452 68584.4 Extra_data
-5356 9 546 12434.5 Extra_data
- 90 12 2345 43522.1 Extra_data
Desired output:
[-1234, -5356, -90]
[56, 9, 12]
[32452, 546, 2345]
etc
The equivalent method in python you are looking for is str.split() without any arguments to split the string on whitespaces. It will also take care of any trailing newline/spaces and as in your VB example, you do not need to care about data width.
Example
with open("data.txt") as fin:
data = map(str.split, fin) #Split each line of data on white-spaces
data = zip(*data) #Transpose the Data
But if you have columns with whitespaces, you need some to split the data, based on column position
>>> def split_on_width(data, pos):
if pos[-1] != len(data):
pos = pos + (len(data), )
indexes = zip(pos, pos[1:]) #Create an index pair with current start and
#end as next start
return [data[start: end].strip() for start, end in indexes] #Slice the data using
#the indexes
>>> def trynum(n):
try:
return int(n)
except ValueError:
pass
try:
return float(n)
except ValueError:
return n
>>> pos
(0, 5, 13, 22, 36)
>>> with open("test.txt") as fin:
data = (split_on_width(data.strip(), pos) for data in fin)
data = [[trynum(n) for n in row] for row in zip(*data)]
>>> data
[[-1234, -5356, -90], [56, 9, 12], [32452, 546, 2345], [68584.4, 12434.5, 43522.1], ['Extra_data', 'Extra_data', 'Extra_data']]
Just use str.split() with no arguments; it splits an input string on arbitrary width whitespace:
>>> ' some_value another_column 123.45 42 \n'.split()
['some_value', 'another_column', '123.45', '42']
Note that any columns containing whitespace would also be split.
If you wanted to have lists if columns, you need to transpose the rows:
with open(filename) as inputfh:
columns = zip(*(l.split() for l in inputfh))
Related
I have a file with multiple FASTA sequences such as:
File1.fa
>seq1
AAATTTTTATATACCCTACCATTACCATTACCATTACCATTACCATTACCATTACCATTTTATTATTATTATATACCACACA
>seq2
AAATTTTTATATACCCTACCATTACCATTACCATTACCATTACCATTACCATTACCATTTTATTATTATTATATACCACACA
And I have a table such as:
tab
Seq positions
seq1 3:10
seq2 10:20,45:60
And I would like for each tab['Seq'] to replace letters by a X for each corresponding seqn positions within File1.fa
As you can see for the second row, I can have multiple positions to replace (these positions are separated by , in the tab['positions'] column.
Here I should then get a new_File1.fa such as:
>seq1
AAXXXXXXXXATACCCTACCATTACCATTACCATTACCATTACCATTACCATTACCATTTTATTATTATTATATACCACACA
>seq2
AAATTTTTAXXXXXXXXXXXATTACCATTACCATTACCATTACCXXXXXXXXXXXXXXXXTATTATTATTATATACCACACA
where for seq1 I replace X from positions 3 to 10, and for seq2 I replaced X from positions 10 to 20 and from positions 45 to 60 positions.
I guess using biopython package should be a solution here?
So far I tried the following:
record_dict = SeqIO.to_dict(SeqIO.parse("File1.fa, "fasta"))
import re
for index, row in tab.iterrows():
start= re.sub(":*.","",row['positions'])
end= re.sub(".*:","",row['positions'])
print(record_dict[Seq].seq[start-end])
But as you can see I only manage to extract the part I want to replace with X and I cannot figure out how to take into account when there are multiple positions to replace in the sequence.
Convert the sequences to lists, replace your chosen ranges then covert back to a string. For example,
seq = "AAABBBCCC"
s = list(seq)
for idx in range(3, 6):
s[idx] = "X"
new_seq = ''.join(s)
print(new_seq)
How can I do a search of a value of the first "latitude, longitude" coordinate in a "file.txt" list in Python and get 3 rows above and 3 rows below?
Value
37.0459
file.txt
37.04278,-95.58895
37.04369,-95.58592
37.04369,-95.58582
37.04376,-95.58557
37.04376,-95.58546
37.04415,-95.58429
37.0443,-95.5839
37.04446,-95.58346
37.04461,-95.58305
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04516,-95.57948
37.04508,-95.57914
37.04494,-95.57842
37.04483,-95.5771
37.0448,-95.57674
37.04474,-95.57606
37.04467,-95.57534
37.04462,-95.57474
37.04458,-95.57396
37.04454,-95.57274
37.04452,-95.57233
37.04453,-95.5722
37.0445,-95.57164
37.04448,-95.57122
37.04444,-95.57054
37.04432,-95.56845
37.04432,-95.56834
37.04424,-95.5668
37.044,-95.56251
37.04396,-95.5618
Expected Result
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04516,-95.57948
Additional information
In linux I can get the closest line and do the treatment I need using grep, sed, cut and others, but I'd like in Python.
Any help will be greatly appreciated!
Thank you.
How can I do a search of a value of the first "latitude, longitude"
coordinate in a "file.txt" list in Python and get 3 rows above and 3
rows below?*
You can try:
with open("text_filter.txt") as f:
text = f.readlines() # read text lines to list
filter= "37.0459"
match = [i for i,x in enumerate(text) if filter in x] # get list index of item matching filter
if match:
if len(text) >= match[0]+3: # if list has 3 items after filter, print it
print("".join(text[match[0]:match[0]+3]).strip())
print(text[match[0]].strip())
if match[0] >= 3: # if list has 3 items before filter, print it
print("".join(text[match[0]-3:match[0]]).strip())
Output:
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04597,-95.58127
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
You can use pandas to import the data in a dataframe and then easily manipulate it. As per your question the value to check is not the exact match and therefore I have converted it to string.
import pandas as pd
data = pd.read_csv("file.txt", header=None, names=["latitude","longitude"]) #imports text file as dataframe
value_to_check = 37.0459 # user defined
for i in range(len(data)):
if str(value_to_check) == str(data.iloc[i,0])[:len(str(value_to_check))]:
break
print(data.iloc[i-3:i+4,:])
output
latitude longitude
9 37.04502 -95.58204
10 37.04516 -95.58184
11 37.04572 -95.58139
12 37.04597 -95.58127
13 37.04565 -95.58073
14 37.04546 -95.58033
15 37.04516 -95.57948
A solution with iterators, that only keeps in memory the necessary lines and doesn't load the unnecessary part of the file:
from collections import deque
from itertools import islice
def find_in_file(file, target, before=3, after=3):
queue = deque(maxlen=before)
with open(file) as f:
for line in f:
if target in map(float, line.split(',')):
out = list(queue) + [line] + list(islice(f, 3))
return out
queue.append(line)
else:
raise ValueError('target not found')
Some tests:
print(find_in_file('test.txt', 37.04597))
# ['37.04502,-95.58204\n', '37.04516,-95.58184\n', '37.04572,-95.58139\n', '37.04597,-95.58127\n',
# '37.04565,-95.58073\n', '37.04565,-95.58073\n', '37.04565,-95.58073\n']
print(find_in_file('test.txt', 37.044)) # Only one line after the match
# ['37.04432,-95.56845\n', '37.04432,-95.56834\n', '37.04424,-95.5668\n', '37.044,-95.56251\n',
# '37.04396,-95.5618\n']
Also, it works if there is less than the expected number of lines before or after the match. We match floats, not strings, as '37.04' would erroneously match '37.0444' otherwise.
This solution will print the before and after elements even if they are less than 3.
Also I am using string as it is implied from the question that you want partial matches also. ie. 37.0459 will match 37.04597
search_term='37.04462'
with open('file.txt') as f:
lines = f.readlines()
lines = [line.strip().split(',') for line in lines] #remove '\n'
for lat,lon in lines:
if search_term in lat:
index=lines.index([lat,lon])
break
left=0
right=0
for k in range (1,4): #bcoz last one is not included
if index-k >=0:
left+=1
if index+k<=(len(lines)-1):
right+=1
for i in range(index-left,index+right+1): #bcoz last one is not included
print(lines[i][0],lines[i][1])
I have a text file that has 5000 lines. It's format is like that:
1,3,4,1,2,3,5,build
2,6,4,6,7,3,4,demolish
3,6,10,2,3,1,3,demolish
4,4,1,2,3,4,5,demolish
5,1,1,1,1,6,8,build
I want to make different lists for example:
for second column:
second_build=[3,1]
second_demolish=[6,6,4]
I've tried something like that:
with open('cons.data') as file:
second_build=[line.split(',')[1] for line in file if line.split(',')[7]=='build']
But It did not work.
You can get values for each column/action as follows:
lines = """1,3,4,1,2,3,5,build
2,6,4,6,7,3,4,demolish
3,6,10,2,3,1,3,demolish
4,4,1,2,3,4,5,demolish
5,1,1,1,1,6,8,build""".split(
"\n"
)
build_cols = [list() for _ in range(7)]
demolish_cols = [list() for _ in range(7)]
data = {"build": build_cols, "demolish": demolish_cols}
for line in lines:
tokens = line.split(",")
for bc, tok in zip(data[tokens[-1]], tokens):
bc.append(tok)
# to access second column build values:
print(build_cols[1])
# ['3', '1']
For example, build_cols stores a list of lists, each entry represents a column. For each build line, you append items from an appropriate column to the corresponding position in the build_cols.
Just simply first make the readlines a variable, then in the list comprehension simply add a rstrip then will work, because the values (except the last) all have '\n' at the end, so strip them out, and make them integers:
with open('cons.data') as file:
f=file.readlines()
second_build=[int(line.split(',')[1]) for line in f if line.rstrip().split(',')[-1]=='build']
second_demolish=[int(line.split(',')[1]) for line in f if line.rstrip().split(',')[-1]=='demolish']
And now:
print(second_build)
print(second_demolish)
Is:
[3, 1]
[6, 6, 4]
This is the first line of my txt.file
0.112296E+02-.121994E-010.158164E-030.158164E-030.000000E+000.340000E+030.328301E-010.000000E+00
There should be 8 columns, sometimes separated with '-', sometimes with '.'. It's very confusing, I just have to work with the file, I didn't generate it.
And second question: How can I work with the different columns? There is no header, so maybe:
df.iloc[:,0] .. ?
As stated in comments, this is likely a list of numbers in scientific notation, that aren't separated by anything but simply glued together.
It could be interpreted as:
0.112296E+02
-.121994E-010
.158164E-030
.158164E-030
.000000E+000
.340000E+030
.328301E-010
.000000E+00
or as
0.112296E+02
-.121994E-01
0.158164E-03
0.158164E-03
0.000000E+00
0.340000E+03
0.328301E-01
0.000000E+00
Assuming the second interpretation is better, the trick is to split evenly every 12 characters.
data = [line[i:i+12] for i in range(0, len(line), 12)]
If really the first interpretation is better, then I'd use a REGEX
import re
line = '0.112296E+02-.121994E-010.158164E-030.158164E-030.000000E+000.340000E+030.328301E-010.000000E+00'
pattern = '[+-]?\d??\.\d+E[+-]\d+'
data = re.findall(pattern, line)
Edit
Obviously, you'd need to iterate over each line in the file, and add it to your dataframe. This is a rather inefficient thing to do in Pandas. Therefore, if your preferred interpretation is the fixed width one, I'd go with #Ev. Kounis ' answer: df = pd.read_fwf(myfile, widths=[12]*8)
Otherwise, the inefficient way is:
df = pd.DataFrame(columns=range(8))
with open(myfile, 'r') as f_in:
for i, lines in enumerate(f_in):
data = re.findall(pattern, line)
df.loc[i] = [float(d) for d in data]
The two things to notice here is that the DataFrame must be initialized with column names (here [0, 1, 2, 3..7] but perhaps you know of better identifiers); and that the regex gave us strings that must be casted to floats.
As i said in the comments, it is not a case of multiple separators, it is just a fixed width format. Pandas has a method to read such files. try this:
df = pd.read_fwf(myfile, widths=[12]*8)
print(df) # prints -> [0.112296E+02, -.121994E-01, 0.158164E-03, 0.158164E-03.1, 0.000000E+00, 0.340000E+03, 0.328301E-01, 0.000000E+00.1]
for the widths you have to provide the cell width which looks like its 12 and the number of columns which as you say must be 8.
As you might notice the results of the read are not perfect (notice the .1 just before the comma in the 4th and last element) but i am working on it.
Alternatively, you can do it "manually" like so:
myfile = r'C:\Users\user\Desktop\PythonScripts\a_file.csv'
width = 12
my_content = []
with open(myfile, 'r') as f_in:
for lines in f_in:
data = [float(lines[i * width:(i + 1) * width]) for i in range(len(lines) // width)]
my_content.append(data)
print(my_content) # prints -> [[11.2296, -0.0121994, 0.000158164, 0.000158164, 0.0, 340.0, 0.0328301, 0.0]]
and every row would be a nested list.
A possible solution is the following:
row = '0.112296E+02-.121994E-010.158164E-030.158164E-030.000000E+000.340000E+030.328301E-010.000000E+00'
chunckLen = 12
for i in range(0, len(row), chunckLen):
print(row[0+i:chunckLen+i])
You can easly extend the code to handle more general cases.
I have a text file like this:
ID = 31
Ne = 5122
============
List of 104 four tuples:
1 2 12 40
2 3 4 21
.
.
51 21 41 42
ID = 34
Ne = 5122
============
List of 104 four tuples:
3 2 12 40
4 3 4 21
.
.
The four-tuples are tab delimited.
For each ID, I'm trying to make a dictionary with the ID being the key and the four-tuples (in list/tuple form) as elements of that key.
dict = {31: (1,2,12,40),(2,3,4,21)....., 32:(3,2,12,40), (4,3,4,21)..
My string parsing knowledge is limited to adding using a reference object for file.readlines(), using str.replace() and str.split() on 'ID = '. But there has to be a better way. Here some beginnings of what I have.
file = open('text.txt', 'r')
fp = file.readlines()
B = [];
for x in fp:
x.replace('\t',',')
x.replace('\n',')')
B.append(x)
something like this:
ll = []
for line in fp:
tt = tuple(int(x) for x in line.split())
ll.append(tt)
that will produce a list of tuples to assign to the key for your dictionary
Python's great for this stuff, why not write up a 5-10 liner for it? It's kind of what the language is meant to excel at.
$ cat test
ID = 31
Ne = 5122
============
List of 104 four tuples:
1 2 12 40
2 3 4 21
ID = 34
Ne = 5122
============
List of 104 four tuples:
3 2 12 40
4 3 4 21
data = {}
for block in open('test').read().split('ID = '):
if not block:
continue #empty line
lines = block.split('\n')
ID = int(lines[0])
tups = map(lambda y: int(y), [filter(lambda x: x, line.split('\t')) for line in lines[4:]])
data[ID] = tuple(filter(lambda x: x, tups))
print(data)
# {34: ([3, 2, 12, 40], [4, 3, 4, 21]), 31: ([1, 2, 12, 40], [2, 3, 4, 21])}
Only annoying thing is all the filters - sorry, that's just the result of empty strings and stuff from extra newlines, etc. For a one-off little script, it's no biggie.
I think this will do the trick for you:
import csv
def parse_file(filename):
"""
Parses an input data file containing tags of the form "ID = ##" (where ## is a
number) followed by rows of data. Returns a dictionary where the ID numbers
are the keys and all of the rows of data are stored as a list of tuples
associated with the key.
Args:
filename (string) name of the file you want to parse
Returns:
my_dict (dictionary) dictionary of data with ID numbers as keys
"""
my_dict = {}
with open(filename, "r") as my_file: # handles opening and closing file
rows = my_file.readlines()
for row in rows:
if "ID = " in row:
my_key = int(row.split("ID = ")[1]) # grab the ID number
my_list = [] # initialize a new data list for a new ID
elif row != "\n": # skip rows that only have newline char
try: # if this fails, we don't have a valid data line
my_list.append(tuple([int(x) for x in row.split()]))
except:
my_dict[my_key] = my_list # stores the data list
continue # repeat until done with file
return my_dict
I made it a function so that you can it from anywhere, just passing the filename. It makes assumptions about the file format, but if the file format is always what you showed us here, it should work for you. You would call it on your data.txt file like:
a_dictionary = parse_file("data.txt")
I tested it on the data that you gave us and it seems to work just fine after deleting the "..." rows.
Edit: I noticed one small bug. As written, it will add an empty tuple in place of a new line character ("\n") wherever that appears alone on a line. To fix this, put the try: and except: clauses inside of this:
elif row != "\n": # skips rows that only contain newline char
I added this to the full code above as well.