How to extract specific data from a text file in python

How to extract specific data from a text file in python - python

I am trying to extract specific data out of a text file to use it in another function. I already looked this up and found something but it doesn't work although it seems like it should work. Is there anything I do wrong or is there a better way to do this? I am basically trying to extract the first column of data in the text file the "distances", without the km of course.
This is the text file:
Distances Times Dates Total distance & time
00 km 00:00:00 h 0000-00-00 00 km ; 00:00:00 h
28 km 01:30:21 h 2020-3-2 28 km ; 01:30:21 h
50 km 02:12:18 h 2020-4-8 78 km ;
This is the code:
all_distances = []
with open("Bike rides.txt", "r") as f:
lines = f.readlines()
for l in lines[1:]:
all_distances.append(l.split()[0])
print(all_distances)
The error I get is this:
IndexError: list index out of range

try this one:
#!/usr/bin/env python3
all_distances = []
for l in open ( "rides.txt" ).readlines () [ 1: ]:
l = l.split ( " " ) [ 0 ]
all_distances.append ( l )
print(all_distances)

Considering you have whitespace delimiters, you can separate the columns using string.split() method. Below is an example of its application.
column = 0 # First column
with open("data.txt") as file:
data = file.readlines()
columns = list(map(lambda x: x.strip().split()[column], data))

The error i get is this: "IndexError: list index out of range"
This suggest problem with blank line(s), so .split() give empty list, to ignore such lines in place:
all_distances.append(l.split()[0])
do:
splitted = l.split()
if splitted:
all_distances.append(splitted[0])
Explanation: in python empty lists are considersing false-y and non-empty truth-y, thus code inside if block will execute if list has at least one element.

Related

Find first line of text according to value in Python

How can I do a search of a value of the first "latitude, longitude" coordinate in a "file.txt" list in Python and get 3 rows above and 3 rows below?
Value
37.0459
file.txt
37.04278,-95.58895
37.04369,-95.58592
37.04369,-95.58582
37.04376,-95.58557
37.04376,-95.58546
37.04415,-95.58429
37.0443,-95.5839
37.04446,-95.58346
37.04461,-95.58305
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04516,-95.57948
37.04508,-95.57914
37.04494,-95.57842
37.04483,-95.5771
37.0448,-95.57674
37.04474,-95.57606
37.04467,-95.57534
37.04462,-95.57474
37.04458,-95.57396
37.04454,-95.57274
37.04452,-95.57233
37.04453,-95.5722
37.0445,-95.57164
37.04448,-95.57122
37.04444,-95.57054
37.04432,-95.56845
37.04432,-95.56834
37.04424,-95.5668
37.044,-95.56251
37.04396,-95.5618
Expected Result
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04516,-95.57948
Additional information
In linux I can get the closest line and do the treatment I need using grep, sed, cut and others, but I'd like in Python.
Any help will be greatly appreciated!
Thank you.

How can I do a search of a value of the first "latitude, longitude"
coordinate in a "file.txt" list in Python and get 3 rows above and 3
rows below?*
You can try:
with open("text_filter.txt") as f:
text = f.readlines() # read text lines to list
filter= "37.0459"
match = [i for i,x in enumerate(text) if filter in x] # get list index of item matching filter
if match:
if len(text) >= match[0]+3: # if list has 3 items after filter, print it
print("".join(text[match[0]:match[0]+3]).strip())
print(text[match[0]].strip())
if match[0] >= 3: # if list has 3 items before filter, print it
print("".join(text[match[0]-3:match[0]]).strip())
Output:
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04597,-95.58127
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139

You can use pandas to import the data in a dataframe and then easily manipulate it. As per your question the value to check is not the exact match and therefore I have converted it to string.
import pandas as pd
data = pd.read_csv("file.txt", header=None, names=["latitude","longitude"]) #imports text file as dataframe
value_to_check = 37.0459 # user defined
for i in range(len(data)):
if str(value_to_check) == str(data.iloc[i,0])[:len(str(value_to_check))]:
break
print(data.iloc[i-3:i+4,:])
output
latitude longitude
9 37.04502 -95.58204
10 37.04516 -95.58184
11 37.04572 -95.58139
12 37.04597 -95.58127
13 37.04565 -95.58073
14 37.04546 -95.58033
15 37.04516 -95.57948

A solution with iterators, that only keeps in memory the necessary lines and doesn't load the unnecessary part of the file:
from collections import deque
from itertools import islice
def find_in_file(file, target, before=3, after=3):
queue = deque(maxlen=before)
with open(file) as f:
for line in f:
if target in map(float, line.split(',')):
out = list(queue) + [line] + list(islice(f, 3))
return out
queue.append(line)
else:
raise ValueError('target not found')
Some tests:
print(find_in_file('test.txt', 37.04597))
# ['37.04502,-95.58204\n', '37.04516,-95.58184\n', '37.04572,-95.58139\n', '37.04597,-95.58127\n',
# '37.04565,-95.58073\n', '37.04565,-95.58073\n', '37.04565,-95.58073\n']
print(find_in_file('test.txt', 37.044)) # Only one line after the match
# ['37.04432,-95.56845\n', '37.04432,-95.56834\n', '37.04424,-95.5668\n', '37.044,-95.56251\n',
# '37.04396,-95.5618\n']
Also, it works if there is less than the expected number of lines before or after the match. We match floats, not strings, as '37.04' would erroneously match '37.0444' otherwise.

This solution will print the before and after elements even if they are less than 3.
Also I am using string as it is implied from the question that you want partial matches also. ie. 37.0459 will match 37.04597
search_term='37.04462'
with open('file.txt') as f:
lines = f.readlines()
lines = [line.strip().split(',') for line in lines] #remove '\n'
for lat,lon in lines:
if search_term in lat:
index=lines.index([lat,lon])
break
left=0
right=0
for k in range (1,4): #bcoz last one is not included
if index-k >=0:
left+=1
if index+k<=(len(lines)-1):
right+=1
for i in range(index-left,index+right+1): #bcoz last one is not included
print(lines[i][0],lines[i][1])

Python read .txt File -> list

I have a .txt File and I want to get the values in a list.
The format of the txt file should be:
value0,timestamp0
value1,timestamp1
...
...
...
In the end I want to get a list with
[[value0,timestamp0],[value1,timestamp1],.....]
I know it's easy to get these values by
direction = []
for line in open(filename):
direction,t = line.strip().split(',')
direction = float(direction)
t = long(t)
direction.append([direction,t])
return direction
But I have a big problem: When creating the data I forgot to insert a "\n" in each row.
Thats why I have this format:
value0, timestamp0value1,timestamp1value2,timestamp2value3.....
Every timestamp has exactly 13 characters.
Is there a way to get these data in a list as I want it? Would be very much work get the data again.
Thanks
Max

import re
input = "value0,0123456789012value1,0123456789012value2,0123456789012value3"
for (line, value, timestamp) in re.findall("(([^,]+),(.{13}))", input):
print value, timestamp

You will have to strip the last , but you can insert a comma after every 13 chars following a comma:
import re
s = "-0.1351197,1466615025472-0.25672746,1466615025501-0.3661744,1466615025531-0.4646‌7665,1466615025561-0.5533287,1466615025591-0.63311553,1466615025621-0.7049236,146‌6615025652-0.7695509,1466615025681-1.7158673,1466615025711-1.6896278,146661502574‌1-1.65375,1466615025772-1.6092329,1466615025801"
print(re.sub("(?<=,)(.{13})",r"\1"+",", s))
Which will give you:
-0.1351197,1466615025472,-0.25672746,1466615025501,-0.3661744,1466615025531,-0.4646‌7665,1466615025561,-0.5533287,1466615025591,-0.63311553,1466615025621,-0.7049236,146‌6615025652-0.7695509,1466615025681,-1.7158673,1466615025711,-1.6896278,146661502574‌1-1.65375,1466615025772,-1.6092329,1466615025801,

I coded a quickie using your example, and not using 13 but len("timestamp") so you can adapt
instr = "value,timestampvalue2,timestampvalue3,timestampvalue4,timestamp"
previous_i = 0
for i,c in enumerate(instr):
if c==",":
next_i = i+len("timestamp")+1
print(instr[previous_i:next_i])
previous_i = next_i
output is descrambled:
value,timestamp
value2,timestamp
value3,timestamp
value4,timestamp

I think you could do something like this:
direction = []
for line in open(filename):
list = line.split(',')
v = list[0]
for s in list[1:]:
t = s[:13]
direction.append([float(v), long(t)])
v = s[13:]
If you're using python 3.X, then the long function no longer exists -- use int.

How to sort a large number of lists to get a top 10 of the longest lists

So I have a text file with around 400,000 lists that mostly look like this.
100005 127545 202036 257630 362970 376927 429080
10001 27638 51569 88226 116422 126227 159947 162938 184977 188045
191044 246142 265214 290507 296858 300258 341525 348922 359832 365744
382502 390538 410857 433453 479170 489980 540746
10001 27638 51569 88226 116422 126227 159947 162938 184977 188045
191044 246142 265214 290507 300258 341525 348922 359832 365744 382502
So far I have a for loop that goes line by line and turns the current line into a temp array list.
How would I create a top ten list that has the list with the most elements of the whole file.
This is the code I have now.
file = open('node.txt', 'r')
adj = {}
top_ten = []
at_least_3 = 0
for line in file:
data = line.split()
adj[data[0]] = data[1:]
And this is what one of the list look like
['99995', '110038', '330533', '333808', '344852', '376948', '470766', '499315']

# collect the lines
lines = []
with open("so.txt") as f:
for line in f:
# split each line into a list
lines.append(line.split())
# sort the lines by length, descending
lines = sorted(lines, key=lambda x: -len(x))
# print the first 10 lines
print(lines[:10])

Why not use collections to display the top 10? i.e.:
import re
import collections
file = open('numbers.txt', 'r')
content = file.read()
numbers = re.findall(r"\d+", content)
counter = collections.Counter(numbers)
print(counter.most_common(10))
Ideone Demo

When wanting to count and then find the one(s) with the highest counts, collections.Counter comes to mind:
from collections import Counter
lists = Counter()
with open('node.txt', 'r') as file:
for line in file:
values = line.split()
lists[tuple(values)] = len(values)
print('Length Data')
print('====== ====')
for values, length in lists.most_common(10):
print('{:2d} {}'.format(length, list(values)))
Output (using sample file data):
Length Data
====== ====
10 ['191044', '246142', '265214', '290507', '300258', '341525', '348922', '359832', '365744', '382502']
10 ['191044', '246142', '265214', '290507', '296858', '300258', '341525', '348922', '359832', '365744']
10 ['10001', '27638', '51569', '88226', '116422', '126227', '159947', '162938', '184977', '188045']
7 ['382502', '390538', '410857', '433453', '479170', '489980', '540746']
7 ['100005', '127545', '202036', '257630', '362970', '376927', '429080']

Use a for loop and max() maybe? You say you've got a for loop that's placing the values into a temp array. From that you could use "max()" to pick out the largest value and put that into a list.
As an open for loop, something like appending max() to a new list:
newlist = []
for x in data:
largest = max(x)
newlist.append(largest)
Or as a list comprehension:
newlist = [max(x) for x in data]
Then from there you have to do the same process on the new list(s) until you get to the desired top 10 scenario.
EDIT: I've just realised that i've misread your question. You want to get the lists with the most elements, not the highest values. Ok.
len() is a good one for this.
for x in data:
if len(templist) > x:
newlist.append(templist)
That would give you the current highest and from there you could create a top 10 list of lengths or of the temp lists themselves, or both.

If your data is really as shown with each number the same length, then I would make a dictionary with key = line, value = length, get the top value / key pairs in the dictionary and voila. Sounds easy enough.

Counting DNA sequences with python/biopython

My script below is counting the occurrences of the sequences 'CCCCAAAA' and 'GGGGTTTT' from a standard FASTA file:
>contig00001
CCCCAAAACCCCAAAACCCCAAAACCCCTAcGAaTCCCcTCATAATTGAAAGACTTAAACTTTAAAACCCTAGAAT
The script counts the CCCCAAAA sequence here 3 times
CCCCAAAACCCCAAAACCCCAAAA(CCCC not counted)
Can somebody please advise how I would include the CCCC sequence at the end as a half count to return a value of 3.5 for this.
I've been unsuccessful in my attempts so far.
My script is as follows...
from Bio import SeqIO
input_file = open('telomer.test.fasta', 'r')
output_file = open('telomer.test1.out.tsv','w')
output_file.write('Contig\tCCCCAAAA\tGGGGTTTT\n')
for cur_record in SeqIO.parse(input_file, "fasta") :
contig = cur_record.name
CCCCAAAA_count = cur_record.seq.count('CCCCAAAA')
CCCC_count = cur_record.seq.count('CCCC')
GGGGTTTT_count = cur_record.seq.count('GGGGTTTT')
GGGG_count = cur_record.seq.count('GGGG')
#length = len(cur_record.seq)
splittedContig1=contig.split(CCCCAAAA_count)
splittedContig2=contig.split(GGGGTTTT_count)
cnt1=len(splittedContig1)-1
cnt2=len(splittedContig2)
cnt1+sum([0.5 for e in splittedContig1 if e.startswith(CCCC_count)])) = CCCCAAAA_count
cnt2+sum([0.5 for e in splittedContig2 if e.startswith(GGGG_count)])) = GGGGTTTT_count
output_line = '%s\t%i\t%i\n' % \
(CONTIG, CCCCAAAA_count, GGGGTTTT_count)
output_file.write(output_line)
output_file.close()
input_file.close()

You can use split and startwith list comprehension as follows:
contig="CCCCAAAACCCCAAAACCCCAAAACCCCTAcGAaTCCCcTCATAATTGAAAGACTTAAACTTTAAAACCCTAGAAT"
splitbase="CCCCAAAA"
halfBase="CCCC"
splittedContig=contig.split(splitbase)
cnt=len(splittedContig)-1
print cnt+sum([0.5 for e in splittedContig if e.startswith(halfBase)])
Output:
3.5
split the strings based on CCCCAAAA. It would give the list, in the list elements CCCCAAAA will be removed
length of splitted - 1 gives the number of occurrence of CCCCAAAA
in the splitted element, look for elements starts with CCCC. If found add 0.5 to count for each occurence.

python how to split text into new list

Have numerous lines of text I would like to put into a list:
123456 123456 123456 234567 234567 4567890
243564 194563 432423 764575 542354 6564536
I think you get the idea. Space separated values, each value should be it's own value. 73 values per line and something like 144 lines. I know how to split based on the column:
d = list(zip(*(e.split() for e in b)))
How I split based on the row. I want d[0] = '123456,123456,123456,234567,234567,4567890'
not d[0] = '123456,243564'
The above line splits the list up the way I don't want it split up.
EXTRA: Let me add one more thing in.
The data in the list are decimal numbers. Is there a way when I go to separate out the list that is can also round the numbers.
f = np.round(float([e.split() for e in d]),2)
That only gives me the error 'float() argument must be a string or a number'

Remove the zip(); a list comprehension is enough here:
d = [e.split() for e in b]
If you need integers, you could use:
d = [[int(v) for v in e.split()] for e in b]

If you're insistent on the commas:
with open('data.txt', 'r') as f:
d = [",".join(var.rstrip().split()) for var in f.readlines()]
print(d[0])
print(d[1])
Output:
123456,123456,123456,234567,234567,4567890
243564,194563,432423,764575,542354,6564536

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract specific data from a text file in python - python

try this one: #!/usr/bin/env python3 all_distances = [] for l in open ( "rides.txt" ).readlines () [ 1: ]: l = l.split ( " " ) [ 0 ] all_distances.append ( l ) print(all_distances)

Considering you have whitespace delimiters, you can separate the columns using string.split() method. Below is an example of its application. column = 0 # First column with open("data.txt") as file: data = file.readlines() columns = list(map(lambda x: x.strip().split()[column], data))

Related

Find first line of text according to value in Python

Python read .txt File -> list

How to sort a large number of lists to get a top 10 of the longest lists

Counting DNA sequences with python/biopython

python how to split text into new list

Categories

Resources