read text file in python and extract specific value in each line? - python

I have a text file that each line of it is as follows:
n:1 mse_avg:8.46 mse_y:12.69 mse_u:0.00 mse_v:0.00 psnr_avg:38.86 psnr_y:37.10 psnr_u:inf psnr_v:inf
n:2 mse_avg:12.20 mse_y:18.30 mse_u:0.00 mse_v:0.00 psnr_avg:37.27 psnr_y:35.51 psnr_u:inf psnr_v:inf
I need to read each line extract psnr_y and its value in a matrix. does python have any other functions for reading a text file? I need to extract psnr_y from each line. I have a matlab code for this, but I need a python code and I am not familiar with functions in python. could you please help me with this issue?
this is the matlab code:
opt = {'Delimiter',{':',' '}};
fid = fopen('data.txt','rt');
nmc = nnz(fgetl(fid)==':');
frewind(fid);
fmt = repmat('%s%f',1,nmc);
tmp = textscan(fid,fmt,opt{:});
fclose(fid);
fnm = [tmp{:,1:2:end}];
out = cell2struct(tmp(:,2:2:end),fnm(1,:),2)

You can use regex like below:
import re
with open('textfile.txt') as f:
a = f.readlines()
pattern = r'psnr_y:([\d.]+)'
for line in a:
print(re.search(pattern, line)[1])
This code will return only psnr_y's value. you can remove [1] and change it with [0] to get the full string like "psnr_y:37.10".
If you want to assign it into a list, the code would look like this:
import re
a_list = []
with open('textfile.txt') as f:
a = f.readlines()
pattern = r'psnr_y:([\d.]+)'
for line in a:
a_list.append(re.search(pattern, line)[1])

use regular expression
r'psnr_y:([\d.]+)'
on each line read
and extract match.group(1) from the result
if needed convert to float: float(match.group(1))

Since I hate regex, I would suggest:
s = 'n:1 mse_avg:8.46 mse_y:12.69 mse_u:0.00 mse_v:0.00 psnr_avg:38.86 psnr_y:37.10 psnr_u:inf psnr_v:inf \nn:2 mse_avg:12.20 mse_y:18.30 mse_u:0.00 mse_v:0.00 psnr_avg:37.27 psnr_y:35.51 psnr_u:inf psnr_v:inf'
lst = s.split('\n')
out = []
for line in lst:
psnr_y_pos = line.index('psnr_y:')
next_key = line[psnr_y_pos:].index(' ')
psnr_y = line[psnr_y_pos+7:psnr_y_pos+next_key]
out.append(psnr_y)
print(out)
out is a list of the values of psnr_y in each line.

For a simple answer with no need to import additional modules, you could try:
rows = []
with open("my_file", "r") as f:
for row in f.readlines():
value_pairs = row.strip().split(" ")
print(value_pairs)
values = {pair.split(":")[0]: pair.split(":")[1] for pair in value_pairs}
print(values["psnr_y"])
rows.append(values)
print(rows)
This gives you a list of dictionaries (basically JSON structure but with python objects).
This probably won't be the fastest solution but the structure is nice and you don't have to use regex

import fileinput
import re
for line in fileinput.input():
row = dict([s.split(':') for s in re.findall('[\S]+:[\S]+', line)])
print(row['psnr_y'])
To verify,
python script_name.py < /path/to/your/dataset.txt

Related

How to xtract multiple numbers delimited by brackets from a txt document into a python list?

So I have a .txt file which looks like the following:
[some_strings] id:[1227194]
[some_strings] id:[1227195]
[some_strings] id:[1227196]
What I need to do is to extract all the numbers in between the brackets [] and append them in a list which I will then use for further analysis.
The final result should then be:
list = [1227194,1227195,1227196]
What would be the most pythonic way to achieve this?
try this:
import re
filename = 'text.txt'
with open(filename) as file:
lines = file.readlines()
lines = " ".join(line.rstrip() for line in lines)
num_list = re.findall('\d+',lines)
num_list
output:
['1227194', '1227195', '1227196']
regex explanation here
import re
s = """[some_strings] id:[1227194]
[some_strings] id:[1227195]
[some_strings] id:[1227196]"""
l = re.findall(r'\[(\d+)\]', s, re.MULTILINE)
print(l) #['1227194', '1227195', '1227196']
You can use .split, another way would be to use regex:
s = """
[some_strings] id:[1227194]
[some_strings] id:[1227195]
[some_strings] id:[1227196]
"""
[line.rstrip(']').split('[')[-1] for line in s.split('\n') if line != '']
# Out[95]: ['1227194', '1227195', '1227196'
A way I think would be:
ids = []
with open(text.txt) as file:
for line in file.readlines():
id_with_bracket = split('id:[')
ids.append(int(id_with_bracket[:-1]))
import re
with open('text.txt', 'r') as f:
data = re.findall('\d+',f.read())
['1227194', '1227195', '1227196']

Removing punctuation and change to lowercase in python CSV file

The code below allow me to open the CSV file and change all the texts to lowercase. However, i have difficulties trying to also remove the punctuation in the CSV file. How can i do that? Do i use string.punctuation?
file = open('names.csv','r')
lines = [line.lower() for line in file]
with open('names.csv','w') as out
out.writelines(sorted(lines))
print (lines)
sample of my few lines from the file:
Justine_123
ANDY*#3
ADRIAN
hEnNy!
You can achieve this by importing strings and make use of the following example code below.
The other way you can achieve this is by using regex.
import string
str(lines).translate(None, string.punctuation)
Also you may want to learn more about how import string works and its features
The working example you requested for.
import string
with open("sample.csv") as csvfile:
lines = [line.lower() for line in csvfile]
print(lines)
will give you ['justine_123\n', 'andy*#3\n', 'adrian\n', 'henny!']
punc_table = str.maketrans({key: None for key in string.punctuation})
new_res = str(lines).translate(punc_table)
print(new_res)
new_s the result will give you justine123n andy3n adriann henny
Example with regular expressions.
import csv
import re
filename = ('names.csv')
def reg_test(name):
reg_result = ''
with open(name, 'r') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
row = re.sub('[^A-Za-z0-9]+', '', str(row))
reg_result += row + ','
return reg_result
print(reg_test(filename).lower())
justine123,andy3,adrian,henny,

how to create a list and sum up

I am relatively new to python and got stuck on the below:
Below is the code I am working with
import re
handle = open ('RegExWeek2.txt')
for line in handle:
line = line.rstrip()
x = re.findall('[0-9]+', line)
if len(x) > 0:
print x
The return from this code looks like this:
['7430']
['9401', '9431']
['2248', '2047']
['5517']
['3184', '1241']
['9939']
['2185', '9450', '8428']
['369']
['3683', '6442', '7654']
Question: how do I combine this to one list and sum up the numbers?
Please help
You may change your code like this,
handle = open ('RegExWeek2.txt')
num = []
for line in handle:
num.extend(re.findall('[0-9]+', line))
print sum(int(i) for i in num)
Since you're using re.findall, this line.rstrip() line is not necessary.
And also there won't be possible for x to be an empty list, since we are using + next to [0-9] (repeats the previous token one or more times) not * (zero or more times)
There's no need to rstrip, and you should open files using with:
import re
all_numbers = []
with open('RegExWeek2.txt') as file:
for line in file:
numbers = re.findall('[0-9]+', line)
for number in numbers:
all_numbers.append(int(number))
print(sum(all_numbers))
This is really beginner code, and a direct translation of yours. Here's how I would write it:
with open('RegExWeek2.txt') as file:
all_numbers = [int(num) for num in re.findall('[0-9]+', file.read())]
print(sum(all_numbers))

Extracting data from a file using regular expressions and storing in a list to be compiled into a dictionary- python

I've been trying to extract both the species name and sequence from a file as depicted below in order to compile a dictionary with the key corresponding to the species name (FOX2_MOUSE for example) and the value corresponding to the Amino Acid sequence.
Sample fasta file:
>sp|P58463|FOXP2_MOUSE
MMQESATETISNSSMNQNGMSTLSSQLDAGSRDGRSSGDTSSEVSTVELL
HLQQQQALQAARQLLLQQQTSGLKSPKSSEKQRPLQVPVSVAMMTPQVIT
PQQMQQILQQQVLSPQQLQALLQQQQAVMLQQQQLQEFYKKQQEQLHLQL
LQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ-HPGKQAKE
QQQQQQQQQ-LAAQQLVFQQQLLQMQQLQQQQHLLSLQRQGLISIPPGQA
ALPVQSLPQAGLSPAEIQQLWKEVTGVHSMEDNGIKHGGLDLTTNNSSST
TSSTTSKASPPITHHSIVNGQSSVLNARRDSSSHEETGASHTLYGHGVCK
>sp|Q8MJ98|FOXP2_PONPY
MMQESVTETISNSSMNQNGMSTLSSQLDAGSRDGRSSGDTSSEVSTVELL
HLQQQQALQAARQLLLQQQTSGLKSPKSSDKQRPLQVPVSVAMMTPQVIT
PQQMQQILQQQVLSPQQLQALLQQQQAVMLQQQQLQEFYKKQQEQLHLQL
LQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ--HPGKQAKE
QQQQQQQQQ-LAAQQLVFQQQLLQMQQLQQQQHLLSLQRQGLISIPPGQA
ALPVQSLPQAGLSPAEIQQLWKEVTGVHSMEDNGIKHGGLDLTTNNSSST
TSSTTSKASPPITHHSIVNGQSSVLNARRDSSSHEETGASHTLYGHGVCK
I've tried using my code below:
import re
InFileName = "foxp2.fasta"
InFile = open(InFileName, 'r')
Species = []
Sequence = []
reg = re.compile('FOXP2_\w+')
for Line in InFile:
Species += reg.findall(Line)
print Species
reg = re.compile('(^\w+)')
for Line in Infile:
Sequence += reg.findall(Line)
print Sequence
dictionary = dict(zip(Species, Sequence))
InFile.close()
However, my output for my lists are:
[FOX2_MOUSE, FOXP2_PONPY]
[]
Why is my second list empty? Are you not allowed to use re.compile() twice? Any suggestions on how to circumvent my problem?
Thank you,
Christy
If you want to read a file twice, you have to seek back to the beginning.
InFile.seek(0)
You can do it in a single pass, and without regular expressions:
def load_fasta(filename):
data = {}
species = ""
sequence = []
with open(filename) as inf:
for line in inf:
line = line.strip()
if line.startswith(";"): # is comment?
# skip it
pass
elif line.startswith(">"): # start of new record?
# save previous record (if any)
if species and sequence:
data[species] = "".join(sequence)
species = line.split("|")[2]
sequence = []
else: # continuation of previous record
sequence.append(line)
# end of file - finish storing last record
if species and sequence:
data[species] = "".join(sequence)
return data
data = load_fasta("foxp2.fasta")
On your given file, this produces data ==
{
'FOXP2_PONPY': 'MMQESVTETISNSSMNQNGMSTLSSQLDAGSRDGRSSGDTSSEVSTVELLHLQQQQALQAARQLLLQQQTSGLKSPKSSDKQRPLQVPVSVAMMTPQVITPQQMQQILQQQVLSPQQLQALLQQQQAVMLQQQQLQEFYKKQQEQLHLQLLQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ--HPGKQAKEQQQQQQQQQ-LAAQQLVFQQQLLQMQQLQQQQHLLSLQRQGLISIPPGQAALPVQSLPQAGLSPAEIQQLWKEVTGVHSMEDNGIKHGGLDLTTNNSSSTTSSTTSKASPPITHHSIVNGQSSVLNARRDSSSHEETGASHTLYGHGVCK',
'FOXP2_MOUSE': 'MMQESATETISNSSMNQNGMSTLSSQLDAGSRDGRSSGDTSSEVSTVELLHLQQQQALQAARQLLLQQQTSGLKSPKSSEKQRPLQVPVSVAMMTPQVITPQQMQQILQQQVLSPQQLQALLQQQQAVMLQQQQLQEFYKKQQEQLHLQLLQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ-HPGKQAKEQQQQQQQQQ-LAAQQLVFQQQLLQMQQLQQQQHLLSLQRQGLISIPPGQAALPVQSLPQAGLSPAEIQQLWKEVTGVHSMEDNGIKHGGLDLTTNNSSSTTSSTTSKASPPITHHSIVNGQSSVLNARRDSSSHEETGASHTLYGHGVCK'
}
You could also do this in a single pass with a multiline regex:
import re
reg = re.compile('(FOXP2_\w+)\n(^[\w\n-]+)', re.MULTILINE)
with open("foxp2.fasta", 'r') as file:
data = dict(reg.findall(file.read()))
The downside is that you have to read the whole file in at once. Whether this is a problem depends on likely file sizes.

Python load list in list from text file

I want to load list in list from text file. I went through many examples but no solution. This is what I want to do
I am new bee to python
def main()
mainlist = [[]]
infile = open('listtxt.txt','r')
for line in infile:
mainlist.append(line)
infile.close()
print mainlist
`[[],['abc','def', 1],['ghi','jkl',2]]`
however what I want is something like this
[['abc','def',1],['ghi','jkl',2]]
my list contains
'abc','def',1
'ghi','jkl',2
'mno','pqr',3
what I want is when I access the list
print mainlist[0]
should return
'abc','def',1
any help will be highly appreciated
Thanks,
It seems to me that you could do this as:
from ast import literal_eval
with open('listtxt.txt') as f:
mainlist = [list(literal_eval(line)) for line in f]
This is the easist way to make sure that the types of the elements are preserved. e.g. a line like:
"foo","bar",3
will be transformed into 2 strings and an integer. Of course, the lines themselves need to be formatted as a python tuple... and this probably isn't the fastest approach due to it's generality and simplicity.
Maybe something like this.
mainlist = []
infile = open('listtxt.txt','r')
for line in infile:
mainlist.append(line.strip().split(','))
infile.close()
print mainlist
You're initializing mainlist with an empty list as first element, rather than as an empty list itself. Change:
mainlist = [[]]
to
mainlist = []
I'd try something like:
with open('listtxt.txt', 'r') as f:
mainlist = [line for line in f]
mainlist = []
infile = open('filelist.txt', 'r')
for line in infile:
line = line.replace('\n', '').replace('[', '').replace(']', '').replace("'", "").replace(' ', '')
mainlist.append(line.split(','))
infile.close()
You can use the json module like below (Python 3.x):
import json
def main()
mainlist = [[]]
infile = open('listtxt.txt','r')
data = json.loads(infile.read())
mainlist.append(data)
infile.close()
print(mainlist)
>>> [[],['abc','def', 1],['ghi','jkl',2]]
Your "listtxt.txt" file should look like this:
[["abc","def", 1],["ghi","jkl",2]]
To export your list, do this:
def export():
with open("listtxt.txt", 'w') as export_file:
json.dump(mainlist, export_file)
JSON module can load lists, found here
I had a list of surnames - one per line - in a text file which I wanted to read into a list. Here's how I did it (remembering to strip the newline character).
surnames = [name.strip("\n") for name in open("surnames.txt", "r")]

Categories