Extracting strings using a list of words - python

I am trying to solve the problem (for example)
I have a name file
VOG00001
VOG00002
VOG00004
and database file
VOG00001!962!834!Xu!sp|O31936|YOPB_BACSU_Putative_antitoxin_YopB
VOG00002!206!17!Xh!sp|Q5UPJ9|YL122_MIMIV_Putative_ankyrin_repeat_protein_L12
VOG00003!1284!960!Xr!sp|O22001|VXIS_BPMD2_Excisionase
VOG00004!353!304!Xu!sp|P03795|Y28_BPT7_Protein_2.8
VOG00005!253!60!Xu!REFSEQ_hypothetical_protein
I need to extract rows from the database that match the words in the file names
results:
VOG00001!962!834!Xu!sp|O31936|YOPB_BACSU_Putative_antitoxin_YopB
VOG00002!206!17!Xh!sp|Q5UPJ9|YL122_MIMIV_Putative_ankyrin_repeat_protein_L12
VOG00004!353!304!Xu!sp|P03795|Y28_BPT7_Protein_2.8
def log(x, y):
output = open('output.txt', 'a')
output.write(x + y)
output.close
def main():
i = 0
nfile = 'input/' + input('Enter file with names: ')
dfile = 'input/' + input('Enter file with data: ')
names = list(open(nfile, 'r'))
data = list(open(dfile, 'r'))
while i != len(data):
line = data[i]
if 'VOG' in line:
line1 = line.replace("!*" , "")
if line1 in names:
log(line, data[i + 1])
i += 1
return(0)
main()
I want to trim the unnecessary and compare with the list of names
line1 = line.replace("!*" , "")

The best way to do this is to create a dictionary based on the database file. The database file uses '!' to delimit the key so we just split the strings to find the key and associated the entire line with that key (the dictionary value). Then iterate over the "names" file and do lookups in the dictionary.
names = input('Enter path to names file: ')
db = input('Enter path to database file: ')
with open(names) as n, open(db) as d:
dict_ = {}
for line in map(str.strip, d):
key, *_ = line.split('!')
dict_[key] = line
for name in map(str.strip, n):
if (v := dict_.get(name)):
print(v)

Related

Return a dictionary of a function

I want to define a function, that reads a table of a textfile as a dictionary and than use it for returning specific values. The keys are chemical symbols (like "He" for Helium,...). The values return their specific atom masses.
I don't understand, what I have to do...
The first five lines of the textfile read:
H,1.008
He,4.0026
Li,6.94
Be,9.0122
B,10.81
Here are my attempts: (I don't know where to place the parameter key so that I can define it)
def read_masses():
atom_masses = {}
with open["average_mass.csv") as f:
for line in f:
(key, value) = line.split(",")
atom_masses[key] = value
return(value)
m = read_masses("average_mass.csv)
print(m["N"]) #for the mass of nitrogen ```
once return has called, the code below it doesn't execute. What you need to return is the atom_masses not value and you have to place it outside the for loop
def read_masses(file):
atom_masses = {}
with open(file) as f:
for line in f:
(key, value) = line.split(",")
atom_masses[key] = value
return (atom_masses)
m = read_masses("average_mass.csv")
print(m["H"])
>>> 1.008
Try:
def read_masses(name):
data = {}
with open(name, "r") as f_in:
for line in map(str.strip, f_in):
if line == "":
continue
a, b = map(str.strip, line.split(",", maxsplit=1))
data[a] = float(b)
return data
m = read_masses("your_file.txt")
print(m.get("He"))
Prints:
4.0026

Converting text file to dictionary then to JSON object using Python - But I need to split data from the text file

I am trying to convert txt file data to a Python dictionary before dumping it to JSON and then writing JSON data into avro. My text file contains unwanted data that I need to remove, I only need field names and field types:
mappings = ['name', 'type']
with open('xxx.txt', 'r') as fn:
dict1 = {}
names = []
types = []
for line in file.readlines():
names_lines = line.split()[0] # index 0 = names
names.append(names_lines)
types_lines = line.split()[1] # index 1 = types
types.append(types_lines)
id = 1
for d in fn:
desc = list(d.strip().split(' ', 1))
name = desc[0]
i = 0
dict2 = {}
while i < len(mappings):
dict2[mappings[i]] = desc[i]
i = i + 1
dict1[name] = dict2
id = id + 1
print(dict1)
1- My Error is:
name 'file' is not defined. Did you mean: 'filter'?***
replace file with fn:
for line in fn.readlines():
From what you described you wanted in the comments, try this:
records = []
with open('xxx.txt', 'r') as f:
for line in f:
name, type_, *_ = line.split()
records.append({"name": name, "type": type_})
print(records)

filteration using separators in python

I have many lines like the following:
>ENSG00000003137|ENST00000001146|CYP26B1|72374964|72375167|4732
CGTCGTTAACCGCCGCCATGGCTCCCGCAGAGGCCGAGT
>ENSG00000001630|ENST00000003100|CYP51A1|91763679|91763844|3210
TCCCGGGAGCGCGCTTCTGCGGGATGCTGGGGCGCGAGCGGGACTGTTGACTAAGCTTCG
>ENSG00000003137|ENST00000412253|CYP26B1|72370133;72362405|72370213;72362548|4025
AGCCTTTTTCTTCGACGATTTCCG
In this example ENSG00000003137 is name and 4732 which is the last one is length. as you see some names are repeated but they have different length.
I want to make a new file in which I only have those with the longest length. meaning the results would be like this:
>ENSG00000003137|ENST00000001146|CYP26B1|72374964|72375167|4732
CGTCGTTAACCGCCGCCATGGCTCCCGCAGAGGCCGAGT
>ENSG00000001630|ENST00000003100|CYP51A1|91763679|91763844|3210
TCCCGGGAGCGCGCTTCTGCGGGATGCTGGGGCGCGAGCGGGACTGTTGACTAAGCTTCG
I have made this code to split but don't know how to make the file I want:
file = open(“file.txt”, “r”)
for line in file:
if line.startswith(“>”):
line = line.split(“|”)
You'll need to read the file twice; the first time round, track the largest size per entry:
largest = {}
with open(inputfile) as f:
for line in f:
if line.startswith('>'):
parts = line.split('|')
name, length = parts[0][1:], int(parts[-1])
largest[name] = max(length, largest.get(name, -1))
then write out the copy in a second pass, but only those sections whose name and length match the extracted largest length from the first pass:
with open(inputfile) as f, open(outpufile, 'w') as out:
copying = False
for line in f:
if line.startswith('>'):
parts = line.split('|')
name, length = parts[0][1:], int(parts[-1])
copying = largest[name] == length
if copying:
out.write(line)
you have to do two types of handling in the loop, one that compares your 'length', and one that stores the CGTA when its needed. I wrote an example for you that reads those into dicts:
file = open("file.txt", "r")
myDict = {}
myValueDict = {}
action = 'remember'
geneDict = {}
for line in file:
if line.startswith(">"):
line = line.rstrip().split("|")
line_name = line[0]
line_number = int(line[-1])
if line_name in myValueDict:
if myValueDict[line_name] < line_number:
action = 'remember'
myValueDict[line_name] = line_number
myDict[line_name] = line
else:
action = 'forget'
else:
myDict[line_name] = line
myValueDict[line_name] = line_number
else:
if action == 'remember':
geneDict[line_name] = line.rstrip()
for key in myDict:
print(myDict[key])
for key in geneDict:
print(geneDict[key])
this ignores the lower length items. you can now store those dicts any way you want.

How to format the output of dictionary in python

I would like to format the values of a dictionary in python. Here is the script that i have used to generate the output
entries = {}
entries1 = {}
with open('no_dup.txt', 'r') as fh_in:
for line in fh_in:
if line.startswith('E'):
line = line.strip()
line = line.split()
entry = line[0]
if entry in entries:
entries[entry].append(line)
else:
entries[entry] = [line]
with open('no_dup_out.txt', 'w') as fh_out:
for kee, val in entries.iteritems():
if len(val) == 1:
fh_out.write("{} \n".format(val))
with open('no_dup_out.txt', 'r') as fh_in2:
for line in fh_in2:
line = line.strip()
line = line.split()
entry = line[1]
if entry in entries1:
entries1[entry].append(line)
else:
entries1[entry] = [line]
with open('no_dup_out_final.txt', 'w') as fh_out2:
for kee, val in entries1.iteritems():
if len(val) == 1:
fh_out2.write("{} \n".format(val))
For example by running the above script i generated the following output
[["[['ENSGMOG00000003747',", "'ENSORLG00000006947']]"]]
[["[['ENSGMOG00000003752',", "'ENSORLG00000005385']]"]]
[["[['ENSGMOG00000003760',", "'ENSORLG00000005379']]"]]
[["[['ENSGMOG00000003748',", "'ENSORLG00000004636']]"]]
[["[['ENSGMOG00000003761',", "'ENSORLG00000005382']]"]]
And i would like to format it such as way that i remove all the parentheses and commas (ENSGMOG00000003747 ENSORLG00000006947) and output the rest as it is using tab delimited format. How can i do that?
If your list of lists is full_list, then you could have the following code give your desired output:
desired_list = ['\t'.join([element.split('\'')[1] for element in list_item[0]]) for list_item in full_list]

Cutting character values according to value from file

This is the which i am doing
import csv
output = open('output.txt' , 'wb')
# this functions return the min for num.txt
def get_min(num):
return int(open('%s.txt' % num, 'r+').readlines()[0])
# temporary variables
last_line = ''
input_list = []
#iterate over input.txt in sort the input in a list of tuples
for i, line in enumerate(open('input.txt', 'r+').readlines()):
if i%2 == 0:
last_line = line
else:
input_list.append((last_line, line))
filtered = [(header, data[:get_min(header[-2])] + '\n' ) for (header, data) in input_list]
[output.write(''.join(data)) for data in filtered]
output.close()
In this code input.txt is something like this
>012|013|0|3|M
AFDSFASDFASDFA
>005|5|67|0|6
ACCTCTGACC
>029|032|4|5|S
GGCAGGGAGCAGGCCTGTA
and num.txt is something like this
M 4
P 10
I want that in above input.txt check the amount of value from the num.txt by looking at its last column which is same like in num.txt and cut its character according to that values
I think the error in my code is that it only accept the integer text file , where it should also accept file which contain alphabets
The totally revised version, after a long chat with the OP;
import os
import re
# Fetch all hashes and counts
file_c = open('num.txt')
file_c = file_c.read()
lines = re.findall(r'\w+\.txt \d+', file_c)
numbers = {}
for line in lines:
line_split = line.split('.txt ')
hash_name = line_split[0]
count = line_split[1]
numbers[hash_name] = count
#print(numbers)
# The input file
file_i = open('input.txt')
file_i = file_i.read()
for hash_name, count in numbers.iteritems():
regex = '(' + hash_name.strip() + ')'
result = re.findall(r'>.*\|(' + regex + ')(.*?)>', file_i, re.S)
if len(result) > 0:
data_original = result[0][2]
stripped_data = result[0][2][int(count):]
file_i = file_i.replace(data_original, '\n' + stripped_data)
#print(data_original)
#print(stripped_data)
#print(file_i)
# Write the input file to new input_new.txt
f = open('input_new.txt', 'wt')
f.write(file_i)
You can do it like so;
import re
min_count = 4 # this variable will contain that count integer from where to start removing
str_to_match = 'EOG6CC67M' # this variable will contain the filename you read
input = '' # The file input (input.txt) will go in here
counter = 0
def callback_f(e):
global min_count
global counter
counter += 1
# Check your input
print(str(counter) + ' >>> ' + e.group())
# Only replace the value with nothing (remove it) after a certain count
if counter > min_count:
return '' # replace with nothing
result = re.sub(r''+str_to_match, callback_f, input)
With this tactic you can keep count with a global counter and there's no need to do hard line-loops with complex structures.
Update
More detailed version with file access;
import os
import re
def callback_f(e):
global counter
counter += 1
# Check your input
print(str(counter) + ' >>> ' + e.group())
# Fetch all hash-file names and their content (count)
num_files = os.listdir('./num_files')
numbers = {}
for file in num_files:
if file[0] != '.':
file_c = open('./num_files/' + file)
file_c = file_c.read()
numbers[file.split('.')[0]] = file_c
# Now the CSV files
csv_files = os.listdir('./csv_files')
for file in csv_files:
if file[0] != '.':
for hash_name, min_count in numbers.iteritems():
file_c = open('./csv_files/' + file)
file_c = file_c.read()
counter = 0
result = re.sub(r''+hash_name, callback_f, file_c)
# Write the replaced content back to the file here
Considered directory/file structure;
+ Projects
+ Project_folder
+ csv_files
- input1.csv
- input2.csv
~ etc.
+ num_files
- EOG6CC67M.txt
- EOG62JQZP.txt
~ etc.
- python_file.py
The CSV files contain the big chunks of text you state in your original question.
The Num files contain the hash-files with an Integer in them
What happens in this script;
Collect all Hash files (in a dictionary) and it's inner count number
Loop through all CSV files
Subloop through the collected numbers for each CSV file
Replace/remove (based on what you do in callback_f()) hashes after a certain count
Write the output back (it's the last comment in the script, would contain the file.write() functionality)

Categories