merge two txt files together by one common column python - python

how to read in two tab delimited files .txt and map them together by one common column.
For example, from these two files create a mapping of gene to pathway:
First file, pathway.txt
Pathway Protein
Binding and Uptake of Ligands by Scavenger Receptors P69905
Erythrocytes take up carbon dioxide and release oxygen P69905
Metabolism P69905
Amyloids P02647
Metabolism P02647
Hemostasis P68871
Second file, gene.txt
Gene Protein
Fabp3 P11404
HBA1 P69905
APOA1 P02647
Hbb-b1 P02088
HBB P68871
Hba P01942
output would be like,
Gene Protein Pathway
Fabp3 P11404
HBA1 P69905 Binding and Uptake of Ligands by Scavenger Receptors, Erythrocytes take up carbon dioxide and release oxygen, Metabolism
APOA1 P02647 Amyloids, Metabolism
Hbb-b1 P02088
HBB P68871 Hemostasis
Hba P01942
Leave blank if there is no pathway corresponds to gene base on the protein id information.
UPDATE:
import pandas as pd
file1= pd.read_csv("gene.csv")
file2= pd.read_csv("pathway.csv")
output = pd.concat([file1,file2]).fillna(" ")
output= output[["Gene","Protein"]+list(output.columns[1:-1])]
output.to_csv("mapping of gene to pathway.csv", index=False)
So this only gives me the merged file which is not i expected.

>>> from collections import defaultdict
>>> my_dict = defaultdict()
>>> f = open('pathway.txt')
>>> for x in f:
... x = x.strip().split()
... value,key = " ".join(x[:-1]),x[-1]
... if my_dict.get(key,0)==0:
... my_dict[key] = [value]
... else:my_dict[key].append(value)
...
>>> my_dict
defaultdict(None, {'P68871': ['Hemostasis'], 'Protein': ['Pathway'], 'P69905': ['Binding', 'Erythrocytes', 'Metabolism'], 'P02647': ['Amyloids', 'Metabolism']})
>>> f1 = open('gene.txt')
>>> for x in f1:
... value,key = x.strip().split()
... if my_dict.get(key,0)==0:
... print("{:<15}{:<15}".format(value,key))
... else: print("{:<15}{:<15}{}".format(value,key,", ".join(my_dict[key])))
...
Gene Protein Pathway
Fabp3 P11404
HBA1 P69905 Binding and Uptake of Ligands by Scavenger Receptors, Erythrocytes take up carbon dioxide and release oxygen Metabolism
APOA1 P02647 Amyloids, Metabolism
Hbb-b1 P02088
HBB P68871 Hemostasis
Hba P01942

class Protein:
def __init__(self, protein, pathway = None, gene = ""):
self.protein = protein
self.pathways = []
self.gene = gene
if pathway is not None:
self.pathways.append(pathway)
return
def __str__(self):
return "%s\t%s\t%s" % (
self.gene,
self.protein,
", ".join([p for p in self.pathways]))
# protein -> pathway map
proteins = {}
# get the pathways
f1 = file("pathways.txt")
for line in f1.readlines()[1:]:
tokens = line.split()
pathway = " ".join(tokens[:-1])
protein = tokens[-1]
if protein in proteins:
p = proteins[protein]
p.pathways.append(pathway)
else:
p = Protein(protein = protein, pathway = pathway)
proteins[protein] = p
# get the genes
f2 = file("genes.txt")
for line in f2.readlines()[1:]:
gene, protein = line.split()
if protein in proteins:
p = proteins[protein]
p.gene = gene
else:
p = Protein(protein = protein, gene = gene)
proteins[protein] = p
# print the results
print "Gene\tProtein\tPathway"
for protein in proteins.values():
print protein

Related

how to sum/aggregate by group without using pandas or import

so I am basically not allowed to use any import or other libraries like pandas or groupby.
and I have to categorize the data and sum up the corresponding values. The data is in the csv file.
For example,
**S** C **T**
A T 100
A. B 102
A. T. 200
A B. 100
C T 203
C. T. 200
C B 200
C T 200
C. B 200
my expected result should be
S C T
A T 300
A B. 202
C T 403
C B. 200
C T. 200
C B. 200
Considering that you have a csv file (i.e., columns split by comma):
with open('myfile.csv', 'r') as file:
header = file.readline().rstrip()
data = {}
for row in file:
state, candidate, value = row.split(',')
k, value = (state, candidate), int(value)
data[k] = data.get(k, 0) + value
result_csv = '\n'.join([header] + [f"{','.join(k)},{v}" for k,v in data.items()])
print(result_csv)
Output:
state,candidate,total votes
Alaska,Trump,300
Alaska,Biden,202
colorado,Trump,403
colorado,Biden,200
California,Trump,200
California,Biden,200
Original content of myfile.csv is (use str.replace if necessary):
state,candidate,total votes
Alaska,Trump,100
Alaska,Biden,102
Alaska,Trump,200
Alaska,Biden,100
colorado,Trump,203
colorado,Trump,200
colorado,Biden,200
California,Trump,200
California,Biden,200
mylist = []
with open("data", "r") as msg:
for line in msg:
mylist.append(line.strip().replace(".",""))
msg.close()
headers = mylist[0].replace("*","").split()
del mylist[0]
headers[2] = headers[2]+" "+headers[3]
mydict = {}
for line in mylist:
state = line.split()[0]
mydict[state] = {}
for line in mylist:
state = line.split()[0]
candidate = line.split()[1]
mydict[state][candidate] = 0
for line in mylist:
state = line.split()[0]
candidate = line.split()[1]
votes = line.split()[2]
mydict[state][candidate] = mydict[state][candidate] + int(votes)
print ("%-15s %-15s %-15s \n\n" % (headers[0],headers[1],headers[2]))
for state in mydict.keys():
for candidate in mydict[state].keys():
print ("%-15s %-15s %-15s" % (state,candidate,str(mydict[state][candidate])))
Output:
state candidate total votes
Alaska Trump 300
Alaska Biden 202
colorado Trump 403
colorado Biden 200
California Trump 200
California Biden 200

Finding relationships between values based on their name in Python with Panda

I want to make relationship between values by their Name based on below rules:
1- I have a CSV file (with more than 100000 rows) that consists of lots of values, I shared some examples as below:
Name:
A02-father
A03-father
A04-father
A05-father
A07-father
A08-father
A09-father
A17-father
A18-father
A20-father
A02-SA-A03-SA
A02-SA-A04-SA
A03-SA-A02-SA
A03-SA-A05-SA
A03-SA-A17-SA
A04-SA-A02-SA
A04-SA-A09-SA
A05-SA-A03-SA
A09-SA-A04-SA
A09-SA-A20-SA
A17-SA-A03-SA
A17-SA-A18-SA
A18-SA-A17-SA
A20-SA-A09-SA
A05-NA
B02-Father
B04-Father
B06-Father
B02-SA-B04-SA
B04-SA-BO2-SA
B04-SA-B06-SA
B06-SA-B04-SA
B06-NA
2- Now I have another CSV file which let me know from which value I should start? in this case the value is
A03-father & B02-father & ... which dont have any influence on each other and they all have seperate path to go, so for each path we will start from mentioned start point.
father.csv
A03-father
B02-father
....
3- Based on the naming I want to make the relationships, As A03-Father has been determined as Father I should check for any value which has been started with A03.(All of them are A0's babies.)
Also as B02 is father, we will check for any value which starts with B02. (B02-SA-B04-SA)
4- Now If I find out A03-SA-A02-SA , this is A03's baby.
I find out A03-SA-A05-SA , this is A03's baby.
I find out A03-SA-A17-SA , this is A03's baby.
and after that I must check any node which starts with A02 & A05 & A17:
As you see A02-Father exists so it is Father and now we will search for any string which starts with A02 and doesn't have A03 which has been detected as Father(It must be ignored)
This must be checked till end of values which exist in the CSV file.
As you see I should check the path based on name (REGEX) and should go forward till end of path.
The expected result:
Father Baby
A03-father A03-SA-A02-SA
A03-father A03-SA-A05-SA
A03-father A03-SA-A17-SA
A02-father A02-SA-A04-SA
A05-father A05-NA
A17-father A17-SA-A18-SA
A04-father A04-SA-A09-SA
A02-father A02-SA-A04-SA
A09-father A09-SA-A20-SA
B02-father B02-SA-B04-SA
B04-father B04-SA-B06-SA
B06-father B06-NA
I have coded it as below with pandas:
import pandas as pd
import numpy as np
import re
#Read the file which consists of all Values
df = pd.read_csv("C:\\total.csv")
#Read the file which let me know who is father
Fa = pd.read_csv("C:\\Father.csv")
#Get the first part of Father which is A0
Fa['sub'] = Fa['Name'].str.extract(r'(\w+\s*)', expand=False)
r2 = []
#check in all the csv file and find anything which starts with A0 and is not Father
for f in Fa['sub']:
baby=(df[df['Name'].str.startswith(f) & ~df['Name'].str.contains('Father')])
baby['sub'] = bay['Name'].str.extract(r'(\w+\s*)', expand=False)
r1= pd.merge(Fa, baby, left_on='sub', right_on='sub',suffixes=('_f', '_c'))
r2.append(result1)
out_df = pd.concat(result2)
out_df= out_df.replace(np.nan, '', regex=True)
#find A0-N-A2-M and A0-N-A4-M
out_df.to_csv('C:\\child1.csv')
#check in all the csv file and find anything which starts with the second part of child1 which is A2 and A4
out_df["baby2"] = out_df['Name_baby'].str.extract(r'^(?:[^-]*-){2}\s*([^-]+)', expand=False)
baby3= out_df["baby2"]
r4 = []
for f in out_df["baby2"]:
#I want to exclude A0 which has been detected.
l = ['A0']
regstr = '|'.join(l)
baby1=(df[df['Name'].str.startswith(f) & ~df['Name'].str.contains(regstr)])
baby1['sub'] = baby1['Name'].str.extract(r'(\w+\s*)', expand=False)
r3= pd.merge(baby3, baby1, left_on='baby2', right_on='sub',suffixes=('_f', '_c'))
r4.append(r3)
out2_df = pd.concat(r4)
out2_df.to_csv('C:\\child2.csv')
I want to put below code in a loop and go through the file and check it, based on naming process and detect other fathers and babies till it finished. however this code is not customized and doesn't have the exact result as i expected.
my question is about how to make the loop?
I should go through the path and also consider regstr value for any string.
#check in all the csv file and find anything which starts with the second part of child1 which is A2 and A4
out_df["baby2"] = out_df['Name_baby'].str.extract(r'^(?:[^-]*-){2}\s*([^-]+)', expand=False)
baby3= out_df["baby2"]
r4 = []
for f in out_df["baby2"]:
#I want to exclude A0 which has been detected.
l = ['A0']
regstr = '|'.join(l)
baby1=(df[df['Name'].str.startswith(f) & ~df['Name'].str.contains(regstr)])
baby1['sub'] = baby1['Name'].str.extract(r'(\w+\s*)', expand=False)
r3= pd.merge(baby3, baby1, left_on='baby2', right_on='sub',suffixes=('_f', '_c'))
r4.append(r3)
out2_df = pd.concat(r4)
out2_df.to_csv('C:\\child2.csv')
Start with import collections (will be needed soon).
I assume that you have already read df and Fa DataFrames.
The first part of my code is to create children Series (index - parent,
value - child):
isFather = df.Name.str.contains('-father', case=False)
dfChildren = df[~isFather]
key = []; val = []
for fath in df[isFather].Name:
prefix = fath.split('-')[0]
for child in dfChildren[dfChildren.Name.str.startswith(prefix)].Name:
key.append(prefix)
val.append(child)
children = pd.Series(val, index=key)
Print children to see the result.
The second part is to create the actual result, starting from each
starting points in Fa:
nodes = collections.deque()
father = []; baby = [] # Containers for source data
# Loop for each starting point
for startNode in Fa.Name.str.split('-', expand=True)[0]:
nodes.append(startNode)
while nodes:
node = nodes.popleft() # Take node name from the queue
# Children of this node
myChildren = children[children.index == node]
# Process children (ind - father, val - child)
for ind, val in myChildren.items():
parts = val.split('-') # Parts of child name
# Child "actual" name (if exists)
val_2 = parts[2] if len(parts) >= 3 else ''
if val_2 not in father: # val_2 not "visited" before
# Add father / child name to containers
father.append(ind)
baby.append(val)
if len(val_2) > 0:
nodes.append(val_2) # Add to the queue, to be processe later
# Drop rows for "node" from "children" (if any exists)
if (children.index == node).sum() > 0:
children.drop(node, inplace=True)
# Convert to a DataFrame
result = pd.DataFrame({'Father': father, 'Baby': baby})
result.Father += '-father' # Add "-father" to "bare" names
I added -father with lower case "f", but I think this is not much
significant detail.
The result, for your data sample, is:
Father Baby
0 A03-father A03-SA-A02-SA
1 A03-father A03-SA-A05-SA
2 A03-father A03-SA-A17-SA
3 A02-father A02-SA-A04-SA
4 A05-father A05-NA
5 A17-father A17-SA-A18-SA
6 A04-father A04-SA-A09-SA
7 A09-father A09-SA-A20-SA
8 B02-father B02-SA-B04-SA
9 B04-father B04-SA-B06-SA
10 B06-father B06-NA
And two remarks concerning your data sample:
You wrote B04-SA-B02-SA with capital O (a letter) instead of 0
(zero). I corrected it in my source data.
Row A02-father A02-SA-A04-SA in your expected result is doubled.
I assume it should occur only once.
Commented inline
def find(data, from_pos=0):
fathers = {}
skip = []
for x in data[from_pos:]:
tks = x.split("-")
# Is it father ?
if tks[1].lower() == "father":
fathers[tks[0]] = x
else:
if tks[0] in fathers and tks[-2] not in skip:
print (fathers[tks[0]], x)
# Skip this father appearing as child later
skip.append(tks[0])
Testcase:
data = [
'A0-Father',
'A0-N-A2-M',
'A0-N-A4-M',
'A2-Father',
'A2-M-A0-N',
'A2-N-A8-M',
'A8-father',
'A8-M-A11-N',
'A8-M-A2-N']
find(data, from_pos=0)
Output:
A0-Father A0-N-A2-M
A0-Father A0-N-A4-M
A2-Father A2-N-A8-M
A8-father A8-M-A11-N
Edit 1:
Start with some data for testing
data = [
'A02-father',
'A03-father',
'A04-father',
'A05-father',
'A07-father',
'A08-father',
'A09-father',
'A17-father',
'A18-father',
'A20-father',
'A02-SA-A03-SA',
'A02-SA-A04-SA',
'A03-SA-A02-SA',
'A03-SA-A05-SA',
'A03-SA-A17-SA',
'A04-SA-A02-SA',
'A04-SA-A09-SA',
'A05-SA-A03-SA',
'A09-SA-A04-SA',
'A09-SA-A20-SA',
'A17-SA-A03-SA',
'A17-SA-A18-SA',
'A18-SA-A17-SA',
'A20-SA-A09-SA',
'A05-NA',
]
father = [
'A03-father',
]
First let us make a data structure so that manipulations will be easy and lookups for relationships will be fast as you have huge data
def make_data_structure(data):
all_fathers, all_relations = {}, {}
for x in data:
tks = x.split("-")
if tks[1].lower() == "father":
all_fathers[tks[0]] = x
else:
if len(tks) == 2:
tks.extend(['NA', 'NA'])
if tks[0] in all_relations:
all_relations[tks[0]][0].append(tks[-2])
all_relations[tks[0]][1].append(x)
else:
all_relations[tks[0]] =[[tks[-2]], [x]]
return all_fathers, all_relations
all_fathers, all_relations = make_data_structure(data)
all_fathers, all_relations
Output:
{'A02': 'A02-father',
'A03': 'A03-father',
'A04': 'A04-father',
'A05': 'A05-father',
'A07': 'A07-father',
'A08': 'A08-father',
'A09': 'A09-father',
'A17': 'A17-father',
'A18': 'A18-father',
'A20': 'A20-father'},
{'A02': [['A03', 'A04'], ['A02-SA-A03-SA', 'A02-SA-A04-SA']],
'A03': [['A02', 'A05', 'A17'],
['A03-SA-A02-SA', 'A03-SA-A05-SA', 'A03-SA-A17-SA']],
'A04': [['A02', 'A09'], ['A04-SA-A02-SA', 'A04-SA-A09-SA']],
'A05': [['A03', 'NA'], ['A05-SA-A03-SA', 'A05-NA']],
'A09': [['A04', 'A20'], ['A09-SA-A04-SA', 'A09-SA-A20-SA']],
'A17': [['A03', 'A18'], ['A17-SA-A03-SA', 'A17-SA-A18-SA']],
'A18': [['A17'], ['A18-SA-A17-SA']],
'A20': [['A09'], ['A20-SA-A09-SA']]}
As you can see all_fathers holds all the parents and most imporantly all_relations hold the father-child relationship which can be indexed using the father for faster lookups.
How lets do the actual parsing of the relationships
def find(all_fathers, all_relations, from_father):
fathers = [from_father]
skip = []
while True:
if len(fathers) == 0:
break
current_father = fathers[0]
fathers = fathers[1:]
for i in range(len(all_relations[current_father][0])):
if not all_relations[current_father][0][i] in skip:
print (all_fathers[current_father], all_relations[current_father][1][i])
if all_relations[current_father][0][i] != 'NA':
fathers.append(all_relations[current_father][0][i])
skip.append(current_father)
for x in father:
find(all_fathers, all_relations, x.split("-")[0])
Output:
A03-father A03-SA-A02-SA
A03-father A03-SA-A05-SA
A03-father A03-SA-A17-SA
A02-father A02-SA-A04-SA
A05-father A05-NA
A17-father A17-SA-A18-SA
A04-father A04-SA-A09-SA
A09-father A09-SA-A20-SA
Edit 2:
New test cases; [You will have to load the values in father.csv to a list called father].
data = [
'A02-father',
'A03-father',
'A04-father',
'A05-father',
'A07-father',
'A08-father',
'A09-father',
'A17-father',
'A18-father',
'A20-father',
'A02-SA-A03-SA',
'A02-SA-A04-SA',
'A03-SA-A02-SA',
'A03-SA-A05-SA',
'A03-SA-A17-SA',
'A04-SA-A02-SA',
'A04-SA-A09-SA',
'A05-SA-A03-SA',
'A09-SA-A04-SA',
'A09-SA-A20-SA',
'A17-SA-A03-SA',
'A17-SA-A18-SA',
'A18-SA-A17-SA',
'A20-SA-A09-SA',
'A05-NA',
'B02-Father',
'B04-Father',
'B06-Father',
'B02-SA-B04-SA',
'B04-SA-B02-SA',
'B04-SA-B06-SA',
'B06-SA-B04-SA',
'B06-NA',
]
father = [
'A03-father',
'B02-father'
]
for x in father:
find(all_fathers, all_relations, x.split("-")[0])
Output:
A03-father A03-SA-A02-SA
A03-father A03-SA-A05-SA
A03-father A03-SA-A17-SA
A02-father A02-SA-A04-SA
A05-father A05-NA
A17-father A17-SA-A18-SA
A04-father A04-SA-A09-SA
A09-father A09-SA-A20-SA
B02-Father B02-SA-B04-SA
B04-Father B04-SA-B06-SA
B06-Father B06-NA

Looping through tree to create a dictionary_NLTK

I'm new to Python and trying to solve a problem looping through a tree in NLTK. I'm stuck on the final output, it is not entirely correct.
I'm looking to create a dictionary with 2 variables and if there is no quantity then add value 1.
This is the desired final output:
{ quantity =1, food =pizza }, {quantity =1, food = coke }
,{quantity =2, food = beers}, {quantity =1, food = sandwich }
Here is my code, any help is much appreaciated!
'''
import nltk as nltk
nltk.download()
grammar = r""" Food:{<DT>?<VRB>?<NN.*>+}
}<>+{
Quantity: {<CD>|<JJ>|<DT>}
"""
rp = nltk.RegexpParser(grammar)
def RegPar(menu):
grammar = r"""Food:{<DT>?<VRB>?<NN.*>+}
}<>+{
Quantity: {<CD>|<JJ>|<DT>}
"""
rp = nltk.RegexpParser(grammar)
output = rp.parse(menu)
return(output)
Sentences = [ 'A pizza margherita', 'one coke y 2 beers', 'Sandwich']
tagged_array =[]
output_array =[]
for s in Sentences:
tokens = nltk.word_tokenize(s)
tags = nltk.pos_tag(tokens)
tagged_array.append(tags)
output = rp.parse(tags)
output_array.append(output)
print(output)
dat = []
tree = RegPar(output_array)
for subtree in tree.subtrees():
if subtree.label() == 'Food' or subtree.label() =='Quantity':
dat.append({(subtree.label(),subtree.leaves()[0][0])})
print(dat)
##[{('Food', 'A')}, {('Quantity', 'one')}, {('Food', 'coke')}, {('Quantity', '2')}, {('Food', 'beers')}, {('Food', 'Sandwich')}]*
'''

read txt file into dictionary

I have the following type of document, where each person might have a couple of names and an associated description of features:
New person
name: ana
name: anna
name: ann
feature: A 65-year old woman that has no known health issues but has a medical history of Schizophrenia.
New person
name: tom
name: thomas
name: thimoty
name: tommy
feature: A 32-year old male that is known to be deaf.
New person
.....
What I would like is to read this file in a python dictionary, where each new person is id-ed.
i.e. Person with ID 1 will have the names ['ann','anna','ana']
and will have the feature ['A 65-year old woman that has no known health issues but has a medical history of Schizophrenia.' ]
Any suggestions?
Assuming that your input file is lo.txt. It can be added to dictionary this way:
file = open('lo.txt')
final_data = []
feature = []
names = []
for line in file.readlines():
if ("feature") in line:
data = line.replace("\n","").split(":")
feature=data[1]
final_data.append({
'names': names,
'feature': feature
})
names = []
feature = []
if ("name") in line:
data = line.replace("\n","").split(":")
names.append(data[1])
print final_data
Something like this might work
result = {}
f = open("document.txt")
contents = f.read()
info = contents.split('==== new person ===')
for i in range(len(info)):
info[i].split('\n')
names = []
features = []
for j in range(len(info[i])):
info[i][j].split(':')
if info[i][j][0] == 'name':
names.append(info[i][j][1])
else:
features.append(info[i][j][1])
result[i] = {'names': names,'features': features}
print(result)
This should give you something like:
{0: {'names': ['ana', 'anna', 'ann'], features:['...', '...']}}
e.t.c
Here is code that may work for you:
f = open("documents.txt").readlines()
f = [i.strip('\n') for i in f]
final_condition = f[len(f)-1]
f.remove(final_condition)
names = [i.split(":")[1] for i in f]
the_dict = {}
the_dict["names"] = names
the_dict["features"] = final_condition
print the_dict
All it does is split the names at ":" and take the last element of the resulting list (the names) and keep it for the list names.

Print a line of HTML, keeping the right format

I must print all the raw text of this HTML page.
Each line has this format:
ENSG00000001461'&nbsp';'&nbsp';'&nbsp';'&nbsp';ENST00000432012'&nbsp';'&nbsp';'&nbsp';'&nbsp';NIPAL3'&nbsp';'&nbsp';'&nbsp';'&nbsp';5'&nbsp';'&nbsp';'&nbsp';'&nbsp';1'&nbsp';'&nbsp';'&nbsp';'&nbsp';Forward'&nbsp';'&nbsp';'&nbsp';'&nbsp';NIPA-like domain containing 3 [Source:HGNC Symbol;Acc:HGNC:25233]<'br/'>
I want the following output:
ENSG00000001461 ENST00000432012 NIPAL3 5 1 Forward NIPA-like domain containing 3 [Source:HGNC Symbol;Acc:HGNC:25233]
But the output is only:
ENSG00000001461
This is my code:
import urllib
from bs4 import BeautifulSoup
species = ['HomoSapiens', 'MusMusculus', 'DrosophilaMelanogaster','CaenorhabditisElegans']
rna_target = ['mRNA', 'lincRNA', 'lncRNA']
db = ['MB21E78v2', 'MB19E65v2', 'MB16E62v1']
species_input = input("Selezionare Specie: ")
target_input = input("Selezionare tipo di RNA: ")
db_input = input("Selezionare DataBase: ")
check = 0
for i in range(len(species)):
if species_input == species[i]:
for j in range(len(rna_target)):
if target_input == rna_target[j]:
for k in range(len(db)):
if db_input == db[k]:
check = 1
if check == 1:
print("Dati Inseriti Correttamente!")
else:
print("Error: Dati inseriti in modo errato!")
exit()
url = urllib.request.urlopen("<https://cm.jefferson.edu/rna22/Precomputed/OptionController?>" +"species=" + species_input + "&type=" + target_input + "&version=" +db_input)
print(url.geturl())
identifier = []
seq_input = input("Digitare ID miRNA: ")
seq = ""
seq = seq_input.split()
print(seq)
for i in range(len(seq)):
identifier.append(seq[i] + "%20")
s = ""
string = s.join(identifier)
url_tab = urllib.request.urlopen("<https://cm.jefferson.edu/rna22/Precomputed/InputController?>"+"identifier=" string+"&minBasePairs=12&maxFoldingEnergy=-12&minSumHits=1&maxProb=.1&"+"version=" + db_input + "&species=" + species_input + "&type=" + target_input)
print(url_tab.geturl())
download = urllib.request.urlopen("
<http://cm.jefferson.edu/rna22/Precomputed/InputController?>download=ALL"+"&ident=" + string+"&minBasePairs=12&maxFoldingEnergy=-12&minSumHits=1&maxProb=.1&" +"version=" + db_input + "&species=" + species_input + "&type=" + target_input)
down_string = download.geturl()
print(down_string)
soup = BeautifulSoup(download, "html5lib")
for match in soup.findAll('br'):
match.unwrap()
s2 = soup
s1 = s2.body.extract()
print(s1.prettify(formatter=lambda s: s.strip(u'xa0')))
There is no notion of lines in the source, there is just one long line of text which you need to separate using newlines using the br tags.
If you have to parse the source, you can replace the br tags with newlines and just pull the text:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://cm.jefferson.edu/rna22/Precomputed/InputController?download=ALL&ident=hsa_miR_107%20hsa_miR_5011_5p%20hsa_miR_326&minBasePairs=12&maxFoldingEnergy=-12&minSumHits=1&maxProb=.1&version=MB21E78v2&species=HomoSapiens&type=mRNA")
soup = BeautifulSoup(r.content)
for b in soup.find_all("br"):
b.replace_with("\n")
print(soup.text)
Which will give you:
ENSG00000001461    ENST00000432012    NIPAL3    5    1    Forward    NIPA-like domain containing 3 [Source:HGNC Symbol;Acc:HGNC:25233]
ENSG00000001631    ENST00000340022    KRIT1    5    7    Reverse    KRIT1, ankyrin repeat containing [Source:HGNC Symbol;Acc:HGNC:1573]
ENSG00000001631    ENST00000394503    KRIT1    3    7    Reverse    KRIT1, ankyrin repeat containing [Source:HGNC Symbol;Acc:HGNC:1573]
ENSG00000001631    ENST00000394505    KRIT1    3    7    Reverse    KRIT1, ankyrin repeat containing [Source:HGNC Symbol;Acc:HGNC:1573]
ENSG00000001631    ENST00000394507    KRIT1    4    7    Reverse    KRIT1, ankyrin repeat containing [Source:HGNC Symbol;Acc:HGNC:1573]
ENSG00000001631    ENST00000412043    KRIT1    4    7    Reverse    KRIT1, ankyrin repeat containing [Source:HGNC Symbol;Acc:HGNC:1573]
ENSG00000002834    ENST00000318008    LASP1    6    17    Forward    LIM and SH3 protein 1 [Source:HGNC Symbol;Acc:HGNC:6513]
ENSG00000002834    ENST00000433206    LASP1    6    17    Forward    LIM and SH3 protein 1 [Source:HGNC Symbol;Acc:HGNC:6513]
ENSG00000002834    ENST00000435347    LASP1    5    17    Forward    LIM and SH3 protein 1 [Source:HGNC Symbol;Acc:HGNC:6513]
ENSG00000005381    ENST00000225275    MPO    5    17    Reverse    myeloperoxidase [Source:HGNC Symbol;Acc:HGNC:7218]
ENSG00000005889    ENST00000539115    ZFX    4    23 X    Forward    zinc finger protein, X-linked [Source:HGNC Symbol;Acc:HGNC:12869]
ENSG00000006432    ENST00000554752    MAP3K9    10    14    Reverse    mitogen-activated protein kinase kinase kinase 9 [Source:HGNC Symbol;Acc:HGNC:6861]
ENSG00000006432    ENST00000611979    MAP3K9    10    14    Reverse    mitogen-activated protein kinase kinase kinase 9 [Source:HGNC Symbol;Acc:HGNC:6861]
ENSG00000007216    ENST00000314669    SLC13A2    4    17    Forward    solute carrier family 13 (sodium-dependent dicarboxylate transporter), member 2 [Source:HGNC Symbol;Acc:HGNC:10917]
ENSG00000007216    ENST00000444914    SLC13A2    4    17    Forward    solute carrier family 13 (sodium-dependent dicarboxylate transporter), member 2 [Source:HGNC Symbol;Acc:HGNC:10917]
And a whole lot more of the same.
I tested your code and replaced my previous answer.
If you edit the following errors, your code seems to work.
Remove < from the urls
Remove EOL in line 42
Add a + between "identifiers=" and string
Here are some of the lines of the output I get:
ENSG00000272325    ENST00000607016    NUDT3    4    6    Reverse    nudix (nucleoside diphosphate linked moiety X)-type motif 3 [Source:HGNC Symbol;Acc:HGNC:8050]
ENSG00000272980    ENST00000400926    CCR6    5    6    Forward    chemokine (C-C motif) receptor 6 [Source:HGNC Symbol;Acc:HGNC:1607]
ENSG00000274211    ENST00000612932    SOCS7    8    17    Forward    suppressor of cytokine signaling 7 [Source:HGNC Symbol;Acc:HGNC:29846]
ENSG00000274588    ENST00000611977    DGKK    4    23 X    Reverse    diacylglycerol kinase, kappa [Source:HGNC Symbol;Acc:HGNC:32395]
ENSG00000275004    ENST00000613655    ZNF280B    4    22    Reverse    zinc finger protein 280B [Source:HGNC Symbol;Acc:HGNC:23022]
ENSG00000275004    ENST00000619852    ZNF280B    4    22    Reverse    zinc finger protein 280B [Source:HGNC Symbol;Acc:HGNC:23022]
ENSG00000275832    ENST00000622683    ARHGAP23    6    17    Forward    Rho GTPase activating protein 23 [Source:HGNC Symbol;Acc:HGNC:29293]
ENSG00000277258    ENST00000616199    PCGF2    3    17    Reverse    polycomb group ring finger 2 [Source:HGNC Symbol;Acc:HGNC:12929]
ENSG00000278871    ENST00000623344    KDM5D    8    24 Y    Reverse    lysine (K)-specific demethylase 5D [Source:HGNC Symbol;Acc:HGNC:11115]
ENSG00000279096    ENST00000622918    AL356289.1    11    1    Forward    HCG1780467 {ECO:0000313|EMBL:EAX06861.1}; PRO0529 {ECO:0000313|EMBL:AAF16687.1} [Source:UniProtKB/TrEMBL;Acc:Q9UI23]

Categories