Keeping data together that are linked in a dataframe when only one is included in an API

Keeping data together that are linked in a dataframe when only one is included in an API - python

I am fairly new to Python so hoping to hear your thoughts on this scenario and question:
I have a dataframe with 1803 rows and 4 columns
Two of the columns are "Sequence" and "Gene names" from which the seqs derive.
My code for a loop that iterates over each seq for the IEDB API is working fine:
url = "http://tools-cluster-interface.iedb.org/tools_api/mhci/"
method = "netmhcpan_el"
allele = "HLA-A*02:01"
length = "9"
for index,row in df_SEQS.iterrows():
sequence = (row["Sequence"])
netmhcpan_el = {"method":method, "sequence_text":sequence, "allele":allele, "length":length}
netmhcpan_el_resp = requests.post(url, netmhcpan_el_query)
output_netmhcpan_el = netmhcpan_el_resp.text
But the output is a list of sequences that are no longer paired with their gene of origin.
How can I get the output to include sequence-gene pairs when the API does not have a variable for "Gene names" from the original df?
I've run out of ideas.
Thanks very much for any suggestions
LG

Related

Improve performance of 8million iterations over a dataframe and query it

There is a for loop of 8 million iterations, which takes 2 sample values from a column of a 1 million records dataframe (say df_original_nodes) and then query that 2 samples in another dataframe say (df_original_rel) and if sample does not exist then add that samples as a new row into the queried dataframe (df_original_rel) and finally write the dataframe (df_original_rel) into a CSV.
This loop is taking roughly around 24+ hrs to complete. How this can be made performant? Happy if it even takes 8 hrs to complete than anything 12+ hrs.
Here is the piece of code:
for j in range(1, n_8000000):
ran_num = random.randint(0, 1)
ran_rel_type = rel_type[ran_num]
df_ran_rel = df_original_nodes["UID"].sample(2, ignore_index=True)
FROM = df_ran_rel[0]
TO = df_ran_rel[1]
if df_original_rel.query("#FROM == FROM and #TO == TO").empty:
k += 1
new_row = {"FROM": FROM, "TO": TO, "TYPE": ran_rel_type[0], "PART_OF": ran_rel_type[1]}
df_original_rel = df_original_rel.append(new_row, ignore_index=True)
df_original_rel.to_csv("output/extra_rel.csv", encoding="utf-8", index=False)
My assumption is that querying a dataframe df_original_rel is the heavy-lifting part where the dataframe df_original_rel is also keep growing as the new row is added.
In my view lists are faster to traverse and maybe to query but then there will be another layer of conversion from dataframe to lists and vice-versa which could add further complexity.

Some things that should probably help – most of them around "do less Pandas".
Since I don't have your original data or anything like it, I can't test this.
# Grab a regular list of UIDs that we can use with `random.sample`
original_nodes_uid_list = df_original_nodes["UID"].tolist()
# Make a regular set of FROM-TO tuples
rel_from_to_pairs = set(df_original_rel[["FROM", "TO"]].apply(tuple, axis=1).tolist())
# Store new rows here instead of putting them in the dataframe; we'll also update rel_from_to_pairs as we go.
new_rows = []
for j in range(1, 8_000_000):
# These two lines could probably also be a `random.choice`
ran_num = random.randint(0, 1)
ran_rel_type = rel_type[ran_num]
# Grab a from-to pair from the UID list
FROM, TO = random.sample(original_nodes_uid_list, 2)
# If this pair isn't in the set of known pairs...
if (FROM, TO) not in rel_from_to_pairs:
# ... prepare a new row to be added later
new_rows.append({"FROM": FROM, "TO": TO, "TYPE": ran_rel_type[0], "PART_OF": ran_rel_type[1]})
# ... and since this from-to pair _would_ exist had df_original_rel
# been updated, update the pairs set.
rel_from_to_pairs.add((FROM, TO))
# Finally, make a dataframe of the new rows, concatenate it with the old, and output.
df_new_rel = pd.DataFrame(new_rows)
df_original_rel = pd.concat([df_original_rel, df_new_rel], ignore_index=True)
df_original_rel.to_csv("output/extra_rel.csv", encoding="utf-8", index=False)

How can I align columns if rows have different number of values?

I am scraping data with python. I get a csv file and can split it into columns in excel later. But I am encountering an issue I have not been able to solve. Sometimes the scraped items have two statuses and sometimes just one. The second status is thus moving the other values in the columns to the right and as a result the dates are not all in the same column which would be useful to sort the rows.
Do you have any idea how to make the columns merge if there are two statuses for example or other solutions?
Maybe is is also an issue that I still need to separate the values into columns manually with excel.
Here is my code
#call packages
import random
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import pandas as pd
# define driver etc.
service_obj = Service("C:\\Users\\joerg\\PycharmProjects\\dynamic2\\chromedriver.exe")
browser = webdriver.Chrome(service=service_obj)
# create loop
initiative_list = []
for i in range(0, 2):
url = 'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_de?page='+str(i)
browser.get(url)
time.sleep(random.randint(5, 10))
initiative_item = browser.find_elements(By.CSS_SELECTOR, "initivative-item")
initiatives = [item.text for item in initiative_item]
initiative_list.extend(initiatives)
df = pd.DataFrame(initiative_list)
#create csv
print(df)
df.to_csv('Initiativen.csv')
df.columns = ['tosplit']
new_df = df['tosplit'].str.split('\n', expand=True)
print(new_df)
new_df.to_csv('Initiativennew.csv')
I tried to merge the columns if there are two statuses.

make the columns merge if there are two statuses for example or other solutions
[If by "statuses" you mean the yellow labels ending in OPEN/UPCOMING/etc, then] it should be taken care of by the following parts of the getDetails_iiaRow (below the dividing line):
labels = cssSelect(iiaEl, 'div.field span.label')
and then
'labels': ', '.join([l.text.strip() for l in labels])
So, multiple labels will be separated by commas (or any other separator you apply .join to).
initiative_item = browser.find_elements(By.CSS_SELECTOR, "initivative-item")
initiatives = [item.text for item in initiative_item]
Instead of doing it like this and then having to split and clean things, you should consider extracting each item in a more specific manner and have each "row" be represented as a dictionary (with the column-names as the keys, so nothing gets mis-aligned later). If you wrap it as a function:
def cssSelect(el, sel): return el.find_elements(By.CSS_SELECTOR, sel)
def getDetails_iiaRow(iiaEl):
title = cssSelect(iiaEl, 'div.search-result-title')
labels = cssSelect(iiaEl, 'div.field span.label')
iiarDets = {
'title': title[0].text.strip() if title else None,
'labels': ', '.join([l.text.strip() for l in labels])
}
cvSel = 'div[translate]+div:last-child'
for c in cssSelect(iiaEl, f'div:has(>{cvSel})'):
colName = cssSelect(c, 'div[translate]')[0].text.strip()
iiarDets[colName] = cssSelect(c, cvSel)[0].text.strip()
link = iiaEl.get_attribute('href')
if link[:1] == '/':
link = f'https://ec.europa.eu/{link}'
iiarDets['link'] = iiaEl.get_attribute('href')
return iiarDets
then you can simply loop through the pages like:
initiative_list = []
for i in range(0, 2):
url = f'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_de?page={i}'
browser.get(url)
time.sleep(random.randint(5, 10))
initiative_list += [
getDetails_iiaRow(iia) for iia in
cssSelect(browser, 'initivative-item>article>a ')
]
and the since it's all cleaned already, you can directly save the data with
pd.DataFrame(initiative_list).to_csv('Initiativen.csv', index=False)
The output I got for the first 3 pages looks like:

I think it is worth working a little bit harder to get your data rationalised before putting it in the csv rather than trying to unpick the damage once ragged data has been exported.
A quick look at each record in the page suggests that there are five main items that you want to export and these correspond to the five top-level divs in the a element.
The complexity (as you note) comes because there are sometimes two statuses specified, and in that case there is sometimes a separate date range for each and sometimes a single date range.
I have therefore chosen to put the three ever present fields as the first three columns, followed next by the status + date range columns as pairs. Finally I have removed the field names (these should effectively become the column headings) to leave only the variable data in the rows.
initiatives = [processDiv(item) for item in initiative_item]
def processDiv(item):
divs = item.find_elements(By.XPATH, "./article/a/div")
if "\n" in divs[0].text:
statuses = divs[0].text.split("\n")
if len(divs) > 5:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], statuses[0], divs[4].text.split("\n")[1], statuses[1], divs[5].text.split("\n")[1]]
else:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], statuses[0], divs[4].text.split("\n")[1], statuses[1], divs[4].text.split("\n")[1]]
else:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], divs[0].text, divs[4].text.split("\n")[1]]
The above approach sticks as close to yours as I can. You will clearly need to rework the pandas code to reflect the slightly altered data structure.
Personally, I would invest even more time in clearly identifying the best definitions for the fields that represent each piece of data that you wish to retrieve (rather than as simply divs 0-5), and extract the text directly from them (rather than messing around with split). In this way you are far more likely to create robust code that can be maintained over time (perhaps not your goal).

GCP Datastore NDB: Filter for KindA elements keys that are NOT IN KindB documentId

Here is my situation: I have two Datastore kind, I need to create a python query for all Data that don't are present in Kind B. In the sample those are: Data 3 and Data 4.
The constraint here is that i need to filter for elements in KindA which have a key that is different from specific KindB property.
Kind A
Kind B
Data 1
Data 1
Data 2
Data 2
Data 3
Data 4
Data 5
Data 5
According to documentation, I can create a query in this way:
query = Account.query(Account.userid == 42)
I've tried this:
myquery = KindA.query(KindA.key.id() != KindB.documentId)
But it throws:
AttributeError: 'ModelKey' object has no attribute 'id'
I've tried following this stack overflow question:
but it seems infeasible because the number of element in kindB is dynamic, and I can't list them all.
Written in english my query would be: filter KindA elements keys that are NOT IN KindB documentId.
Could you help?

Try this
# keys_only=True means Return only the keys which is faster
kindB_Ids = [ a.id() for a in KindB.query().fetch(keys_only=True) ]
kindA_Ids = [ a.id() for a in KindA.query().fetch(keys_only=True) ]
# This gives you rows in KindA whose ids are not in KindB
diff = [ ndb.Key(KindA, a) for a in KindA_Ids if a not in kindB_Ids]

According with this post quoted by #SarahRemo in the comments , here is a working solution.
list_of_id = []
entries = KindB.query()
for request in entries:
list_of_id.append(request.key.id())
keys = [ndb.Key(KindA, unique_id) for unique_id in list_of_id]
objs= ndb.get_multi(keys)

Sort a list based on values from dataframe

Is my approach here the right way to do it in Python? As I'm new to Python, I appreciate any feedback you can provide, especially if I'm way off here.
My task is to order a list of file names based on values from a dataset. Specifically, these are file names that I need to sort based on site information. The resulting list is the order in which the reports will be printed.
Site Information
key_info = pd.DataFrame({
'key_id': ['1010','3030','2020','5050','4040','4040']
, 'key_name': ['Name_A','Name_B','Name_C','Name_D','Name_E','Name_E']
, 'key_value': [1,2,3,4,5,6]
})
key_info = key_info[['key_id','key_name']].drop_duplicates()
key_info['key_id'] = key_info.key_id.astype('str').astype('int64')
Filenames
These are the file names I need to sort. In this example, I sort by just the key_id, but I assume I could easily add a column to site information and sort it by that as well.
filenames = ['1010_Filename','2020_Filename','3030_Filename','5050_Filename','4040_Filename']
Sorting
The resulting "filenames" is the final sorted list.
names_df = pd.DataFrame({'filename': filenames})
names_df['key_id'] = names_df.filename.str[:4].astype('str').astype('int64')
merged_df = pd.merge(key_info, names_df, on='key_id', how='right')
merged_df = merged_df.sort_values('key_id')
filenames = merged_df['filename'].tolist()
I'm looking for any solutions that might be better or more Pythonic. Or, if there is a more appropriate place to post "code review" questions.

I like your use of Pandas, but it isn't the most Pythonic as it uses data structures that are a superset of Python. Nevertheless, I think we can improve on what you have. I will show an improved version and I will show a completely native Python way to do it. Either is fine I suppose?
The strictly Python version is best for people who do know Pandas as there's a large learning curve associated with that.
Common
For both examples, let's assume a function like this:
def trim_filenames(filename):
return filename[0:4]
I use this in both examples.
Improvements
# Load the DataFrame and give it a proper index (I added some data)
key_info = pd.DataFrame(index=['2020','5050','4040','4040','6000','7000','1010','3030'], data={'key_name':['Name_C','Name_D','Name_E','Name_E','Name_F','Name_G','Name_A','Name_B'], 'key_value' :[1,2,3,4,5,6,7,8]})
# Eliminate duplicates and sort in one step
key_info = key_info.groupby(key_info.index).first()
filenames = ['1010_Filename','2020_Filename','3030_Filename','5050_Filename','4040_Filename']
names_df = pd.DataFrame({'filename': filenames})
# Let's give this an index too so we can match on the index (not the function call)
names_df.index=names_df.filename.transform(trim_filenames)
combined = pd.concat([key_info,names_df], axis=1)
combined matches by index, but there are some keys with no filenames. It looks like this now:
key_name key_value filename
1010 Name_A 7 1010_Filename
2020 Name_C 1 2020_Filename
3030 Name_B 8 3030_Filename
4040 Name_E 3 4040_Filename
5050 Name_D 2 5050_Filename
6000 Name_F 5 NaN
7000 Name_G 6 NaN
Now we drop the NaN columns and create the list of filenames:
combined.filename.dropna().values.tolist()
['1010_Filename', '2020_Filename', '3030_Filename', '4040_Filename', '5050_Filename']
Python Only Version (no framework)
key_info = {'2020' : {'key_name':'Name_C', 'key_value':1},'5050' : {'key_name':'Name_D', 'key_value':2},'4040' : {'key_name':'Name_E', 'key_value':3},'4040' : {'key_name':'Name_E', 'key_value':4},'6000' : {'key_name':'Name_F', 'key_value':5},'7000' : {'key_name':'Name_G', 'key_value':6},'1010' : {'key_name':'Name_A', 'key_value':7},'3030' : {'key_name':'Name_B', 'key_value':8}}
filenames = ['1010_Filename','2020_Filename','3030_Filename','5050_Filename','4040_Filename']
# Let's get a dictionary of filenames that is keyed by the same key as in key_info:
hashed_filenames = {}
for filename in filenames:
# Note here I'm using the function again
hashed_filenames[trim_filenames(filename)] = filename
# We'll store the new filenames in new_filenames:
new_filenames = []
# sort the key info and loop it
for key in sorted(key_info.keys()):
# for each key, if the key matches in the hashed_filenames, then add it to the list
if key in hashed_filenames:
new_filenames.append(hashed_filenames[key])
Summary
Both solutions are concise, and I like Pandas, but I like better something that is immediately readable by anyone that knows Python. The Python only solution (of course, they are both Python) is the one you should go with, in my opinion.

out_list = []
for x in key_info.key_id:
for f in filenames:
if str(x) in f:
out_list.append(f)
out_list
['1010_Filename', '3030_Filename', '2020_Filename', '5050_Filename', '4040_Filename']

Extracting similar data from dictionary and putting into a new list or array

I have a dictionary with 5 keys that each key refers to a list (a CSV file with 100 rows and 5 columns). Each row of the list points to a data of a person. I would like to extract the similar row of each list and put into a new list or an array. So at the end I should have 100 lists/array such that each list/array contains a user’s data. And then I want to do some experiments like machine learning and so on.
This is my example:
My_dict={0,1,2,3}
0={id,var1,var2,var3
User1,med,high,low
User2,med,low,low
…,…,..,..,
User100,hih,low,med}
1={id,var1,var2,var3
User1,high,med,low
User2,high,med,low
…,…,..,..,
User100,low,low,med}
2={id,var1,var2,var3
User1,low,med,low
User2,med,med,low
…,…,..,..,
User100,med,low,med}
So I want to have a list of lists or array of arrays that I can experiment with. Something like this:
User1={id,var1,var2,var3
User1,med,high,low
User1,high,med,low
User1,low,med,low
}
User2={d,var1,var2,var3
User2,med,high,low
User2,high,med,low
User2,low,med,low
}

input_data = {"0":[["U1","med","low","high"],["U2","low","low","high"],["U3","high","low","high"]], "1": [["U1","med","low","high"],["U2","low","low","high"],["U3","high","low","high"]]}
# Assuming that above kind of data you have then below dict will be your output
users_dict = dict()
for key, users in input_data.iteritems():
for user in users:
users_dict.setdefault(user[0], []).append(user)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Keeping data together that are linked in a dataframe when only one is included in an API - python

Related

Improve performance of 8million iterations over a dataframe and query it

How can I align columns if rows have different number of values?

GCP Datastore NDB: Filter for KindA elements keys that are NOT IN KindB documentId

Sort a list based on values from dataframe

Extracting similar data from dictionary and putting into a new list or array

Categories

Resources