Related
Hi Please help me either: speed up this dictionary compression; offer a better way to do it or gain a higher understanding of why it is so slow internally (like for example is calculation slowing down as the dictionary grows in memory size). I'm sure there must be a quicker way without learning some C!
classes = {i : [1 if x in df['column'].str.split("|")[i] else 0 for x in df['column']] for i in df.index}
with the output:
{1:[0,1,0...0],......, 4000:[0,1,1...0]}
from a df like this:
data_ = {'drugbank_id': ['DB06605', 'DB06606', 'DB06607', 'DB06608', 'DB06609'],
'drug-interactions': ['DB06605|DB06695|DB01254|DB01609|DB01586|DB0212',
'DB06605|DB06695|DB01254|DB01609|DB01586|DB0212',
'DB06606|DB06607|DB06608|DB06609',
'DB06606|DB06607',
'DB06608']
}
pd.DataFrame(data = data_ , index=range(0,5) )
I am preforming it in a df with 4000 rows, the column df['column'] contains a string of Ids separated by |. The number of IDs in each row that needs splitting varies from 1 to 1000, however, this is done for all 4000 indexes. I tested it on the head of the df and it seemed quick enough, now the comprehension has been running for 24hrs. So maybe it is just the sheer size of the job, but feel like I could speed it up and at this point I want to stop it an re-engineer, however, I am scared that will set me back without much increase in speed, so before I do that wanted to get some thoughts, ideas and suggestions.
Beyond 4000x4000 size I suspect that using the Series and Index Objects is the another problem and that I would be better off using lists, but given the size of the task I am not sure how much speed that will gain and maybe I am better off using some other method such as pd.apply(df, f(write line by line to json)). I am not sure - any help and education appreciated, thanks.
Here is one approach:
import pandas as pd
# create data frame
df = pd.DataFrame({'idx': [1, 2, 3, 4], 'col': ['1|2', '1|2|3', '2|3', '1|4']})
# split on '|' to convert string to list
df['col'] = df['col'].str.split('|')
# explode to get one row for each list element
df = df.explode('col')
# create dummy ID (this will become True in the final result)
df['dummy'] = 1
# use pivot to create dense matrix
df = (df.pivot(index='idx', columns='col', values='dummy')
.fillna(0)
.astype(int))
# convert each row to a list
df['test'] = df.apply(lambda x: x.to_list(), axis=1)
print(df)
col 1 2 3 4 test
idx
1 1 1 0 0 [1, 1, 0, 0]
2 1 1 1 0 [1, 1, 1, 0]
3 0 1 1 0 [0, 1, 1, 0]
4 1 0 0 1 [1, 0, 0, 1]
The output you want can be achieved using dummies. We split the column, stack, and use max to turn it into dummy indicators based on the original index. Then we use reindex to get it in the order you want based on the 'drugbank_id' column.
Finally to get the dictionary you want we will transpose and use to_dict
classes = (pd.get_dummies(df['drug-interactions'].str.split('|', expand=True).stack())
.max(level=0)
.reindex(df['drugbank_id'], axis=1)
.fillna(0, downcast='infer')
.T.to_dict('list'))
print(classes)
{0: [1, 0, 0, 0, 0], #Has DB06605, No DB06606, No DB06607, No DB06608, No DB06609
1: [1, 0, 0, 0, 0],
2: [0, 1, 1, 1, 1],
3: [0, 1, 1, 0, 0],
4: [0, 0, 0, 1, 0]}
I am looking to quickly combine columns that are genetic complements of each other. I have a large data frame with counts and want to combine columns where the column names are complements. I have a currently have a system that
Gets the complement of a column name
Checks the columns names for the compliment
Adds together the columns if there is a match
Then deletes the compliment column
However, this is slow (checking every column name) and gives different column names based on the ordering of the columns (i.e. deletes different compliment columns between runs). I was wondering if there was a way to incorporate a dictionary key:value pair to speed the process and keep the output consistent. I have an example dataframe below with the desired result (ATTG|TAAC & CGGG|GCCC are compliments).
df = pd.DataFrame({"ATTG": [3, 6, 0, 1],"CGGG" : [0, 2, 1, 4],
"TAAC": [0, 1, 0, 1], "GCCC" : [4, 2, 0, 0], "TTTT": [2, 1, 0, 1]})
## Current Pseudocode
for item in df.columns():
if compliment(item) in df.columns():
df[item] = df[item] + df[compliment(item)]
del df[compliment(item)]
## Desired Result
df_result = pd.DataFrame({"ATTG": [3, 7, 0, 2],"CGGG" : [4, 4, 1, 4], "TTTT": [2, 1, 0, 1]})
Translate the columns, then assign the columns the translation or original that is sorted first. This allows you to group compliments.
import numpy as np
mytrans = str.maketrans('ATCG', 'TAGC')
df.columns = np.sort([df.columns, [x.translate(mytrans) for x in df.columns]], axis=0)[0, :]
df.groupby(level=0, axis=1).sum()
# AAAA ATTG CGGG
#0 2 3 4
#1 1 7 4
#2 0 0 1
#3 1 2 4
Specify any keyword in list or dict format as follows
Is it possible to sort columns in a data frame?
df = pd.DataFrame ({
"col_cc_7": [0, 0, 0],
"col_aa_7": [1, 1, 1],
"col_bb_7": [2, 2, 2]})
# before
col_cc_7, col_aa_7, col_bb_7
0, 1, 2
0, 1, 2
0, 1, 2
# sort
custom_sort_key = ["aa", "bb", "cc"]
# ... sort codes ...
# after
col_aa_7, col_bb_7, col_cc_7
1, 2, 0
1, 2, 0
1, 2, 0
For me, your question is a little confusing.
If you only want to sort your columns values, a simple google search would do the trick, if not, I could not understand the question.
df= df.sort_values(by=['col','col2', "col3"],ascending=[True,True,False])
The by= sets the order of the sorting, and the ascending is self explanatory.
We can split by the middle value and create a dictionary of your columns, then apply a sort before we assign this back. I've added some extra columns not in your sort to show what will happen to them.
df = pd.DataFrame ({
"col_cc_7": [0, 0, 0],
"col_aa_7": [1, 1, 1],
"col_bb_7": [2, 2, 2],
"col_ee_7": [2, 2, 2],
"col_dd_7": [2, 2, 2]})
custom_sort_key = ["bb", "cc", "aa"]
col_dict = dict(zip(df.columns,[x.split('_')[1] for x in df.columns.tolist()]))
#{'col_cc_7': 'cc',
# 'col_aa_7': 'aa',
# 'col_bb_7': 'bb',
# 'col_ee_7': 'ee',
# 'col_dd_7': 'dd'}
d = {v:k for k,v in enumerate(custom_sort_key)}
# this will only work on python 3.6 +
new_cols = dict(sorted(col_dict.items(), key=lambda x: d.get(x[1], float('inf'))))
df[new_cols.keys()]
col_bb_7 col_cc_7 col_aa_7 col_ee_7 col_dd_7
0 2 0 1 2 2
1 2 0 1 2 2
2 2 0 1 2 2
I need to go through each row in a dataframe looking at a column called 'source_IP_address', and look at the previous 100 rows, so that I can find if there are any rows with the same 'source_IP_address' and where another column states 'authentication failure'.
I have written some code that does this, as I couldn't use Pandas rolling over two columns. Problem is, it is not very fast and I want to know if there is a better way to do it?
function to find in the previous window of n rows, the number of matching axis values, together with number of attribute values in the attribute column
def check_window_for_match(df_w, window_size, axis_col, attr_col, attr_var):
l = []
n_rows = df_w.shape[0]
for i in range(n_rows):
# create a temp dataframe with the previous n rows including current row
temp_df = df_w.iloc[i-(window_size-1):i+1]
#print(temp_df.shape)
# assign the current 'B' value as the axis variable
current_value = df_w[axis_col].iloc[i]
#print(current_value)
#print(temp_df)
# given the temp dataframe of previous window of n_rows, check axis matches against fails
l_temp = temp_df.loc[(temp_df[axis_col] == current_value) & (temp_df[attr_col] == attr_var)].shape[0]
l.append(l_temp)
return l
e.g.
df_test = pd.DataFrame({'B': [0, 1, 2, np.nan, 4, 6, 7, 8, 10, 8, 7], 'C': [2, 10, 'fail', np.nan, 6, 7, 8, 'fail', 8, 'fail', 9]})
df_test
matches_list = check_window_for_match(df_test, window_size=3, axis_col='B', attr_col='C', attr_var='fail')
output: [0, 0, 1, 0, 0, 0, 0, 1, 0, 2, 0]
I want to know if my code is correct and if it is the best way to do it, or there is a better alternative.
What is the best way to get a random sample of the elements of a groupby? As I understand it, a groupby is just an iterable over groups.
The standard way I would do this for an iterable, if I wanted to select N = 200 elements is:
rand = random.sample(data, N)
If you attempt the above where data is a 'grouped' the elements of the resultant list are tuples for some reason.
I found the below example for randomly selecting the elements of a single key groupby, however this does not work with a multi-key groupby. From, How to access pandas groupby dataframe by key
create groupby object
grouped = df.groupby('some_key')
pick N dataframes and grab their indices
sampled_df_i = random.sample(grouped.indices, N)
grab the groups using the groupby object 'get_group' method
df_list = map(lambda df_i: grouped.get_group(df_i),sampled_df_i)
optionally - turn it all back into a single dataframe object
sampled_df = pd.concat(df_list, axis=0, join='outer')
You can take a randoms sample of the unique values of df.some_key.unique(), use that to slice the df and finally groupby on the resultant:
In [337]:
df = pd.DataFrame({'some_key': [0,1,2,3,0,1,2,3,0,1,2,3],
'val': [1,2,3,4,1,5,1,5,1,6,7,8]})
In [338]:
print df[df.some_key.isin(random.sample(df.some_key.unique(),2))].groupby('some_key').mean()
val
some_key
0 1.000000
2 3.666667
If there are more than one groupby keys:
In [358]:
df = pd.DataFrame({'some_key1':[0,1,2,3,0,1,2,3,0,1,2,3],
'some_key2':[0,0,0,0,1,1,1,1,2,2,2,2],
'val': [1,2,3,4,1,5,1,5,1,6,7,8]})
In [359]:
gby = df.groupby(['some_key1', 'some_key2'])
In [360]:
print gby.mean().ix[random.sample(gby.indices.keys(),2)]
val
some_key1 some_key2
1 1 5
3 2 8
But if you are just going to get the values of each group, you don't even need to groubpy, MultiIndex will do:
In [372]:
idx = random.sample(set(pd.MultiIndex.from_product((df.some_key1, df.some_key2)).tolist()),
2)
print df.set_index(['some_key1', 'some_key2']).ix[idx]
val
some_key1 some_key2
2 0 3
3 1 5
I feel like lower-level numpy operations are cleaner:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"some_key": [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3],
"val": [1, 2, 3, 4, 1, 5, 1, 5, 1, 6, 7, 8],
}
)
ids = df["some_key"].unique()
ids = np.random.choice(ids, size=2, replace=False)
ids
# > array([3, 2])
df.loc[df["some_key"].isin(ids)]
# > some_key val
# 2 2 3
# 3 3 4
# 6 2 1
# 7 3 5
# 10 2 7
# 11 3 8
Although this question was asked and answered long ago, I think the following is cleaner:
import pandas as pd
df = pd.DataFrame(
{
"some_key1": [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3],
"some_key2": [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
"val": [1, 2, 3, 4, 1, 5, 1, 5, 1, 6, 7, 8]
}
)
# Set the number of samples by group
n_samples_by_group = 1
samples_by_group = df \
.groupby(by=["some_key1", "some_key2"]) \
.sample(n_samples_by_group)