Combine Pandas Columns by Corresponding Dictionary Values - python

I am looking to quickly combine columns that are genetic complements of each other. I have a large data frame with counts and want to combine columns where the column names are complements. I have a currently have a system that
Gets the complement of a column name
Checks the columns names for the compliment
Adds together the columns if there is a match
Then deletes the compliment column
However, this is slow (checking every column name) and gives different column names based on the ordering of the columns (i.e. deletes different compliment columns between runs). I was wondering if there was a way to incorporate a dictionary key:value pair to speed the process and keep the output consistent. I have an example dataframe below with the desired result (ATTG|TAAC & CGGG|GCCC are compliments).
df = pd.DataFrame({"ATTG": [3, 6, 0, 1],"CGGG" : [0, 2, 1, 4],
"TAAC": [0, 1, 0, 1], "GCCC" : [4, 2, 0, 0], "TTTT": [2, 1, 0, 1]})
## Current Pseudocode
for item in df.columns():
if compliment(item) in df.columns():
df[item] = df[item] + df[compliment(item)]
del df[compliment(item)]
## Desired Result
df_result = pd.DataFrame({"ATTG": [3, 7, 0, 2],"CGGG" : [4, 4, 1, 4], "TTTT": [2, 1, 0, 1]})

Translate the columns, then assign the columns the translation or original that is sorted first. This allows you to group compliments.
import numpy as np
mytrans = str.maketrans('ATCG', 'TAGC')
df.columns = np.sort([df.columns, [x.translate(mytrans) for x in df.columns]], axis=0)[0, :]
df.groupby(level=0, axis=1).sum()
# AAAA ATTG CGGG
#0 2 3 4
#1 1 7 4
#2 0 0 1
#3 1 2 4

Related

How to group consecutive data in 2d array in python

I have a 2d NumPy array that looks like this:
array([[1, 1],
[1, 2],
[2, 1],
[2, 2],
[3, 1],
[5, 1],
[5, 2]])
and I want to group it and have an output that looks something like this:
Col1 Col2
group 1: 1-2, 1-2
group 2: 3-3, 1-1
group 3: 5-5, 1-2
I want to group the columns based on if they are consecutive.
So, for a unique value In column 1, group data in the second column if they are consecutive between rows. Now for a unique grouping of column 2, group column 1 if it is consecutive between rows.
The result can be thought of as corner points of a grid. In the above example, group 1 is a square grid, group 2 is a a point, and group 3 is a flat line.
My system won't allow me to use pandas so I cannot use group_by in that library but I can use other standard libraries.
Any help is appreciated. Thank you
Here you go ...
Steps are:
Get a list xUnique of unique column 1 values with sort order preserved.
Build a list xRanges of items of the form [col1_value, [col2_min, col2_max]] holding the column 2 ranges for each column 1 value.
Build a list xGroups of items of the form [[col1_min, col1_max], [col2_min, col2_max]] where the [col1_min, col1_max] part is created by merging the col1_value part of consecutive items in xRanges if they differ by 1 and have identical [col2_min, col2_max] value ranges for column 2.
Turn the ranges in each item of xGroups into strings and print with the required row and column headings.
Also package and print as a numpy.array to match the form of the input.
import numpy as np
data = np.array([
[1, 1],
[1, 2],
[2, 1],
[2, 2],
[3, 1],
[5, 1],
[5, 2]])
xUnique = list({pair[0] for pair in data})
xRanges = list(zip(xUnique, [[0, 0] for _ in range(len(xUnique))]))
rows, cols = data.shape
iRange = -1
for i in range(rows):
if i == 0 or data[i, 0] > data[i - 1, 0]:
iRange += 1
xRanges[iRange][1][0] = data[i, 1]
xRanges[iRange][1][1] = data[i, 1]
xGroups = []
for i in range(len(xRanges)):
if i and xRanges[i][0] - xRanges[i - 1][0] == 1 and xRanges[i][1] == xRanges[i - 1][1]:
xGroups[-1][0][1] = xRanges[i][0]
else:
xGroups += [[[xRanges[i][0], xRanges[i][0]], xRanges[i][1]]]
xGroupStrs = [ [f'{a}-{b}' for a, b in row] for row in xGroups]
groupArray = np.array(xGroupStrs)
print(groupArray)
print()
print(f'{"":<10}{"Col1":<8}{"Col2":<8}')
[print(f'{"group " + str(i) + ":":<10}{col1:<8}{col2:<8}') for i, (col1, col2) in enumerate(xGroupStrs)]
Output:
[['1-2' '1-2']
['3-3' '1-1']
['5-5' '1-2']]
Col1 Col2
group 0: 1-2 1-2
group 1: 3-3 1-1
group 2: 5-5 1-2

How to apply rolling mean function while keeping all the observations with duplicated indices in time

I have a dataframe that has duplicated time indices and I would like to get the mean across all for the previous 2 days (I do not want to drop any observations; they are all information that I need). I've checked pandas documentation and read previous posts on Stackoverflow (such as Apply rolling mean function on data frames with duplicated indices in pandas), but could not find a solution. Here's an example of how my data frame look like and the output I'm looking for. Thank you in advance.
data:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,3,3,4,4,4],'t': [1, 2, 3, 2, 1, 2, 2, 3, 4],'v1':[1, 2, 3, 4, 5, 6, 7, 8, 9]})
output:
t
v2
1
-
2
-
3
4.167
4
5
5
6.667
A rough proposal to concatenate 2 copies of the input frame in which values in 't' are replaced respectively by values of 't+1' and 't+2'. This way, the meaning of the column 't' becomes "the target day".
Setup:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,3,3,4,4,4],
't': [1, 2, 3, 2, 1, 2, 2, 3, 4],
'v1':[1, 2, 3, 4, 5, 6, 7, 8, 9]})
Implementation:
len = df.shape[0]
incr = pd.DataFrame({'id': [0]*len, 't': [1]*len, 'v1':[0]*len}) # +1 in 't'
df2 = pd.concat([df + incr, df + incr + incr]).groupby('t').mean()
df2 = df2[1:-1] # Drop the days that have no full values for the 2 previous days
df2 = df2.rename(columns={'v1': 'v2'}).drop('id', axis=1)
Output:
v2
t
3 4.166667
4 5.000000
5 6.666667
Thank you for all the help. I ended up using groupby + rolling (2 Day), and then drop duplicates (keep the last observation).

Why is this dictionary comprehension so slow? Please suggest way to speed it up

Hi Please help me either: speed up this dictionary compression; offer a better way to do it or gain a higher understanding of why it is so slow internally (like for example is calculation slowing down as the dictionary grows in memory size). I'm sure there must be a quicker way without learning some C!
classes = {i : [1 if x in df['column'].str.split("|")[i] else 0 for x in df['column']] for i in df.index}
with the output:
{1:[0,1,0...0],......, 4000:[0,1,1...0]}
from a df like this:
data_ = {'drugbank_id': ['DB06605', 'DB06606', 'DB06607', 'DB06608', 'DB06609'],
'drug-interactions': ['DB06605|DB06695|DB01254|DB01609|DB01586|DB0212',
'DB06605|DB06695|DB01254|DB01609|DB01586|DB0212',
'DB06606|DB06607|DB06608|DB06609',
'DB06606|DB06607',
'DB06608']
}
pd.DataFrame(data = data_ , index=range(0,5) )
I am preforming it in a df with 4000 rows, the column df['column'] contains a string of Ids separated by |. The number of IDs in each row that needs splitting varies from 1 to 1000, however, this is done for all 4000 indexes. I tested it on the head of the df and it seemed quick enough, now the comprehension has been running for 24hrs. So maybe it is just the sheer size of the job, but feel like I could speed it up and at this point I want to stop it an re-engineer, however, I am scared that will set me back without much increase in speed, so before I do that wanted to get some thoughts, ideas and suggestions.
Beyond 4000x4000 size I suspect that using the Series and Index Objects is the another problem and that I would be better off using lists, but given the size of the task I am not sure how much speed that will gain and maybe I am better off using some other method such as pd.apply(df, f(write line by line to json)). I am not sure - any help and education appreciated, thanks.
Here is one approach:
import pandas as pd
# create data frame
df = pd.DataFrame({'idx': [1, 2, 3, 4], 'col': ['1|2', '1|2|3', '2|3', '1|4']})
# split on '|' to convert string to list
df['col'] = df['col'].str.split('|')
# explode to get one row for each list element
df = df.explode('col')
# create dummy ID (this will become True in the final result)
df['dummy'] = 1
# use pivot to create dense matrix
df = (df.pivot(index='idx', columns='col', values='dummy')
.fillna(0)
.astype(int))
# convert each row to a list
df['test'] = df.apply(lambda x: x.to_list(), axis=1)
print(df)
col 1 2 3 4 test
idx
1 1 1 0 0 [1, 1, 0, 0]
2 1 1 1 0 [1, 1, 1, 0]
3 0 1 1 0 [0, 1, 1, 0]
4 1 0 0 1 [1, 0, 0, 1]
The output you want can be achieved using dummies. We split the column, stack, and use max to turn it into dummy indicators based on the original index. Then we use reindex to get it in the order you want based on the 'drugbank_id' column.
Finally to get the dictionary you want we will transpose and use to_dict
classes = (pd.get_dummies(df['drug-interactions'].str.split('|', expand=True).stack())
.max(level=0)
.reindex(df['drugbank_id'], axis=1)
.fillna(0, downcast='infer')
.T.to_dict('list'))
print(classes)
{0: [1, 0, 0, 0, 0], #Has DB06605, No DB06606, No DB06607, No DB06608, No DB06609
1: [1, 0, 0, 0, 0],
2: [0, 1, 1, 1, 1],
3: [0, 1, 1, 0, 0],
4: [0, 0, 0, 1, 0]}

how to iterate though 5 rows from get Max per row of 4 columns and set value of a different column accordingly

I have a DataFrame, i've made a subset df2 of 4 columns from df1 and create a list of 5 items containing max value from each row. Now depending on which column the max value for that row is in i.e column 1, 2, 3, 4, determines the int label i.e. 1, 2, 3, or 4 in the label column in df1.
The df2 is because some of the other columns not including those 4 have a higher value than the 4 to compare and obviously screws that up. Starting to think it should be a list or series?
code
df1= pd.DataFrame({'x_1': [xvalues[0][0], xvalues[0][1], xvalues[0][2],
xvalues[0][3], xvalues[0][4]],
'x_2': [yvalues[0][0], yvalues[0][1], yvalues[0][2],
yvalues[0][3], yvalues[0][4]],
'True labels': [truelabels[0], truelabels[1],
truelabels[2],truelabels[3], truelabels[4]],
'g11': [classifier1[0][0],classifier1[0][1],
classifier1[0][2],classifier1[0][3],
classifier1[0][4],],
'g12': [classifier1[1][0],classifier1[1][1],
classifier1[1][2],classifier1[1][3],
classifier1[1][4],],
'g13': [classifier1[2][0],classifier1[2][1],
classifier1[2][2],classifier1[2][3],
classifier1[2][4],],
'g14': [classifier1[3][0],classifier1[3][1],
classifier1[3][2],classifier1[3][3],
classifier1[3][4],],
'L1': [2, 5, 6, 7, 8],
'g21': [classifier2[0][0],classifier2[0][1],
classifier2[0][2],classifier2[0][3],
classifier2[0][4],],
'g22': [classifier2[1][0],classifier2[1][1],
classifier2[1][2],classifier2[1][3],
classifier2[1][4],],
'g23': [classifier2[2][0],classifier2[2][1],
classifier2[2][2],classifier2[2][3],
classifier2[2][4],],
'g24': [classifier2[3][0],classifier2[3][1],
classifier2[3][2],classifier2[3][3],
classifier2[3][4],],
'L2': [0, 0, 0, 0, 0],
'g31': [classifier3[0],classifier3[0],
classifier3[0],classifier3[0],
classifier3[0],],
'g32': [classifier3[1][0],classifier3[1][1],
classifier3[1][2],classifier3[1][3],
classifier3[1][4],],
'g33': [classifier3[2][0],classifier3[2][1],
classifier3[2][2],classifier3[2][3],
classifier3[2][4],],
'g34': [classifier3[3][0],classifier3[3][1],
classifier3[3][2],classifier3[3][3],
classifier3[3][4],],
'L3': [0, 0, 0, 0, 0],
'Assigned L':[1, 1, 1, 1,1]}, index =['Datapoint1', 'D2', 'D3',
'D4', 'D5'])
df2= df1[['g11','g12','g13','g14']]
hdf = df2.max(axis = 1)
g11 = df1['g11'].to_list()
g12 = df1['g12'].to_list()
g13 = df1['g13'].to_list()
g14 = df1['g14'].to_list()
for item, label in zip(hdf, table['L1']):
if hdf[item] in g11:
df1['L1'][label] = labels[0]
print(item, label)
elif hdf[item] in g12:
df1['L1'][label] = labels[1]
print(item, label)
elif hdf[item] in g13:
df1['L1'][label] = labels[2]
print(item, label)
elif hdf[item] in g14:
df1['L1'][label] = labels[3]
print(item, label)
I have tried using .loc, .at but when it didn't work I just scrapped it and tried something else, maybe those approaches would be better? This is where I'm at so far.
The error is coming from the for loop for hdf,
The issue i'm having is "cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [0.0311272081] of <class 'float'>"
I don't think the other values in the data frame are relvant, its just there so people know I have made one. The 5 relavant columns in the dataframe are g11, g12, g13, g14 and L1.

Detect Missing Column Labels in Pandas

I'm working with the dataset outlined here:
https://archive.ics.uci.edu/ml/datasets/Balance+Scale
I'm trying create a general function to be able to parse any categorical data following these two rules:
Must have a column labeled class containing the class of the object
Each row must have the same numbers of columns
Minimal example of the data that I'm working with:
Class,LW,LD,RW,RD
B,1,1,1,1
L,1,2,1,1
R,1,2,1,3
R,2,2,4,5
This provides 3 unique classes: B, L, R. It also provides 4 features which pertain to each entry: LW, LD, RW and RD.
The following is a part of my function to handle generic cases, but my issue with it is that I don't know how to check if any column labels are simply missing:
import pandas as pd
import sys
dataframe = pd.read_csv('Balance_Data.csv')
columns = list(dataframe.columns.values)
if "Class" not in columns:
sys.exit("'Class' is not a column in the data")
if "Class.1" in columns:
sys.exit("Cannot specify more than one 'Class' column")
columns.remove("Class")
inputX = dataframe.loc[:, columns].as_matrix()
inputY = dataframe.loc[:, ['Class']].as_matrix()
At this point, the correct values are:
inputX = array([[1, 1, 1, 1],
[1, 2, 1, 1],
[1, 2, 1, 3],
[2, 2, 4, 5]])
inputY = array([['B'],
['L'],
['R'],
['R'],
['R'],
['R']], dtype=object)
But if I remove the last column label (RD) and reprocess,
Class,LW,LD,RW
B,1,1,1,1
L,1,2,1,1
R,1,2,1,3
R,2,2,4,5
I get:
inputX = array([[1, 1, 1],
[2, 1, 1],
[2, 1, 3],
[2, 4, 5]])
inputY = array([[1],
[1],
[1],
[2]])
This indicates that it reads label values from right to left instead of left to right, which means that if any data is input into this function that doesn't have the right amount of labels, it's not going to work correctly.
How can I check that the dimension of the rows is the same as the number of columns? (It can be assumed that there are no gaps in the data itself, that each row of data beyond the columns always has the same number of elements in it)
I would pull it out as follows:
In [11]: df = pd.read_csv('Balance_Data.csv', index_col=0)
In [12]: df
Out[12]:
LW LD RW RD
Class
B 1 1 1 1
L 1 2 1 1
R 1 2 1 3
R 2 2 4 5
That way the assertion check can be:
if "Class" in df.columns:
sys.exit("class must be the first and only the column and number of columns must match all rows")
and then check that the there are no NaNs in the last column:
In [21]: df.iloc[:, -1].notnull().all()
Out[21]: True
Note: this happens e.g. with the following (bad) csv:
In [31]: !cat bad.csv
A,B,C
1,2
3,4
In [32]: df = pd.read_csv('bad.csv', index_col=0)
In [33]: df
Out[33]:
B C
A
1 2 NaN
3 4 NaN
In [34]: df.iloc[:, -1].notnull().all()
Out[34]: False
I think these are the only two failing cases (but I think the error messages can be made clearer)...

Categories