I have 2 numpy arrays, I am using the top row as column headers. Each array has the same columns except for 2 columns. arr2 will have a different C column as well as an additional column
How can I combine all of these columns into a single np array?
arr1 = [ ['A', 'B', 'C1'], [1, 1, 0], [0, 1, 1] ]
arr2 = [ ['A', 'B', 'C2', 'C3'], [0, 1, 0, 1], [0, 0, 1, 0] ]
a1 = np.array(arr1)
a2 = np.array(arr2)
b = np.append(a1, a2, axis=0)
print(b)
# Desired Result
# A B C1 C2 C3
# 1 1 0 - -
# 0 1 1 - -
# 0 1 - 0 1
# 0 0 - 1 0
NumPy arrays aren't great for handling data with named columns, which might contain different types. Instead, I would use pandas for this. For example:
import pandas as pd
arr1 = [[1, 1, 0], [0, 1, 1] ]
arr2 = [[0, 1, 0, 1], [0, 0, 1, 0] ]
df1 = pd.DataFrame(arr1, columns=['A', 'B', 'C1'])
df2 = pd.DataFrame(arr2, columns=['A', 'B', 'C2', 'C3'])
df = pd.concat([df1, df2], sort=False)
df.to_csv('mydata.csv', index=False)
This results in a 'dataframe', a spreadsheet-like data structure. Jupyter Notebooks render these as follows:
You might notice there's an extra new column; this is the "index", which you can think of as row labels. You don't need it if you don't want it in your CSV, but if you carry on doing things in the dataframe, you might want to do df = df.reset_index() to relabel the rows in a more useful way.
If you want the dataframe back as a NumPy array, you can do df.values and away you go. It doesn't have the column names though.
Last thing: if you really want to stay in NumPy-land, then check out structured arrays, which give you another way to name the columns, essentially, in an array. Honestly, since pandas came along, I hardly ever see these in the wild.
Related
I have a Dataframe as follows:
import pandas as pd
df = pd.DataFrame({'Target': [0 ,1, 2],
'Source': [1, 0, 3],
'Count': [1, 1, 1]})
I have to count how many pairs of Sources and Targets there are. (1,0) and (0,1) will be treated as duplicate, hence the count will be 2.
I need to do it several times as I have 79 nodes in total. Any help will be much appreciated.
import pandas as pd
# instantiate without the 'count' column to start over
In[1]: df = pd.DataFrame({'Target': [0, 1, 2],
'Source': [1, 0, 3]})
Out[1]: Target Source
0 0 1
1 1 0
2 2 3
To count pairs regardless of their order is possible by converting to numpy.ndarray and sorting the rows to make them identical:
In[1]: array = df.values
In[2]: array.sort(axis=1)
In[3]: array
Out[3]: array([[0, 1],
[0, 1],
[2, 3]])
And then turn it back to a DataFrame to perform .value_counts():
In[1]: df_sorted = pd.DataFrame(array, columns=['value1', 'value2'])
In[2]: df_sorted.value_counts()
Out[2]: value1 value2
0 1 2
2 3 1
dtype: int64
I would like to turn the names of columns into values. This is so to create a factor variable and define the levels as the column names. I am hoping to achieve x2 from x1. In R it would be like using the model.matrix() function
Thank you
x1 = pd.DataFrame({'A': [1,0,0],
'B': [0,1,0],
'C': [0,1,1]})
x2 = pd.DataFrame({'All': ['A','BC','C']})
You can also use list comprehension, as follows:
cols = x1.columns.values
x2 = pd.DataFrame({'All': [''.join(cols[x]) for x in x1.eq(1).values]})
Or simply:
x2 = pd.DataFrame({'All': [''.join(x1.columns[x]) for x in x1.eq(1).values]})
Result:
print(x2)
All
0 A
1 BC
2 C
That's one way, there should be a simpler solution:
x1.astype(bool).apply(lambda row: ''.join(x1.columns[row]), axis=1)
Use the # (matrix multiplication operator) to multiply the columns vector by the boolean matrix:
import pandas as pd
x1 = pd.DataFrame({'A': [1, 0, 0],
'B': [0, 1, 0],
'C': [0, 1, 1]})
# create result DataFrame
x2 = pd.DataFrame({"all": x1 # x1.columns})
print(x2)
Output
all
0 A
1 BC
2 C
Hi Please help me either: speed up this dictionary compression; offer a better way to do it or gain a higher understanding of why it is so slow internally (like for example is calculation slowing down as the dictionary grows in memory size). I'm sure there must be a quicker way without learning some C!
classes = {i : [1 if x in df['column'].str.split("|")[i] else 0 for x in df['column']] for i in df.index}
with the output:
{1:[0,1,0...0],......, 4000:[0,1,1...0]}
from a df like this:
data_ = {'drugbank_id': ['DB06605', 'DB06606', 'DB06607', 'DB06608', 'DB06609'],
'drug-interactions': ['DB06605|DB06695|DB01254|DB01609|DB01586|DB0212',
'DB06605|DB06695|DB01254|DB01609|DB01586|DB0212',
'DB06606|DB06607|DB06608|DB06609',
'DB06606|DB06607',
'DB06608']
}
pd.DataFrame(data = data_ , index=range(0,5) )
I am preforming it in a df with 4000 rows, the column df['column'] contains a string of Ids separated by |. The number of IDs in each row that needs splitting varies from 1 to 1000, however, this is done for all 4000 indexes. I tested it on the head of the df and it seemed quick enough, now the comprehension has been running for 24hrs. So maybe it is just the sheer size of the job, but feel like I could speed it up and at this point I want to stop it an re-engineer, however, I am scared that will set me back without much increase in speed, so before I do that wanted to get some thoughts, ideas and suggestions.
Beyond 4000x4000 size I suspect that using the Series and Index Objects is the another problem and that I would be better off using lists, but given the size of the task I am not sure how much speed that will gain and maybe I am better off using some other method such as pd.apply(df, f(write line by line to json)). I am not sure - any help and education appreciated, thanks.
Here is one approach:
import pandas as pd
# create data frame
df = pd.DataFrame({'idx': [1, 2, 3, 4], 'col': ['1|2', '1|2|3', '2|3', '1|4']})
# split on '|' to convert string to list
df['col'] = df['col'].str.split('|')
# explode to get one row for each list element
df = df.explode('col')
# create dummy ID (this will become True in the final result)
df['dummy'] = 1
# use pivot to create dense matrix
df = (df.pivot(index='idx', columns='col', values='dummy')
.fillna(0)
.astype(int))
# convert each row to a list
df['test'] = df.apply(lambda x: x.to_list(), axis=1)
print(df)
col 1 2 3 4 test
idx
1 1 1 0 0 [1, 1, 0, 0]
2 1 1 1 0 [1, 1, 1, 0]
3 0 1 1 0 [0, 1, 1, 0]
4 1 0 0 1 [1, 0, 0, 1]
The output you want can be achieved using dummies. We split the column, stack, and use max to turn it into dummy indicators based on the original index. Then we use reindex to get it in the order you want based on the 'drugbank_id' column.
Finally to get the dictionary you want we will transpose and use to_dict
classes = (pd.get_dummies(df['drug-interactions'].str.split('|', expand=True).stack())
.max(level=0)
.reindex(df['drugbank_id'], axis=1)
.fillna(0, downcast='infer')
.T.to_dict('list'))
print(classes)
{0: [1, 0, 0, 0, 0], #Has DB06605, No DB06606, No DB06607, No DB06608, No DB06609
1: [1, 0, 0, 0, 0],
2: [0, 1, 1, 1, 1],
3: [0, 1, 1, 0, 0],
4: [0, 0, 0, 1, 0]}
I am looking to quickly combine columns that are genetic complements of each other. I have a large data frame with counts and want to combine columns where the column names are complements. I have a currently have a system that
Gets the complement of a column name
Checks the columns names for the compliment
Adds together the columns if there is a match
Then deletes the compliment column
However, this is slow (checking every column name) and gives different column names based on the ordering of the columns (i.e. deletes different compliment columns between runs). I was wondering if there was a way to incorporate a dictionary key:value pair to speed the process and keep the output consistent. I have an example dataframe below with the desired result (ATTG|TAAC & CGGG|GCCC are compliments).
df = pd.DataFrame({"ATTG": [3, 6, 0, 1],"CGGG" : [0, 2, 1, 4],
"TAAC": [0, 1, 0, 1], "GCCC" : [4, 2, 0, 0], "TTTT": [2, 1, 0, 1]})
## Current Pseudocode
for item in df.columns():
if compliment(item) in df.columns():
df[item] = df[item] + df[compliment(item)]
del df[compliment(item)]
## Desired Result
df_result = pd.DataFrame({"ATTG": [3, 7, 0, 2],"CGGG" : [4, 4, 1, 4], "TTTT": [2, 1, 0, 1]})
Translate the columns, then assign the columns the translation or original that is sorted first. This allows you to group compliments.
import numpy as np
mytrans = str.maketrans('ATCG', 'TAGC')
df.columns = np.sort([df.columns, [x.translate(mytrans) for x in df.columns]], axis=0)[0, :]
df.groupby(level=0, axis=1).sum()
# AAAA ATTG CGGG
#0 2 3 4
#1 1 7 4
#2 0 0 1
#3 1 2 4
I'm working with the dataset outlined here:
https://archive.ics.uci.edu/ml/datasets/Balance+Scale
I'm trying create a general function to be able to parse any categorical data following these two rules:
Must have a column labeled class containing the class of the object
Each row must have the same numbers of columns
Minimal example of the data that I'm working with:
Class,LW,LD,RW,RD
B,1,1,1,1
L,1,2,1,1
R,1,2,1,3
R,2,2,4,5
This provides 3 unique classes: B, L, R. It also provides 4 features which pertain to each entry: LW, LD, RW and RD.
The following is a part of my function to handle generic cases, but my issue with it is that I don't know how to check if any column labels are simply missing:
import pandas as pd
import sys
dataframe = pd.read_csv('Balance_Data.csv')
columns = list(dataframe.columns.values)
if "Class" not in columns:
sys.exit("'Class' is not a column in the data")
if "Class.1" in columns:
sys.exit("Cannot specify more than one 'Class' column")
columns.remove("Class")
inputX = dataframe.loc[:, columns].as_matrix()
inputY = dataframe.loc[:, ['Class']].as_matrix()
At this point, the correct values are:
inputX = array([[1, 1, 1, 1],
[1, 2, 1, 1],
[1, 2, 1, 3],
[2, 2, 4, 5]])
inputY = array([['B'],
['L'],
['R'],
['R'],
['R'],
['R']], dtype=object)
But if I remove the last column label (RD) and reprocess,
Class,LW,LD,RW
B,1,1,1,1
L,1,2,1,1
R,1,2,1,3
R,2,2,4,5
I get:
inputX = array([[1, 1, 1],
[2, 1, 1],
[2, 1, 3],
[2, 4, 5]])
inputY = array([[1],
[1],
[1],
[2]])
This indicates that it reads label values from right to left instead of left to right, which means that if any data is input into this function that doesn't have the right amount of labels, it's not going to work correctly.
How can I check that the dimension of the rows is the same as the number of columns? (It can be assumed that there are no gaps in the data itself, that each row of data beyond the columns always has the same number of elements in it)
I would pull it out as follows:
In [11]: df = pd.read_csv('Balance_Data.csv', index_col=0)
In [12]: df
Out[12]:
LW LD RW RD
Class
B 1 1 1 1
L 1 2 1 1
R 1 2 1 3
R 2 2 4 5
That way the assertion check can be:
if "Class" in df.columns:
sys.exit("class must be the first and only the column and number of columns must match all rows")
and then check that the there are no NaNs in the last column:
In [21]: df.iloc[:, -1].notnull().all()
Out[21]: True
Note: this happens e.g. with the following (bad) csv:
In [31]: !cat bad.csv
A,B,C
1,2
3,4
In [32]: df = pd.read_csv('bad.csv', index_col=0)
In [33]: df
Out[33]:
B C
A
1 2 NaN
3 4 NaN
In [34]: df.iloc[:, -1].notnull().all()
Out[34]: False
I think these are the only two failing cases (but I think the error messages can be made clearer)...