I have a 2d NumPy array that looks like this:
array([[1, 1],
[1, 2],
[2, 1],
[2, 2],
[3, 1],
[5, 1],
[5, 2]])
and I want to group it and have an output that looks something like this:
Col1 Col2
group 1: 1-2, 1-2
group 2: 3-3, 1-1
group 3: 5-5, 1-2
I want to group the columns based on if they are consecutive.
So, for a unique value In column 1, group data in the second column if they are consecutive between rows. Now for a unique grouping of column 2, group column 1 if it is consecutive between rows.
The result can be thought of as corner points of a grid. In the above example, group 1 is a square grid, group 2 is a a point, and group 3 is a flat line.
My system won't allow me to use pandas so I cannot use group_by in that library but I can use other standard libraries.
Any help is appreciated. Thank you
Here you go ...
Steps are:
Get a list xUnique of unique column 1 values with sort order preserved.
Build a list xRanges of items of the form [col1_value, [col2_min, col2_max]] holding the column 2 ranges for each column 1 value.
Build a list xGroups of items of the form [[col1_min, col1_max], [col2_min, col2_max]] where the [col1_min, col1_max] part is created by merging the col1_value part of consecutive items in xRanges if they differ by 1 and have identical [col2_min, col2_max] value ranges for column 2.
Turn the ranges in each item of xGroups into strings and print with the required row and column headings.
Also package and print as a numpy.array to match the form of the input.
import numpy as np
data = np.array([
[1, 1],
[1, 2],
[2, 1],
[2, 2],
[3, 1],
[5, 1],
[5, 2]])
xUnique = list({pair[0] for pair in data})
xRanges = list(zip(xUnique, [[0, 0] for _ in range(len(xUnique))]))
rows, cols = data.shape
iRange = -1
for i in range(rows):
if i == 0 or data[i, 0] > data[i - 1, 0]:
iRange += 1
xRanges[iRange][1][0] = data[i, 1]
xRanges[iRange][1][1] = data[i, 1]
xGroups = []
for i in range(len(xRanges)):
if i and xRanges[i][0] - xRanges[i - 1][0] == 1 and xRanges[i][1] == xRanges[i - 1][1]:
xGroups[-1][0][1] = xRanges[i][0]
else:
xGroups += [[[xRanges[i][0], xRanges[i][0]], xRanges[i][1]]]
xGroupStrs = [ [f'{a}-{b}' for a, b in row] for row in xGroups]
groupArray = np.array(xGroupStrs)
print(groupArray)
print()
print(f'{"":<10}{"Col1":<8}{"Col2":<8}')
[print(f'{"group " + str(i) + ":":<10}{col1:<8}{col2:<8}') for i, (col1, col2) in enumerate(xGroupStrs)]
Output:
[['1-2' '1-2']
['3-3' '1-1']
['5-5' '1-2']]
Col1 Col2
group 0: 1-2 1-2
group 1: 3-3 1-1
group 2: 5-5 1-2
Related
I want to randomly assign 20 people to 4 tables to do 4 tasks without repetition, and each person must be at each table only once.
There are five people at each table.
There are four tables.
Five people must evenly rotate each table.
import random
# generate 1 to 20
members_list = [x for x in range(1, 21)]
# assign to 4 groups
chunks = [members_list[x:x + 5] for x in range(0, len(members_list), 5)]
print(chunks)
final_result = []
count = 0
start_assign = 4
# generate a new random list
while start_assign:
new_member_list = [x for x in range(1, 21)]
random.shuffle(new_member_list)
print(f"random List {start_assign}: {new_member_list} \n")
for i in chunks:
result = []
count += 1
print(f"The Original List {count}: {i}")
for x in new_member_list:
if x not in i:
result.append(x)
result = result[:5]
new_member_list = [x for x in new_member_list if x not in result]
result.sort()
if len(result) < 5:
result.extend(new_member_list)
print(f"Second List Result {count}: {result}")
result.extend(i)
final_result.append(result)
print(f"combine with the previous list: {count}: {final_result}\n")
chunks = []
chunks = final_result
start_assign -= 1
print(f"Final new list: {final_result}")
Let's say that you have a 20x4 binary matrix mapping people to tables. Rows are people, columns are iterations of the process. Each row contains some permutation of the numbers [0, 3] to indicate the order in which each person traverses the tables. Each column contains exactly five elements corresponding to each table, meaning that at each iteration, all tables are filled evenly. Here is a sample starting configuration:
np.array([
[0, 1, 2, 3], # Person 0 goes to tables 0, 1, 2, 3, in that order
[0, 1, 2, 3],
[0, 1, 2, 3],
[0, 1, 2, 3],
[0, 1, 2, 3],
[1, 2, 3, 0], # Person 5 goes to tables 1, 2, 3, 0, in that order
[1, 2, 3, 0],
[1, 2, 3, 0],
[1, 2, 3, 0],
[1, 2, 3, 0],
[2, 3, 0, 1],
[2, 3, 0, 1],
[2, 3, 0, 1],
[2, 3, 0, 1],
[2, 3, 0, 1],
[3, 0, 1, 2],
[3, 0, 1, 2],
[3, 0, 1, 2],
[3, 0, 1, 2],
[3, 0, 1, 2],
])
This is a valid configuration. Any configuration with 5 of each element in a column and unique elements in each row is a valid configuration. One way to move to another valid configuration is to swap any of the rows: two people will swap schedules, but the constraint on the rows and columns will not change. Similarly, you can swap any two columns: the order of visits will change, but everything will remain valid. Shuffling entire rows and columns like this is not very interesting: groups of five people will end up sticking together for all the steps, even if their paths will be somewhat randomized.
More generally, you can swap a single pair of elements in a given row or column. That will break the validity of the configuration, so you will have to swap other elements a bunch of times to restore it. Here is an example of what happens if you swap the first and last element of the first row:
3, 1, 2, 0
Now you need to find another row that starts with 3 and swap the 3 and the 0 in that row. That will restore the first column, but will possibly invalidate the column that now contains the 3. So you find another 3 elsewhere in the new column, and swap with 0 in the row that contains the 3. You keep doing that until the 3 ends up in the last column for some row, and order is restored.
You can implement something like a Fisher-Yates shuffle for each row, with the added step of resolving the conditions as you go. Here is a sample implementation. I'm using numpy arrays for storage and indexing convenience, but you can do this with lists just as well:
P = 20
M = 4
N = P // M
idx = np.arange(M)
# Start with a valid configuration, same as the array written out above
paths = ((idx + idx[:, None]) % M).repeat(N, axis=0)
for i in range(P):
# Shuffle each row
for j in range(M - 1):
swap = j + np.random.choice(M - j)
if swap == j:
continue
n0 = paths[i, j] # First number to swap
n1 = paths[i, swap] # Second number to swap
# fix up the consequences
ix = swap # Index of the other column
while j != swap:
# Swap the elements in the previous row
paths[i, [j, ix]] = paths[i, [ix, j]]
# Find a new row with n1 at column j
i = np.random.choice(np.flatnonzero(paths[:, j] == n1))
# Find the location of n0 in the new row
ix = j
j = np.flatnonzero(paths[i] == n0)[0]
paths[i, [j, ix]] = paths[i, [ix, j]]
# Check the results:
assert (np.sort(paths, axis=1) == idx).all(None)
assert (np.sum(paths, axis=0) == N * M * (M - 1) // 2).all()
Comments in the code should help you follow along. I can't prove that the solution is unbiased, but it seems pretty good (and fast) from the cursory tests I ran. The assertions at the end can be moved out of the loop to check the final result, or removed entirely if you are confident in the implementation.
With the conditions as you've specified, there are 6 possible paths for each person starting at a given table (3 other tables to visit). Since there are 5 people starting at each table, it works out that in almost all cases, one or more pairs of folks starting at a given table will still end up on the same path. I'm not sure if this just happens because it's likely (~91% for M=4, N=5 without priors [see here]), or because there is something in the conditions that effectively requires it.
I am looking to quickly combine columns that are genetic complements of each other. I have a large data frame with counts and want to combine columns where the column names are complements. I have a currently have a system that
Gets the complement of a column name
Checks the columns names for the compliment
Adds together the columns if there is a match
Then deletes the compliment column
However, this is slow (checking every column name) and gives different column names based on the ordering of the columns (i.e. deletes different compliment columns between runs). I was wondering if there was a way to incorporate a dictionary key:value pair to speed the process and keep the output consistent. I have an example dataframe below with the desired result (ATTG|TAAC & CGGG|GCCC are compliments).
df = pd.DataFrame({"ATTG": [3, 6, 0, 1],"CGGG" : [0, 2, 1, 4],
"TAAC": [0, 1, 0, 1], "GCCC" : [4, 2, 0, 0], "TTTT": [2, 1, 0, 1]})
## Current Pseudocode
for item in df.columns():
if compliment(item) in df.columns():
df[item] = df[item] + df[compliment(item)]
del df[compliment(item)]
## Desired Result
df_result = pd.DataFrame({"ATTG": [3, 7, 0, 2],"CGGG" : [4, 4, 1, 4], "TTTT": [2, 1, 0, 1]})
Translate the columns, then assign the columns the translation or original that is sorted first. This allows you to group compliments.
import numpy as np
mytrans = str.maketrans('ATCG', 'TAGC')
df.columns = np.sort([df.columns, [x.translate(mytrans) for x in df.columns]], axis=0)[0, :]
df.groupby(level=0, axis=1).sum()
# AAAA ATTG CGGG
#0 2 3 4
#1 1 7 4
#2 0 0 1
#3 1 2 4
suppose i create a Pandas DataFrame as below
import pandas as pd
import numpy as np
np.random.seed(0)
x = 10*np.random.randn(5,5)
df = pd.DataFrame(x)
as an example, this can generate the below:
for each row, i am looking for a way to readily obtain the indices corresponding to the largest n (say 3) values in absolute value terms. for example, for the first row, i would expect [0,3,4]. we can assume that the results don't need to be ordered.
i tried searching for solutions similar to idxmax and argmax, but it seems these do not readily handle multiple values
You can use np.argsort(axis=1)
Given dataset:
x = 10*np.random.randn(5,5)
df = pd.DataFrame(x)
0 1 2 3 4
0 17.640523 4.001572 9.787380 22.408932 18.675580
1 -9.772779 9.500884 -1.513572 -1.032189 4.105985
2 1.440436 14.542735 7.610377 1.216750 4.438632
3 3.336743 14.940791 -2.051583 3.130677 -8.540957
4 -25.529898 6.536186 8.644362 -7.421650 22.697546
df.abs().values.argsort(1)[:, -3:][:, ::-1]
array([[3, 4, 0],
[0, 1, 4],
[1, 2, 4],
[1, 4, 0],
[0, 4, 2]])
Try this ( this is not the optimal code ) :
idx_nmax = {}
n = 3
for index, row in df.iterrows():
idx_nmax[index] = list(row.nlargest(n).index)
at the end of that you will have a dictionary with:
as Key the index of the row
and as Values the index of the 'n' highest value of this row
I'm working with the dataset outlined here:
https://archive.ics.uci.edu/ml/datasets/Balance+Scale
I'm trying create a general function to be able to parse any categorical data following these two rules:
Must have a column labeled class containing the class of the object
Each row must have the same numbers of columns
Minimal example of the data that I'm working with:
Class,LW,LD,RW,RD
B,1,1,1,1
L,1,2,1,1
R,1,2,1,3
R,2,2,4,5
This provides 3 unique classes: B, L, R. It also provides 4 features which pertain to each entry: LW, LD, RW and RD.
The following is a part of my function to handle generic cases, but my issue with it is that I don't know how to check if any column labels are simply missing:
import pandas as pd
import sys
dataframe = pd.read_csv('Balance_Data.csv')
columns = list(dataframe.columns.values)
if "Class" not in columns:
sys.exit("'Class' is not a column in the data")
if "Class.1" in columns:
sys.exit("Cannot specify more than one 'Class' column")
columns.remove("Class")
inputX = dataframe.loc[:, columns].as_matrix()
inputY = dataframe.loc[:, ['Class']].as_matrix()
At this point, the correct values are:
inputX = array([[1, 1, 1, 1],
[1, 2, 1, 1],
[1, 2, 1, 3],
[2, 2, 4, 5]])
inputY = array([['B'],
['L'],
['R'],
['R'],
['R'],
['R']], dtype=object)
But if I remove the last column label (RD) and reprocess,
Class,LW,LD,RW
B,1,1,1,1
L,1,2,1,1
R,1,2,1,3
R,2,2,4,5
I get:
inputX = array([[1, 1, 1],
[2, 1, 1],
[2, 1, 3],
[2, 4, 5]])
inputY = array([[1],
[1],
[1],
[2]])
This indicates that it reads label values from right to left instead of left to right, which means that if any data is input into this function that doesn't have the right amount of labels, it's not going to work correctly.
How can I check that the dimension of the rows is the same as the number of columns? (It can be assumed that there are no gaps in the data itself, that each row of data beyond the columns always has the same number of elements in it)
I would pull it out as follows:
In [11]: df = pd.read_csv('Balance_Data.csv', index_col=0)
In [12]: df
Out[12]:
LW LD RW RD
Class
B 1 1 1 1
L 1 2 1 1
R 1 2 1 3
R 2 2 4 5
That way the assertion check can be:
if "Class" in df.columns:
sys.exit("class must be the first and only the column and number of columns must match all rows")
and then check that the there are no NaNs in the last column:
In [21]: df.iloc[:, -1].notnull().all()
Out[21]: True
Note: this happens e.g. with the following (bad) csv:
In [31]: !cat bad.csv
A,B,C
1,2
3,4
In [32]: df = pd.read_csv('bad.csv', index_col=0)
In [33]: df
Out[33]:
B C
A
1 2 NaN
3 4 NaN
In [34]: df.iloc[:, -1].notnull().all()
Out[34]: False
I think these are the only two failing cases (but I think the error messages can be made clearer)...
What question is asking for is, from a list of lists like the following, to return a tuple that contains tuples of all occurrences of the number 2 for a given index on the list. If there are X consecutive 2s, then it should appear only one element in the inside tuple containing X, just like this:
[[1, 2, 2, 1],
[2, 1, 1, 2],
[1, 1, 2, 2]]
Gives
((1,), (1,), (1, 1),(2,))
While
[[2, 2, 2, 2],
[2, 1, 2, 2],
[2, 2, 1, 2]]
Gives
((3,),(1, 1),(2,)(3,))
What about the same thing but not for the columns, this time, for the rows? is there a "one-line" method to do it? I mean:
[[1, 2, 2, 1],
[2, 1, 1, 2],
[1, 1, 2, 2]]
Gives
((2,), (1, 1), (2,))
While
[[2, 2, 2, 2],
[2, 1, 2, 2],
[2, 2, 1, 2]]
Gives
((4,),(1, 2),(2, 1))
I have tried some things, this is one of the things, I can't finish it, don't know what to do anymore, after it:
l = [[2,2,2],[2,2,2],[2,2,2]]
t = (((1,1),(2,),(2,)),((2,),(2,),(1,1)))
if [x.count(0) for x in l] == [0 for x in l]:
espf = []*len(l)
espf2 = []
espf_atual = 0
contador = 0
for x in l:
for c in x:
celula = x[c]
if celula == 2:
espf_atual += 1
else:
if celula == 1:
espf[contador] = [espf_atual]
contador += 1
espf_atual = 0
espf2 += [espf_atual]
espf_atual = 0
print(tuple(espf2))
output
(3, 3, 3)
this output is the correct one but if I change the list(l) it doesn't work
So, you have som emistakes in the code.
Indexing:
for c in x:
celula = x[c]
It should be celula = c as c already points to each element of x.
Intermediate results
For each column you store intermediate results as:
espf_atual = 0
...
espf_atual += 1
...
espf2 += [espf_atual]
but this will only allow to store the last occurrences of 2 for each column. This is, if a row is [2,1,2,2], then espf_actual = 2 and you will store only the last occurrence. You will override the first occurrence (before the 1).
To avoid this, you need to store intermediate results for each row. You got it halfway with espf = []*len(l), but you never used it properly later.
Find bellow a working example (not too different from your initial solution):
espf = []
for x in l:
# Restart counters for every row
espf_current = [] # Will store any sequences of 2
contador = 0 # Will count consecutive 2's
for c in x:
celula = c
if celula == 2:
contador += 1 # Count number of 2
elif celula == 1:
if contador > 0: # Store any 2 before 1
espf_current += [contador]
contador = 0
if contador > 0: # Check if the row ends in 2
espf_current += [contador]
# Store results of this row in the final results
espf += [tuple(espf_current)]
print tuple(espf)
The key to switch rows and columns, is to change the indexing method. Currently you are iterating along the elements of the list, and thus, this doesn't allow you to switch between rows and columns.
Another way to see the iteration is to iterate indexes of the matrix (i, j for rows and columns) as follows:
numRows = len(l)
numCols = len(l[0])
for i in range(numRows):
for j in range(numCols):
celula = l[i][j]
The above is the indexing equivalent to the previous code. It assumes all the rows have the same length (which is true in your examples). Changing it from rows to columns is straightforward (tip: switch the loops), I leave it to you :P