Replicating rows in a pandas data frame - python

I have the following DataFrame:
N numbers
n1 1,2,3
n2 4,6,2
n4 2,5
....
frequency=[0.45, 0.5, 0.05]
Activ = [ 1, 2, 3]
df = shuffle(df)[:20]
Activs=np.random.choice(Activ , len(df), p=frequency)
df['index']=pd.Series(Activs.tolist())
df_new = df.loc[np.repeat(df.index.values,df.index)]
I want to get a data frame of the type of:
df_new:
N numbers index
n1 1,2,3 3
n1 1,2,3 3
n2 4,6,2 2
n2 4,6,2 2
n2 4,6,2 2
n1 1,2,3 1
n4 2,5 2
....
I get an error - in my frame a date value in colum index numbers and NaN

I think column index is not necessary, for np.repeat is possible use array Activs:
df = pd.DataFrame({'numbers': ['1,2,3', '4,6,2', '2,5'], 'N': ['n1', 'n2', 'n4']})
print (df)
N numbers
0 n1 1,2,3
1 n2 4,6,2
2 n4 2,5
frequency=[0.45, 0.5, 0.05]
Activ = [ 1, 2, 3]
df = df[:20]
#for testing
np.random.seed(100)
Activs=np.random.choice(Activ , len(df.index), p=frequency)
print (Activs)
[2 1 1]
df_new = df.loc[np.repeat(df.index,Activs)]
print (df_new)
N numbers
0 n1 1,2,3
0 n1 1,2,3
1 n2 4,6,2
2 n4 2,5
But if need new column from Activs, better is dont use name index if not really necessary - e.g. name is val:
np.random.seed(100)
Activs=np.random.choice(Activ , len(df.index), p=frequency)
print (Activs)
[2 1 1]
df['val'] = Activs
df_new = df.loc[np.repeat(df.index,Activs)]
print (df_new)
N numbers val
0 n1 1,2,3 2
0 n1 1,2,3 2
1 n2 4,6,2 1
2 n4 2,5 1

Related

Getting row/column name and integer-based index in a dataframe where a condition holds

Given a dataframe df with a simple Index (not a MultiIndex) - that corresponds to a 2-D real matrix with names for rows and columns - and a boolean expression e over the elements in df, I would like to get:
the name and the integer-based index of the rows
the name and the integer-based index of the columns
of all the elements satisfying the expression e. The expression e is nothing special: I am interested in the rows/columns of the elements greater than a threshold.
After reading the documentation and plenty of questions and answers here, I wrote the code given below. It contains two solutions:
one based on numpy. Basically, I extract the numbers from the dataframe and treat them as a numpy array. This solution seems reasonable: given the basic nature of the task, the code is simple enough.
one based on methods provided by pandas. Even if pandas is designed for more complex scenarios than a simple matrix with numbers, this solution seems way too complex for what I am trying to accomplish.
set up the data
import numpy as np
import pandas as pd
n_rows, n_cols, v = 4, 5, 3
rows = [ "r" + str(i) for i in range(n_rows) ]
columns = [ "c" + str(i) for i in range(n_cols) ]
values = np.zeros( (n_rows, n_cols), dtype=int)
ii = np.random.randint(n_rows, size=(2,))
jj = np.random.randint(n_cols, size=(2,))
poss = zip(ii, jj)
for pos in poss:
print(f"target set at {pos} -> ({rows[pos[0]]}, {columns[pos[1]]})")
values[pos] = v + 1
print(" === values ===")
print(values)
df = pd.DataFrame(values, index=rows, columns=columns)
print(" === df === ")
print(df)
with output:
target set at (2, 4) -> (r2, c4)
target set at (1, 0) -> (r1, c0)
=== values ===
[[0 0 0 0 0]
[4 0 0 0 0]
[0 0 0 0 4]
[0 0 0 0 0]]
=== df ===
c0 c1 c2 c3 c4
r0 0 0 0 0 0
r1 4 0 0 0 0
r2 0 0 0 0 4
r3 0 0 0 0 0
solution with numpy
print("\n === USING NUMPY ===")
data = df.to_numpy()
indexes = np.argwhere(data > v)
for ind in indexes:
print(f"(numpy) target found at {ind} -> ({rows[ind[0]]}, {columns[ind[1]]})")
with output:
=== USING NUMPY ===
(numpy) target found at [1 0] -> (r1, c0)
(numpy) target found at [2 4] -> (r2, c4)
solution with pandas
print("\n === WITH PANDAS ===")
# select the rows with at least one column satisfying the condition
cond = (df > v).any(1)
df2 = df[cond]
print(df2, "\n")
# stack
stacked = df2.stack()
print(stacked, "\n")
# filter (again!)
stacked2 = stacked.loc[stacked>v]
print("indexes in stacked:", stacked2.index.to_list(), "\n")
# get index (it is a MultiIndex at this point)
target_rows = [a for (a, _) in stacked2.index.to_list()]
target_cols = [b for (_, b) in stacked2.index.to_list()]
target_rows_idx = [df.index.get_loc(row_name) for row_name in target_rows]
target_cols_idx = [columns.index(col_name) for col_name in target_cols]
for pos in zip(target_rows_idx, target_cols_idx):
print(f"(pandas) target found at {pos} -> ({rows[pos[0]]}, {columns[pos[1]]})")
with output:
=== WITH PANDAS ===
c0 c1 c2 c3 c4
r1 4 0 0 0 0
r2 0 0 0 0 4
r1 c0 4
c1 0
c2 0
c3 0
c4 0
r2 c0 0
c1 0
c2 0
c3 0
c4 4
dtype: int64
indexes in stacked: [('r1', 'c0'), ('r2', 'c4')]
(pandas) target found at (1, 0) -> (r1, c0)
(pandas) target found at (2, 4) -> (r2, c4)
Is there a simpler way to write the code using only pandas?
Since stack drops NaN values by default, we could mask out the values first then stack (this avoids the need to filter twice). Then just grab the index and use get_loc on both index and columns to convert the labels to integer values:
stacked = df[df > v].stack()
label_idx = stacked.index.tolist()
integer_idx = [(df.index.get_loc(r), df.columns.get_loc(c))
for r, c in label_idx]
for i, j in zip(integer_idx, label_idx):
print(f'(pandas 2) target found at {i} -> {j}')
Output:
(pandas 2) target found at (0, 0) -> ('r0', 'c0')
(pandas 2) target found at (1, 4) -> ('r1', 'c4')
stacked:
r0 c0 4.0
r1 c4 4.0
dtype: float64
label_idx:
[('r0', 'c0'), ('r1', 'c4')]
integer_index:
[(0, 0), (1, 4)]
Reproducible with:
np.random.seed(22)
I'd use pd.Series.iteritems():
>>> [x for x, y in df.gt(3).stack().iteritems() if y]
[('r1', 'c3'), ('r2', 'c3')]
For index:
>>> [(df.index.get_loc(a), df.columns.get_loc(b)) for (a, b), y in df.gt(3).stack().iteritems() if y]
[(1, 3), (2, 3)]
>>>
df in this case:
>>> df
c0 c1 c2 c3 c4
r0 0 0 0 0 0
r1 0 0 0 4 0
r2 0 0 0 4 0
r3 0 0 0 0 0
>>>

Creating matrix of 0 and 1 from a string vector in R or python

I want to create a matrix of 0 and 1 from a vector where each string contains the two names I want to map to the matrix. For example, if I have the following vector
vector_matrix <- c("A_B", "A_C", "B_C", "B_D", "C_D")
I would like to transform it into the following matrix
A B C D
A 0 1 1 0
B 0 0 1 1
C 0 0 0 1
D 0 0 0 0
I am open to any suggestion, but it is better if there is some built-in function that can deal with it. I am trying to do a very similar thing but in a magnitude that I will generate a matrix of 25 million cells.
I prefer if the code is R, but doesn't matter if there is some pythonic solution :)
Edit:
So when I say "A_B", I want a "1" in row A column B. It doesn't matter if it is the contrary (column A row B).
Edit:
I would like to have a matrix where its rownames and colnames are the letters.
Create a two column data frame d from the data, calculate the levels and then generate a list in which each colunn of d is a factor and finally run table. The second line sorts each row and that isn't actually needed for the input shown so it could be omitted but you might need it for other data if B_A is to be regarded as A_B.
d <- read.table(text = vector_matrix, sep = "_")
d[] <- t(apply(d, 1, sort))
tab <- table( lapply(d, factor, levels = levels(factor(unlist(d)))) )
tab
giving this table:
V2
V1 A B C D
A 0 1 1 0
B 0 0 1 1
C 0 0 0 1
D 0 0 0 0
heatmap(tab[nrow(tab):1, ], NA, NA, col = 2:3, symm = TRUE)
library(igraph)
g <- graph_from_adjacency_matrix(tab, mode = "undirected")
plot(g)
The following should work in Python. It splits the input data in two lists, converts the characters to indexes and sets the indexes of a matrix to 1.
import numpy as np
vector_matrix = ("A_B", "A_C", "B_C", "B_D", "C_D")
# Split data in two lists
rows, cols = zip(*(s.split("_") for s in vector_matrix))
print(rows, cols)
>>> ('A', 'A', 'B', 'B', 'C') ('B', 'C', 'C', 'D', 'D')
# With inspiration from: https://stackoverflow.com/a/5706787/10603874
row_idxs = np.array([ord(char) - 65 for char in rows])
col_idxs = np.array([ord(char) - 65 for char in cols])
print(row_idxs, col_idxs)
>>> [0 0 1 1 2] [1 2 2 3 3]
n_rows = row_idxs.max() + 1
n_cols = col_idxs.max() + 1
print(n_rows, n_cols)
>>> 3 4
mat = np.zeros((n_rows, n_cols), dtype=int)
mat[row_idxs, col_idxs] = 1
print(mat)
>>>
[[0 1 1 0]
[0 0 1 1]
[0 0 0 1]]

Python: dynamic column sum for each row

I have a dataframe with 2 identifiers (ID1, ID2) and 3 numeric columns (X1,X2,X3) and a column titled 'input' (total 6 columns) and n rows. For each row, I want to get the index of the nth column such that n is the last time that (x1+x2+xn... >=0) is still true.
How can I do this in Python?
In R I did this by using:
tmp = data
for (i in 4:5)
{
data[,i]<- tmp$input - rowSums(tmp[,3:i])
}
output<- apply((data[,3:5]), 1, function(x) max(which(x>0)))
data$output <- output
I am trying to translate this into Python. What might be the best way to do this? There can be N such rows, and M such columns.
Sample Data:
ID1 ID2 X1 X2 X3 INPUT OUTPUT (explanation)
a b 1 2 3 3 2 (X1 = 1, x1+x2 = 3, x1+x3+x3 = 6 ... and after 2 sums, input< sums)
a1 a2 5 2 1 4 0 (X1 = 5, x1+x2 = 7, x1+x3+x3 = 8 ... and even for 1 sum, input< sums)
a2 b2 0 4 5 100 3 (X1=0, X1+X2=4, X1+X2+X3=9, ... even after 3 sums, input>sums)
You can use Pandas module which handles this very effectively in Python.
import pandas as pd
#Taking a sample data here
df = pd.DataFrame([
['A','B',1,3,4,0.1],
['K','L',10,3,14,0.5],
['P','H',1,73,40,0.6]],columns = ['ID1','ID2','X2','X3','X4','INPUT'])
#Below code does the functionality you would want.
df['new_column']=df[['X2','X3','X4']].max(axis=1)

How to remove rows from DataFrame

I have a DataFrame with n rows and an ndarray with n values (-1 for outliers and 1 for inlier). Is there a pythonic way to remove DataFrame rows that match the indices of the elements of the nparray marked as -1?
You can just do: new_df = old_df[arr == 1].
Example:
df = pd.DataFrame(np.random.randn(5,5))
arr = np.random.choice([1,-1], 5)
>>> df
0 1 2 3 4
0 -0.238418 0.291475 0.139162 -0.030003 -0.515817
1 -0.162404 -1.272317 0.342051 -0.787938 0.464699
2 -0.965481 0.727143 -0.887149 -0.430592 -2.074865
3 0.699129 -0.242738 1.754805 -0.120637 -1.536973
4 0.228538 0.799445 -0.217787 0.398572 -1.255639
>>> arr
array([ 1, -1, -1, 1, -1])
>>> df[arr == 1]
0 1 2 3 4
0 -0.238418 0.291475 0.139162 -0.030003 -0.515817
3 0.699129 -0.242738 1.754805 -0.120637 -1.536973

How to return all opposite pairs in a Pandas DataFrame?

For the dataframe below, how to return all opposite pairs?
import pandas as pd
df1 = pd.DataFrame([1,2,-2,2,-1,-1,1,1], columns=['a'])
a
0 1
1 2
2 -2
3 2
4 -1
5 -1
6 1
7 1
The output should be as below:
(1) sum of all rows is 0
(2) as there are 3 "1" and 2 "-1" in
original data, output includes 2 "1" and 2"-1".
a
0 1
1 2
2 -2
4 -1
5 -1
6 1
Thank you very much.
Well, I thought this would take fewer lines (and probably can) but this does work. First just create a couple of new columns to simplify the later syntax:
>>> df1['abs_a'] = np.abs( df1['a'] )
>>> df1['ones'] = 1
Then the main thing you need is to do some counting. For example, are there fewer 1s or fewer -1s?
>>> df2 = df1.groupby(['abs_a','a']).count()
ones
abs_a a
1 -1 2
1 3
2 -2 1
2 2
>>> df3 = df2.groupby(level=0).min()
ones
abs_a
1 2
2 1
That's basically the answer right there, but I'll put it closer to the form you asked for:
>>> lst = [ [i]*j for i, j in zip( df3.index.tolist(), df3['ones'].tolist() ) ]
>>> arr = np.array( [item for sublist in lst for item in sublist] )
>>> np.hstack( [arr,-1*arr] )
array([ 1, 1, 2, -1, -1, -2], dtype=int64)
Or if you want to put it back into a dataframe:
>>> pd.DataFrame( np.hstack( [arr,-1*arr] ) )
0
0 1
1 1
2 2
3 -1
4 -1
5 -2

Categories