how to transform R dataframe to rows of indicator values - python

I have a dataframe
A 2
B 4
C 3
and I would like to make a data frame with the following
A 0
A 1
B 0
B 0
B 0
B 1
C 0
C 0
C 1.
So for B, I want to make 4 rows and each one is 0 except for the last one which is 1. Similarly, for A, I'll have 2 rows and the first one has a 0 and the second one has a 1.
In general, if I have a row in the original table with X n, I want to return n rows in the new table with n-1 of them being X 0 and the final one as X 1.
Is there a way to do this in R? Or Python or SQL?

In R, we may use uncount to replicate the rows from the second column and replace the second column with binary by converting the first to logical column (duplicated)
library(tidyr)
library(dplyr)
df1 %>%
uncount(v2) %>%
mutate(v2 = +(!duplicated(v1, fromLast = TRUE)))
-output
v1 v2
1 A 0
2 A 1
3 B 0
4 B 0
5 B 0
6 B 1
7 C 0
8 C 0
9 C 1
Or in Python
import pandas as pd
df1 = pd.DataFrame({"v1":["A", "B", "C"], "v2": [2, 4, 3]})
df2 = df1.reindex(df1.index.repeat(df1.v2))
df2['v2'] = (~df2.duplicated(subset = ['v2'], keep = "last")) + 0
df2
v1 v2
0 A 0
0 A 1
1 B 0
1 B 0
1 B 0
1 B 1
2 C 0
2 C 0
2 C 1
data
df1 <- structure(list(v1 = c("A", "B", "C"), v2 = c(2L, 4L, 3L)),
class = "data.frame", row.names = c(NA,
-3L))

It's not that hard with base R...
d <- data.frame(x = LETTERS[1:3], n = c(2L, 4L, 3L))
d
## x n
## 1 A 2
## 2 B 4
## 3 C 3
data.frame(x = rep.int(d$x, d$n), i = replace(integer(sum(d$n)), cumsum(d$n), 1L))
## x i
## 1 A 0
## 2 A 1
## 3 B 0
## 4 B 0
## 5 B 0
## 6 B 1
## 7 C 0
## 8 C 0
## 9 C 1

# load package
library(data.table)
# set as data table
setDT(df)
# work
df1 <- df[rep(seq(.N), b), ][, c := 1:.N, a]
df1[, d := 0][b == c, d := 1][, b := d][, c('c', 'd') := NULL]

There is another way in python to replicate what #akrun did in R:
>>> from datar.all import f, tibble, uncount, mutate, duplicated, as_integer
>>> df1 = tibble(v1=list("ABC"), v2=[2, 4, 3])
>>> df1 >> uncount(f.v2) >> mutate(
... v2=as_integer(~duplicated(f.v1, from_last=True))
... )
v1 v2
<object> <object>
0 A 0
1 A 1
2 B 0
3 B 0
4 B 0
5 B 1
6 C 0
7 C 0
8 C 1
I am the author of datar, which is backed by pandas.

Related

How to create dataframe based on matrix?

There are two dataframe I have "df1" and "df2" and one matrix "res"
df1= a df2 = a
b c
c e
d
there are 4 record in df1 and 3 record in df2
so,
res = 4*3 matrix
res =
df2(index)
0 1 2
0 100 0 0
df1(index) 1 0 0 0
2 0 100 0
3 0 0 0
so I have above data based on this data or matrix I want following output in the form of dataframe
df1 df2 score
a a 100
a c 0
a e 0
b a 0
b c 0
b e 0
c a 0
c c 100
c e 0
d a 0
d c 0
d e 0
Set index and columns names by df1, df2:
res.index = df1[:len(res.index)]
res.columns = df2[:len(res.columns)]
And then reshape by DataFrame.melt:
df = res.rename_axis(index='df1', columns='df2').melt(ignore_index=False)
Or DataFrame.stack:
df = res.rename_axis(index='df1', columns='df2').stack().reset_index(name='value')

pandas finds indices of rows in each group which meets certain condition and assign values to these rows

I have a df,
name_id name
1 a
2 b
2 b
3 c
3 c
3 c
now I want to groupby name_id and assign -1 to rows in the group(s), whose length is 1 or < 2;
one_occurrence_indices = df.groupby('name_id').filter(lambda x: len(x) == 1).index.tolist()
for index in one_occurrence_indices:
df.loc[index, 'name_id'] = -1
I am wondering what is the best way to do it. so the result df,
name_id name
-1 a
2 b
2 b
3 c
3 c
3 c
Use transform with loc:
df.loc[df.groupby('name_id')['name_id'].transform('size') == 1, 'name_id'] = -1
Alternative is numpy.where:
df['name_id'] = np.where(df.groupby('name_id')['name_id'].transform('size') == 1,
-1, df['name_id'])
print (df)
name_id name
0 -1 a
1 2 b
2 2 b
3 3 c
4 3 c
5 3 c
Also if want test duplicates use duplicated:
df['name_id'] = np.where(df.duplicated('name_id', keep=False), df['name_id'], -1)
Use:
df.name_id*=(df.groupby('name_id').name.transform(len)==1).map({True:-1,False:1})
df
Out[50]:
name_id name
0 -1 a
1 2 b
2 2 b
3 3 c
4 3 c
5 3 c
Using pd.DataFrame.mask:
lens = df.groupby('name_id')['name'].transform(len)
df['name_id'].mask(lens < 2, -1, inplace=True)
print(df)
name_id name
0 -1 a
1 2 b
2 2 b
3 3 c
4 3 c
5 3 c

Drop pandas dataframe rows AND columns in a batch fashion based on value

Background: I have a matrix which represents the distance between two points. In this matrix both rows and columns are the data points. For example:
A B C
A 0 999 3
B 999 0 999
C 3 999 0
In this toy example let's say I want to drop C for some reason, because it is far away from any other point. So I first aggregate the count:
df["far_count"] = df[df == 999].count()
and then batch remove them:
df = df[df["far_count"] == 2]
In this example this looks a bit redundant but please imagine that I have many data points like this (say in the order of 10Ks)
The problem with the above batch removal is that I would like to remove rows and columns in the same time (instead of just rows) and it is unclear to me how to do so elegantly. A naive way is to get a list of such data points and put it in a loop and then:
for item in list:
df.drop(item, axis=1).drop(item, axis=0)
But I was wondering if there is a better way. (Bonus if we could skip the intermdiate step far_count)
np.random.seed([3,14159])
idx = pd.Index(list('ABCDE'))
a = np.random.randint(3, size=(5, 5))
df = pd.DataFrame(
a.T.dot(a) * (1 - np.eye(5, dtype=int)),
idx, idx)
df
A B C D E
A 0 4 2 4 2
B 4 0 1 5 2
C 2 1 0 2 6
D 4 5 2 0 3
E 2 2 6 3 0
l = ['A', 'C']
m = df.index.isin(l)
df.loc[~m, ~m]
B D E
B 0 5 2
D 5 0 3
E 2 3 0
For your specific case, because the array is symmetric you only need to check one dimension.
m = (df.values == 999).sum(0) == len(df) - 1
In [66]: x = pd.DataFrame(np.triu(df), df.index, df.columns)
In [67]: x
Out[67]:
A B C
A 0 999 3
B 0 0 999
C 0 0 0
In [68]: mask = x.ne(999).all(1) | x.ne(999).all(0)
In [69]: df.loc[mask, mask]
Out[69]:
A C
A 0 3
C 3 0

Transform the relationship data with weight into a Matrix in python

Input data format like that: data.txt
col1 col2 weight
a b 1
a c 2
a d 0
b c 3
b d 0
c d 0
i want the output data format like that: result.txt
a b c d
a 0 1 2 0
b 1 0 3 0
c 2 3 0 0
d 0 0 0 0
I would use pandas in this way
import pandas as pd
# Read your data from a .csv file
df = pd.read_csv('yourdata.csv')
# Pivot table
mat = pd.pivot_table(df,index='col1',columns='col2',values='weight')
# Rebuild the index
index = mat.index.union(mat.columns)
# Build the new full matrix and fill NaN values with 0
mat = mat.reindex(index=index, columns=index).fillna(0)
# Make the matrix symmetric
m = mat + mat.T
This returns:
a b c d
a 0 1 2 0
b 1 0 3 0
c 2 3 0 0
d 0 0 0 0
EDIT: instead of pivot_table() you can also use:
mat = df.pivot(index='col1',columns='col2',values='weight')
give a, b, c, d values and set col 1 = i, and col 2 = j. evaluate row by row. For example, row 1, i = 0, j = 1 , weights(i,j) = 1

Pandas - Remove duplicates across multiple columns

I am trying to efficiently remove duplicates in Pandas in which duplicates are inverted across two columns. For example, in this data frame:
import pandas as pd
key = pd.DataFrame({'p1':['a','b','a','a','b','d','c'],'p2':['b','a','c','d','c','a','b'],'value':[1,1,2,3,5,3,5]})
df = pd.DataFrame(key,columns=['p1','p2','value'])
print frame
p1 p2 value
0 a b 1
1 b a 1
2 a c 2
3 a d 3
4 b c 5
5 d a 3
6 c b 5
I would want to remove rows 1, 5 and 6, leaving me with just:
p1 p2 value
0 a b 1
2 a c 2
3 a d 3
4 b c 5
Thanks in advance for ideas on how to do this.
Reorder the p1 and p2 values so they appear in a canonical order:
mask = df['p1'] < df['p2']
df['first'] = df['p1'].where(mask, df['p2'])
df['second'] = df['p2'].where(mask, df['p1'])
yields
In [149]: df
Out[149]:
p1 p2 value first second
0 a b 1 a b
1 b a 1 a b
2 a c 2 a c
3 a d 3 a d
4 b c 5 b c
5 d a 3 a d
6 c b 5 b c
Then you can drop_duplicates:
df = df.drop_duplicates(subset=['value', 'first', 'second'])
import pandas as pd
key = pd.DataFrame({'p1':['a','b','a','a','b','d','c'],'p2':['b','a','c','d','c','a','b'],'value':[1,1,2,3,5,3,5]})
df = pd.DataFrame(key,columns=['p1','p2','value'])
mask = df['p1'] < df['p2']
df['first'] = df['p1'].where(mask, df['p2'])
df['second'] = df['p2'].where(mask, df['p1'])
df = df.drop_duplicates(subset=['value', 'first', 'second'])
df = df[['p1', 'p2', 'value']]
yields
In [151]: df
Out[151]:
p1 p2 value
0 a b 1
2 a c 2
3 a d 3
4 b c 5

Categories