Input data format like that: data.txt
col1 col2 weight
a b 1
a c 2
a d 0
b c 3
b d 0
c d 0
i want the output data format like that: result.txt
a b c d
a 0 1 2 0
b 1 0 3 0
c 2 3 0 0
d 0 0 0 0
I would use pandas in this way
import pandas as pd
# Read your data from a .csv file
df = pd.read_csv('yourdata.csv')
# Pivot table
mat = pd.pivot_table(df,index='col1',columns='col2',values='weight')
# Rebuild the index
index = mat.index.union(mat.columns)
# Build the new full matrix and fill NaN values with 0
mat = mat.reindex(index=index, columns=index).fillna(0)
# Make the matrix symmetric
m = mat + mat.T
This returns:
a b c d
a 0 1 2 0
b 1 0 3 0
c 2 3 0 0
d 0 0 0 0
EDIT: instead of pivot_table() you can also use:
mat = df.pivot(index='col1',columns='col2',values='weight')
give a, b, c, d values and set col 1 = i, and col 2 = j. evaluate row by row. For example, row 1, i = 0, j = 1 , weights(i,j) = 1
Related
I have a pandas data frame as follows
A
B
C
D
...
Z
and another data frame in which every column has zero or more letters as follows:
Letters
A,C,D
A,B,F
A,H,G
A
B,F
None
I want to match the two dataframes to have something like this
A
B
C
D
...
Z
1
0
1
1
0
0
make example and desired output for answer
Example:
data = ['A,C,D', 'A,B,F', 'A,E,G', None]
df = pd.DataFrame(data, columns=['letter'])
df :
letter
0 A,C,D
1 A,B,F
2 A,E,G
3 None
get_dummies and groupby
pd.get_dummies(df['letter'].str.split(',').explode()).groupby(level=0).sum()
output:
A B C D E F G
0 1 0 1 1 0 0 0
1 1 1 0 0 0 1 0
2 1 0 0 0 1 0 1
3 0 0 0 0 0 0 0
I have a dataframe where one of the columns has its items separated with commas. It looks like:
Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e
My goal is to create a matrix that has as header all the unique values from column Data, meaning [a,b,c,d,e]. Then as rows a flag indicating if the value is at that particular row.
The matrix should look like this:
Data
a
b
c
d
e
a,b,c
1
1
1
0
0
a,c,d
1
0
1
1
0
d,e
0
0
0
1
1
a,e
1
0
0
0
1
a,b,c,d,e
1
1
1
1
1
To separate column Data what I did is:
df['data'].str.split(',', expand = True)
Then I don't know how to proceed to allocate the flags to each of the columns.
Maybe you can try this without pivot.
Create the dataframe.
import pandas as pd
import io
s = '''Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e'''
df = pd.read_csv(io.StringIO(s), sep = "\s+")
We can use pandas.Series.str.split with expand argument equals to True. And value_counts each rows with axis = 1.
Finally fillna with zero and change the data into integer with astype(int).
df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)
#
a b c d e
0 1 1 1 0 0
1 1 0 1 1 0
2 0 0 0 1 1
3 1 0 0 0 1
4 1 1 1 1 1
And then merge it with the original column.
new = df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)
pd.concat([df, new], axis = 1)
#
Data a b c d e
0 a,b,c 1 1 1 0 0
1 a,c,d 1 0 1 1 0
2 d,e 0 0 0 1 1
3 a,e 1 0 0 0 1
4 a,b,c,d,e 1 1 1 1 1
Use the Series.str.get_dummies() method to return the required matrix of 'a', 'b', ... 'e' columns.
df["Data"].str.get_dummies(sep=',')
If you split the strings into lists, then explode them, it makes pivot possible.
(df.assign(data_list=df.Data.str.split(','))
.explode('data_list')
.pivot_table(index='Data',
columns='data_list',
aggfunc=lambda x: 1,
fill_value=0))
Output
data_list a b c d e
Data
a,b,c 1 1 1 0 0
a,b,c,d,e 1 1 1 1 1
a,c,d 1 0 1 1 0
a,e 1 0 0 0 1
d,e 0 0 0 1 1
You could apply a custom count function for each key:
for k in ["a","b","c","d","e"]:
df[k] = df.apply(lambda row: row["Data"].count(k), axis=1)
There are two dataframe I have "df1" and "df2" and one matrix "res"
df1= a df2 = a
b c
c e
d
there are 4 record in df1 and 3 record in df2
so,
res = 4*3 matrix
res =
df2(index)
0 1 2
0 100 0 0
df1(index) 1 0 0 0
2 0 100 0
3 0 0 0
so I have above data based on this data or matrix I want following output in the form of dataframe
df1 df2 score
a a 100
a c 0
a e 0
b a 0
b c 0
b e 0
c a 0
c c 100
c e 0
d a 0
d c 0
d e 0
Set index and columns names by df1, df2:
res.index = df1[:len(res.index)]
res.columns = df2[:len(res.columns)]
And then reshape by DataFrame.melt:
df = res.rename_axis(index='df1', columns='df2').melt(ignore_index=False)
Or DataFrame.stack:
df = res.rename_axis(index='df1', columns='df2').stack().reset_index(name='value')
I have a dataframe
A 2
B 4
C 3
and I would like to make a data frame with the following
A 0
A 1
B 0
B 0
B 0
B 1
C 0
C 0
C 1.
So for B, I want to make 4 rows and each one is 0 except for the last one which is 1. Similarly, for A, I'll have 2 rows and the first one has a 0 and the second one has a 1.
In general, if I have a row in the original table with X n, I want to return n rows in the new table with n-1 of them being X 0 and the final one as X 1.
Is there a way to do this in R? Or Python or SQL?
In R, we may use uncount to replicate the rows from the second column and replace the second column with binary by converting the first to logical column (duplicated)
library(tidyr)
library(dplyr)
df1 %>%
uncount(v2) %>%
mutate(v2 = +(!duplicated(v1, fromLast = TRUE)))
-output
v1 v2
1 A 0
2 A 1
3 B 0
4 B 0
5 B 0
6 B 1
7 C 0
8 C 0
9 C 1
Or in Python
import pandas as pd
df1 = pd.DataFrame({"v1":["A", "B", "C"], "v2": [2, 4, 3]})
df2 = df1.reindex(df1.index.repeat(df1.v2))
df2['v2'] = (~df2.duplicated(subset = ['v2'], keep = "last")) + 0
df2
v1 v2
0 A 0
0 A 1
1 B 0
1 B 0
1 B 0
1 B 1
2 C 0
2 C 0
2 C 1
data
df1 <- structure(list(v1 = c("A", "B", "C"), v2 = c(2L, 4L, 3L)),
class = "data.frame", row.names = c(NA,
-3L))
It's not that hard with base R...
d <- data.frame(x = LETTERS[1:3], n = c(2L, 4L, 3L))
d
## x n
## 1 A 2
## 2 B 4
## 3 C 3
data.frame(x = rep.int(d$x, d$n), i = replace(integer(sum(d$n)), cumsum(d$n), 1L))
## x i
## 1 A 0
## 2 A 1
## 3 B 0
## 4 B 0
## 5 B 0
## 6 B 1
## 7 C 0
## 8 C 0
## 9 C 1
# load package
library(data.table)
# set as data table
setDT(df)
# work
df1 <- df[rep(seq(.N), b), ][, c := 1:.N, a]
df1[, d := 0][b == c, d := 1][, b := d][, c('c', 'd') := NULL]
There is another way in python to replicate what #akrun did in R:
>>> from datar.all import f, tibble, uncount, mutate, duplicated, as_integer
>>> df1 = tibble(v1=list("ABC"), v2=[2, 4, 3])
>>> df1 >> uncount(f.v2) >> mutate(
... v2=as_integer(~duplicated(f.v1, from_last=True))
... )
v1 v2
<object> <object>
0 A 0
1 A 1
2 B 0
3 B 0
4 B 0
5 B 1
6 C 0
7 C 0
8 C 1
I am the author of datar, which is backed by pandas.
I have a dataframe, it's in one hot format:
dummy_data = {'a': [0,0,1,0],'b': [1,1,1,0], 'c': [0,1,0,1],'d': [1,1,1,0]}
data = pd.DataFrame(dummy_data)
Output:
a b c d
0 0 1 0 1
1 0 1 1 1
2 1 1 0 1
3 0 0 1 0
I am trying to get the occurrence matrix from dataframe, but if I have columns name in list instead of one hot like this:
raw = [['b','d'],['b','c','d'],['a','b','d'],['c']]
unique_categories = ['a','b','c','d']
Then I am able to find the occurrence matrix like this:
df = pd.DataFrame(raw).stack().rename('val').reset_index().drop(columns='level_1')
df = df.loc[df.val.isin(unique_categories)]
df = df.merge(df, on='level_0').query('val_x != val_y')
final = pd.crosstab(df.val_x, df.val_y)
adj_matrix = (pd.crosstab(df.val_x, df.val_y)
.reindex(unique_categories, axis=0).reindex(unique_categories, axis=1)).fillna(0)
Output:
val_y a b c d
val_x
a 0 1 0 1
b 1 0 1 3
c 0 1 0 1
d 1 3 1 0
How to get the occurrence matrix directly from one hot dataframe?
You can have some fun with matrix math!
u = np.diag(np.ones(df.shape[1], dtype=bool))
df.T.dot(df) * (~u)
a b c d
a 0 1 0 1
b 1 0 1 3
c 0 1 0 1
d 1 3 1 0