Pandas / Numpy: How to Turn Column Data Into Sparse Matrix - python

I'm working on an iPython project with Pandas and Numpy. I'm just learning too so this question is probably pretty basic. Lets say I have two columns of data
---------------
| col1 | col2 |
---------------
| a | b |
| c | d |
| b | e |
---------------
I want to transform this data of the form.
---------------------
| a | b | c | d | e |
---------------------
| 1 | 1 | 0 | 0 | 0 |
| 0 | 0 | 1 | 1 | 0 |
| 0 | 1 | 0 | 0 | 1 |
---------------------
Then I want to take a three column version
---------------------
| col1 | col2 | val |
---------------------
| a | b | .5 |
| c | d | .3 |
| b | e | .2 |
---------------------
and turn it into
---------------------------
| a | b | c | d | e | val |
---------------------------
| 1 | 1 | 0 | 0 | 0 | .5 |
| 0 | 0 | 1 | 1 | 0 | .3 |
| 0 | 1 | 0 | 0 | 1 | .2 |
---------------------------
I'm very new to Pandas and Numpy, how would I do this? What functions would I use?

I think you're looking for the pandas.get_dummies() function and pandas.DataFrame.combineAdd method.
In [7]: df = pd.DataFrame({'col1': list('acb'),
'col2': list('bde'),
'val': [.5, .3, .2]})
In [8]: df1 = pd.get_dummies(df.col1)
In [9]: df2 = pd.get_dummies(df.col2)
This produces the following two dataframes:
In [16]: df1
Out[16]:
a b c
0 1 0 0
1 0 0 1
2 0 1 0
[3 rows x 3 columns]
In [17]: df2
Out[17]:
b d e
0 1 0 0
1 0 1 0
2 0 0 1
[3 rows x 3 columns]
Which can be combined as follows:
In [10]: dummies = df1.combineAdd(df2)
In [18]: dummies
Out[18]:
a b c d e
0 1 1 0 0 0
1 0 0 1 1 0
2 0 1 0 0 1
[3 rows x 5 columns]
The last step is to copy the val column into the new dataframe.
In [19]: dummies['val'] = df.val
In [20]: dummies
Out[20]:
a b c d e val
0 1 1 0 0 0 0.5
1 0 0 1 1 0 0.3
2 0 1 0 0 1 0.2
[3 rows x 6 columns]

Related

Pandas transform columns into counts grouped by ID

I have a dataframe like this:
| |ID |sex|est|
| 0 |aaa| M | S |
| 1 |aaa| M | C |
| 2 |aaa| F | D |
| 3 |bbb| F | D |
| 4 |bbb| M | C |
| 5 |ccc| F | C |
I need to change it to this:
| |ID | M | F | S | C | D |
| 0 |aaa| 2 | 1 | 1 | 1 | 1 |
| 1 |bbb| 1 | 1 | 0 | 1 | 1 |
| 2 |ccc| 0 | 1 | 0 | 1 | 0 |
I need to count from each unique ID the number of entries for each row but I can't do it manually, there are too many rows and columns.
Try this:
out = (df
.set_index('ID')
.stack()
.str.get_dummies()
.groupby(level=0)
.sum()
.reset_index()
)
print(out)
ID C D F M S
0 aaa 1 1 1 2 1
1 bbb 1 1 1 1 0
2 ccc 1 0 1 0 0
Use pd.get_dummies directly, to avoid the stack step, before computing on the groupby:
(pd
.get_dummies(
df,
columns=['sex', 'est'],
prefix_sep='',
prefix='')
.groupby('ID', as_index=False)
.sum()
)
ID F M C D S
0 aaa 1 2 1 1 1
1 bbb 1 1 1 1 0
2 ccc 1 0 1 0 0

Mapping duplicate rows to originals with dictionary - Python 3.6

I am trying to locate duplicate rows in my pandas dataframe. In reality, df.shape is 438796, 4531, but I am using this toy example below for an MRE
| id | ft1 | ft2 | ft3 | ft4 | ft5 | label |
|:------:|:---:|:---:|:---:|:---:|:---:|:------:|
| id_100 | 1 | 1 | 43 | 1 | 1 | High |
| id_101 | 1 | 1 | 33 | 0 | 1 | Medium |
| id_102 | 1 | 1 | 12 | 1 | 1 | Low |
| id_103 | 1 | 1 | 46 | 1 | 0 | Low |
| id_104 | 1 | 1 | 10 | 1 | 1 | High |
| id_105 | 0 | 1 | 99 | 0 | 1 | Low |
| id_106 | 0 | 0 | 0 | 0 | 0 | High |
| id_107 | 1 | 1 | 6 | 0 | 1 | High |
| id_108 | 1 | 1 | 29 | 1 | 1 | Medium |
| id_109 | 1 | 0 | 27 | 0 | 0 | Medium |
| id_110 | 0 | 1 | 32 | 0 | 1 | High |
What I am trying to accomplish is observing a subset of the features, and if there are duplicate rows, to keep the first and then denote which id: label pair is the duplicate.
I have looked at the following posts:
find duplicate rows in a pandas dataframe
(I could not figure out how to replace col1 in df['index_original'] = df.groupby(['col1', 'col2']).col1.transform('idxmin') with my list of cols)
Find all duplicate rows in a pandas dataframe
I know pandas has a duplicated() call. So I tried implementing that and it sort of works:
import pandas as pd
# Read in example data
df = pd.read_clipboard()
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
# Create a subset of my dataframe with only the columns I care about
sub_df = df[cols]
# Create a list of duplicates
dupes = sub_df.index[sub_df.duplicated(keep='first')].tolist()
# Loop through the duplicates and print out the values I want
for idx in dupes:
# print(df[:idx])
print(df.loc[[idx],['id', 'label']])
However, what I am trying to do is for a particular row, determine which rows are duplicates of it by saving those rows as id: label combination. So while I'm able to extract the id and label for each duplicate, I have no ability to map it back to the original row for which it is a duplicate.
An ideal dataset would look like:
| id | ft1 | ft2 | ft3 | ft4 | ft5 | label | duplicates |
|:------:|:---:|:---:|:---:|:---:|:---:|:------:|:-------------------------------------------:|
| id_100 | 1 | 1 | 43 | 1 | 1 | High | {id_102: Low, id_104: High, id_108: Medium} |
| id_101 | 1 | 1 | 33 | 0 | 1 | Medium | {id_107: High} |
| id_102 | 1 | 1 | 12 | 1 | 1 | Low | |
| id_103 | 1 | 1 | 46 | 1 | 0 | Low | |
| id_104 | 1 | 1 | 10 | 1 | 1 | High | |
| id_105 | 0 | 1 | 99 | 0 | 1 | Low | {id_110: High} |
| id_106 | 0 | 0 | 0 | 0 | 0 | High | |
| id_107 | 1 | 1 | 6 | 0 | 1 | High | |
| id_108 | 1 | 1 | 29 | 1 | 1 | Medium | |
| id_109 | 1 | 0 | 27 | 0 | 0 | Medium | |
| id_110 | 0 | 1 | 32 | 0 | 1 | High | |
How can I take my duplicated values and map them back to their originals efficiently (understanding the size of my actual dataset)?
Working with dictionaries in columns is really complicated, here is one possible solution:
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
# Create a subset of my dataframe with only the columns I care about
sub_df = df[cols]
#mask for first dupes
m = sub_df.duplicated()
#create tuples, aggregate to list of tuples
s = (df.assign(a = df[['id','label']].apply(tuple, 1))[m]
.groupby(cols)['a']
.agg(lambda x: dict(list(x))))
#add new column
df = df.join(s.rename('duplicates'), on=cols)
#repalce missing values and not first duplciates to empty strings
df['duplicates'] = df['duplicates'].fillna('').mask(m, '')
print (df)
id ft1 ft2 ft3 ft4 ft5 label \
0 id_100 1 1 43 1 1 High
1 id_101 1 1 33 0 1 Medium
2 id_102 1 1 12 1 1 Low
3 id_103 1 1 46 1 0 Low
4 id_104 1 1 10 1 1 High
5 id_105 0 1 99 0 1 Low
6 id_106 0 0 0 0 0 High
7 id_107 1 1 6 0 1 High
8 id_108 1 1 29 1 1 Medium
9 id_109 1 0 27 0 0 Medium
10 id_110 0 1 32 0 1 High
duplicates
0 {'id_102': 'Low', 'id_104': 'High', 'id_108': ...
1 {'id_107': 'High'}
2
3
4
5 {'id_110': 'High'}
6
7
8
9
10
Alternative with custom function for assign all dupes without first one to first value of new column per groups, last is changed mask for replace empty strings:
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
m = ~df.duplicated(subset=cols) & df.duplicated(subset=cols, keep=False)
def f(x):
x.loc[x.index[0], 'duplicated'] = [dict(x[['id','label']].to_numpy()[1:])]
return x
df = df.groupby(cols).apply(f)
df['duplicated'] = df['duplicated'].where(m, '')
print (df)
id ft1 ft2 ft3 ft4 ft5 label \
0 id_100 1 1 43 1 1 High
1 id_101 1 1 33 0 1 Medium
2 id_102 1 1 12 1 1 Low
3 id_103 1 1 46 1 0 Low
4 id_104 1 1 10 1 1 High
5 id_105 0 1 99 0 1 Low
6 id_106 0 0 0 0 0 High
7 id_107 1 1 6 0 1 High
8 id_108 1 1 29 1 1 Medium
9 id_109 1 0 27 0 0 Medium
10 id_110 0 1 32 0 1 High
duplicated
0 {'id_102': 'Low', 'id_104': 'High', 'id_108': ...
1 {'id_107': 'High'}
2
3
4
5 {'id_110': 'High'}
6
7
8
9
10

Group by as a list and create new colum for each value

I have a dataframe where every row is a user id and if he has an item:
| user | item_id |
|------|---------|
| 1 | a |
| 1 | b |
| 2 | b |
| 3 | c |
| 4 | a |
| 4 | c |
I want to create n columns where n is all the possible values of item_id, group one row per user and fill 1/0 according if the value is present for the user.
| user | item_a | item_b | item_c |
|------|---------|---------|----------|
| 1 | 1 | 1 | 0 |
| 2 | 0 | 0 | 0 |
| 3 | 0 | 1 | 1 |
| 4 | 1 | 0 | 1 |
Use pivot_table:
import pandas as pd
df = pd.DataFrame({'user': [1,1,2,3,4,4], 'item_id': list('abbcac')})
df = df.assign(val=1).pivot_table(values='val',
index='user',
columns='item_id',
fill_value=0)
pd.crosstab(df.user,df.item_id).add_prefix('item_').reset_index()
Yet another approach is to use get_dummies and group by sum where:
pd.get_dummies(df, columns=['item_id']).groupby('user').sum().reset_index()
desired result:
user item_id_a item_id_b item_id_c
0 1 1 1 0
1 2 0 1 0
2 3 0 0 1
3 4 1 0 1
and to change the columns:
df.columns = df.columns.str.replace(r"_id", "")
df
user item_a item_b item_c
0 1 1 1 0
1 2 0 1 0
2 3 0 0 1
3 4 1 0 1

Iterate and assign value in Pandas dataframe based on condition

I have a pandas dataframe composed of 8 columns (c1 to c7 and the last is called total). c1 to c7 are 0 and 1.
The column total should be an assignment for the maximum number of 1 in a sequence within c1 to c7. c1 to c7 represent weekdays, hence 7 should then flip to 1.
For example, if we would have and initial dataframe like df:
| c1 | c2 | c3 | c4 | c5 | c6 | c7 | total |
|:---|:---|:---|: --|:---|:---|:---|:-----:|
| 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 |
| 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 |
| 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 |
| 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
My initial thought was to create a loop with an if statement within to evaluate the criteria within the columns and assign the value to the column total.
i = "c1"
d =
for i in df.iloc[:,0:7]:
if df[i] == 1 and df[i-1] == 1:
df["total"]:= df["total"] + 1
I would expect df to look like:
| c1 | c2 | c3 | c4 | c5 | c6 | c7 | total |
|:---|:---|:---|: --|:---|:---|:---|:-----:|
| 1 | 1 | 1 | 1 | 1 | 0 | 0 | 5 |
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 1 | 1 | 0 | 1 | 1 | 0 | 0 | 2 |
| 1 | 1 | 1 | 0 | 1 | 1 | 1 | 6 |
| 1 | 0 | 1 | 1 | 1 | 0 | 1 | 3 |
| 1 | 0 | 1 | 0 | 1 | 0 | 1 | 2 |
| 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
I haven't been able to get to a result, was trying to build step by step but kept getting an error in the if statement evaluation
ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
df = pd.DataFrame([[ 1,1,1,1,1,0,0], [0,0,1,0,0,0,0],[1,1,0,1,1,0,0]])
def fun(x):
count = 0;
result = 0;
n = len(x)
for i in range(0,2*n):
if x[i % n] == 0:
result = max(result, count)
count = 0
else:
count += 1
return result
df['total'] = df.apply(lambda x: fun(x), axis=1)
0 1 2 3 4 5 6 total
0 1 1 1 1 1 0 0 5
1 0 0 1 0 0 0 0 1
2 1 1 0 1 1 0 0 2
Bugs in your loop
df[i-1] when i==0 will throw an error
df[i] gives the values of ith column of all the rows
7 should then flip to 1: This part is missing in your code
To flip the tail (7) of the row back to head(1), place a copy of row at the tail and then check for constitutive 1's. This can also be done by looping the row twice and using a modulus operator. Check this algorithm for more details

Dataframe conditional column subtract until zero

This is different than the usual 'subtract until 0' questions on here as it is conditional on another column. This question is about creating that conditional column.
This dataframe consists of three columns.
Column 'quantity' tells you how much to add/subtract.
Column 'in' tells you when to subtract.
Column 'cumulative_in' tells you how much you have.
+----------+----+---------------+
| quantity | in | cumulative_in |
+----------+----+---------------+
| 5 | 0 | |
| 1 | 0 | |
| 3 | 1 | 3 |
| 4 | 1 | 7 |
| 2 | 1 | 9 |
| 1 | 0 | |
| 1 | 0 | |
| 3 | 0 | |
| 1 | -1 | |
| 2 | 0 | |
| 1 | 0 | |
| 2 | 0 | |
| 3 | 0 | |
| 3 | 0 | |
| 1 | 0 | |
| 3 | 0 | |
+----------+----+---------------+
Whenever column 'in' equals -1, starting from next row I want to create a column 'out' (0/1) that tells it to keep subtracting until 'cumulative_in' reaches 0. Doing it by hand,
Column 'out' tells you when to keep subtracting.
Column 'cumulative_subtracted' tells you how much you have already subtracted.
I subtract column 'cumulative_in' by 'cumulative_subtracted' until it reaches 0, the output looks something like this:
+----------+----+---------------+-----+-----------------------+
| quantity | in | cumulative_in | out | cumulative_subtracted |
+----------+----+---------------+-----+-----------------------+
| 5 | 0 | | | |
| 1 | 0 | | | |
| 3 | 1 | 3 | | |
| 4 | 1 | 7 | | |
| 2 | 1 | 9 | | |
| 1 | 0 | | | |
| 1 | 0 | | | |
| 3 | 0 | | | |
| 1 | -1 | | | |
| 2 | 0 | 7 | 1 | 2 |
| 1 | 0 | 6 | 1 | 3 |
| 2 | 0 | 4 | 1 | 5 |
| 3 | 0 | 1 | 1 | 8 |
| 3 | 0 | 0 | 1 | 9 |
| 1 | 0 | | | |
| 3 | 0 | | | |
+----------+----+---------------+-----+-----------------------+
I couldn't find a vector solution to this. I would love to see one. However, the problem is not that hard when going through row by row. I hope your dataframe is not too big!!
First set up the data.
data = {
"quantity": [
5,1,3,4,2,1,1,3,1,2,1,2,3,3,1,3
],
"in":[
0,0,1,1,1,0,0,0,-1,0,0,0,0,0,0,0
],
"cumulative_in": [
np.NaN,np.NaN,3,7,9,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN
]
}
Then set up the dataframe and extra columns. I used np.NaN for the 'out' but 0 was easier for 'cumulative_subtracted'
df=pd.DataFrame(data)
df['out'] = np.NaN
df['cumulative_subtracted'] = 0
Set the initial variables
last_in = 0.
reduce = False
Go through the dataframe row by row, unfortunately.
for i in df.index:
# check if necessary to adjust last_in value.
if ~np.isnan(df.at[i, "cumulative_in"]) and reduce == False:
last_in = df.at[i, "cumulative_in"]
# check if -1 and change reduce to true
elif df.at[i, "in"] == -1:
reduce = True
# check if reduce true, the implement reductions
elif reduce == True:
df.at[i, "out"] = 1
if df.at[i, "quantity"] <= last_in:
last_in -= df.at[i, "quantity"]
df.at[i, "cumulative_in"] = last_in
df.at[i, "cumulative_subtracted"] = (
df.at[i - 1, "cumulative_subtracted"] + df.at[i, "quantity"]
)
elif df.at[i, "quantity"] > last_in:
df.at[i, "cumulative_in"] = 0
df.at[i, "cumulative_subtracted"] = (
df.at[i - 1, "cumulative_subtracted"] + last_in
)
last_in = 0
reduce = False
This works for the data given, and hopefully for all your dataset.
print(df)
quantity in cumulative_in out cumulative_subtracted
0 5 0 NaN NaN 0
1 1 0 NaN NaN 0
2 3 1 3.0 NaN 0
3 4 1 7.0 NaN 0
4 2 1 9.0 NaN 0
5 1 0 NaN NaN 0
6 1 0 NaN NaN 0
7 3 0 NaN NaN 0
8 1 -1 NaN NaN 0
9 2 0 7.0 1.0 2
10 1 0 6.0 1.0 3
11 2 0 4.0 1.0 5
12 3 0 1.0 1.0 8
13 3 0 0.0 1.0 9
14 1 0 NaN NaN 0
15 3 0 NaN NaN 0
It is not clear for me what happens when the quantity to subtract has not yet reached zero and you have another '1' in the 'in' column.
Yet, here is a rough solution for a simple case:
import pandas as pd
import numpy as np
size = 20
df = pd.DataFrame(
{
"quantity": np.random.randint(1, 6, size),
"in": np.full(size, np.nan),
}
)
# These are just to place a random 1 and -1 into 'in', not important
df.loc[np.random.choice(df.iloc[:size//3, :].index, 1), 'in'] = 1
df.loc[np.random.choice(df.iloc[size//3:size//2, :].index, 1), 'in'] = -1
df.loc[np.random.choice(df.iloc[size//2:, :].index, 1), 'in'] = 1
# Fill up with 1/-1 values the missing values after each entry up to the
# next 1/-1 entry.
df.loc[:, 'in'] = df['in'].fillna(method='ffill')
# Calculates the cumulative sum with a negative value for subtractions
df["cum_in"] = (df["quantity"] * df['in']).cumsum()
# Subtraction indicator and cumulative column
df['out'] = (df['in'] == -1).astype(int)
df["cumulative_subtracted"] = df.loc[df['in'] == -1, 'quantity'].cumsum()
# Remove values when the 'cum_in' turns to negative
df.loc[
df["cum_in"] < 0 , ["in", "cum_in", "out", "cumulative_subtracted"]
] = np.NaN
print(df)

Categories