Transposing group of data in pandas dataframe - python

I have a large dataframe like this:
|type| qt | vol|
|----|---- | -- |
| A | 1 | 10 |
| A | 2 | 12 |
| A | 1 | 12 |
| B | 3 | 11 |
| B | 4 | 20 |
| B | 4 | 20 |
| C | 4 | 20 |
| C | 4 | 20 |
| C | 4 | 20 |
| C | 4 | 20 |
How can I transpose to the dataframe with grouping horizontally like that?
|A. |B. |C. |
|--------------|--------------|--------------|
|type| qt | vol|type| qt | vol|type| qt | vol|
|----|----| ---|----|----| ---|----|----| ---|
| A | 1 | 10 | B | 3 | 11 | C | 4 | 20 |
| A | 2 | 12 | B | 4 | 20 | C | 4 | 20 |
| A | 1 | 12 | B | 4 | 20 | C | 4 | 20 |
| C | 4 | 20 |

You can group the dataframe on type then create key-value pairs of groups inside a dict comprehension, finally use concat along axis=1 and pass the optional keys parameter to get the final result:
d = {k:g.reset_index(drop=True) for k, g in df.groupby('type')}
pd.concat(d.values(), keys=d.keys(), axis=1)
Alternatively you can use groupby + cumcount to create a sequential counter per group, then create a multilevel index having two levels where the first level is counter and second level is column type itself, finally use stack followed by unstack to reshape:
c = df.groupby('type').cumcount()
df.set_index([c, df['type'].values]).stack().unstack([1, 2])
A B C
type qt vol type qt vol type qt vol
0 A 1 10 B 3 11 C 4 20
1 A 2 12 B 4 20 C 4 20
2 A 1 12 B 4 20 C 4 20
3 NaN NaN NaN NaN NaN NaN C 4 20

This is pretty much pivot by one column:
(df.assign(idx=df.groupby('type').cumcount())
.pivot(index='idx',columns='type', values=df.columns)
.swaplevel(0,1, axis=1)
.sort_index(axis=1)
)
Output:
type A B C
qt type vol qt type vol qt type vol
idx
0 1 A 10 3 B 11 4 C 20
1 2 A 12 4 B 20 4 C 20
2 1 A 12 4 B 20 4 C 20
3 NaN NaN NaN NaN NaN NaN 4 C 20

Related

How to build sequence of purchases for each ID?

I want to create a dataframe that shows me the sequence of what users purchasing according to the sequence column. For example this is my current df:
user_id | sequence | product | price
1 | 1 | A | 10
1 | 2 | C | 15
1 | 3 | G | 1
2 | 1 | B | 20
2 | 2 | T | 45
2 | 3 | A | 10
...
I want to convert it to the following format:
user_id | source_product | target_product | cum_total_price
1 | A | C | 25
1 | C | G | 16
2 | B | T | 65
2 | T | A | 75
...
How can I achieve this?
shift + cumsum + groupby.apply:
def seq(g):
g['source_product'] = g['product']
g['target_product'] = g['product'].shift(-1)
g['price'] = g.price.cumsum().shift(-1)
return g[['user_id', 'source_product', 'target_product', 'price']].iloc[:-1]
df.sort_values('sequence').groupby('user_id', group_keys=False).apply(seq)
# user_id source_product target_product price
#0 1 A C 25.0
#1 1 C G 26.0
#3 2 B T 65.0
#4 2 T A 75.0

How to count the occurrence of a value and set that count as a new value for that value's row

Title is probably confusing, but let me make it clearer.
Let's say I have a df like this:
+----+------+---------------+
| Id | Name | reports_to_id |
+----+------+---------------+
| 0 | A | 10 |
| 1 | B | 10 |
| 2 | C | 11 |
| 3 | D | 12 |
| 4 | E | 11 |
| 10 | F | 20 |
| 11 | G | 21 |
| 12 | H | 22 |
+----+------+---------------+
I would want my resulting df to look like this:
+----+------+---------------+-------+
| Id | Name | reports_to_id | Count |
+----+------+---------------+-------+
| 0 | A | 10 | 0 |
| 1 | B | 10 | 0 |
| 2 | C | 11 | 0 |
| 3 | D | 12 | 0 |
| 4 | E | 11 | 0 |
| 10 | F | 20 | 2 |
| 11 | G | 21 | 2 |
| 12 | H | 22 | 1 |
+----+------+---------------+-------+
But this what I currently get as a result of my code (that is wrong):
+----+------+---------------+-------+
| Id | Name | reports_to_id | Count |
+----+------+---------------+-------+
| 0 | A | 10 | 2 |
| 1 | B | 10 | 2 |
| 2 | C | 11 | 2 |
| 3 | D | 12 | 1 |
| 4 | E | 11 | 2 |
| 10 | F | 20 | 0 |
| 11 | G | 21 | 0 |
| 12 | H | 22 | 0 |
+----+------+---------------+-------+
with this code:
df['COUNT'] = df.groupby(['reports_to_id'])['id'].transform('count')
Any suggestions or directions on how to get the result I want? All help is appreciated! and thank you in advance!
Use value_counts to count the reports_to_id by values, then map that to Id:
df['COUNT'] = df['Id'].map(df['reports_to_id'].value_counts()).fillna(0)
Output:
Id Name reports_to_id COUNT
0 0 A 10 0.0
1 1 B 10 0.0
2 2 C 11 0.0
3 3 D 12 0.0
4 4 E 11 0.0
5 10 F 20 2.0
6 11 G 21 2.0
7 12 H 22 1.0
Similar idea with reindex:
df['COUNT'] = df['reports_to_id'].value_counts().reindex(df['Id'], fill_value=0).values
which gives a better looking COUNT:
Id Name reports_to_id COUNT
0 0 A 10 0
1 1 B 10 0
2 2 C 11 0
3 3 D 12 0
4 4 E 11 0
5 10 F 20 2
6 11 G 21 2
7 12 H 22 1
You can try the following:
l=list[df['reports_to_id']
df['Count']=df['Id'].apply(lambda x: l.count(x))

Rolling quantiles over a column in pandas

I have a table as such
+------+------------+-------+
| Idx | date | value |
+------+------------+-------+
| A | 20/11/2016 | 10 |
| A | 21/11/2016 | 8 |
| A | 22/11/2016 | 12 |
| B | 20/11/2016 | 16 |
| B | 21/11/2016 | 18 |
| B | 22/11/2016 | 11 |
+------+------------+-------+
I'd like to create a column that creates a new column 'rolling_quantile_value' based on the column 'value' that calculates a quantile based on the past for each row and each possible Idx.
For the example above, if the quantile chosen is median, the output should look like this :
+------+------------+-------+-----------------------+
| Idx | date | value | rolling_median_value |
+------+------------+-------+-----------------------+
| A | 20/11/2016 | 10 | NaN |
| A | 21/11/2016 | 8 | 10 |
| A | 22/11/2016 | 12 | 9 |
| A | 23/11/2016 | 14 | 10 |
| B | 20/11/2016 | 16 | NaN |
| B | 21/11/2016 | 18 | 16 |
| B | 22/11/2016 | 11 | 17 |
+------+------------+-------+-----------------------+
I've done it the naive way where I just put a function that creates row by row based on precedents rows of value and flags the jump from one Id to another but I'm sure that it's not the most efficient way to do that, nor the most elegant.
Looking forward to your suggestions !
I think you want expanding
df['rolling_median_value']=(df.groupby('Idx',sort=False)
.expanding(1)['value']
.median()
.groupby(level=0)
.shift()
.reset_index(drop=True))
print(df)
Idx date value rolling_median_value
0 A 20/11/2016 10 NaN
1 A 21/11/2016 8 10.0
2 A 22/11/2016 12 9.0
3 A 23/11/2016 14 10.0
4 B 20/11/2016 16 NaN
5 B 21/11/2016 18 16.0
6 B 22/11/2016 11 17.0
UPDATE
df['rolling_quantile_value']=(df.groupby('Idx',sort=False)
.expanding(1)['value']
.quantile(0.75)
.groupby(level=0)
.shift()
.reset_index(drop=True))
print(df)
Idx date value rolling_quantile_value
0 A 20/11/2016 10 NaN
1 A 21/11/2016 8 10.0
2 A 22/11/2016 12 9.5
3 A 23/11/2016 14 11.0
4 B 20/11/2016 16 NaN
5 B 21/11/2016 18 16.0
6 B 22/11/2016 11 17.5

Pandas, how to count the occurance within grouped dataframe and create new column?

How do I get the count of each values within the group using pandas ?
In the below table, I have Group and the Value column, and I want to generate a new column called count, which should contain the total nunmber of occurance of that value within the group.
my df dataframe is as follows (without the count column):
-------------------------
| Group| Value | Count? |
-------------------------
| A | 10 | 3 |
| A | 20 | 2 |
| A | 10 | 3 |
| A | 10 | 3 |
| A | 20 | 2 |
| A | 30 | 1 |
-------------------------
| B | 20 | 3 |
| B | 20 | 3 |
| B | 20 | 3 |
| B | 10 | 1 |
-------------------------
| C | 20 | 2 |
| C | 20 | 2 |
| C | 10 | 2 |
| C | 10 | 2 |
-------------------------
I can get the counts using this:
df.groupby(['group','value']).value.count()
but this is just to view, I am having difficuly putting the results back to the dataframe as new columns.
Using transform
df['count?']=df.groupby(['group','value']).value.transform('count').values
Try a merge:
df
Group Value
0 A 10
1 A 20
2 A 10
3 A 10
4 A 20
5 A 30
6 B 20
7 B 20
8 B 20
9 B 10
10 C 20
11 C 20
12 C 10
13 C 10
g = df.groupby(['Group', 'Value']).Group.count()\
.to_frame('Count?').reset_index()
df = df.merge(g)
df
Group Value Count?
0 A 10 3
1 A 10 3
2 A 10 3
3 A 20 2
4 A 20 2
5 A 30 1
6 B 20 3
7 B 20 3
8 B 20 3
9 B 10 1
10 C 20 2
11 C 20 2
12 C 10 2
13 C 10 2

graphlab adding variable columns from existing sframe

I have a SFrame e.g.
a | b
-----
2 | 31 4 5
0 | 1 9
1 | 2 84
now i want to get following result
a | b | c | d | e
----------------------
2 | 31 4 5 | 31|4 | 5
0 | 1 9 | 1 | 9 | 0
1 | 2 84 | 2 | 84 | 0
any idea how to do it? or maybe i have to use some other tools?
thanks
Using pandas:
In [409]: sf
Out[409]:
Columns:
a int
b str
Rows: 3
Data:
+---+--------+
| a | b |
+---+--------+
| 2 | 31 4 5 |
| 0 | 1 9 |
| 1 | 2 84 |
+---+--------+
[3 rows x 2 columns]
In [410]: df = sf.to_dataframe()
In [411]: newdf = pd.DataFrame(df.b.str.split().tolist(), columns = ['c', 'd', 'e']).fillna('0')
In [412]: df.join(newdf)
Out[412]:
a b c d e
0 2 31 4 5 31 4 5
1 0 1 9 1 9 0
2 1 2 84 2 84 0
Converting back to SFrame:
In [498]: SFrame(df.join(newdf))
Out[498]:
Columns:
a int
b str
c str
d str
e str
Rows: 3
Data:
+---+--------+----+----+---+
| a | b | c | d | e |
+---+--------+----+----+---+
| 2 | 31 4 5 | 31 | 4 | 5 |
| 0 | 1 9 | 1 | 9 | 0 |
| 1 | 2 84 | 2 | 84 | 0 |
+---+--------+----+----+---+
[3 rows x 5 columns]
If you want integers/floats, you can also do:
In [506]: newdf = pd.DataFrame(map(lambda x: [int(y) for y in x], df.b.str.split().tolist()), columns = ['c', 'd', 'e'])
In [507]: newdf
Out[507]:
c d e
0 31 4 5.0
1 1 9 NaN
2 2 84 NaN
In [508]: SFrame(df.join(newdf))
Out[508]:
Columns:
a int
b str
c int
d int
e float
Rows: 3
Data:
+---+--------+----+----+-----+
| a | b | c | d | e |
+---+--------+----+----+-----+
| 2 | 31 4 5 | 31 | 4 | 5.0 |
| 0 | 1 9 | 1 | 9 | nan |
| 1 | 2 84 | 2 | 84 | nan |
+---+--------+----+----+-----+
[3 rows x 5 columns]
def customsplit(string,column):
val = string.split(' ')
diff = column - len(val)
val += ['0']*diff
return val
a = sf['b'].apply(lambda x: customsplit(x,3))
sf['c'] = [i[0] for i in a]
sf['d'] = [i[1] for i in a]
sf['e'] = [i[2] for i in a]
sf
Output:
a | b | c | d | e
----------------------
2 | 31 4 5 | 31|4 | 5
0 | 1 9 | 1 | 9 | 0
1 | 2 84 | 2 | 84 | 0
This can be done by SFrame itself not using Pandas. Just utilize 'unpack' function.
Pandas provides a variety of functions to handle dataset, but it is inconvenient to convert SFrame to Pandas DataFrame and vice versa.
If you handles over 10 Giga bytes data, Pandas can not properly handle the dataset. (But SFrame can)
# your SFrame
sf=sframe.SFrame({'a' : [2,0,1], 'b' : [[31,4,5],[1,9,],[2,84,]]})
# just use 'unpack()' function
sf2= sf.unpack('b')
# change the column names
sf2.rename({'b.0':'c', 'b.1':'d', 'b.2':'e'})
# filling-up the missing values to zero
sf2 = sf2['e'].fillna(0)
# merge the original SFrame and new SFrame
sf.join(sf2, 'a')

Categories