graphlab adding variable columns from existing sframe - python

I have a SFrame e.g.
a | b
-----
2 | 31 4 5
0 | 1 9
1 | 2 84
now i want to get following result
a | b | c | d | e
----------------------
2 | 31 4 5 | 31|4 | 5
0 | 1 9 | 1 | 9 | 0
1 | 2 84 | 2 | 84 | 0
any idea how to do it? or maybe i have to use some other tools?
thanks

Using pandas:
In [409]: sf
Out[409]:
Columns:
a int
b str
Rows: 3
Data:
+---+--------+
| a | b |
+---+--------+
| 2 | 31 4 5 |
| 0 | 1 9 |
| 1 | 2 84 |
+---+--------+
[3 rows x 2 columns]
In [410]: df = sf.to_dataframe()
In [411]: newdf = pd.DataFrame(df.b.str.split().tolist(), columns = ['c', 'd', 'e']).fillna('0')
In [412]: df.join(newdf)
Out[412]:
a b c d e
0 2 31 4 5 31 4 5
1 0 1 9 1 9 0
2 1 2 84 2 84 0
Converting back to SFrame:
In [498]: SFrame(df.join(newdf))
Out[498]:
Columns:
a int
b str
c str
d str
e str
Rows: 3
Data:
+---+--------+----+----+---+
| a | b | c | d | e |
+---+--------+----+----+---+
| 2 | 31 4 5 | 31 | 4 | 5 |
| 0 | 1 9 | 1 | 9 | 0 |
| 1 | 2 84 | 2 | 84 | 0 |
+---+--------+----+----+---+
[3 rows x 5 columns]
If you want integers/floats, you can also do:
In [506]: newdf = pd.DataFrame(map(lambda x: [int(y) for y in x], df.b.str.split().tolist()), columns = ['c', 'd', 'e'])
In [507]: newdf
Out[507]:
c d e
0 31 4 5.0
1 1 9 NaN
2 2 84 NaN
In [508]: SFrame(df.join(newdf))
Out[508]:
Columns:
a int
b str
c int
d int
e float
Rows: 3
Data:
+---+--------+----+----+-----+
| a | b | c | d | e |
+---+--------+----+----+-----+
| 2 | 31 4 5 | 31 | 4 | 5.0 |
| 0 | 1 9 | 1 | 9 | nan |
| 1 | 2 84 | 2 | 84 | nan |
+---+--------+----+----+-----+
[3 rows x 5 columns]

def customsplit(string,column):
val = string.split(' ')
diff = column - len(val)
val += ['0']*diff
return val
a = sf['b'].apply(lambda x: customsplit(x,3))
sf['c'] = [i[0] for i in a]
sf['d'] = [i[1] for i in a]
sf['e'] = [i[2] for i in a]
sf
Output:
a | b | c | d | e
----------------------
2 | 31 4 5 | 31|4 | 5
0 | 1 9 | 1 | 9 | 0
1 | 2 84 | 2 | 84 | 0

This can be done by SFrame itself not using Pandas. Just utilize 'unpack' function.
Pandas provides a variety of functions to handle dataset, but it is inconvenient to convert SFrame to Pandas DataFrame and vice versa.
If you handles over 10 Giga bytes data, Pandas can not properly handle the dataset. (But SFrame can)
# your SFrame
sf=sframe.SFrame({'a' : [2,0,1], 'b' : [[31,4,5],[1,9,],[2,84,]]})
# just use 'unpack()' function
sf2= sf.unpack('b')
# change the column names
sf2.rename({'b.0':'c', 'b.1':'d', 'b.2':'e'})
# filling-up the missing values to zero
sf2 = sf2['e'].fillna(0)
# merge the original SFrame and new SFrame
sf.join(sf2, 'a')

Related

Transposing group of data in pandas dataframe

I have a large dataframe like this:
|type| qt | vol|
|----|---- | -- |
| A | 1 | 10 |
| A | 2 | 12 |
| A | 1 | 12 |
| B | 3 | 11 |
| B | 4 | 20 |
| B | 4 | 20 |
| C | 4 | 20 |
| C | 4 | 20 |
| C | 4 | 20 |
| C | 4 | 20 |
How can I transpose to the dataframe with grouping horizontally like that?
|A. |B. |C. |
|--------------|--------------|--------------|
|type| qt | vol|type| qt | vol|type| qt | vol|
|----|----| ---|----|----| ---|----|----| ---|
| A | 1 | 10 | B | 3 | 11 | C | 4 | 20 |
| A | 2 | 12 | B | 4 | 20 | C | 4 | 20 |
| A | 1 | 12 | B | 4 | 20 | C | 4 | 20 |
| C | 4 | 20 |
You can group the dataframe on type then create key-value pairs of groups inside a dict comprehension, finally use concat along axis=1 and pass the optional keys parameter to get the final result:
d = {k:g.reset_index(drop=True) for k, g in df.groupby('type')}
pd.concat(d.values(), keys=d.keys(), axis=1)
Alternatively you can use groupby + cumcount to create a sequential counter per group, then create a multilevel index having two levels where the first level is counter and second level is column type itself, finally use stack followed by unstack to reshape:
c = df.groupby('type').cumcount()
df.set_index([c, df['type'].values]).stack().unstack([1, 2])
A B C
type qt vol type qt vol type qt vol
0 A 1 10 B 3 11 C 4 20
1 A 2 12 B 4 20 C 4 20
2 A 1 12 B 4 20 C 4 20
3 NaN NaN NaN NaN NaN NaN C 4 20
This is pretty much pivot by one column:
(df.assign(idx=df.groupby('type').cumcount())
.pivot(index='idx',columns='type', values=df.columns)
.swaplevel(0,1, axis=1)
.sort_index(axis=1)
)
Output:
type A B C
qt type vol qt type vol qt type vol
idx
0 1 A 10 3 B 11 4 C 20
1 2 A 12 4 B 20 4 C 20
2 1 A 12 4 B 20 4 C 20
3 NaN NaN NaN NaN NaN NaN 4 C 20

Python: how do I filter data as long as a group contains any of a particular value

df = pd.DataFrame({'VisitID':[1,1,1,1,2,2,2,3,3,4,4], 'Item':['A','B','C','D','A','D','B','B','C','D','C']})
I have a dataset like this:
VisitID | Item |
1 | A |
1 | B |
1 | C |
1 | D |
2 | A |
2 | D |
2 | B |
3 | B |
3 | C |
4 | D |
4 | C |
I want to return VisitID rows as long as that VisitID had a occurrence of item A OR B. How do I go about? Expected Result:
VisitID | Item |
1 | A |
1 | B |
1 | C |
1 | D |
2 | A |
2 | D |
2 | B |
3 | B |
3 | C |
In R, I can do this via
library(dplyr)
df %>% group_by(VisitID) %>% filter(any(Item %in% c('A', 'B')))
How can I perform this in Python?
Something like df.groupby(['VisitID']).query(any(['A','B']))?
The syntax is similar, just use groupby.filter:
df.groupby('VisitID').filter(lambda g: g.Item.isin(['A','B']).any())
VisitID Item
0 1 A
1 1 B
2 1 C
3 1 D
4 2 A
5 2 D
6 2 B
7 3 B
8 3 C
To extract groups contains either we can just use groupby().transform('any') on isin():
s = (df.Item.isin(['A','B'])
.groupby(df['VisitID']).transform('any')
)
df[s]
Output:
VisitID Item
0 1 A
1 1 B
2 1 C
3 1 D
4 2 A
5 2 D
6 2 B
7 3 B
8 3 C

How do I add or subtract a row to an entire pandas dataframe?

I have a dataframe like this:
| a | b | c |
0 | 0 | 0 | 0 |
1 | 5 | 5 | 5 |
I have a dataframe row (or series) like this:
| a | b | c |
0 | 1 | 2 | 3 |
I want to subtract the row from the entire dataframe to obtain this:
| a | b | c |
0 | 1 | 2 | 3 |
1 | 6 | 7 | 8 |
Any help is appreciated, thanks.
Use DataFrame.add or DataFrame.sub with convert one row DataFrame to Series - e.g. by DataFrame.iloc for first row:
df = df1.add(df2.iloc[0])
#alternative select by row label
#df = df1.add(df2.loc[0])
print (df)
a b c
0 1 2 3
1 6 7 8
Detail:
print (df2.iloc[0])
a 1
b 2
c 3
Name: 0, dtype: int64
You can convert the second dataframe to numpy array:
df1 + df2.values
Output:
a b c
0 1 2 3
1 6 7 8

Pandas, how to count the occurance within grouped dataframe and create new column?

How do I get the count of each values within the group using pandas ?
In the below table, I have Group and the Value column, and I want to generate a new column called count, which should contain the total nunmber of occurance of that value within the group.
my df dataframe is as follows (without the count column):
-------------------------
| Group| Value | Count? |
-------------------------
| A | 10 | 3 |
| A | 20 | 2 |
| A | 10 | 3 |
| A | 10 | 3 |
| A | 20 | 2 |
| A | 30 | 1 |
-------------------------
| B | 20 | 3 |
| B | 20 | 3 |
| B | 20 | 3 |
| B | 10 | 1 |
-------------------------
| C | 20 | 2 |
| C | 20 | 2 |
| C | 10 | 2 |
| C | 10 | 2 |
-------------------------
I can get the counts using this:
df.groupby(['group','value']).value.count()
but this is just to view, I am having difficuly putting the results back to the dataframe as new columns.
Using transform
df['count?']=df.groupby(['group','value']).value.transform('count').values
Try a merge:
df
Group Value
0 A 10
1 A 20
2 A 10
3 A 10
4 A 20
5 A 30
6 B 20
7 B 20
8 B 20
9 B 10
10 C 20
11 C 20
12 C 10
13 C 10
g = df.groupby(['Group', 'Value']).Group.count()\
.to_frame('Count?').reset_index()
df = df.merge(g)
df
Group Value Count?
0 A 10 3
1 A 10 3
2 A 10 3
3 A 20 2
4 A 20 2
5 A 30 1
6 B 20 3
7 B 20 3
8 B 20 3
9 B 10 1
10 C 20 2
11 C 20 2
12 C 10 2
13 C 10 2

Pandas: replace zero value with value of another column

How to replace zero value in a column with value from same row of another column where previous row value of column is zero i.e. replace only where non-zero has not been encountered yet?
For example: Given a dataframe with columns a, b and c:
+----+-----+-----+----+
| | a | b | c |
|----+-----+-----|----|
| 0 | 2 | 0 | 0 |
| 1 | 5 | 0 | 0 |
| 2 | 3 | 4 | 0 |
| 3 | 2 | 0 | 3 |
| 4 | 1 | 8 | 1 |
+----+-----+-----+----+
replace zero values in b and c with values of a where previous value is zero
+----+-----+-----+----+
| | a | b | c |
|----+-----+-----|----|
| 0 | 2 | 2 | 2 |
| 1 | 5 | 5 | 5 |
| 2 | 3 | 4 | 3 |
| 3 | 2 | 0 | 3 | <-- zero in this row is not replaced because of
| 4 | 1 | 8 | 1 | non-zero value (4) in row before it.
+----+-----+-----+----+
In [90]: (df[~df.apply(lambda c: c.eq(0) & c.shift().fillna(0).eq(0))]
...: .fillna(pd.DataFrame(np.tile(df.a.values[:, None], df.shape[1]),
...: columns=df.columns, index=df.index))
...: .astype(int)
...: )
Out[90]:
a b c
0 2 2 2
1 5 5 5
2 3 4 3
3 2 0 3
4 1 8 1
Explanation:
In [91]: df[~df.apply(lambda c: c.eq(0) & c.shift().fillna(0).eq(0))]
Out[91]:
a b c
0 2 NaN NaN
1 5 NaN NaN
2 3 4.0 NaN
3 2 0.0 3.0
4 1 8.0 1.0
now we can fill NaN's with the corresponding values from the DF below (which is built as 3 concatenated a columns):
In [92]: pd.DataFrame(np.tile(df.a.values[:, None], df.shape[1]), columns=df.columns, index=df.index)
Out[92]:
a b c
0 2 2 2
1 5 5 5
2 3 3 3
3 2 2 2
4 1 1 1

Categories