Group by/ Pivot - python

dummy_df = pd.DataFrame({
'accnt' : [101, 102, 103, 104, 101, 102, 103, 104, 101, 102, 103, 104, 101, 102, 103, 104, 101, 102, 103, 104],
'value' : [10, 20, 30, 40, 5, 2, 6, 48, 22, 23, 24, 25, 18, 25, 26, 14, 78, 72, 54, 6],
'category' : [1,1,1,1,2,2,2,2,1,1,2,2,3,3,3,3,1,3,2,3]
})
dummy_df
accnt value category
101 10 1
102 20 1
103 30 1
104 40 1
101 5 2
102 2 2
103 6 2
104 48 2
101 22 1
102 23 1
103 24 2
104 25 2
101 18 3
102 25 3
103 26 3
104 14 3
101 78 1
102 72 3
103 54 2
104 6 3
I want to get a dataframe like below:
accnt sum_val_c1 count_c1 sum_val_ct2 count_c2 sum_val_c3 count_c3
101 110 3 5 1 18 1
102 43 2 2 1 97 2
103 30 1 84 3 26 1
104 40 1 73 2 20 2
Which is summing up the occurrence of a category into count_c# and summing the value of that category into sum_val_c# and grouping by on accnt. I have tried using pivot() and groupby() but I know I'm missing something.

Use groupby, agg, and unstack:
u = df.groupby(['accnt', 'category'])['value'].agg(['sum', 'count']).unstack(1)
u.columns = u.columns.map('{0[0]}_c{0[1]}'.format)
u
sum_c1 sum_c2 sum_c3 count_c1 count_c2 count_c3
accnt
101 110 5 18 3 1 1
102 43 2 97 2 1 2
103 30 84 26 1 3 1
104 40 73 20 1 2 2
Similarly, with pivot_table,
u = df.pivot_table(index=['accnt'],
columns='category',
values='value',
aggfunc=['sum', 'count'])
u.columns = u.columns.map('{0[0]}_c{0[1]}'.format)
u
sum_c1 sum_c2 sum_c3 count_c1 count_c2 count_c3
accnt
101 110 5 18 3 1 1
102 43 2 97 2 1 2
103 30 84 26 1 3 1
104 40 73 20 1 2 2

Pandas has a method to do that.
pivot2 = dummy_df.pivot_table(values='value', index='accnt', columns='category', aggfunc=['count', 'sum'])
That returns a dataframe like this:
count sum
category 1 2 3 1 2 3
accnt
101 3 1 1 110 5 18
102 2 1 2 43 2 97
103 1 3 1 30 84 26
104 1 2 2 40 73 20

Related

Filter rows with consecutive numbers

I have some data.
I want to remain with rows when an ID has 4 consecutive numbers. For example, if ID 1 has rows 100, 101, 102, 103, 105, the "105" should be excluded.
Data:
ID X
0 1 100
1 1 101
2 1 102
3 1 103
4 1 105
5 2 100
6 2 102
7 2 103
8 2 104
9 3 100
10 3 101
11 3 102
12 3 103
13 3 106
14 3 107
15 3 108
16 3 109
17 3 110
18 3 112
19 4 100
20 4 102
21 4 103
22 4 104
23 4 105
24 4 107
Expected results:
ID X
0 1 100
1 1 101
2 1 102
3 1 103
4 3 100
5 3 101
6 3 102
7 3 103
8 3 106
9 3 107
10 3 108
11 3 109
12 3 110
13 4 102
14 4 103
15 4 104
16 4 105
You can identify the consecutive values, then filter the groups by size with groupby.filter:
# group consecutive X
g = df['X'].diff().gt(1).cumsum() # no need to group here, we'll group later
# filter groups
out = df.groupby(['ID', g]).filter(lambda g: len(g)>=4)#.reset_index(drop=True)
output:
ID X
0 1 100
1 1 101
2 1 102
3 1 103
9 3 100
10 3 101
11 3 102
12 3 103
13 3 106
14 3 107
15 3 108
16 3 109
17 3 110
20 4 102
21 4 103
22 4 104
23 4 105
Another method:
out = df.groupby(df.groupby('ID')['X'].diff().ne(1).cumsum()).filter(lambda x: len(x) >= 4)
print(out)
# Output
ID X
0 1 100
1 1 101
2 1 102
3 1 103
9 3 100
10 3 101
11 3 102
12 3 103
13 3 106
14 3 107
15 3 108
16 3 109
17 3 110
20 4 102
21 4 103
22 4 104
23 4 105
def function1(dd:pd.DataFrame):
return dd.assign(rk=(dd.assign(col1=(dd.X.diff()>1).cumsum()).groupby('col1').transform('size')))
df1.groupby('ID').apply(function1).loc[lambda x:x.rk>3,:'X']
ID X
0 1 100
1 1 101
2 1 102
3 1 103
9 3 100
10 3 101
11 3 102
12 3 103
13 3 106
14 3 107
15 3 108
16 3 109
17 3 110
20 4 102
21 4 103
22 4 104
23 4 105

Count how many times a pair of values in one pandas dataframe appears in another

I have a pandas dataframe df1 that looks like this:
import pandas as pd
d = {'node1': [47, 24, 19, 77, 24, 19, 77, 24, 56, 92, 32, 77], 'node2': [24, 19, 77, 24, 19, 77, 24, 19, 92, 32, 77, 24], 'user': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C']}
df1 = pd.DataFrame(data=d)
df1
node1 node2 user
47 24 A
24 19 A
19 77 A
77 24 A
24 19 A
19 77 B
77 24 B
24 19 B
56 92 C
92 32 C
32 77 C
77 24 C
And a second pandas dataframe df2 that looks like this:
d2 = {'way_id': [4, 3, 1, 8, 5, 2, 7, 9, 6, 10], 'source': [24, 19, 84, 47, 19, 16, 77, 56, 32, 92], 'target': [19, 43, 67, 24, 77, 29, 24, 92, 77, 32]}
df2 = pd.DataFrame(data=d2)
df2
way_id source target
4 24 19
3 19 43
1 84 67
8 47 24
5 19 77
2 16 29
7 77 24
9 56 92
6 32 77
10 92 32
In a new dataframe I would like to count how often the value pairs per row in the columns node1 and node2 in df1 occur in the rows of the source and target columns in df2. The order is relevant, but also the corresponding user should be added to a new column. That's why the desired output should be like this:
way_id source target count user
4 24 19 2 A
3 19 43 0 A
1 84 67 0 A
8 47 24 1 A
5 19 77 1 A
2 16 29 0 A
7 77 24 1 A
9 56 92 0 A
6 32 77 0 A
10 92 32 0 A
4 24 19 1 B
3 19 43 0 B
1 84 67 0 B
8 47 24 0 B
5 19 77 1 B
2 16 29 0 B
7 77 24 1 B
9 56 92 0 B
6 32 77 0 B
10 92 32 0 B
4 24 19 0 C
3 19 43 0 C
1 84 67 0 C
8 47 24 0 C
5 19 77 0 C
2 16 29 0 C
7 77 24 1 C
9 56 92 1 C
6 32 77 1 C
10 92 32 1 C
Since you don't care about the source/target match, you need to duplicate the data then merge :
(pd.concat([df1.rename(columns={'node1':'source','node2':'target'}),
df1.rename(columns={'node2':'source','node1':'target'})]
)
.merge(df2, on=['source','target'], how='outer')
.groupby(['source','target','user'], as_index=False)['way_id'].count()
)

Extract specific column with dict as rows

id numbers
1 {'105': 1, '65': 11, '75': 0, '85': 51, '95': 0}
2 {'105': 1, '65': 11, '75': 0, '85': 50, '95': 0}
3 {'105': 1, '65': 11, '75': 0, '85': 51, '95': 0}
4 {}
5 {}
6 {}
7 {'75 cm': 7, '85 cm': 52, '95 cm': 10}
8 {'75 cm': 51, '85 cm': 114, '95 cm': 10}
9 {'75 cm': 9, '85 cm': 60, '95 cm': 10}
this is the current table
I know how to turn the dict into column and rows (key as column and value as rows but what i am looking for is for key and value to be rows with their own column headers)
test = pd.concat([df.drop(['numbers'], axis=1).sort_values(['id']),
df['numbers'].apply(pd.Series)], axis=1)
test2 = test.melt(id_vars=['id'],
var_name="name",
value_name="nameN").fillna(0)
im trying to get each key and value in the dictionary to be rows
id name nameN
1 105 1
1 65 11
1 75 0
1 85 51
1 95 0
You should use comprehensions to build the data for a new DataFrame. If you can just drop the ids where numbers is an empy dictionary, you can do:
test = pd.DataFrame([[x['id'], k, v] for _, x in df.iterrows()
for k,v in x['numbers'].items()], columns=['id', 'name', 'nameN'])
to get:
id name nameN
0 1 105 1
1 1 65 11
2 1 75 0
3 1 85 51
4 1 95 0
5 2 105 1
6 2 65 11
7 2 75 0
8 2 85 50
9 2 95 0
10 3 105 1
11 3 65 11
12 3 75 0
13 3 85 51
14 3 95 0
15 7 75 cm 7
16 7 85 cm 52
17 7 95 cm 10
18 8 75 cm 51
19 8 85 cm 114
20 8 95 cm 10
21 9 75 cm 9
22 9 85 cm 60
23 9 95 cm 10
If you want a line with a specific value when numbers is empty:
test2 = pd.DataFrame([i for lst in [[[x['id'], '', '']] if x['numbers'] == {}
else [[x['id'], k, v] for k,v in x['numbers'].items()]
for _, x in df.iterrows()] for i in lst],
columns=['id', 'name', 'nameN']).sort_values('id').reset_index(drop=True)
giving:
id name nameN
0 1 105 1
1 1 65 11
2 1 75 0
3 1 85 51
4 1 95 0
5 2 105 1
6 2 65 11
7 2 75 0
8 2 85 50
9 2 95 0
10 3 95 0
11 3 75 0
12 3 85 51
13 3 105 1
14 3 65 11
15 4
16 5
17 6
18 7 75 cm 7
19 7 85 cm 52
20 7 95 cm 10
21 8 75 cm 51
22 8 85 cm 114
23 8 95 cm 10
24 9 85 cm 60
25 9 75 cm 9
26 9 95 cm 10

VLookup (then replace) in Pandas with Dictionary?

I want to replaces values in Pandas dataframe using dictionary
DataFrame = games-u-q-s.csv:
blue1 blue2 blue3 blue4 blue5 red1 red2 red3 red4 red5 winner
8 432 96 11 112 104 498 122 238 412 0
119 39 76 10 35 54 25 120 157 92 0
57 63 29 61 36 90 19 412 92 22 0
column 1 - 10 contain champId with winner column as label
Dictionary = champNum.csv
champId champNum
266 1
103 2
84 3
12 4
32 5
34 6
1 7
. .
. .
143 138
and save it as dataset_feature_champion_number.csv
I want to convert champId into champNum and expected output like this:
blue1 blue2 blue3 blue4 blue5 red1 red2 red3 red4 red5 winner
125 11 59 70 124 36 129 20 135 111 0
23 40 77 53 95 67 73 37 132 91 0
69 13 116 81 22 68 127 111 91 8 0
This is the code:
import csv
import os
import numpy as np
import pandas as pd
def createDictionary(csvfile):
with open(csvfile, mode='r') as data:
reader = csv.reader(data)
dict = {int(rows[0]):int(rows[1]) for rows in reader}
return dict
def convertDataframeToChampNum(csvfile,dictionary):
df = pd.read_csv(csvfile)
temp1 = df.iloc[:,1:11]
temp2 = df['winner']
temp3 = temp1.applymap(dictionary.get)
champNum = temp3.join(temp2)
return champNum
def saveAsCSV(dataframe):
dataframe.to_csv("dataset_feature_champion_number.csv")
def main():
diction = createDictionary("champNum.csv")
dataset = convertDataframeToChampNum("games-u-q-s.csv",diction)
saveAsCSV(dataset)
if __name__ =='__main__':
main()
And I got so many errors:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-19-f86679fc49f9> in <module>()
27
28 if __name__ =='__main__':
---> 29 main()
<ipython-input-19-f86679fc49f9> in main()
22
23 def main():
---> 24 diction = createDictionary("champNum.csv")
25 dataset = convertDataframeToChampNum("games-u-q-s.csv",diction)
26 saveAsCSV(dataset)
<ipython-input-19-f86679fc49f9> in createDictionary(csvfile)
7 with open(csvfile, mode='r') as data:
8 reader = csv.reader(data)
----> 9 dict = {int(rows[0]):int(rows[1]) for rows in reader}
10 return dict
11
<ipython-input-19-f86679fc49f9> in <dictcomp>(.0)
7 with open(csvfile, mode='r') as data:
8 reader = csv.reader(data)
----> 9 dict = {int(rows[0]):int(rows[1]) for rows in reader}
10 return dict
11
ValueError: invalid literal for int() with base 10: 'champNum'
I think you're looking for pandas.DataFrame.transform:
>>> a = pd.DataFrame([[1,2,3,4,5],[6,7,8,9,10]])
>>> a
0 1 2 3 4
0 1 2 3 4 5
1 6 7 8 9 10
>>> a.transform(lambda x: -x)
0 1 2 3 4
0 -1 -2 -3 -4 -5
1 -6 -7 -8 -9 -10
or, applied to your problem
df = pd.DataFrame({'blue1': [8, 119, 57],
'blue2': [432, 39, 63],
'blue3': [96, 76, 29],
'blue4': [11, 10, 61],
'blue5': [112, 35, 36],
'red1': [104, 54, 90],
'red2': [498, 25, 19],
'red3': [122, 120, 412],
'red4': [238, 157, 92],
'red5': [412, 92, 22],
'winner': [0, 0, 0]})
transform_dict = {266: 1, 103: 2, ...}
df.transform(lambda x: transform_dict[x] if x in transform_dict else None)

Pivot Daily Time Series to Rows of Weeks in Pandas

I have a Pandas timeseries:
days = pd.DatetimeIndex([
'2011-01-01T00:00:00.000000000',
'2011-01-02T00:00:00.000000000',
'2011-01-03T00:00:00.000000000',
'2011-01-04T00:00:00.000000000',
'2011-01-05T00:00:00.000000000',
'2011-01-06T00:00:00.000000000',
'2011-01-07T00:00:00.000000000',
'2011-01-08T00:00:00.000000000',
'2011-01-09T00:00:00.000000000',
'2011-01-11T00:00:00.000000000',
'2011-01-12T00:00:00.000000000',
'2011-01-13T00:00:00.000000000',
'2011-01-14T00:00:00.000000000',
'2011-01-16T00:00:00.000000000',
'2011-01-18T00:00:00.000000000',
'2011-01-19T00:00:00.000000000',
'2011-01-21T00:00:00.000000000',
])
counts = [85, 97, 24, 64, 3, 37, 73, 86, 87, 82, 75, 84, 43, 51, 42, 3, 70]
df = pd.DataFrame(counts,
index=days,
columns=['count'],
)
df['day of the week'] = df.index.dayofweek
And it looks like this:
count day of the week
2011-01-01 85 5
2011-01-02 97 6
2011-01-03 24 0
2011-01-04 64 1
2011-01-05 3 2
2011-01-06 37 3
2011-01-07 73 4
2011-01-08 86 5
2011-01-09 87 6
2011-01-11 82 1
2011-01-12 75 2
2011-01-13 84 3
2011-01-14 43 4
2011-01-16 51 6
2011-01-18 42 1
2011-01-19 3 2
2011-01-21 70 4
Notice that there are some days that are missing which should be filled with zeros. I want to convert this so it looks like a calendar so that the rows are increasing by weeks, the columns are days of the week, and the values are the count for that particular day. So the end result should look like:
0 1 2 3 4 5 6
0 0 0 0 0 0 85 97
1 24 64 3 37 73 86 87
2 0 82 75 84 0 0 51
3 0 42 3 0 70 0 0
# create weeks number based on day of the week
df['weeks'] = (df['day of the week'].diff() < 0).cumsum()
# pivot the table
df.pivot('weeks', 'day of the week', 'count').fillna(0)

Categories