I cannot make my ideal DataFrame - python

There is a csv data like
No,User,A,B,C,D
1 Tom 100 120 110 90
1 Juddy 89 90 100 110
1 Bob 99 80 90 100
2 Tom 80 100 100 70
2 Juddy 79 90 80 70
2 Bob 88 90 95 90
・
・
・
I want to transform this csv data into this DataFrame like
Tom_A Tom_B Tom_C Tom_D Juddy_A Juddy_B Juddy_C Juddy_D Bob_A Bob_B Bob_C Bob_D
No
1 100 120 110 90 89 90 100 110
99 80 90 100
2 80 100 100 70 79 90 80 70
88 90 95 90
I run the codes,
import pandas as pd
csv = pd.read_csv("user.csv", header=0, index_col=‘No', sep='\s|,', engine='python')
but output is not my ideal one.I cannot understand how to make columns is not resignated like Tom_A・Tom_B・Juddy_A which is in csv.
How should I fix my codes?

Setup
df = pd.DataFrame({'No': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2}, 'User': {0: 'Tom', 1: 'Juddy', 2: 'Bob', 3: 'Tom', 4: 'Juddy', 5: 'Bob'}, 'A': {0: 100, 1: 89, 2: 99, 3: 80, 4: 79, 5: 88}, 'B': {0: 120, 1: 90, 2: 80, 3: 100, 4: 90, 5: 90}, 'C': {0: 110, 1: 100, 2: 90, 3: 100, 4: 80, 5: 95}, 'D': {0: 90, 1: 110, 2: 100, 3: 70, 4: 70, 5: 90}})
You want pivot_table:
out = df.pivot_table(index='No', columns='User')
A B C D
User Bob Juddy Tom Bob Juddy Tom Bob Juddy Tom Bob Juddy Tom
No
1 99 89 100 80 90 120 90 100 110 100 110 90
2 88 79 80 90 90 100 95 80 100 90 70 70
To get the prefix:
out.columns = out.columns.swaplevel(0,1).to_series().str.join('_')
Bob_A Juddy_A Tom_A Bob_B Juddy_B Tom_B Bob_C Juddy_C Tom_C Bob_D Juddy_D Tom_D
No
1 99 89 100 80 90 120 90 100 110 100 110 90
2 88 79 80 90 90 100 95 80 100 90 70 70

Related

Remove rows in dataframe by overlaping groups based on coordinates

I have a dataframe such as
Seq Chrm start end length score
A C1 1 50 49 12
B C1 3 55 52 12
C C1 6 60 54 12
Cbis C1 6 60 54 11
D C1 70 120 50 12
E C1 78 111 33 12
F C2 350 400 50 12
A C2 349 400 51 12
B C2 450 500 50 12
And I would like, within each specific Chrm, to keep within each overlapping start and end the row with the longest length value AND the highest Score value.
For example in C1:
Seq Chrm start end length score
A C1 1 50 49 12
B C1 3 55 52 12
C C1 6 60 54 12
Cbis C1 6 60 54 11
D C1 70 120 50 12
E C1 78 111 33 12
Coordinates from start to end of A,B,C,Cbis together overlaps and D and E together overlaps.
In the A,B,C,Cbis group the longest are C and Cbis with 54, then I keep the one with the highest score which is **C** (12) In the **D,E** group, the longest is **D** with50`.
so I keep only the row C and D here.
If I do the same for other Chrm I should then get the following output:
Seq Chrm start end length score
C C1 6 60 54 12
D C1 70 120 50 12
A C2 349 400 51 12
B C2 450 500 50 12
Here is the dataframe in dic format if it can help :
{'Seq': {0: 'A', 1: 'B', 2: 'C', 3: 'Cbis', 4: 'D', 5: 'E', 6: 'F', 7: 'A', 8: 'B'}, 'Chrm': {0: 'C1', 1: 'C1', 2: 'C1', 3: 'C1', 4: 'C1', 5: 'C1', 6: 'C2', 7: 'C2', 8: 'C2'}, 'start': {0: 1, 1: 3, 2: 6, 3: 6, 4: 70, 5: 78, 6: 350, 7: 349, 8: 450}, 'end': {0: 50, 1: 55, 2: 60, 3: 60, 4: 120, 5: 111, 6: 400, 7: 400, 8: 500}, 'length': {0: 49, 1: 52, 2: 54, 3: 54, 4: 50, 5: 33, 6: 50, 7: 51, 8: 50}, 'score': {0: 12, 1: 12, 2: 12, 3: 11, 4: 12, 5: 12, 6: 12, 7: 12, 8: 12}}
Edit for Corralien :
If I used this table :
Seq Chrm start end length score
A C1 12414 14672 49 12
B C1 12414 14741 52 12
C C1 12414 14744 54 12
It does not class A,B and C in the same overlapping group...
{'Seq': {0: 'A', 1: 'B', 2: 'C'}, 'Chrm': {0: 'C1', 1: 'C1', 2: 'C1'}, 'start': {0: 12414, 1: 12414, 2: 12414}, 'end': {0: 14672, 1: 14741, 2: 14744}, 'length': {0: 49, 1: 52, 2: 54}, 'score': {0: 12, 1: 12, 2: 12}}
Create virtual groups and keep the best row (length, score) for each group:
Suppose this dataframe:
>>> df
Seq Chrm start end length score
0 A C1 1 50 49 12
1 B C1 3 55 52 12
2 C C1 6 60 54 12
3 Cbis C1 6 60 54 11
4 D C1 70 120 50 12
5 E C1 78 111 33 12
6 F C2 350 400 50 12
7 A C2 349 400 51 12
8 B C2 450 500 50 12
9 A C1 12414 14672 49 12
10 B C1 12414 14741 52 12
11 C C1 12414 14744 54 12
Create groups:
is_overlapped = lambda x: x['start'] >= x['end'].shift(fill_value=-1)
df['group'] = df.sort_values(['Chrm', 'start', 'end']) \
.groupby('Chrm').apply(is_overlapped).droplevel(0).cumsum()
out = df.sort_values(['group', 'length', 'score'], ascending=[True, False, False]) \
.groupby(df['group']).head(1)
Output:
>>> out
Seq Chrm start end length score group
2 C C1 6 60 54 12 1
4 D C1 70 120 50 12 2
11 C C1 12414 14744 54 12 3
7 A C2 349 400 51 12 4
8 B C2 450 500 50 12 5
# Groups
>>> df
Seq Chrm start end length score group
0 A C1 1 50 49 12 1
1 B C1 3 55 52 12 1
2 C C1 6 60 54 12 1
3 Cbis C1 6 60 54 11 1
4 D C1 70 120 50 12 2
5 E C1 78 111 33 12 2
6 F C2 350 400 50 12 4
7 A C2 349 400 51 12 4
8 B C2 450 500 50 12 5
9 A C1 12414 14672 49 12 3
10 B C1 12414 14741 52 12 3
11 C C1 12414 14744 54 12 3
You can drop the group column with out.drop(columns='group') but I left it to illustrate the virtual groups.

Pandas: calculate weighted average by row using a dataframe and a series

I was trying to make a weighed average and I came across a doubt:
Problem
I wanted to create a new column named answer that calculates the result between each line and a list of weighted values named in this case as month. If I use df.mean() I would get a simple average by month and that is not what I want. The idea is to give more importance to the end of the year and less importance to the demand in the begging of the year. So that's why I would like to use weighted average calculation.
In excel I would use the formula bellow. I'm having troubles to convert this calculation to pandas data frame.
=SUMPRODUCT( demands[#[1]:[12]] ; month )/SUM(month)
I couldn't find a solution to this problem and I really appreciate help with this subject.
Thank you in advance.
Here's a dummy dataframe that serves as an example:
Example Code
demand = pd.DataFrame({'1': [360, 40, 100, 20, 55],
'2': [500, 180, 450, 60, 50],
'3': [64, 30, 60, 10, 0],
'4': [50, 40, 30, 60, 50],
'5': [40, 24, 45, 34, 60],
'6': [30, 34, 65, 80, 78],
'7': [56, 45, 34, 90, 58],
'8': [32, 12, 45, 55, 66],
'9': [32, 56, 89, 67, 56],
'10': [57, 35, 75, 48, 9],
'11': [56, 33, 11, 6, 78],
'12': [23, 65, 34, 8, 67]
})
months = [i for i in range(1,13)]
Visualization of the problem
Just use numpy.average, specifying weights:
demand["result"]=np.average(demand, weights=months, axis=1)
https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.average.html
Outputs:
1 2 3 4 5 6 ... 8 9 10 11 12 result
0 360 500 64 50 40 30 ... 32 32 57 56 23 58.076923
1 40 180 30 40 24 34 ... 12 56 35 33 65 43.358974
2 100 450 60 30 45 65 ... 45 89 75 11 34 58.884615
3 20 60 10 60 34 80 ... 55 67 48 6 8 43.269231
4 55 50 0 50 60 78 ... 66 56 9 78 67 55.294872
This can be done by the following:
demand['result'] = (demand * months).sum(axis=1)/sum(months)
You can try this code:
den = np.sum(a)
demand['average']=demand['1'].mul(1/den).add(demand['2'].mul(2/den)).add(demand['3'].mul(3/den)).add(demand['4'].mul(4/den)).add(demand['5'].mul(5/den)).add(demand['6'].mul(6/den)).add(demand['7'].mul(7/den)).add(demand['8'].mul(8/den)).add(demand['9'].mul(9/den)).add(demand['10'].mul(10/den)).add(demand['11'].mul(11/den)).add(demand['12'].mul(12/den))
The Output:
1 2 3 4 5 6 7 8 9 10 11 12 average
0 360 500 64 50 40 30 56 32 32 57 56 23 58.076923
1 40 180 30 40 24 34 45 12 56 35 33 65 43.358974
2 100 450 60 30 45 65 34 45 89 75 11 34 58.884615
3 20 60 10 60 34 80 90 55 67 48 6 8 43.269231
4 55 50 0 50 60 78 58 66 56 9 78 67 55.294872

Pandas column values between values from another dataframe column

I have two pandas data-frames as follows:
import pandas as pd
import numpy as np
import string
size = 5
student_names = [''.join(np.random.choice(list(string.ascii_lowercase), size=4)) for i in range(size)]
marks = list(np.random.randint(50, high=100, size=size))
df1 = pd.DataFrame({'Student Names': student_names, 'Total': marks})
grade_leters = ['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D',
'D-', 'F']
grade_minimum_value = [95, 90, 85, 80, 75, 70, 65, 60, 55, 50, 45, 40, 0]
df2 = pd.DataFrame({'Grade Letters': grade_leters, 'Minimums': grade_minimum_value})
df1
Student Names Total
0 cjpv 83
1 iywm 98
2 jhhb 87
3 qwau 70
4 ppai 82
df2
Grade Letters Minimums
0 A+ 95
1 A 90
2 A- 85
3 B+ 80
4 B 75
5 B- 70
6 C+ 65
7 C 60
8 C- 55
9 D+ 50
10 D 45
11 D- 40
12 F 0
I want to give the grade letter as a new column to df1. For example, student cjpv having a total mark of 83 will receive a grade letter of B+, since 83 is between 80 (inclusive) and 85 (exclusive).
The desired output is as follows.
Student Names Total Grade
0 cjpv 83 B+
1 iywm 98 A+
2 jhhb 87 A-
3 qwau 70 B-
4 ppai 82 B+
Thanks in advance. My apologies if there is a similar question to this, However, I could not find one after a long search.
Use cut with dynamic values bins and labels from df2 columns, also is added right=False for left closed bins:
np.random.seed(123)
size = 20
student_names = [''.join(np.random.choice(list(string.ascii_lowercase), size=4)) for i in range(size)]
marks = list(np.random.randint(50, high=100, size=size))
df1 = pd.DataFrame({'Student Names': student_names, 'Total': marks})
grade_leters = ['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D',
'D-', 'F']
grade_minimum_value = [95, 90, 85, 80, 75, 70, 65, 60, 55, 50, 45, 40, 0]
df2 = pd.DataFrame({'Grade Letters': grade_leters, 'Minimums': grade_minimum_value})
df2 = df2.sort_values('Minimums')
df1['new'] = pd.cut(df1['Total'],
bins=df2['Minimums'].tolist() + [np.inf],
labels=df2['Grade Letters'],
right=False)
print (df1)
Student Names Total new
0 nccg 70 B-
1 rtkz 99 A+
2 wbar 62 C
3 pjao 68 C+
4 apzt 67 C+
5 oeaq 51 D+
6 erxd 94 A
7 cuhc 91 A
8 upyq 98 A+
9 hjdu 77 B
10 gbvw 99 A+
11 cbmi 72 B-
12 dkfa 53 D+
13 lckw 53 D+
14 nsep 61 C
15 lmug 71 B-
16 ntqg 75 B
17 ouhl 89 A-
18 whbl 91 A
19 fxzs 84 B+
Like #Henry Yik commented here is possible use merge_asof:
df1 = pd.merge_asof(df1.sort_values('Total'), df2, left_on='Total', right_on='Minimums')
print (df1)
Student Names Total new Grade Letters Minimums
0 oeaq 51 D+ D+ 50
1 lckw 53 D+ D+ 50
2 dkfa 53 D+ D+ 50
3 nsep 61 C C 60
4 wbar 62 C C 60
5 apzt 67 C+ C+ 65
6 pjao 68 C+ C+ 65
7 nccg 70 B- B- 70
8 lmug 71 B- B- 70
9 cbmi 72 B- B- 70
10 ntqg 75 B B 75
11 hjdu 77 B B 75
12 fxzs 84 B+ B+ 80
13 ouhl 89 A- A- 85
14 whbl 91 A A 90
15 cuhc 91 A A 90
16 erxd 94 A A 90
17 upyq 98 A+ A+ 95
18 rtkz 99 A+ A+ 95
19 gbvw 99 A+ A+ 95

VLookup (then replace) in Pandas with Dictionary?

I want to replaces values in Pandas dataframe using dictionary
DataFrame = games-u-q-s.csv:
blue1 blue2 blue3 blue4 blue5 red1 red2 red3 red4 red5 winner
8 432 96 11 112 104 498 122 238 412 0
119 39 76 10 35 54 25 120 157 92 0
57 63 29 61 36 90 19 412 92 22 0
column 1 - 10 contain champId with winner column as label
Dictionary = champNum.csv
champId champNum
266 1
103 2
84 3
12 4
32 5
34 6
1 7
. .
. .
143 138
and save it as dataset_feature_champion_number.csv
I want to convert champId into champNum and expected output like this:
blue1 blue2 blue3 blue4 blue5 red1 red2 red3 red4 red5 winner
125 11 59 70 124 36 129 20 135 111 0
23 40 77 53 95 67 73 37 132 91 0
69 13 116 81 22 68 127 111 91 8 0
This is the code:
import csv
import os
import numpy as np
import pandas as pd
def createDictionary(csvfile):
with open(csvfile, mode='r') as data:
reader = csv.reader(data)
dict = {int(rows[0]):int(rows[1]) for rows in reader}
return dict
def convertDataframeToChampNum(csvfile,dictionary):
df = pd.read_csv(csvfile)
temp1 = df.iloc[:,1:11]
temp2 = df['winner']
temp3 = temp1.applymap(dictionary.get)
champNum = temp3.join(temp2)
return champNum
def saveAsCSV(dataframe):
dataframe.to_csv("dataset_feature_champion_number.csv")
def main():
diction = createDictionary("champNum.csv")
dataset = convertDataframeToChampNum("games-u-q-s.csv",diction)
saveAsCSV(dataset)
if __name__ =='__main__':
main()
And I got so many errors:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-19-f86679fc49f9> in <module>()
27
28 if __name__ =='__main__':
---> 29 main()
<ipython-input-19-f86679fc49f9> in main()
22
23 def main():
---> 24 diction = createDictionary("champNum.csv")
25 dataset = convertDataframeToChampNum("games-u-q-s.csv",diction)
26 saveAsCSV(dataset)
<ipython-input-19-f86679fc49f9> in createDictionary(csvfile)
7 with open(csvfile, mode='r') as data:
8 reader = csv.reader(data)
----> 9 dict = {int(rows[0]):int(rows[1]) for rows in reader}
10 return dict
11
<ipython-input-19-f86679fc49f9> in <dictcomp>(.0)
7 with open(csvfile, mode='r') as data:
8 reader = csv.reader(data)
----> 9 dict = {int(rows[0]):int(rows[1]) for rows in reader}
10 return dict
11
ValueError: invalid literal for int() with base 10: 'champNum'
I think you're looking for pandas.DataFrame.transform:
>>> a = pd.DataFrame([[1,2,3,4,5],[6,7,8,9,10]])
>>> a
0 1 2 3 4
0 1 2 3 4 5
1 6 7 8 9 10
>>> a.transform(lambda x: -x)
0 1 2 3 4
0 -1 -2 -3 -4 -5
1 -6 -7 -8 -9 -10
or, applied to your problem
df = pd.DataFrame({'blue1': [8, 119, 57],
'blue2': [432, 39, 63],
'blue3': [96, 76, 29],
'blue4': [11, 10, 61],
'blue5': [112, 35, 36],
'red1': [104, 54, 90],
'red2': [498, 25, 19],
'red3': [122, 120, 412],
'red4': [238, 157, 92],
'red5': [412, 92, 22],
'winner': [0, 0, 0]})
transform_dict = {266: 1, 103: 2, ...}
df.transform(lambda x: transform_dict[x] if x in transform_dict else None)

Pandas Very Simple Percent of total size from Group by

I'm having trouble for a seemingly incredibly easy operation. What is the most succint way to just get a percent of total from a group by operation such as df.groupby['col1'].size(). My DF after grouping looks like this and I just want a percent of total. I remember using a variation of this statement in the past but cannot get this to work now: percent = totals.div(totals.sum(1), axis=0)
Original DF:
A B C
0 77 3 98
1 77 52 99
2 77 58 61
3 77 3 93
4 77 31 99
5 77 53 51
6 77 2 9
7 72 25 78
8 34 41 34
9 44 95 27
Result:
df1.groupby('A').size() / df1.groupby('A').size().sum()
A
34 0.1
44 0.1
72 0.1
77 0.7
Here is what I came up with so far which seems pretty reasonable way to do this:
df.groupby('col1').size().apply(lambda x: float(x) / df.groupby('col1').size().sum()*100)
I don't know if I'm missing something, but looks like you could do something like this:
df.groupby('A').size() * 100 / len(df)
or
df.groupby('A').size() * 100 / df.shape[0]
Getting good performance (3.73s) on DF with shape (3e6,59) by using:
df.groupby('col1').size().apply(lambda x: float(x) / df.groupby('col1').size().sum()*100)
How about:
df = pd.DataFrame({'A': {0: 77, 1: 77, 2: 77, 3: 77, 4: 77, 5: 77, 6: 77, 7: 72, 8: 34, 9: None},
'B': {0: 3, 1: 52, 2: 58, 3: 3, 4: 31, 5: 53, 6: 2, 7: 25, 8: 41, 9: 95},
'C': {0: 98, 1: 99, 2: 61, 3: 93, 4: 99, 5: 51, 6: 9, 7: 78, 8: 34, 9: 27}})
>>> df.groupby('A').size().divide(sum(df['A'].notnull()))
A
34 0.111111
72 0.111111
77 0.777778
dtype: float64
>>> df
A B C
0 77 3 98
1 77 52 99
2 77 58 61
3 77 3 93
4 77 31 99
5 77 53 51
6 77 2 9
7 72 25 78
8 34 41 34
9 NaN 95 27

Categories