cant get count with 2 arguments in python pandas dataframe? - python

I have a dataframe called new_df
that prints this ....
which basically collates the following data
Pass Profit Trades MA2
0 69 10526.0 14 119
1 47 10420.0 13 97
2 68 10406.0 14 118
3 50 10376.0 13 100
4 285 10352.0 16 335
... ... ... ... ...
21643 117 -10376.0 14 167
21644 116 -10376.0 14 166
21645 115 -10376.0 14 165
21646 114 -10376.0 14 164
21647 113 -10376.0 14 163
[21648 rows x 4 columns]
and then i can see there are 48 times 69 is showing in the Pass column, etc
#counts the number of times each pass number is listed in pass column
new_df['Pass'].value_counts()
69 48
219 48
184 48
185 48
186 48
..
59 48
16 48
20 48
70 48
113 48
Name: Pass, Length: 451, dtype: int64
Right now i am trying to create a new df called sorted_df
the columns i cant get working are below
Total Pass - Counts the number of times a unique number in Pass column also has the profit column above 110000
Pass % - Total Pass / Total Weeks
Total Fail - Counts the number of times a unique number in Pass column also has the profit column below 100000
Fail % - Total Fail / Total Weeks
sorted_df = pd.DataFrame(columns=['Pass','Total Profit','Total Weeks','Average per week','Total Pass','Pass %','Total Fail','Fail %','MA2'])
#group the original df by Pass and get first MA2 value of each group
pass_to_ma2 = new_df.groupby('Pass')['MA2'].first()
total_pass = 0
total_fail = 0
for value in new_df['Pass'].unique():
mask = new_df['Pass'] == value
pass_value = new_df[mask]
total_profit = pass_value['Profit'].sum()
total_weeks = pass_value.shape[0]
average_per_week = total_profit/total_weeks
total_pass = pass_value[pass_value['Profit'] > 110000].shape[0]
pass_percentage = total_pass / total_weeks * 100 if total_weeks > 0 else 0
total_fail = pass_value[pass_value['Profit'] < 100000].shape[0]
fail_percentage = total_fail / total_weeks * 100 if total_weeks > 0 else 0
sorted_df = sorted_df.append({'Pass': value, 'Total Profit': total_profit, 'Total Weeks': total_weeks, 'Average per week': average_per_week, 'In Profit': in_profit, 'Profit %': profit_percentage, 'Total Pass': total_pass, 'Pass %': pass_percentage, 'Total Fail': total_fail, 'Fail %': fail_percentage}, ignore_index=True)
# Add the MA2 value to the sorted_df DataFrame
sorted_df["MA2"] = sorted_df["Pass"].map(pass_to_ma2)
Pass Total Profit Total Weeks Average per week Total Pass Pass % \
0 69.0 505248.0 48.0 10526.0 0.0 0.0
1 47.0 500160.0 48.0 10420.0 0.0 0.0
2 68.0 499488.0 48.0 10406.0 0.0 0.0
3 50.0 498048.0 48.0 10376.0 0.0 0.0
4 285.0 496896.0 48.0 10352.0 0.0 0.0
.. ... ... ... ... ... ...
446 117.0 -498048.0 48.0 -10376.0 0.0 0.0
447 116.0 -498048.0 48.0 -10376.0 0.0 0.0
448 115.0 -498048.0 48.0 -10376.0 0.0 0.0
449 114.0 -498048.0 48.0 -10376.0 0.0 0.0
450 113.0 -498048.0 48.0 -10376.0 0.0 0.0
Total Fail Fail % MA2 In Profit Profit %
0 48.0 100.0 119 0.0 0.0
1 48.0 100.0 97 0.0 0.0
2 48.0 100.0 118 0.0 0.0
3 48.0 100.0 100 0.0 0.0
4 48.0 100.0 335 0.0 0.0
.. ... ... ... ... ...
446 48.0 100.0 167 0.0 0.0
447 48.0 100.0 166 0.0 0.0
448 48.0 100.0 165 0.0 0.0
449 48.0 100.0 164 0.0 0.0
450 48.0 100.0 163 0.0 0.0
[451 rows x 11 columns]
What am i doing wrong?

Related

Joining 2 dataframe based on a column [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 11 months ago.
Following is one of my dataframe structure:
strike coi chgcoi
120 200 20
125 210 15
130 230 12
135 240 9
and the other one is:
strike poi chgpoi
125 210 15
130 230 12
135 240 9
140 225 12
What I want is:
strike coi chgcoi strike poi chgpoi
120 200 20 120 0 0
125 210 15 125 210 15
130 230 12 130 230 12
135 240 9 135 240 9
140 0 0 140 225 12
First, you need to create two dataframes using pandas
df1 = pd.Dataframe({'column_1': [val_1, val_2, ..., val_n], 'column_2':[val_1, val_2, ..., val_n]})
df2 = pd.Dataframe({'column_1': [val_1, val_2, ..., val_n], 'column_2':[val_1, val_2, ..., val_n]})
Then you can use outer join
df1.merge(df2, on='common_column_name', how='outer')
db1
strike coi chgcoi
0 120 200 20
1 125 210 15
2 130 230 12
3 135 240 9
db2
strike poi chgpoi
0 125 210 15
1 130 230 12
2 135 240 9
3 140 225 12
merge = db1.merge(db2,how="outer",on='strike')
merge
strike coi chgcoi poi chgpoi
0 120 200.0 20.0 NaN NaN
1 125 210.0 15.0 210.0 15.0
2 130 230.0 12.0 230.0 12.0
3 135 240.0 9.0 240.0 9.0
4 140 NaN NaN 225.0 12.0
merge.fillna(0)
strike coi chgcoi poi chgpoi
0 120 200.0 20.0 0.0 0.0
1 125 210.0 15.0 210.0 15.0
2 130 230.0 12.0 230.0 12.0
3 135 240.0 9.0 240.0 9.0
4 140 0.0 0.0 225.0 12.0
This is your expected result with the only difference that 'strike' is not repeated

replace do work in str but does not work in object dtype

ab = '1 234'
ab = ab.replace(" ", "")
ab
'1234'
its easy to use replace() to get rid of the white space, but when I have a column of pandas dataframe;
gbpusd['Profit'] = gbpusd['Profit'].replace(" ", "")
gbpusd['Profit'].head()
3 7 000.00
4 6 552.00
11 4 680.00
14 3 250.00
24 1 700.00
Name: Profit, dtype: object
But it didnt work, googled many times but no solutions...
gbpusd['Profit'].sum() TypeError: can only concatenate str (not "int")
to str
Then, as the whitespace is still here, which cannot do further analysis, like sum()
The thing is harder than I think: the raw data is
gbpusd.head()
Ticket Open Time Type Volume Item Price S / L T / P Close Time Price.1 Commission Taxes Swap Profit
84 50204109.0 2019.10.24 09:56:32 buy 0.5 gbpusd 1.29148 0.0 0.0 2019.10.24 09:57:48 1.29179 0 0.0 0.0 15.5
85 50205025.0 2019.10.24 10:10:13 buy 0.5 gbpusd 1.29328 0.0 0.0 2019.10.24 15:57:02 1.29181 0 0.0 0.0 -73.5
86 50207371.0 2019.10.24 10:34:10 buy 0.5 gbpusd 1.29236 0.0 0.0 2019.10.24 15:57:18 1.29197 0 0.0 0.0 -19.5
87 50207747.0 2019.10.24 10:40:32 buy 0.5 gbpusd 1.29151 0.0 0.0 2019.10.24 15:57:24 1.29223 0 0.0 0.0 36
88 50212252.0 2019.10.24 11:47:14 buy 1.5 gbpusd 1.28894 0.0 0.0 2019.10.24 15:57:12 1.29181 0 0.0 0.0 430.5
when I did
gbpusd['Profit'] = gbpusd['Profit'].str.replace(" ", "")
gbpusd['Profit']
84 NaN
85 NaN
86 NaN
87 NaN
88 NaN
89 NaN
90 NaN
91 NaN
92 NaN
93 NaN
94 NaN
95 NaN
96 NaN
97 NaN
98 NaN
99 NaN
100 NaN
101 NaN
102 NaN
103 NaN
104 NaN
105 NaN
106 NaN
107 NaN
108 NaN
109 NaN
110 NaN
111 NaN
112 NaN
113 NaN
...
117 4680.00
118 NaN
119 NaN
120 NaN
121 NaN
122 NaN
123 NaN
124 NaN
125 NaN
126 NaN
127 NaN
128 NaN
129 NaN
130 -2279.00
131 -2217.00
132 -2037.00
133 -5379.00
134 -1620.00
135 -7154.00
136 -4160.00
137 1144.00
138 NaN
139 NaN
140 NaN
141 -1920.00
142 7000.00
143 3250.00
144 NaN
145 1700.00
146 NaN
Name: Profit, Length: 63, dtype: object
The white space is replaced, but some data which has no space is NaN now...someone may have the same problem...
also need to use str
gbpusdprofit = gbpusd['Profit'].str.replace(" ", "")
Output:
0 7000.00
1 6552.00
2 4680.00
3 3250.00
4 1700.00
Name: Profit, dtype: object
and for sum:
gbpusd['Profit'].str.replace(" ", "").astype('float').sum()
Result:
23182.0
You can convert to string and sum in a oneliner:
gbpusd['Profit'].str.replace(' ', "").astype(float).sum()

Can I pass a list of column names into get_dummies() to use as the column label for all possible answers?

(EDITED: I just realised I think I am asking a question that cannot be answered but not sure how to delete this question... please ignore or advise on how I can delete. I think I need to think about a different way to approach this problem.)
******----------------------------*****
I have a DataFrame called user_answers, this DataFrame is formed using get_dummies(). It looks like this
Index,Q1_1,Q1_2,Q1_4,Q1_5,mas_Y,fhae_Y
1,1,0,0,0,0,0
2,0,0,1,0,1,0
3,0,1,0,0,1,1
4,1,0,0,0,1,0
5,0,0,0,1,1,0
6,0,0,1,0,1,1
7,0,1,0,0,1,1
I am needing to do a comparison against a similar DataFrame called DF_answers. That DataFrame looks like this
Index,Q1_1,Q1_2,Q1_3,Q1_4,Q1_5,mas_Y,fhae_Y
1,1,0,0,0,0,1,0
2,1,0,0,0,0,1,0
3,0,1,0,0,0,1,1
4,0,0,1,0,0,1,0
5,0,0,0,0,1,1,0
6,1,0,0,0,0,1,1
7,0,0,0,1,0,1,1
The problem I am having is when I use 'get_dummies' does not create a column in the user_answers dataframe for Q1_3 assuming that the user didn't select Q1_3 as an answer in any of the 7 questions in the original questionnaire. I need to try get my output of user_answers to look like this. So even if the user did not answer Q1_3 on any of the 7 questions the get_dummies will still output a column Q1_3 filled with zeros as per illustration below.
Index,Q1_1,Q1_2,Q1_3,Q1_4,Q1_5,mas_Y,fhae_Y
1,1,0,0,0,0,0,0
2,1,0,0,0,0,1,0
3,0,1,0,0,0,1,1
4,1,0,0,0,0,1,0
5,0,0,0,1,0,1,0
6,1,0,0,0,0,1,1
7,1,0,0,0,0,1,1
I think I have over thought this so much im possibly over thinking things. I read that you can pass in a list of column names into get_dummies()
Sorry for the delay,
find my attempt below:
from what I understand the following applies
You have a dataframe which only has questions which the user filled out.
you need to merge this onto a frame which has every question for some sort of further anaylsis?
if this is true this is my noobish attempt:
cols = ['ID','Q1_1','Q1_2','Q1_4','Q1_5','mas_Y','fhae_Y']
data = []
for x in enumerate(cols):
data.append(np.random.randint(0,150,size=150))
df = pd.DataFrame(dict(zip(cols,data)))
print(df.head())
ID Q1_1 Q1_2 Q1_4 Q1_5 mas_Y fhae_Y
0 7 76 41 46 57 75 139
1 11 118 65 38 17 116 75
2 111 104 109 110 32 53 106
3 131 14 92 128 14 22 65
4 83 72 148 99 103 133 144
## Create a dummy frame
cols_b = ['ID']
x = 0
for i in range(1,101):
cols_b.append('Q1_' + str(x+i))
data_b = []
for x in enumerate(cols_b):
data_b.append(np.nan)
df2 = pd.DataFrame(dict(zip(cols_b,data_b)),index=[0])
final_cols = list(df2.columns)
final_cols.append('fhae_Y')
final_cols.append('mas_Y')
df = pd.merge(df,df2,how='left')
print(df[final_cols].fillna(0).head(5))
ID Q1_1 Q1_2 Q1_3 Q1_4 Q1_5 Q1_6 Q1_7 Q1_8 Q1_9 ... Q1_93 Q1_94 Q1_95 Q1_96 Q1_97 Q1_98 Q1_99 Q1_100 fhae_Y mas_Y
0 7 76 41 0.0 46 57 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 139 75
1 11 118 65 0.0 38 17 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 75 116
2 111 104 109 0.0 110 32 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 106 53
3 131 14 92 0.0 128 14 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 65 22
4 83 72 148 0.0 99 103 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 144

Round before convert to string in pandas

I have problem in rounding, this is seems so common, but I can't find the answer by googling, so I decide to ask it here.
Here's my data
day reg log ad trans paid
1111 20171005 172 65 39.0 14.0 3.0
1112 20171006 211 90 46.0 17.0 4.0
1113 20171007 155 70 50.0 17.0 1.0
1114 20171008 174 71 42.0 18.0 0.0
1115 20171009 209 63 43.0 21.0 2.0
Here's what I did, I still want to % in number
table['% log'] = (table.log / table.reg * 100).astype(str) + '%'
table['% ad'] = (table.ad / table.reg * 100).astype(str) + '%'
table['% trans'] = (table.trans / table.reg* 100).astype(str) + '%'
table['% paid'] = (table.paid / table.reg * 100).astype(str) + '%'
Here's what I get, need a final touch in rounding
day reg log ad trans paid % log % ad % trans % paid
1111 20171005 172 65 39.0 14.0 3.0 37.7906976744% 22.6744186047% 8.13953488372% 1.74418604651%
1112 20171006 211 90 46.0 17.0 4.0 42.654028436% 21.8009478673% 8.05687203791% 1.89573459716%
1113 20171007 155 70 50.0 17.0 1.0 45.1612903226% 32.2580645161% 10.9677419355% 0.645161290323%
1114 20171008 174 71 42.0 18.0 0.0 40.8045977011% 24.1379310345% 10.3448275862% 0.0%
1115 20171009 209 63 43.0 21.0 2.0 30.1435406699% 20.5741626794% 10.04784689% 0.956937799043%
What I want is the percentage is not too long, just round in two digits.
You need round:
table['% log'] = (table.log / table.reg * 100).round(2).astype(str) + '%'
Better solution is select all columns by subset and output join to original df:
cols = ['log','ad','trans','paid']
table =(table.join(table[cols].div(table.reg, 0)
.mul(100)
.round(2)
.astype(str)
.add('%')
.add_prefix('%% ')))
print (table)
day reg log ad trans paid % log % ad % trans % paid
1111 20171005 172 65 39.0 14.0 3.0 37.79% 22.67% 8.14% 1.74%
1112 20171006 211 90 46.0 17.0 4.0 42.65% 21.8% 8.06% 1.9%
1113 20171007 155 70 50.0 17.0 1.0 45.16% 32.26% 10.97% 0.65%
1114 20171008 174 71 42.0 18.0 0.0 40.8% 24.14% 10.34% 0.0%
1115 20171009 209 63 43.0 21.0 2.0 30.14% 20.57% 10.05% 0.96%
Also if need nicer output - add 0 for 2 decimals:
table =(table.join(table[cols].div(table.reg, 0)
.mul(100)
.applymap("{0:.2f}".format)
.add('%')
.add_prefix('%% ')))
print (table)
day reg log ad trans paid % log % ad % trans % paid
1111 20171005 172 65 39.0 14.0 3.0 37.79% 22.67% 8.14% 1.74%
1112 20171006 211 90 46.0 17.0 4.0 42.65% 21.80% 8.06% 1.90%
1113 20171007 155 70 50.0 17.0 1.0 45.16% 32.26% 10.97% 0.65%
1114 20171008 174 71 42.0 18.0 0.0 40.80% 24.14% 10.34% 0.00%
1115 20171009 209 63 43.0 21.0 2.0 30.14% 20.57% 10.05% 0.96%

Difference between dates in Pandas dataframe

This is related to this question, but now I need to find the difference between dates that are stored in 'YYYY-MM-DD'. Essentially the difference between values in the count column is what we need, but normalized by the number of days between each row.
My dataframe is:
date,site,country_code,kind,ID,rank,votes,sessions,avg_score,count
2017-03-20,website1,US,0,84,226,0.0,15.0,3.370812,53.0
2017-03-21,website1,US,0,84,214,0.0,15.0,3.370812,53.0
2017-03-22,website1,US,0,84,226,0.0,16.0,3.370812,53.0
2017-03-23,website1,US,0,84,234,0.0,16.0,3.369048,54.0
2017-03-24,website1,US,0,84,226,0.0,16.0,3.369048,54.0
2017-03-25,website1,US,0,84,212,0.0,16.0,3.369048,54.0
2017-03-27,website1,US,0,84,228,0.0,16.0,3.369048,58.0
2017-02-15,website2,AU,1,91,144,4.0,148.0,4.727272,521.0
2017-02-16,website2,AU,1,91,144,3.0,147.0,4.727272,524.0
2017-02-20,website2,AU,1,91,100,4.0,148.0,4.727272,531.0
2017-02-21,website2,AU,1,91,118,6.0,149.0,4.727272,533.0
2017-02-22,website2,AU,1,91,114,4.0,151.0,4.727272,534.0
And I'd like to find the difference between each date after grouping by date+site+country+kind+ID tuples.
[date,site,country_code,kind,ID,rank,votes,sessions,avg_score,count,day_diff
2017-03-20,website1,US,0,84,226,0.0,15.0,3.370812,0,0
2017-03-21,website1,US,0,84,214,0.0,15.0,3.370812,0,1
2017-03-22,website1,US,0,84,226,0.0,16.0,3.370812,0,1
2017-03-23,website1,US,0,84,234,0.0,16.0,3.369048,0,1
2017-03-24,website1,US,0,84,226,0.0,16.0,3.369048,0,1
2017-03-25,website1,US,0,84,212,0.0,16.0,3.369048,0,1
2017-03-27,website1,US,0,84,228,0.0,16.0,3.369048,4,2
2017-02-15,website2,AU,1,91,144,4.0,148.0,4.727272,0,0
2017-02-16,website2,AU,1,91,144,3.0,147.0,4.727272,3,1
2017-02-20,website2,AU,1,91,100,4.0,148.0,4.727272,7,4
2017-02-21,website2,AU,1,91,118,6.0,149.0,4.727272,3,1
2017-02-22,website2,AU,1,91,114,4.0,151.0,4.727272,1,1]
One option would be to convert the date column to a Pandas datetime one using pd.to_datetime() and use the diff function but that results in values of "x days", of type timetelda64. I'd like to use this difference to find the daily average count so if this can be accomplished in even a single/less painful step, that would work well.
you can use .dt.days accessor:
In [72]: df['date'] = pd.to_datetime(df['date'])
In [73]: df['day_diff'] = df.groupby(['site','country_code','kind','ID'])['date'] \
.diff().dt.days.fillna(0)
In [74]: df
Out[74]:
date site country_code kind ID rank votes sessions avg_score count day_diff
0 2017-03-20 website1 US 0 84 226 0.0 15.0 3.370812 53.0 0.0
1 2017-03-21 website1 US 0 84 214 0.0 15.0 3.370812 53.0 1.0
2 2017-03-22 website1 US 0 84 226 0.0 16.0 3.370812 53.0 1.0
3 2017-03-23 website1 US 0 84 234 0.0 16.0 3.369048 54.0 1.0
4 2017-03-24 website1 US 0 84 226 0.0 16.0 3.369048 54.0 1.0
5 2017-03-25 website1 US 0 84 212 0.0 16.0 3.369048 54.0 1.0
6 2017-03-27 website1 US 0 84 228 0.0 16.0 3.369048 58.0 2.0
7 2017-02-15 website2 AU 1 91 144 4.0 148.0 4.727272 521.0 0.0
8 2017-02-16 website2 AU 1 91 144 3.0 147.0 4.727272 524.0 1.0
9 2017-02-20 website2 AU 1 91 100 4.0 148.0 4.727272 531.0 4.0
10 2017-02-21 website2 AU 1 91 118 6.0 149.0 4.727272 533.0 1.0
11 2017-02-22 website2 AU 1 91 114 4.0 151.0 4.727272 534.0 1.0

Categories