cant get count with 2 arguments in python pandas dataframe? - python
I have a dataframe called new_df
that prints this ....
which basically collates the following data
Pass Profit Trades MA2
0 69 10526.0 14 119
1 47 10420.0 13 97
2 68 10406.0 14 118
3 50 10376.0 13 100
4 285 10352.0 16 335
... ... ... ... ...
21643 117 -10376.0 14 167
21644 116 -10376.0 14 166
21645 115 -10376.0 14 165
21646 114 -10376.0 14 164
21647 113 -10376.0 14 163
[21648 rows x 4 columns]
and then i can see there are 48 times 69 is showing in the Pass column, etc
#counts the number of times each pass number is listed in pass column
new_df['Pass'].value_counts()
69 48
219 48
184 48
185 48
186 48
..
59 48
16 48
20 48
70 48
113 48
Name: Pass, Length: 451, dtype: int64
Right now i am trying to create a new df called sorted_df
the columns i cant get working are below
Total Pass - Counts the number of times a unique number in Pass column also has the profit column above 110000
Pass % - Total Pass / Total Weeks
Total Fail - Counts the number of times a unique number in Pass column also has the profit column below 100000
Fail % - Total Fail / Total Weeks
sorted_df = pd.DataFrame(columns=['Pass','Total Profit','Total Weeks','Average per week','Total Pass','Pass %','Total Fail','Fail %','MA2'])
#group the original df by Pass and get first MA2 value of each group
pass_to_ma2 = new_df.groupby('Pass')['MA2'].first()
total_pass = 0
total_fail = 0
for value in new_df['Pass'].unique():
mask = new_df['Pass'] == value
pass_value = new_df[mask]
total_profit = pass_value['Profit'].sum()
total_weeks = pass_value.shape[0]
average_per_week = total_profit/total_weeks
total_pass = pass_value[pass_value['Profit'] > 110000].shape[0]
pass_percentage = total_pass / total_weeks * 100 if total_weeks > 0 else 0
total_fail = pass_value[pass_value['Profit'] < 100000].shape[0]
fail_percentage = total_fail / total_weeks * 100 if total_weeks > 0 else 0
sorted_df = sorted_df.append({'Pass': value, 'Total Profit': total_profit, 'Total Weeks': total_weeks, 'Average per week': average_per_week, 'In Profit': in_profit, 'Profit %': profit_percentage, 'Total Pass': total_pass, 'Pass %': pass_percentage, 'Total Fail': total_fail, 'Fail %': fail_percentage}, ignore_index=True)
# Add the MA2 value to the sorted_df DataFrame
sorted_df["MA2"] = sorted_df["Pass"].map(pass_to_ma2)
Pass Total Profit Total Weeks Average per week Total Pass Pass % \
0 69.0 505248.0 48.0 10526.0 0.0 0.0
1 47.0 500160.0 48.0 10420.0 0.0 0.0
2 68.0 499488.0 48.0 10406.0 0.0 0.0
3 50.0 498048.0 48.0 10376.0 0.0 0.0
4 285.0 496896.0 48.0 10352.0 0.0 0.0
.. ... ... ... ... ... ...
446 117.0 -498048.0 48.0 -10376.0 0.0 0.0
447 116.0 -498048.0 48.0 -10376.0 0.0 0.0
448 115.0 -498048.0 48.0 -10376.0 0.0 0.0
449 114.0 -498048.0 48.0 -10376.0 0.0 0.0
450 113.0 -498048.0 48.0 -10376.0 0.0 0.0
Total Fail Fail % MA2 In Profit Profit %
0 48.0 100.0 119 0.0 0.0
1 48.0 100.0 97 0.0 0.0
2 48.0 100.0 118 0.0 0.0
3 48.0 100.0 100 0.0 0.0
4 48.0 100.0 335 0.0 0.0
.. ... ... ... ... ...
446 48.0 100.0 167 0.0 0.0
447 48.0 100.0 166 0.0 0.0
448 48.0 100.0 165 0.0 0.0
449 48.0 100.0 164 0.0 0.0
450 48.0 100.0 163 0.0 0.0
[451 rows x 11 columns]
What am i doing wrong?
Related
Joining 2 dataframe based on a column [duplicate]
This question already has answers here: Pandas Merging 101 (8 answers) Closed 11 months ago. Following is one of my dataframe structure: strike coi chgcoi 120 200 20 125 210 15 130 230 12 135 240 9 and the other one is: strike poi chgpoi 125 210 15 130 230 12 135 240 9 140 225 12 What I want is: strike coi chgcoi strike poi chgpoi 120 200 20 120 0 0 125 210 15 125 210 15 130 230 12 130 230 12 135 240 9 135 240 9 140 0 0 140 225 12
First, you need to create two dataframes using pandas df1 = pd.Dataframe({'column_1': [val_1, val_2, ..., val_n], 'column_2':[val_1, val_2, ..., val_n]}) df2 = pd.Dataframe({'column_1': [val_1, val_2, ..., val_n], 'column_2':[val_1, val_2, ..., val_n]}) Then you can use outer join df1.merge(df2, on='common_column_name', how='outer')
db1 strike coi chgcoi 0 120 200 20 1 125 210 15 2 130 230 12 3 135 240 9 db2 strike poi chgpoi 0 125 210 15 1 130 230 12 2 135 240 9 3 140 225 12 merge = db1.merge(db2,how="outer",on='strike') merge strike coi chgcoi poi chgpoi 0 120 200.0 20.0 NaN NaN 1 125 210.0 15.0 210.0 15.0 2 130 230.0 12.0 230.0 12.0 3 135 240.0 9.0 240.0 9.0 4 140 NaN NaN 225.0 12.0 merge.fillna(0) strike coi chgcoi poi chgpoi 0 120 200.0 20.0 0.0 0.0 1 125 210.0 15.0 210.0 15.0 2 130 230.0 12.0 230.0 12.0 3 135 240.0 9.0 240.0 9.0 4 140 0.0 0.0 225.0 12.0 This is your expected result with the only difference that 'strike' is not repeated
replace do work in str but does not work in object dtype
ab = '1 234' ab = ab.replace(" ", "") ab '1234' its easy to use replace() to get rid of the white space, but when I have a column of pandas dataframe; gbpusd['Profit'] = gbpusd['Profit'].replace(" ", "") gbpusd['Profit'].head() 3 7 000.00 4 6 552.00 11 4 680.00 14 3 250.00 24 1 700.00 Name: Profit, dtype: object But it didnt work, googled many times but no solutions... gbpusd['Profit'].sum() TypeError: can only concatenate str (not "int") to str Then, as the whitespace is still here, which cannot do further analysis, like sum() The thing is harder than I think: the raw data is gbpusd.head() Ticket Open Time Type Volume Item Price S / L T / P Close Time Price.1 Commission Taxes Swap Profit 84 50204109.0 2019.10.24 09:56:32 buy 0.5 gbpusd 1.29148 0.0 0.0 2019.10.24 09:57:48 1.29179 0 0.0 0.0 15.5 85 50205025.0 2019.10.24 10:10:13 buy 0.5 gbpusd 1.29328 0.0 0.0 2019.10.24 15:57:02 1.29181 0 0.0 0.0 -73.5 86 50207371.0 2019.10.24 10:34:10 buy 0.5 gbpusd 1.29236 0.0 0.0 2019.10.24 15:57:18 1.29197 0 0.0 0.0 -19.5 87 50207747.0 2019.10.24 10:40:32 buy 0.5 gbpusd 1.29151 0.0 0.0 2019.10.24 15:57:24 1.29223 0 0.0 0.0 36 88 50212252.0 2019.10.24 11:47:14 buy 1.5 gbpusd 1.28894 0.0 0.0 2019.10.24 15:57:12 1.29181 0 0.0 0.0 430.5 when I did gbpusd['Profit'] = gbpusd['Profit'].str.replace(" ", "") gbpusd['Profit'] 84 NaN 85 NaN 86 NaN 87 NaN 88 NaN 89 NaN 90 NaN 91 NaN 92 NaN 93 NaN 94 NaN 95 NaN 96 NaN 97 NaN 98 NaN 99 NaN 100 NaN 101 NaN 102 NaN 103 NaN 104 NaN 105 NaN 106 NaN 107 NaN 108 NaN 109 NaN 110 NaN 111 NaN 112 NaN 113 NaN ... 117 4680.00 118 NaN 119 NaN 120 NaN 121 NaN 122 NaN 123 NaN 124 NaN 125 NaN 126 NaN 127 NaN 128 NaN 129 NaN 130 -2279.00 131 -2217.00 132 -2037.00 133 -5379.00 134 -1620.00 135 -7154.00 136 -4160.00 137 1144.00 138 NaN 139 NaN 140 NaN 141 -1920.00 142 7000.00 143 3250.00 144 NaN 145 1700.00 146 NaN Name: Profit, Length: 63, dtype: object The white space is replaced, but some data which has no space is NaN now...someone may have the same problem...
also need to use str gbpusdprofit = gbpusd['Profit'].str.replace(" ", "") Output: 0 7000.00 1 6552.00 2 4680.00 3 3250.00 4 1700.00 Name: Profit, dtype: object and for sum: gbpusd['Profit'].str.replace(" ", "").astype('float').sum() Result: 23182.0
You can convert to string and sum in a oneliner: gbpusd['Profit'].str.replace(' ', "").astype(float).sum()
Can I pass a list of column names into get_dummies() to use as the column label for all possible answers?
(EDITED: I just realised I think I am asking a question that cannot be answered but not sure how to delete this question... please ignore or advise on how I can delete. I think I need to think about a different way to approach this problem.) ******----------------------------***** I have a DataFrame called user_answers, this DataFrame is formed using get_dummies(). It looks like this Index,Q1_1,Q1_2,Q1_4,Q1_5,mas_Y,fhae_Y 1,1,0,0,0,0,0 2,0,0,1,0,1,0 3,0,1,0,0,1,1 4,1,0,0,0,1,0 5,0,0,0,1,1,0 6,0,0,1,0,1,1 7,0,1,0,0,1,1 I am needing to do a comparison against a similar DataFrame called DF_answers. That DataFrame looks like this Index,Q1_1,Q1_2,Q1_3,Q1_4,Q1_5,mas_Y,fhae_Y 1,1,0,0,0,0,1,0 2,1,0,0,0,0,1,0 3,0,1,0,0,0,1,1 4,0,0,1,0,0,1,0 5,0,0,0,0,1,1,0 6,1,0,0,0,0,1,1 7,0,0,0,1,0,1,1 The problem I am having is when I use 'get_dummies' does not create a column in the user_answers dataframe for Q1_3 assuming that the user didn't select Q1_3 as an answer in any of the 7 questions in the original questionnaire. I need to try get my output of user_answers to look like this. So even if the user did not answer Q1_3 on any of the 7 questions the get_dummies will still output a column Q1_3 filled with zeros as per illustration below. Index,Q1_1,Q1_2,Q1_3,Q1_4,Q1_5,mas_Y,fhae_Y 1,1,0,0,0,0,0,0 2,1,0,0,0,0,1,0 3,0,1,0,0,0,1,1 4,1,0,0,0,0,1,0 5,0,0,0,1,0,1,0 6,1,0,0,0,0,1,1 7,1,0,0,0,0,1,1 I think I have over thought this so much im possibly over thinking things. I read that you can pass in a list of column names into get_dummies()
Sorry for the delay, find my attempt below: from what I understand the following applies You have a dataframe which only has questions which the user filled out. you need to merge this onto a frame which has every question for some sort of further anaylsis? if this is true this is my noobish attempt: cols = ['ID','Q1_1','Q1_2','Q1_4','Q1_5','mas_Y','fhae_Y'] data = [] for x in enumerate(cols): data.append(np.random.randint(0,150,size=150)) df = pd.DataFrame(dict(zip(cols,data))) print(df.head()) ID Q1_1 Q1_2 Q1_4 Q1_5 mas_Y fhae_Y 0 7 76 41 46 57 75 139 1 11 118 65 38 17 116 75 2 111 104 109 110 32 53 106 3 131 14 92 128 14 22 65 4 83 72 148 99 103 133 144 ## Create a dummy frame cols_b = ['ID'] x = 0 for i in range(1,101): cols_b.append('Q1_' + str(x+i)) data_b = [] for x in enumerate(cols_b): data_b.append(np.nan) df2 = pd.DataFrame(dict(zip(cols_b,data_b)),index=[0]) final_cols = list(df2.columns) final_cols.append('fhae_Y') final_cols.append('mas_Y') df = pd.merge(df,df2,how='left') print(df[final_cols].fillna(0).head(5)) ID Q1_1 Q1_2 Q1_3 Q1_4 Q1_5 Q1_6 Q1_7 Q1_8 Q1_9 ... Q1_93 Q1_94 Q1_95 Q1_96 Q1_97 Q1_98 Q1_99 Q1_100 fhae_Y mas_Y 0 7 76 41 0.0 46 57 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 139 75 1 11 118 65 0.0 38 17 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 75 116 2 111 104 109 0.0 110 32 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 106 53 3 131 14 92 0.0 128 14 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 65 22 4 83 72 148 0.0 99 103 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 144
Round before convert to string in pandas
I have problem in rounding, this is seems so common, but I can't find the answer by googling, so I decide to ask it here. Here's my data day reg log ad trans paid 1111 20171005 172 65 39.0 14.0 3.0 1112 20171006 211 90 46.0 17.0 4.0 1113 20171007 155 70 50.0 17.0 1.0 1114 20171008 174 71 42.0 18.0 0.0 1115 20171009 209 63 43.0 21.0 2.0 Here's what I did, I still want to % in number table['% log'] = (table.log / table.reg * 100).astype(str) + '%' table['% ad'] = (table.ad / table.reg * 100).astype(str) + '%' table['% trans'] = (table.trans / table.reg* 100).astype(str) + '%' table['% paid'] = (table.paid / table.reg * 100).astype(str) + '%' Here's what I get, need a final touch in rounding day reg log ad trans paid % log % ad % trans % paid 1111 20171005 172 65 39.0 14.0 3.0 37.7906976744% 22.6744186047% 8.13953488372% 1.74418604651% 1112 20171006 211 90 46.0 17.0 4.0 42.654028436% 21.8009478673% 8.05687203791% 1.89573459716% 1113 20171007 155 70 50.0 17.0 1.0 45.1612903226% 32.2580645161% 10.9677419355% 0.645161290323% 1114 20171008 174 71 42.0 18.0 0.0 40.8045977011% 24.1379310345% 10.3448275862% 0.0% 1115 20171009 209 63 43.0 21.0 2.0 30.1435406699% 20.5741626794% 10.04784689% 0.956937799043% What I want is the percentage is not too long, just round in two digits.
You need round: table['% log'] = (table.log / table.reg * 100).round(2).astype(str) + '%' Better solution is select all columns by subset and output join to original df: cols = ['log','ad','trans','paid'] table =(table.join(table[cols].div(table.reg, 0) .mul(100) .round(2) .astype(str) .add('%') .add_prefix('%% '))) print (table) day reg log ad trans paid % log % ad % trans % paid 1111 20171005 172 65 39.0 14.0 3.0 37.79% 22.67% 8.14% 1.74% 1112 20171006 211 90 46.0 17.0 4.0 42.65% 21.8% 8.06% 1.9% 1113 20171007 155 70 50.0 17.0 1.0 45.16% 32.26% 10.97% 0.65% 1114 20171008 174 71 42.0 18.0 0.0 40.8% 24.14% 10.34% 0.0% 1115 20171009 209 63 43.0 21.0 2.0 30.14% 20.57% 10.05% 0.96% Also if need nicer output - add 0 for 2 decimals: table =(table.join(table[cols].div(table.reg, 0) .mul(100) .applymap("{0:.2f}".format) .add('%') .add_prefix('%% '))) print (table) day reg log ad trans paid % log % ad % trans % paid 1111 20171005 172 65 39.0 14.0 3.0 37.79% 22.67% 8.14% 1.74% 1112 20171006 211 90 46.0 17.0 4.0 42.65% 21.80% 8.06% 1.90% 1113 20171007 155 70 50.0 17.0 1.0 45.16% 32.26% 10.97% 0.65% 1114 20171008 174 71 42.0 18.0 0.0 40.80% 24.14% 10.34% 0.00% 1115 20171009 209 63 43.0 21.0 2.0 30.14% 20.57% 10.05% 0.96%
Difference between dates in Pandas dataframe
This is related to this question, but now I need to find the difference between dates that are stored in 'YYYY-MM-DD'. Essentially the difference between values in the count column is what we need, but normalized by the number of days between each row. My dataframe is: date,site,country_code,kind,ID,rank,votes,sessions,avg_score,count 2017-03-20,website1,US,0,84,226,0.0,15.0,3.370812,53.0 2017-03-21,website1,US,0,84,214,0.0,15.0,3.370812,53.0 2017-03-22,website1,US,0,84,226,0.0,16.0,3.370812,53.0 2017-03-23,website1,US,0,84,234,0.0,16.0,3.369048,54.0 2017-03-24,website1,US,0,84,226,0.0,16.0,3.369048,54.0 2017-03-25,website1,US,0,84,212,0.0,16.0,3.369048,54.0 2017-03-27,website1,US,0,84,228,0.0,16.0,3.369048,58.0 2017-02-15,website2,AU,1,91,144,4.0,148.0,4.727272,521.0 2017-02-16,website2,AU,1,91,144,3.0,147.0,4.727272,524.0 2017-02-20,website2,AU,1,91,100,4.0,148.0,4.727272,531.0 2017-02-21,website2,AU,1,91,118,6.0,149.0,4.727272,533.0 2017-02-22,website2,AU,1,91,114,4.0,151.0,4.727272,534.0 And I'd like to find the difference between each date after grouping by date+site+country+kind+ID tuples. [date,site,country_code,kind,ID,rank,votes,sessions,avg_score,count,day_diff 2017-03-20,website1,US,0,84,226,0.0,15.0,3.370812,0,0 2017-03-21,website1,US,0,84,214,0.0,15.0,3.370812,0,1 2017-03-22,website1,US,0,84,226,0.0,16.0,3.370812,0,1 2017-03-23,website1,US,0,84,234,0.0,16.0,3.369048,0,1 2017-03-24,website1,US,0,84,226,0.0,16.0,3.369048,0,1 2017-03-25,website1,US,0,84,212,0.0,16.0,3.369048,0,1 2017-03-27,website1,US,0,84,228,0.0,16.0,3.369048,4,2 2017-02-15,website2,AU,1,91,144,4.0,148.0,4.727272,0,0 2017-02-16,website2,AU,1,91,144,3.0,147.0,4.727272,3,1 2017-02-20,website2,AU,1,91,100,4.0,148.0,4.727272,7,4 2017-02-21,website2,AU,1,91,118,6.0,149.0,4.727272,3,1 2017-02-22,website2,AU,1,91,114,4.0,151.0,4.727272,1,1] One option would be to convert the date column to a Pandas datetime one using pd.to_datetime() and use the diff function but that results in values of "x days", of type timetelda64. I'd like to use this difference to find the daily average count so if this can be accomplished in even a single/less painful step, that would work well.
you can use .dt.days accessor: In [72]: df['date'] = pd.to_datetime(df['date']) In [73]: df['day_diff'] = df.groupby(['site','country_code','kind','ID'])['date'] \ .diff().dt.days.fillna(0) In [74]: df Out[74]: date site country_code kind ID rank votes sessions avg_score count day_diff 0 2017-03-20 website1 US 0 84 226 0.0 15.0 3.370812 53.0 0.0 1 2017-03-21 website1 US 0 84 214 0.0 15.0 3.370812 53.0 1.0 2 2017-03-22 website1 US 0 84 226 0.0 16.0 3.370812 53.0 1.0 3 2017-03-23 website1 US 0 84 234 0.0 16.0 3.369048 54.0 1.0 4 2017-03-24 website1 US 0 84 226 0.0 16.0 3.369048 54.0 1.0 5 2017-03-25 website1 US 0 84 212 0.0 16.0 3.369048 54.0 1.0 6 2017-03-27 website1 US 0 84 228 0.0 16.0 3.369048 58.0 2.0 7 2017-02-15 website2 AU 1 91 144 4.0 148.0 4.727272 521.0 0.0 8 2017-02-16 website2 AU 1 91 144 3.0 147.0 4.727272 524.0 1.0 9 2017-02-20 website2 AU 1 91 100 4.0 148.0 4.727272 531.0 4.0 10 2017-02-21 website2 AU 1 91 118 6.0 149.0 4.727272 533.0 1.0 11 2017-02-22 website2 AU 1 91 114 4.0 151.0 4.727272 534.0 1.0