I have a column like this
User time Column
User1 time1 44 db
User1 time2 55 db
User1 time3 43 db
User1 time4 no_available
How to calculate average, Min, MAX by just taking 44 55 43 (without db) and ignoring values like 'no_available' and 'no_power' for each user
Bonus, also how take last value of the day if user has for exemple 10 values for 10 times.
Regards,
thank you.
If all integers, you can use str.extract() to pull out the numbers. Then, return the mean, max, etc:
df = pd.DataFrame({'User': {0: 'User1', 1: 'User1', 2: 'User1', 3: 'User1'},
'time': {0: 'time1', 1: 'time2', 2: 'time3', 3: 'time4'},
'Column': {0: '44 db', 1: '55 db', 2: '43 db', 3: 'no_available'}})
df['Numbers'] = df['Column'].str.extract('(\d+)').astype(float)
print(df['Numbers'].mean(), df['Numbers'].max())
Out [1]:
47.333333333333336 55.0
Example with -, ., or , in the number:
import pandas as pd
df = pd.DataFrame({'User': {0: 'User1', 1: 'User1', 2: 'User1', 3: 'User1'},
'time': {0: 'time1', 1: 'time2', 2: 'time3', 3: 'time4'},
'Column': {0: '44 db', 1: '-45.32 db', 2: '4,452.03 db', 3: 'no_available'}})
df['Numbers'] = df['Column'].str.replace(',','').str.extract('(-?\d+.?\d+)').astype(float)
print(df['Numbers'])
0 44.00
1 -45.32
2 4452.03
3 NaN
Name: Numbers, dtype: float64
There are the MAX and MIN for that column.
Give me like. :D
import pandas as pd
a = ["user1","user1","user1","user1"]
a2 = ["time1","time2","time3","time4"]
a3 = ['45 db','55 db','43 db','no_available']
a = pd.DataFrame(a, columns=["user"])
a2 = pd.DataFrame(a2, columns=["time"])
a3 = pd.DataFrame(a3, columns=["column"])
data = pd.concat([a,a2,a3], axis=1)
data1 = list(data["column"])
h = []
for i in data1:
try:
if int(i[0:2]):
h.append(int(i[0:2]))
except:
print(i)
max(h)
min(h)
Related
I have a vertical data frame that I am looking to make more horizontal by "duplicating" columns for each item in the groupby column.
I have the following data frame:
pd.DataFrame({'posteam': {0: 'ARI', 1: 'ARI', 2: 'ARI', 3: 'ARI', 4: 'ARI'},
'offense_grouping': {0: 'personnel_00',
1: 'personnel_01',
2: 'personnel_02',
3: 'personnel_10',
4: 'personnel_11'},
'snap_ct': {0: 1, 1: 6, 2: 4, 3: 396, 4: 1441},
'personnel_epa': {0: 0.1539720594882965,
1: 0.7805194854736328,
2: -0.2678736448287964,
3: 0.1886662095785141,
4: 0.005721719935536385}})
And in its current state, there are 5 duplicate values in the 'posteam' column and 5 different values in the 'offense_grouping' column. Ideally, I would like to group by 'posteam' (so the team only has one row) and by 'offense_grouping'. Each 'offense_grouping' value is corresponded with 'snap_ct' and 'personnel_epa' values. I would like the end result of this group to be something like this:
posteam
personnel_00_snap_ct
personnel_00_personnel_epa
personnel_01_snap_ct
personnel_01_personnel_epa
personnel_02_snap_ct
personnel_02_personnel_epa
ARI
1
.1539...
6
.7805...
4
-.2679
And so on. How can this be achieved?
Given the data you provide, the following would give the expected result. But there might be more complex cases in your data.
z = (
df
.set_index(['posteam', 'offense_grouping'])
.unstack('offense_grouping')
.swaplevel(axis=1)
.sort_index(axis=1, ascending=[True, False])
)
# or, alternatively (might be better if you have multiple values
# for some given indices./columns):
z = (
df
.pivot_table(index='posteam', columns='offense_grouping', values=['snap_ct', 'personnel_epa'])
.swaplevel(axis=1)
.sort_index(axis=1, ascending=[True, False])
)
>>> z
offense_grouping personnel_00 personnel_01 \
snap_ct personnel_epa snap_ct personnel_epa
posteam
ARI 1 0.153972 6 0.780519
offense_grouping personnel_02 personnel_10 \
snap_ct personnel_epa snap_ct personnel_epa
posteam
ARI 4 -0.267874 396 0.188666
offense_grouping personnel_11
snap_ct personnel_epa
posteam
ARI 1441 0.005722
Then you can join the two levels of columns:
res = z.set_axis([f'{b}_{a}' for a, b in z.columns], axis=1)
>>> res
snap_ct_personnel_00 personnel_epa_personnel_00 snap_ct_personnel_01 personnel_epa_personnel_01 snap_ct_personnel_02 personnel_epa_personnel_02 snap_ct_personnel_10 personnel_epa_personnel_10 snap_ct_personnel_11 personnel_epa_personnel_11
posteam
ARI 1 0.153972 6 0.780519 4 -0.267874 396 0.188666 1441 0.005722
```
I am working on a school project, so please no exact answers.
I have a pandas dataframe that has numerators and denominators rating images of dogs out of 10. When there are multiple dogs in the image, the rating is out of number of dogs * 10. I am trying to adjust it so that for example... if there are 5 dogs, and the rating is 40/50, then the new numerator/denominator is 8/10.
Here is an example of my code. I am aware that the syntax does not work in line 3, but I believe it accurately represents what I am trying to accomplish. twitter_archive is the dataframe.
twitter_archive['new_denom'] = 10
twitter_archive['new_numer'] = 0
for numer, denom in twitter_archive['rating_numerator','rating_denominator']:
if (denom > 10) & (denom % 10 == 0):
num_denom = denom / 10
new_numer = numer / num_denom
twitter_archive['new_numer'] = new_numer
So basically I am checking the denominator if it is above 10, and if it is, is it divisible by 10? if it is, then find out how many times 10 goes into it, and then divide the numerator by that value to get an new numerator. I think my logic for that works fine, but the issue I have is grabbing that row, and then adding that new value to the new column I created, in that row.
edit: added df head
tweet_id
timestamp
text
rating_numerator
rating_denominator
name
doggo
floofer
pupper
puppo
avg_numerator
avg_denom
avg_numer
0
8.924206e+17
2017-08-01 16:23:56+00:00
This is Phineas. He's a mystical boy. Only eve...
13.0
10.0
phineas
None
None
None
None
0.0
10
0
1
8.921774e+17
2017-08-01 00:17:27+00:00
This is Tilly. She's just checking pup on you....
13.0
10.0
tilly
None
None
None
None
0.0
10
0
2
8.918152e+17
2017-07-31 00:18:03+00:00
This is Archie. He is a rare Norwegian Pouncin...
12.0
10.0
archie
None
None
None
None
0.0
10
0
3
8.916896e+17
2017-07-30 15:58:51+00:00
This is Darla. She commenced a snooze mid meal...
13.0
10.0
darla
None
None
None
None
0.0
10
0
4
8.913276e+17
2017-07-29 16:00:24+00:00
This is Franklin. He would like you to stop ca...
12.0
10.0
franklin
None
None
None
None
0.0
10
0
copy/paste head below:
{'tweet_id': {0: 8.924206435553362e+17,
1: 8.921774213063434e+17,
2: 8.918151813780849e+17,
3: 8.916895572798587e+17,
4: 8.913275589266883e+17},
'timestamp': {0: Timestamp('2017-08-01 16:23:56+0000', tz='UTC'),
1: Timestamp('2017-08-01 00:17:27+0000', tz='UTC'),
2: Timestamp('2017-07-31 00:18:03+0000', tz='UTC'),
3: Timestamp('2017-07-30 15:58:51+0000', tz='UTC'),
4: Timestamp('2017-07-29 16:00:24+0000', tz='UTC')},
'text': {0: "This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 ",
1: "This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 ",
2: 'This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 ',
3: 'This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us ',
4: 'This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek '},
'rating_numerator': {0: 13.0, 1: 13.0, 2: 12.0, 3: 13.0, 4: 12.0},
'rating_denominator': {0: 10.0, 1: 10.0, 2: 10.0, 3: 10.0, 4: 10.0},
'name': {0: 'phineas', 1: 'tilly', 2: 'archie', 3: 'darla', 4: 'franklin'},
'doggo': {0: 'None', 1: 'None', 2: 'None', 3: 'None', 4: 'None'},
'floofer': {0: 'None', 1: 'None', 2: 'None', 3: 'None', 4: 'None'},
'pupper': {0: 'None', 1: 'None', 2: 'None', 3: 'None', 4: 'None'},
'puppo': {0: 'None', 1: 'None', 2: 'None', 3: 'None', 4: 'None'}}
If you want to use for loop to get row values, you can use iterrows() function.
for idx, row in twitter_archive.iterrows():
denom = row['rating_denominator']
numer = row['rating_numerator']
# You can add values in list and concat it with df
Faster way to iterate on df is itertuples():
for row in twitter_archive.itertuples():
denom = row[1]
numer = row[2]
But I think best way to create new col from old ones is to use pandas apply function .
df = pd.DataFrame(data={'a' : [1,2], 'b': [3,5]})
df['c'] = df.apply(lambda x: 'sum_is_odd' if (x['a'] + x['b']) % 2 == 1 else 'sum_is_even', axis=1)
In this case, 'c' is a new column and value is calculated using 'a' and 'b' columns.
I wrote a little script that loops through constraints to filter a dataframe. Example and follow up explaining the issue are below.
constraints = [['stand','==','L'],['zone','<','20']]
for x in constraints:
vari = x[2]
df = df.query("{0} {1} #vari".format(x[0],x[1]))
zone
stand
speed
type
0
2
L
83.7
CH
1
7
L
95.9
SI
2
14
L
94.9
FS
3
11
L
93.3
FS
4
13
L
86.9
CH
5
7
L
96.4
SI
6
13
L
82.6
SL
I can't figure out a way to filter when there is an OR condition. For example, in the table above I'd like to return a dataframe using the constraints in the code example along with any rows that contain SI or CH in the type column. Does anyone have ideas on how to accomplish this? Any help would be greatly appreciated.
This seems to have gotten the job done but there is probably a much better way of going about it.
for x in constraints:
vari = x[2]
if isinstance(vari,list):
frame = frame[frame[x[0]].isin(vari)]
else:
frame = frame.query("{0} {1} #vari".format(x[0],x[1]))
IIUC (see my question in the comment) you can do it like this:
Made a little different df to show you the result (I guess the table you show is already filtered)
df = pd.DataFrame(
{'zone': {0: 2, 1: 11, 2: 25, 3: 11, 4: 23, 5: 7, 6: 13},
'stand': {0: 'L', 1: 'L', 2: 'L', 3: 'C', 4: 'L', 5: 'K', 6: 'L'},
'speed': {0: 83.7, 1: 95.9, 2: 94.9, 3: 93.3, 4: 86.9, 5: 96.4, 6: 82.6},
'type': {0: 'CH', 1: 'SI', 2: 'FS', 3: 'FS', 4: 'CH', 5: 'SI', 6: 'SL'}})
print(df)
zone stand speed type
0 2 L 83.7 CH
1 11 L 95.9 SI
2 25 L 94.9 FS
3 11 C 93.3 FS
4 23 L 86.9 CH
5 7 K 96.4 SI
6 13 L 82.6 SL
res = df.loc[ ( (df['type']=='SI') | (df['type']=='CH') ) & ( (df['zone']<20) & (df['stand']=='L') ) ]
print(res)
zone stand speed type
0 2 L 83.7 CH
1 11 L 95.9 SI
Let me know if that is what you are searching for.
I have a dataset like the below, multiple groups, completed values with over 200 columns (denoting days)
Input
Series
1
2
3
4
5
6
7
GROUP
01/08/2021
100%
75%
60%
50%
40%
30%
0%
A
08/08/2021
100%
95%
80%
60%
30%
10%
0%
A
15/08/2021
100%
85%
60%
40%
20%
10%
5%
A
01/08/2021
100%
70%
65%
55%
45%
35%
0%
B
08/08/2021
100%
90%
80%
60%
30%
10%
0%
B
15/08/2021
100%
95%
60%
40%
30%
20%
5%
B
Now, I have an incomplete dataset like the below. I would like to compute similarity metric for each group and state which series is most similar.
For purpose of similarity, I am using CORREL in Excel at the moment and in case of tie, I am using the latest one. For comparison, only complete values in both groups are compared (i.e. so missing values in expected output are not used for similarity metric calculation).
This is a VBA macro which I am shifting to python (either pandas or pyspark).
I am confused on how best to proceed. Any other similarity metric can be tried out too. Thanks
Expected Output
Series
1
2
3
4
5
6
7
Similarity_Score
Similarity_Week
Group
01/09/2021
39%
28%
0%
0.99
01/08/2021
A
08/09/2021
62%
44%
21%
12%
7%
0.99
15/08/2021
A
15/09/2021
8%
0%
1.00
08/08/2021
A
15/09/2021
30%
19%
0%
1.00
15/08/2021
B
This solution involves iterating over each group, taking a subset of each dataframe and taking the product of each dataframes values, such that each row can be compared to every other row.
We can use some nested zip/filter/reverse trickery to keep only columns that are filled out. Putting that in a list with the dates from both dfs and the group, we can create a dataframe, sort, group, and keep the top score from each.
Joining this back to the second df should give you the output you want.
import pandas as pd
import numpy as np
from itertools import product
df = pd.DataFrame({'Series': {0: '01/08/2021',
1: '08/08/2021',
2: '15/08/2021',
3: '01/08/2021',
4: '08/08/2021',
5: '15/08/2021'},
'1': {0: '100%', 1: '100%', 2: '100%', 3: '100%', 4: '100%', 5: '100%'},
'2': {0: '75%', 1: '95%', 2: '85%', 3: '70%', 4: '90%', 5: '95%'},
'3': {0: '60%', 1: '80%', 2: '60%', 3: '65%', 4: '80%', 5: '60%'},
'4': {0: '50%', 1: '60%', 2: '40%', 3: '55%', 4: '60%', 5: '40%'},
'5': {0: '40%', 1: '30%', 2: '20%', 3: '45%', 4: '30%', 5: '30%'},
'6': {0: '30%', 1: '10%', 2: '10%', 3: '35%', 4: '10%', 5: '20%'},
'7': {0: '0%', 1: '0%', 2: '5%', 3: '0%', 4: '0%', 5: '5%'},
'GROUP': {0: 'A', 1: 'A', 2: 'A', 3: 'B', 4: 'B', 5: 'B'}})
df2 = pd.DataFrame({'Series': {0: '01/09/2021',
1: '08/09/2021',
2: '15/09/2021',
3: '15/09/2021'},
'1': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan},
'2': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan},
'3': {0: np.nan, 1: '62%', 2: np.nan, 3: np.nan},
'4': {0: np.nan, 1: '44%', 2: np.nan, 3: np.nan},
'5': {0: '39%', 1: '21%', 2: np.nan, 3: '30%'},
'6': {0: '28%', 1: '12%', 2: '8%', 3: '19%'},
'7': {0: '0%', 1: '7%', 2: '0%', 3: '0%'},
'Similarity_Score': {0: 0.99, 1: 0.99, 2: 1.0, 3: 1.0},
'Similarity_Week': {0: '01/08/2021',
1: '15/08/2021',
2: '08/08/2021',
3: '15/08/2021'},
'Group': {0: 'A', 1: 'A', 2: 'A', 3: 'B'}}
)
df2.drop(columns=['Similarity_Score','Similarity_Week'], inplace=True)
l = []
for g, data in df.groupby('GROUP'):
x = df2.loc[df2['Group']==g]
for c in product(data.values,x.values):
a = c[0][1:-1]
b = c[1][1:-1]
a,b = list(zip(*(zip(reversed(a),list(filter(lambda v: v==v, b))))))
a = [int(x.replace('%',''))/100 for x in a]
b = list(reversed([int(x.replace('%',''))/100 for x in b]))
l.append([g,c[0][0],c[1][0], np.corrcoef(a,b)[1,0]])
out = df2.merge(pd.DataFrame(l, columns=['Group','Similarity_Week','Series','Similarity_Score']).sort_values(by=['Similarity_Score', 'Similarity_Week'], ascending=False).groupby(['Group','Series']).head(1), on=['Group','Series'])
Output
Series 1 2 3 4 5 6 7 Group Similarity_Week \
0 01/09/2021 NaN NaN NaN NaN 39% 28% 0% A 01/08/2021
1 08/09/2021 NaN NaN 62% 44% 21% 12% 7% A 15/08/2021
2 15/09/2021 NaN NaN NaN NaN NaN 8% 0% A 01/08/2021
3 15/09/2021 NaN NaN NaN NaN 30% 19% 0% B 15/08/2021
Similarity_Score
0 0.999405
1 0.999005
2 1.000000
3 0.999286
I believe the scores are very similar for 15/09/2021 group A, such that if you were to round the scores you would get a different most recent date. You can validate this by checking
[x for x in l if x[2]=='15/09/2021' and x[0]=='A']
Yields
[['A', '01/08/2021', '15/09/2021', 1.0],
['A', '08/08/2021', '15/09/2021', 0.9999999999999998],
['A', '15/08/2021', '15/09/2021', 0.9999999999999998]]
So in theory the 15/08/2021 would be the date if you were rounding to a few decimal places, which you could do by putting round() around the np.corrcoef
If you prefer a solution without for loops you could merge the two data frames on Group, and use groupby to apply the similarity metric.
Building on the data frames constructed by #Chris:
df.rename(columns={"GROUP":"Group"}, inplace=True)
def similarity(arr1, arr2):
"""Similarity between two arrays of percent strings, nans ignored"""
df = pd.DataFrame({"arr1":arr1, "arr2":arr2}).dropna() \
.apply(lambda s: s.str.strip("%").astype(float)/100)
return df.arr1.corr(df.arr2)
# Convert data columns to array in each row.
df_xformed = df.set_index(["Series", "Group"]).apply(pd.Series.to_numpy, axis=1) \
.reset_index().rename(columns={"Series":"df_Series", 0:"df"})
df2_xformed = df2.set_index(["Series", "Group"]).apply(pd.Series.to_numpy, axis=1) \
.reset_index().rename(columns={"Series":"df2_Series", 0:"df2"})
# Merge on Group and calculate similarities.
df_combined = df_xformed.merge(df2_xformed, on="Group")
df_combined["similarity"] = df_combined.apply(
lambda row: similarity(row["df"], row["df2"]), axis=1)
# Find max similarity of each df2_Series within its Group.
df_combined["df2_sim_max"] = df_combined.groupby(\
["df2_Series", "Group"])["similarity"] \
.transform(max)
idx = df_combined["similarity"] == df_combined["df2_sim_max"]
result = df_combined[idx][["df2_Series", "Group", "df2", "df_Series", "similarity"]]
result
# df2_Series Group ... df_Series similarity
# 0 01/09/2021 A ... 01/08/2021 0.999405
# 2 15/09/2021 A ... 01/08/2021 1.000000
# 7 08/09/2021 A ... 15/08/2021 0.999005
# 11 15/09/2021 B ... 15/08/2021 0.999286
I have this input data:
id,prescriber_last_name,prescriber_first_name,drug_name,drug_cost
1000000001,Smith,James,AMBIEN,100
1000000002,Garcia,Maria,AMBIEN,200
1000000003,Johnson,James,CHLORPROMAZINE,1000
1000000004,Rodriguez,Maria,CHLORPROMAZINE,2000
1000000005,Smith,David,BENZTROPINE MESYLATE,1500
The output is supposed to be as follows:
drug_name,num_prescriber,total_cost
CHLORPROMAZINE,2,3000
BENZTROPINE MESYLATE,1,1500
AMBIEN,2,300
But instead I get the following output:
AMBIEN 2 300
CHLORPROMAZINE 0 0
BENZTROPINE MESYLATE 0 0
Any suggestions would be appreciated! My code is below:
fileHandle = """
id,prescriber_last_name,prescriber_first_name,drug_name,drug_cost
1000000001,Smith,James,AMBIEN,100
1000000002,Garcia,Maria,AMBIEN,200
1000000003,Johnson,James,CHLORPROMAZINE,1000
1000000004,Rodriguez,Maria,CHLORPROMAZINE,2000
1000000005,Smith,David,BENZTROPINE MESYLATE,1500
"""
input_data = re.sub(r'(\n)', r',\1', fileHandle)
fields = input_data.split(',')
del fields[0]
NumberOfRows = round(len(fields)/5)
NumberOfCols = 5
length_of_fields=len(fields)
# Expected output: drug_name,number_of_prescribers,total_cost
# drug_name at indices 3 (title), 8, 13, 18, 23, 28
# drug_cost at indices 4 (title), 9, 14, 19, 24, 29
#outputfile = open('/output/top_cost_drug.txt','w')
# get list of drug names
i=8
j=0
drug_name_list=list()
drug_name_indices=list()
while i<=length_of_fields:
drug_name_list.append(fields[i])
drug_name_indices.append(i)
i=i+5;
j=j+1;
# find unique names (same as drug_name_list but without repetition)
unique_drug_list = list()
# traverse for all elements
for x in drug_name_list:
# check if exists in unique_name_list or not
if x not in unique_drug_list:
unique_drug_list.append(x)
i=0
j=0
number_of_unique_drugs=len(unique_drug_list)
unique_cost_list=[0]*number_of_unique_drugs
number_of_prescribers = [0]*number_of_unique_drugs
#while i<len(drug_name_list):
# while j<number_of_unique_drugs:
# if drug_name_list[i]==unique_drug_list[j]:
# drug_name_index=drug_name_indices[i]
# cost_of_drug=int(fields[drug_name_index+1])
# unique_cost_list[j]=int(unique_cost_list[j])+cost_of_drug
# number_of_prescribers[j]=number_of_prescribers[i]+1
# j=j+1
# i=i+1
while j<number_of_unique_drugs:
while i<len(drug_name_list):
if drug_name_list[i]==unique_drug_list[j]:
drug_name_index=drug_name_indices[i]
cost_of_drug=int(fields[drug_name_index+1])
unique_cost_list[j]=int(unique_cost_list[j])+cost_of_drug
number_of_prescribers[j]=number_of_prescribers[i]+1
i=i+1
j=j+1
# print output values
counter=0
print("drug_name,number_of_prescribers,total_cost \n")
while counter<number_of_unique_drugs:
print(unique_drug_list[counter], number_of_prescribers[counter], unique_cost_list[counter])
print("\n")
counter=counter+1
Also, I changed the print statements to outputfile.write but I'm not getting any output file, why is that?
outputfile = open('/output/top_cost_drug.txt','w')
# print output values
counter=0
outputfile.write("drug_name,number_of_prescribers,total_cost \n")
while counter<number_of_unique_drugs:
outputfile.write(unique_drug_list[counter],',', number_of_prescribers[counter],',', unique_cost_list[counter])
print("\n")
counter=counter+1
To get your expected output, use Pandas groupby() aggregation methods:
df.groupby("drug_name").drug_cost.agg(["count", "sum"])
count sum
drug_name
AMBIEN 2 300
BENZTROPINE MESYLATE 1 1500
CHLORPROMAZINE 2 3000
To write to file, use to_csv():
df.groupby("drug_name").drug_cost.agg(["count", "sum"]).to_csv("output.csv")
Data:
import pandas as pd
data = {'id': {0: 1000000001,
1: 1000000002,
2: 1000000003,
3: 1000000004,
4: 1000000005},
'prescriber_last_name': {0: 'Smith',
1: 'Garcia',
2: 'Johnson',
3: 'Rodriguez',
4: 'Smith'},
'prescriber_first_name': {0: 'James',
1: 'Maria',
2: 'James',
3: 'Maria',
4: 'David'},
'drug_name': {0: 'AMBIEN',
1: 'AMBIEN',
2: 'CHLORPROMAZINE',
3: 'CHLORPROMAZINE',
4: 'BENZTROPINE MESYLATE'},
'drug_cost': {0: 100, 1: 200, 2: 1000, 3: 2000, 4: 1500}}
df = pd.DataFrame(data)
df
id prescriber_last_name prescriber_first_name \
0 1000000001 Smith James
1 1000000002 Garcia Maria
2 1000000003 Johnson James
3 1000000004 Rodriguez Maria
4 1000000005 Smith David
drug_name drug_cost
0 AMBIEN 100
1 AMBIEN 200
2 CHLORPROMAZINE 1000
3 CHLORPROMAZINE 2000
4 BENZTROPINE MESYLATE 1500
I agree with #andrew_reece but if you want the data to be in a string do this:
df = """
id,prescriber_last_name,prescriber_first_name,drug_name,drug_cost
1000000001,Smith,James,AMBIEN,100
1000000002,Garcia,Maria,AMBIEN,200
1000000003,Johnson,James,CHLORPROMAZINE,1000
1000000004,Rodriguez,Maria,CHLORPROMAZINE,2000
1000000005,Smith,David,BENZTROPINE MESYLATE,1500
"""
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
import pandas as pd
TESTDATA = StringIO(df)
df = pd.read_csv(TESTDATA)
print(df.groupby("drug_name").drug_cost.agg(["count", "sum"]))
Output:
count sum
drug_name
AMBIEN 2 300
BENZTROPINE MESYLATE 1 1500
CHLORPROMAZINE 2 3000
Another way like #andrew_reece sad in the comments is to copy the data string then do:
import pandas
df = pd.read_clipboard()
df=df.groupby("drug_name").drug_cost.agg(["count", "sum"])
df.to_csv(filename)