I have data about electric cars in USA and I am trying to calculate standard deviation for each state. I already calculated mean in that way:
df = pd.read_csv('https://gist.githubusercontent.com/AlbertKozera/6396b4333d1a9222193e11401069ed9a/raw/ab8733a2135bcf61999bbcac4f92e0de5fd56794/Pojazdy%2520elektryczne%2520w%2520USA.csv')
for col in df.columns:
df[col] = df[col].astype(str)
df['range'] = pd.to_numeric(df['range'])
.
.
.
df_avg_range = df.drop(columns = ['state', 'brand', 'model', 'year of production', 'type']).groupby('code', as_index=False)['range'].mean()
And here is my return after that:
code range
0 AK 154.553600
1 AL 156.959936
2 AR 153.950400
3 AZ 152.756000
4 CA 152.359200
5 CO 159.084800
6 CT 155.212000
7 DE 156.322400
8 FL 153.728000
9 GA 154.748800
10 HI 154.503200
11 IA 155.746400
12 ID 157.851200
13 IL 155.200800
14 IN 153.338400
15 KS 154.240000
16 KY 154.162400
17 LA 156.728800
18 MA 134.643200
19 MD 137.080800
20 ME 142.263200
21 MI 132.828000
22 MN 135.828000
23 MO 138.376000
24 MS 132.704000
25 MT 132.552000
26 NC 133.800000
27 ND 136.096800
28 NE 137.150400
29 NH 131.498400
30 NJ 137.760800
31 NM 133.325600
32 NV 137.522400
33 NY 137.476000
34 OH 137.784800
35 OK 134.277600
36 OR 134.504000
37 PA 141.052000
38 RI 137.572000
39 SC 143.348000
40 SD 141.189600
41 TN 139.981600
42 TX 139.233600
43 UT 138.615200
44 VA 141.334400
45 VT 143.104000
46 WA 137.880800
47 WI 143.916800
48 WV 141.008000
49 WY 147.109600
Now I am trying to calculate deviation in the same way:
df_dev_range = df.drop(columns = ['state', 'brand', 'model', 'year of production', 'type']).groupby('code', as_index=False)['range'].std()
And here is my error after that:
*** TypeError: loop of ufunc does not support argument 0 of type str which has no callable sqrt method
Can someone explain what am I doing wrong?
Try removing as_index=False from groupby. Standard deviation will be applied to all columns including groupby column when keeping as_index as False.
In order to retain index 0-49 try using the below syntax
df_dev_range = df.drop(columns = ['state', 'brand', 'model', 'year of production', 'type']).groupby('code',as_index=False).agg({'range':'std'})
occupation gender number
administrator F 36
M 43
artist F 13
M 15
doctor M 7
educator F 26
M 69
How to get the rolling mean of first 2 column and find the average of (M)male and (F)female in each occupation
users = pd.read_table('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user',
sep='|', index_col='user_id')
users.head()
age gender occupation zip_code
user_id
1 24 M technician 85711
2 53 F other 94043
3 23 M writer 32067
4 24 M technician 43537
5 33 F other 15213
# create a data frame and apply count to gender
gender_ocup = users.groupby(['occupation', 'gender']).agg({'gender': 'count'})
# create a DataFrame and apply count for each occupation
occup_count = users.groupby(['occupation']).agg('count')
# divide the gender_ocup per the occup_count and multiply per 100
occup_gender = gender_ocup.div(occup_count, level = "occupation") * 100
# present all rows from the 'gender column'
occup_gender.loc[: , 'gender']
courtesy
https://github.com/guipsamora/pandas_exercises/blob/master/03_Grouping/Occupation/Exercises_with_solutions.ipynb
I have a DF that has the results of a NER classifier such as the following:
df =
s token pred tokenID
17 hakawati B-Loc 3
17 theatre L-Loc 3
17 jerusalem U-Loc 7
56 university B-Org 5
56 of I-Org 5
56 texas I-Org 5
56 here L-Org 6
...
5402 dwight B-Peop 1
5402 d. I-Peop 1
5402 eisenhower L-Peop 1
There are many other columns in this DataFrame that are not relevant. Now I want to group the tokens depending on their sentenceID (=s) and their predicted tags to combine them into a single entity:
df2 =
s token pred
17 hakawati theatre Location
17 jerusalem Location
56 university of texas here Organisation
...
5402 dwight d. eisenhower People
Normally I would do so by simply using a line like
data_map = df.groupby(["s"],as_index=False, sort=False).agg(" ".join) and using a rename function. However since the data contains different kind of Strings (B,I,L - Loc/Org ..) I don't know how to exactly do it.
Any ideas are appreciated.
Any ideas?
One solution via a helper column.
df['pred_cat'] = df['pred'].str.split('-').str[-1]
res = df.groupby(['s', 'pred_cat'])['token']\
.apply(' '.join).reset_index()
print(res)
s pred_cat token
0 17 Loc hakawati theatre jerusalem
1 56 Org university of texas here
2 5402 Peop dwight d. eisenhower
Note this doesn't match exactly your desired output; there seems to be some data-specific treatment involved.
You could group by both s and tokenID and aggregate like so:
def aggregate(df):
token = " ".join(df.token)
pred = df.iloc[0].pred.split("-", 1)[1]
return pd.Series({"token": token, "pred": pred})
df.groupby(["s", "tokenID"]).apply(aggregate)
# Output
token pred
s tokenID
17 3 hakawati theatre Loc
7 jerusalem Loc
56 5 university of texas Org
6 here Org
5402 1 dwight d. eisenhower Peop
I'm using this txt file named Gradedata.txt and it looks like this:
Sarah K.,10,9,7,9,10,20,19,19,45,92
John M.,9,9,8,9,8,20,20,18,43,95
David R.,8,7,7,9,6,18,17,17,40,83
Joan A.,9,10,10,10,10,20,19,20,47,99
Nick J.,9,7,10,10,10,20,20,19,46,98
Vicki T.,7,7,8,9,9,17,18,19,44,88
I'm looking for the averages of each column. Each column has it's own title (Homework #1, Homework #2, etc. in that order). What I am trying to do should look exactly like this:
Homework #1 8.67
Homework #2 8.17
Homework #3 8.33
Homework #4 9.33
Homework #5 8.83
Quiz #1 19.17
Quiz #2 18.83
Quiz #3 18.67
Midterm #1 44.17
Final #1 92.50
Here is my attempt at accomplishing this task:
with open("GradeData.txt", "rtU") as f:
columns = f.readline().strip().split(" ")
numRows = 0
sums = [0] * len(columns)
for line in f:
if not line.strip():
continue
values = line.split(" ")
for i in xrange(len(values)):
sums[i] += int(values[i])
numRows += 1
for index, summedRowValue in enumerate(sums):
print columns[index], 1.0 * summedRowValue / numRows
I'm getting errors and also I realize I have to name each assignment average. Need some help here. I appreciate it.
numpy can chew this up in one line:
>>> np.loadtxt('Gradedata.txt', delimiter=',', usecols=range(1,11)).mean(axis=0)
array([ 8.66666667, 8.16666667, 8.33333333, 9.33333333,
8.83333333, 19.16666667, 18.83333333, 18.66666667,
44.16666667, 92.5 ])
Just transpose and use statistics.mean to get the average, skipping the first col:
import csv
from itertools import islice
from statistics import mean
with open("in.txt") as f:
for col in islice(zip(*csv.reader(f)), 1, None):
print(mean(map(float,col)))
Which will give you:
8.666666666666666
8.166666666666666
8.333333333333334
9.333333333333334
8.833333333333334
19.166666666666668
18.833333333333332
18.666666666666668
44.166666666666664
92.5
If the columns are actually named and you want to pair them:
import csv
from itertools import islice
from statistics import mean
with open("in.txt") as f:
# get column names
cols = next(f).split(",")
for col in islice(zip(*csv.reader(f)),1 ,None):
# keys are column names, values are averages
data = dict(zip(cols[1:],mean(map(float,col))))
Or using pandas.read_csv:
import pandas as pd
df = pd.read_csv("in.txt",index_col=0,header=None)
print(df)
print(df.mean(axis=0))
1 2 3 4 5 6 7 8 9 10
0
Sarah K. 10 9 7 9 10 20 19 19 45 92
John M. 9 9 8 9 8 20 20 18 43 95
David R. 8 7 7 9 6 18 17 17 40 83
Joan A. 9 10 10 10 10 20 19 20 47 99
Nick J. 9 7 10 10 10 20 20 19 46 98
Vicki T. 7 7 8 9 9 17 18 19 44 88
1 8.666667
2 8.166667
3 8.333333
4 9.333333
5 8.833333
6 19.166667
7 18.833333
8 18.666667
9 44.166667
10 92.500000
dtype: float64
Table
Roll Class Country Rights CountryAcc
1 x IND 23 US
1 x1 IND 32 Ind
2 s US 12 US
3 q IRL 33 CA
4 a PAK 12 PAK
4 e PAK 12 IND
5 f US 21 CA
5 g US 31 PAK
6 h US 21 BAN
I want to display only those Rolls whose CountryAcc is not in US or CA. For example: if Roll 1 has one CountryAcc in US then I don't want its other row with CountryAcc Ind and same goes with Roll 5 as it is having one row with CountryAcc as CA. So my final output would be:
Roll Class Country Rights CountryAcc
4 a PAK 12 PAK
4 e PAK 12 IND
6 h US 21 BAN
I tried getting that output following way:
Home_Country = ['US', 'CA']
#First I saved two countries in a variable
Account_Other_Count = df.loc[~df.CountryAcc.isin(Home_Country)]
Account_Other_Count_Var = df.loc[~df.CountryAcc.isin(Home_Country)][['Roll']].values.ravel()
# Then I made two variables one with CountryAcc in US or CA and other variable with remaining and I got their Roll
Account_Home_Count = df.loc[df.CountryAcc.isin(Home_Country)]
Account_Home_Count_Var = df.loc[df.CountryAcc.isin(Home_Country)][['Roll']].values.ravel()
#Here I got the common Rolls
Common_ROLL = list(set(Account_Home_Count_Var).intersection(list(Account_Other_Count_Var)))
Final_Output = Account_Other_Count.loc[~Account_Other_Count.Roll.isin(Common_ROLL)]
Is there any better and more pandas or pythonic way to do it.
One solution could be
In [37]: df.ix[~df['Roll'].isin(df.ix[df['CountryAcc'].isin(['US', 'CA']), 'Roll'])]
Out[37]:
Roll Class Country Rights CountryAcc
4 4 a PAK 12 PAK
5 4 e PAK 12 IND
8 6 h US 21 BAN
This is one way to do it:
sortdata = df[~df['CountryAcc'].isin(['US', 'CA'])].sort(axis=0)