Hi I need to create a column with values 1 or 0 based on certain conditions. My dataframe is enormous, so a general for loop or even apply are extremely slow. I want to used Pandas or even more preferably Numpy vectorization. Below is a sample of the data and my code that does not work:
election_year D_president
1992 0
1992 0
1996 0
1996 0
2000 0
2004 0
2008 0
2012 0
test_df['D_president'] = 0
election_year = test_df['election_year']
test_df['D_president'] = test_df.loc[((election_year == 1992) |
(election_year == 1996) |
(election_year == 2008)|
(election_year == 2012)), 'D_president'] = 1
So basically I need to get a value of 1 in a column 'D_president' for these certain years. However, when I execute this code I get all 1 even for 2000 and 2004. Can't understand what's wrong.
Also how could I transform this into a Numpy vectorization with .values?
It looks like you're having two "=" assignments on the same row. Try removing the leftmost one test_df['D_president'] Also, for the test, you can replace it with election_year.isin([1992, 1996, 2008, 2012]))
Related
I am trying to randomly select records from 17mm dataframe using np.random.choice as it runs faster compared to other methods but I am getting incorrect value in output against each record...example below:
data = {
"calories":[420,380,390,500,200,100],
"Duration":[50,40,45,600,450,210],
"Id":[1,1 2,3,2,3],
"Yr":[2003,2003,2009,2003,2012,2003],
"Mth":[3,6,9,12,3,6],
}
df = PD.dataframe(data)
df2=df.groupby(['id','yr'],as_index=False).agg(np.random.choice)
Output:
Id yr calories duration mth
1 2003 420 50 6
2 2009 390 45 9
2 2012 200 450 3
3 2003 500 210 6
Problem in the output is for Id 3 for calories 500, duration and mth should be 600 and 12 instead of 210 and 6...can anyone please help why it is choosing value from different row ?
Expected output:
Same row value should be retained after random selection
This doesn't work because Pandas applies aggregates across each column independently, try putting a print statement in, e.g.:
def fn(x):
print(x)
return np.random.choice(x)
df.groupby(['id','yr'],as_index=False).agg(fn)
would let you see when the function was called and what it was called with.
I'm not an expert in Pandas, but using GroupBy.apply seems to be the easiest way I've found of keeping rows together.
Something like the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({
"calories":[420,380,390,500,200,100],
"duration":[50,40,45,600,450,210],
"id":[1,1,2,3,2,3],
"yr":[2003,2003,2009,2003,2012,2003],
"mth":[3,6,9,12,3,6],
})
df.groupby(['id', 'yr'], as_index=False).apply(lambda x: x.sample(1))
produces:
calories duration id yr mth
0 1 380 40 1 2003 6
1 2 390 45 2 2009 9
2 4 200 450 2 2012 3
3 5 100 210 3 2003 6
the two numbers at the beginning are because you end up with a multi-index. If you want to know where the rows were selected from this would contain useful information, otherwise you could discard the index.
Note that there are warnings in the docs that this might not be very performant, but don't know the details.
Update: I've just had more of a read of the docs, and noticed that there's a GroupBy.sample method, so you could instead just do:
df.groupby(['id', 'yr']).sample(1)
which would presumably be performant as well as being much shorter!
I have the following multi-index data frame, where ID and Year are part of the multi-index. Some numbers for the variable ROA are unreasonable, so I want to replace every ROA value that is larger than the 99th percentile of ROA in the overall data frame by the average of its company (the same for everything smaller than the 1th percentile).
ID Year ROA
1 2016 1.5
1 2017 0.8
1 2018 NaN
2 2016 0.7
2 2017 0.8
2 2018 0.4
In a different thread I found the following approach (Replace values based on multiple conditions with groupby mean in Pandas):
mask = ((df['ROA'] > df['ROA'].quantile(0.99)) | (df['ROA'] < df['ROA'].quantile(0.01)))
df['ROA'] = np.where(~mask, df['ROA'], df.groupby('ID')['ROA'].transform('mean'))
However, this does not work for me. The maximum and minimum values of my data frame do not change. Does someone have an idea why this could be?
EDIT:
Alternatively, I thought of this function:
df_outliers = df[(df['ROA'] < df['ROA'].quantile(0.01))|(df['ROA'] >
df['ROA'].quantile(0.99))]
for i in df_outliers.index:
df.loc[(df.index.get_level_values('ID') == float(i[0])) &
(df.index.get_level_values('Year')==float(i[1])), 'ROA'] =
float(df.query('ID == {} and Year != {}'.format(i[0],
i[1])).ROA.mean())
However, here I run into the problem that in the df_outliers.index some companies are mentioned several times because their ROA is an outlier in several years. This makes the function defeat its purpose, as it is currently it only excludes one year from the calculation of the mean, and not several.
This is a continuation of the method used in this question.
Say we have a dataframe
Make Model Year HP Cylinders Transmission MPG-H MPG-C Price
0 BMW 1 Series M 2011 335.0 6.0 MANUAL 26 19 46135
1 BMW 1 Series 2011 300.0 6.0 MANUAL 28 19 40650
2 BMW 1 Series 2011 300.0 6.0 MANUAL 28 20 36350
3 BMW 1 Series 2011 230.0 6.0 MANUAL 28 18 29450
4 BMW 1 Series 2011 230.0 6.0 MANUAL 28 18 34500
...
Using the interquartile range (IQR) (i.e middle 50%), I created 2 variables, upper and lower. The specific calculation isn't important in this discussion, but to give an example of upper:
Year 2029.50
HP 498.00
Cylinders 9.00
MPG-H 42.00
MPG-C 31.00
Price 75291.25
As expected, it only calculates values for columns that have int64 values.
When I want to filter out values that lie outside of the IQR,
correct_df = df[~((df < lower) |(df > upper)).any(axis=1)]
it gives me the right answer. However, when I invert the logic to use & instead of |, I get an empty dataframe. Here is the code:
another_df = df[((df >= lower) & (df <= upper)).all(axis=1)]
Which gives the results, but can be fixed by converting the index of upper/lower into a list ('lst'):
Make Model Year HP Cylinders Transmission Drive Mode MPG-H MPG-C Price
----------------------------------------------------------------------------------------------
another_df = df[((df[lst] >= lower) & (df[lst] <= upper)).all(axis=1)]
It seems like & and | behave differently for non-numerical columns? Why does that happen?
& and | behave just as you'd expect; they're not the problem. They problem is that you're use all in the code that doesn't work, but in the code that does work, you're using any.
In the first example you say "select all rows where any column of the row is less than lower OR is greater than upper"
In the second example you say "select all rows where ALL columns of the row are greater than or equal to lower OR are less than or equal to upper".
Change all to any and you should be fine.
Good afternoon.
I have this question I am trying to solve using "panda" statistical data structures and related syntax from the Python scripting language. I am already graduated from a US university and employed while currently taking the Coursera.org course of "Python for Data Science" just for professional development, which is offered online at Coursera's platform by the University of Michigan. I'm not sharing answers to anyone either as I abide by Coursera's Honor Code.
First, I was given this panda dataframe chart concerning Olympic medals won by countries around the world:
# Summer Gold Silver Bronze Total # Winter Gold.1 Silver.1 Bronze.1 Total.1 # Games Gold.2 Silver.2 Bronze.2 Combined total ID
Afghanistan 13 0 0 2 2 0 0 0 0 0 13 0 0 2 2 AFG
Algeria 12 5 2 8 15 3 0 0 0 0 15 5 2 8 15 ALG
Argentina 23 18 24 28 70 18 0 0 0 0 41 18 24 28 70 ARG
Armenia 5 1 2 9 12 6 0 0 0 0 11 1 2 9 12 ARM
Australasia 2 3 4 5 12 0 0 0 0 0 2 3 4 5 12 ANZ
Second, the question asked is, "Which country has won the most gold medals in summer games?"
Third, a hint given me as to how to answer using Python's panda syntax is this:
"This function should return a single string value."
Fourth, I tried entering this as the answer in Python's panda syntax:
import pandas as pd
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)
def answer_one():
if df.columns[:2]=='00':
df.rename(columns={col:'Country'+col[4:]}, inplace=True)
df_max = df[df[max('Gold')]]
return df_max['Country']
answer_one()
Fifth, I have tried other various answers like this in Coursera's auto-grader, but
it keeps giving this error message:
There was a problem evaluating function answer_one, it threw an exception was thus counted as incorrect.
0.125 points were not awarded.
Could you please help me solve that question? Any hints/suggestions/comments are welcome for that.
Thanks, Kevin
You can use pandas' loc function to find the country name corresponding to the maximum of the "Gold" column:
data = [('Afghanistan', 13),
('Algeria', 12),
('Argentina', 23)]
df = pd.DataFrame(data, columns=['Country', 'Gold'])
df['Country'].loc[df['Gold'] == df['Gold'].max()]
The last line returns Argentina as answer.
Edit 1:
I just noticed you import the .csv file using pd.read_csv('olympics.csv', index_col=0, skiprows=1). If you leave out the skiprows argument you will get a dataframe where the first line in the .csv file correspond to column names in the dataframe. This makes handling of your dataframe much easier in pandas and is encouraged. Second, I see that using the index_col=0 argument you use the country names as indices in the dataframe. In this case you should choose to use index over the loc function as follows:
df.index[df['Gold'] == df['Gold'].max()][0]
import pandas as pd
def answer_one():
df1=pd.Series.max(df['Gold'])
df1=df[df['Gold']==df1]
return df1.index[0]
answer_one()
Function argmax() returns the index of the maximum element in the data frame.
return df['Gold'].argmax()
I have to use survey data from ipums to get the average number of people who are unemployed in two successive periods. I wrote a function that uses an index and a dataframe as input,
def u1(x,df):
if df.loc[x]['LABFORCE']==2 and df.loc[x]['CPSIDP']==df.loc[x+1]['CPSIDP']:
if df.loc[x]['EMPSTAT']==21 or df.loc[x]['EMPSTAT']==22:
return True
else:
return False
where x is the index and df is the dataframe. CPSIDP identifies the survey respondent, LABFORCE checks the respondent is in the labor force and EMPSTAT is what I need to use to check the employment status of the respondent.
And then I planned to use apply as
result= df.apply(u1, axis=1)
It is not clear what arguments I should pass in my function (and please let me know if this approach is just philosophically wrong). Passing a number or a variable for the index gives me a 'bool' object is not callable error.
The smallest dataframe subset that generates the error (left most column is the number of the observation, it is the x I need to pass through u1):
YEAR MONTH CPSIDP EMPSTAT LABFORCE
15285896 2018 7 20180707096701 10 2
15285926 2018 7 20180707098301 10 2
15285927 2018 7 20180707098302 10 2
15285928 2018 7 20180707098303 0 0
15285929 2018 7 20180707098304 0 0
15285930 2018 7 20180707098305 10 2
15286095 2018 7 20180707108203 21 2
IIUC it would be more efficient to create a boolean Series using the logic from your function.
Here & is the AND operator.
result = (df['LABFORCE'].eq(2) &
df['CPSIDP'].eq(df['CPSIDP'].shift()) &
df['EMPSTAT'].isin([21,22]))
result
15285896 False
15285926 False
15285927 False
15285928 False
15285929 False
15285930 False
15286095 False