This is a continuation of the method used in this question.
Say we have a dataframe
Make Model Year HP Cylinders Transmission MPG-H MPG-C Price
0 BMW 1 Series M 2011 335.0 6.0 MANUAL 26 19 46135
1 BMW 1 Series 2011 300.0 6.0 MANUAL 28 19 40650
2 BMW 1 Series 2011 300.0 6.0 MANUAL 28 20 36350
3 BMW 1 Series 2011 230.0 6.0 MANUAL 28 18 29450
4 BMW 1 Series 2011 230.0 6.0 MANUAL 28 18 34500
...
Using the interquartile range (IQR) (i.e middle 50%), I created 2 variables, upper and lower. The specific calculation isn't important in this discussion, but to give an example of upper:
Year 2029.50
HP 498.00
Cylinders 9.00
MPG-H 42.00
MPG-C 31.00
Price 75291.25
As expected, it only calculates values for columns that have int64 values.
When I want to filter out values that lie outside of the IQR,
correct_df = df[~((df < lower) |(df > upper)).any(axis=1)]
it gives me the right answer. However, when I invert the logic to use & instead of |, I get an empty dataframe. Here is the code:
another_df = df[((df >= lower) & (df <= upper)).all(axis=1)]
Which gives the results, but can be fixed by converting the index of upper/lower into a list ('lst'):
Make Model Year HP Cylinders Transmission Drive Mode MPG-H MPG-C Price
----------------------------------------------------------------------------------------------
another_df = df[((df[lst] >= lower) & (df[lst] <= upper)).all(axis=1)]
It seems like & and | behave differently for non-numerical columns? Why does that happen?
& and | behave just as you'd expect; they're not the problem. They problem is that you're use all in the code that doesn't work, but in the code that does work, you're using any.
In the first example you say "select all rows where any column of the row is less than lower OR is greater than upper"
In the second example you say "select all rows where ALL columns of the row are greater than or equal to lower OR are less than or equal to upper".
Change all to any and you should be fine.
Related
I had other issues that I resolved but this problem here has set me back a little bit.
I have the following columns (there's 50,000 total data in my actual file):
Area Date SpeedOver Risk Accident
Wendly 8/8/2010 15 L No
Wendly 2/9/2010 35 L Yes
Reet 1/5/2010 65 M Yes
Reet 9/11/2010 10 M Yes
Sarall 14/3/2010 18 M No
Sarall 7/6/2010 23 H No
Sarall 23/6/2014 25 H Yes
I am trying to print the top 3 locations based on accidents in the year of 2010. So the output should be:
Reet
Wendly
Sarall
top_loc_accident = df[(df.index.year==2010)]['Accident'].nlargest(n=3)
print(top_loc_accident)
But the above code prints the date itself and the accidents, not the actual location name, so I have it 50% correct but it's a bit confusing currently.
You first need to aggregate the number of accidents:
# select rows of 2010
# the original method can be used here
m1 = df['Date'].str.endswith('2010')
# m1 = df.index.year==2010
# identify rows with accidents
m2 = df['Accident'].eq('Yes')
# count the accidents of 2010
# keep the top 3
m2[m1].groupby(df['Area']).sum().nlargest(3)
Output:
Area
Reet 2
Wendly 1
Sarall 0
Name: Accident, dtype: int64
I have the following multi-index data frame, where ID and Year are part of the multi-index. Some numbers for the variable ROA are unreasonable, so I want to replace every ROA value that is larger than the 99th percentile of ROA in the overall data frame by the average of its company (the same for everything smaller than the 1th percentile).
ID Year ROA
1 2016 1.5
1 2017 0.8
1 2018 NaN
2 2016 0.7
2 2017 0.8
2 2018 0.4
In a different thread I found the following approach (Replace values based on multiple conditions with groupby mean in Pandas):
mask = ((df['ROA'] > df['ROA'].quantile(0.99)) | (df['ROA'] < df['ROA'].quantile(0.01)))
df['ROA'] = np.where(~mask, df['ROA'], df.groupby('ID')['ROA'].transform('mean'))
However, this does not work for me. The maximum and minimum values of my data frame do not change. Does someone have an idea why this could be?
EDIT:
Alternatively, I thought of this function:
df_outliers = df[(df['ROA'] < df['ROA'].quantile(0.01))|(df['ROA'] >
df['ROA'].quantile(0.99))]
for i in df_outliers.index:
df.loc[(df.index.get_level_values('ID') == float(i[0])) &
(df.index.get_level_values('Year')==float(i[1])), 'ROA'] =
float(df.query('ID == {} and Year != {}'.format(i[0],
i[1])).ROA.mean())
However, here I run into the problem that in the df_outliers.index some companies are mentioned several times because their ROA is an outlier in several years. This makes the function defeat its purpose, as it is currently it only excludes one year from the calculation of the mean, and not several.
I have data for many countries over a period of time (2001-2003). It looks something like this:
index
year
country
inflation
GDP
1
2001
AFG
nan
48
2
2002
AFG
nan
49
3
2003
AFG
nan
50
4
2001
CHI
3.0
nan
5
2002
CHI
5.0
nan
6
2003
CHI
7.0
nan
7
2001
USA
nan
220
8
2002
USA
4.0
250
9
2003
USA
2.5
280
I want to drop countries in case there is no data (i.e. values are missing for all years) for any given variable.
In the example table above, I want to drop AFG (because it misses all values for inflation) and CHI (GDP missing). I don't want to drop observation #7 just because one year is missing.
What's the best way to do that?
This should work by filtering all values that have nan in one of (inflation, GDP):
(
df.groupby(['country'])
.filter(lambda x: not x['inflation'].isnull().all() and not x['GDP'].isnull().all())
)
Note, if you have more than two columns you can work on a more general version of this:
df.groupby(['country']).filter(lambda x: not x.isnull().all().any())
If you want this to work with a specific range of year instead of all columns, you can set up a mask and change the code a bit:
mask = (df['year'] >= 2002) & (df['year'] <= 2003) # mask of years
grp = df.groupby(['country']).filter(lambda x: not x[mask].isnull().all().any())
You can also try this:
# check where the sum is equal to 0 - means no values in the column for a specific country
group_by = df.groupby(['country']).agg({'inflation':sum, 'GDP':sum}).reset_index()
# extract only countries with information on both columns
indexes = group_by[ (group_by['GDP'] != 0) & ( group_by['inflation'] != 0) ].index
final_countries = list(group_by.loc[ group_by.index.isin(indexes), : ]['country'])
# keep the rows contains the countries
df = df.drop(df[~df.country.isin(final_countries)].index)
You could reshape the data frame from long to wide, drop nulls, and then convert back to wide.
To convert from long to wide, you can use pivot functions. See this question too.
Here's code for dropping nulls, after its reshaped:
df.dropna(axis=0, how= 'any', thresh=None, subset=None, inplace=True) # Delete rows, where any value is null
To convert back to long, you can use pd.melt.
So, my dataframe is
price model_year model condition cylinders fuel odometer transmission type paint_color is_4wd date_posted days_listed
0 9400 2011.0 bmw x5 good 6.0 gas 145000.0 automatic SUV NaN True 2018-06-23 19
1 25500 NaN ford f-150 good 6.0 gas 88705.0 automatic pickup white True 2018-10-19 50
2 5500 2013.0 hyundai sonata like new 4.0 gas 110000.0 automatic sedan red False 2019-02-07 79
3 1500 2003.0 ford f-150 fair 8.0 gas NaN automatic pickup NaN False 2019-03-22 9
4 14900 2017.0 chrysler 200 excellent 4.0 gas 80903.0 automatic sedan black False 2019-04-02 28
As you can see, row 1's model is the same as row 3's, but row 1's model year is missing. It would naturally follow I can replace row 1's model year with row 3's so there isn't NaN there, and I'm aware I can manually change it, but the dataframe is over 50,000 rows long and there are many more values just like that Is there an automated way I can go about replacing these values like that?
Edit: After looking over the df just now, I've realized that I can't really replace the model year like that as it can change even within the same model, although I would still love to know how it's done if possible for future reference
You can merge dataframe with itself and fillna it.
df_want = df.merge(df[['model_year','model']].dropna().drop_duplicates(),on='model',how='left')
df_want['model_year'] = df_want['model_year_x'].fillna(df_want['model_year_y']
df_want = df_want.drop(['model_year_x','model_year_y'],axis=1)
Yes, you can replace all NaN model years with the non-nan entry like this:
models = df['model'].unique()
for m in models:
year = df.loc[(df['model_year'].notna()) & (df['model'] == m)]['model_year'].values[0]
df.at[(df['model_year'].isna()) & (df['model'] == m), 'model_year'] = year
Hi I need to create a column with values 1 or 0 based on certain conditions. My dataframe is enormous, so a general for loop or even apply are extremely slow. I want to used Pandas or even more preferably Numpy vectorization. Below is a sample of the data and my code that does not work:
election_year D_president
1992 0
1992 0
1996 0
1996 0
2000 0
2004 0
2008 0
2012 0
test_df['D_president'] = 0
election_year = test_df['election_year']
test_df['D_president'] = test_df.loc[((election_year == 1992) |
(election_year == 1996) |
(election_year == 2008)|
(election_year == 2012)), 'D_president'] = 1
So basically I need to get a value of 1 in a column 'D_president' for these certain years. However, when I execute this code I get all 1 even for 2000 and 2004. Can't understand what's wrong.
Also how could I transform this into a Numpy vectorization with .values?
It looks like you're having two "=" assignments on the same row. Try removing the leftmost one test_df['D_president'] Also, for the test, you can replace it with election_year.isin([1992, 1996, 2008, 2012]))