How to modify the Pandas DataFrame and insert new columns - python

I have some data with information provide below,
df.info() is below,
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6662 entries, 0 to 6661
Data columns (total 2 columns):
value 6662 non-null float64
country 6478 non-null object
dtypes: float64(1), object(1)
memory usage: 156.1+ KB
None
list of the columns,
[u'value' 'country']
the df is below,
value country
0 550.00 USA
1 118.65 CHINA
2 120.82 CHINA
3 86.82 CHINA
4 112.14 CHINA
5 113.59 CHINA
6 114.31 CHINA
7 111.42 CHINA
8 117.21 CHINA
9 111.42 CHINA
--------------------
--------------------
6655 500.00 USA
6656 500.00 USA
6657 390.00 USA
6658 450.00 USA
6659 420.00 USA
6660 420.00 USA
6661 450.00 USA
I need to add another column namely outlier and put 1
if the data is an outliers for that respective country,
otherwise, I need to put 0. I emphasize that the outlier will need to be computed for the respective countries and NOT for the countries altogether.
I find some formulas for calculating the outliers which may be in help, for example,
# keep only the ones that are within +3 to -3 standard
def exclude_the_outliers(df):
df = df[np.abs(df.col - df.col.mean())<=(3*df.col.std())]
return df
def exclude_the_outliers_extra(df):
LOWER_LIMIT = .35
HIGHER_LIMIT = .70
filt_df = df.loc[:, df.columns == 'value']
# Then, computing percentiles.
quant_df = filt_df.quantile([LOWER_LIMIT, HIGHER_LIMIT])
# Next filtering values based on computed percentiles. To do that I use
# an apply by columns and that's it !
filt_df = filt_df.apply(lambda x: x[(x>quant_df.loc[LOWER_LIMIT,x.name]) &
(x < quant_df.loc[HIGHER_LIMIT,x.name])], axis=0)
filt_df = pd.concat([df.loc[:, df.columns != 'value'], filt_df], axis=1)
filt_df.dropna(inplace=True)
return df
I was not able to use those formulas properly for this purpose, but, provided as suggestion.
Finally, I will need to count the percentage of the outliers for the
USA and CHINA presented in the data.
How to achieve that?
Note: putting the outlier column with all zeros is easy in the
pasdas and should be like this,
df['outlier'] = 0
However, it's still the issue to find the outlier and overwrite the
zeros with 1 for that respective country.

You can slice the dataframe by each country, calculate the quantiles for the slice, and set the value of outlier at the index of the country.
There might be a way to do it without iteration, but it is beyond me.
# using True/False for the outlier, it is the same as 1/0
df['outlier'] = False
# set the quantile limits
low_q = 0.35
high_q = 0.7
# iterate over each country
for c in df.country.unique():
# subset the dataframe where the country = c, get the quantiles
q = df.value[df.country==c].quantile([low_q, high_q])
# at the row index where the country column equals `c` and the column is `outlier`
# set the value to be true or false based on if the `value` column is within
# the quantiles
df.loc[df.index[df.country==c], 'outlier'] = (df.value[df.country==c]
.apply(lambda x: x<q[low_q] or x>q[high_q]))
Edit: To get the percentage of outliers per country, you can groupby the country column and aggregate using the mean.
gb = df[['country','outlier']].groupby('country').mean()
for row in gb.itertuples():
print('Percentage of outliers for {: <12}: {:.1f}%'.format(row[0], 100*row[1]))
# output:
# Percentage of outliers for China : 54.0%
# Percentage of outliers for USA : 56.0%

Related

Linear interpolation

I am working with a dataset containing columns for GDP and GDP per capita for a number of countries. These columns contain missing values. Due to the nature of the data I was hoping to play around with linear interpolation in order to fill in the missing values without losing the general shape of the data.
My code looks as follows:
grouped_df = df.groupby("Country")
#Iterate over the groups
for country, group in grouped_df:
#Select the rows that contain missing values
missing_values = group[group["GDP percapita"].isnull()]
if not missing_values.empty:
#Interpolate to fill the missing values
filled_values = missing_values["GDP percapita"].interpolate(method="linear")
#Update the original dataframe
df.update(filled_values)
When I run this however, the missing values are still present in my dataset, however I can't find the issue with my code.
replace
filled_values = missing_values["GDP
percapita"].interpolate(method="linear")
with
filled_values = group["GDP
percapita"].interpolate(method="linear")
Here is little working sample
d= {"country": ["Brazil", "Brazil", "India", "India", ],
"cities": ["Brasilia", "Rio", "New Dehli", "Bombay"],
"population": [200.4, None, 100.10, None]}
p= pd.DataFrame(d)
print(p) #should give you
country cities population
0 Brazil Brasilia 200.4
1 Brazil Rio NaN
2 India New Dehli 100.1
3 India Bombay NaN
# to interpolate entire df
p.interpolate(method='linear')
country cities population
0 Brazil Brasilia 200.40
1 Brazil Rio 150.25
2 India New Dehli 100.10
3 India Bombay 100.10
#to interpolate group wise
grouped_df = p.groupby("country")
#Iterate over the groups
for country, group in grouped_df:
missing_values = group[group["population"].isnull()]
filled_values =
missing_values["population"].interpolate(method="linear")
p.update(filled_values)

How to iterate over columns and check condition by group

I have data for many countries over a period of time (2001-2003). It looks something like this:
index
year
country
inflation
GDP
1
2001
AFG
nan
48
2
2002
AFG
nan
49
3
2003
AFG
nan
50
4
2001
CHI
3.0
nan
5
2002
CHI
5.0
nan
6
2003
CHI
7.0
nan
7
2001
USA
nan
220
8
2002
USA
4.0
250
9
2003
USA
2.5
280
I want to drop countries in case there is no data (i.e. values are missing for all years) for any given variable.
In the example table above, I want to drop AFG (because it misses all values for inflation) and CHI (GDP missing). I don't want to drop observation #7 just because one year is missing.
What's the best way to do that?
This should work by filtering all values that have nan in one of (inflation, GDP):
(
df.groupby(['country'])
.filter(lambda x: not x['inflation'].isnull().all() and not x['GDP'].isnull().all())
)
Note, if you have more than two columns you can work on a more general version of this:
df.groupby(['country']).filter(lambda x: not x.isnull().all().any())
If you want this to work with a specific range of year instead of all columns, you can set up a mask and change the code a bit:
mask = (df['year'] >= 2002) & (df['year'] <= 2003) # mask of years
grp = df.groupby(['country']).filter(lambda x: not x[mask].isnull().all().any())
You can also try this:
# check where the sum is equal to 0 - means no values in the column for a specific country
group_by = df.groupby(['country']).agg({'inflation':sum, 'GDP':sum}).reset_index()
# extract only countries with information on both columns
indexes = group_by[ (group_by['GDP'] != 0) & ( group_by['inflation'] != 0) ].index
final_countries = list(group_by.loc[ group_by.index.isin(indexes), : ]['country'])
# keep the rows contains the countries
df = df.drop(df[~df.country.isin(final_countries)].index)
You could reshape the data frame from long to wide, drop nulls, and then convert back to wide.
To convert from long to wide, you can use pivot functions. See this question too.
Here's code for dropping nulls, after its reshaped:
df.dropna(axis=0, how= 'any', thresh=None, subset=None, inplace=True) # Delete rows, where any value is null
To convert back to long, you can use pd.melt.

How can I use value_counts() to get a count of values after using a boolean on a data frame?

I'm trying to analyze a data set in colab, and it looks a bit like this:
import pandas as pd
df = pd.read_csv('gdrive/My Drive/python_for_data_analysts/Agora Data.csv')
df.info()
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Vendor 109689 non-null object
1 Category 109689 non-null object
2 Item 109687 non-null object
3 Item Description 109660 non-null object
4 Price 109684 non-null object
5 Origin 99807 non-null object
6 Destination 60528 non-null object
7 Rating 109674 non-null object
8 Remarks 12616 non-null object
There's a column of category and origin and what I'm trying to do is get a value count specifically of the categories with an origin of say China or USA only. Something that looks like:
df[' Origin'].value_counts().head(30)
USA 33729
UK 10336
Australia 8767
Germany 7876
Netherlands 7707
Canada 5126
EU 4356
China 4185
I've filtered out everything other than rows with an origin of China, but when I try to get a value count of the different categories within China, it doesn't output a proper list like the one above.
china_transactions = (df[' Origin'] == 'China') & (df[' Category']).value_counts()
china_transactions.head(50)
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
Create a Boolean Series where 'Origin' == 'China' and subset the DataFrame to only those rows. Then take the value_counts of the Category column. You can use DataFrame.loc to combine row and column selections at once.
df.loc[df[' Origin'].eq('China'), 'Category'].value_counts()
# ------------------------ | |
# | Take this apply this
# Only these rows column method

How to calculate the percentage of the sum value of the column?

I have a pandas dataframe which looks like this:
Country Sold
Japan 3432
Japan 4364
Korea 2231
India 1130
India 2342
USA 4333
USA 2356
USA 3423
I have use the code below and get the sum of the "sold" column
df1= df.groupby(df['Country'])
df2 = df1.sum()
I want to ask how to calculate the percentage of the sum of "sold" column.
You can get the percentage by adding this code
df2["percentage"] = df2['Sold']*100 / df2['Sold'].sum()
In the output dataframe, a column with the percentage of each country is added.
We can divide the original Sold column by a new column consisting of the grouped sums but keeping the same length as the original DataFrame, by using transform
df.assign(
pct_per=df['Sold'] / df.groupby('Country').transform(pd.DataFrame.sum)['Sold']
)
Country Sold pct_per
0 Japan 3432 0.440226
1 Japan 4364 0.559774
2 Korea 2231 1.000000
3 India 1130 0.325461
4 India 2342 0.674539
5 USA 4333 0.428501
6 USA 2356 0.232991
7 USA 3423 0.338509
Simple Solution
You were almost there.
First you need to group by country
Then create the new percentage column (by dividing grouped sales with sum of all sales)
# reset_index() is only there because the groupby makes the grouped column the index
df_grouped_countries = df.groupby(df.Country).sum().reset_index()
df_grouped_countries['pct_sold'] = df_grouped_countries.Sold / df.Sold.sum()
Are you looking for the percentage after or before aggregation?
import pandas as pd
countries = [['Japan',3432],['Japan',4364],['Korea',2231],['India',1130], ['India',2342],['USA',4333],['USA',2356],['USA',3423]]
df = pd.DataFrame(countries,columns=['Country','Sold'])
df1 = df.groupby(df['Country'])
df2 = df1.sum()
df2['percentage'] = (df2['Sold']/df2['Sold'].sum()) * 100
df2

Pandas - Access a value in Column B based on a value in Column A

So I have a small set of data, TFR.csv taking the form:
Year State1 State2 State3
1993 3 4 5
1994 6 2 1.4
...
I am supposed to determine when State 2's value is at its lowest (1994), and extract whatever value State 3 has in that year (1.4).
To do this, I've written a short filter:
State1Min = min(TFR['State1']) #Determine the minimum value for State1
filt = (TFR['State1']==State1Min) #Filter to select the row where the value exists
TFR[filt]['State3'] #Apply this filter to the original table, and return the value in the State3 column.
It returns the right value I'm looking for, but also the row number at the front:
2 1.4
Name: NT, dtype: float64
I need to print this value of 1.4, so I'm trying to find a way to extract it out of this output.
Thanks for helping out.
Use pandas.DataFrame.set_index and idxmin:
df = df.set_index('Year')
df.loc[df['State2'].idxmin(), 'State3']
Output:
1.4

Categories