Find groups where a certain condition applies to every value in it - python

grouped=exp.groupby(["country","year"])["value"].sum().reset_index().sort_values(["country","year"])
grouped["prev_year"]=grouped.groupby("country")["value"].apply(lambda x:x.shift(1))
grouped["increase_vs_prev_year"]=((100*grouped.value-grouped.prev_year)/grouped.prev_year-100).round(1)
grouped
Here is the output for the code:
I want to find countries where increase_vs_prev_year was more than 0 in every year

If need countries with greater like 0 per all values (year)s comapre values for greater by Series.gt and then aggregate GroupBy.all, last filter indices for countries:
s = grouped["increase_vs_prev_year"].gt(0).groupby(grouped["country"]).all()
out = s.index[s].tolist()

Related

Get info about other columns in same row from pandas search

I have a .csv file that looks like the following:
Country Number
United 19
Ireland 17
Afghan 20
My goal is to use python-pandas to find the row with the smallest number, and get the country name of that row.
I know I can use this to get the value of the smallest number.
min = df['Number'].min()
How can I get the country name at the smallest number?
I couldn't figure out how to put in the variable "min" in an expression.
I would use a combination of finding the min and a iloc
df = pd.DataFrame(data)
min_number = df['Column_2'].min()
iloc_number = df.loc[df['Column_2'] == min_number].index.values[0]
df['Column_1'].iloc[iloc_number]
The only downside to this is if you have multiple countries with the same minimal number, but if that is the case you would have to provide more specs to determine the desired country.
If you expect the minimal value to be unique, use idxmin:
df.loc[df['Number'].idxmin(), 'Country']
Output: Ireland
If you have multiple min, this will yield the first one.

Calculating the moving average for each unique value in a column

I have a csv file with a value that increases with time for n cities, like this:
city,date,value
saopaulo,2020-01-01,5
riodejaneiro,2020-01-01,3
curitiba,2020-01-01,7
...
saopaulo,2020-05-01,31
riodejaneiro,2020-05-01,55
curitiba,2020-05-01,41
What I want to do is to calculate the moving average of the column "value", but for each "city" separately.
I loaded the csv into a pandas dataframe, but if I calculate df["value"].rolling(3), it will calculate the moving average but for all the cities together.
What I want is to create a new column with the moving average but for each city. I was thinking about groupby, but I don't know exactly how to implement this.
You can groupby:
df.groupby('city')['value'].rolling(3).mean()
To assign:
df['roll'] = df.groupby('city')['value'].rolling(3).mean().droplevel(0)
Here you go:
def rolling_mean(group :pd.DataFrame) -> pd.DataFrame:
# Whatever operation you want to do with the cities.
# For each city group will be a dataframe of that city's rows without the city column
# I'm guessing you'd like to set the date as a sorted index
# and calculate your moving average based on that but if that's not case modify this function.
return group.set_index('date').sort_index().rolling(3).mean()
df.grouby("city").apply(rolling_mean) # Use .reset_index() if you don't need the multiindex.
maybe doing this (supose your dataframe named df)
from collections import defaultdict
data =defaultdict(list)
for (place,date,value) in df.values:
data[place].append(value)
new_df = pd.DataFrame(dict(data))
and now you have a new dataframe with each city on a column , so you can apply your function on each column ( in a for loop)
saopaulo riodejaneiro curitiba
0 5 3 7
1 31 55 41

returning a specific part of column name in a dataframe on a given condition

I have a dataframe with the following column names(for the reference I have mentioned only one row).
I would like to get the maximum size_cd amongst the encoded_feature.size_cd_0 encoded_feature.size_cd_1 encoded_feature.size_cd_2 encoded_feature.size_cd_3. In this case it is size_cd_0 has the maximum value i.e.24. I would like to return the number 0 from the column name as a return value.
Input dataframe:
type_z
encoded_feature.size_cd_0
encoded_feature.size_cd_1
encoded_feature.size_cd_2
encoded_feature.size_cd_3
0
24
0
0
0
required Output:
0 i.e. the part following the encoded_feature.size_cd_# (as this column has the max value)
I would appreciate your feedback on this.
First use filter to select columns and then pd.Series.argmax
df.filter(regex='encoded_feature.size_cd').apply(pd.Series.argmax, axis=1)
Other approach using idxmax
(
df
.filter(regex='encoded_feature.size_cd')
.idxmax(axis=1)
.str
.slice(-1)
)

In pandas, can I avoid a loop when assigning value based on specific row values?

I have three columns,['date'] which has the date, ['id'] which holds product id's and ['rating'] which holds product ratings for each product for each data, I want to create a dummy variable ['threshold'] which equals 1 when within the same value of ['id'] the value of rating went from anywhere above 5 to anywhere below 6.
My code would use a for loop as follows:
df['threshold']=np.zeros(df.shape[0])
for i in range(df.shape[0]):
if df.iloc[i]['id'] == df.iloc[i-1]['id'] and df.iloc[i-1]['rating']>5 and df.iloc[i]['rating']<6:
df.iloc[i]['threshold']=1
Is there a way to perform this without using a for loop?
Use Series.shift and compare with Series.eq for equal and convert output mask to integers 0,1 by Series.view:
df['threshold']= (df['id'].eq(df['id'].shift()) &
df['rating'].shift().gt(5) &
df['rating'].lt(6)).view('i1')

Creating an Availability Dataframe

I have a pandas dataFrame that contains the values of several parameters against timestamps that are 15 minutes apart. The parameters may contain NaN values(np.nan). My aim is to find the total number of available values per month for each parameter, i.e total number of values in that month that are not 0 or np.nan.
I tried turning all the valid values (values that are not zero or np.nan) into 1; and all the invalid values into 0. That way I can just sum all the values of a parameter in a month and I'd get the total number of available values for that month.
df.fillna(0)
for col in selected_parameters:
df.loc[df[col] > 0, col] = 1
This generates the df having 1 for valid and 0 for invalid values.
What I can't do is create a new dataFrame that'll have the timestamps a month apart (instead of 15 min apart) and against each month, I can have the total number of available values for that month.
Use a groupby with sum as your aggregator function
df.groupby([df.index.dt.year, df.index.dt.month]).agg('sum')
This assumes that your timestamps are at the index.

Categories