Pandas dataframe, max of selection - python

I need to write a function, which takes df with data and returns string with country, which GDP is maximum among countries with area(sq km) is less than 200 OR which population is less than 1000.
How to write this code correctly?
def find_country(df):
df.loc[((df.Area < 200).Max(df.GDP))|(df.Population < 1000)]

First of all you should make your first column to be your Index. This could be done using the following command:
df.set_index('Country', inlace = True)
Assuming you want to replace your dataframe with the reworked version.
To find your desired country you simply look for the date which has the maximum GDP, for instance, and return its index. The subscript of the index is needed to get the actual value of the index.
def find_Country(df):
return df[df['GDP'] == max(df['GDP'])].index[0]
I hope this will help,
Fabian

Related

Python (If, Then) with DataFrames

I have a data frame with months (by year), and ID number. I am trying to calculate the attrition rate, but I am getting stuck on obtaining unique ID counts when a month equals a certain month in pandas.
ID.
Month
1
Sept. 2022
2
Oct. 2022
etc... with possible duplicates in ID and 1.75 years worth of data.
import pandas as pd
path = some path on my computer
data = pd.read_excel(path)
if data["Month"] == "Sept. 2022":
ID_SEPT = data["ID."].unique()
return ID_SEPT
I am trying to discover what I am doing incorrect here in this if-then statement. Ideally I am trying to collect all the unique ID values per each month per year to then calculate the attrition rate. Is there something obvious I am doing wrong here?
Thank you.
I tried an id-then statement and I was expecting unique value counts of ID per month.
You need to use one of the iterator functions, like items().
for (columnName, columnData) in data.iteritems():
if columnName = 'Month'
[code]
The way you do this with a dataframe, conceptually, is to filter the entire dataframe to be just the rows where your comparison is true, and then do whatever (get uniques) from there.
That would look like this:
filtered_df = df[df['Month'] == 'Sept. 2022']
ids_sept = list(filtered_df['ID.'].unique()
The first line there can look at a little strange, but what it is doing is:
df['Month'] == 'Sept. 2022' will return an array/column/series (it actually returns a series) of True/False whether or not the comparison is, well, true or false.
You then run that series of bools through df[series_of_bools] that filters the dataframe to return only the rows where it is True.
Thus, you have a filter.
If you are looking for the number of unique items, rather than the list of unique items, you can also use filtered_df['ID.'].nunique() and save yourself the step later of getting the length of the list.
You are looking for pandas.groupby.
Use it like this to get the unique values of each Group (Month)
data.groupby("Month")["ID."].unique() # You have a . after ID in your example, check if thats correct
try this
data[data.Month=='Sept. 2022']['ID.'].unique()

Convert Dataframe into Series to return directly the value in that row and column

I'm trying to convert a DF I have, so that it can return true a piece of code like this (cat_totals being a df):
assert_equal(cat_totals["Interdisciplinary"],12296)
assert_equal(cat_totals["Engineering"],537583)
For this, I'm supposed to filter the main dataframe (df) into a subset containing only the Subject and the Total number of students. Then I group by the subject (Major_category) and sum. The numbers are correct in my dataframe, but if I try to run the above code, it throws up an error. How can I convert the dataframe so that the above assert_equals returns True?
cat_totals = df[['Major_category', 'Total']]
cat_totals = cat_totals.groupby(['Major_category']).sum()
cat_totals['Total'] = cat_totals['Total'].astype(int)
display(cat_totals.head(10))
With this the DF looks intuitively correct, but cat_totals['Interdisciplinary'] does not equal the value I'm looking for. In the table, the number that corresponds to the Major is correct, so the calculation is correct, but the format of the return value does not seem right.
Any help would be much appreciated! I'm quite new to working with Pandas, so it's a bit of a struggle.

How to loop through a dataframe column name with conditions in Python Pandas to get min value?

I would like to learn how to loop through a columns name with conditions in pandas.
For example, I have a list T = [400,500,600]. I have dataframe with columns name are G_ads_400, G_ads_500 ...
I would like to get the min value of the G_ads_ columns if T's value match G_ads_... using for-loop and if-statement (I am only familiar with these 2. Open for other suggestion)
For example: Take min value of G_ads_400 when T = 400
here is my code
T = [400,500,600,700]
for t in T:
if t in df.columns[df.columns.str.contains('t')]:
min_value = df.columns.min()
print(min_value)
I tried few other ways but it didnt work. It was either return error or only the name of the columns.
Thank you!
I think you can do it easily like this.This will return all min values for column matches with T values.
T = [400,500,600,700]
columns = [f"G_ads_{i}" for i in T]
res=df[columns].min()
if you need a min values as a dataframe
res=df[columns].min().to_frame().T

Given an index label, how would you extract the index position in a dataframe?

New to python, trying to take a csv and get the country that has the max number of gold medals. I can get the country name as a type Index but need a string value for the submission.
csv has rows of countries as the indices, and columns with stats.
ind = DataFrame.index.get_loc(index_result) doesn't work because it doesn't have a valid key.
If I run dataframe.loc[ind], it returns the entire row.
df = read_csv('csv', index_col=0,skiprows=1)
for loop to get the most gold medals:
mostMedals= iterator
getIndex = (df[df['medals' == mostMedals]).index #check the column medals
#for mostMedals cell to see what country won that many
ind = dataframe.index.get_loc[getIndex] #doesn't like the key
What I'm going for is to get the index position of the getIndex so I can run something like dataframe.index[getIndex] and that will give me the string I need but I can't figure out how to get that index position integer.
Expanding on my comments above, this is how I would approach it. There may be better/other ways, pandas is a pretty enormous library with lots of neat functionality that I don't know yet, either!
df = read_csv('csv', index_col=0,skiprows=1)
max_medals = df['medals'].max()
countries = list(df.where(df['medals'] == max_medals).dropna().index)
Unpacking that expression, the where method returns a frame based on df that matches the condition expressed. dropna() tells us to remove any rows that are NaN values, and index returns the remaining row index. Finally, I wrap that all in list, which isn't strictly necessary but I prefer working with simple built-in types unless I have a greater need.

highest number in each row in a DF and return column of numbers

Hi I have a list of sock prices and calculated 5 moving averages
I want to find the max number in each ROW. The code is returning the max number for the entire array
Here is the code
# For stock in df:
Create 10,30,50,100 and 200D MAvgs
MA10D = stock.rolling(10).mean()
MA30D = stock.rolling(30).mean()
MA50D = stock.rolling(50).mean()
MA100D = stock.rolling(100).mean()
MA200D = stock.rolling(200).mean()
max_line = pd.concat([MA10D, MA30D, MA50D, MA100D, MA200D],axis=0).max()
I want to create new column with the max number (either the 10D, 30D, 50D, 100D or 200DMA). So I should get a value on each row.
Right now all I get in the max number of the each entire array. I tried axis=1 and that did not work either.
Seems like a simple question but I can not get it written properly. Please let me know if you can help. thanks
the axis=0 in your code refers to the concatenation. You need to make that axis=1 to make each moving average a separate column. Then use axis=1 in your call to max as well. It should look like this.
max_line = pd.concat([MA10D, MA30D, MA50D, MA100D, MA200D], axis=1).max(1)

Categories