Get info about other columns in same row from pandas search - python

I have a .csv file that looks like the following:
Country Number
United 19
Ireland 17
Afghan 20
My goal is to use python-pandas to find the row with the smallest number, and get the country name of that row.
I know I can use this to get the value of the smallest number.
min = df['Number'].min()
How can I get the country name at the smallest number?
I couldn't figure out how to put in the variable "min" in an expression.

I would use a combination of finding the min and a iloc
df = pd.DataFrame(data)
min_number = df['Column_2'].min()
iloc_number = df.loc[df['Column_2'] == min_number].index.values[0]
df['Column_1'].iloc[iloc_number]
The only downside to this is if you have multiple countries with the same minimal number, but if that is the case you would have to provide more specs to determine the desired country.

If you expect the minimal value to be unique, use idxmin:
df.loc[df['Number'].idxmin(), 'Country']
Output: Ireland
If you have multiple min, this will yield the first one.

Related

How to sum over column in pandas dropping value of a given category if values of another are present?

I downloaded food production data from FAOSTAT.
For a given year, production data for a certain foodstuff may be provided as an official value, an estimate or it may be of another category. However, the production values are all given in one column like this:
Area Y2017 Y2017flags
0 France 10 official
1 USA 11 estimate
2 Germany 12 official
3 Germany 10 estimate
For some areas multiple production values are available, e.g. an estimate, an official value, and an unofficial value.
I'd now like to sum over all values in the column Y2017 but in a conditional way: If an official figure is available for a country, take that value, if not take the estimate, if not take the unofficial value, etc.
Is there a way to do this without splitting the dataframe?
You can group the DataFrame by the Area column and apply the first() method to the Y2017 column with the parameter key, which allows you to specify a function to order the groups before selecting the first value.
df.sort_values("Y2017flags", inplace=True)
df = df.groupby("Area").first()
total_sum = df['Y2017'].sum()
This will group the dataframe by Area and sort the groups by the Y2017flags in ascending order. Then, it selects the first value of each group and returns the result in a new dataframe (in case of both official and estimated, it will give the offical.

Find groups where a certain condition applies to every value in it

grouped=exp.groupby(["country","year"])["value"].sum().reset_index().sort_values(["country","year"])
grouped["prev_year"]=grouped.groupby("country")["value"].apply(lambda x:x.shift(1))
grouped["increase_vs_prev_year"]=((100*grouped.value-grouped.prev_year)/grouped.prev_year-100).round(1)
grouped
Here is the output for the code:
I want to find countries where increase_vs_prev_year was more than 0 in every year
If need countries with greater like 0 per all values (year)s comapre values for greater by Series.gt and then aggregate GroupBy.all, last filter indices for countries:
s = grouped["increase_vs_prev_year"].gt(0).groupby(grouped["country"]).all()
out = s.index[s].tolist()

Given an index label, how would you extract the index position in a dataframe?

New to python, trying to take a csv and get the country that has the max number of gold medals. I can get the country name as a type Index but need a string value for the submission.
csv has rows of countries as the indices, and columns with stats.
ind = DataFrame.index.get_loc(index_result) doesn't work because it doesn't have a valid key.
If I run dataframe.loc[ind], it returns the entire row.
df = read_csv('csv', index_col=0,skiprows=1)
for loop to get the most gold medals:
mostMedals= iterator
getIndex = (df[df['medals' == mostMedals]).index #check the column medals
#for mostMedals cell to see what country won that many
ind = dataframe.index.get_loc[getIndex] #doesn't like the key
What I'm going for is to get the index position of the getIndex so I can run something like dataframe.index[getIndex] and that will give me the string I need but I can't figure out how to get that index position integer.
Expanding on my comments above, this is how I would approach it. There may be better/other ways, pandas is a pretty enormous library with lots of neat functionality that I don't know yet, either!
df = read_csv('csv', index_col=0,skiprows=1)
max_medals = df['medals'].max()
countries = list(df.where(df['medals'] == max_medals).dropna().index)
Unpacking that expression, the where method returns a frame based on df that matches the condition expressed. dropna() tells us to remove any rows that are NaN values, and index returns the remaining row index. Finally, I wrap that all in list, which isn't strictly necessary but I prefer working with simple built-in types unless I have a greater need.

Drop duplicates keeping the row with the highest value in another column

a = [['John', 'Mary', 'John'], [10,22,50]]
df1 = pd.DataFrame(a, columns=['Name', 'Count'])
Given a data frame like this I want to compare all similar string values of "Name" against the "Count" value to determine the highest. I'm not sure how to do this in a dataframe in Python.
Ex: In the case above the Answer would be:
Name Count
Mary 22
John 50
The lower value John 10 has been dropped (I only want to see the highest value of "Count" based on the same value for "Name").
In SQL it would be something like a Select Case query (wherein I select the Case where Name == Name & Count > Count recursively to determine the highest number. Or a For loop for each name, but as I understand loops in DataFrames is a bad idea due to the nature of the object.
Is there a way to do this with a DF in Python? I could create a new data frame with each variable (one with Only John and then get the highest value (df.value()[:1] or similar. But as I have many hundreds of unique entries that seems like a terrible solution. :D
Either sort_values and drop_duplicates,
df1.sort_values('Count').drop_duplicates('Name', keep='last')
Name Count
1 Mary 22
2 John 50
Or, like miradulo said, groupby and max.
df1.groupby('Name')['Count'].max().reset_index()
Name Count
0 John 50
1 Mary 22

Pandas dataframe, max of selection

I need to write a function, which takes df with data and returns string with country, which GDP is maximum among countries with area(sq km) is less than 200 OR which population is less than 1000.
How to write this code correctly?
def find_country(df):
df.loc[((df.Area < 200).Max(df.GDP))|(df.Population < 1000)]
First of all you should make your first column to be your Index. This could be done using the following command:
df.set_index('Country', inlace = True)
Assuming you want to replace your dataframe with the reworked version.
To find your desired country you simply look for the date which has the maximum GDP, for instance, and return its index. The subscript of the index is needed to get the actual value of the index.
def find_Country(df):
return df[df['GDP'] == max(df['GDP'])].index[0]
I hope this will help,
Fabian

Categories