why dataframe.iloc[1,5] return an error but dataframe.iloc[[1,5]] not?
dataframe.iloc[1,5] returns an error because it is trying to access a single scalar value at the intersection of row 1 and column 5 in the DataFrame.
On the other hand, dataframe.iloc[[1,5]] does not return an error because it is trying to access multiple rows (rows 1 and 5) in the DataFrame. In this case, the output is a new DataFrame containing those specific rows.
The .iloc attribute is used to access a specific location in the DataFrame by index position. When passing a single integer value to .iloc, it will treat it as the index of the row, and when passing a list of integers, it will treat it as the indices of the rows.
So in short, when you use dataframe.iloc[1,5] it's trying to access a single value at a specific location, which doesn't exist in the dataframe. But when using dataframe.iloc[[1,5]] you're accessing the rows with those indexes, which exist in the dataframe.
Related
I have been worried about how to find indices of all rows with null values in a particular column of a pandas dataframe in python. If A is one of the entries in df.columns then I need to find indices of each row with null values in A
Supposing you need the indices as a list, one option would be:
df[df['A'].isnull()].index.tolist()
np.where(df['column_name'].isnull())[0]
np.where(Series_object) returns the indices of True occurrences in the column. So, you will be getting the indices where isnull() returned True.
The [0] is needed because np.where returns a tuple and you need to access the first element of the tuple to get the array of indices.
Similarly, if you want to get the indices of all non-null values in the column, you can run
np.where(df['column_name'].isnull() == False)[0]
I am using geopanda with python. I would like to select the row with the maximum value of column "pop".
I am trying the solution given here Find maximum value of a column and return the corresponding row values using Pandas :
city_join.loc[city_join['pop'].idxmax()]
However, it does not work. It says "TypeError: reduction operation 'argmax' not allowed for this dtype" . I think the reason is because that solution works for panda and not for geopandas. Am I right? How can I select the row with the maximun value of column "pop" in a geopanda dataframe?
check your type with print(df['columnName'].dtype) and make sure it is numeric (i.e. integer, float ...). if it returns just object then use df['columnName'].astype(float) instead
Try with - city_join.loc[city_join['pop'].astype(float).idxmax()] if pop column is object type
Or
You can convert the column to numeric first
city_join['pop'] = pd.to_numeric(city_join['pop'])
and run your code city_join.loc[city_join['pop'].idxmax()]
I have a large pandas dataframe with ~500,000 rows and 6 columns and each row is uniquely identified by the 'Names' column. The other 5 columns contain characteristic information about the correpsonding 'Names' entry. I also have a separate list of ~40,000 individual names, all of which are subsumed within the larger dataframe. I want to use this smaller list to extract all of the corresponding infromation in the larger dataframe, and am using:
info = df[df['Names'].isin(ListNames)]
where df is the large dataframe and ListNames is the list of names I want to get the information for. However, when I run this, only one row is extracted from the overall dataframe as opposed to ~40000. I have also tried using ListNames as a 'Series' datatype instead of 'List' datatype but this returned the same thing as before. Would be super grateful for any advice - thanks!
df_collection = dict(tuple(data.groupby('LOCATION').head(2)))
I want to group the df by geographic information and make for each country an own df in a dict so that I can assign different names. Moreover I want that only the first two years of each country are assigned to this new Dataframe therefore I wanted to use head(2) but I receive the error message:
dictionary update sequence element #0 has length 8; 2 is required
df_collection[('AUS')].head(2)
this works but what is the difference?
Problem is using word tuple()
On calling tuple, it takes all column names as separate values while on calling direct head() considers one rows of 8 columns as 1 series value.
Thereby, dictionary expects 2 records via dataframe while, tuple gives 8 (number of columns in dataFrame)
This, should work for you.
df_collection = dict(data.groupby('LOCATION').head(2))
I am creating a new column and appending a dataframe and have created a function for the new column:
def male_female_child(passenger):
age, sex = passenger
if age < 16:
return 'child'
else:
return sex
My question is about the line of code that executes the function:
titanic_df['person'] = titanic_df[['Age','Sex']].apply(male_female_child, axis = 1)
Why is titanic_df[['Age','Sex']] a nested list? I guess I'm wondering why titanic_df['Age','Sex'] wouldn't work? Also, why is argument axis = 1 given?
Thank you
Welcome to python! :)
First you should study up on list methods, specifically the subscript notation that lets you refer to indexes and slices of a given list.
https://docs.python.org/2/tutorial/introduction.html#lists
Now, you're using pandas, right? titanic_df[['Age','Sex']] is not a nested list, it's a subscript call on the object titanic_df, which isn't a list but a pandas.DataFrame. http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
A pandas DataFrame has methods that allow you to refer only to specified columns or rows, using similar subscript notation. The differences between DataFrame subscript notation and standard list subscript notation are that
with a DataFrame, you can index or slice in 2 dimensions, rows and columns, e.g. titanic_df[columns][rows]
you can pass a list or tuple of non-contiguous column or row names or indices to operate on, e.g. not just titanic_df[1][1] (second row, second column) or titanic_df['Age'][1] (Age column, first row) but `titanic[
The subscript can be specified as any of:
an integer (representing the index of a single column you wish to operate on)
a string (representing the name of a single column you wish to operate on)
A list or tuple of integers or strings that refer to multiple names or indices of columns you wish to operate on
Contiguous column or row indices specified by slice notation, as in x[1:5][2:4]
a colon, :, indicating all columns or rows.
Oh and axis = 1 means you're applying your function row by row, instead column by columns (axis = 0):
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html#pandas.DataFrame.apply
I am assuming that titanic_df is a pandas dataframe. The df[['Age','Sex']] syntax says you that you are applying your function only to the Age and Sex columns of the dataframe. (Selecting a new dataframe that contains only those columns) See: http://pandas.pydata.org/pandas-docs/stable/indexing.html
If you gave titanic_df['Age','Sex'], you would be attempting to select from a multindex. See:http://pandas.pydata.org/pandas-docs/stable/advanced.html#basic-indexing-on-axis-with-multiindex.
In short, you can't use the second syntax because pandas has that set aside for a different operation.