I want to use the function .head() for a groupby object - python

df_collection = dict(tuple(data.groupby('LOCATION').head(2)))
I want to group the df by geographic information and make for each country an own df in a dict so that I can assign different names. Moreover I want that only the first two years of each country are assigned to this new Dataframe therefore I wanted to use head(2) but I receive the error message:
dictionary update sequence element #0 has length 8; 2 is required
df_collection[('AUS')].head(2)
this works but what is the difference?

Problem is using word tuple()
On calling tuple, it takes all column names as separate values while on calling direct head() considers one rows of 8 columns as 1 series value.
Thereby, dictionary expects 2 records via dataframe while, tuple gives 8 (number of columns in dataFrame)
This, should work for you.
df_collection = dict(data.groupby('LOCATION').head(2))

Related

single positional indexer is out-of-bounds: an error

why dataframe.iloc[1,5] return an error but dataframe.iloc[[1,5]] not?
dataframe.iloc[1,5] returns an error because it is trying to access a single scalar value at the intersection of row 1 and column 5 in the DataFrame.
On the other hand, dataframe.iloc[[1,5]] does not return an error because it is trying to access multiple rows (rows 1 and 5) in the DataFrame. In this case, the output is a new DataFrame containing those specific rows.
The .iloc attribute is used to access a specific location in the DataFrame by index position. When passing a single integer value to .iloc, it will treat it as the index of the row, and when passing a list of integers, it will treat it as the indices of the rows.
So in short, when you use dataframe.iloc[1,5] it's trying to access a single value at a specific location, which doesn't exist in the dataframe. But when using dataframe.iloc[[1,5]] you're accessing the rows with those indexes, which exist in the dataframe.

.isin() only returning one value as opposed to entire list

I have a large pandas dataframe with ~500,000 rows and 6 columns and each row is uniquely identified by the 'Names' column. The other 5 columns contain characteristic information about the correpsonding 'Names' entry. I also have a separate list of ~40,000 individual names, all of which are subsumed within the larger dataframe. I want to use this smaller list to extract all of the corresponding infromation in the larger dataframe, and am using:
info = df[df['Names'].isin(ListNames)]
where df is the large dataframe and ListNames is the list of names I want to get the information for. However, when I run this, only one row is extracted from the overall dataframe as opposed to ~40000. I have also tried using ListNames as a 'Series' datatype instead of 'List' datatype but this returned the same thing as before. Would be super grateful for any advice - thanks!

Creating a new dataframe colum as a function of other columns [duplicate]

This question already has answers here:
Pandas Vectorized lookup of Dictionary
(2 answers)
Closed 2 years ago.
I have a dataframe with Country column. It has rows for around 15 countries. I want to add a Continent column using a mapping dictionary, ContinentDict, that has mapping from country name to Continent name)
I see that these two work
df['Population'] = df['Energy Supply'] / df['Energy Supply per Capita']
df['Continent'] = df.apply(lambda x: ContinentDict[x['Country']], axis='columns')
but this does not
df['Continent'] = ContinentDict[df['Country']]
Looks like the issue is that df['Country'] is a series object and so the statement is not smart enough to treat the last statement to be same as 2.
Questions
Would love to understand why statement 1 works but not 3. Is it because "dividing two series objects" is defined as an element wise divide?
Any way to change 3 to tell I want element wise operation without having to go the apply route?
From your statement a mapping dictionary, ContinentDict, it looks like ContinentDict is a Python dictionary. In this case,
ContinentDict[some_key]
is a pure Python call, regardless of what object some_key is. That's why the 3rd call fails since the df['Country'] is not in the dictionary key (and it never can be since dictionary keys are not mutable).
In which case, Python only allows to index the exact key and throws an error when the key is not in the dictionary.
Pandas does provide a tool for you to replace/map the values:
df['Continent'] = df['Country'].map(ContinentDict)
df['Continent']=df['Country'].map(ContinentDict)
In case 1, you are dealing with two pandas series, so it knows how to deal with them.
In case 2, you have a python dictionary and pandas series, pandas don't know how to deal with a dictionary(df['country'] is pandas series but not a key in the dictionary)

Accessing data in a Pandas dataframe with one row

I use Pandas dataframes to manipulate data and I usually visualise them as virtual spreadsheets, with rows and columns defining the positions of individual cells. I'm happy with the methods to slice and dice the dataframes but there seems to be some odd behaviour when the dataframe contains a single row. Basically, I want to select rows of data from a large parent dataframe that meet certain criteria and then pass those results as a daughter dataframe to a separate function for further processing. Sometimes there will only be a single record in the parent dataframe that meets the defined criteria and, therefore, the daughter dataframe will only contain a single row. Nevertheless, I still need to be able to access data in the daughter in the same way as for the parent database. To illustrate may point, consider the following dataframe:
import pandas as pd
tempDF = pd.DataFrame({'group':[1,1,1,1,2,2,2,2],
'string':['a','b','c','d','a','b','c','d']})
print(tempDF)
Which looks like:
group string
0 1 a
1 1 b
2 1 c
3 1 d
4 2 a
5 2 b
6 2 c
7 2 d
As an example, I can now select those rows where 'group' == 2 and 'string' == 'c', which yields just a single row. As expected, the length of dataframe is 1 and it's possible to print just a single cell using .ix() based on index values in the original dataframe:
tempDF2 = tempDF.loc[((tempDF['group']==2) & (tempDF['string']=='c')),['group','string']]
print(tempDF2)
print('Length of tempDF2 = ',tempDF2.index.size)
print(tempDF2.loc[6,['string']])
Output:
group string
6 2 c
Length of tempDF2 = 1
string c
However, if I select a single row using .loc, then the dataframe is printed in a transposed form and the length of the dataframe is now given as 2 (rather than 1). Clearly, it's no longer possible to select single cell values based on index of original parent dataframe:
tempDF3 = tempDF.loc[6,['group','string']]
print(tempDF3)
print('Length of tempDF3 = ',tempDF3.index.size)
Output:
group 2
string c
Name: 7, dtype: object
Length of tempDF3 = 2
In my mind, both these methods are actually doing the same thing, namely selecting a single row of data. However, in the second example, the rows and columns are transposed making it impossible to extract data in an expected way.
Why should these 2 behaviours exist? What is the point of transposing a single row of a dataframe as a default behaviour? How can I make sure that a dataframe containing a single row isn't transposed when I pass it to another function?
tempDF3 = tempDF.loc[6,['group','string']]
The 6 in the first position of the .loc selection dictates that the return type will be a Series and hence your problem. Instead use [6]:
tempDF3 = tempDF.loc[[6],['group','string']]

Beginner Python - Why the nested list?

I am creating a new column and appending a dataframe and have created a function for the new column:
def male_female_child(passenger):
age, sex = passenger
if age < 16:
return 'child'
else:
return sex
My question is about the line of code that executes the function:
titanic_df['person'] = titanic_df[['Age','Sex']].apply(male_female_child, axis = 1)
Why is titanic_df[['Age','Sex']] a nested list? I guess I'm wondering why titanic_df['Age','Sex'] wouldn't work? Also, why is argument axis = 1 given?
Thank you
Welcome to python! :)
First you should study up on list methods, specifically the subscript notation that lets you refer to indexes and slices of a given list.
https://docs.python.org/2/tutorial/introduction.html#lists
Now, you're using pandas, right? titanic_df[['Age','Sex']] is not a nested list, it's a subscript call on the object titanic_df, which isn't a list but a pandas.DataFrame. http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
A pandas DataFrame has methods that allow you to refer only to specified columns or rows, using similar subscript notation. The differences between DataFrame subscript notation and standard list subscript notation are that
with a DataFrame, you can index or slice in 2 dimensions, rows and columns, e.g. titanic_df[columns][rows]
you can pass a list or tuple of non-contiguous column or row names or indices to operate on, e.g. not just titanic_df[1][1] (second row, second column) or titanic_df['Age'][1] (Age column, first row) but `titanic[
The subscript can be specified as any of:
an integer (representing the index of a single column you wish to operate on)
a string (representing the name of a single column you wish to operate on)
A list or tuple of integers or strings that refer to multiple names or indices of columns you wish to operate on
Contiguous column or row indices specified by slice notation, as in x[1:5][2:4]
a colon, :, indicating all columns or rows.
Oh and axis = 1 means you're applying your function row by row, instead column by columns (axis = 0):
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html#pandas.DataFrame.apply
I am assuming that titanic_df is a pandas dataframe. The df[['Age','Sex']] syntax says you that you are applying your function only to the Age and Sex columns of the dataframe. (Selecting a new dataframe that contains only those columns) See: http://pandas.pydata.org/pandas-docs/stable/indexing.html
If you gave titanic_df['Age','Sex'], you would be attempting to select from a multindex. See:http://pandas.pydata.org/pandas-docs/stable/advanced.html#basic-indexing-on-axis-with-multiindex.
In short, you can't use the second syntax because pandas has that set aside for a different operation.

Categories