Beginner Python - Why the nested list? - python

I am creating a new column and appending a dataframe and have created a function for the new column:
def male_female_child(passenger):
age, sex = passenger
if age < 16:
return 'child'
else:
return sex
My question is about the line of code that executes the function:
titanic_df['person'] = titanic_df[['Age','Sex']].apply(male_female_child, axis = 1)
Why is titanic_df[['Age','Sex']] a nested list? I guess I'm wondering why titanic_df['Age','Sex'] wouldn't work? Also, why is argument axis = 1 given?
Thank you

Welcome to python! :)
First you should study up on list methods, specifically the subscript notation that lets you refer to indexes and slices of a given list.
https://docs.python.org/2/tutorial/introduction.html#lists
Now, you're using pandas, right? titanic_df[['Age','Sex']] is not a nested list, it's a subscript call on the object titanic_df, which isn't a list but a pandas.DataFrame. http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
A pandas DataFrame has methods that allow you to refer only to specified columns or rows, using similar subscript notation. The differences between DataFrame subscript notation and standard list subscript notation are that
with a DataFrame, you can index or slice in 2 dimensions, rows and columns, e.g. titanic_df[columns][rows]
you can pass a list or tuple of non-contiguous column or row names or indices to operate on, e.g. not just titanic_df[1][1] (second row, second column) or titanic_df['Age'][1] (Age column, first row) but `titanic[
The subscript can be specified as any of:
an integer (representing the index of a single column you wish to operate on)
a string (representing the name of a single column you wish to operate on)
A list or tuple of integers or strings that refer to multiple names or indices of columns you wish to operate on
Contiguous column or row indices specified by slice notation, as in x[1:5][2:4]
a colon, :, indicating all columns or rows.
Oh and axis = 1 means you're applying your function row by row, instead column by columns (axis = 0):
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html#pandas.DataFrame.apply

I am assuming that titanic_df is a pandas dataframe. The df[['Age','Sex']] syntax says you that you are applying your function only to the Age and Sex columns of the dataframe. (Selecting a new dataframe that contains only those columns) See: http://pandas.pydata.org/pandas-docs/stable/indexing.html
If you gave titanic_df['Age','Sex'], you would be attempting to select from a multindex. See:http://pandas.pydata.org/pandas-docs/stable/advanced.html#basic-indexing-on-axis-with-multiindex.
In short, you can't use the second syntax because pandas has that set aside for a different operation.

Related

How can I assign a lists elements to corresponding rows of a dataframe in pandas?

I have numbers in a List that should get assigned to certain rows of a dataframe consecutively.
List=[2,5,7,12….]
In my dataframe that looks similar to the below table, I need to do the following:
A frame_index that starts with 1 gets the next element of List as “sequence_number”
Frame_Index==1 then assign first element of List as Sequence_number.
Frame_index == 1 again, so assign second element of List as Sequence_number.
So my goal is to achieve a new dataframe like this:
I don't know which functions to use. If this weren't python language, I would use a for loop and check where frame_index==1, but my dataset is large and I need a pythonic way to achieve the described solution. I appreciate any help.
EDIT: I tried the following to fill with my List values to use fillna with ffill afterwards:
concatenated_df['Sequence_number']=[List[i] for i in
concatenated_df.index if (concatenated_df['Frame_Index'] == 1).any()]
But of course I'm getting "list index out of range" error.
I think you could do that in two steps.
Add column and fill with your list where frame_index == 1.
Use df.fillna() with method="ffill" kwarg.
import pandas as pd
df = pd.DataFrame({"frame_index": [1,2,3,4,1,2]})
sequence = [2,5]
df.loc[df["frame_index"] == 1, "sequence_number"] = sequence
df.ffill(inplace=True) # alias for df.fillna(method="ffill")
This puts the sequence_number as float64, which might be acceptable in your use case, if you want it to be int64, then you can just force it when creating the column (line 4) or cast it later.

split contents of a column in a python pandas Dataframe and create a new Dataframe with the newly separated list of strings

I have my
my original pandas dataframe as such, where the 'comment' column consist of unseparated lists of strings and another column called 'direction' indicating whether the overall content in 'comment' column suggests positive or negative comments, where 1 represents positive comments and 0 represents negative comments.
Now I wish to create a new Dataframe by separating all the strings under 'comment' by delimiter '' and assign the each new list of strings as a seperate row with their original 'direction' respectively. So it would looks something like this new dataframe.
I wonder how should I achieve so?
Try:
df.comments = df.comments.str.split('<END>')
df = df.explode('comment')

Create a list from rows values with no duplicates

I would need to extract the following words from a dataframe.
car+ferrari
The dataset is
Owner Sold
type
car+ferrari J.G £500000
car+ferrari R.R.T. £276,550
car+ferrari
motobike+ducati
motobike+ducati
...
I need to create a list with words from type, but distinguishing them separately. So in this case I need only car and ferrari.
The list should be
my_list=['car','ferrari']
no duplicates.
So what I should do is select type car+ferrari and extract the all the words, adding them into a list as shown above, without duplicates (I have many car+ferrari rows, but since I need to create a list with the terms, I need only extract these terms once).
Any help will be appreciated
EDIT: type column is an index
def lister(x): #function to split by '+'
return set(x.split('+'))
df['listcol']=df['type'].apply(lister) # applying the function on the type column and saving output to new column
Adding #AMC's suggestion of a rather inbuilt solution to split series in pandas:
df['type'].str.split(pat='+')
for details refer pandas.Series.str.split
Converting pandas index to series:
pd.Series(df.index)
Apply a function on index:
pd.Series(df.index).apply(lister)
or
pd.Series(df.index).str.split(pat = '+')
or
df.index.to_series().str.split("+")

I want to use the function .head() for a groupby object

df_collection = dict(tuple(data.groupby('LOCATION').head(2)))
I want to group the df by geographic information and make for each country an own df in a dict so that I can assign different names. Moreover I want that only the first two years of each country are assigned to this new Dataframe therefore I wanted to use head(2) but I receive the error message:
dictionary update sequence element #0 has length 8; 2 is required
df_collection[('AUS')].head(2)
this works but what is the difference?
Problem is using word tuple()
On calling tuple, it takes all column names as separate values while on calling direct head() considers one rows of 8 columns as 1 series value.
Thereby, dictionary expects 2 records via dataframe while, tuple gives 8 (number of columns in dataFrame)
This, should work for you.
df_collection = dict(data.groupby('LOCATION').head(2))

How to create a new empty pandas columns with a specific dtype?

I have a DataFrame df with columns 'a'. How would I create a new column 'b' which has dtype=object?
I know this may be considered poor form, but at the moment I have a dataframe df where the column 'a' contains arrays (each element is an np.array). I want to create a new column 'b' where each element is a new np.array that contains the logs of the corresponding elemnent in 'a'.
At the moment I tried these two methods, but neither worked:
for i in df.index:
df.set_value(i,'b', log10(df.loc[i,'a']))
and
for i in df.index:
df.loc[i,'b'] = log10(df.loc[i,'a']))
Both give me ValueError: Must have equal len keys and value when setting with an iterable.
I'm assuming the error comes about because the dtype of the new column is defaulted to float although I may be wrong.
As each row of your column is an array, it's better to use the standard NumPy mathematical functions for computing their element-wise logarithms to the base 10:
df['log_a'] = df.a.apply(lambda x: np.log10(x))

Categories