Pandas Apply with multiple columns as input - python

For a dataframe which has 4 columns of coordinates (longitude, lattitude) I would like to create a 5th column which has the distance between both places for each column, below illustrates this:
dict = [{'x1': '1','y1': '1','x2': '3','y2': '2'},
{'x1': '1','y1': '1','x2': '3','y2': '2'}]
data = pd.DataFrame(dict)
As an outcome I would like to have this:
dict1 = [{'x1': '1','y1': '1','x2': '3','y2': '2','distance': '2.6'},
{'x1': '1','y1': '1','x2': '3','y2': '2','distance': '2.9'}]
data2 = pd.DataFrame(dict)
Where distance is computed using from geopy.distance import great_circle:
This is what I tried:
data['distance']=data[['x1','y1','x2','y2']].apply(lambda x1,y1,x2,y2: great_circle(x1,y1,x2,y2).miles, axis=1)
But that gives me a type error:
TypeError: () missing 3 required positional arguments: 'y1', 'x2', and 'y2'
Any help is appreciated.

That is because the lambda function can only view the operand data[['x1','y1','x2','y2']], so you should modify it as follow. Hope this helps!
data['distance']=data[['x1','y1','x2','y2']].apply(lambda df: great_circle(df['x1'],df['y1'],df['x2'],df['y2']).miles, axis=1)

Related

Getting a dictionnary of lists that contain element from a column using a groupby

I have a dataframe that looks like this, with 1 string column and 1 int column.
import random
columns=['EG','EC','FI', 'ED', 'EB', 'FB', 'FCY', 'ECY', 'FG', 'FUR', 'E', '\[ED']
choices_str = random.choices(columns, k=200)
choices_int = random.choices(range(1, 8), k=200)
my_df = pd.DataFrame({'column_A': choices_str, 'column_B': choices_int})
I would like to get at the very end a dictionnary of lists that store all values of column B groupby A, like this :
What I made to achieve this to used a groupby to get number of occurences for column_B :
group_by = my_df.groupby(['column_A','column_B'])['column_B'].count().unstack().fillna(0).T
group_by
And then use some list comprehensions to create by hand my lists for each column_A and add them to the dictionnary.
Is there anyway to get more directly using a groupby ?
I am not aware of a method that is able to achieve that within the groupby statement. But I think you could try something like this alternatively:
import random
import pandas as pd
columns=['EG','EC','FI', 'ED', 'EB', 'FB', 'FCY', 'ECY', 'FG', 'FUR', 'E', '\[ED']
choices_str = random.choices(columns, k=200)
choices_int = random.choices(range(1, 8), k=200)
my_df = pd.DataFrame({'column_A': choices_str, 'column_B': choices_int})
final_dict = {val: my_df.loc[my_df['column_A'] == val, 'column_B'].values.tolist() for val in my_df['column_A'].unique()}
This dict-comprehension is a one-liner and takes all column_B values that correspond to a specific column_A value and assigns them to the dict stored in a list with column_A values as keys.

Bokeh: Column DataSource part giving error

I am trying to create an interactive bokeh plot that holds multiple data and I am not sure why I am getting the error
ValueError: expected an element of ColumnData(String, Seq(Any)),got {'x': 6.794, 'y': 46.8339999999999, 'country': 'Congo, Dem. Rep.', 'pop': 3.5083789999999997, 'region': 'Sub-Saharan Africa'}
source = ColumnDataSource(data={
'x' : data.loc[1970].fertility,
'y' : data.loc[1970].life,
'pop' : (data.loc[1970].population / 20000000) + 2,
'region' : data.loc[1970].region,
})
I have tried two different data sets by importing data from excel and have been running out of issues on exactly why this happening.
As the name suggests, the ColumnDataSource is a data structure for storing columns of data. This means that the value of every key in .data must be a column, i.e. a Python list, a NumPy array, or a Pandas series. But you are trying to assign plain numbers as the values, which is what the error message is telling you:
I am trying to create an interactive bokeh plot that holds multiple data and I am not sure why I am getting the error
expected an element of ColumnData(String, Seq(Any))
This is saying acceptable, expected values are dicts that map strings to sequences. But what you passed is clearly not that:
got {'x': 6.794, 'y': 46.8339999999999, 'country': 'Congo, Dem. Rep.', 'pop': 3.5083789999999997, 'region': 'Sub-Saharan Africa'}
The value for x for instance is just the number 6.794 and not an array or list, etc.
You can easily do this:
source = ColumnDataSource({str(c): v.values for c, v in df.items()})
This would be a solution. I think the problem is in getting the data from df.
source = ColumnDataSource(data={
'x' : data[data['Year'] == 1970]['fertility'],
'y' : data[data['Year'] == 1970]['life'],
'pop' : (data[data['Year'] == 1970]['population']/20000000) + 2,
'region' : data[data['Year'] == 1970]['region']
})
I had this same problem using this same dataset.
My solution was to import the csv in pandas using "Year" as index column.
data = pd.read_csv(csv_path, index_col='Year')

Panda Series Insert Error Missing Positional Argument and Does Not Match Index Length

I have a list that I am attempting to prepend a value to, but I am uncertain of the best approach to doing just that. I have read that .insert() is the best method, but after trying two different variations of the method I can't seem to get it to work.
I have tried
df_full_modified = df_full['date', 'b_clicks', 'b_cpc'].insert(0, ['date', 'b_clicks', 'b_cpc'])
which returns
TypeError: insert() missing 1 required positional argument: 'value'
and also tried adding in a value for the columns parameter
df_full_modified = df_full['date', 'b_clicks', 'b_cpc'].insert(0, ['date', 'b_clicks', 'b_cpc'], ['date', 'b_clicks', 'b_cpc'])
which returns
ValueError: Length of values does not match length of index
Am I missing something with trying to map an array to the insert() method?
Here is the format of the data frame df_full:
[['2018-01-01', '72', 2.43], ['2018-01-02', '232', 2.8], ['2018-01-03', '255', 2.6], ...
and I am trying to prepend ['date', 'b_clicks', 'b_cpc'] to make it
[['date', 'b_clicks', 'b_cpc'], ['2018-01-01', '72', 2.43], ['2018-01-02', '232', 2.8], ['2018-01-03', '255', 2.6], ...
IIUC you already have a df like this:
date b_clicks b_cpc
0 2018-01-01' 72 2.43
1 2018-01-02' 232 2.80
2 2018-01-03' 255 2.60
And you want to insert a row to the top. df.insert inserts a column at a specified position, not a row. It looks like you understand that you can do this with a list, so you can just do your same operation by creating a new list with the words you have specified, and just concat the df you already have to the new list:
data = []
data.insert(0, {'date': 'date', 'b_clicks': 'b_clicks', 'b_cpc': 'b_cpc'})
df_full_modified = pd.concat([pd.DataFrame(data), df], ignore_index=True)
output:
>>> df_full_modified
b_clicks b_cpc date
0 b_clicks b_cpc date
1 72 2.43 2018-01-01'
2 232 2.8 2018-01-02'
3 255 2.6 2018-01-03'
If I understand your initial standpoint correctly, then the provided lists are stored in a pandas Series, like:
a = pd.Series([['2018-01-01', '72', 2.43], ['2018-01-02', '232', 2.8], ['2018-01-03', '255', 2.6]])
If yes, then you could simply create a Series for the ['date', 'b_clicks', 'b_cpc'] variable, like:
b = pd.Series({['date', 'b_clicks', 'b_cpc']])
and finally append b with a:
b = b.append(a)
However with this approach you will have to 0 indecies at the beginning of the series. I do not know if it bothers you.

Converting a set to a list with Pandas grouopby agg function causes 'ValueError: Function does not reduce'

Sometimes, it seems that the more I use Python (and Pandas), the less I understand. So I apologise if I'm just not seeing the wood for the trees here but I've been going round in circles and just can't see what I'm doing wrong.
Basically, I have an example script (that I'd like to implement on a much larger dataframe) but I can't get it to work to my satisfaction.
The dataframe consists of columns of various datatypes. I'd like to group the dataframe on 2 columns and then produce a new dataframe that contains lists of all the unique values for each variable in each group. (Ultimately, I'd like to concatenate the list items into a single string – but that's a different question.)
The initial script I used was:
import numpy as np
import pandas as pd
def tempFuncAgg(tempVar):
tempList = set(tempVar.dropna()) # Drop NaNs and create set of unique values
print(tempList)
return tempList
# Define dataframe
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
# Groupby based on 2 categorical variables
tempGroupby = tempDF.groupby(['gender','age'])
# Aggregate for each variable in each group using function defined above
dfAgg = tempGroupby.agg(lambda x: tempFuncAgg(x))
print(dfAgg)
The output from this script is as expected: a series of lines containing the sets of values and a dataframe containing the returned sets:
{'09/04/2015 23:03', '21/04/2015 12:59', '06/04/2015 12:34'}
{'01/06/2015 11:09', '12/05/2015 14:19', '27/05/2015 22:31', '19/06/2015 05:37'}
{'15/04/2015 07:12', '19/05/2015 19:22', '06/05/2015 11:12', '04/06/2015 12:57', '15/06/2015 03:23', '12/04/2015 01:00'}
{'02/04/2015 02:34', '10/05/2015 08:52'}
{2, 3, 6}
{18, 11, 13, 14}
{4, 5, 9, 12, 15, 17}
{1, 10}
date \
gender age
female old set([09/04/2015 23:03, 21/04/2015 12:59, 06/04...
young set([01/06/2015 11:09, 12/05/2015 14:19, 27/05...
male old set([15/04/2015 07:12, 19/05/2015 19:22, 06/05...
young set([02/04/2015 02:34, 10/05/2015 08:52])
id
gender age
female old set([2, 3, 6])
young set([18, 11, 13, 14])
male old set([4, 5, 9, 12, 15, 17])
young set([1, 10])
The problem occurs when I try to convert the sets to lists. Bizarrely, it produces 2 duplicated rows containing identical lists but then fails with a 'ValueError: Function does not reduce' error.
def tempFuncAgg(tempVar):
tempList = list(set(tempVar.dropna())) # This is the only difference
print(tempList)
return tempList
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
tempGroupby = tempDF.groupby(['gender','age'])
dfAgg = tempGroupby.agg(lambda x: tempFuncAgg(x))
print(dfAgg)
But now the output is:
['09/04/2015 23:03', '21/04/2015 12:59', '06/04/2015 12:34']
['09/04/2015 23:03', '21/04/2015 12:59', '06/04/2015 12:34']
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
...
ValueError: Function does not reduce
Any help to troubleshoot this problem would be appreciated and I apologise in advance if it's something obvious that I'm just not seeing.
EDIT
Incidentally, converting the set to a tuple rather than a list works with no problem.
Lists can sometimes have weird problems in pandas. You can either :
Use tuples (as you've already noticed)
If you really need lists, just do it in a second operation like this :
dfAgg.applymap(lambda x: list(x))
full example :
import numpy as np
import pandas as pd
def tempFuncAgg(tempVar):
tempList = set(tempVar.dropna()) # Drop NaNs and create set of unique values
print(tempList)
return tempList
# Define dataframe
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
# Groupby based on 2 categorical variables
tempGroupby = tempDF.groupby(['gender','age'])
# Aggregate for each variable in each group using function defined above
dfAgg = tempGroupby.agg(lambda x: tempFuncAgg(x))
# Transform in list
dfAgg.applymap(lambda x: list(x))
print(dfAgg)
There's many such bizzare behaviours in pandas, it is generally better to go on with a workaround (like this), than to find a perfect solution

Pandas query throws error when column name starts with a number

I'm trying to perform a query on the following dataframe:
data = {'ab': [1,2,3], 'c1': [1,2,3], 'd': [1,2,3], 'e_f': [1,2,3]}
df = pd.DataFrame(data)
for cl in df.columns:
print len(df.query('%s==2' %cl))
This works fine. However, if a column name starts with a number then it throws a syntax error.
data = {'ab': [1,2,3], 'c1': [1,2,3], '1d': [1,2,3], 'e_f': [1,2,3]}
df = pd.DataFrame(data)
for cl in df.columns:
print len(df.query('%s==2' %cl))
File "", line 1
1 d ==2
^
SyntaxError: invalid syntax
I think that the problem is related to the format of the string. I was wondering what will be the correct way to form this query.
query uses pandas.eval, which is documented to "evaluate a Python expression as a string". Your query is not a valid Python expression, because 1d is not valid syntax in Python, so you can't use query to refer to this column that way.
Things in pandas are generally easier if you make sure all your columns are valid Python identifiers.
You could always get a list of the column names which returns the columns as strings and then query them.
data = {'ab': [1,2,3], 'c1': [1,2,3], 'd': [1,2,3], 'e_f': [1,2,3]}
df = pd.DataFrame(data)
cols = list(df)
So for example cols[0] would be 'ab' and cols[2] would be '1d'.

Categories