I am trying to extract values from a column of dictionaries in pandas and assign them to their respective columns that already exist. I have hardcoded an example below of the data set that I have:
df_have = pd.DataFrame(
{
'value_column':[np.nan, np.nan, np.nan]
,'date':[np.nan, np.nan, np.nan]
,'string_column':[np.nan, np.nan, np.nan]
, 'dict':[[{'value_column':40},{'date':'2017-08-01'}],[{'value_column':30},
{'string_column':'abc'}],[{'value_column':10},{'date':'2016-12-01'}]]
})
df_have
df_want = pd.DataFrame(
{
'value_column':[40, 30, 10]
,'date':['2017-08-01', np.nan, '2016-12-01']
,'string_column':[np.nan, 'abc', np.nan]
,'dict':[[{'value_column':40},{'date':'2017-08-01'}],[{'value_column':30},
{'string_column':'abc'}],[{'value_column':10},{'date':'2016-12-01'}]]})
df_want
I have managed to extract the values out of the dictionaries using loops:
'''
for row in range(len(df_have)):
row_holder = df_have.dict[row]
number_of_dictionaries_in_the_row = len(row_holder)
for dictionary in range(number_of_dictionaries_in_the_row):
variable_holder = df_have.dict[row][dictionary].keys()
variable = list(variable_holder)[0]
value = df_have.dict[row][dictionary].get(variable)
'''
I now need to somehow conditionally turn df_have into df_want. I am happy to take a completely new approach and recreate the whole thing from scratch. We could even assume that I only have a dataframe with the dictionaries and nothing else.
You could use pandas string methods to pull the data out, although I think it is inefficient nesting data structures within Pandas :
df_have.loc[:, "value_column"] = df_have["dict"].str.get(0).str.get("value_column")
df_have.loc[:, "date"] = df_have["dict"].str.get(-1).str.get("date")
df_have.loc[:, "string_column"] = df_have["dict"].str.get(-1).str.get("string_column")
value_column date string_column dict
0 40 2017-08-01 None [{'value_column': 40}, {'date': '2017-08-01'}]
1 30 None abc [{'value_column': 30}, {'string_column': 'abc'}]
2 10 2016-12-01 None [{'value_column': 10}, {'date': '2016-12-01'}]
I have pandas dataframe, where I listed items, and categorised them:
col_name |col_group
-------------------------
id | Metadata
listing_url | Metadata
scrape_id | Metadata
name | Text
summary | Text
space | Text
To reproduce:
import pandas
df = pandas.DataFrame([
['id','metadata'],
['listing_url','metadata'],
['scrape_id','metadata'],
['name','Text'],
['summary','Text'],
['space','Text']],
columns=['col_name', 'col_group'])
Can you suggest how I can convert this dataframe to multiple lists based on "col_group":
Metadata = ['id','listing_url','scraping_id]
Text = ['name','summary','space']
This is to allow me to pass these lists of columns to panda and drop columns.
I googled a lot and got stuck: all answers are about converting lists to df, not vice versa. Should I aim to convert into dictionary, or list of lists?
I have over 100 rows, belonging to 10 categories, so would like to avoid manual hard-coding.
I've try this code:
import pandas
df = pandas.DataFrame([
[1, 'url_a', 'scrap_a', 'name_a', 'summary_a', 'space_a'],
[2, 'url_b', 'scrap_b', 'name_b', 'summary_b', 'space_b'],
[3, 'url_c', 'scrap_c', 'name_c', 'summary_c', 'space_ac']],
columns=['id', 'listing_url', 'scrape_id', 'name', 'summary', 'space'])
print(df)
for row in df.iterrows():
print(row[1].to_list())
which give this answer:
[1, 'url_a', 'scrap_a', 'name_a', 'summary_a', 'space_a']
[2, 'url_b', 'scrap_b', 'name_b', 'summary_b', 'space_b']
[3, 'url_c', 'scrap_c', 'name_c', 'summary_c', 'space_ac']
You can use
for row in df[['name', 'summary', 'space']].iterrows():
to only iter over specific columns.
Like this:
In [245]: res = df.groupby('col_group', as_index=False)['Col_name'].apply(list)
In [248]: res.tolist()
Out[248]: [['id', 'listing_url', 'scrape_id'], ['name', 'summary', 'space']]
my_vars = df.groupby('col_group').agg(list)['col_name'].to_dict()
Output:
>>> my_vars
{'Text': ['name', 'summary', 'space'], 'metadata': ['id', 'listing_url', 'scrape_id']}
The recommended usage would be just my_vars['Text'] to access the Text, and etc. If you must have this as distinct names you can force it upon your target scope, e.g. globals:
globals().update(df.groupby('col_group').agg(list)['col_name'].to_dict())
Result:
>>> Text
['name', 'summary', 'space']
>>> metadata
['id', 'listing_url', 'scrape_id']
However I would advise against that as you might unwittingly overwrite some of your other objects, or they might not be in the proper scope you needed (e.g. locals).
I am trying to get the mean value for a list of percentages from an Excel file which has data. My current code is as follows:
import numpy as pd
data = pd.DataFrame =({'Percentages': [.20, .10, .05], 'Nationality':['American', 'Mexican', 'Russian'],
'Gender': ['Male', 'Female'], 'Question': ['They have good looks']})
pref = data[data.Nationality == 'American']
prefPref = pref.pivot_table(data.Percentage.mean(), index=['Question'], column='Gender')
The error is coming from where I try to get the .mean() from my ['Percentage'] list. So, how can I get the mean from the list of Percentages? Do I need to create a variable for the mean value, and if so how to I implement that into the code?
["Percentage"] is a list containging the single string item "Percentage". It isn't possible to calculate a mean from lists of text.
In addition, the method .mean() doesn't exist in Python for generic lists, have a look at numpy for calculating means and other mathematical operations.
For example:
import numpy
numpy.array([4,2,6,5]).mean()
Here is a reworked version of your pd.pivot_table. See also How to pivot a dataframe.
import pandas as pd, numpy as np
data = pd.DataFrame({'Percentages': [0.20, 0.10, 0.05],
'Nationality': ['American', 'American', 'Russian'],
'Gender': ['Male', 'Female', 'Male'],
'Question': ['Q1', 'Q2', 'Q3']})
pref = data[data['Nationality'] == 'American']
prefPref = pref.pivot_table(values='Percentages', index='Question',\
columns='Gender', aggfunc='mean')
# Gender Female Male
# Question
# Q1 NaN 0.2
# Q2 0.1 NaN
Sometimes, it seems that the more I use Python (and Pandas), the less I understand. So I apologise if I'm just not seeing the wood for the trees here but I've been going round in circles and just can't see what I'm doing wrong.
Basically, I have an example script (that I'd like to implement on a much larger dataframe) but I can't get it to work to my satisfaction.
The dataframe consists of columns of various datatypes. I'd like to group the dataframe on 2 columns and then produce a new dataframe that contains lists of all the unique values for each variable in each group. (Ultimately, I'd like to concatenate the list items into a single string – but that's a different question.)
The initial script I used was:
import numpy as np
import pandas as pd
def tempFuncAgg(tempVar):
tempList = set(tempVar.dropna()) # Drop NaNs and create set of unique values
print(tempList)
return tempList
# Define dataframe
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
# Groupby based on 2 categorical variables
tempGroupby = tempDF.groupby(['gender','age'])
# Aggregate for each variable in each group using function defined above
dfAgg = tempGroupby.agg(lambda x: tempFuncAgg(x))
print(dfAgg)
The output from this script is as expected: a series of lines containing the sets of values and a dataframe containing the returned sets:
{'09/04/2015 23:03', '21/04/2015 12:59', '06/04/2015 12:34'}
{'01/06/2015 11:09', '12/05/2015 14:19', '27/05/2015 22:31', '19/06/2015 05:37'}
{'15/04/2015 07:12', '19/05/2015 19:22', '06/05/2015 11:12', '04/06/2015 12:57', '15/06/2015 03:23', '12/04/2015 01:00'}
{'02/04/2015 02:34', '10/05/2015 08:52'}
{2, 3, 6}
{18, 11, 13, 14}
{4, 5, 9, 12, 15, 17}
{1, 10}
date \
gender age
female old set([09/04/2015 23:03, 21/04/2015 12:59, 06/04...
young set([01/06/2015 11:09, 12/05/2015 14:19, 27/05...
male old set([15/04/2015 07:12, 19/05/2015 19:22, 06/05...
young set([02/04/2015 02:34, 10/05/2015 08:52])
id
gender age
female old set([2, 3, 6])
young set([18, 11, 13, 14])
male old set([4, 5, 9, 12, 15, 17])
young set([1, 10])
The problem occurs when I try to convert the sets to lists. Bizarrely, it produces 2 duplicated rows containing identical lists but then fails with a 'ValueError: Function does not reduce' error.
def tempFuncAgg(tempVar):
tempList = list(set(tempVar.dropna())) # This is the only difference
print(tempList)
return tempList
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
tempGroupby = tempDF.groupby(['gender','age'])
dfAgg = tempGroupby.agg(lambda x: tempFuncAgg(x))
print(dfAgg)
But now the output is:
['09/04/2015 23:03', '21/04/2015 12:59', '06/04/2015 12:34']
['09/04/2015 23:03', '21/04/2015 12:59', '06/04/2015 12:34']
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
...
ValueError: Function does not reduce
Any help to troubleshoot this problem would be appreciated and I apologise in advance if it's something obvious that I'm just not seeing.
EDIT
Incidentally, converting the set to a tuple rather than a list works with no problem.
Lists can sometimes have weird problems in pandas. You can either :
Use tuples (as you've already noticed)
If you really need lists, just do it in a second operation like this :
dfAgg.applymap(lambda x: list(x))
full example :
import numpy as np
import pandas as pd
def tempFuncAgg(tempVar):
tempList = set(tempVar.dropna()) # Drop NaNs and create set of unique values
print(tempList)
return tempList
# Define dataframe
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
# Groupby based on 2 categorical variables
tempGroupby = tempDF.groupby(['gender','age'])
# Aggregate for each variable in each group using function defined above
dfAgg = tempGroupby.agg(lambda x: tempFuncAgg(x))
# Transform in list
dfAgg.applymap(lambda x: list(x))
print(dfAgg)
There's many such bizzare behaviours in pandas, it is generally better to go on with a workaround (like this), than to find a perfect solution