Create a pandas DataFrame where each cell is a set of strings - python

I am trying to create a DataFrame like so:
col_a
col_b
{'soln_a'}
{'soln_b'}
In case it helps, here are some of my failed attempts:
import pandas as pd
my_dict_a = {"col_a": set(["soln_a"]), "col_b": set("soln_b")}
df_0 = pd.DataFrame.from_dict(my_dict_a) # ValueError: All arrays must be of the same length
df_1 = pd.DataFrame.from_dict(my_dict_a, orient="index").T # splits 'soln_b' into individual letters
my_dict_b = {"col_a": ["soln_a"], "col_b": ["soln_b"]}
df_2 = pd.DataFrame(my_dict_b).apply(set) # TypeError: 'set' type is unordered
df_3 = pd.DataFrame.from_dict(my_dict_b, orient="index").T # creates DataFrame of lists
df_3.apply(set, axis=1) # combines into single set of {soln_a, soln_b}
What's the best way to do this?

You just need to ensure your input data structure is formatted correctly.
The (default) dictionary -> DataFrame constructor, asks for the values in the dictionary be a collection of some type. You just need to make sure you have a collection of set objects, instead of having the key link directly to a set.
So, if I change my input dictionary to have a list of sets, then it works as expected.
import pandas as pd
my_dict = {
"col_a": [{"soln_a"}, {"soln_c"}],
"col_b": [{"soln_b", "soln_d"}, {"soln_c"}]
}
df = pd.DataFrame.from_dict(my_dict)
print(df)
col_a col_b
0 {soln_a} {soln_d, soln_b}
1 {soln_c} {soln_c}

You could apply a list comprehension on the columns:
my_dict_b = {"col_a": ["soln_a"], "col_b": ["soln_b"]}
df_2 = pd.DataFrame(my_dict_b)
df_2 = df_2.apply(lambda col: [set([x]) for x in col])
Output:
col_a col_b
0 {soln_a} {soln_b}

Why not something like this?
df = pd.DataFrame({
'col_a': [set(['soln_a'])],
'col_b': [set(['soln_b'])],
})
Output:
>>> df
col_a col_b
0 {soln_a} {soln_b}

Related

Dataframe with empty column in the data

I have a list of lists with an header row and then the different value rows.
It could happen that is some cases the last "column" has an empty value for all the rows (if just a row has a value it works fine), but DataFrame is not happy about that as the number of columns differs from the header.
I'm thinking to add a None value to the first list without any value before creating the DF, but I wondering if there is a better way to handle this case?
data = [
["data1", "data2", "data3"],
["value11", "value12"],
["value21", "value22"],
["value31", "value32"]]
headers = data.pop(0)
dataframe = pandas.DataFrame(data, columns = headers)
You could do this:
import pandas as pd
data = [
["data1", "data2", "data3"],
["value11", "value12"],
["value21", "value22"],
["value31", "value32"]
]
# create dataframe
df = pd.DataFrame(data)
# set new column names
# this will use ["data1", "data2", "data3"] as new columns, because they are in the first row
df.columns = df.iloc[0].tolist()
# now that you have the right column names, just jump the first line
df = df.iloc[1:].reset_index(drop=True)
df
data1 data2 data3
0 value11 value12 None
1 value21 value22 None
2 value31 value32 None
Is this that you want?
You can use pd.reindex function to add missing columns. You can possibly do something like this:
import pandas as pd
df = pd.DataFrame(data)
# To prevent throwing exception.
df.columns = headers[:df.shape[1]]
df = df.reindex(headers,axis=1)

Filtering a Pandas DataFrame through a list dictionary

Movie Dataframe
I have a DataFrame that contains movie information and I'm trying to filter the rows so that if the list of dictionaries contains 'name' == 'specified genre' it will display movies containing that genre.
I have tried using a list comprehension
filter = ['Action']
expectedResult = [d for d in df if d['name'] in filter]
however I end up with an error:
TypeError: string indices must be integers
d is a column name in your code. That's why you are getting this error.
See the following example:
import pandas as pd
df = pd.DataFrame({"abc": [1,2,3], "def": [4,5,6]})
for d in df:
print(d)
Gives:
abc
def
I think what you are trying to do could be achieved by:
df = pd.DataFrame({"genre": ["something", "soemthing else"], "abc": ["movie1", "movie2"]})
movies = df.to_dict("records")
[m["abc"] for m in movies if m["genre"] == "something"]
Which gives:
['movie1']
your loop,for d in df, will give the headings for your values.
your d will have generes as a value.
try to run:-
for d in df:
print(d)
you will understand

How to merge multiple columns with same names in a dataframe

I have the following dataframe as below:
df = pd.DataFrame({'Field':'FAPERF',
'Form':'LIVERID',
'Folder':'ALL',
'Logline':'9',
'Data':'Yes',
'Data':'Blank',
'Data':'No',
'Logline':'10'}) '''
I need dataframe:
df = pd.DataFrame({'Field':['FAPERF','FAPERF'],
'Form':['LIVERID','LIVERID'],
'Folder':['ALL','ALL'],
'Logline':['9','10'],
'Data':['Yes','Blank','No']}) '''
I had tried using the below code but not able to achieve desired output.
res3.set_index(res3.groupby(level=0).cumcount(), append=True['Data'].unstack(0)
Can anyone please help me.
I believe your best option is to create multiple data frames with the same column name ( example 3 df with column name : "Data" ) then simply perform a concat function over Data frames :
df1 = pd.DataFrame({'Field':'FAPERF',
'Form':'LIVERID',
'Folder':'ALL',
'Logline':'9',
'Data':'Yes'}
df2 = pd.DataFrame({
'Data':'No',
'Logline':'10'})
df3 = pd.DataFrame({'Data':'Blank'})
frames = [df1, df2, df3]
result = pd.concat(frames)
You just need to add to list in which you specify the logline and data_type for each row.
import pandas as pd
import numpy as np
list_df = []
data_type_list = ["yes","no","Blank"]
logline_type = ["9","10",'10']
for x in range (len(data_type_list)):
new_dict = { 'Field':['FAPERF'], 'Form':['LIVERID'],'Folder':['ALL'],"Data" : [data_type_list[x]], "Logline" : [logline_type[x]]}
df = pd.DataFrame(new_dict)
list_df.append(df)
new_df = pd.concat(list_df)
print(new_df)

Separate column data with a comma to two columns for dataframe

The data set I pulled from an API return looks like this:
([['Date', 'Value']],
[[['2019-08-31', 445000.0],
['2019-07-31', 450000.0],
['2019-06-30', 450000.0]]])
I'm trying to create a DataFrame with two columns from the data:
Date & Value
Here's what I've tried:
df = pd.DataFrame(city_data, index =['a', 'b'], columns =['Names'] .
['Names1'])
city_data[['Date','Value']] =
city_data['Date'].str.split(',',expand=True)
city_data
city_data.append({"header": column_value,
"Value": date_value})
city_data = pd.DataFrame()
This code was used to create the dataset. I pulled the lists from the API return:
column_value = data["dataset"]["column_names"]
date_value = data["dataset"]["data"]
city_data = ([column_value], [date_value])
city_data
Instead of creating a dataframe with two columns from the data, in most cases I get the "TypeError: list indices must be integers or slices, not str"
is it what you are looking for:
d = ([['Date', 'Value']],
[[['2019-08-31', 445000.0],
['2019-07-31', 450000.0],
['2019-06-30', 450000.0]]])
pd.DataFrame(d[1][0], columns=d[0][0])
return:

Dictionary in Pandas DataFrame, how to split the columns

I have a DataFrame that consists of one column ('Vals') which is a dictionary. The DataFrame looks more or less like this:
In[215]: fff
Out[213]:
Vals
0 {u'TradeId': u'JP32767', u'TradeSourceNam...
1 {u'TradeId': u'UUJ2X16', u'TradeSourceNam...
2 {u'TradeId': u'JJ35A12', u'TradeSourceNam...
When looking at an individual row the dictionary looks like this:
In[220]: fff['Vals'][100]
Out[218]:
{u'BrdsTraderBookCode': u'dffH',
u'Measures': [{u'AssetName': u'Ie0',
u'DefinitionId': u'6dbb',
u'MeasureValues': [{u'Amount': -18.64}],
u'ReportingCurrency': u'USD',
u'ValuationId': u'669bb'}],
u'SnapshotId': 12739,
u'TradeId': u'17304M',
u'TradeLegId': u'31827',
u'TradeSourceName': u'xxxeee',
u'TradeVersion': 1}
How can I split the the columns and create a new DataFrame, so that I get one column with TradeId and another one with MeasureValues?
try this:
l=[]
for idx, row in df['Vals'].iteritems():
temp_df = pd.DataFrame(row['Measures'][0]['MeasureValues'])
temp_df['TradeId'] = row['TradeId']
l.append(temp_df)
pd.concat(l,axis=0)
Here's a way to get TradeId and MeasureValues (using twice your sample row above to illustrate the iteration):
new_df = pd.DataFrame()
for id, data in fff.iterrows():
d = {'TradeId': data.ix[0]['TradeId']}
d.update(data.ix[0]['Measures'][0]['MeasureValues'][0])
new_df = pd.concat([new_df, pd.DataFrame.from_dict(d, orient='index').T])
Amount TradeId
0 -18.64 17304M
0 -18.64 17304M

Categories