Merging single level DataFrame with 3-level DataFrame - python

Single-level DataFrame:
data1 = {'Sr.No.': Sr_no,
'CompanyNames': Company_Names,
'YourChoice1': Your_Choice,
'YourChoice2': Your_Choice}
df1 = pd.DataFrame(data1, columns = pd.Index(['Sr.No.', 'CompanyNames','YourChoice1','YourChoice2'], name='key'))
Output of single-level dataframe in csv file:
3-level dataframe:
form = {'I1': {'F1': {'PD': ['1','2','3','4','5','6','7','8','9'],
'CD': ['1','2','3','4','5','6','7','8','9']},
'F2': {'PD': ['1','2','3','4','5','6','7','8','9'],
'CD': ['1','2','3','4','5','6','7','8','9']},
'F3': {'PD': ['1','2','3','4','5','6','7','8','9'],
'CD': ['1','2','3','4','5','6','7','8','9']}
},
'I2': {'F1': {'PD': ['1','2','3','4','5','6','7','8','9'],
'CD': ['1','2','3','4','5','6','7','8','9']},
'F2': {'PD': ['1','2','3','4','5','6','7','8','9'],
'CD': ['1','2','3','4','5','6','7','8','9']}
}
}
headers,values,data = CSV_trial.DATA(form)
cols = pd.MultiIndex.from_tuples(headers, names=['ind','field','data'])
df2 = pd.DataFrame(data, columns=cols)
Output of 3-level dataframe in csv file:
I want to merge these dataframe as df1 on left and df2 on right...
Desired Output:
Can anyone help me with this???

An easy way is to transform the single-layer df into a 3-level, then concat two df's of the same structure.
Importing necessary packages:
import pandas as pd
import numpy as np
Creating a native 3-level index. You can read it from a csv, xml, etc.
native_lvl_3_index_tup = [('A','foo1', 1), ('A','foo2', 3),
('B','foo1', 1), ('B','foo2', 3),
('C','foo1', 1), ('C','foo2', 3)]
variables = [33871648, 37253956,
18976457, 19378102,
20851820, 25145561]
native_lvl_3_index = pd.MultiIndex.from_tuples(native_lvl_3_index_tup)
Function, converting native single-level index to a 3-level:
def single_to_3_lvl(single_index_list,val_lvl_0,val_lvl_1):
multiindex_tuple = [(val_lvl_0,val_lvl_1,i) for i in single_index_list]
return pd.MultiIndex.from_tuples(multiindex_tuple)
Use this function to get an artificial 3-level index:
single_index = [1,2,3,4,5,6]
artificial_multiindex = single_to_3_lvl(single_index,'A','B')
Creating dataframes, transposing to move multiindex to columns (as in the question):
df1 = pd.DataFrame(variables,artificial_multiindex).T
df2 = pd.DataFrame(variables,native_lvl_3_index).T
I used the same variables in the dataframes. You can manipulate the concatenation by setting join='outer' or 'inner' in the pd.concat()
result = pd.concat([df1,df2],axis = 1)
Variable result contains the concatenated dataframes. If You have a single-level indexed dataframe, You can reindex it:
single_level_df = pd.DataFrame(single_index,variables)
reindexed = single_level_df.reindex(artificial_multiindex).T
Again, I do transposing (.T) to work with columns. It can be setup differently when creating dataframes.
Hope my answer helped.
I used some code from the link: https://jakevdp.github.io/PythonDataScienceHandbook/03.05-hierarchical-indexing.html

Related

Getting a dictionnary of lists that contain element from a column using a groupby

I have a dataframe that looks like this, with 1 string column and 1 int column.
import random
columns=['EG','EC','FI', 'ED', 'EB', 'FB', 'FCY', 'ECY', 'FG', 'FUR', 'E', '\[ED']
choices_str = random.choices(columns, k=200)
choices_int = random.choices(range(1, 8), k=200)
my_df = pd.DataFrame({'column_A': choices_str, 'column_B': choices_int})
I would like to get at the very end a dictionnary of lists that store all values of column B groupby A, like this :
What I made to achieve this to used a groupby to get number of occurences for column_B :
group_by = my_df.groupby(['column_A','column_B'])['column_B'].count().unstack().fillna(0).T
group_by
And then use some list comprehensions to create by hand my lists for each column_A and add them to the dictionnary.
Is there anyway to get more directly using a groupby ?
I am not aware of a method that is able to achieve that within the groupby statement. But I think you could try something like this alternatively:
import random
import pandas as pd
columns=['EG','EC','FI', 'ED', 'EB', 'FB', 'FCY', 'ECY', 'FG', 'FUR', 'E', '\[ED']
choices_str = random.choices(columns, k=200)
choices_int = random.choices(range(1, 8), k=200)
my_df = pd.DataFrame({'column_A': choices_str, 'column_B': choices_int})
final_dict = {val: my_df.loc[my_df['column_A'] == val, 'column_B'].values.tolist() for val in my_df['column_A'].unique()}
This dict-comprehension is a one-liner and takes all column_B values that correspond to a specific column_A value and assigns them to the dict stored in a list with column_A values as keys.

Extract values from dictionary and conditionally assign them to columns in pandas

I am trying to extract values from a column of dictionaries in pandas and assign them to their respective columns that already exist. I have hardcoded an example below of the data set that I have:
df_have = pd.DataFrame(
{
'value_column':[np.nan, np.nan, np.nan]
,'date':[np.nan, np.nan, np.nan]
,'string_column':[np.nan, np.nan, np.nan]
, 'dict':[[{'value_column':40},{'date':'2017-08-01'}],[{'value_column':30},
{'string_column':'abc'}],[{'value_column':10},{'date':'2016-12-01'}]]
})
df_have
df_want = pd.DataFrame(
{
'value_column':[40, 30, 10]
,'date':['2017-08-01', np.nan, '2016-12-01']
,'string_column':[np.nan, 'abc', np.nan]
,'dict':[[{'value_column':40},{'date':'2017-08-01'}],[{'value_column':30},
{'string_column':'abc'}],[{'value_column':10},{'date':'2016-12-01'}]]})
df_want
I have managed to extract the values out of the dictionaries using loops:
'''
for row in range(len(df_have)):
row_holder = df_have.dict[row]
number_of_dictionaries_in_the_row = len(row_holder)
for dictionary in range(number_of_dictionaries_in_the_row):
variable_holder = df_have.dict[row][dictionary].keys()
variable = list(variable_holder)[0]
value = df_have.dict[row][dictionary].get(variable)
'''
I now need to somehow conditionally turn df_have into df_want. I am happy to take a completely new approach and recreate the whole thing from scratch. We could even assume that I only have a dataframe with the dictionaries and nothing else.
You could use pandas string methods to pull the data out, although I think it is inefficient nesting data structures within Pandas :
df_have.loc[:, "value_column"] = df_have["dict"].str.get(0).str.get("value_column")
df_have.loc[:, "date"] = df_have["dict"].str.get(-1).str.get("date")
df_have.loc[:, "string_column"] = df_have["dict"].str.get(-1).str.get("string_column")
value_column date string_column dict
0 40 2017-08-01 None [{'value_column': 40}, {'date': '2017-08-01'}]
1 30 None abc [{'value_column': 30}, {'string_column': 'abc'}]
2 10 2016-12-01 None [{'value_column': 10}, {'date': '2016-12-01'}]

Convert panda dataframe group of values to multiple lists

I have pandas dataframe, where I listed items, and categorised them:
col_name |col_group
-------------------------
id | Metadata
listing_url | Metadata
scrape_id | Metadata
name | Text
summary | Text
space | Text
To reproduce:
import pandas
df = pandas.DataFrame([
['id','metadata'],
['listing_url','metadata'],
['scrape_id','metadata'],
['name','Text'],
['summary','Text'],
['space','Text']],
columns=['col_name', 'col_group'])
Can you suggest how I can convert this dataframe to multiple lists based on "col_group":
Metadata = ['id','listing_url','scraping_id]
Text = ['name','summary','space']
This is to allow me to pass these lists of columns to panda and drop columns.
I googled a lot and got stuck: all answers are about converting lists to df, not vice versa. Should I aim to convert into dictionary, or list of lists?
I have over 100 rows, belonging to 10 categories, so would like to avoid manual hard-coding.
I've try this code:
import pandas
df = pandas.DataFrame([
[1, 'url_a', 'scrap_a', 'name_a', 'summary_a', 'space_a'],
[2, 'url_b', 'scrap_b', 'name_b', 'summary_b', 'space_b'],
[3, 'url_c', 'scrap_c', 'name_c', 'summary_c', 'space_ac']],
columns=['id', 'listing_url', 'scrape_id', 'name', 'summary', 'space'])
print(df)
for row in df.iterrows():
print(row[1].to_list())
which give this answer:
[1, 'url_a', 'scrap_a', 'name_a', 'summary_a', 'space_a']
[2, 'url_b', 'scrap_b', 'name_b', 'summary_b', 'space_b']
[3, 'url_c', 'scrap_c', 'name_c', 'summary_c', 'space_ac']
You can use
for row in df[['name', 'summary', 'space']].iterrows():
to only iter over specific columns.
Like this:
In [245]: res = df.groupby('col_group', as_index=False)['Col_name'].apply(list)
In [248]: res.tolist()
Out[248]: [['id', 'listing_url', 'scrape_id'], ['name', 'summary', 'space']]
my_vars = df.groupby('col_group').agg(list)['col_name'].to_dict()
Output:
>>> my_vars
{'Text': ['name', 'summary', 'space'], 'metadata': ['id', 'listing_url', 'scrape_id']}
The recommended usage would be just my_vars['Text'] to access the Text, and etc. If you must have this as distinct names you can force it upon your target scope, e.g. globals:
globals().update(df.groupby('col_group').agg(list)['col_name'].to_dict())
Result:
>>> Text
['name', 'summary', 'space']
>>> metadata
['id', 'listing_url', 'scrape_id']
However I would advise against that as you might unwittingly overwrite some of your other objects, or they might not be in the proper scope you needed (e.g. locals).

Handle nested lists in pandas

How can I turn a nested list with dict inside into extra columns in a dataframe in Python?
I received information within a dict from an API,
{'orders':
[
{ 'orderId': '2838168630',
'dateTimeOrderPlaced': '2020-01-22T18:37:29+01:00',
'orderItems': [{ 'orderItemId': 'BFC0000361764421',
'ean': '234234234234234',
'cancelRequest': False,
'quantity': 1}
]},
{ 'orderId': '2708182540',
'dateTimeOrderPlaced': '2020-01-22T17:45:36+01:00',
'orderItems': [{ 'orderItemId': 'BFC0000361749496',
'ean': '234234234234234',
'cancelRequest': False,
'quantity': 3}
]},
{ 'orderId': '2490844970',
'dateTimeOrderPlaced': '2019-08-17T14:21:46+02:00',
'orderItems': [{ 'orderItemId': 'BFC0000287505870',
'ean': '234234234234234',
'cancelRequest': True,
'quantity': 1}
]}
which I managed to turn into a simple dataframe by doing this:
pd.DataFrame(recieved_data.get('orders'))
output:
orderId date oderItems
1 1-12 [{orderItemId: 'dfs13', 'ean': '34234'}]
2 etc.
...
I would like to have something like this
orderId date oderItemId ean
1 1-12 dfs13 34234
2 etc.
...
I already tried to single out the orderItems column with Iloc and than turn it into a list so I can then try to extract the values again. However I than still end up with a list which I need to extract another list from, which has the dict in it.
# Load the dataframe as you have already done.
temp_df = df['orderItems'].apply(pd.Series)
# concat the temp_df and original df
final_df = pd.concat([df, temp_df])
# drop columns if required
Hope it works for you.
Cheers
By combining the answers on this question I reached my end goal. I dit the following:
#unlist the orderItems column
temp_df = df['orderItems'].apply(pd.Series)
#Put items in orderItems into seperate columns
temp_df_json = json_normalize(temp_df[0])
#Join the tables
final_df = df.join(temp_df_json)
#Drop the old orderItems coloumn for a clean table
final_df = final_df.drop(["orderItems"], axis=1)
Also, instead of .concat() I applied .join() to join both tables based on the existing index.
Just to make it clear, you are receiving a json from the API, so you can try to use the function json_normalize.
Try this:
import pandas as pd
from pandas.io.json import json_normalize
# DataFrame initialization
df = pd.DataFrame({"orderId": [1], "date": ["1-12"], "oderItems": [{ 'orderItemId': 'dfs13', 'ean': '34234'}]})
# Serializing inner dict
sub_df = json_normalize(df["oderItems"])
# Dropping the unserialized column
df = df.drop(["oderItems"], axis=1)
# joining both dataframes.
df.join(sub_df)
So the output is:
orderId date ean orderItemId
0 1 1-12 34234 dfs13

How to convert the data frame values and its headers into a list with python?

I am trying to convert one data frame into a list and I want to keep the header names but I am unable to achieve this.
I am downloading the data from a SQL database and then I convert this data into a data frame:
import pypyodbc
from datetime import datetime
initial_date = datetime(2017,1,1,00,00,00)
end_date = datetime(2017,6,1,00,00,00)
sql_connection = pypyodbc.connect(driver="{SQL Server}", server="xxx.xxx.xxx.xxx", uid="you-dont-know-me",
pwd="guess...", Trusted_Connection="No")
#execute the SP to retrieve data
retrieve_database_values = "[DEV].[SP].[QA_ExportV2] #start_date='{start_date:%Y-%m-%d}', " \
"#end_date='{end_date:%Y-%m-%d'}".format(start_date=initial_date, end_date=end_date)
df = pd.read_sql_query(retrieve_database_values, sql_connection)
Then the way I convert this data frame into lists is with the following code:
df.values.tolist()
Which gives me the results:
[[100008115, 'CAS.Santa', 'CAS.Santa-2', 'Yes', 'Transferred', Timestamp('2017-03-11 08:15:00'), ...],
[100008116, 'Springfield', 'Springfield:H3', 'Yes','Traffic Variation', Timestamp('2017-09-11 00:00:00'), ...],
[...],[...]]
However, I want to be able to retrieve the data values and the header names of the data frame, something like this:
[['id', 100008115, 'site','CAS.Santa', 'site name','CAS.Santa-2', 'new','Yes', 'status','Transferred', 'initial date' ,Timestamp('2017-03-11 08:15:00'), ...],
['id',100008116, 'site','Springfield', 'site name','Springfield:H3', 'new','Yes', 'status','Traffic Variation', 'initial date' ,Timestamp('2017-09-11 00:00:00'), ...],
[...],[...]]
or if possible something like this:
[[('id', 100008115), ('site','CAS.Santa'), ('site name','CAS.Santa-2'), ('new','Yes'), ('status','Transferred'), ('initial date' ,Timestamp('2017-03-11 08:15:00')), (...)],
[('id',100008116), ('site','Springfield'), ('site name','Springfield:H3'), ('new','Yes'), ('status','Traffic Variation'), ('initial date' ,Timestamp('2017-09-11 00:00:00')), (...)],
[...],[...]]
One of the options of DataFrame.to_dict() should work.
import pandas as pd
df = pd.DataFrame({'a':[1, 2, 3], 'b':[2, 3, 3]})
>>> df
a b
0 1 2
1 2 3
2 3 3
>>>
>>> df.to_dict('records')
[{'a': 1, 'b': 2}, {'a': 2, 'b': 3}, {'a': 3, 'b': 3}]
>>> result = df.to_dict('records')
>>> for thing in result:
... print(list(thing.items()))
[('a', 1), ('b', 2)]
[('a', 2), ('b', 3)]
[('a', 3), ('b', 3)]
>>>
Data frame is just an intermediate step to achieve my desired result.
Seems like you could get your result directly from the output of the stored procedure. I have no way to test this but looking through the pypyodbc wiki
I came up with this alternative to the DataFrame...
Create a cursor from your connection object
cursor = sql_connection.cursor()
Reformat the stored procedure call and execute it
sp = "{{CALL {}}}".format(retrieve_database_values)
cursor.execute(sp)
Then from method three of the the nice Hello World script
query_results = [dict(zip([column[0] for column in cursor.description], row)) for row in cursor.fetchall()]
query_results should be a list of dicts like result from my DataFrame.to_dict() solution.
If I am reading that comprehension correctly, zip produces tuples so I think what you want is
query_results = [list(zip([column[0] for column in cursor.description], row)) for row in cursor.fetchall()]
#OR
query_results = []
for row in cursor.fetchall():
column_names = [column[0] for column in cursor.description]
query_results.append(list(zip(column_names, row)))
I imagine that could be refined.

Categories