Single-level DataFrame:
data1 = {'Sr.No.': Sr_no,
'CompanyNames': Company_Names,
'YourChoice1': Your_Choice,
'YourChoice2': Your_Choice}
df1 = pd.DataFrame(data1, columns = pd.Index(['Sr.No.', 'CompanyNames','YourChoice1','YourChoice2'], name='key'))
Output of single-level dataframe in csv file:
3-level dataframe:
form = {'I1': {'F1': {'PD': ['1','2','3','4','5','6','7','8','9'],
'CD': ['1','2','3','4','5','6','7','8','9']},
'F2': {'PD': ['1','2','3','4','5','6','7','8','9'],
'CD': ['1','2','3','4','5','6','7','8','9']},
'F3': {'PD': ['1','2','3','4','5','6','7','8','9'],
'CD': ['1','2','3','4','5','6','7','8','9']}
},
'I2': {'F1': {'PD': ['1','2','3','4','5','6','7','8','9'],
'CD': ['1','2','3','4','5','6','7','8','9']},
'F2': {'PD': ['1','2','3','4','5','6','7','8','9'],
'CD': ['1','2','3','4','5','6','7','8','9']}
}
}
headers,values,data = CSV_trial.DATA(form)
cols = pd.MultiIndex.from_tuples(headers, names=['ind','field','data'])
df2 = pd.DataFrame(data, columns=cols)
Output of 3-level dataframe in csv file:
I want to merge these dataframe as df1 on left and df2 on right...
Desired Output:
Can anyone help me with this???
An easy way is to transform the single-layer df into a 3-level, then concat two df's of the same structure.
Importing necessary packages:
import pandas as pd
import numpy as np
Creating a native 3-level index. You can read it from a csv, xml, etc.
native_lvl_3_index_tup = [('A','foo1', 1), ('A','foo2', 3),
('B','foo1', 1), ('B','foo2', 3),
('C','foo1', 1), ('C','foo2', 3)]
variables = [33871648, 37253956,
18976457, 19378102,
20851820, 25145561]
native_lvl_3_index = pd.MultiIndex.from_tuples(native_lvl_3_index_tup)
Function, converting native single-level index to a 3-level:
def single_to_3_lvl(single_index_list,val_lvl_0,val_lvl_1):
multiindex_tuple = [(val_lvl_0,val_lvl_1,i) for i in single_index_list]
return pd.MultiIndex.from_tuples(multiindex_tuple)
Use this function to get an artificial 3-level index:
single_index = [1,2,3,4,5,6]
artificial_multiindex = single_to_3_lvl(single_index,'A','B')
Creating dataframes, transposing to move multiindex to columns (as in the question):
df1 = pd.DataFrame(variables,artificial_multiindex).T
df2 = pd.DataFrame(variables,native_lvl_3_index).T
I used the same variables in the dataframes. You can manipulate the concatenation by setting join='outer' or 'inner' in the pd.concat()
result = pd.concat([df1,df2],axis = 1)
Variable result contains the concatenated dataframes. If You have a single-level indexed dataframe, You can reindex it:
single_level_df = pd.DataFrame(single_index,variables)
reindexed = single_level_df.reindex(artificial_multiindex).T
Again, I do transposing (.T) to work with columns. It can be setup differently when creating dataframes.
Hope my answer helped.
I used some code from the link: https://jakevdp.github.io/PythonDataScienceHandbook/03.05-hierarchical-indexing.html
Related
I have a dataframe that looks like this, with 1 string column and 1 int column.
import random
columns=['EG','EC','FI', 'ED', 'EB', 'FB', 'FCY', 'ECY', 'FG', 'FUR', 'E', '\[ED']
choices_str = random.choices(columns, k=200)
choices_int = random.choices(range(1, 8), k=200)
my_df = pd.DataFrame({'column_A': choices_str, 'column_B': choices_int})
I would like to get at the very end a dictionnary of lists that store all values of column B groupby A, like this :
What I made to achieve this to used a groupby to get number of occurences for column_B :
group_by = my_df.groupby(['column_A','column_B'])['column_B'].count().unstack().fillna(0).T
group_by
And then use some list comprehensions to create by hand my lists for each column_A and add them to the dictionnary.
Is there anyway to get more directly using a groupby ?
I am not aware of a method that is able to achieve that within the groupby statement. But I think you could try something like this alternatively:
import random
import pandas as pd
columns=['EG','EC','FI', 'ED', 'EB', 'FB', 'FCY', 'ECY', 'FG', 'FUR', 'E', '\[ED']
choices_str = random.choices(columns, k=200)
choices_int = random.choices(range(1, 8), k=200)
my_df = pd.DataFrame({'column_A': choices_str, 'column_B': choices_int})
final_dict = {val: my_df.loc[my_df['column_A'] == val, 'column_B'].values.tolist() for val in my_df['column_A'].unique()}
This dict-comprehension is a one-liner and takes all column_B values that correspond to a specific column_A value and assigns them to the dict stored in a list with column_A values as keys.
I am trying to extract values from a column of dictionaries in pandas and assign them to their respective columns that already exist. I have hardcoded an example below of the data set that I have:
df_have = pd.DataFrame(
{
'value_column':[np.nan, np.nan, np.nan]
,'date':[np.nan, np.nan, np.nan]
,'string_column':[np.nan, np.nan, np.nan]
, 'dict':[[{'value_column':40},{'date':'2017-08-01'}],[{'value_column':30},
{'string_column':'abc'}],[{'value_column':10},{'date':'2016-12-01'}]]
})
df_have
df_want = pd.DataFrame(
{
'value_column':[40, 30, 10]
,'date':['2017-08-01', np.nan, '2016-12-01']
,'string_column':[np.nan, 'abc', np.nan]
,'dict':[[{'value_column':40},{'date':'2017-08-01'}],[{'value_column':30},
{'string_column':'abc'}],[{'value_column':10},{'date':'2016-12-01'}]]})
df_want
I have managed to extract the values out of the dictionaries using loops:
'''
for row in range(len(df_have)):
row_holder = df_have.dict[row]
number_of_dictionaries_in_the_row = len(row_holder)
for dictionary in range(number_of_dictionaries_in_the_row):
variable_holder = df_have.dict[row][dictionary].keys()
variable = list(variable_holder)[0]
value = df_have.dict[row][dictionary].get(variable)
'''
I now need to somehow conditionally turn df_have into df_want. I am happy to take a completely new approach and recreate the whole thing from scratch. We could even assume that I only have a dataframe with the dictionaries and nothing else.
You could use pandas string methods to pull the data out, although I think it is inefficient nesting data structures within Pandas :
df_have.loc[:, "value_column"] = df_have["dict"].str.get(0).str.get("value_column")
df_have.loc[:, "date"] = df_have["dict"].str.get(-1).str.get("date")
df_have.loc[:, "string_column"] = df_have["dict"].str.get(-1).str.get("string_column")
value_column date string_column dict
0 40 2017-08-01 None [{'value_column': 40}, {'date': '2017-08-01'}]
1 30 None abc [{'value_column': 30}, {'string_column': 'abc'}]
2 10 2016-12-01 None [{'value_column': 10}, {'date': '2016-12-01'}]
I have pandas dataframe, where I listed items, and categorised them:
col_name |col_group
-------------------------
id | Metadata
listing_url | Metadata
scrape_id | Metadata
name | Text
summary | Text
space | Text
To reproduce:
import pandas
df = pandas.DataFrame([
['id','metadata'],
['listing_url','metadata'],
['scrape_id','metadata'],
['name','Text'],
['summary','Text'],
['space','Text']],
columns=['col_name', 'col_group'])
Can you suggest how I can convert this dataframe to multiple lists based on "col_group":
Metadata = ['id','listing_url','scraping_id]
Text = ['name','summary','space']
This is to allow me to pass these lists of columns to panda and drop columns.
I googled a lot and got stuck: all answers are about converting lists to df, not vice versa. Should I aim to convert into dictionary, or list of lists?
I have over 100 rows, belonging to 10 categories, so would like to avoid manual hard-coding.
I've try this code:
import pandas
df = pandas.DataFrame([
[1, 'url_a', 'scrap_a', 'name_a', 'summary_a', 'space_a'],
[2, 'url_b', 'scrap_b', 'name_b', 'summary_b', 'space_b'],
[3, 'url_c', 'scrap_c', 'name_c', 'summary_c', 'space_ac']],
columns=['id', 'listing_url', 'scrape_id', 'name', 'summary', 'space'])
print(df)
for row in df.iterrows():
print(row[1].to_list())
which give this answer:
[1, 'url_a', 'scrap_a', 'name_a', 'summary_a', 'space_a']
[2, 'url_b', 'scrap_b', 'name_b', 'summary_b', 'space_b']
[3, 'url_c', 'scrap_c', 'name_c', 'summary_c', 'space_ac']
You can use
for row in df[['name', 'summary', 'space']].iterrows():
to only iter over specific columns.
Like this:
In [245]: res = df.groupby('col_group', as_index=False)['Col_name'].apply(list)
In [248]: res.tolist()
Out[248]: [['id', 'listing_url', 'scrape_id'], ['name', 'summary', 'space']]
my_vars = df.groupby('col_group').agg(list)['col_name'].to_dict()
Output:
>>> my_vars
{'Text': ['name', 'summary', 'space'], 'metadata': ['id', 'listing_url', 'scrape_id']}
The recommended usage would be just my_vars['Text'] to access the Text, and etc. If you must have this as distinct names you can force it upon your target scope, e.g. globals:
globals().update(df.groupby('col_group').agg(list)['col_name'].to_dict())
Result:
>>> Text
['name', 'summary', 'space']
>>> metadata
['id', 'listing_url', 'scrape_id']
However I would advise against that as you might unwittingly overwrite some of your other objects, or they might not be in the proper scope you needed (e.g. locals).
How can I turn a nested list with dict inside into extra columns in a dataframe in Python?
I received information within a dict from an API,
{'orders':
[
{ 'orderId': '2838168630',
'dateTimeOrderPlaced': '2020-01-22T18:37:29+01:00',
'orderItems': [{ 'orderItemId': 'BFC0000361764421',
'ean': '234234234234234',
'cancelRequest': False,
'quantity': 1}
]},
{ 'orderId': '2708182540',
'dateTimeOrderPlaced': '2020-01-22T17:45:36+01:00',
'orderItems': [{ 'orderItemId': 'BFC0000361749496',
'ean': '234234234234234',
'cancelRequest': False,
'quantity': 3}
]},
{ 'orderId': '2490844970',
'dateTimeOrderPlaced': '2019-08-17T14:21:46+02:00',
'orderItems': [{ 'orderItemId': 'BFC0000287505870',
'ean': '234234234234234',
'cancelRequest': True,
'quantity': 1}
]}
which I managed to turn into a simple dataframe by doing this:
pd.DataFrame(recieved_data.get('orders'))
output:
orderId date oderItems
1 1-12 [{orderItemId: 'dfs13', 'ean': '34234'}]
2 etc.
...
I would like to have something like this
orderId date oderItemId ean
1 1-12 dfs13 34234
2 etc.
...
I already tried to single out the orderItems column with Iloc and than turn it into a list so I can then try to extract the values again. However I than still end up with a list which I need to extract another list from, which has the dict in it.
# Load the dataframe as you have already done.
temp_df = df['orderItems'].apply(pd.Series)
# concat the temp_df and original df
final_df = pd.concat([df, temp_df])
# drop columns if required
Hope it works for you.
Cheers
By combining the answers on this question I reached my end goal. I dit the following:
#unlist the orderItems column
temp_df = df['orderItems'].apply(pd.Series)
#Put items in orderItems into seperate columns
temp_df_json = json_normalize(temp_df[0])
#Join the tables
final_df = df.join(temp_df_json)
#Drop the old orderItems coloumn for a clean table
final_df = final_df.drop(["orderItems"], axis=1)
Also, instead of .concat() I applied .join() to join both tables based on the existing index.
Just to make it clear, you are receiving a json from the API, so you can try to use the function json_normalize.
Try this:
import pandas as pd
from pandas.io.json import json_normalize
# DataFrame initialization
df = pd.DataFrame({"orderId": [1], "date": ["1-12"], "oderItems": [{ 'orderItemId': 'dfs13', 'ean': '34234'}]})
# Serializing inner dict
sub_df = json_normalize(df["oderItems"])
# Dropping the unserialized column
df = df.drop(["oderItems"], axis=1)
# joining both dataframes.
df.join(sub_df)
So the output is:
orderId date ean orderItemId
0 1 1-12 34234 dfs13
I am trying to convert one data frame into a list and I want to keep the header names but I am unable to achieve this.
I am downloading the data from a SQL database and then I convert this data into a data frame:
import pypyodbc
from datetime import datetime
initial_date = datetime(2017,1,1,00,00,00)
end_date = datetime(2017,6,1,00,00,00)
sql_connection = pypyodbc.connect(driver="{SQL Server}", server="xxx.xxx.xxx.xxx", uid="you-dont-know-me",
pwd="guess...", Trusted_Connection="No")
#execute the SP to retrieve data
retrieve_database_values = "[DEV].[SP].[QA_ExportV2] #start_date='{start_date:%Y-%m-%d}', " \
"#end_date='{end_date:%Y-%m-%d'}".format(start_date=initial_date, end_date=end_date)
df = pd.read_sql_query(retrieve_database_values, sql_connection)
Then the way I convert this data frame into lists is with the following code:
df.values.tolist()
Which gives me the results:
[[100008115, 'CAS.Santa', 'CAS.Santa-2', 'Yes', 'Transferred', Timestamp('2017-03-11 08:15:00'), ...],
[100008116, 'Springfield', 'Springfield:H3', 'Yes','Traffic Variation', Timestamp('2017-09-11 00:00:00'), ...],
[...],[...]]
However, I want to be able to retrieve the data values and the header names of the data frame, something like this:
[['id', 100008115, 'site','CAS.Santa', 'site name','CAS.Santa-2', 'new','Yes', 'status','Transferred', 'initial date' ,Timestamp('2017-03-11 08:15:00'), ...],
['id',100008116, 'site','Springfield', 'site name','Springfield:H3', 'new','Yes', 'status','Traffic Variation', 'initial date' ,Timestamp('2017-09-11 00:00:00'), ...],
[...],[...]]
or if possible something like this:
[[('id', 100008115), ('site','CAS.Santa'), ('site name','CAS.Santa-2'), ('new','Yes'), ('status','Transferred'), ('initial date' ,Timestamp('2017-03-11 08:15:00')), (...)],
[('id',100008116), ('site','Springfield'), ('site name','Springfield:H3'), ('new','Yes'), ('status','Traffic Variation'), ('initial date' ,Timestamp('2017-09-11 00:00:00')), (...)],
[...],[...]]
One of the options of DataFrame.to_dict() should work.
import pandas as pd
df = pd.DataFrame({'a':[1, 2, 3], 'b':[2, 3, 3]})
>>> df
a b
0 1 2
1 2 3
2 3 3
>>>
>>> df.to_dict('records')
[{'a': 1, 'b': 2}, {'a': 2, 'b': 3}, {'a': 3, 'b': 3}]
>>> result = df.to_dict('records')
>>> for thing in result:
... print(list(thing.items()))
[('a', 1), ('b', 2)]
[('a', 2), ('b', 3)]
[('a', 3), ('b', 3)]
>>>
Data frame is just an intermediate step to achieve my desired result.
Seems like you could get your result directly from the output of the stored procedure. I have no way to test this but looking through the pypyodbc wiki
I came up with this alternative to the DataFrame...
Create a cursor from your connection object
cursor = sql_connection.cursor()
Reformat the stored procedure call and execute it
sp = "{{CALL {}}}".format(retrieve_database_values)
cursor.execute(sp)
Then from method three of the the nice Hello World script
query_results = [dict(zip([column[0] for column in cursor.description], row)) for row in cursor.fetchall()]
query_results should be a list of dicts like result from my DataFrame.to_dict() solution.
If I am reading that comprehension correctly, zip produces tuples so I think what you want is
query_results = [list(zip([column[0] for column in cursor.description], row)) for row in cursor.fetchall()]
#OR
query_results = []
for row in cursor.fetchall():
column_names = [column[0] for column in cursor.description]
query_results.append(list(zip(column_names, row)))
I imagine that could be refined.