I have pandas dataframe, where I listed items, and categorised them:
col_name |col_group
-------------------------
id | Metadata
listing_url | Metadata
scrape_id | Metadata
name | Text
summary | Text
space | Text
To reproduce:
import pandas
df = pandas.DataFrame([
['id','metadata'],
['listing_url','metadata'],
['scrape_id','metadata'],
['name','Text'],
['summary','Text'],
['space','Text']],
columns=['col_name', 'col_group'])
Can you suggest how I can convert this dataframe to multiple lists based on "col_group":
Metadata = ['id','listing_url','scraping_id]
Text = ['name','summary','space']
This is to allow me to pass these lists of columns to panda and drop columns.
I googled a lot and got stuck: all answers are about converting lists to df, not vice versa. Should I aim to convert into dictionary, or list of lists?
I have over 100 rows, belonging to 10 categories, so would like to avoid manual hard-coding.
I've try this code:
import pandas
df = pandas.DataFrame([
[1, 'url_a', 'scrap_a', 'name_a', 'summary_a', 'space_a'],
[2, 'url_b', 'scrap_b', 'name_b', 'summary_b', 'space_b'],
[3, 'url_c', 'scrap_c', 'name_c', 'summary_c', 'space_ac']],
columns=['id', 'listing_url', 'scrape_id', 'name', 'summary', 'space'])
print(df)
for row in df.iterrows():
print(row[1].to_list())
which give this answer:
[1, 'url_a', 'scrap_a', 'name_a', 'summary_a', 'space_a']
[2, 'url_b', 'scrap_b', 'name_b', 'summary_b', 'space_b']
[3, 'url_c', 'scrap_c', 'name_c', 'summary_c', 'space_ac']
You can use
for row in df[['name', 'summary', 'space']].iterrows():
to only iter over specific columns.
Like this:
In [245]: res = df.groupby('col_group', as_index=False)['Col_name'].apply(list)
In [248]: res.tolist()
Out[248]: [['id', 'listing_url', 'scrape_id'], ['name', 'summary', 'space']]
my_vars = df.groupby('col_group').agg(list)['col_name'].to_dict()
Output:
>>> my_vars
{'Text': ['name', 'summary', 'space'], 'metadata': ['id', 'listing_url', 'scrape_id']}
The recommended usage would be just my_vars['Text'] to access the Text, and etc. If you must have this as distinct names you can force it upon your target scope, e.g. globals:
globals().update(df.groupby('col_group').agg(list)['col_name'].to_dict())
Result:
>>> Text
['name', 'summary', 'space']
>>> metadata
['id', 'listing_url', 'scrape_id']
However I would advise against that as you might unwittingly overwrite some of your other objects, or they might not be in the proper scope you needed (e.g. locals).
Related
I have a dataframe that looks like this, with 1 string column and 1 int column.
import random
columns=['EG','EC','FI', 'ED', 'EB', 'FB', 'FCY', 'ECY', 'FG', 'FUR', 'E', '\[ED']
choices_str = random.choices(columns, k=200)
choices_int = random.choices(range(1, 8), k=200)
my_df = pd.DataFrame({'column_A': choices_str, 'column_B': choices_int})
I would like to get at the very end a dictionnary of lists that store all values of column B groupby A, like this :
What I made to achieve this to used a groupby to get number of occurences for column_B :
group_by = my_df.groupby(['column_A','column_B'])['column_B'].count().unstack().fillna(0).T
group_by
And then use some list comprehensions to create by hand my lists for each column_A and add them to the dictionnary.
Is there anyway to get more directly using a groupby ?
I am not aware of a method that is able to achieve that within the groupby statement. But I think you could try something like this alternatively:
import random
import pandas as pd
columns=['EG','EC','FI', 'ED', 'EB', 'FB', 'FCY', 'ECY', 'FG', 'FUR', 'E', '\[ED']
choices_str = random.choices(columns, k=200)
choices_int = random.choices(range(1, 8), k=200)
my_df = pd.DataFrame({'column_A': choices_str, 'column_B': choices_int})
final_dict = {val: my_df.loc[my_df['column_A'] == val, 'column_B'].values.tolist() for val in my_df['column_A'].unique()}
This dict-comprehension is a one-liner and takes all column_B values that correspond to a specific column_A value and assigns them to the dict stored in a list with column_A values as keys.
How can I turn a nested list with dict inside into extra columns in a dataframe in Python?
I received information within a dict from an API,
{'orders':
[
{ 'orderId': '2838168630',
'dateTimeOrderPlaced': '2020-01-22T18:37:29+01:00',
'orderItems': [{ 'orderItemId': 'BFC0000361764421',
'ean': '234234234234234',
'cancelRequest': False,
'quantity': 1}
]},
{ 'orderId': '2708182540',
'dateTimeOrderPlaced': '2020-01-22T17:45:36+01:00',
'orderItems': [{ 'orderItemId': 'BFC0000361749496',
'ean': '234234234234234',
'cancelRequest': False,
'quantity': 3}
]},
{ 'orderId': '2490844970',
'dateTimeOrderPlaced': '2019-08-17T14:21:46+02:00',
'orderItems': [{ 'orderItemId': 'BFC0000287505870',
'ean': '234234234234234',
'cancelRequest': True,
'quantity': 1}
]}
which I managed to turn into a simple dataframe by doing this:
pd.DataFrame(recieved_data.get('orders'))
output:
orderId date oderItems
1 1-12 [{orderItemId: 'dfs13', 'ean': '34234'}]
2 etc.
...
I would like to have something like this
orderId date oderItemId ean
1 1-12 dfs13 34234
2 etc.
...
I already tried to single out the orderItems column with Iloc and than turn it into a list so I can then try to extract the values again. However I than still end up with a list which I need to extract another list from, which has the dict in it.
# Load the dataframe as you have already done.
temp_df = df['orderItems'].apply(pd.Series)
# concat the temp_df and original df
final_df = pd.concat([df, temp_df])
# drop columns if required
Hope it works for you.
Cheers
By combining the answers on this question I reached my end goal. I dit the following:
#unlist the orderItems column
temp_df = df['orderItems'].apply(pd.Series)
#Put items in orderItems into seperate columns
temp_df_json = json_normalize(temp_df[0])
#Join the tables
final_df = df.join(temp_df_json)
#Drop the old orderItems coloumn for a clean table
final_df = final_df.drop(["orderItems"], axis=1)
Also, instead of .concat() I applied .join() to join both tables based on the existing index.
Just to make it clear, you are receiving a json from the API, so you can try to use the function json_normalize.
Try this:
import pandas as pd
from pandas.io.json import json_normalize
# DataFrame initialization
df = pd.DataFrame({"orderId": [1], "date": ["1-12"], "oderItems": [{ 'orderItemId': 'dfs13', 'ean': '34234'}]})
# Serializing inner dict
sub_df = json_normalize(df["oderItems"])
# Dropping the unserialized column
df = df.drop(["oderItems"], axis=1)
# joining both dataframes.
df.join(sub_df)
So the output is:
orderId date ean orderItemId
0 1 1-12 34234 dfs13
Single-level DataFrame:
data1 = {'Sr.No.': Sr_no,
'CompanyNames': Company_Names,
'YourChoice1': Your_Choice,
'YourChoice2': Your_Choice}
df1 = pd.DataFrame(data1, columns = pd.Index(['Sr.No.', 'CompanyNames','YourChoice1','YourChoice2'], name='key'))
Output of single-level dataframe in csv file:
3-level dataframe:
form = {'I1': {'F1': {'PD': ['1','2','3','4','5','6','7','8','9'],
'CD': ['1','2','3','4','5','6','7','8','9']},
'F2': {'PD': ['1','2','3','4','5','6','7','8','9'],
'CD': ['1','2','3','4','5','6','7','8','9']},
'F3': {'PD': ['1','2','3','4','5','6','7','8','9'],
'CD': ['1','2','3','4','5','6','7','8','9']}
},
'I2': {'F1': {'PD': ['1','2','3','4','5','6','7','8','9'],
'CD': ['1','2','3','4','5','6','7','8','9']},
'F2': {'PD': ['1','2','3','4','5','6','7','8','9'],
'CD': ['1','2','3','4','5','6','7','8','9']}
}
}
headers,values,data = CSV_trial.DATA(form)
cols = pd.MultiIndex.from_tuples(headers, names=['ind','field','data'])
df2 = pd.DataFrame(data, columns=cols)
Output of 3-level dataframe in csv file:
I want to merge these dataframe as df1 on left and df2 on right...
Desired Output:
Can anyone help me with this???
An easy way is to transform the single-layer df into a 3-level, then concat two df's of the same structure.
Importing necessary packages:
import pandas as pd
import numpy as np
Creating a native 3-level index. You can read it from a csv, xml, etc.
native_lvl_3_index_tup = [('A','foo1', 1), ('A','foo2', 3),
('B','foo1', 1), ('B','foo2', 3),
('C','foo1', 1), ('C','foo2', 3)]
variables = [33871648, 37253956,
18976457, 19378102,
20851820, 25145561]
native_lvl_3_index = pd.MultiIndex.from_tuples(native_lvl_3_index_tup)
Function, converting native single-level index to a 3-level:
def single_to_3_lvl(single_index_list,val_lvl_0,val_lvl_1):
multiindex_tuple = [(val_lvl_0,val_lvl_1,i) for i in single_index_list]
return pd.MultiIndex.from_tuples(multiindex_tuple)
Use this function to get an artificial 3-level index:
single_index = [1,2,3,4,5,6]
artificial_multiindex = single_to_3_lvl(single_index,'A','B')
Creating dataframes, transposing to move multiindex to columns (as in the question):
df1 = pd.DataFrame(variables,artificial_multiindex).T
df2 = pd.DataFrame(variables,native_lvl_3_index).T
I used the same variables in the dataframes. You can manipulate the concatenation by setting join='outer' or 'inner' in the pd.concat()
result = pd.concat([df1,df2],axis = 1)
Variable result contains the concatenated dataframes. If You have a single-level indexed dataframe, You can reindex it:
single_level_df = pd.DataFrame(single_index,variables)
reindexed = single_level_df.reindex(artificial_multiindex).T
Again, I do transposing (.T) to work with columns. It can be setup differently when creating dataframes.
Hope my answer helped.
I used some code from the link: https://jakevdp.github.io/PythonDataScienceHandbook/03.05-hierarchical-indexing.html
I have an API that returns a single row of data as a Python dictionary. Most of the keys have a single value, but some of the keys have values that are lists (or even lists-of-lists or lists-of-dictionaries).
When I throw the dictionary into pd.DataFrame to try to convert it to a pandas DataFrame, it throws a "Arrays must be the same length" error. This is because it cannot process the keys which have multiple values (i.e. the keys which have values of lists).
How do I get pandas to treat the lists as 'single values'?
As a hypothetical example:
data = { 'building': 'White House', 'DC?': True,
'occupants': ['Barack', 'Michelle', 'Sasha', 'Malia'] }
I want to turn it into a DataFrame like this:
ix building DC? occupants
0 'White House' True ['Barack', 'Michelle', 'Sasha', 'Malia']
This works if you pass a list (of rows):
In [11]: pd.DataFrame(data)
Out[11]:
DC? building occupants
0 True White House Barack
1 True White House Michelle
2 True White House Sasha
3 True White House Malia
In [12]: pd.DataFrame([data])
Out[12]:
DC? building occupants
0 True White House [Barack, Michelle, Sasha, Malia]
This turns out to be very trivial in the end
data = { 'building': 'White House', 'DC?': True, 'occupants': ['Barack', 'Michelle', 'Sasha', 'Malia'] }
df = pandas.DataFrame([data])
print df
Which results in:
DC? building occupants
0 True White House [Barack, Michelle, Sasha, Malia]
Solution to make dataframe from dictionary of lists where keys become a sorted index and column names are provided. Good for creating dataframes from scraped html tables.
d = { 'B':[10,11], 'A':[20,21] }
df = pd.DataFrame(d.values(),columns=['C1','C2'],index=d.keys()).sort_index()
df
C1 C2
A 20 21
B 10 11
Would it be acceptable if instead of having one entry with a list of occupants, you had individual entries for each occupant? If so you could just do
n = len(data['occupants'])
for key, val in data.items():
if key != 'occupants':
data[key] = n*[val]
EDIT: Actually, I'm getting this behavior in pandas (i.e. just with pd.DataFrame(data)) even without this pre-processing. What version are you using?
I had a closely related problem, but my data structure was a multi-level dictionary with lists in the second level dictionary:
result = {'hamster': {'confidence': 1, 'ids': ['id1', 'id2']},
'zombie': {'confidence': 1, 'ids': ['id3']}}
When importing this with pd.DataFrame([result]), I end up with columns named hamster and zombie. The (for me) correct import would be to have these as row titles, and confidence and ids as column titles. To achieve this, I used pd.DataFrame.from_dict:
In [42]: pd.DataFrame.from_dict(result, orient="index")
Out[42]:
confidence ids
hamster 1 [id1, id2]
zombie 1 [id3]
This works for me with python 3.8 + pandas 1.2.3.
if you know the keys of the dictionary beforehand, why not first create an empty data frame and then keep adding rows?
I have what should be a simple problem but 3 hours into trying different things and I cant solve it.
I have a pymysql returning me results from a query. I cant share the exact example but this straw man should do.
cur.execute("select name, address, phonenum from contacts")
This returns results perfectly which i grab with
results = cur.fetchall()
and then convert to a list object exactly as I want it
data = list(results)
Unfortunately this doesn't include the header but you can get it with cur.description (which contains metadata including but not limited to the header). I push this into a list
Header=[]
for n in cur.description:
header.append(str((n[0])))
so my header looks like:
['name','address','phonenum']
and my results look like:
[['Tom','dublin','12345'],['Bob','Kerry','56789']]
I want to create a dataframe in pandas and then pivot it but it needs column headers to work properly. I had previously been importing a completed csv into a pandas DF which included the header so this all worked smoothly but now i need to get this data direct from the DB so I was thinking, that's easy, I just join the two lists and hey presto I have what I am looking for, but when i try to append I actually wind up with this:
['name','address','phonenum',['Tom','dublin','12345'],['Bob','Kerry','56789']]
when i need this
[['name','address','phonenum'],['Tom','dublin','12345'],['Bob','Kerry','56789']]
Anyone any ideas?
Much appreciated!
Addition of lists concatenates contents:
In [17]: [1] + [2,3]
Out[17]: [1, 2, 3]
This is true even if the contents are themselves lists:
In [18]: [[1]] + [[2],[3]]
Out[18]: [[1], [2], [3]]
So:
In [13]: header = ['name','address','phonenum']
In [14]: data = [['Tom','dublin','12345'],['Bob','Kerry','56789']]
In [15]: [header] + data
Out[15]:
[['name', 'address', 'phonenum'],
['Tom', 'dublin', '12345'],
['Bob', 'Kerry', '56789']]
In [16]: pd.DataFrame(data, columns=header)
Out[16]:
name address phonenum
0 Tom dublin 12345
1 Bob Kerry 56789
Note that loading a DataFrame with data from a database can also be done with pandas.read_sql.
is that what you are looking for?
first = ['name','address','phonenum']
second = [['Tom','dublin','12345'],['Bob','Kerry','56789']]
second = [first] + second
print second
'[['name', 'address', 'phonenum'], ['Tom', 'dublin', '12345'], ['Bob', 'Kerry', '56789']]'
Other possibilities:
You could insert it into data location 0 as a list
header = ['name','address','phonenum']
data = [['Tom','dublin','12345'],['Bob','Kerry','56789']]
data.insert(0,header)
print data
[['name', 'address', 'phonenum'], ['Tom', 'dublin', '12345'], ['Bob', 'Kerry', '56789']]
But if you are going to manipulate header variable you can shallow copy it
header = ['name','address','phonenum']
data = [['Tom','dublin','12345'],['Bob','Kerry','56789']]
data.insert(0,header[:])
print data
[['name', 'address', 'phonenum'], ['Tom', 'dublin', '12345'], ['Bob', 'Kerry', '56789']]