I have this dataframe:
where the column items is a list of dictionnary with multiple 'item_id' and 'item_name' key. And I would like to split all these informations in the same dataframe like this:
You could simply use the explode method:
import pandas as pd
df = pd.DataFrame({'user_id': [1, 2], 'items': [[{'a':'1'},{'b':'2'}], [{'c':'3'},{'d':'4'}]]})
print(df)
print('===============================')
print(df.explode('items').reset_index(drop=True))
Related
I have the following empty dataframe.
columns = [
'image_path',
'label',
'nose',
'neck',
'r_sho',
'r_elb',
'r_wri',
'l_sho',
'l_elb',
'l_wri',
'r_hip',
'r_knee',
'r_ank',
'l_hip',
'l_knee',
'l_ank',
'r_eye',
'l_eye',
'r_ear',
'l_ear']
df = pd.DataFrame(columns = columns)
And in every loop, I want to append a new dictionary in this dataframe, with keys the column objects. I tried df = df.append(dictionary, ignore_index = True). Yet only this message appears and the script never ends:
df = df.append(dictionary, ignore_index = True)
/home/karantai/opendr/projects/perception/lightweight_open_pose/demos/inference_demo.py:133: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
Can I append a new dictionary as new row in dataframe, in each loop?
As #Amer implied, it's not a good idea to append rows to a dataframe one by one, because each time a new dataframe object would have to be created, which is inefficient. Instead, you can merge all your dictionaries into one and then call pd.DataFrame() only once.
defaultdict is useful here, to automatically initialize the values of the full dictionary as lists. I would also use the get method for dictionaries, in case any one of your dictionaries is missing one of the keys. You can assign NumPy's nan as a placeholder for the missing values in that case.
In the end, pd.DataFrame() will pick up the column names from the dictionary keys. Here is an example with just three columns for simplicity:
from collections import defaultdict
import numpy as np
import pandas as pd
columns = ['image_path', 'label', 'nose']
dictionaries = [{'image_path': 'file1.jpg', 'label': 1, 'nose': 3},
{'label': 2, 'nose': 2},
{'image_path': 'file3.png', 'label': 3, 'nose': 1}]
full_dictionary = defaultdict(list)
for d in dictionaries:
for key in columns:
full_dictionary[key].append(d.get(key, np.nan))
df = pd.DataFrame(full_dictionary)
df
image_path label nose
0 file1.jpg 1 3
1 NaN 2 2
2 file3.png 3 1
I have columns with similar names but numeric suffixes that represent different occurrences of each column. For example, I have columns (company_1, job_title_1, location_1, company_2, job_title_2, location_2). I would like to order these columns grouped together by the prefix (before the underscore) and then sequentially by the suffix (after the underscore).
How I would like the columns to be: company_1, company_2, job_title_1, job_title_2, location_1, location_2.
Here's what I tried from this question:
df = df.reindex(sorted(df.columns), axis=1)
This resulted in the order: company_1, company_10, company_11 (skipping over 2-9)
This type of sorting is called natural sorting. (There are more details in Naturally sorting Pandas DataFrame which demonstrates how to sort rows using natsort)
Setup with natsort
import pandas as pd
from natsort import natsorted
df = pd.DataFrame(columns=[f'company_{i}' for i in [5, 2, 3, 4, 1, 10]])
# Before column sort
print(df)
df = df.reindex(natsorted(df.columns), axis=1)
# After column sort
print(df)
Before sort:
Empty DataFrame
Columns: [company_5, company_2, company_3, company_4, company_1, company_10]
Index: []
After sort:
Empty DataFrame
Columns: [company_1, company_2, company_3, company_4, company_5, company_10]
Index: []
Compared to lexicographic sorting with sorted:
df = df.reindex(sorted(df.columns), axis=1)
Empty DataFrame
Columns: [company_1, company_10, company_2, company_3, company_4, company_5]
Index: []
Just for the sake of completeness, you can also get the desired result by passing a function to sorted that splits the strings into name, index tuples.
def index_splitter(x):
"""Example input: 'job_title_1'
Output: ('job_title', 1)
"""
*name, index = x.split("_")
return '_'.join(name), int(index)
df = df.reindex(sorted(df.columns, key=index_splitter), axis=1)
I feel like this is a simply question, but I do not use pandas that often.
Suppose I have a dataframe like this:
d = {'repeat': [23, 25,23], 'unique': [1,2,3]}
df = pd.DataFrame(data=d)
How do I obtain, using pandas built-in function a dataframe of this form:
grouped = {'unique': [[1,3],[2]], 'repeat': [23,25]}
dg =pd.DataFrame(data=grouped)
df -- pandas built-in operations --> dg?
Thanks
You can use the groupby and agg:
df.groupby("repeat").agg(lambda x: list(x)).reset_index()
Given a dataframe df['serialnumber', 'basicinfo'], the column "basicinfo" is a dict {'name': xxx, 'model': xxx, 'studyid': xxx}.
Is there an easy way to filter this dataframe by the dict key "model"?
We can do this filtered by 'serialnumber' if it is integer:
df = df[df.serialnumber == <value>]
How to do that for dict column?
You won't get the vectorized operations directly. But you could use apply to get the dictionary value from.
df = df[df.basicinfo.apply(lambda x: x['model'] == <value>)]
I have a dataframe in which third column is a list:
import pandas as pd
pd.DataFrame([[1,2,['a','b','c']]])
I would like to separate that nest and create more rows with identical values of first and second column.
The end result should be something like:
pd.DataFrame([[[1,2,'a']],[[1,2,'b']],[[1,2,'c']]])
Note, this is simplified example. In reality I have multiple rows that I would like to "expand".
Regarding my progress, I have no idea how to solve this. Well, I imagine that I could take each member of nested list while having other column values in mind. Then I would use the list comprehension to make more list. I would continue so by and add many lists to create a new dataframe... But this seems just a bit too complex. What about simpler solution?
Create the dataframe with a single column, then add columns with constant values:
import pandas as pd
df = pd.DataFrame({"data": ['a', 'b', 'c']})
df['col1'] = 1
df['col2'] = 2
print df
This prints:
data col1 col2
0 a 1 2
1 b 1 2
2 c 1 2
Not exactly the same issue that the OR described, but related - and more pandas-like - is the situation where you have a dict of lists with lists of unequal lengths. In that case, you can create a DataFrame like this in long format.
import pandas as pd
my_dict = {'a': [1,2,3,4], 'b': [2,3]}
df = pd.DataFrame.from_dict(my_dict, orient='index')
df = df.unstack() # to format it in long form
df = df.dropna() # to drop nan values which were generated by having lists of unequal length
df.index = df.index.droplevel(level=0) # if you don't want to store the index in the list
# NOTE this last step results duplicate indexes