append successive dictionaries in Dataframe as rows - python

I have the following empty dataframe.
columns = [
'image_path',
'label',
'nose',
'neck',
'r_sho',
'r_elb',
'r_wri',
'l_sho',
'l_elb',
'l_wri',
'r_hip',
'r_knee',
'r_ank',
'l_hip',
'l_knee',
'l_ank',
'r_eye',
'l_eye',
'r_ear',
'l_ear']
df = pd.DataFrame(columns = columns)
And in every loop, I want to append a new dictionary in this dataframe, with keys the column objects. I tried df = df.append(dictionary, ignore_index = True). Yet only this message appears and the script never ends:
df = df.append(dictionary, ignore_index = True)
/home/karantai/opendr/projects/perception/lightweight_open_pose/demos/inference_demo.py:133: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
Can I append a new dictionary as new row in dataframe, in each loop?

As #Amer implied, it's not a good idea to append rows to a dataframe one by one, because each time a new dataframe object would have to be created, which is inefficient. Instead, you can merge all your dictionaries into one and then call pd.DataFrame() only once.
defaultdict is useful here, to automatically initialize the values of the full dictionary as lists. I would also use the get method for dictionaries, in case any one of your dictionaries is missing one of the keys. You can assign NumPy's nan as a placeholder for the missing values in that case.
In the end, pd.DataFrame() will pick up the column names from the dictionary keys. Here is an example with just three columns for simplicity:
from collections import defaultdict
import numpy as np
import pandas as pd
columns = ['image_path', 'label', 'nose']
dictionaries = [{'image_path': 'file1.jpg', 'label': 1, 'nose': 3},
{'label': 2, 'nose': 2},
{'image_path': 'file3.png', 'label': 3, 'nose': 1}]
full_dictionary = defaultdict(list)
for d in dictionaries:
for key in columns:
full_dictionary[key].append(d.get(key, np.nan))
df = pd.DataFrame(full_dictionary)
df
image_path label nose
0 file1.jpg 1 3
1 NaN 2 2
2 file3.png 3 1

Related

How to correctly append values of a dictionary to an empty dataframe?

Hi I am trying to create a dataframe that will have rows added to it in a for loop. So I decided to first create an empty version of the dataframe, and then create a dictionary for a new row and append that dictionary to the dataframe in each iteration of the loop. The problem is that the values in the dataframe do not properly match the values that were in the dictionary:
I create an empty dataframe as follows:
import pandas
df = pandas.DataFrame({"a":[], "b":[], "c":[]})
I would then create a dictionary and append it to the dataframe like:
dict = {"a":1, "b":2, "c":True}
df = df.append(dict, ignore_index=True)
The problem is that instead of getting a=1, b=1, c=True the dataframe has a=1.0, b=1.0, c=1.0
So how can I make the columns of a and b integers, and the column of c a boolean value?
You should not convert dictionary to dataframe in the first line. Better convert it at last
import pandas
dicts = {"a":[], "b":[], "c":[]} #do not convert dict to df
new_dict = {"a":1, "b":2, "c":True}
for k,v in new_dict.items():
dicts[k].append(new_dict[k])
df = pd.DataFrame(dicts)
Output:
df
a b c
0 1 2 True

Keep rows in dataframe where value in each list is in a list

I have a dataframe that looks like the following.
import pandas as pd
# intialise data of lists.
data = {'Name':['Tom', 'nick', 'krish', 'jack'],
'Lists':["4,67,3,4,53,32", "7,3,44,2,5,6,9", "8,9,23", "9,36,21,32"]}
# Create DataFrame
df = pd.DataFrame(data)
I want to keep the rows where each list 'Lists' has any value in the pre-defined list [1,2,3,4,5]
What would be the most efficient and rapid way of doing it.
I'd like to avoid a for loop, and asking your proficiency in pandas df to ask you what's the best way to achieve this.
In the example above, this would keep only the rows for 'Tom' and 'nick'.
Many thanks!
This would work:
values = set(str(i) for i in [1, 2, 3, 4, 5]) # note the set
idx = df['Lists'].str.split(',').map(lambda x: len(values.intersection(x)) > 0)
df.loc[idx, 'Name']
0 Tom
1 nick
Name: Name, dtype: object
First convert the values to a set for faster membership tests (if you have many values), then filter rows where 'Lists' intersects the values.

Dictionary to Dataframe Error: "If using all scalar values, you must pass an index"

Currently, I am using a for loop to read csv files from a folder.
After reading the csv file, I am storing the data into one row of a dictionary.
When I print the data types using "print(list_of_dfs.dtypes)" I receive:
dtype: object
DATETIME : object
VALUE : float64
ID : int64
ID Name: object.
Note that this is a nested dictionary with thousands of values stored in each of these data fields. I have 26 rows of the structure listed above. I am trying to append the dictionary rows into a dataframe where I will have only 1 row consisting of the datafields:
Index DATETIME VALUE ID ID Name.
Note: I am learning python as I go.
I tried using an array to store the data and then convert the array to a dataframe but I could not append the rows of the dataframe.
Using the dictionary method I attempted "df = pd.Dataframe(list_of_dfs)"
This throws an error.
list_of_dfs = {}
for I in range(0,len(regionLoadArray)
list_of_dfs[I] = pd.read_csv(regionLoadArray[I])
#regionLoadArray contains my- file names from list directory.
dataframe = pd.DataFrame(list_of_dfs)
#this method was suggested at thispoint.com for nested dictionaries.
#This is where my error occurs^
ValueError: If using all scalar values, you must pass an index
I appreciate any assistance with this issue as I am new to python.
My current goals is to simply produce a dataframe with my Headers that I can then send to a csv.
Depending on your needs, a simple workaround could be:
dct = {'col1': 'abc', 'col2': 123}
dct = {k:[v] for k,v in dct.items()} # WORKAROUND
df = pd.DataFrame(dct)
which results in
print(df)
col1 col2
0 abc 123
This error occurs because pandas needs an index. At first this seems sort of confusing because you think of list indexing. What this is essentially asking for is a column number for each dictionary to correspond to each dictionary. You can set this like so:
import pandas as pd
list = ['a', 'b', 'c', 'd']
df = pd.DataFrame(list, index = [0, 1, 2, 3])
The data frame then yields:
0
0 'a'
1 'b'
2 'c'
3 'd'
For you specifically, this might look something like this using numpy (not tested):
list_of_dfs = {}
for I in range(0,len(regionLoadArray)):
list_of_dfs[I] = pd.read_csv(regionLoadArray[I])
ind = np.arange[len(list_of_dfs)]
dataframe = pd.DataFrame(list_of_dfs, index = ind)
Pandas unfortunately always needs an index when creating a DataFrame.
You can either set it yourself, or use an object with the following structure so pandas can determine the index itself:
data= {'a':[1],'b':[2]}
Since it won't be easy to edit the data in your case,
A hacky solution is to wrap the data into a list
dataframe = pd.DataFrame([list_of_dfs])
import pandas as pd
d = [{"a": 1, "b":2, "c": 3},
{"a": 4, "b":5, "c": 6},
{"a": 7, "b":8, "c": 9}
]
pd.DataFrame(d, index=list(range(len(d))))
returns:
a b c
0 1 2 3
1 4 5 6
2 7 8 9

Pandas: create dataframe without auto ordering column names alphabetically

I am creating an initial pandas dataframe to store results generated from other codes: e.g.
result = pd.DataFrame({'date': datelist, 'total': [0]*len(datelist),
'TT': [0]*len(datelist)})
with datelist a predefined list. Then other codes will output some number for total and TT for each date, which I will store in the result dataframe.
So I want the first column to be date, second total and third TT. However, pandas will automatically reorder it alphabetically to TT, date, total at creation. While I can manually reorder this again afterwards, I wonder if there is an easier way to achieve this in one step.
I figured I can also do
result = pd.DataFrame(np.transpose([datelist, [0]*l, [0]*l]),
columns = ['date', 'total', 'TT'])
but it somehow also looks tedious. Any other suggestions?
You can pass the (correctly ordered) list of column as parameter to the constructor or use an OrderedDict:
# option 1:
result = pd.DataFrame({'date': datelist, 'total': [0]*len(datelist),
'TT': [0]*len(datelist)}, columns=['date', 'total', 'TT'])
# option 2:
od = collections.OrderedDict()
od['date'] = datelist
od['total'] = [0]*len(datelist)
od['TT'] = [0]*len(datelist)
result = pd.DataFrame(od)
result = pd.DataFrame({'date': [23,24], 'total': 0,
'TT': 0},columns=['date','total','TT'])
Use pandas >= 0.23 in combination with Python >= 3.6.
result = pd.DataFrame({'date': datelist,
'total': [0]*len(datelist),
'TT': [0]*len(datelist)})
retains the dict's insertion order when creating a DataFrame (or Series) from a dict when using pandas v0.23.0 in combination with Python3.6.
See https://pandas.pydata.org/pandas-docs/version/0.23.0/whatsnew.html#whatsnew-0230-api-breaking-dict-insertion-order.

"Expanding" pandas dataframe by using cell-contained list

I have a dataframe in which third column is a list:
import pandas as pd
pd.DataFrame([[1,2,['a','b','c']]])
I would like to separate that nest and create more rows with identical values of first and second column.
The end result should be something like:
pd.DataFrame([[[1,2,'a']],[[1,2,'b']],[[1,2,'c']]])
Note, this is simplified example. In reality I have multiple rows that I would like to "expand".
Regarding my progress, I have no idea how to solve this. Well, I imagine that I could take each member of nested list while having other column values in mind. Then I would use the list comprehension to make more list. I would continue so by and add many lists to create a new dataframe... But this seems just a bit too complex. What about simpler solution?
Create the dataframe with a single column, then add columns with constant values:
import pandas as pd
df = pd.DataFrame({"data": ['a', 'b', 'c']})
df['col1'] = 1
df['col2'] = 2
print df
This prints:
data col1 col2
0 a 1 2
1 b 1 2
2 c 1 2
Not exactly the same issue that the OR described, but related - and more pandas-like - is the situation where you have a dict of lists with lists of unequal lengths. In that case, you can create a DataFrame like this in long format.
import pandas as pd
my_dict = {'a': [1,2,3,4], 'b': [2,3]}
df = pd.DataFrame.from_dict(my_dict, orient='index')
df = df.unstack() # to format it in long form
df = df.dropna() # to drop nan values which were generated by having lists of unequal length
df.index = df.index.droplevel(level=0) # if you don't want to store the index in the list
# NOTE this last step results duplicate indexes

Categories