Create a Pandas Dataframe by appending one row at a time - python

How do I create an empty DataFrame, then add rows, one by one?
I created an empty DataFrame:
df = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
Then I can add a new row at the end and fill a single field with:
df = df._set_value(index=len(df), col='qty1', value=10.0)
It works for only one field at a time. What is a better way to add new row to df?

You can use df.loc[i], where the row with index i will be what you specify it to be in the dataframe.
>>> import pandas as pd
>>> from numpy.random import randint
>>> df = pd.DataFrame(columns=['lib', 'qty1', 'qty2'])
>>> for i in range(5):
>>> df.loc[i] = ['name' + str(i)] + list(randint(10, size=2))
>>> df
lib qty1 qty2
0 name0 3 3
1 name1 2 4
2 name2 2 8
3 name3 2 1
4 name4 9 6

In case you can get all data for the data frame upfront, there is a much faster approach than appending to a data frame:
Create a list of dictionaries in which each dictionary corresponds to an input data row.
Create a data frame from this list.
I had a similar task for which appending to a data frame row by row took 30 min, and creating a data frame from a list of dictionaries completed within seconds.
rows_list = []
for row in input_rows:
dict1 = {}
# get input row in dictionary format
# key = col_name
dict1.update(blah..)
rows_list.append(dict1)
df = pd.DataFrame(rows_list)

In the case of adding a lot of rows to dataframe, I am interested in performance. So I tried the four most popular methods and checked their speed.
Performance
Using .append (NPE's answer)
Using .loc (fred's answer)
Using .loc with preallocating (FooBar's answer)
Using dict and create DataFrame in the end (ShikharDua's answer)
Runtime results (in seconds):
Approach
1000 rows
5000 rows
10 000 rows
.append
0.69
3.39
6.78
.loc without prealloc
0.74
3.90
8.35
.loc with prealloc
0.24
2.58
8.70
dict
0.012
0.046
0.084
So I use addition through the dictionary for myself.
Code:
import pandas as pd
import numpy as np
import time
del df1, df2, df3, df4
numOfRows = 1000
# append
startTime = time.perf_counter()
df1 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows-4):
df1 = df1.append( dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']), ignore_index=True)
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df1.shape)
# .loc w/o prealloc
startTime = time.perf_counter()
df2 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows):
df2.loc[i] = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df2.shape)
# .loc with prealloc
df3 = pd.DataFrame(index=np.arange(0, numOfRows), columns=['A', 'B', 'C', 'D', 'E'] )
startTime = time.perf_counter()
for i in range( 1,numOfRows):
df3.loc[i] = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df3.shape)
# dict
startTime = time.perf_counter()
row_list = []
for i in range (0,5):
row_list.append(dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']))
for i in range( 1,numOfRows-4):
dict1 = dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E'])
row_list.append(dict1)
df4 = pd.DataFrame(row_list, columns=['A','B','C','D','E'])
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df4.shape)
P.S.: I believe my realization isn't perfect, and maybe there is some optimization that could be done.

You could use pandas.concat(). For details and examples, see Merge, join, and concatenate.
For example:
def append_row(df, row):
return pd.concat([
df,
pd.DataFrame([row], columns=row.index)]
).reset_index(drop=True)
df = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
new_row = pd.Series({'lib':'A', 'qty1':1, 'qty2': 2})
df = append_row(df, new_row)

NEVER grow a DataFrame!
Yes, people have already explained that you should NEVER grow a DataFrame, and that you should append your data to a list and convert it to a DataFrame once at the end. But do you understand why?
Here are the most important reasons, taken from my post here.
It is always cheaper/faster to append to a list and create a DataFrame in one go.
Lists take up less memory and are a much lighter data structure to work with, append, and remove.
dtypes are automatically inferred for your data. On the flip side, creating an empty frame of NaNs will automatically make them object, which is bad.
An index is automatically created for you, instead of you having to take care to assign the correct index to the row you are appending.
This is The Right Way™ to accumulate your data
data = []
for a, b, c in some_function_that_yields_data():
data.append([a, b, c])
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
These options are horrible
append or concat inside a loop
append and concat aren't inherently bad in isolation. The
problem starts when you iteratively call them inside a loop - this
results in quadratic memory usage.
# Creates empty DataFrame and appends
df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True)
# This is equally bad:
# df = pd.concat(
# [df, pd.Series({'A': i, 'B': b, 'C': c})],
# ignore_index=True)
Empty DataFrame of NaNs
Never create a DataFrame of NaNs as the columns are initialized with
object (slow, un-vectorizable dtype).
# Creates DataFrame of NaNs and overwrites values.
df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5))
for a, b, c in some_function_that_yields_data():
df.loc[len(df)] = [a, b, c]
The Proof is in the Pudding
Timing these methods is the fastest way to see just how much they differ in terms of their memory and utility.
Benchmarking code for reference.
It's posts like this that remind me why I'm a part of this community. People understand the importance of teaching folks getting the right answer with the right code, not the right answer with wrong code. Now you might argue that it is not an issue to use loc or append if you're only adding a single row to your DataFrame. However, people often look to this question to add more than just one row - often the requirement is to iteratively add a row inside a loop using data that comes from a function (see related question). In that case it is important to understand that iteratively growing a DataFrame is not a good idea.

If you know the number of entries ex ante, you should preallocate the space by also providing the index (taking the data example from a different answer):
import pandas as pd
import numpy as np
# we know we're gonna have 5 rows of data
numberOfRows = 5
# create dataframe
df = pd.DataFrame(index=np.arange(0, numberOfRows), columns=('lib', 'qty1', 'qty2') )
# now fill it up row by row
for x in np.arange(0, numberOfRows):
#loc or iloc both work here since the index is natural numbers
df.loc[x] = [np.random.randint(-1,1) for n in range(3)]
In[23]: df
Out[23]:
lib qty1 qty2
0 -1 -1 -1
1 0 0 0
2 -1 0 -1
3 0 -1 0
4 -1 0 0
Speed comparison
In[30]: %timeit tryThis() # function wrapper for this answer
In[31]: %timeit tryOther() # function wrapper without index (see, for example, #fred)
1000 loops, best of 3: 1.23 ms per loop
100 loops, best of 3: 2.31 ms per loop
And - as from the comments - with a size of 6000, the speed difference becomes even larger:
Increasing the size of the array (12) and the number of rows (500) makes
the speed difference more striking: 313ms vs 2.29s

mycolumns = ['A', 'B']
df = pd.DataFrame(columns=mycolumns)
rows = [[1,2],[3,4],[5,6]]
for row in rows:
df.loc[len(df)] = row

You can append a single row as a dictionary using the ignore_index option.
>>> f = pandas.DataFrame(data = {'Animal':['cow','horse'], 'Color':['blue', 'red']})
>>> f
Animal Color
0 cow blue
1 horse red
>>> f.append({'Animal':'mouse', 'Color':'black'}, ignore_index=True)
Animal Color
0 cow blue
1 horse red
2 mouse black

For efficient appending, see How to add an extra row to a pandas dataframe and Setting With Enlargement.
Add rows through loc/ix on non existing key index data. For example:
In [1]: se = pd.Series([1,2,3])
In [2]: se
Out[2]:
0 1
1 2
2 3
dtype: int64
In [3]: se[5] = 5.
In [4]: se
Out[4]:
0 1.0
1 2.0
2 3.0
5 5.0
dtype: float64
Or:
In [1]: dfi = pd.DataFrame(np.arange(6).reshape(3,2),
.....: columns=['A','B'])
.....:
In [2]: dfi
Out[2]:
A B
0 0 1
1 2 3
2 4 5
In [3]: dfi.loc[:,'C'] = dfi.loc[:,'A']
In [4]: dfi
Out[4]:
A B C
0 0 1 0
1 2 3 2
2 4 5 4
In [5]: dfi.loc[3] = 5
In [6]: dfi
Out[6]:
A B C
0 0 1 0
1 2 3 2
2 4 5 4
3 5 5 5

For the sake of a Pythonic way:
res = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
res = res.append([{'qty1':10.0}], ignore_index=True)
print(res.head())
lib qty1 qty2
0 NaN 10.0 NaN

You can also build up a list of lists and convert it to a dataframe -
import pandas as pd
columns = ['i','double','square']
rows = []
for i in range(6):
row = [i, i*2, i*i]
rows.append(row)
df = pd.DataFrame(rows, columns=columns)
giving
i double square
0 0 0 0
1 1 2 1
2 2 4 4
3 3 6 9
4 4 8 16
5 5 10 25

If you always want to add a new row at the end, use this:
df.loc[len(df)] = ['name5', 9, 0]

I figured out a simple and nice way:
>>> df
A B C
one 1 2 3
>>> df.loc["two"] = [4,5,6]
>>> df
A B C
one 1 2 3
two 4 5 6
Note the caveat with performance as noted in the comments.

This is not an answer to the OP question, but a toy example to illustrate ShikharDua's answer which I found very useful.
While this fragment is trivial, in the actual data I had 1,000s of rows, and many columns, and I wished to be able to group by different columns and then perform the statistics below for more than one target column. So having a reliable method for building the data frame one row at a time was a great convenience. Thank you ShikharDua!
import pandas as pd
BaseData = pd.DataFrame({ 'Customer' : ['Acme','Mega','Acme','Acme','Mega','Acme'],
'Territory' : ['West','East','South','West','East','South'],
'Product' : ['Econ','Luxe','Econ','Std','Std','Econ']})
BaseData
columns = ['Customer','Num Unique Products', 'List Unique Products']
rows_list=[]
for name, group in BaseData.groupby('Customer'):
RecordtoAdd={} #initialise an empty dict
RecordtoAdd.update({'Customer' : name}) #
RecordtoAdd.update({'Num Unique Products' : len(pd.unique(group['Product']))})
RecordtoAdd.update({'List Unique Products' : pd.unique(group['Product'])})
rows_list.append(RecordtoAdd)
AnalysedData = pd.DataFrame(rows_list)
print('Base Data : \n',BaseData,'\n\n Analysed Data : \n',AnalysedData)

You can use a generator object to create a Dataframe, which will be more memory efficient over the list.
num = 10
# Generator function to generate generator object
def numgen_func(num):
for i in range(num):
yield ('name_{}'.format(i), (i*i), (i*i*i))
# Generator expression to generate generator object (Only once data get populated, can not be re used)
numgen_expression = (('name_{}'.format(i), (i*i), (i*i*i)) for i in range(num) )
df = pd.DataFrame(data=numgen_func(num), columns=('lib', 'qty1', 'qty2'))
To add raw to existing DataFrame you can use append method.
df = df.append([{ 'lib': "name_20", 'qty1': 20, 'qty2': 400 }])

Instead of a list of dictionaries as in ShikharDua's answer (row-based), we can also represent our table as a dictionary of lists (column-based), where each list stores one column in row-order, given we know our columns beforehand. At the end we construct our DataFrame once.
In both cases, the dictionary keys are always the column names. Row order is stored implicitly as order in a list. For c columns and n rows, this uses one dictionary of c lists, versus one list of n dictionaries. The list-of-dictionaries method has each dictionary storing all keys redundantly and requires creating a new dictionary for every row. Here we only append to lists, which overall is the same time complexity (adding entries to list and dictionary are both amortized constant time) but may have less overhead due to being a simple operation.
# Current data
data = {"Animal":["cow", "horse"], "Color":["blue", "red"]}
# Adding a new row (be careful to ensure every column gets another value)
data["Animal"].append("mouse")
data["Color"].append("black")
# At the end, construct our DataFrame
df = pd.DataFrame(data)
# Animal Color
# 0 cow blue
# 1 horse red
# 2 mouse black

Create a new record (data frame) and add to old_data_frame.
Pass a list of values and the corresponding column names to create a new_record (data_frame):
new_record = pd.DataFrame([[0, 'abcd', 0, 1, 123]], columns=['a', 'b', 'c', 'd', 'e'])
old_data_frame = pd.concat([old_data_frame, new_record])

Here is the way to add/append a row in a Pandas DataFrame:
def add_row(df, row):
df.loc[-1] = row
df.index = df.index + 1
return df.sort_index()
add_row(df, [1,2,3])
It can be used to insert/append a row in an empty or populated Pandas DataFrame.

If you want to add a row at the end, append it as a list:
valuestoappend = [va1, val2, val3]
res = res.append(pd.Series(valuestoappend, index = ['lib', 'qty1', 'qty2']), ignore_index = True)

Another way to do it (probably not very performant):
# add a row
def add_row(df, row):
colnames = list(df.columns)
ncol = len(colnames)
assert ncol == len(row), "Length of row must be the same as width of DataFrame: %s" % row
return df.append(pd.DataFrame([row], columns=colnames))
You can also enhance the DataFrame class like this:
import pandas as pd
def add_row(self, row):
self.loc[len(self.index)] = row
pd.DataFrame.add_row = add_row

All you need is loc[df.shape[0]] or loc[len(df)]
# Assuming your df has 4 columns (str, int, str, bool)
df.loc[df.shape[0]] = ['col1Value', 100, 'col3Value', False]
or
df.loc[len(df)] = ['col1Value', 100, 'col3Value', False]

You can concatenate two DataFrames for this. I basically came across this problem to add a new row to an existing DataFrame with a character index (not numeric).
So, I input the data for a new row in a duct() and index in a list.
new_dict = {put input for new row here}
new_list = [put your index here]
new_df = pd.DataFrame(data=new_dict, index=new_list)
df = pd.concat([existing_df, new_df])

initial_data = {'lib': np.array([1,2,3,4]), 'qty1': [1,2,3,4], 'qty2': [1,2,3,4]}
df = pd.DataFrame(initial_data)
df
lib qty1 qty2
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
val_1 = [10]
val_2 = [14]
val_3 = [20]
df.append(pd.DataFrame({'lib': val_1, 'qty1': val_2, 'qty2': val_3}))
lib qty1 qty2
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
0 10 14 20
You can use a for loop to iterate through values or can add arrays of values.
val_1 = [10, 11, 12, 13]
val_2 = [14, 15, 16, 17]
val_3 = [20, 21, 22, 43]
df.append(pd.DataFrame({'lib': val_1, 'qty1': val_2, 'qty2': val_3}))
lib qty1 qty2
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
0 10 14 20
1 11 15 21
2 12 16 22
3 13 17 43

Make it simple. By taking a list as input which will be appended as a row in the data-frame:
import pandas as pd
res = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
for i in range(5):
res_list = list(map(int, input().split()))
res = res.append(pd.Series(res_list, index=['lib', 'qty1', 'qty2']), ignore_index=True)

pandas.DataFrame.append
DataFrame.append(self, other, ignore_index=False, verify_integrity=False, sort=False) → 'DataFrame'
Code
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df.append(df2)
With ignore_index set to True:
df.append(df2, ignore_index=True)

If you have a data frame df and want to add a list new_list as a new row to df, you can simply do:
df.loc[len(df)] = new_list
If you want to add a new data frame new_df under data frame df, then you can use:
df.append(new_df)

We often see the construct df.loc[subscript] = … to assign to one DataFrame row. Mikhail_Sam posted benchmarks containing, among others, this construct as well as the method using dict and create DataFrame in the end. He found the latter to be the fastest by far.
But if we replace the df3.loc[i] = … (with preallocated DataFrame) in his code with df3.values[i] = …, the outcome changes significantly, in that that method performs similar to the one using dict. So we should more often take the use of df.values[subscript] = … into consideration. However note that .values takes a zero-based subscript, which may be different from the DataFrame.index.

Before going to add a row, we have to convert the dataframe to a dictionary. There you can see the keys as columns in the dataframe and the values of the columns are again stored in the dictionary, but there the key for every column is the index number in the dataframe.
That idea makes me to write the below code.
df2 = df.to_dict()
values = ["s_101", "hyderabad", 10, 20, 16, 13, 15, 12, 12, 13, 25, 26, 25, 27, "good", "bad"] # This is the total row that we are going to add
i = 0
for x in df.columns: # Here df.columns gives us the main dictionary key
df2[x][101] = values[i] # Here the 101 is our index number. It is also the key of the sub dictionary
i += 1

If all data in your Dataframe has the same dtype you might use a NumPy array. You can write rows directly into the predefined array and convert it to a dataframe at the end.
It seems to be even faster than converting a list of dicts.
import pandas as pd
import numpy as np
from string import ascii_uppercase
startTime = time.perf_counter()
numcols, numrows = 5, 10000
npdf = np.ones((numrows, numcols))
for row in range(numrows):
npdf[row, 0:] = np.random.randint(0, 100, (1, numcols))
df5 = pd.DataFrame(npdf, columns=list(ascii_uppercase[:numcols]))
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df5.shape)

This code snippet uses a list of dictionaries to update the data frame. It adds on to ShikharDua's and Mikhail_Sam's answers.
import pandas as pd
colour = ["red", "big", "tasty"]
fruits = ["apple", "banana", "cherry"]
dict1={}
feat_list=[]
for x in colour:
for y in fruits:
# print(x, y)
dict1 = dict([('x',x),('y',y)])
# print(f'dict 1 {dict1}')
feat_list.append(dict1)
# print(f'feat_list {feat_list}')
feat_df=pd.DataFrame(feat_list)
feat_df.to_csv('feat1.csv')

Related

to get individual rows of the list in a single data frame Python [duplicate]

How do I create an empty DataFrame, then add rows, one by one?
I created an empty DataFrame:
df = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
Then I can add a new row at the end and fill a single field with:
df = df._set_value(index=len(df), col='qty1', value=10.0)
It works for only one field at a time. What is a better way to add new row to df?
You can use df.loc[i], where the row with index i will be what you specify it to be in the dataframe.
>>> import pandas as pd
>>> from numpy.random import randint
>>> df = pd.DataFrame(columns=['lib', 'qty1', 'qty2'])
>>> for i in range(5):
>>> df.loc[i] = ['name' + str(i)] + list(randint(10, size=2))
>>> df
lib qty1 qty2
0 name0 3 3
1 name1 2 4
2 name2 2 8
3 name3 2 1
4 name4 9 6
In case you can get all data for the data frame upfront, there is a much faster approach than appending to a data frame:
Create a list of dictionaries in which each dictionary corresponds to an input data row.
Create a data frame from this list.
I had a similar task for which appending to a data frame row by row took 30 min, and creating a data frame from a list of dictionaries completed within seconds.
rows_list = []
for row in input_rows:
dict1 = {}
# get input row in dictionary format
# key = col_name
dict1.update(blah..)
rows_list.append(dict1)
df = pd.DataFrame(rows_list)
In the case of adding a lot of rows to dataframe, I am interested in performance. So I tried the four most popular methods and checked their speed.
Performance
Using .append (NPE's answer)
Using .loc (fred's answer)
Using .loc with preallocating (FooBar's answer)
Using dict and create DataFrame in the end (ShikharDua's answer)
Runtime results (in seconds):
Approach
1000 rows
5000 rows
10 000 rows
.append
0.69
3.39
6.78
.loc without prealloc
0.74
3.90
8.35
.loc with prealloc
0.24
2.58
8.70
dict
0.012
0.046
0.084
So I use addition through the dictionary for myself.
Code:
import pandas as pd
import numpy as np
import time
del df1, df2, df3, df4
numOfRows = 1000
# append
startTime = time.perf_counter()
df1 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows-4):
df1 = df1.append( dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']), ignore_index=True)
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df1.shape)
# .loc w/o prealloc
startTime = time.perf_counter()
df2 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows):
df2.loc[i] = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df2.shape)
# .loc with prealloc
df3 = pd.DataFrame(index=np.arange(0, numOfRows), columns=['A', 'B', 'C', 'D', 'E'] )
startTime = time.perf_counter()
for i in range( 1,numOfRows):
df3.loc[i] = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df3.shape)
# dict
startTime = time.perf_counter()
row_list = []
for i in range (0,5):
row_list.append(dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']))
for i in range( 1,numOfRows-4):
dict1 = dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E'])
row_list.append(dict1)
df4 = pd.DataFrame(row_list, columns=['A','B','C','D','E'])
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df4.shape)
P.S.: I believe my realization isn't perfect, and maybe there is some optimization that could be done.
You could use pandas.concat(). For details and examples, see Merge, join, and concatenate.
For example:
def append_row(df, row):
return pd.concat([
df,
pd.DataFrame([row], columns=row.index)]
).reset_index(drop=True)
df = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
new_row = pd.Series({'lib':'A', 'qty1':1, 'qty2': 2})
df = append_row(df, new_row)
NEVER grow a DataFrame!
Yes, people have already explained that you should NEVER grow a DataFrame, and that you should append your data to a list and convert it to a DataFrame once at the end. But do you understand why?
Here are the most important reasons, taken from my post here.
It is always cheaper/faster to append to a list and create a DataFrame in one go.
Lists take up less memory and are a much lighter data structure to work with, append, and remove.
dtypes are automatically inferred for your data. On the flip side, creating an empty frame of NaNs will automatically make them object, which is bad.
An index is automatically created for you, instead of you having to take care to assign the correct index to the row you are appending.
This is The Right Way™ to accumulate your data
data = []
for a, b, c in some_function_that_yields_data():
data.append([a, b, c])
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
These options are horrible
append or concat inside a loop
append and concat aren't inherently bad in isolation. The
problem starts when you iteratively call them inside a loop - this
results in quadratic memory usage.
# Creates empty DataFrame and appends
df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True)
# This is equally bad:
# df = pd.concat(
# [df, pd.Series({'A': i, 'B': b, 'C': c})],
# ignore_index=True)
Empty DataFrame of NaNs
Never create a DataFrame of NaNs as the columns are initialized with
object (slow, un-vectorizable dtype).
# Creates DataFrame of NaNs and overwrites values.
df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5))
for a, b, c in some_function_that_yields_data():
df.loc[len(df)] = [a, b, c]
The Proof is in the Pudding
Timing these methods is the fastest way to see just how much they differ in terms of their memory and utility.
Benchmarking code for reference.
It's posts like this that remind me why I'm a part of this community. People understand the importance of teaching folks getting the right answer with the right code, not the right answer with wrong code. Now you might argue that it is not an issue to use loc or append if you're only adding a single row to your DataFrame. However, people often look to this question to add more than just one row - often the requirement is to iteratively add a row inside a loop using data that comes from a function (see related question). In that case it is important to understand that iteratively growing a DataFrame is not a good idea.
If you know the number of entries ex ante, you should preallocate the space by also providing the index (taking the data example from a different answer):
import pandas as pd
import numpy as np
# we know we're gonna have 5 rows of data
numberOfRows = 5
# create dataframe
df = pd.DataFrame(index=np.arange(0, numberOfRows), columns=('lib', 'qty1', 'qty2') )
# now fill it up row by row
for x in np.arange(0, numberOfRows):
#loc or iloc both work here since the index is natural numbers
df.loc[x] = [np.random.randint(-1,1) for n in range(3)]
In[23]: df
Out[23]:
lib qty1 qty2
0 -1 -1 -1
1 0 0 0
2 -1 0 -1
3 0 -1 0
4 -1 0 0
Speed comparison
In[30]: %timeit tryThis() # function wrapper for this answer
In[31]: %timeit tryOther() # function wrapper without index (see, for example, #fred)
1000 loops, best of 3: 1.23 ms per loop
100 loops, best of 3: 2.31 ms per loop
And - as from the comments - with a size of 6000, the speed difference becomes even larger:
Increasing the size of the array (12) and the number of rows (500) makes
the speed difference more striking: 313ms vs 2.29s
mycolumns = ['A', 'B']
df = pd.DataFrame(columns=mycolumns)
rows = [[1,2],[3,4],[5,6]]
for row in rows:
df.loc[len(df)] = row
You can append a single row as a dictionary using the ignore_index option.
>>> f = pandas.DataFrame(data = {'Animal':['cow','horse'], 'Color':['blue', 'red']})
>>> f
Animal Color
0 cow blue
1 horse red
>>> f.append({'Animal':'mouse', 'Color':'black'}, ignore_index=True)
Animal Color
0 cow blue
1 horse red
2 mouse black
For efficient appending, see How to add an extra row to a pandas dataframe and Setting With Enlargement.
Add rows through loc/ix on non existing key index data. For example:
In [1]: se = pd.Series([1,2,3])
In [2]: se
Out[2]:
0 1
1 2
2 3
dtype: int64
In [3]: se[5] = 5.
In [4]: se
Out[4]:
0 1.0
1 2.0
2 3.0
5 5.0
dtype: float64
Or:
In [1]: dfi = pd.DataFrame(np.arange(6).reshape(3,2),
.....: columns=['A','B'])
.....:
In [2]: dfi
Out[2]:
A B
0 0 1
1 2 3
2 4 5
In [3]: dfi.loc[:,'C'] = dfi.loc[:,'A']
In [4]: dfi
Out[4]:
A B C
0 0 1 0
1 2 3 2
2 4 5 4
In [5]: dfi.loc[3] = 5
In [6]: dfi
Out[6]:
A B C
0 0 1 0
1 2 3 2
2 4 5 4
3 5 5 5
For the sake of a Pythonic way:
res = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
res = res.append([{'qty1':10.0}], ignore_index=True)
print(res.head())
lib qty1 qty2
0 NaN 10.0 NaN
You can also build up a list of lists and convert it to a dataframe -
import pandas as pd
columns = ['i','double','square']
rows = []
for i in range(6):
row = [i, i*2, i*i]
rows.append(row)
df = pd.DataFrame(rows, columns=columns)
giving
i double square
0 0 0 0
1 1 2 1
2 2 4 4
3 3 6 9
4 4 8 16
5 5 10 25
If you always want to add a new row at the end, use this:
df.loc[len(df)] = ['name5', 9, 0]
I figured out a simple and nice way:
>>> df
A B C
one 1 2 3
>>> df.loc["two"] = [4,5,6]
>>> df
A B C
one 1 2 3
two 4 5 6
Note the caveat with performance as noted in the comments.
This is not an answer to the OP question, but a toy example to illustrate ShikharDua's answer which I found very useful.
While this fragment is trivial, in the actual data I had 1,000s of rows, and many columns, and I wished to be able to group by different columns and then perform the statistics below for more than one target column. So having a reliable method for building the data frame one row at a time was a great convenience. Thank you ShikharDua!
import pandas as pd
BaseData = pd.DataFrame({ 'Customer' : ['Acme','Mega','Acme','Acme','Mega','Acme'],
'Territory' : ['West','East','South','West','East','South'],
'Product' : ['Econ','Luxe','Econ','Std','Std','Econ']})
BaseData
columns = ['Customer','Num Unique Products', 'List Unique Products']
rows_list=[]
for name, group in BaseData.groupby('Customer'):
RecordtoAdd={} #initialise an empty dict
RecordtoAdd.update({'Customer' : name}) #
RecordtoAdd.update({'Num Unique Products' : len(pd.unique(group['Product']))})
RecordtoAdd.update({'List Unique Products' : pd.unique(group['Product'])})
rows_list.append(RecordtoAdd)
AnalysedData = pd.DataFrame(rows_list)
print('Base Data : \n',BaseData,'\n\n Analysed Data : \n',AnalysedData)
You can use a generator object to create a Dataframe, which will be more memory efficient over the list.
num = 10
# Generator function to generate generator object
def numgen_func(num):
for i in range(num):
yield ('name_{}'.format(i), (i*i), (i*i*i))
# Generator expression to generate generator object (Only once data get populated, can not be re used)
numgen_expression = (('name_{}'.format(i), (i*i), (i*i*i)) for i in range(num) )
df = pd.DataFrame(data=numgen_func(num), columns=('lib', 'qty1', 'qty2'))
To add raw to existing DataFrame you can use append method.
df = df.append([{ 'lib': "name_20", 'qty1': 20, 'qty2': 400 }])
Instead of a list of dictionaries as in ShikharDua's answer (row-based), we can also represent our table as a dictionary of lists (column-based), where each list stores one column in row-order, given we know our columns beforehand. At the end we construct our DataFrame once.
In both cases, the dictionary keys are always the column names. Row order is stored implicitly as order in a list. For c columns and n rows, this uses one dictionary of c lists, versus one list of n dictionaries. The list-of-dictionaries method has each dictionary storing all keys redundantly and requires creating a new dictionary for every row. Here we only append to lists, which overall is the same time complexity (adding entries to list and dictionary are both amortized constant time) but may have less overhead due to being a simple operation.
# Current data
data = {"Animal":["cow", "horse"], "Color":["blue", "red"]}
# Adding a new row (be careful to ensure every column gets another value)
data["Animal"].append("mouse")
data["Color"].append("black")
# At the end, construct our DataFrame
df = pd.DataFrame(data)
# Animal Color
# 0 cow blue
# 1 horse red
# 2 mouse black
Create a new record (data frame) and add to old_data_frame.
Pass a list of values and the corresponding column names to create a new_record (data_frame):
new_record = pd.DataFrame([[0, 'abcd', 0, 1, 123]], columns=['a', 'b', 'c', 'd', 'e'])
old_data_frame = pd.concat([old_data_frame, new_record])
Here is the way to add/append a row in a Pandas DataFrame:
def add_row(df, row):
df.loc[-1] = row
df.index = df.index + 1
return df.sort_index()
add_row(df, [1,2,3])
It can be used to insert/append a row in an empty or populated Pandas DataFrame.
If you want to add a row at the end, append it as a list:
valuestoappend = [va1, val2, val3]
res = res.append(pd.Series(valuestoappend, index = ['lib', 'qty1', 'qty2']), ignore_index = True)
Another way to do it (probably not very performant):
# add a row
def add_row(df, row):
colnames = list(df.columns)
ncol = len(colnames)
assert ncol == len(row), "Length of row must be the same as width of DataFrame: %s" % row
return df.append(pd.DataFrame([row], columns=colnames))
You can also enhance the DataFrame class like this:
import pandas as pd
def add_row(self, row):
self.loc[len(self.index)] = row
pd.DataFrame.add_row = add_row
All you need is loc[df.shape[0]] or loc[len(df)]
# Assuming your df has 4 columns (str, int, str, bool)
df.loc[df.shape[0]] = ['col1Value', 100, 'col3Value', False]
or
df.loc[len(df)] = ['col1Value', 100, 'col3Value', False]
You can concatenate two DataFrames for this. I basically came across this problem to add a new row to an existing DataFrame with a character index (not numeric).
So, I input the data for a new row in a duct() and index in a list.
new_dict = {put input for new row here}
new_list = [put your index here]
new_df = pd.DataFrame(data=new_dict, index=new_list)
df = pd.concat([existing_df, new_df])
initial_data = {'lib': np.array([1,2,3,4]), 'qty1': [1,2,3,4], 'qty2': [1,2,3,4]}
df = pd.DataFrame(initial_data)
df
lib qty1 qty2
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
val_1 = [10]
val_2 = [14]
val_3 = [20]
df.append(pd.DataFrame({'lib': val_1, 'qty1': val_2, 'qty2': val_3}))
lib qty1 qty2
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
0 10 14 20
You can use a for loop to iterate through values or can add arrays of values.
val_1 = [10, 11, 12, 13]
val_2 = [14, 15, 16, 17]
val_3 = [20, 21, 22, 43]
df.append(pd.DataFrame({'lib': val_1, 'qty1': val_2, 'qty2': val_3}))
lib qty1 qty2
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
0 10 14 20
1 11 15 21
2 12 16 22
3 13 17 43
Make it simple. By taking a list as input which will be appended as a row in the data-frame:
import pandas as pd
res = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
for i in range(5):
res_list = list(map(int, input().split()))
res = res.append(pd.Series(res_list, index=['lib', 'qty1', 'qty2']), ignore_index=True)
pandas.DataFrame.append
DataFrame.append(self, other, ignore_index=False, verify_integrity=False, sort=False) → 'DataFrame'
Code
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df.append(df2)
With ignore_index set to True:
df.append(df2, ignore_index=True)
If you have a data frame df and want to add a list new_list as a new row to df, you can simply do:
df.loc[len(df)] = new_list
If you want to add a new data frame new_df under data frame df, then you can use:
df.append(new_df)
We often see the construct df.loc[subscript] = … to assign to one DataFrame row. Mikhail_Sam posted benchmarks containing, among others, this construct as well as the method using dict and create DataFrame in the end. He found the latter to be the fastest by far.
But if we replace the df3.loc[i] = … (with preallocated DataFrame) in his code with df3.values[i] = …, the outcome changes significantly, in that that method performs similar to the one using dict. So we should more often take the use of df.values[subscript] = … into consideration. However note that .values takes a zero-based subscript, which may be different from the DataFrame.index.
Before going to add a row, we have to convert the dataframe to a dictionary. There you can see the keys as columns in the dataframe and the values of the columns are again stored in the dictionary, but there the key for every column is the index number in the dataframe.
That idea makes me to write the below code.
df2 = df.to_dict()
values = ["s_101", "hyderabad", 10, 20, 16, 13, 15, 12, 12, 13, 25, 26, 25, 27, "good", "bad"] # This is the total row that we are going to add
i = 0
for x in df.columns: # Here df.columns gives us the main dictionary key
df2[x][101] = values[i] # Here the 101 is our index number. It is also the key of the sub dictionary
i += 1
If all data in your Dataframe has the same dtype you might use a NumPy array. You can write rows directly into the predefined array and convert it to a dataframe at the end.
It seems to be even faster than converting a list of dicts.
import pandas as pd
import numpy as np
from string import ascii_uppercase
startTime = time.perf_counter()
numcols, numrows = 5, 10000
npdf = np.ones((numrows, numcols))
for row in range(numrows):
npdf[row, 0:] = np.random.randint(0, 100, (1, numcols))
df5 = pd.DataFrame(npdf, columns=list(ascii_uppercase[:numcols]))
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df5.shape)
This code snippet uses a list of dictionaries to update the data frame. It adds on to ShikharDua's and Mikhail_Sam's answers.
import pandas as pd
colour = ["red", "big", "tasty"]
fruits = ["apple", "banana", "cherry"]
dict1={}
feat_list=[]
for x in colour:
for y in fruits:
# print(x, y)
dict1 = dict([('x',x),('y',y)])
# print(f'dict 1 {dict1}')
feat_list.append(dict1)
# print(f'feat_list {feat_list}')
feat_df=pd.DataFrame(feat_list)
feat_df.to_csv('feat1.csv')

Nearest neighbor matching in Pandas

Given two DataFrames (t1, t2), both with a column 'x', how would I append a column to t1 with the ID of t2 whose 'x' value is the nearest to the 'x' value in t1?
t1:
id x
1 1.49
2 2.35
t2:
id x
3 2.36
4 1.5
output:
id id2
1 4
2 3
I can do this by creating a new DataFrame and iterating on t1.groupby() and doing look ups on t2 then merging, but this take incredibly long given a 17 million row t1 DataFrame.
Is there a better way to accomplish? I've scoured the pandas docs regarding groupby, apply, transform, agg, etc. But an elegant solution has yet to present itself despite my thought that this would be a common problem.
Using merge_asof
df = pd.merge_asof(df1.sort_values('x'),
df2.sort_values('x'),
on='x',
direction='nearest',
suffixes=['', '_2'])
print(df)
Out[975]:
id x id_2
0 3 0.87 6
1 1 1.49 5
2 2 2.35 4
Method 2 reindex
df1['id2']=df2.set_index('x').reindex(df1.x,method='nearest').values
df1
id x id2
0 1 1.49 4
1 2 2.35 3
convert to list t1 and t2 and sort them after this
and with the zip() function match the id
list1 = t1.values.tolist()
list2 = t2.values.tolist()
list1.sort() // ASC ORD DESC YOU DECIDE
list2.sort()
list3 = zip(list1,list2)
print(list3)
//after that you must see the output like (1,4),(2,3)
You can calculate a new array with the distance from each element in t1 to each element in t2, and then take the argmin along the rows to get the right index. This has the advantage that you can choose whatever distance function you like, and it does not require the dataframes to be of equal length.
It creates one intermediate array of size len(t1) * len(t2). Using a pandas builtin might be more memory-efficient, but this should be as fast as you can get as everything is done on the C side of numpy. You could always do this method in batches if memory is a problem.
import numpy as np
import pandas as pd
t1 = pd.DataFrame({"id": [1, 2], "x": np.array([1.49, 2.35])})
t2 = pd.DataFrame({"id": [3, 4], "x": np.array([2.36, 1.5])})
Now comes the part doing the actual work. The .to_numpy() bit is important since otherwise Pandas tries to merge on the indices. The first line uses broadcasting to create horizontal and vertical "repetitions" in a memory-efficient way.
dist = np.abs(t1["x"][np.newaxis, :] - t2["x"][:, np.newaxis])
closest_idx = np.argmin(dist, axis=1)
closest_id = t2["id"][closest_idx].to_numpy()
output = pd.DataFrame({"id1": t1["id"], "id2": closest_id})
print(output)
Alternatively, you can use round to 1 precision
t1 = {'id': [1, 2], 'x': [1.49,2.35]}
t2 = {'id': [3, 4], 'x': [2.36,1.5]}
df1 = pd.DataFrame(t1)
df2 = pd.DataFrame(t2)
df = df1.round(1).merge(df2.round(1), on='x', suffixes=('','2')).drop('x',1)
print(df)
id id2
0 1 4
1 2 3
add .drop('x',1) to remove the output for the binding column 'x'.
add suffixes=('','2') to rename the column titles.

Add list as row in pandas dataframe [duplicate]

How do I create an empty DataFrame, then add rows, one by one?
I created an empty DataFrame:
df = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
Then I can add a new row at the end and fill a single field with:
df = df._set_value(index=len(df), col='qty1', value=10.0)
It works for only one field at a time. What is a better way to add new row to df?
You can use df.loc[i], where the row with index i will be what you specify it to be in the dataframe.
>>> import pandas as pd
>>> from numpy.random import randint
>>> df = pd.DataFrame(columns=['lib', 'qty1', 'qty2'])
>>> for i in range(5):
>>> df.loc[i] = ['name' + str(i)] + list(randint(10, size=2))
>>> df
lib qty1 qty2
0 name0 3 3
1 name1 2 4
2 name2 2 8
3 name3 2 1
4 name4 9 6
In case you can get all data for the data frame upfront, there is a much faster approach than appending to a data frame:
Create a list of dictionaries in which each dictionary corresponds to an input data row.
Create a data frame from this list.
I had a similar task for which appending to a data frame row by row took 30 min, and creating a data frame from a list of dictionaries completed within seconds.
rows_list = []
for row in input_rows:
dict1 = {}
# get input row in dictionary format
# key = col_name
dict1.update(blah..)
rows_list.append(dict1)
df = pd.DataFrame(rows_list)
In the case of adding a lot of rows to dataframe, I am interested in performance. So I tried the four most popular methods and checked their speed.
Performance
Using .append (NPE's answer)
Using .loc (fred's answer)
Using .loc with preallocating (FooBar's answer)
Using dict and create DataFrame in the end (ShikharDua's answer)
Runtime results (in seconds):
Approach
1000 rows
5000 rows
10 000 rows
.append
0.69
3.39
6.78
.loc without prealloc
0.74
3.90
8.35
.loc with prealloc
0.24
2.58
8.70
dict
0.012
0.046
0.084
So I use addition through the dictionary for myself.
Code:
import pandas as pd
import numpy as np
import time
del df1, df2, df3, df4
numOfRows = 1000
# append
startTime = time.perf_counter()
df1 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows-4):
df1 = df1.append( dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']), ignore_index=True)
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df1.shape)
# .loc w/o prealloc
startTime = time.perf_counter()
df2 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows):
df2.loc[i] = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df2.shape)
# .loc with prealloc
df3 = pd.DataFrame(index=np.arange(0, numOfRows), columns=['A', 'B', 'C', 'D', 'E'] )
startTime = time.perf_counter()
for i in range( 1,numOfRows):
df3.loc[i] = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df3.shape)
# dict
startTime = time.perf_counter()
row_list = []
for i in range (0,5):
row_list.append(dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']))
for i in range( 1,numOfRows-4):
dict1 = dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E'])
row_list.append(dict1)
df4 = pd.DataFrame(row_list, columns=['A','B','C','D','E'])
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df4.shape)
P.S.: I believe my realization isn't perfect, and maybe there is some optimization that could be done.
You could use pandas.concat(). For details and examples, see Merge, join, and concatenate.
For example:
def append_row(df, row):
return pd.concat([
df,
pd.DataFrame([row], columns=row.index)]
).reset_index(drop=True)
df = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
new_row = pd.Series({'lib':'A', 'qty1':1, 'qty2': 2})
df = append_row(df, new_row)
NEVER grow a DataFrame!
Yes, people have already explained that you should NEVER grow a DataFrame, and that you should append your data to a list and convert it to a DataFrame once at the end. But do you understand why?
Here are the most important reasons, taken from my post here.
It is always cheaper/faster to append to a list and create a DataFrame in one go.
Lists take up less memory and are a much lighter data structure to work with, append, and remove.
dtypes are automatically inferred for your data. On the flip side, creating an empty frame of NaNs will automatically make them object, which is bad.
An index is automatically created for you, instead of you having to take care to assign the correct index to the row you are appending.
This is The Right Way™ to accumulate your data
data = []
for a, b, c in some_function_that_yields_data():
data.append([a, b, c])
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
These options are horrible
append or concat inside a loop
append and concat aren't inherently bad in isolation. The
problem starts when you iteratively call them inside a loop - this
results in quadratic memory usage.
# Creates empty DataFrame and appends
df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True)
# This is equally bad:
# df = pd.concat(
# [df, pd.Series({'A': i, 'B': b, 'C': c})],
# ignore_index=True)
Empty DataFrame of NaNs
Never create a DataFrame of NaNs as the columns are initialized with
object (slow, un-vectorizable dtype).
# Creates DataFrame of NaNs and overwrites values.
df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5))
for a, b, c in some_function_that_yields_data():
df.loc[len(df)] = [a, b, c]
The Proof is in the Pudding
Timing these methods is the fastest way to see just how much they differ in terms of their memory and utility.
Benchmarking code for reference.
It's posts like this that remind me why I'm a part of this community. People understand the importance of teaching folks getting the right answer with the right code, not the right answer with wrong code. Now you might argue that it is not an issue to use loc or append if you're only adding a single row to your DataFrame. However, people often look to this question to add more than just one row - often the requirement is to iteratively add a row inside a loop using data that comes from a function (see related question). In that case it is important to understand that iteratively growing a DataFrame is not a good idea.
If you know the number of entries ex ante, you should preallocate the space by also providing the index (taking the data example from a different answer):
import pandas as pd
import numpy as np
# we know we're gonna have 5 rows of data
numberOfRows = 5
# create dataframe
df = pd.DataFrame(index=np.arange(0, numberOfRows), columns=('lib', 'qty1', 'qty2') )
# now fill it up row by row
for x in np.arange(0, numberOfRows):
#loc or iloc both work here since the index is natural numbers
df.loc[x] = [np.random.randint(-1,1) for n in range(3)]
In[23]: df
Out[23]:
lib qty1 qty2
0 -1 -1 -1
1 0 0 0
2 -1 0 -1
3 0 -1 0
4 -1 0 0
Speed comparison
In[30]: %timeit tryThis() # function wrapper for this answer
In[31]: %timeit tryOther() # function wrapper without index (see, for example, #fred)
1000 loops, best of 3: 1.23 ms per loop
100 loops, best of 3: 2.31 ms per loop
And - as from the comments - with a size of 6000, the speed difference becomes even larger:
Increasing the size of the array (12) and the number of rows (500) makes
the speed difference more striking: 313ms vs 2.29s
mycolumns = ['A', 'B']
df = pd.DataFrame(columns=mycolumns)
rows = [[1,2],[3,4],[5,6]]
for row in rows:
df.loc[len(df)] = row
You can append a single row as a dictionary using the ignore_index option.
>>> f = pandas.DataFrame(data = {'Animal':['cow','horse'], 'Color':['blue', 'red']})
>>> f
Animal Color
0 cow blue
1 horse red
>>> f.append({'Animal':'mouse', 'Color':'black'}, ignore_index=True)
Animal Color
0 cow blue
1 horse red
2 mouse black
For efficient appending, see How to add an extra row to a pandas dataframe and Setting With Enlargement.
Add rows through loc/ix on non existing key index data. For example:
In [1]: se = pd.Series([1,2,3])
In [2]: se
Out[2]:
0 1
1 2
2 3
dtype: int64
In [3]: se[5] = 5.
In [4]: se
Out[4]:
0 1.0
1 2.0
2 3.0
5 5.0
dtype: float64
Or:
In [1]: dfi = pd.DataFrame(np.arange(6).reshape(3,2),
.....: columns=['A','B'])
.....:
In [2]: dfi
Out[2]:
A B
0 0 1
1 2 3
2 4 5
In [3]: dfi.loc[:,'C'] = dfi.loc[:,'A']
In [4]: dfi
Out[4]:
A B C
0 0 1 0
1 2 3 2
2 4 5 4
In [5]: dfi.loc[3] = 5
In [6]: dfi
Out[6]:
A B C
0 0 1 0
1 2 3 2
2 4 5 4
3 5 5 5
For the sake of a Pythonic way:
res = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
res = res.append([{'qty1':10.0}], ignore_index=True)
print(res.head())
lib qty1 qty2
0 NaN 10.0 NaN
You can also build up a list of lists and convert it to a dataframe -
import pandas as pd
columns = ['i','double','square']
rows = []
for i in range(6):
row = [i, i*2, i*i]
rows.append(row)
df = pd.DataFrame(rows, columns=columns)
giving
i double square
0 0 0 0
1 1 2 1
2 2 4 4
3 3 6 9
4 4 8 16
5 5 10 25
If you always want to add a new row at the end, use this:
df.loc[len(df)] = ['name5', 9, 0]
I figured out a simple and nice way:
>>> df
A B C
one 1 2 3
>>> df.loc["two"] = [4,5,6]
>>> df
A B C
one 1 2 3
two 4 5 6
Note the caveat with performance as noted in the comments.
This is not an answer to the OP question, but a toy example to illustrate ShikharDua's answer which I found very useful.
While this fragment is trivial, in the actual data I had 1,000s of rows, and many columns, and I wished to be able to group by different columns and then perform the statistics below for more than one target column. So having a reliable method for building the data frame one row at a time was a great convenience. Thank you ShikharDua!
import pandas as pd
BaseData = pd.DataFrame({ 'Customer' : ['Acme','Mega','Acme','Acme','Mega','Acme'],
'Territory' : ['West','East','South','West','East','South'],
'Product' : ['Econ','Luxe','Econ','Std','Std','Econ']})
BaseData
columns = ['Customer','Num Unique Products', 'List Unique Products']
rows_list=[]
for name, group in BaseData.groupby('Customer'):
RecordtoAdd={} #initialise an empty dict
RecordtoAdd.update({'Customer' : name}) #
RecordtoAdd.update({'Num Unique Products' : len(pd.unique(group['Product']))})
RecordtoAdd.update({'List Unique Products' : pd.unique(group['Product'])})
rows_list.append(RecordtoAdd)
AnalysedData = pd.DataFrame(rows_list)
print('Base Data : \n',BaseData,'\n\n Analysed Data : \n',AnalysedData)
You can use a generator object to create a Dataframe, which will be more memory efficient over the list.
num = 10
# Generator function to generate generator object
def numgen_func(num):
for i in range(num):
yield ('name_{}'.format(i), (i*i), (i*i*i))
# Generator expression to generate generator object (Only once data get populated, can not be re used)
numgen_expression = (('name_{}'.format(i), (i*i), (i*i*i)) for i in range(num) )
df = pd.DataFrame(data=numgen_func(num), columns=('lib', 'qty1', 'qty2'))
To add raw to existing DataFrame you can use append method.
df = df.append([{ 'lib': "name_20", 'qty1': 20, 'qty2': 400 }])
Instead of a list of dictionaries as in ShikharDua's answer (row-based), we can also represent our table as a dictionary of lists (column-based), where each list stores one column in row-order, given we know our columns beforehand. At the end we construct our DataFrame once.
In both cases, the dictionary keys are always the column names. Row order is stored implicitly as order in a list. For c columns and n rows, this uses one dictionary of c lists, versus one list of n dictionaries. The list-of-dictionaries method has each dictionary storing all keys redundantly and requires creating a new dictionary for every row. Here we only append to lists, which overall is the same time complexity (adding entries to list and dictionary are both amortized constant time) but may have less overhead due to being a simple operation.
# Current data
data = {"Animal":["cow", "horse"], "Color":["blue", "red"]}
# Adding a new row (be careful to ensure every column gets another value)
data["Animal"].append("mouse")
data["Color"].append("black")
# At the end, construct our DataFrame
df = pd.DataFrame(data)
# Animal Color
# 0 cow blue
# 1 horse red
# 2 mouse black
Create a new record (data frame) and add to old_data_frame.
Pass a list of values and the corresponding column names to create a new_record (data_frame):
new_record = pd.DataFrame([[0, 'abcd', 0, 1, 123]], columns=['a', 'b', 'c', 'd', 'e'])
old_data_frame = pd.concat([old_data_frame, new_record])
Here is the way to add/append a row in a Pandas DataFrame:
def add_row(df, row):
df.loc[-1] = row
df.index = df.index + 1
return df.sort_index()
add_row(df, [1,2,3])
It can be used to insert/append a row in an empty or populated Pandas DataFrame.
If you want to add a row at the end, append it as a list:
valuestoappend = [va1, val2, val3]
res = res.append(pd.Series(valuestoappend, index = ['lib', 'qty1', 'qty2']), ignore_index = True)
Another way to do it (probably not very performant):
# add a row
def add_row(df, row):
colnames = list(df.columns)
ncol = len(colnames)
assert ncol == len(row), "Length of row must be the same as width of DataFrame: %s" % row
return df.append(pd.DataFrame([row], columns=colnames))
You can also enhance the DataFrame class like this:
import pandas as pd
def add_row(self, row):
self.loc[len(self.index)] = row
pd.DataFrame.add_row = add_row
All you need is loc[df.shape[0]] or loc[len(df)]
# Assuming your df has 4 columns (str, int, str, bool)
df.loc[df.shape[0]] = ['col1Value', 100, 'col3Value', False]
or
df.loc[len(df)] = ['col1Value', 100, 'col3Value', False]
You can concatenate two DataFrames for this. I basically came across this problem to add a new row to an existing DataFrame with a character index (not numeric).
So, I input the data for a new row in a duct() and index in a list.
new_dict = {put input for new row here}
new_list = [put your index here]
new_df = pd.DataFrame(data=new_dict, index=new_list)
df = pd.concat([existing_df, new_df])
initial_data = {'lib': np.array([1,2,3,4]), 'qty1': [1,2,3,4], 'qty2': [1,2,3,4]}
df = pd.DataFrame(initial_data)
df
lib qty1 qty2
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
val_1 = [10]
val_2 = [14]
val_3 = [20]
df.append(pd.DataFrame({'lib': val_1, 'qty1': val_2, 'qty2': val_3}))
lib qty1 qty2
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
0 10 14 20
You can use a for loop to iterate through values or can add arrays of values.
val_1 = [10, 11, 12, 13]
val_2 = [14, 15, 16, 17]
val_3 = [20, 21, 22, 43]
df.append(pd.DataFrame({'lib': val_1, 'qty1': val_2, 'qty2': val_3}))
lib qty1 qty2
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
0 10 14 20
1 11 15 21
2 12 16 22
3 13 17 43
Make it simple. By taking a list as input which will be appended as a row in the data-frame:
import pandas as pd
res = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
for i in range(5):
res_list = list(map(int, input().split()))
res = res.append(pd.Series(res_list, index=['lib', 'qty1', 'qty2']), ignore_index=True)
pandas.DataFrame.append
DataFrame.append(self, other, ignore_index=False, verify_integrity=False, sort=False) → 'DataFrame'
Code
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df.append(df2)
With ignore_index set to True:
df.append(df2, ignore_index=True)
If you have a data frame df and want to add a list new_list as a new row to df, you can simply do:
df.loc[len(df)] = new_list
If you want to add a new data frame new_df under data frame df, then you can use:
df.append(new_df)
We often see the construct df.loc[subscript] = … to assign to one DataFrame row. Mikhail_Sam posted benchmarks containing, among others, this construct as well as the method using dict and create DataFrame in the end. He found the latter to be the fastest by far.
But if we replace the df3.loc[i] = … (with preallocated DataFrame) in his code with df3.values[i] = …, the outcome changes significantly, in that that method performs similar to the one using dict. So we should more often take the use of df.values[subscript] = … into consideration. However note that .values takes a zero-based subscript, which may be different from the DataFrame.index.
Before going to add a row, we have to convert the dataframe to a dictionary. There you can see the keys as columns in the dataframe and the values of the columns are again stored in the dictionary, but there the key for every column is the index number in the dataframe.
That idea makes me to write the below code.
df2 = df.to_dict()
values = ["s_101", "hyderabad", 10, 20, 16, 13, 15, 12, 12, 13, 25, 26, 25, 27, "good", "bad"] # This is the total row that we are going to add
i = 0
for x in df.columns: # Here df.columns gives us the main dictionary key
df2[x][101] = values[i] # Here the 101 is our index number. It is also the key of the sub dictionary
i += 1
If all data in your Dataframe has the same dtype you might use a NumPy array. You can write rows directly into the predefined array and convert it to a dataframe at the end.
It seems to be even faster than converting a list of dicts.
import pandas as pd
import numpy as np
from string import ascii_uppercase
startTime = time.perf_counter()
numcols, numrows = 5, 10000
npdf = np.ones((numrows, numcols))
for row in range(numrows):
npdf[row, 0:] = np.random.randint(0, 100, (1, numcols))
df5 = pd.DataFrame(npdf, columns=list(ascii_uppercase[:numcols]))
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df5.shape)
This code snippet uses a list of dictionaries to update the data frame. It adds on to ShikharDua's and Mikhail_Sam's answers.
import pandas as pd
colour = ["red", "big", "tasty"]
fruits = ["apple", "banana", "cherry"]
dict1={}
feat_list=[]
for x in colour:
for y in fruits:
# print(x, y)
dict1 = dict([('x',x),('y',y)])
# print(f'dict 1 {dict1}')
feat_list.append(dict1)
# print(f'feat_list {feat_list}')
feat_df=pd.DataFrame(feat_list)
feat_df.to_csv('feat1.csv')

Pandas read multiindexed csv with blanks

I'm struggling with properly loading a csv that has a multi lines header with blanks. The CSV looks like this:
,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8
What I would like to get is:
When I try to load with pd.read_csv(file, header=[0,1], sep=','), I end up with the following:
Is there a way to get the desired result?
Note: alternatively, I would accept this as a result:
Versions used:
Python: 2.7.8
Pandas 0.16.0
Here is an automated way to fix the column index. First,
pull the column level values into a DataFrame:
columns = pd.DataFrame(df.columns.tolist())
then rename the Unnamed: columns to NaN:
columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan
and then forward-fill the NaNs:
columns[0] = columns[0].fillna(method='ffill')
so that columns now looks like
In [314]: columns
Out[314]:
0 1
0 NaN A
1 NaN B
2 C X
3 C Y
4 C Z
5 D X
6 D Y
7 D Z
Now we can find the remaining NaNs and fill them with empty strings:
mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')
To make the first two columns, A and B, indexable as df['A'] and df['B'] -- as though they were single-leveled -- you could swap the values in the first and second columns:
columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values
Now you can build a new MultiIndex and assign it to df.columns:
df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())
Putting it all together, if data is
,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8
3,4,5,6,7,8,9,0
then
import numpy as np
import pandas as pd
df = pd.read_csv('data', header=[0,1], sep=',')
columns = pd.DataFrame(df.columns.tolist())
columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan
columns[0] = columns[0].fillna(method='ffill')
mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')
columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values
df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())
print(df)
yields
A B C D
X Y Z X Y Z
0 1 2 3 4 5 6 7 8
1 3 4 5 6 7 8 9 0
There is no magical way of making pandas aware of how you want your index to look, the closest way you can do this is by specifying a lot yourself, like this:
names = ['A', 'B',
('C','X'), ('C', 'Y'), ('C', 'Z'),
('D','X'), ('D','Y'), ('D', 'Z')]
pd.read_csv(file, mangle_dupe_cols=True,
header=1, names=names, index_col=[0, 1])
Gives:
C D
X Y Z X Y Z
A B
1 2 3 4 5 6 7 8
To do this in a dynamic fashion, you could read the first two lines of the CSV as they are and loop through the columns you get to generate the names variable dynamically before loading the full dataset.
pd.read_csv(file, nrows=1, header=[0,1], index_col=[0, 1])
Then access the columns and loop to create your header.
Again, not a very clean solution, but should work.
you can read using :
df = pd.read_csv('file.csv', header=[0, 1], skipinitialspace=True, tupleize_cols=True)
and then
df.columns = pd.MultiIndex.from_tuples(df.columns)
Load the dataframe, with multiindex:
df = pd.read_csv(filelist,header=[0,1], sep=',')
Write a function to replace the index:
def replace_index(df):
arr = df.columns.values
l = [list(x) for x in arr]
for i in range(len(l)):
if l[i][0][:7] == 'Unnamed':
if l[i-1][0][:7] != 'Unnamed':
l[i][0] = l[i-1][0]
for i in range(len(l)):
if l[i][0][:7] == 'Unnamed':
l[i][0] = l[i][1]
l[i][1] = ''
index = pd.MultiIndex.from_tuples(l)
df.columns = index
return df
Return the new dataframe properly indexed:
replace_index(df)
I used a technique to flatten from the multi-index columns and make one column. It works well for me.
your_df.columns = ['_'.join(col).strip() for col in your_df.columns.values]
Import your csv file providing the header row indexes:
df = pd.read_csv('file.csv', header=[0, 1, 2])
Then, you can iterate over each column header, clean it up, assign it to a tuple, the re-assign the dataframe columns using pd.MultiIndex.from_tuples(list_of_tuples)
df.columns = pd.MultiIndex.from_tuples(
[tuple(['' if y.find('Unnamed')==0 else y for y in x]) for x in df.columns]
)
this is the quick one liner I was looking for when trying to figure this out.

Appending a list or series to a pandas DataFrame as a row?

So I have initialized an empty pandas DataFrame and I would like to iteratively append lists (or Series) as rows in this DataFrame. What is the best way of doing this?
df = pd.DataFrame(columns=list("ABC"))
df.loc[len(df)] = [1,2,3]
Sometimes it's easier to do all the appending outside of pandas, then, just create the DataFrame in one shot.
>>> import pandas as pd
>>> simple_list=[['a','b']]
>>> simple_list.append(['e','f'])
>>> df=pd.DataFrame(simple_list,columns=['col1','col2'])
col1 col2
0 a b
1 e f
Here's a simple and dumb solution:
>>> import pandas as pd
>>> df = pd.DataFrame()
>>> df = df.append({'foo':1, 'bar':2}, ignore_index=True)
Could you do something like this?
>>> import pandas as pd
>>> df = pd.DataFrame(columns=['col1', 'col2'])
>>> df = df.append(pd.Series(['a', 'b'], index=['col1','col2']), ignore_index=True)
>>> df = df.append(pd.Series(['d', 'e'], index=['col1','col2']), ignore_index=True)
>>> df
col1 col2
0 a b
1 d e
Does anyone have a more elegant solution?
Following onto Mike Chirico's answer... if you want to append a list after the dataframe is already populated...
>>> list = [['f','g']]
>>> df = df.append(pd.DataFrame(list, columns=['col1','col2']),ignore_index=True)
>>> df
col1 col2
0 a b
1 d e
2 f g
There are several ways to append a list to a Pandas Dataframe in Python. Let's consider the following dataframe and list:
import pandas as pd
# Dataframe
df = pd.DataFrame([[1, 2], [3, 4]], columns = ["col1", "col2"])
# List to append
list = [5, 6]
Option 1: append the list at the end of the dataframe with pandas.DataFrame.loc.
df.loc[len(df)] = list
Option 2: convert the list to dataframe and append with pandas.DataFrame.append().
df = df.append(pd.DataFrame([list], columns=df.columns), ignore_index=True)
Option 3: convert the list to series and append with pandas.DataFrame.append().
df = df.append(pd.Series(list, index = df.columns), ignore_index=True)
Each of the above options should output something like:
>>> print (df)
col1 col2
0 1 2
1 3 4
2 5 6
Reference : How to append a list as a row to a Pandas DataFrame in Python?
Converting the list to a data frame within the append function works, also when applied in a loop
import pandas as pd
mylist = [1,2,3]
df = pd.DataFrame()
df = df.append(pd.DataFrame(data[mylist]))
Here's a function that, given an already created dataframe, will append a list as a new row. This should probably have error catchers thrown in, but if you know exactly what you're adding then it shouldn't be an issue.
import pandas as pd
import numpy as np
def addRow(df,ls):
"""
Given a dataframe and a list, append the list as a new row to the dataframe.
:param df: <DataFrame> The original dataframe
:param ls: <list> The new row to be added
:return: <DataFrame> The dataframe with the newly appended row
"""
numEl = len(ls)
newRow = pd.DataFrame(np.array(ls).reshape(1,numEl), columns = list(df.columns))
df = df.append(newRow, ignore_index=True)
return df
If you want to add a Series and use the Series' index as columns of the DataFrame, you only need to append the Series between brackets:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame()
In [3]: row=pd.Series([1,2,3],["A","B","C"])
In [4]: row
Out[4]:
A 1
B 2
C 3
dtype: int64
In [5]: df.append([row],ignore_index=True)
Out[5]:
A B C
0 1 2 3
[1 rows x 3 columns]
Whitout the ignore_index=True you don't get proper index.
simply use loc:
>>> df
A B C
one 1 2 3
>>> df.loc["two"] = [4,5,6]
>>> df
A B C
one 1 2 3
two 4 5 6
As mentioned here - https://kite.com/python/answers/how-to-append-a-list-as-a-row-to-a-pandas-dataframe-in-python, you'll need to first convert the list to a series then append the series to dataframe.
df = pd.DataFrame([[1, 2], [3, 4]], columns = ["a", "b"])
to_append = [5, 6]
a_series = pd.Series(to_append, index = df.columns)
df = df.append(a_series, ignore_index=True)
Consider an array A of N x 2 dimensions. To add one more row, use the following.
A.loc[A.shape[0]] = [3,4]
The simplest way:
my_list = [1,2,3,4,5]
df['new_column'] = pd.Series(my_list).values
Edit:
Don't forget that the length of the new list should be the same of the corresponding Dataframe.

Categories