How to insert rows at specific positions into a dataframe in Python? - python

suppose you have a dataframe
df = pd.DataFrame({'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':
[28,34,29,42]})
and another dataframe
df1 = pd.DataFrame({'Name':['Anna', 'Susie'],'Age':[20,50]})
as well as a list with indices
pos = [0,2].
What is the most pythonic way to create a new dataframe df2 where df1 is integrated into df right before the index positions of df specified in pos?
So, the new array should look like this:
df2 =
Age Name
0 20 Anna
1 28 Tom
2 34 Jack
3 50 Susie
4 29 Steve
5 42 Ricky
Thank you very much.
Best,
Nathan

The behavior you are searching for is implemented by numpy.insert, however, this will not play very well with pandas.DataFrame objects, but no-matter, pandas.DataFrame objects have a numpy.ndarray inside of them (sort of, depending on various factors, it may be multiple arrays, but you can think of them as on array accessible via the .values parameter).
You will simply have to reconstruct the columns of your data-frame, but otherwise, I suspect this is the easiest and fastest way:
In [1]: import pandas as pd, numpy as np
In [2]: df = pd.DataFrame({'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':
...: [28,34,29,42]})
In [3]: df1 = pd.DataFrame({'Name':['Anna', 'Susie'],'Age':[20,50]})
In [4]: np.insert(df.values, (0,2), df1.values, axis=0)
Out[4]:
array([['Anna', 20],
['Tom', 28],
['Jack', 34],
['Susie', 50],
['Steve', 29],
['Ricky', 42]], dtype=object)
So this returns an array, but this array is exactly what you need to make a data-frame! And you have the other elements, i.e. the columns already on the original data-frames, so you can just do:
In [5]: pd.DataFrame(np.insert(df.values, (0,2), df1.values, axis=0), columns=df.columns)
Out[5]:
Name Age
0 Anna 20
1 Tom 28
2 Jack 34
3 Susie 50
4 Steve 29
5 Ricky 42
So that single line is all you need.

Tricky solution with float indexes:
df = pd.DataFrame({'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age': [28,34,29,42]})
df1 = pd.DataFrame({'Name':['Anna', 'Susie'],'Age':[20,50]}, index=[-0.5, 1.5])
result = df.append(df1, ignore_index=False).sort_index().reset_index(drop=True)
print(result)
Output:
Name Age
0 Anna 20
1 Tom 28
2 Jack 34
3 Susie 50
4 Steve 29
5 Ricky 42
Pay attention to index parameter in df1 creation. You can construct index from pos using simple list comprehension:
[x - 0.5 for x in pos]

Related

Pandas FutureWarning [duplicate]

Trying to add a new row of type Series into a DataFrame, both share the same columns/index:
df.loc[df.shape[0]] = r
Getting:
FutureWarning: In a future version, object-dtype columns with all-bool
values will not be included in reductions with bool_only=True.
Explicitly cast to bool dtype instead.
Which comes from inference module.
I got the same error and it is because of the version 1.5.0 of pandas why maybe some answers are here not solving the issue:
Deprecated treating all-bool object-dtype columns as bool-like in DataFrame.any() and DataFrame.all() with bool_only=True, explicitly cast to bool instead (GH46188)
So I tried to understand.. but somehow I was able to find a solution. The cause is that columns with boolean values are not properly casted. I used the concat and for me it was the existing DataFrame.
Because I don't want to define for all columns of the Dataframe the corresponding dtype (which might be also possible), I changed it for the necessary columns:
df["var1"]=df["var1"].astype(bool)
Or for multiple ones:
df=df.astype({"var1":bool,"var2":bool})
Then the concat worked for me without the FutureWarning.
try:
df
c1 c2 c3 c4
0 1 3 True abc
1 2 4 False def
d = {'c1': 3, 'c2': 5, 'c3': True, 'c4': 'ghi'}
s = pd.Series(d)
s
c1 3
c2 5
c3 True
c4 ghi
dtype: object
df.loc[df.shape[0]] = s.to_numpy()
df
c1 c2 c3 c4
0 1 3 True abc
1 2 4 False def
2 3 5 True ghi
base:
import pandas as pd
data = pd.DataFrame.from_dict({
'Name': ['Nik', 'Kate', 'Evan', 'Kyra'],
'Age': [31, 30, 40, 33],
'Location': ['Toronto', 'London', 'Kingston', 'Hamilton']
})
df = pd.DataFrame(data)
df
Name
Age
Location
0
Nik
31
Toronto
1
Kate
30
London
2
Evan
40
Kingston
3
Kyra
33
Hamilton
solution:
import pandas as pd
data = pd.DataFrame.from_dict({
'Name': ['Nik', 'Kate', 'Evan', 'Kyra'],
'Age': [31, 30, 40, 33],
'Location': ['Toronto', 'London', 'Kingston', 'Hamilton']
})
df = pd.DataFrame(data)
# Using pandas.concat() to add a row
r = pd.DataFrame({'Name':'Creuza', 'Age':69, 'Location':'São Gonçalo'}, index=[0])
df2 = pd.concat([r,df.loc[:]]).reset_index(drop=True)
df2
Name
Age
Location
0
Creuza
69
São Gonçalo
1
Nik
31
Toronto
2
Kate
30
London
3
Evan
40
Kingston
4
Kyra
33
Hamilton
happened to me as well when I search the message in google I got here.
the reason it happened to me:
when converting a dict to a data frame serious the conversion isn't converting a boolean type into: <class 'pandas.core.arrays.boolean.BooleanArray'>
it converts it to <class 'numpy.ndarray'>
.
so you need to convert it "manually" and than concat it, correct command that worked for me was:
_item = pd.DataFrame([dictionary])
_item["column"] = _item["column"].astype("boolean")
data_frame = pd.concat([data_frame, _item], ignore_index=True)
see also:
https://github.com/pandas-dev/pandas/issues/46662

Add Series as a new row into DataFrame triggers FutureWarning

Trying to add a new row of type Series into a DataFrame, both share the same columns/index:
df.loc[df.shape[0]] = r
Getting:
FutureWarning: In a future version, object-dtype columns with all-bool
values will not be included in reductions with bool_only=True.
Explicitly cast to bool dtype instead.
Which comes from inference module.
I got the same error and it is because of the version 1.5.0 of pandas why maybe some answers are here not solving the issue:
Deprecated treating all-bool object-dtype columns as bool-like in DataFrame.any() and DataFrame.all() with bool_only=True, explicitly cast to bool instead (GH46188)
So I tried to understand.. but somehow I was able to find a solution. The cause is that columns with boolean values are not properly casted. I used the concat and for me it was the existing DataFrame.
Because I don't want to define for all columns of the Dataframe the corresponding dtype (which might be also possible), I changed it for the necessary columns:
df["var1"]=df["var1"].astype(bool)
Or for multiple ones:
df=df.astype({"var1":bool,"var2":bool})
Then the concat worked for me without the FutureWarning.
try:
df
c1 c2 c3 c4
0 1 3 True abc
1 2 4 False def
d = {'c1': 3, 'c2': 5, 'c3': True, 'c4': 'ghi'}
s = pd.Series(d)
s
c1 3
c2 5
c3 True
c4 ghi
dtype: object
df.loc[df.shape[0]] = s.to_numpy()
df
c1 c2 c3 c4
0 1 3 True abc
1 2 4 False def
2 3 5 True ghi
base:
import pandas as pd
data = pd.DataFrame.from_dict({
'Name': ['Nik', 'Kate', 'Evan', 'Kyra'],
'Age': [31, 30, 40, 33],
'Location': ['Toronto', 'London', 'Kingston', 'Hamilton']
})
df = pd.DataFrame(data)
df
Name
Age
Location
0
Nik
31
Toronto
1
Kate
30
London
2
Evan
40
Kingston
3
Kyra
33
Hamilton
solution:
import pandas as pd
data = pd.DataFrame.from_dict({
'Name': ['Nik', 'Kate', 'Evan', 'Kyra'],
'Age': [31, 30, 40, 33],
'Location': ['Toronto', 'London', 'Kingston', 'Hamilton']
})
df = pd.DataFrame(data)
# Using pandas.concat() to add a row
r = pd.DataFrame({'Name':'Creuza', 'Age':69, 'Location':'São Gonçalo'}, index=[0])
df2 = pd.concat([r,df.loc[:]]).reset_index(drop=True)
df2
Name
Age
Location
0
Creuza
69
São Gonçalo
1
Nik
31
Toronto
2
Kate
30
London
3
Evan
40
Kingston
4
Kyra
33
Hamilton
happened to me as well when I search the message in google I got here.
the reason it happened to me:
when converting a dict to a data frame serious the conversion isn't converting a boolean type into: <class 'pandas.core.arrays.boolean.BooleanArray'>
it converts it to <class 'numpy.ndarray'>
.
so you need to convert it "manually" and than concat it, correct command that worked for me was:
_item = pd.DataFrame([dictionary])
_item["column"] = _item["column"].astype("boolean")
data_frame = pd.concat([data_frame, _item], ignore_index=True)
see also:
https://github.com/pandas-dev/pandas/issues/46662

How to extract data from dataframe in pandas and assign them to normal variables

I am trying to get individual data from groupby() function result in pandas and assign them to variables, but i dont know how:
for example:
df
Names Grades Ages
0 Bob 4 20
1 Jessica 3 21
3 Bob 3 22
4 John 2 20
5 Bob 4 24
print(df.groupby('Names').Ages.mean())
Names
Bob 33
Jessica 21
John 20
Now i want get the mean value of Bob into a scalar variable, like:
Bob_mean = 33 <-- how to extract this value from the dataframe object in pandas
Please help.
Thanks.
You can try:
import pandas as pd
df = pd.DataFrame([['Bob', 4, 20],
['Jessica', 3, 21],
['Bob', 3, 22],
['John', 2, 20],
['Bob', 4, 24]], columns=['Names', 'Grades', 'Ages'])
bob_mean = df.groupby(by = 'Names').Ages.mean()['Bob']
You have two options.
Simply amend your code to select the index 'Bob':
df.groupby('Names').Ages.mean()['Bob']
However groupby operations can become very slow and instead we can use df.loc:
df.loc[df['Names']=='Bob'].Ages.mean()

Fast splitting of pandas dataframe by column value

I have a pandas dataframe:
0 1
0 john 14
1 jack 2
2 emma 6
3 john 23
4 john 53
5 jack 43
that is really large(1+GB). I want to split the dataframe by name and to execute code on each of the resulting dataframes. This is my code, that works:
df.sort(columns=[0], inplace=True)
df.set_index(keys=[0], drop=False, inplace=True)
names = df[0].unique().tolist()
for name in names:
name_df = df.loc[df[0] == name]
do_stuff(name_df)
However it runs really slow. Is there a faster way to accomplish this task?
Here is an dictionary comprehension example that simply adds together each sub dataframe grouped on name:
>>> {k: gb['1'].sum() for k, gb in df.groupby('0')}
{'emma': 6, 'jack': 45, 'john': 90}
For something more complicated, you can create a function and then apply it to the group.
def foo(df):
df += 1
df *= 2
df = df.sum()
return df
{k: g['1'].apply(foo) for k, g in df.groupby('0')}

Pandas data frame sum of column and collecting the results

Given the following dataframe:
import pandas as pd
p1 = {'name': 'willy', 'age': 11, 'interest': "Lego"}
p2 = {'name': 'willy', 'age': 11, 'interest': "games"}
p3 = {'name': 'zoe', 'age': 9, 'interest': "cars"}
df = pd.DataFrame([p1, p2, p3])
df
age interest name
0 11 Lego willy
1 11 games willy
2 9 cars zoe
I want to know the sum of interests of each person and let each person only show once in the list. I do the following:
Interests = df[['age', 'name', 'interest']].groupby(['age' , 'name']).count()
Interests.reset_index(inplace=True)
Interests.sort('interest', ascending=False, inplace=True)
Interests
age name interest
1 11 willy 2
0 9 zoe 1
This works but I have the feeling that I'm doing it wrong. Now I'm using the column 'interest' to display my sum values which is okay but like I said I expect there to be a nicer way to do this.
I saw many questions about counting/sum in Pandas but for me the part where I leave out the 'duplicates' is key.
You can use size (the length of each group), rather than count, the non-NaN enties in each column of the group.
In [11]: df[['age', 'name', 'interest']].groupby(['age' , 'name']).size()
Out[11]:
age name
9 zoe 1
11 willy 2
dtype: int64
In [12]: df[['age', 'name', 'interest']].groupby(['age' , 'name']).size().reset_index(name='count')
Out[12]:
age name count
0 9 zoe 1
1 11 willy 2
In [2]: df
Out[2]:
age interest name
0 11 Lego willy
1 11 games willy
2 9 cars zoe
In [3]: for name,group in df.groupby('name'):
...: print name
...: print group.interest.count()
...:
willy
2
zoe
1

Categories