Trying to add a new row of type Series into a DataFrame, both share the same columns/index:
df.loc[df.shape[0]] = r
Getting:
FutureWarning: In a future version, object-dtype columns with all-bool
values will not be included in reductions with bool_only=True.
Explicitly cast to bool dtype instead.
Which comes from inference module.
I got the same error and it is because of the version 1.5.0 of pandas why maybe some answers are here not solving the issue:
Deprecated treating all-bool object-dtype columns as bool-like in DataFrame.any() and DataFrame.all() with bool_only=True, explicitly cast to bool instead (GH46188)
So I tried to understand.. but somehow I was able to find a solution. The cause is that columns with boolean values are not properly casted. I used the concat and for me it was the existing DataFrame.
Because I don't want to define for all columns of the Dataframe the corresponding dtype (which might be also possible), I changed it for the necessary columns:
df["var1"]=df["var1"].astype(bool)
Or for multiple ones:
df=df.astype({"var1":bool,"var2":bool})
Then the concat worked for me without the FutureWarning.
try:
df
c1 c2 c3 c4
0 1 3 True abc
1 2 4 False def
d = {'c1': 3, 'c2': 5, 'c3': True, 'c4': 'ghi'}
s = pd.Series(d)
s
c1 3
c2 5
c3 True
c4 ghi
dtype: object
df.loc[df.shape[0]] = s.to_numpy()
df
c1 c2 c3 c4
0 1 3 True abc
1 2 4 False def
2 3 5 True ghi
base:
import pandas as pd
data = pd.DataFrame.from_dict({
'Name': ['Nik', 'Kate', 'Evan', 'Kyra'],
'Age': [31, 30, 40, 33],
'Location': ['Toronto', 'London', 'Kingston', 'Hamilton']
})
df = pd.DataFrame(data)
df
Name
Age
Location
0
Nik
31
Toronto
1
Kate
30
London
2
Evan
40
Kingston
3
Kyra
33
Hamilton
solution:
import pandas as pd
data = pd.DataFrame.from_dict({
'Name': ['Nik', 'Kate', 'Evan', 'Kyra'],
'Age': [31, 30, 40, 33],
'Location': ['Toronto', 'London', 'Kingston', 'Hamilton']
})
df = pd.DataFrame(data)
# Using pandas.concat() to add a row
r = pd.DataFrame({'Name':'Creuza', 'Age':69, 'Location':'São Gonçalo'}, index=[0])
df2 = pd.concat([r,df.loc[:]]).reset_index(drop=True)
df2
Name
Age
Location
0
Creuza
69
São Gonçalo
1
Nik
31
Toronto
2
Kate
30
London
3
Evan
40
Kingston
4
Kyra
33
Hamilton
happened to me as well when I search the message in google I got here.
the reason it happened to me:
when converting a dict to a data frame serious the conversion isn't converting a boolean type into: <class 'pandas.core.arrays.boolean.BooleanArray'>
it converts it to <class 'numpy.ndarray'>
.
so you need to convert it "manually" and than concat it, correct command that worked for me was:
_item = pd.DataFrame([dictionary])
_item["column"] = _item["column"].astype("boolean")
data_frame = pd.concat([data_frame, _item], ignore_index=True)
see also:
https://github.com/pandas-dev/pandas/issues/46662
Related
Trying to add a new row of type Series into a DataFrame, both share the same columns/index:
df.loc[df.shape[0]] = r
Getting:
FutureWarning: In a future version, object-dtype columns with all-bool
values will not be included in reductions with bool_only=True.
Explicitly cast to bool dtype instead.
Which comes from inference module.
I got the same error and it is because of the version 1.5.0 of pandas why maybe some answers are here not solving the issue:
Deprecated treating all-bool object-dtype columns as bool-like in DataFrame.any() and DataFrame.all() with bool_only=True, explicitly cast to bool instead (GH46188)
So I tried to understand.. but somehow I was able to find a solution. The cause is that columns with boolean values are not properly casted. I used the concat and for me it was the existing DataFrame.
Because I don't want to define for all columns of the Dataframe the corresponding dtype (which might be also possible), I changed it for the necessary columns:
df["var1"]=df["var1"].astype(bool)
Or for multiple ones:
df=df.astype({"var1":bool,"var2":bool})
Then the concat worked for me without the FutureWarning.
try:
df
c1 c2 c3 c4
0 1 3 True abc
1 2 4 False def
d = {'c1': 3, 'c2': 5, 'c3': True, 'c4': 'ghi'}
s = pd.Series(d)
s
c1 3
c2 5
c3 True
c4 ghi
dtype: object
df.loc[df.shape[0]] = s.to_numpy()
df
c1 c2 c3 c4
0 1 3 True abc
1 2 4 False def
2 3 5 True ghi
base:
import pandas as pd
data = pd.DataFrame.from_dict({
'Name': ['Nik', 'Kate', 'Evan', 'Kyra'],
'Age': [31, 30, 40, 33],
'Location': ['Toronto', 'London', 'Kingston', 'Hamilton']
})
df = pd.DataFrame(data)
df
Name
Age
Location
0
Nik
31
Toronto
1
Kate
30
London
2
Evan
40
Kingston
3
Kyra
33
Hamilton
solution:
import pandas as pd
data = pd.DataFrame.from_dict({
'Name': ['Nik', 'Kate', 'Evan', 'Kyra'],
'Age': [31, 30, 40, 33],
'Location': ['Toronto', 'London', 'Kingston', 'Hamilton']
})
df = pd.DataFrame(data)
# Using pandas.concat() to add a row
r = pd.DataFrame({'Name':'Creuza', 'Age':69, 'Location':'São Gonçalo'}, index=[0])
df2 = pd.concat([r,df.loc[:]]).reset_index(drop=True)
df2
Name
Age
Location
0
Creuza
69
São Gonçalo
1
Nik
31
Toronto
2
Kate
30
London
3
Evan
40
Kingston
4
Kyra
33
Hamilton
happened to me as well when I search the message in google I got here.
the reason it happened to me:
when converting a dict to a data frame serious the conversion isn't converting a boolean type into: <class 'pandas.core.arrays.boolean.BooleanArray'>
it converts it to <class 'numpy.ndarray'>
.
so you need to convert it "manually" and than concat it, correct command that worked for me was:
_item = pd.DataFrame([dictionary])
_item["column"] = _item["column"].astype("boolean")
data_frame = pd.concat([data_frame, _item], ignore_index=True)
see also:
https://github.com/pandas-dev/pandas/issues/46662
I would like to create a DataFrame from a DataFrame I already have in Python.
The DataFrame I have looks like below:
Nome Dept
Maria A1
Joao A2
Anna A1
Jorge A3
The DataFrame I want to create is like the below:
Dept Funcionario 1 Funcionario 2
A1 Maria Anna
A2 Joao
I tried the below code:
df_func.merge(df_dept, how='inner', on='Dept')
But I got the error: TypeError: merge() got multiple values for argument 'how'
Would anyone know how I can do this?
Thank you in Advance! :)
Even if you try that and it works, you will not get the right answer. in fact the key is gonna be duplicated 4 times.
{'Name': ['maria', 'joao', 'anna', 'jorge'], 'dept': [1, 2, 1, 3]}
{'Name': ['maria', 'joao', 'anna', 'jorge'], 'dept': [1, 2, 1, 3]}
d = _
df = pd.DataFrame(d)
df.merge(df, how='inner', on='dept')
Out[8]:
Name_x dept Name_y
0 maria 1 maria
1 maria 1 anna
2 anna 1 maria
3 anna 1 anna
4 joao 2 joao
5 jorge 3 jorge
Best way around is to groupby :
dd = df.groupby('dept').agg(list)
Out[10]:
Name
dept
1 [maria, anna]
2 [joao]
3 [jorge]
Then you apply pd.Series
dd['Name'].apply(pd.Series)
Out[21]:
0 1
dept
1 maria anna
2 joao NaN
3 jorge NaN
This is how I have merged two data frames recently.
rpt_data = connect_to_presto() # returned data from a db
df_rpt = pd.DataFrame(rpt_data, columns=["domain", "revenue"])
""" adding sellers.json seller {} into a panads df """
sj_data = data # returned response from requests module
df_sj = pd.json_normalize(sj_data, record_path="sellers", errors="ignore")
""" merging both dataframes """
df_merged = df_rpt.merge(df_sj, how="inner", on="domain", indicator=True)
Notice how I have stored the data into a variable each time, then created a DataFrame out of that? Then merged them like so
df_merged = df_rpt.merge(df_sj, how="inner", on="domain", indicator=True)
This may not be the best approach but it works.
I have two different dataframes which i need to compare.
These two dataframes are having different number of rows and doesnt have a Pk its Composite primarykey of (id||ver||name||prd||loc)
df1:
id ver name prd loc
a 1 surya 1a x
a 1 surya 1a y
a 2 ram 1a x
b 1 alex 1b z
b 1 alex 1b y
b 2 david 1b z
df2:
id ver name prd loc
a 1 surya 1a x
a 1 surya 1a y
a 2 ram 1a x
b 1 alex 1b z
I tried the below code and this workingif there are same number of rows , but if its like the above case its not working.
df1 = pd.DataFrame(Source)
df1 = df1.astype(str) #converting all elements as objects for easy comparison
df2 = pd.DataFrame(Target)
df2 = df2.astype(str) #converting all elements as objects for easy comparison
header_list = df1.columns.tolist() #creating a list of column names from df1 as the both df has same structure
df3 = pd.DataFrame(data=None, columns=df1.columns, index=df1.index)
for x in range(len(header_list)) :
df3[header_list[x]] = np.where(df1[header_list[x]] == df2[header_list[x]], 'True', 'False')
df3.to_csv('Output', index=False)
Please leet me know how to compare the datasets if there are different number od rows.
You can try this:
~df1.isin(df2)
# df1[~df1.isin(df2)].dropna()
Lets consider a quick example:
df1 = pd.DataFrame({
'Buyer': ['Carl', 'Carl', 'Carl'],
'Quantity': [18, 3, 5, ]})
# Buyer Quantity
# 0 Carl 18
# 1 Carl 3
# 2 Carl 5
df2 = pd.DataFrame({
'Buyer': ['Carl', 'Mark', 'Carl', 'Carl'],
'Quantity': [2, 1, 18, 5]})
# Buyer Quantity
# 0 Carl 2
# 1 Mark 1
# 2 Carl 18
# 3 Carl 5
~df2.isin(df1)
# Buyer Quantity
# 0 False True
# 1 True True
# 2 False True
# 3 True True
df2[~df2.isin(df1)].dropna()
# Buyer Quantity
# 1 Mark 1
# 3 Carl 5
Another idea can be merge on the same column names.
Sure, tweak the code to your needs. Hope this helped :)
suppose you have a dataframe
df = pd.DataFrame({'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':
[28,34,29,42]})
and another dataframe
df1 = pd.DataFrame({'Name':['Anna', 'Susie'],'Age':[20,50]})
as well as a list with indices
pos = [0,2].
What is the most pythonic way to create a new dataframe df2 where df1 is integrated into df right before the index positions of df specified in pos?
So, the new array should look like this:
df2 =
Age Name
0 20 Anna
1 28 Tom
2 34 Jack
3 50 Susie
4 29 Steve
5 42 Ricky
Thank you very much.
Best,
Nathan
The behavior you are searching for is implemented by numpy.insert, however, this will not play very well with pandas.DataFrame objects, but no-matter, pandas.DataFrame objects have a numpy.ndarray inside of them (sort of, depending on various factors, it may be multiple arrays, but you can think of them as on array accessible via the .values parameter).
You will simply have to reconstruct the columns of your data-frame, but otherwise, I suspect this is the easiest and fastest way:
In [1]: import pandas as pd, numpy as np
In [2]: df = pd.DataFrame({'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':
...: [28,34,29,42]})
In [3]: df1 = pd.DataFrame({'Name':['Anna', 'Susie'],'Age':[20,50]})
In [4]: np.insert(df.values, (0,2), df1.values, axis=0)
Out[4]:
array([['Anna', 20],
['Tom', 28],
['Jack', 34],
['Susie', 50],
['Steve', 29],
['Ricky', 42]], dtype=object)
So this returns an array, but this array is exactly what you need to make a data-frame! And you have the other elements, i.e. the columns already on the original data-frames, so you can just do:
In [5]: pd.DataFrame(np.insert(df.values, (0,2), df1.values, axis=0), columns=df.columns)
Out[5]:
Name Age
0 Anna 20
1 Tom 28
2 Jack 34
3 Susie 50
4 Steve 29
5 Ricky 42
So that single line is all you need.
Tricky solution with float indexes:
df = pd.DataFrame({'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age': [28,34,29,42]})
df1 = pd.DataFrame({'Name':['Anna', 'Susie'],'Age':[20,50]}, index=[-0.5, 1.5])
result = df.append(df1, ignore_index=False).sort_index().reset_index(drop=True)
print(result)
Output:
Name Age
0 Anna 20
1 Tom 28
2 Jack 34
3 Susie 50
4 Steve 29
5 Ricky 42
Pay attention to index parameter in df1 creation. You can construct index from pos using simple list comprehension:
[x - 0.5 for x in pos]
What's the best way to insert new rows into an existing pandas DataFrame while maintaining column data types and, at the same time, giving user-defined fill values for columns that aren't specified? Here's an example:
df = pd.DataFrame({
'name': ['Bob', 'Sue', 'Tom'],
'age': [45, 40, 10],
'weight': [143.2, 130.2, 34.9],
'has_children': [True, True, False]
})
Assume that I want to add a new record passing just name and age. To maintain data types, I can copy rows from df, modify values and then append df to the copy, e.g.
columns = ('name', 'age')
copy_df = df.loc[0:0, columns].copy()
copy_df.loc[0, columns] = 'Cindy', 42
new_df = copy_df.append(df, sort=False).reset_index(drop=True)
But that converts the bool column to an object.
Here's a really hacky solution that doesn't feel like the "right way" to do this:
columns = ('name', 'age')
copy_df = df.loc[0:0].copy()
missing_remap = {
'int64': 0,
'float64': 0.0,
'bool': False,
'object': ''
}
for c in set(copy_df.columns).difference(columns)):
copy_df.loc[:, c] = missing_remap[str(copy_df[c].dtype)]
new_df = copy_df.append(df, sort=False).reset_index(drop=True)
new_df.loc[0, columns] = 'Cindy', 42
I know I must be missing something.
As you found, since NaN is a float, adding NaN to a series may cause it to be either upcasted to float or converted to object. You are right in determining this is not a desirable outcome.
There is no straightforward approach. My suggestion is to store your input row data in a dictionary and combine it with a dictionary of defaults before appending. Note that this works because pd.DataFrame.append accepts a dict argument.
In Python 3.6, you can use the syntax {**d1, **d2} to combine two dictionaries with preference for the second.
default = {'name': '', 'age': 0, 'weight': 0.0, 'has_children': False}
row = {'name': 'Cindy', 'age': 42}
df = df.append({**default, **row}, ignore_index=True)
print(df)
age has_children name weight
0 45 True Bob 143.2
1 40 True Sue 130.2
2 10 False Tom 34.9
3 42 False Cindy 0.0
print(df.dtypes)
age int64
has_children bool
name object
weight float64
dtype: object
It's because, NaN value is a float, but True and False are bool. There are mixed dtypes in one column, so Pandas will automatically convert it into object.
Another instance of this is, if you have a column with all integer values and append a value with float, then pandas change entire column to float by adding '.0' to the remaining values.
Edit
Based on comments, Another hacky way to convert object to bool dtype.
df = pandas.DataFrame({
'name': ['Bob', 'Sue', 'Tom'],
'age': [45, 40, 10],
'weight': [143.2, 130.2, 34.9],
'has_children': [True, True, False]
})
row = {'name': 'Cindy', 'age': 12}
df = df.append(row, ignore_index=True)
df['has_children'] = df['has_children'].fillna(False).astype('bool')
Now the new dataframe looks like this :
age has_children name weight
0 45 True Bob 143.2
1 40 True Sue 130.2
2 10 False Tom 34.9
3 12 False Cindy NaN