How to apply pandas.DataFrame.replace on selected columns with inplace = True? - python

import pandas as pd
df = pd.DataFrame({
'col1':[99,99,99],
'col2':[4,5,6],
'col3':[7,None,9]
})
col_list = ['col1','col2']
df[col_list].replace(99,0,inplace=True)
This generates a Warning and leaves the dataframe unchanged.
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
I want to be able to apply the replace method on a subset of the columns specified by the user. I also want to use inplace = True to avoid making a copy of the dataframe, since it is huge. Any ideas on how this can be accomplished would be appreciated.

When you select the columns for replacement with df[col_list], a slice (a copy) of your dataframe is created. The copy is updated, but never written back into the original dataframe.
You should either replace one column at a time or use nested dictionary mapping:
df.replace(to_replace={'col1' : {99 : 0}, 'col2' : {99 : 0}},
inplace=True)
The nested dictionary for to_replace can be generated automatically:
d = {col : {99:0} for col in col_list}

You can use replace with loc. Here is a slightly modified version of your sample df:
d = {'col1':[99,99,9],'col2':[99,5,6],'col3':[7,None,99]}
df = pd.DataFrame(data=d)
col_list = ['col1','col2']
df.loc[:, col_list] = df.loc[:, col_list].replace(99,0)
You get
col1 col2 col3
0 0 0 7.0
1 0 5 NaN
2 9 6 99.0
Here is a nice explanation for similar issue.

Related

Trying to overwrite subset of pandas dataframe but get empty values

I have dataset, df with some empty values in second column col2.
so I create a new table with same column names and the lenght is equal to number of missings in col2 for df. I call the new dataframe df2.
df[df['col2'].isna()] = df2
But this will return nan for the entire rows where col2 was missing. which means that df[df['col1'].isna()] is now missins everywhere and not only in col2.
Why is that and how Can I fix that?
Assuming that by df2 you really meant a Series, so renaming as s:
df.loc[df['col2'].isna(), 'col2'] = s.values
Example
nan = float('nan')
df = pd.DataFrame({'col1': [1,2,3], 'col2': [nan, 0, nan]})
s = pd.Series([10, 11])
df.loc[df['col2'].isna(), 'col2'] = s.values
>>> df
col1 col2
0 1 10.0
1 2 0.0
2 3 11.0
Note
I don't like this, because it is relying on knowing that the number of NaNs in df is the same length as s. It would be better to know how you create the missing values. With that information, we could probably propose a better and more robust solution.

How to insert Pandas dataframe into another Pandas dataframe without wrapping it in a Series?

import pandas as pd
df1 = pd.DataFrame([{'a':1,'b':2}])
df2 = pd.DataFrame([{'x':100}])
df2[y] = [df1]
#df2.iloc[:,'y'].shape = (1,)
# type(df2.iloc[:,1][0]) = pandas.core.frame.DataFrame
I want to make a df a column in an existing row. However Pandas wraps this df in a Series object so that I cannot access it with dot notation such as df2.y.a to get the value 1. Is there a way to make this not occur or is there some constraint on object type for df elements such that this is impossible?
the desired output is a df like:
x y
0 100 a b
0 1 2
and type(df2.y) == pd.DataFrame
You can combine two DataFrame objects along the columns axis, which I think achieves what you're trying to. Let me know if this is what you're looking for
import pandas as pd
df1 = pd.DataFrame([{'a':1,'b':2}])
df2 = pd.DataFrame([{'x':100}])
combined_df = pd.concat([df1, df2], axis=1)
print(combined_df)
a b x
0 1 2 100

Check for numeric value in text column - python

5 columns (col1 - col5) in a 10-column dataframe (df) should be either blank or have text values only. If any row in these 5 columns has an all numeric value, i need to trigger an error. Wrote the following code to identify rows where the value is all-numeric in 'col1'. (I will cycle through all 5 columns using the same code):
df2 = df[df['col1'].str.isnumeric()]
I get the following error: ValueError: cannot mask with array containing NA / NaN values
This is triggered because the blank values create NaNs instead of False. I see this when I created a list instead using the following:
lst = df['col1'].str.isnumeric()
Any suggestions on how to solve this? Thanks
Try this to work around the NaN
import pandas as pd
df = pd.DataFrame([{'col1':1}, {'col1': 'a'}, {'col1': None}])
lst = df['col1'].astype(str).str.isnumeric()
if lst.any():
raise ValueError()
Here's a way to do:
import string
df['flag'] = (df
.applymap(lambda x: any(i for i in x if i in string.digits))
.apply(lambda x: f'Fail: {",".join(df.columns[x].tolist())} is numeric', 1))
print(df)
col1 col2 flag
0 a 2.04 Fail: col2 is numeric
1 2.02 b Fail: col1 is numeric
2 c c Fail: is numeric
3 d e Fail: is numeric
Explanation:
We iterate through each value of the dataframe and check if it is a digit and return a boolean value.
We use that boolean value to subset the column names
Sample Data
df = pd.DataFrame({'col1': ['a','2.02','c','d'],
'col2' : ['2.04','b','c','e']})

Pandas mapping all, and a portion, of column value in another column

I am trying to search for values and portions of values from one column to another and return a third value.
Essentially, I have two dataframes: df and df2. The first has a part number in 'col1'. The second has the part number, or portion of it, in 'col1' and the value I want to put in df['col2'] in 'col2'.
import pandas as pd
df = pd.DataFrame({'col1': ['1-1-1', '1-1-2', '1-1-3',
'2-1-1', '2-1-2', '2-1-3']})
df2 = pd.DataFrame({'col1': ['1-1-1', '1-1-2', '1-1-3', '2-1'],
'col2': ['A', 'B', 'C', 'D']})
Of course this:
df['col1'].isin(df2['col1'])
Only covers everything that matches, not the portions:
df['col1'].isin(df2['col1'])
Out[27]:
0 True
1 True
2 True
3 False
4 False
5 False
Name: col1, dtype: bool
I tried:
df[df['col1'].str.contains(df2['col1'])]
but get:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
I also tried use a dictionary made from df2; using the same approaches as above and also mapping it--with no luck
The results for df I need would look like this:
col1 col2
'1-1-1' 'A'
'1-1-2' 'B'
'1-1-3' 'C'
'2-1-1' 'D'
'2-1-2' 'D'
'2-1-3' 'D'
I can't figure out how to get the 'D' value into 'col2' because df2['col1'] contains '2-1'--only a portion of the part number.
Any help would be greatly appreciated. Thank you in advance.
We can do str.findall
s=df.col1.str.findall('|'.join(df2.col1.tolist())).str[0].map(df2.set_index('col1').col2)
df['New']=s
df
col1 New
0 1-1-1 A
1 1-1-2 B
2 1-1-3 C
3 2-1-1 D
4 2-1-2 D
5 2-1-3 D
If your df and df2 the specific format as in the sample, another way is using a dict map with fillna by mapping from rsplit
d = dict(df2[['col1', 'col2']].values)
df['col2'] = df.col1.map(d).fillna(df.col1.str.rsplit('-',1).str[0].map(d))
Out[1223]:
col1 col2
0 1-1-1 A
1 1-1-2 B
2 1-1-3 C
3 2-1-1 D
4 2-1-2 D
5 2-1-3 D
Otherwise, besides using findall as in Wen's solution, you may also use extract using with dict d from above
df.col1.str.extract('('+'|'.join(df2.col1)+')')[0].map(d)

Constructing pandas DataFrame from values in variables gives "ValueError: If using all scalar values, you must pass an index"

This may be a simple question, but I can not figure out how to do this. Lets say that I have two variables as follows.
a = 2
b = 3
I want to construct a DataFrame from this:
df2 = pd.DataFrame({'A':a,'B':b})
This generates an error:
ValueError: If using all scalar values, you must pass an index
I tried this also:
df2 = (pd.DataFrame({'a':a,'b':b})).reset_index()
This gives the same error message.
The error message says that if you're passing scalar values, you have to pass an index. So you can either not use scalar values for the columns -- e.g. use a list:
>>> df = pd.DataFrame({'A': [a], 'B': [b]})
>>> df
A B
0 2 3
or use scalar values and pass an index:
>>> df = pd.DataFrame({'A': a, 'B': b}, index=[0])
>>> df
A B
0 2 3
You may try wrapping your dictionary into a list:
my_dict = {'A':1,'B':2}
pd.DataFrame([my_dict])
A B
0 1 2
You can also use pd.DataFrame.from_records which is more convenient when you already have the dictionary in hand:
df = pd.DataFrame.from_records([{ 'A':a,'B':b }])
You can also set index, if you want, by:
df = pd.DataFrame.from_records([{ 'A':a,'B':b }], index='A')
You need to create a pandas series first. The second step is to convert the pandas series to pandas dataframe.
import pandas as pd
data = {'a': 1, 'b': 2}
pd.Series(data).to_frame()
You can even provide a column name.
pd.Series(data).to_frame('ColumnName')
Maybe Series would provide all the functions you need:
pd.Series({'A':a,'B':b})
DataFrame can be thought of as a collection of Series hence you can :
Concatenate multiple Series into one data frame (as described here )
Add a Series variable into existing data frame ( example here )
Pandas magic at work. All logic is out.
The error message "ValueError: If using all scalar values, you must pass an index" Says you must pass an index.
This does not necessarily mean passing an index makes pandas do what you want it to do
When you pass an index, pandas will treat your dictionary keys as column names and the values as what the column should contain for each of the values in the index.
a = 2
b = 3
df2 = pd.DataFrame({'A':a,'B':b}, index=[1])
A B
1 2 3
Passing a larger index:
df2 = pd.DataFrame({'A':a,'B':b}, index=[1, 2, 3, 4])
A B
1 2 3
2 2 3
3 2 3
4 2 3
An index is usually automatically generated by a dataframe when none is given. However, pandas does not know how many rows of 2 and 3 you want. You can however be more explicit about it
df2 = pd.DataFrame({'A':[a]*4,'B':[b]*4})
df2
A B
0 2 3
1 2 3
2 2 3
3 2 3
The default index is 0 based though.
I would recommend always passing a dictionary of lists to the dataframe constructor when creating dataframes. It's easier to read for other developers. Pandas has a lot of caveats, don't make other developers have to experts in all of them in order to read your code.
You could try:
df2 = pd.DataFrame.from_dict({'a':a,'b':b}, orient = 'index')
From the documentation on the 'orient' argument: If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’.
I usually use the following to to quickly create a small table from dicts.
Let's say you have a dict where the keys are filenames and the values their corresponding filesizes, you could use the following code to put it into a DataFrame (notice the .items() call on the dict):
files = {'A.txt':12, 'B.txt':34, 'C.txt':56, 'D.txt':78}
filesFrame = pd.DataFrame(files.items(), columns=['filename','size'])
print(filesFrame)
filename size
0 A.txt 12
1 B.txt 34
2 C.txt 56
3 D.txt 78
You need to provide iterables as the values for the Pandas DataFrame columns:
df2 = pd.DataFrame({'A':[a],'B':[b]})
I had the same problem with numpy arrays and the solution is to flatten them:
data = {
'b': array1.flatten(),
'a': array2.flatten(),
}
df = pd.DataFrame(data)
import pandas as pd
a=2
b=3
dict = {'A': a, 'B': b}
pd.DataFrame(pd.Series(dict)).T
# *T :transforms the dataframe*
Result:
A B
0 2 3
To figure out the "ValueError" understand DataFrame and "scalar values" is needed.
To create a Dataframe from dict, at least one Array is needed.
IMO, array itself is indexed.
Therefore, if there is an array-like value there is no need to specify index.
e.g. The index of each element in ['a', 's', 'd', 'f'] are 0,1,2,3 separately.
df_array_like = pd.DataFrame({
'col' : 10086,
'col_2' : True,
'col_3' : "'at least one array'",
'col_4' : ['one array is arbitrary length', 'multi arrays should be the same length']})
print("df_array_like: \n", df_array_like)
Output:
df_array_like:
col col_2 col_3 col_4
0 10086 True 'at least one array' one array is arbitrary length
1 10086 True 'at least one array' multi arrays should be the same length
As shows in the output, the index of the DataFrame is 0 and 1.
Coincidently same with the index of the array ['one array is arbitrary length', 'multi arrays should be the same length']
If comment out the 'col_4', it will raise
ValueError("If using all scalar values, you must pass an index")
Cause scalar value (integer, bool, and string) does not have index
Note that Index(...) must be called with a collection of some kind
Since index used to locate all the rows of DataFrame
index should be an array. e.g.
df_scalar_value = pd.DataFrame({
'col' : 10086,
'col_2' : True,
'col_3' : "'at least one array'"
}, index = ['fst_row','snd_row','third_row'])
print("df_scalar_value: \n", df_scalar_value)
Output:
df_scalar_value:
col col_2 col_3
fst_row 10086 True 'at least one array'
snd_row 10086 True 'at least one array'
third_row 10086 True 'at least one array'
I'm a beginner, I'm learning python and English. 👀
I tried transpose() and it worked.
Downside: You create a new object.
testdict1 = {'key1':'val1','key2':'val2','key3':'val3','key4':'val4'}
df = pd.DataFrame.from_dict(data=testdict1,orient='index')
print(df)
print(f'ID for DataFrame before Transpose: {id(df)}\n')
df = df.transpose()
print(df)
print(f'ID for DataFrame after Transpose: {id(df)}')
Output
0
key1 val1
key2 val2
key3 val3
key4 val4
ID for DataFrame before Transpose: 1932797100424
key1 key2 key3 key4
0 val1 val2 val3 val4
ID for DataFrame after Transpose: 1932797125448
​```
the input does not have to be a list of records - it can be a single dictionary as well:
pd.DataFrame.from_records({'a':1,'b':2}, index=[0])
a b
0 1 2
Which seems to be equivalent to:
pd.DataFrame({'a':1,'b':2}, index=[0])
a b
0 1 2
This is because a DataFrame has two intuitive dimensions - the columns and the rows.
You are only specifying the columns using the dictionary keys.
If you only want to specify one dimensional data, use a Series!
If you intend to convert a dictionary of scalars, you have to include an index:
import pandas as pd
alphabets = {'A': 'a', 'B': 'b'}
index = [0]
alphabets_df = pd.DataFrame(alphabets, index=index)
print(alphabets_df)
Although index is not required for a dictionary of lists, the same idea can be expanded to a dictionary of lists:
planets = {'planet': ['earth', 'mars', 'jupiter'], 'length_of_day': ['1', '1.03', '0.414']}
index = [0, 1, 2]
planets_df = pd.DataFrame(planets, index=index)
print(planets_df)
Of course, for the dictionary of lists, you can build the dataframe without an index:
planets_df = pd.DataFrame(planets)
print(planets_df)
Change your 'a' and 'b' values to a list, as follows:
a = [2]
b = [3]
then execute the same code as follows:
df2 = pd.DataFrame({'A':a,'B':b})
df2
and you'll get:
A B
0 2 3
simplest options ls :
dict = {'A':a,'B':b}
df = pd.DataFrame(dict, index = np.arange(1) )
Another option is to convert the scalars into list on the fly using Dictionary Comprehension:
df = pd.DataFrame(data={k: [v] for k, v in mydict.items()})
The expression {...} creates a new dict whose values is a list of 1 element. such as :
In [20]: mydict
Out[20]: {'a': 1, 'b': 2}
In [21]: mydict2 = { k: [v] for k, v in mydict.items()}
In [22]: mydict2
Out[22]: {'a': [1], 'b': [2]}
Convert Dictionary to Data Frame
col_dict_df = pd.Series(col_dict).to_frame('new_col').reset_index()
Give new name to Column
col_dict_df.columns = ['col1', 'col2']
You could try this:
df2 = pd.DataFrame.from_dict({'a':a,'b':b}, orient = 'index')
If you have a dictionary you can turn it into a pandas data frame with the following line of code:
pd.DataFrame({"key": d.keys(), "value": d.values()})
Just pass the dict on a list:
a = 2
b = 3
df2 = pd.DataFrame([{'A':a,'B':b}])

Categories