I have dataset, df with some empty values in second column col2.
so I create a new table with same column names and the lenght is equal to number of missings in col2 for df. I call the new dataframe df2.
df[df['col2'].isna()] = df2
But this will return nan for the entire rows where col2 was missing. which means that df[df['col1'].isna()] is now missins everywhere and not only in col2.
Why is that and how Can I fix that?
Assuming that by df2 you really meant a Series, so renaming as s:
df.loc[df['col2'].isna(), 'col2'] = s.values
Example
nan = float('nan')
df = pd.DataFrame({'col1': [1,2,3], 'col2': [nan, 0, nan]})
s = pd.Series([10, 11])
df.loc[df['col2'].isna(), 'col2'] = s.values
>>> df
col1 col2
0 1 10.0
1 2 0.0
2 3 11.0
Note
I don't like this, because it is relying on knowing that the number of NaNs in df is the same length as s. It would be better to know how you create the missing values. With that information, we could probably propose a better and more robust solution.
Related
Suppose I have a data frame with three columns with dtypes (object, int, and float):
df = pd.DataFrame({
'col1': [1, 2, np.nan, 5],
'col2': [3, 4, 5, 4],
'col3': ['This is a text column'] * 4
})
I need to replace the np.nan with None, which is an object (since None becomes to NULL when imported to PostgresSQL).
df.replace({np.nan: None}, inplace=True)
I think (correct me if I'm wrong) None cannot be used in any NumPy/Pandas array except for arrays with dtype object. And so 'col1' above becomes an object column after replace. Now, if I wanted to subset only the string columns (which in this case should only be 'col3'), I can no longer use df.select_dtypes(include=object), which returns all object dtype columns, including 'col1'. I've been working around this by using this hacky solution:
# Select only object columns, which includes 'col1'
(df.select_dtypes(include=object)
# Hack, after this, 'col1' becomes float again since None becomes np.nan
.apply(lambda col: col.apply(lambda val: val))
# Now select only the object columns
.select_dtypes(include=object))
I'm wondering if there are idiomatic (or less hacky) ways to accomplish this. The use case really arose since I need to get the string columns from a data frame where there are numeric (float or int) columns with missing values represented by None rather than np.nan.
Another solution
Based on Mayank Porwal's solution below:
# The list comprehension returns a boolean list
df.loc[:, [pd.to_numeric(df[col], errors='coerce').isna().all() for col in df.columns.tolist()]]
Based on your sample df, you can do something like this:
After replacing np.nan to None, col1 becomes an object:
In [1413]: df.dtypes
Out[1413]:
col1 object
col2 int64
col3 object
dtype: object
To pick the columns which contains only strings, you can use pd.to_numeric with errors='coerce' and check if the column contains all Nan using isna:
In [1416]: cols = df.select_dtypes('object').columns.tolist()
In [1422]: cols
Out[1422]: ['col1', 'col3']
In [1424]: for i in cols:
...: if pd.to_numeric(df[i], errors='coerce').isna().all():
...: print(f'{i}: String col')
...: else:
...: print(f'{i}: number col')
...:
col1: number col
col3: String col
Reverse your 2 operations:
Extract object columns and process them.
Convert NaN to None before export to pgsql.
>>> df.dtypes
col1 float64
col2 int64
col3 object
dtype: object
# Step 1: process string columns
>>> df.update(df.select_dtypes('object').agg(lambda x: x.str.upper()))
# Step 2: replace nan by None
>>> df.replace({np.nan: None}, inplace=True)
>>> df
col1 col2 col3
0 1.0 3 THIS IS A TEXT COLUMN
1 2.0 4 THIS IS A TEXT COLUMN
2 None 5 THIS IS A TEXT COLUMN
3 5.0 4 THIS IS A TEXT COLUMN
5 columns (col1 - col5) in a 10-column dataframe (df) should be either blank or have text values only. If any row in these 5 columns has an all numeric value, i need to trigger an error. Wrote the following code to identify rows where the value is all-numeric in 'col1'. (I will cycle through all 5 columns using the same code):
df2 = df[df['col1'].str.isnumeric()]
I get the following error: ValueError: cannot mask with array containing NA / NaN values
This is triggered because the blank values create NaNs instead of False. I see this when I created a list instead using the following:
lst = df['col1'].str.isnumeric()
Any suggestions on how to solve this? Thanks
Try this to work around the NaN
import pandas as pd
df = pd.DataFrame([{'col1':1}, {'col1': 'a'}, {'col1': None}])
lst = df['col1'].astype(str).str.isnumeric()
if lst.any():
raise ValueError()
Here's a way to do:
import string
df['flag'] = (df
.applymap(lambda x: any(i for i in x if i in string.digits))
.apply(lambda x: f'Fail: {",".join(df.columns[x].tolist())} is numeric', 1))
print(df)
col1 col2 flag
0 a 2.04 Fail: col2 is numeric
1 2.02 b Fail: col1 is numeric
2 c c Fail: is numeric
3 d e Fail: is numeric
Explanation:
We iterate through each value of the dataframe and check if it is a digit and return a boolean value.
We use that boolean value to subset the column names
Sample Data
df = pd.DataFrame({'col1': ['a','2.02','c','d'],
'col2' : ['2.04','b','c','e']})
What's the most pythonic place to drop the columns in a dataframe where the header row is NaN? Preferably inplace.
There may or may not be data in the column.
df = pd.DataFrame({'col1': [1,2,np.NaN], 'col2': [4,5,6], np.NaN: [7,np.NaN,9]})
df.dropna(axis='columns', inplace=True)
Doesn't do it as it looks at the data in the column.
Wanted output
df = pd.DataFrame({'col1': [1,2,np.NaN], 'col2': [4,5,6]})
Thanks in advance for the replies.
Simply try this
df.drop(np.nan, axis=1, inplace=True)
However, if 'no headers' includes None, then jpp's answer will work perfectly at one shot.
Even, in case you have more than one np.nan headers, I don't know how to make df.drop works.
You can use pd.Index.dropna:
df = df[df.columns.dropna()]
print(df)
col1 col2
0 1.0 4
1 2.0 5
2 NaN 6
I have a DataFrame (df1) with a dimension 2000 rows x 500 columns (excluding the index) for which I want to divide each row by another DataFrame (df2) with dimension 1 rows X 500 columns. Both have the same column headers. I tried:
df.divide(df2) and
df.divide(df2, axis='index') and multiple other solutions and I always get a df with nan values in every cell. What argument am I missing in the function df.divide?
In df.divide(df2, axis='index'), you need to provide the axis/row of df2 (ex. df2.iloc[0]).
import pandas as pd
data1 = {"a":[1.,3.,5.,2.],
"b":[4.,8.,3.,7.],
"c":[5.,45.,67.,34]}
data2 = {"a":[4.],
"b":[2.],
"c":[11.]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df1.div(df2.iloc[0], axis='columns')
or you can use df1/df2.values[0,:]
You can divide by the series i.e. the first row of df2:
In [11]: df = pd.DataFrame([[1., 2.], [3., 4.]], columns=['A', 'B'])
In [12]: df2 = pd.DataFrame([[5., 10.]], columns=['A', 'B'])
In [13]: df.div(df2)
Out[13]:
A B
0 0.2 0.2
1 NaN NaN
In [14]: df.div(df2.iloc[0])
Out[14]:
A B
0 0.2 0.2
1 0.6 0.4
Small clarification just in case: the reason why you got NaN everywhere while Andy's first example (df.div(df2)) works for the first line is div tries to match indexes (and columns). In Andy's example, index 0 is found in both dataframes, so the division is made, not index 1 so a line of NaN is added. This behavior should appear even more obvious if you run the following (only the 't' line is divided):
df_a = pd.DataFrame(np.random.rand(3,5), index= ['x', 'y', 't'])
df_b = pd.DataFrame(np.random.rand(2,5), index= ['z','t'])
df_a.div(df_b)
So in your case, the index of the only row of df2 was apparently not present in df1. "Luckily", the column headers are the same in both dataframes, so when you slice the first row, you get a series, the index of which is composed by the column headers of df2. This is what eventually allows the division to take place properly.
For a case with index and column matching:
df_a = pd.DataFrame(np.random.rand(3,5), index= ['x', 'y', 't'], columns = range(5))
df_b = pd.DataFrame(np.random.rand(2,5), index= ['z','t'], columns = [1,2,3,4,5])
df_a.div(df_b)
If you want to divide each row of a column with a specific value you could try:
df['column_name'] = df['column_name'].div(10000)
For me, this code divided each row of 'column_name' with 10,000.
to divide a row (with single or multiple columns), we need to do the below:
df.loc['index_value'] = df.loc['index_value'].div(10000)
I have a DataFrame (df1) with a dimension 2000 rows x 500 columns (excluding the index) for which I want to divide each row by another DataFrame (df2) with dimension 1 rows X 500 columns. Both have the same column headers. I tried:
df.divide(df2) and
df.divide(df2, axis='index') and multiple other solutions and I always get a df with nan values in every cell. What argument am I missing in the function df.divide?
In df.divide(df2, axis='index'), you need to provide the axis/row of df2 (ex. df2.iloc[0]).
import pandas as pd
data1 = {"a":[1.,3.,5.,2.],
"b":[4.,8.,3.,7.],
"c":[5.,45.,67.,34]}
data2 = {"a":[4.],
"b":[2.],
"c":[11.]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df1.div(df2.iloc[0], axis='columns')
or you can use df1/df2.values[0,:]
You can divide by the series i.e. the first row of df2:
In [11]: df = pd.DataFrame([[1., 2.], [3., 4.]], columns=['A', 'B'])
In [12]: df2 = pd.DataFrame([[5., 10.]], columns=['A', 'B'])
In [13]: df.div(df2)
Out[13]:
A B
0 0.2 0.2
1 NaN NaN
In [14]: df.div(df2.iloc[0])
Out[14]:
A B
0 0.2 0.2
1 0.6 0.4
Small clarification just in case: the reason why you got NaN everywhere while Andy's first example (df.div(df2)) works for the first line is div tries to match indexes (and columns). In Andy's example, index 0 is found in both dataframes, so the division is made, not index 1 so a line of NaN is added. This behavior should appear even more obvious if you run the following (only the 't' line is divided):
df_a = pd.DataFrame(np.random.rand(3,5), index= ['x', 'y', 't'])
df_b = pd.DataFrame(np.random.rand(2,5), index= ['z','t'])
df_a.div(df_b)
So in your case, the index of the only row of df2 was apparently not present in df1. "Luckily", the column headers are the same in both dataframes, so when you slice the first row, you get a series, the index of which is composed by the column headers of df2. This is what eventually allows the division to take place properly.
For a case with index and column matching:
df_a = pd.DataFrame(np.random.rand(3,5), index= ['x', 'y', 't'], columns = range(5))
df_b = pd.DataFrame(np.random.rand(2,5), index= ['z','t'], columns = [1,2,3,4,5])
df_a.div(df_b)
If you want to divide each row of a column with a specific value you could try:
df['column_name'] = df['column_name'].div(10000)
For me, this code divided each row of 'column_name' with 10,000.
to divide a row (with single or multiple columns), we need to do the below:
df.loc['index_value'] = df.loc['index_value'].div(10000)