I have 2 columns, which we'll call x and y. I want to create a new column called xy:
x y xy
1 1
2 2
4 4
8 8
There shouldn't be any conflicting values, but if there are, y takes precedence. If it makes the solution easier, you can assume that x will always be NaN where y has a value.
it could be quite simple if your example is accurate
df.fillna(0) #if the blanks are nan will need this line first
df['xy']=df['x']+df['y']
Notice your column type right now is string not numeric anymore
df = df.apply(lambda x : pd.to_numeric(x, errors='coerce'))
df['xy'] = df.sum(1)
More
df['xy'] =df[['x','y']].astype(str).apply(''.join,1)
#df[['x','y']].astype(str).apply(''.join,1)
Out[655]:
0 1.0
1 2.0
2
3 4.0
4 8.0
dtype: object
You can also use NumPy:
import pandas as pd, numpy as np
df = pd.DataFrame({'x': [1, 2, np.nan, np.nan],
'y': [np.nan, np.nan, 4, 8]})
arr = df.values
df['xy'] = arr[~np.isnan(arr)].astype(int)
print(df)
x y xy
0 1.0 NaN 1
1 2.0 NaN 2
2 NaN 4.0 4
3 NaN 8.0 8
Related
I have a table:
A
B
C
x
1
NA
y
NA
4
z
2
NA
p
NA
5
t
6
7
I want to create a new column D which should combine columns B and C if one of the columns is empty (NA):
A
B
C
D
x
1
NA
1
y
NA
4
4
z
2
NA
2
p
NA
5
5
t
6
7
error
In case both columns contain a value, it should return the text 'error' inside the cell.
You could first calculate a mask with rows where both values are present and then fill NA values of, let's say column B, with values from column C. Using the mask calculated in the first step simply assign NA values where needed.
error_mask = df['B'].notna() & df['C'].notna()
df['D'] = df['B'].fillna(df['C'])
df.loc[error_mask, 'D'] = pd.NA
df
A B C D
0 x 1 <NA> 1
1 y <NA> 4 4
2 z 2 <NA> 2
3 p 3 5 <NA>
OR
df = df['D'].astype(str)
df.loc[error_mask, 'D'] = 'error'
I would suggest against assigning a string error where both values are present, since that would make the whole D column an object dtype
There as several ways to achieve this.
Using fillna and mask
df['D'] = df['B'].fillna(df['C']).mask(df['B'].notna()&df['C'].notna(), 'error')
Or numpy.select:
m1 = df['B'].notna()
m2 = df['C'].notna()
df['D'] = np.select([m1&m2, m1], ['error', df['B']], df['C'])
Output:
A B C D
0 x 1.0 NaN 1.0
1 y NaN 4.0 4.0
2 z 2.0 NaN 2.0
3 p NaN 5.0 5.0
4 t 6.0 7.0 error
Adding to the previous answer, you can address this with a series of .apply() methods paired with lambda functions.
Consider the dataframe that you presented, with np.nan as the NA values:
df = pd.DataFrame({
'B':[1, np.nan, 2, np.nan, 6],
'C':[np.nan, 4, np.nan, 5, 7]})
First generate a list of the elements from the series in question:
df['D'] = df.apply(lambda x: list(x), axis=1)
This will net you a pd.Series with a list of values as elements, e.g. [1.0, nan] for the first row. Next, remove all np.nan elements by using that np.nan != np.nan in numpy (see also an answer here: How can I remove Nan from list Python/NumPy)
df['E'] = df['D'].apply(lambda x: [i for i in x if i == i])
Finally, create the error by filtering based on length.
df['F'] = df['E'].apply(lambda x: x[0] if len(x) == 1 else 'error')
The resulting dataframe works like this:
B C D E F
0 1.0 NaN [1.0, nan] [1.0] 1.0
1 NaN 4.0 [nan, 4.0] [4.0] 4.0
2 2.0 NaN [2.0, nan] [2.0] 2.0
3 NaN 5.0 [nan, 5.0] [5.0] 5.0
4 6.0 7.0 [6.0, 7.0] [6.0, 7.0] error
Of course you could chain all this together in a not-so-pythonic, yet single-line answer:
a = df.apply(lambda x: list(x), axis=1).apply(lambda x: [i for i in x if i == i]).apply(lambda x: x[0] if len(x) == 1 else 'error')
Have a look at the function combine_first:
df['C'].combine_first(df['B']).mask(df['B'].notna() & df['C'].notna(), 'error')
Output:
0 1.0
1 4.0
2 2.0
3 5.0
4 error
Name: C, dtype: object
I've asked a similar question to this and it got no reply, so I thought I'd take a different approach and see if anyone knows how to do this;
First I'll tell you my goal and what I already know:
I am currently cleaning a dataset and need to backward fill the dataset to get rid of some of the NaN values.
From the image below
I would like to backward fill the Na columns of the same X column value, and fill the Na cell with a Y value that has a row value of 1
This image shows what outcome I would like
I already know I can use
df.loc[df['Y'] == 1] = df.loc[:,].bfill(limit=1)
to get it to only fill cells that are matching with a Y value row of 1 (hence the bottom Na cell is not filled).
Here is my question: Using the code above, it fills the middle Na because the Y value to the left is 1, this is fine for the top cell because the source cell and Na cell both have the X value of 1, although for the middle Na there is an X value of 2 and 3. So, is there a way to fill cells that share the same X value down the row? (the X values need to be the same between the source and the Na, if not, nothing happens.)
Thanks!
We can try with loc + groupby bfill:
df.loc[df['Y'] == 1, 'Z'] = df.groupby('X')['Z'].bfill()
groupby will ensure that each group of X values is treated independently, bfill will backfill per group. df['Y'] == 1 ensures that only rows with Y value of 1 will be updated.
df:
X Y Z
0 1 1 2.0
1 1 2 2.0
2 2 1 NaN
3 3 1 3.0
4 3 2 NaN
5 4 1 4.0
Initial Frame:
import numpy as np
import pandas as pd
df = pd.DataFrame({'X': [1, 1, 2, 3, 3, 4],
'Y': [1, 2, 1, 1, 2, 1],
'Z': [np.nan, 2, np.nan, 3, np.nan, 4]})
df:
X Y Z
0 1 1 NaN
1 1 2 2.0
2 2 1 NaN
3 3 1 3.0
4 3 2 NaN
5 4 1 4.0
Edit to bfill all columns except X and Y use:
df.loc[df['Y'] == 1, df.columns.difference(['X', 'Y'])] = df.groupby('X').bfill()
Try using shift:
df.loc[df['Y'].eq(1) & df['X'].shift(-1).eq(df['X']), 'Z'] = df['Z'].bfill(limit=1)
I am new to python, I am attempting what would be a conditional mutate in R DPLYR.
In short I would like to create a new column in the Data-frame called Result where : if df.['test'] is greater than 1 df.['Result'] equals the respective df.['count'] for that row, if it lower than 1 then df.['Result'] is
df.['count'] *df.['test']
I have tried df['Result']=df['test'].apply(lambda x: df['count'] if x >=1 else ...) Unfortunately this results in a series, I have also attempted to write small functions which also return series
I would like the final Dataframe to look like this...
no_ Test Count Result
1 2 1 1
2 3 5 5
3 4 1 1
4 6 2 2
5 0.5 2 1
You can use np.where:
df['Result'] = np.where(df['Test'] > 1, df['Count'], df['Count'] * df['Test'])
Output:
No_ Test Count Result
0 1 2.0 1 1.0
1 2 3.0 5 5.0
2 3 4.0 1 1.0
3 4 6.0 2 2.0
4 5 0.5 2 1.0
You can work it out with lists comprehensions:
df['Result'] = [ df['count'][i] if df['test'][i]>1 else
df['count'][i] * df['test'][i]
for i in range(df.shape[0]) ]
Here is a way to do this:
import pandas as pd
df = pd.DataFrame(columns = ['Test', 'Count'],
data={'Test':[2, 3, 4, 6, 0.5], 'Count':[1, 5, 1, 2, 2]})
df['Result'] = df['Count']
df.loc[df['Test'] < 1, 'Result'] = df['Test'] * df['Count']
Output:
Test Count Result
0 2.0 1 1.0
1 3.0 5 5.0
2 4.0 1 1.0
3 6.0 2 2.0
4 0.5 2 1.0
I have a dataframe as follows
df = pd.DataFrame({'A': [1, 2, 3], 'B': [1.45, 2.33, np.nan], 'C': [4, 5, 6], 'D': [4.55, 7.36, np.nan]})
I want to replace the missing values i.e. np.nan in generic way. For this I have created a function as follows
def treat_mis_value_nu(df):
df_nu = df.select_dtypes(include=['number'])
lst_null_col = df_nu.columns[df_nu.isnull().any()].tolist()
if len(lst_null_col)>0:
for i in lst_null_col:
if df_nu[i].isnull().sum()/len(df_nu[i])>0.10:
df_final_nu = df_nu.drop([i],axis=1)
else:
df_final_nu = df_nu[i].fillna(df_nu[i].median(),inplace=True)
return df_final_nu
When I apply this function as follows
df_final = treat_mis_value_nu(df)
I am getting a dataframe as follows
A B C
0 1 1.0 4
1 2 2.0 5
2 3 NaN 6
So it has actually removed column D correctly, but failed to remove column B.
I know in past there have been discussion on this topic (here). Still I might be missing something?
Use:
df = pd.DataFrame({'A': [1, 2, 3,5,7], 'B': [1.45, 2.33, np.nan, np.nan, np.nan],
'C': [4, 5, 6,8,7], 'D': [4.55, 7.36, np.nan,9,10],
'E':list('abcde')})
print (df)
A B C D E
0 1 1.45 4 4.55 a
1 2 2.33 5 7.36 b
2 3 NaN 6 NaN c
3 5 NaN 8 9.00 d
4 7 NaN 7 10.00 e
def treat_mis_value_nu(df):
#get only numeric columns to dataframe
df_nu = df.select_dtypes(include=['number'])
#get only columns with NaNs
df_nu = df_nu.loc[:, df_nu.isnull().any()]
#get columns for remove with mean instead sum/len, it is same
cols_to_drop = df_nu.columns[df_nu.isnull().mean() <= 0.30]
#replace missing values of original columns and remove above thresh
return df.fillna(df_nu.median()).drop(cols_to_drop, axis=1)
print (treat_mis_value_nu(df))
A C D E
0 1 4 4.55 a
1 2 5 7.36 b
2 3 6 8.18 c
3 5 8 9.00 d
4 7 7 10.00 e
I would recommend looking at the sklearn Imputer transformer. I don't think it it can drop columns but it can definetly fill them in a 'generic way' - for example, filling in missing values with the median of the relevant column.
You could use it as such:
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy='median')
num_df = df.values
names = df.columns.values
df_final = pd.DataFrame(imputer.transform(num_df), columns=names)
If you have additional transformations you would like to make you could consider making a transformation Pipeline or could even make your own transformers to do bespoke tasks.
Say I have a dictionary with 10 key-value pairs. Each entry holds a numpy array. However, the length of the array is not the same for all of them.
How can I create a dataframe where each column holds a different entry?
When I try:
pd.DataFrame(my_dict)
I get:
ValueError: arrays must all be the same length
Any way to overcome this? I am happy to have Pandas use NaN to pad those columns for the shorter entries.
In Python 3.x:
import pandas as pd
import numpy as np
d = dict( A = np.array([1,2]), B = np.array([1,2,3,4]) )
pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in d.items() ]))
Out[7]:
A B
0 1 1
1 2 2
2 NaN 3
3 NaN 4
In Python 2.x:
replace d.items() with d.iteritems().
Here's a simple way to do that:
In[20]: my_dict = dict( A = np.array([1,2]), B = np.array([1,2,3,4]) )
In[21]: df = pd.DataFrame.from_dict(my_dict, orient='index')
In[22]: df
Out[22]:
0 1 2 3
A 1 2 NaN NaN
B 1 2 3 4
In[23]: df.transpose()
Out[23]:
A B
0 1 1
1 2 2
2 NaN 3
3 NaN 4
A way of tidying up your syntax, but still do essentially the same thing as these other answers, is below:
>>> mydict = {'one': [1,2,3], 2: [4,5,6,7], 3: 8}
>>> dict_df = pd.DataFrame({ key:pd.Series(value) for key, value in mydict.items() })
>>> dict_df
one 2 3
0 1.0 4 8.0
1 2.0 5 NaN
2 3.0 6 NaN
3 NaN 7 NaN
A similar syntax exists for lists, too:
>>> mylist = [ [1,2,3], [4,5], 6 ]
>>> list_df = pd.DataFrame([ pd.Series(value) for value in mylist ])
>>> list_df
0 1 2
0 1.0 2.0 3.0
1 4.0 5.0 NaN
2 6.0 NaN NaN
Another syntax for lists is:
>>> mylist = [ [1,2,3], [4,5], 6 ]
>>> list_df = pd.DataFrame({ i:pd.Series(value) for i, value in enumerate(mylist) })
>>> list_df
0 1 2
0 1 4.0 6.0
1 2 5.0 NaN
2 3 NaN NaN
You may additionally have to transpose the result and/or change the column data types (float, integer, etc).
Use pandas.DataFrame and pandas.concat
The following code will create a list of DataFrames with pandas.DataFrame, from a dict of uneven arrays, and then concat the arrays together in a list-comprehension.
This is a way to create a DataFrame of arrays, that are not equal in length.
For equal length arrays, use df = pd.DataFrame({'x1': x1, 'x2': x2, 'x3': x3})
import pandas as pd
import numpy as np
# create the uneven arrays
mu, sigma = 200, 25
np.random.seed(365)
x1 = mu + sigma * np.random.randn(10, 1)
x2 = mu + sigma * np.random.randn(15, 1)
x3 = mu + sigma * np.random.randn(20, 1)
data = {'x1': x1, 'x2': x2, 'x3': x3}
# create the dataframe
df = pd.concat([pd.DataFrame(v, columns=[k]) for k, v in data.items()], axis=1)
Use pandas.DataFrame and itertools.zip_longest
For iterables of uneven length, zip_longest fills missing values with the fillvalue.
The zip generator needs to be unpacked, because the DataFrame constructor won't unpack it.
from itertools import zip_longest
# zip all the values together
zl = list(zip_longest(*data.values()))
# create dataframe
df = pd.DataFrame(zl, columns=data.keys())
plot
df.plot(marker='o', figsize=[10, 5])
dataframe
x1 x2 x3
0 232.06900 235.92577 173.19476
1 176.94349 209.26802 186.09590
2 194.18474 168.36006 194.36712
3 196.55705 238.79899 218.33316
4 249.25695 167.91326 191.62559
5 215.25377 214.85430 230.95119
6 232.68784 240.30358 196.72593
7 212.43409 201.15896 187.96484
8 188.97014 187.59007 164.78436
9 196.82937 252.67682 196.47132
10 NaN 223.32571 208.43823
11 NaN 209.50658 209.83761
12 NaN 215.27461 249.06087
13 NaN 210.52486 158.65781
14 NaN 193.53504 199.10456
15 NaN NaN 186.19700
16 NaN NaN 223.02479
17 NaN NaN 185.68525
18 NaN NaN 213.41414
19 NaN NaN 271.75376
While this does not directly answer the OP's question. I found this to be an excellent solution for my case when I had unequal arrays and I'd like to share:
from pandas documentation
In [31]: d = {'one' : Series([1., 2., 3.], index=['a', 'b', 'c']),
....: 'two' : Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
....:
In [32]: df = DataFrame(d)
In [33]: df
Out[33]:
one two
a 1 1
b 2 2
c 3 3
d NaN 4
You can also use pd.concat along axis=1 with a list of pd.Series objects:
import pandas as pd, numpy as np
d = {'A': np.array([1,2]), 'B': np.array([1,2,3,4])}
res = pd.concat([pd.Series(v, name=k) for k, v in d.items()], axis=1)
print(res)
A B
0 1.0 1
1 2.0 2
2 NaN 3
3 NaN 4
Both the following lines work perfectly :
pd.DataFrame.from_dict(df, orient='index').transpose() #A
pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in df.items() ])) #B (Better)
But with %timeit on Jupyter, I've got a ratio of 4x speed for B vs A, which is quite impressive especially when working with a huge data set (mainly with a big number of columns/features).
If you don't want it to show NaN and you have two particular lengths, adding a 'space' in each remaining cell would also work.
import pandas
long = [6, 4, 7, 3]
short = [5, 6]
for n in range(len(long) - len(short)):
short.append(' ')
df = pd.DataFrame({'A':long, 'B':short}]
# Make sure Excel file exists in the working directory
datatoexcel = pd.ExcelWriter('example1.xlsx',engine = 'xlsxwriter')
df.to_excel(datatoexcel,sheet_name = 'Sheet1')
datatoexcel.save()
A B
0 6 5
1 4 6
2 7
3 3
If you have more than 2 lengths of entries, it is advisable to make a function which uses a similar method.