Python Pandas Replace Values with NAN from Tuple - python

Got the Following Dataframe:
A B
Temp1 1
Temp2 2
NaN NaN
NaN 4
Since the A nad B are correlated, I am able to create new column where I have calculated the nan value of A and B and form a tuple:
A B C
Temp1 1 (1,Temp1)
Temp2 2 (2, Temp2)
NaN NaN (3, Temp3)
NaN 4 (4, Temp4)
Now I have to drop the column C and fill the Nan value corrosponding to the Columns.

Use Series.fillna with select values in tuple by indexing with str, last remove C column:
#if values are not in tuples
#df.C = df.C.str.strip('()').str.split(',').apply(tuple)
df.A = df.A.fillna(df.C.str[1])
df.B = df.B.fillna(df.C.str[0])
df = df.drop('C', axis=1)
print (df)
A B
0 Temp1 1
1 Temp2 2
2 Temp3 3
3 Temp4 4
Or create DataFrame from C with DataFrame.pop for use and remove column, set new columns names and pass to DataFrame.fillna:
#if values are not in tuples
#df.C = df.C.str.strip('()').str.split(',').apply(tuple)
df[['A','B']] = df[['A','B']].fillna(pd.DataFrame(df.pop('C').tolist(), columns=['B','A']))
print (df)
A B
0 Temp1 1
1 Temp2 2
2 Temp3 3
3 Temp4 4

Related

pandas matching database with string keeping index of database

I have a database with strings and the index as below.
df0
idx name_id_code string_line_0
0 0.01 A
1 0.5 B
2 77.6 C
3 29.8 D
4 56.2 E
5 88.1000005 F
6 66.4000008 G
7 2.1 H
8 99 I
9 550.9999999 J
df1
idx string_line_1
0 A
1 F
2 J
3 G
4 D
Now, I want to match the df1 with df0, taking the values where df1 = df 0 but, keeping the index of df0 true as below
df_result name_id_code string_line_0
0 0.01 A
5 88.1000005 F
9 550.9999999 J
6 66.4000008 G
3 29.8 D
I tried with my code but it didnt work for string and only matching index
c = df0['name_id_code'] + ' (' + df0['string_line_0'].astype(str) + ')'
out = df1[df2['string_line_1'].isin(s)]
I also tried to keep simple just last column match like
c = df0['string_line_0'].astype(str) + ')'
out = df1[df1['string_line_1'].isin(s)]
but blank output.
Because is filtered df0 DataFrame then is index values not changed if use Series.isin by df1['string_line_1', only order of columns is like in original df0:
out = df0[df0['string_line_0'].isin(df1['string_line_1'])]
print (out)
name_id_code string_line_0
idx
0 0.010000 A
3 29.800000 D
5 88.100001 F
6 66.400001 G
9 551.000000 J
Or if use DataFrame.merge then for avoid lost df0.index is necessary add DataFrame.reset_index:
out = (df1.rename(columns={'string_line_1':'string_line_0'})
.merge(df0.reset_index(), on='string_line_0'))
print (out)
string_line_0 idx name_id_code
0 A 0 0.010000
1 F 5 88.100001
2 J 9 551.000000
3 G 6 66.400001
4 D 3 29.800000
Similar solution, only same values in string_line_0 and string_line_1 columns:
out = (df1.merge(df0.reset_index(), left_on='string_line_1', right_on='string_line_0'))
print (out)
string_line_1 idx name_id_code string_line_0
0 A 0 0.010000 A
1 F 5 88.100001 F
2 J 9 551.000000 J
3 G 6 66.400001 G
4 D 3 29.800000 D
You can do:
out = df0.loc[(df0["string_line_0"].isin(df1["string_line_1"]))].copy()
out["string_line_0"] = pd.Categorical(out["string_line_0"], categories=df1["string_line_1"].unique())
out.sort_values(by=["string_line_0"], inplace=True)
The first line filters df0 to just the rows where string_line_0 is in the string_line_1 column of df1.
The second line converts string_line_0 in the output df to a Categorical feature, which is then custom sorted by the order of the values in df1

How to split a pandas dataframe of different column sizes into separate dataframes?

I have a large pandas dataframe, consisting of a different number of columns throughout the dataframe.
Here is an example: Current dataframe example
I would like to split the dataframe into multiple dataframes, based on the number of columns it has.
Example output image here:Output image
Thanks.
If you have a dataframe of say 10 columns and you want to put the records with 3 NaN values in another result dataframe as those with 1 NaN, you can do this as follows:
# evaluate the number of NaNs per row
num_counts=df.isna().sum('columns')
# group by this number and add the grouped
# dataframe to a dictionary
results= dict()
num_counts=df.isna().sum('columns')
for key, sub_df in df.groupby(num_counts):
results[key]= sub_df
After executing this code, results contains subsets of df where each subset contains the same number of NaNs (so the same number of non-NaNs).
If you want to write your results to a excel file, you just need to execute the following code:
with pd.ExcelWriter('sorted_output.xlsx') as writer:
for key, sub_df in results.items():
# if you want to avoid the detour of using dicitonaries
# just replace the previous line by
# for key, sub_df in df.groupby(num_counts):
sub_df.to_excel(
writer,
sheet_name=f'missing {key}',
na_rep='',
inf_rep='inf',
float_format=None,
index=True,
index_label=True,
header=True)
Example:
# create an example dataframe
df=pd.DataFrame(dict(a=[1, 2, 3, 4, 5, 6], b=list('abbcac')))
df.loc[[2, 4, 5], 'c']= list('xyz')
df.loc[[2, 3, 4], 'd']= list('vxw')
df.loc[[1, 2], 'e']= list('qw')
It looks like this:
Out[58]:
a b c d e
0 1 a NaN NaN NaN
1 2 b NaN NaN q
2 3 b x v w
3 4 c NaN x NaN
4 5 a y w NaN
5 6 c z NaN NaN
If you execute the code above on this dataframe, you get a dictionary with the following content:
0: a b c d e
2 3 b x v w
1: a b c d e
4 5 a y w NaN
2: a b c d e
1 2 b NaN NaN q
3 4 c NaN x NaN
5 6 c z NaN NaN
3: a b c d e
0 1 a NaN NaN NaN
The keys of the dictionary are the number of NaNs in the row and the values are the dataframes which contain only rows with that number of NaNs in them.
If I'm getting you right, what you want to do is to split existing 1 dataframe with n columns into ceil(n/5) dataframes, each with 5 columns, and the last one with the reminder of n/5.
If that's the case this will do the trick:
import pandas as pd
import math
max_cols=5
dt={"a": [1,2,3], "b": [6,5,3], "c": [8,4,2], "d": [8,4,0], "e": [1,9,5], "f": [9,7,9]}
df=pd.DataFrame(data=dt)
dfs=[df[df.columns[max_cols*i:max_cols*i+max_cols]] for i in range(math.ceil(len(df.columns)/max_cols))]
for el in dfs:
print(el)
And output:
a b c d e
0 1 6 8 8 1
1 2 5 4 4 9
2 3 3 2 0 5
f
0 9
1 7
2 9
[Program finished]

Map only first occurrence of key/value match in dataframe

Is it possible to map only the first occurrence of key in a dataframe?
Ex:
testDict = { A : 1, B: 2}
df
Name Num
A
A
B
B
Expected output
Name Num
A 1
A
B 2
B
Use duplicated to find the first occurrence and then map:
df['Num'] = df.Name[df.Name.duplicated(keep='last')].map(testDict)
print(df)
Output
Name Num
0 A 1.0
1 A NaN
2 B 2.0
3 B NaN
To remove the NaN values, if you wish, do:
df = df.fillna('')
map the drop_duplicates, assuming you have a unique Index for alignment. (Probably best to keep NaN so the column remains numeric)
df['Num'] = df['Name'].drop_duplicates().map(testDict)
Name Num
0 A 1.0
1 A NaN
2 B 2.0
3 B NaN
You can use duplicated and map:
df['Num'] = np.where(~df['Name'].duplicated(), df['Name'].map(testDict), '')
Output:
Name Num
0 A 1
1 A
2 B 2
3 B

find indeces of rows containing NaN

in a pandas dataframe
matrix
I would like to find the rows (indices) contaning NaN.
for finding NaN in columns I would do
idx_nan = matrix.columns[np.isnan(matrix).any(axis=1)]
but it doesn't work with matrix.rows
What is the equivalent for finding items in rows?
I think you need DataFrame.isnull with any and boolean indexing:
print (df[df.isnull().any(1)].index)
Sample:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[np.nan,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 NaN 1 5 7
1 2 5 8.0 3 3 4
2 3 6 9.0 5 6 3
print (df[df.isnull().any(1)].index)
Int64Index([0], dtype='int64')
Another solutions:
idx_nan = df[np.isnan(df).any(axis=1)].index
print (idx_nan)
Int64Index([0], dtype='int64')
idx_nan = df.index[np.isnan(df).any(axis=1)]
print (idx_nan)

Apply Across Dynamic Number of Columns

I have a pandas dataframe and I want to make the last N columns null values. N is dependent on the value in another column.
Here is an example:
df = pd.DataFrame(np.random.randn(4, 5))
df['lookup_key'] = df.index #(actual data does not use index here)
lkup_dict = {0:1,1:2,2:2,3:3}
In this DataFrame, I want to use the value in the 'lookup_key' column to determine which columns to set to null.
Row 0 -> df.ix[0,lkup_dict[0]:4] = np.nan #key = 0, value = 1
Row 1 -> df.ix[1,lkup_dict[1]:4] = np.nan #key = 1, value = 2
Row 2 -> df.ix[2,lkup_dict[2]:4] = np.nan #key = 2, value = 2
Row 3 -> df.ix[3,lkup_dict[3]:4] = np.nan #key = 3, value = 3
The end result looking like this:
0 1 2 3 4 lookup_key
0 -0.882864 NaN NaN NaN NaN 0
1 1.358663 -0.024898 NaN NaN NaN 1
2 0.885058 0.673621 NaN NaN NaN 2
3 -1.487506 0.031021 -1.313646 NaN NaN 3
In this example I have to manually type out the df.ix... for each row. I need something that will do this for all rows of my DataFrame
You can do this with a for loop. To demonstrate, I generate a DataFrame with some random values. I then insert a lookup_key column in the front with some random integers. Finally, I generate lkup_dict dictionary with some random values.
>>> import pandas as pd
>>> import numpy as np
>>>
>>> df = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
>>> df.insert(0, 'lookup_key', np.random.randint(0, 5, 10))
>>> print df
lookup_key A B C D
0 0 0.048738 0.773304 -0.912366 -0.832459
1 3 -0.573221 -1.381395 -0.644223 1.888484
2 0 0.198043 -0.751243 0.138277 2.006188
3 2 -1.692605 -1.586282 -0.656690 0.647510
4 3 -0.847591 -0.368447 0.510250 -0.172055
5 1 0.927243 -0.447478 0.796221 0.372763
6 3 0.027285 0.177276 1.087456 -0.420614
7 4 -1.147004 -0.172367 -0.767347 -0.855318
8 1 -0.649695 -0.572409 -0.664149 0.863050
9 4 -0.820982 -0.499889 -0.624889 1.397271
>>> lkup_dict = {i: np.random.randint(0, 5) for i in range(5)}
>>> print lkup_dict
{0: 3, 1: 0, 2: 0, 3: 4, 4: 1}
Now I iterate over the rows in the DataFrame. key gets the value under the lookup_key column for that row. nNulls uses the key to get the number of null values from lkup_dict. startIndex gets the index for the first column with a null value in that row. The final line replaces the relevant values with null values.
>>> for i, row in df.iterrows():
... key = row['lookup_key'].astype(int)
... nNulls = lkup_dict[key]
... startIndex = df.shape[1] - nNulls
... df.loc[i, startIndex:] = np.nan
>>> print df
lookup_key A B C D
0 0 0.048738 NaN NaN NaN
1 3 NaN NaN NaN NaN
2 0 0.198043 NaN NaN NaN
3 2 -1.692605 -1.586282 -0.656690 0.647510
4 3 NaN NaN NaN NaN
5 1 0.927243 -0.447478 0.796221 0.372763
6 3 NaN NaN NaN NaN
7 4 -1.147004 -0.172367 -0.767347 NaN
8 1 -0.649695 -0.572409 -0.664149 0.863050
9 4 -0.820982 -0.499889 -0.624889 NaN
That's it. Hopefully that's what you're looking for.

Categories