I am attempting to construct dataframes using large amount of data stored in txt files. I did not construct the data, however, so I am having to work with the frustrating formatting contained within. I couldn't get my code to work within the large data (and almost crashed my computer doing so), so set up a smaller dataframe like so:
'Value' ID_1 ID_2
0 11122222 ABC42123 33333
1 21219299 YOF21233 88821
2 00022011 ERE00091 23124
3 75643311;21233332 ROB21288 99421
4 12412421 POW94277 12231;33221
5 54221721 IRS21231;YOU28137 13123
My frustration lies in the use of semicolons in the data. The data is meant to represent IDs, but multiple IDs have been assigned to multiple variables. I want to repeat these rows so that I can search through the data for individual IDs and have a datatable that looks like so:
'Value' ID_1 ID_2
0 11122222 ABC42123 33333
1 21219299 YOF21233 88821
2 00022011 ERE00091 23124
3 75643311 ROB21288 99421
4 21233332 ROB21288 99421
5 12412421 POW94277 12231
6 12412421 POW94277 33221
7 54221721 YOU28137 13123
8 54221721 IRS21231 13123
Reindexing is not a problem, so long as the different IDs stay linked to each other and to their correct values.
Unfortunately, all my attempts to split the data have, so far, ended in abject failure. I have managed to set up a function that repeats data containing a semicolon, parse that through my function for each column, but then fail to split the data afterwards.
def delete_dup(df,column):
for a in column:
location = df.loc[df.duplicated(subset= column, keep=False)]
for x in location:
semicolon = df.loc[df[column].str.contains(';', regex=True)]
duplicate = semicolon.duplicated(subset= column, keep='first')
tiny_df = semicolon.loc[duplicate]
split_up = tiny_df[column].str.split(';')
return pd.concat([df, split_up])
'Value' ID_1 ID_2 0
11122222 ABC42123 33333 NaN
21219299 YOF21233 88821 NaN
00022011 ERE00091 23124 NaN
75643311;21233332 ROB21288 99421 NaN
12412421 POW94277 12231;33221 NaN
54221721 IRS21231;YOU28137 13123 NaN
75643311;21233332 ROB21288 99421 NaN
54221721 IRS21231;YOU28137 13123 NaN
12412421 POW94277 12231;33221 NaN
NaN NaN NaN [75643311, 21233332]
I feel like this is the closest I've come and it's still nowhere near what I want. Any "If" statements I try to do on dataframes are met with the "ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()." error, which is so frustrating to read. Any ideas on how to make pandas do what I want?
There are two parts to the solution. The first is to identify which rows have the semicolon, and the second it to create additional rows and concatenate them. The first part is done in contains_sc, and the second part is done by iterating over the rows and running the function create_additional_rows when a row with a semicolon is detected.
Hope this helps.
In[6]: import pandas as pd
In[7]: df = pd.DataFrame(
[['1', '2;3', '4', '5'],
['A', 'B', 'C', 'D;E'],
['T', 'U', 'V;W', 'X']],
index=['Val', 'ID1', 'ID2']
).T
In[8]: df
Out[8]:
Val ID1 ID2
0 1 A T
1 2;3 B U
2 4 C V;W
3 5 D;E X
In[9]: contains_sc = df.apply(lambda x: x.str.contains(';'))
In[10]: contains_sc
Out[10]:
Val ID1 ID2
0 False False False
1 True False False
2 False False True
3 False True False
In[11]:
def create_additional_rows(data_row, csc_row, split_char=';'):
"""Given a duplicated row return additional de-duplicated rows."""
if len(csc_row[csc_row].dropna()) > 1:
raise ValueError('Expect only a single column with a semicolon')
col_with_sc = csc_row[csc_row].dropna().index[0]
retval = []
for item in data_row.loc[col_with_sc].split(split_char):
copied = data_row.copy()
copied.loc[col_with_sc] = item
retval.append(copied)
return retval
In[11]:
new_rows = []
for (idx, data_row), (_, csc_row) in zip(df.iterrows(), contains_sc.iterrows()):
if True not in csc_row.values:
new_rows.append(data_row)
continue
new_rows.extend(create_additional_rows(data_row, csc_row))
final = pd.concat(new_rows, axis='columns').T.reset_index(drop=True)
In[13]: final
Out[13]:
Val ID1 ID2
0 1 A T
1 2 B U
2 3 B U
3 4 C V
4 4 C W
5 5 D X
6 5 E X
Perhaps not the most elegant way, but this just solves the problem:
Step 1
Data we have:
df
'Value' ID_1 ID_2
0 11122222 ABC42123 33333
1 21219299 YOF21233 88821
2 00022011 ERE00091 23124
3 75643311;21233332 ROB21288 99421
4 12412421 POW94277 12231;33221
5 54221721 IRS21231;YOU28137 13123
Step 2
Let's split misbehaving columns
df["'Value'_Dupe"] = df["'Value'"].apply(lambda x: x.split(";")[1] if len(x.split(";"))>1 else np.NaN)
df["'Value'"] = df["'Value'"].apply(lambda x: x.split(";")[0])
df["ID_1_Dupe"] = df["ID_1"].apply(lambda x: x.split(";")[1] if len(x.split(";"))>1 else np.NaN)
df["ID_1"] = df["ID_1"].apply(lambda x: x.split(";")[0])
df["ID_2_Dupe"] = df["ID_2"].apply(lambda x: x.split(";")[1] if len(x.split(";"))>1 else np.NaN)
df["ID_2"] = df["ID_2"].apply(lambda x: x.split(";")[0])
df
'Value' ID_1 ID_2 'Value'_Dupe ID_1_Dupe ID_2_Dupe
0 11122222 ABC42123 33333 NaN NaN NaN
1 21219299 YOF21233 88821 NaN NaN NaN
2 00022011 ERE00091 23124 NaN NaN NaN
3 75643311 ROB21288 99421 21233332 NaN NaN
4 12412421 POW94277 12231 NaN NaN 33221
5 54221721 IRS21231 13123 NaN YOU28137 NaN
Step 3
Let's combine dupes with original data into single dataframe:
df2 = df[pd.notna(df["'Value'_Dupe"])][["'Value'_Dupe","ID_1","ID_2"]]
df2.columns = ["'Value'","ID_1","ID_2"]
df3 = df[pd.notna(df["ID_1_Dupe"])][["'Value'","ID_1_Dupe","ID_2"]]
df3.columns = ["'Value'","ID_1","ID_2"]
df4 = df[pd.notna(df["ID_2_Dupe"])][["'Value'","ID_1","ID_2_Dupe"]]
df4.columns = ["'Value'","ID_1","ID_2"]
df5 = df[["'Value'","ID_1","ID_2"]]
df_result = pd.concat([df5,df2,df3,df4])
df_result
'Value' ID_1 ID_2
0 11122222 ABC42123 33333
1 21219299 YOF21233 88821
2 00022011 ERE00091 23124
3 75643311 ROB21288 99421
4 12412421 POW94277 12231
5 54221721 IRS21231 13123
3 21233332 ROB21288 99421
5 54221721 YOU28137 13123
4 12412421 POW94277 33221
Please let me know if this solves your problem.
Related
I have a dataset that looks something like this:
index Ind. Code Code_2
1 1 NaN x
2 0 7 NaN
3 1 9 z
4 1 NaN a
5 0 11 NaN
6 1 4 NaN
I also created a list to indicate values in the column Code, something like this:
Code_List=['7', '9', '11']
I would like to create a new column for the indicator that is 1 so long as Ind. = 1, Code is in the above list, and Code 2 is not null
I would like to create a function containing an if statement. I tried this and am not sure if its a syntax issue, but i keep getting attribute errors such as the following:
def New_Indicator(x):
if x['Ind.'] == 1 and (x['Code'].isin[Code_List]) or (x['Code_2'].notnull()):
return 1
else:
return 0
df['NewIndColumn'] = df.apply(lambda x: New_Indicator(x), axis=1)
("'str' object has no attribute 'isin'", 'occurred at index 259')
("'float' object has no attribute 'notnull'", 'occurred at index
259')
The problem is that in your function, x['Code'] is a string, not a Series. I suggest you use numpy.where:
ind1 = df['Ind.'].eq(1)
codes = df.Code.isin(code_list)
code2NotNull = df.Code_2.notnull()
mask = ind1 & codes & code2NotNull
df['indicator'] = np.where(mask, 1, 0)
print(df)
Output
index Ind. Code Code_2 indicator
0 1 1 NaN x 0
1 2 0 7.0 NaN 0
2 3 1 9.0 z 1
3 4 1 NaN a 0
4 5 0 11.0 NaN 0
5 6 1 4.0 NaN 0
Update (as suggested by #splash58):
df['indicator'] = mask.astype(int)
I am using bnp-paribas-cardif-claims-management from Kaggle.
Dataset : https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/data
df=pd.read_csv('F:\\Data\\Paribas_Claim\\train.csv',nrows=5000)
df.info() gives
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Columns: 133 entries, ID to v131
dtypes: float64(108), int64(6), object(19)
memory usage: 5.1+ MB
My requirement is :
I am trying to fill null values for columns with datatypes as int and object. I am trying to fill the nulls based on the target column.
My code is
df_obj = df.select_dtypes(['object','int64']).columns.to_list()
for cols in df_obj:
df[( df['target'] == 1 )&( df[cols].isnull() )][cols] = df[df['target'] == 1][cols].mode()
df[( df['target'] == 0 )&( df[cols].isnull() )][cols] = df[df['target'] == 0][cols].mode()
I am able to get output in below print statement:
df[( df['target'] == 1 )&( df[cols].isnull() )][cols]
also the able to print the values for df[df['target'] == 0][cols].mode() if I substitute cols.
But unable to replace the null values with mode values.
I tried df.loc, df.at options instead of df[] and df[...] == np.nan instead of df[...].isnull() but of no use.
Please assist if I need to do any changes in the code. Thanks.
Here is problem is select integers columns, then no contain missing values (because NaN is float), so cannot be replaced. Possible solution is select all numeric columns and in loop set first value of mode per conditions with DataFrame.loc for avoid chain indexing and Series.iat for return only first value (mode should return sometimes 2 values):
df=pd.read_csv('train.csv',nrows=5000)
#only numeric columns
df_obj = df.select_dtypes(np.number).columns.to_list()
#all columns
#df_obj = df.columns.to_list()
#print (df_obj)
for cols in df_obj:
m1 = df['target'] == 1
m2 = df['target'] == 0
df.loc[m1 & (df[cols].isnull()), cols] = df.loc[m1, cols].mode().iat[0]
df.loc[m2 & (df[cols].isnull()), cols] = df.loc[m2, cols].mode().iat[0]
Another solution with replace missing values by Series.fillna:
for cols in df_obj:
m1 = df['target'] == 1
m2 = df['target'] == 0
df.loc[m1, cols] = df.loc[m1, cols].fillna(df.loc[m1, cols].mode().iat[0])
df.loc[m2, cols] = df.loc[m2, cols].fillna(df.loc[m2, cols].mode().iat[0])
print (df.head())
ID target v1 v2 v3 v4 v5 v6 \
0 3 1 1.335739e+00 8.727474 C 3.921026 7.915266 2.599278e+00
1 4 1 -9.543625e-07 1.245405 C 0.586622 9.191265 2.126825e-07
2 5 1 9.438769e-01 5.310079 C 4.410969 5.326159 3.979592e+00
3 6 1 7.974146e-01 8.304757 C 4.225930 11.627438 2.097700e+00
4 8 1 -9.543625e-07 1.245405 C 0.586622 2.151983 2.126825e-07
v7 v8 ... v122 v123 v124 v125 \
0 3.176895e+00 1.294147e-02 ... 8.000000 1.989780 3.575369e-02 AU
1 -9.468765e-07 2.301630e+00 ... 1.499437 0.149135 5.988956e-01 AF
2 3.928571e+00 1.964513e-02 ... 9.333333 2.477596 1.345191e-02 AE
3 1.987549e+00 1.719467e-01 ... 7.018256 1.812795 2.267384e-03 CJ
4 -9.468765e-07 -7.783778e-07 ... 1.499437 0.149135 -9.962319e-07 Z
v126 v127 v128 v129 v130 v131
0 1.804126e+00 3.113719e+00 2.024285 0 0.636365 2.857144e+00
1 5.521558e-07 3.066310e-07 1.957825 0 0.173913 -9.932825e-07
2 1.773709e+00 3.922193e+00 1.120468 2 0.883118 1.176472e+00
3 1.415230e+00 2.954381e+00 1.990847 1 1.677108 1.034483e+00
4 5.521558e-07 3.066310e-07 0.100455 0 0.173913 -9.932825e-07
[5 rows x 133 columns]
You don't have a sample data so I'll just give the methods I think you can use to solve your problem.
Try to read your DataFrame with na_filter = False that way your columns with np.nan or has null values will be replaced by blanks instead.
Then, during your loop use the '' as your identifier for null values. Easier to tag than trying to use the type of the value you are parsing.
I think pd.fillna should help.
# random dataset
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 2, np.nan, 1],
[np.nan, np.nan, np.nan, 5],
[np.nan, 3, np.nan, 4]],
columns=list('ABCD'))
print(df)
A B C D
0 NaN 2.0 NaN 0
1 3.0 2.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
Assuming you want to replace missing values with the mode value of a given column, I'd just use:
df.fillna({'A':df.A.mode()[0],'B':df.B.mode()[0]})
A B C D
0 3.0 2.0 NaN 0
1 3.0 2.0 NaN 1
2 3.0 2.0 NaN 5
3 3.0 3.0 NaN 4
This would also work if you needed a mode value from a subset of values from given column to fill NaNs with.
# let's add 'type' column
A B C D type
0 NaN 2.0 0 1
1 3.0 2.0 1 1
2 NaN NaN 5 2
3 NaN 3.0 4 2
For example, if you want to fill df['B'] NaNs with the mode value of each row that is equal to df['type'] 2:
df.fillna({
'B': df.loc[df.type.eq(2)].B.mode()[0] # type 2
})
A B C D type
0 NaN 2.0 NaN 0 1
1 3.0 2.0 NaN 1 1
2 NaN 3.0 NaN 5 2
3 NaN 3.0 NaN 4 2
# ↑ this would have been '2.0' hadn't we filtered the column with df.loc[]
Your problem is this
df[( df['target'] == 1 )&( df[cols].isnull() )][cols] = ...
Do NOT chain index, especially when assigning. See Why does assignment fail when using chained indexing? section in this doc.
Instead use loc:
df.loc[(df['target'] == 1) & (df[cols].isnull()),
cols] = df.loc[df['target'] == 1,
cols].mode()
I had a data which I pivoted using pivot table method , now the data looks like this:
rule_id a b c
50211 8 0 0
50249 16 0 3
50378 0 2 0
50402 12 9 6
I have set 'rule_id' as index. Now I compared one column to it's corresponding column and created another column with it's result. The idea is if the first column has a value other than 0 and the second column , to which the first column is compared to ,has 0 , then 100 should be updated in the newly created column, but if the situation is vice-versa then 'Null' should be updated. If both column have 0 , then also 'Null' should be updated. If the last column has value 0 , then 'Null' should be updated and other than 0 , then 100 should be updated. But if both the columns have values other than 0(like in the last row of my data) , then the comparison should be like this for column a and b:
value_of_b/value_of_a *50 + 50
and for column b and c:
value_of_c/value_of_b *25 + 25
and similarly if there are more columns ,then the multiplication and addition value should be 12.5 and so on.
I was able to achieve all the above things apart from the last result which is the division and multiplication stuff. I used this code:
m = df.eq(df.shift(-1, axis=1))
arr = np.select([df ==0, m], [np.nan, df], 1*100)
df2 = pd.DataFrame(arr, index=df.index).rename(columns=lambda x: f'comp{x+1}')
df3 = df.join(df2)
df is the dataframe which stores my pivoted table data which I mentioned at the start. After using this code my data looks like this:
rule_id a b c comp1 comp2 comp3
50211 8 0 0 100 NaN NaN
50249 16 0 3 100 NaN 100
50378 0 2 0 NaN 100 NaN
50402 12 9 6 100 100 100
But I want the data to look like this:
rule_id a b c comp1 comp2 comp3
50211 8 0 0 100 NaN NaN
50249 16 0 3 100 NaN 100
50378 0 2 0 NaN 100 NaN
50402 12 9 6 87.5 41.67 100
If you guys can help me get the desired data , I would greatly appreciate it.
Edit:
This is how my data looks:
The problem is that the coefficient to use when building the new compx column does not depend only on the columns position. In fact in each row it is reset to its maximum of 50 after each 0 value and is half of previous one after a non 0 value. Those resetable series are hard to vectorize in pandas, especially in rows. Here I would build a companion dataframe holding only those coefficients, and use directly the numpy underlying arrays to compute them as efficiently as possible. Code could be:
# transpose the dataframe to process columns instead of rows
coeff = df.T
# compute the coefficients
for name, s in coeff.items():
top = 100 # start at 100
r = []
for i, v in enumerate(s):
if v == 0: # reset to 100 on a 0 value
top=100
else:
top = top/2 # else half the previous value
r.append(top)
coeff.loc[:, name] = r # set the whole column in one operation
# transpose back to have a companion dataframe for df
coeff = coeff.T
# build a new column from 2 consecutive ones, using the coeff dataframe
def build_comp(col1, col2, i):
df['comp{}'.format(i)] = np.where(df[col1] == 0, np.nan,
np.where(df[col2] == 0, 100,
df[col2]/df[col1]*coeff[col1]
+coeff[col1]))
old = df.columns[0] # store name of first column
# Ok, enumerate all the columns (except first one)
for i, col in enumerate(df.columns[1:], 1):
build_comp(old, col, i)
old = col # keep current column name for next iteration
# special processing for last comp column
df['comp{}'.format(i+1)] = np.where(df[col] == 0, np.nan, 100)
With this initial dataframe:
date 2019-04-25 15:08:23 2019-04-25 16:14:14 2019-04-25 16:29:05 2019-04-25 16:36:32
rule_id
50402 0 0 9 0
51121 0 1 0 0
51147 0 1 0 0
51183 2 0 0 0
51283 0 12 9 6
51684 0 1 0 0
52035 0 4 3 2
it gives as expected:
date 2019-04-25 15:08:23 2019-04-25 16:14:14 2019-04-25 16:29:05 2019-04-25 16:36:32 comp1 comp2 comp3 comp4
rule_id
50402 0 0 9 0 NaN NaN 100.000000 NaN
51121 0 1 0 0 NaN 100.0 NaN NaN
51147 0 1 0 0 NaN 100.0 NaN NaN
51183 2 0 0 0 100.0 NaN NaN NaN
51283 0 12 9 6 NaN 87.5 41.666667 100.0
51684 0 1 0 0 NaN 100.0 NaN NaN
52035 0 4 3 2 NaN 87.5 41.666667 100.0
Ok, I think you can iterate over your dataframe df and use some if-else to get the desired output.
for i in range(len(df.index)):
if df.iloc[i,1]!=0 and df.iloc[i,2]==0: # column start from index 0
df.loc[i,'colname'] = 'whatever you want' # so rule_id is column 0
elif:
.
.
.
I have multiple DataFrames that I want to merge where I would like the fill value an empty string rather than nan. Some of the DataFrames have already nan values in them. concat sort of does what I want but fill empty values with nan. How does one not fill them with nan, or specify the fill_value to achieve something like this:
>>> df1
Value1
0 1
1 NaN
2 3
>>> df2
Value2
1 5
2 Nan
3 7
>>> merge_multiple_without_nan([df1,df2])
Value1 Value2
0 1
1 NaN 5
2 3 NaN
3 7
This is what concat does:
>>> concat([df1,df2], axis=1)
Value1 Value2
0 1 NaN
1 NaN 5
2 3 NaN
3 NaN 7
Well, I couldn't find any function in concat or merge that would handle this by itself, but the code below works without much hassel:
df1 = pd.DataFrame({'Value2': [1,np.nan,3]}, index = [0,1, 2])
df2 = pd.DataFrame({'Value2': [5,np.nan,7]}, index = [1, 2, 3])
# Add temporary Nan values for the data frames.
df = pd.concat([df1.fillna('X'), df2.fillna('Y')], axis=1)
df=
Value2 Value2
0 1 NaN
1 X 5
2 3 Y
3 NaN 7
Step 2:
df.fillna('', inplace=True)
df=
Value2 Value2
0 1
1 X 5
2 3 Y
3 7
Step 3:
df.replace(to_replace=['X','Y'], value=np.nan, inplace=True)
df=
Value2 Value2
0 1
1 NaN 5
2 3 NaN
3 7
After using concat, you can iterate over the DataFrames you merged, find the indices that are missing, and fill them in with an empty string. This should work for concatenating an arbitrary number of DataFrames, as long as your column names are unique.
# Concatenate all of the DataFrames.
merge_dfs = [df1, df2]
full_df = pd.concat(merge_dfs, axis=1)
# Find missing indices for each merged frame, fill with an empty string.
for partial_df in merge_dfs:
missing_idx = full_df.index.difference(partial_df.index)
full_df.loc[missing_idx, partial_df.columns] = ''
The resulting output using your sample data:
Value1 Value2
0 1
1 NaN 5
2 3 NaN
3 7
Suppose I have a dataframe as follows,
import pandas as pd
columns=['A','B','C','D', 'E', 'F']
index=['1','2','3','4','5','6']
df = pd.DataFrame(columns=columns,index=index)
df['D']['1'] = 1
df['E'] = 1
df['F']['1'] = 1
df['A']['2'] = 1
df['B']['3'] = 1
df['C']['4'] = 1
df['A']['5'] = 1
df['B']['5'] = 1
df['C']['5'] = 1
df['D']['6'] = 1
df['F']['6'] = 1
df
A B C D E F
1 NaN NaN NaN 1 1 1
2 1 NaN NaN NaN 1 NaN
3 NaN 1 NaN NaN 1 NaN
4 NaN NaN 1 NaN 1 NaN
5 1 1 1 NaN 1 NaN
6 NaN NaN NaN 1 1 1
My condition is, I want to remove the columns which have value only when A,B,C(together) don't have a value. I want to find which column is mutually exclusive to A,B,C columns together. I am interested in finding the columns that have values only when A or B or C has values. The output here would be to remove D,F columns. But my dataframe has 400 columns and I want a way to check this for A,B,C vs rest of the columns.
One way I can think is,
Remove NA rows from A,B,C
df = df[np.isfinite(df['A'])]
df = df[np.isfinite(df['B'])]
df = df[np.isfinite(df['C'])]
and get NA count of all columns and check with the total number of rows,
df.isnull().sum()
and remove the counts that match.
Is there a better and efficient way to do this?
Thanks
Rather than delete rows, just select the others that don't have A, B, C equal to NaN at the same time.
mask = df[["A", "B", "C"]].isnull().all(axis=1)
df = df[~mask]