I have this as the ID numbers:
"00456, 0000456, 567, 00567" in a dataframe called "test".
I created a dataframe where it has the IDs with leading zeros in a column called ID, and a left strip version of the ID in a column named stripped. and an indicator column coded like this:
df.loc[df['stripped'].isin(test['ID'], 'indicator']=1
This tells me if the left stripped version of the ID is in the test df.
So for 00567, it would indicate 1, since 567 is in the test dataframe. And for 456 it would indicator nan, since 456 is not in test.
I am trying to find a way to indicate for leading zero differences. So I want an indicator column to indicate yes that 00456 and 0000456 are the same with leading zero differences,and let me access that info easily.
I am a bit at a loss for how to do this, I tried groupby, but because they are not matches it wasnt working.
Any tips would be appreciated.
ID stripped indicator
00456 456 1
0000456 456 1
Currently, the output is that the indicator column is nan for rows shown above. But I want it to be what is shown above, indicator = 1. Showing that theres leading zero differences.
I am not sure how to compare row by row, and by specific ID
Updated Question: How do I code a way of comparing the semi matching IDs, 00456, 0000456. and indicating a leading zero difference. I am thinking a .str.contains, but I am not sure how to group that to only the same ID numbers.
Are either of these helpful to what you want to do?
Given a column of ids:
id
0 00456
1 0000456
2 456
3 00345
4 345
5 00000345
Doing:
df['id'].groupby(df.id.str.lstrip('0')).agg(list)
Output:
id
345 [00345, 345, 00000345]
456 [00456, 0000456, 456]
Name: id, dtype: object
Or doing:
df['all'] = df['id'].groupby(df.id.str.lstrip('0')).transform(lambda x: [x.tolist()]*len(x))
# or
df['all'] = df['id'].groupby(df.id.str.lstrip('0')).transform(' '.join).str.split()
Output:
id all
0 00456 [00456, 0000456, 456]
1 0000456 [00456, 0000456, 456]
2 456 [00456, 0000456, 456]
3 00345 [00345, 345, 00000345]
4 345 [00345, 345, 00000345]
5 00000345 [00345, 345, 00000345]
IIUC you need an or condition to look both in ID, stripped.
df.loc[((df['stripped'].isin(test['ID']) | (df['ID'].isin(test['ID'])), 'indicator']=1
Related
I have a data frame containing a multi-parent hierarchy of employees. Node (int64) is a key that identifies each unique combination of employee and manager. Parent (float64) is a key that represents the manager in the ‘node’.
Due to some source data anomalies, there are parent keys present in the data frame that are not 'nodes'. I would like to delete all such rows where this is occurring.
empId
empName
mgrId
mgrName
node
parent
111
Alice
222
Bob
1
3
111
Alice
333
Charlie
2
4
222
Bob
444
Dave
3
5
333
Charlie
444
Dave
4
5
444
Dave
5
555
Elizabeth
333
Charlie
6
7
In the above sample, it would be employee ID 555 because parent key 7 is not present anywhere in ‘node’ column.
Here's what I tried so far:
This removes some rows but does not remove all. Not sure why?
df1 = df[df['parent'].isin(df['node'])]
I thought maybe it was because ‘parent’ is float and ‘node’ is int64, so I converted and tried but same result as previous.
df1 = df[df['parent'].astype('int64').isin(df['node'])]
Something to consider is that the data frame contains around 1.5 million rows.
I also tried this, but this just keeps running the code forever - I'm assume this is because .map will loop through the entire data frame (which is around 1.5 million rows):
df[df['parent'].map(lambda x: np.isin(x, df['node']).all())]
I'm especially perplexed, when I use the first 2 code snippets, as to why it would consistently filter out a small subset of rows that do not meet the filter condition but not all.
Again, 'parent' is float64 and has empty values. 'node' is int64 and has no empty values. A more realistic example of node and parent keys is as follows:
Node - 13192210227
Parent - 12668210227.0
I have a python pandas dataframe of stock data, and I'm trying to filter some of those tickers.
There are companies that have 2 or more tickers (different types of shares when a share is preferred and the other not).
I want to drop the lines of those additional share values, and let just the share with the higher volume. In the dataframe I also have the company name, so maybe there is a way of using it to make some condition and then drop it when comparing the volume of the same company? How can I do this?
Use groupby and idxmax:
Suppose this dataframe:
>>> df
ticker volume
0 CEBR3 123
1 CEBR5 456
2 CEBR6 789 # <- keep for group CEBR
3 GOAU3 23 # <- keep for group GOAU
4 GOAU4 12
5 CMIN3 135 # <- keep for group CMIN3
>>> df.loc[df.groupby(df['ticker'].str.extract(r'^(.*)\d', expand=False),
sort=False)['volume'].idxmax().tolist()]
ticker volume
2 CEBR6 789
3 GOAU3 23
5 CMIN3 135
ok, I have a bit of a humdinger.
I hava a dataframe that can be upwards of 120,000 entries
The frames will be similar to this:
ID UID NAME DATE
1 1234 Bob 02/02/2020
2 1235 Jim 02/04/2020
3 1234 Bob
4 1234 Bob 02/02/2020
5 1236 Jan 20/03/2020
6 1235 Jim
I need to be able to eliminate all duplicates, however; i need to check if in the duplicates, if there is a date, then that one, or one of the ones that does have a date, is the one kept, and remove all others. if there is no date in any of the duplicates, then just use whichever is easiest.
I am struggling to come up with a way to do this elegantly.
My thought is:
iterate through all entries, for each entry, create a temp DF and place all duplicates in it, iterate through THAT df and if i find a date, save the index and then delete each entry that is not that entry.. but that seems VERY bulky and slow.
any better suggestions??
Since the blanks are empty string '', you can do this:
(sample_df.sort_values(['UID','NAME','DATE'],
ascending=(True, True, False))
.groupby(['UID','NAME'])
.first()
.reset_index())
which gives you:
UID NAME DATE
0 1234 Bob 02/02/2020
1 1235 Jim 02/04/2020
2 1236 Jan 20/03/2020
Note the ascending flag in sort_values. Pandas will sort string according to their length, and to have non-empty DATE sorted before empty DATE (i.e. ''), you need to sort the column in descending order.
After sorting, you can simply group each pair of (UID, NAME) and keep the first entry.
main_df:
Name Age Id DOB
0 Tom 20 A4565 22-07-1993
1 nick 21 G4562 11-09-1996
2 krish AKL F4561 15-03-1997
3 636A 18 L5624 06-07-1995
4 mak 20 K5465 03-09-1997
5 nits 55 56541 45aBc
6 444 66 NIT 09031992
column_info_df:
Column_Name Column_Type
0 Name string
1 Age integer
2 Id string
3 DOB Date
how can i find data type error value from main df. For example from column info df we can see 'Name' is a string column, so in main df, 'Name' column should contain either string or alphanumeric other than that it's an error. I need to find those datatype error values in a separate df.
error output df:
Column_Name Current_Value Exp_Dtype Index_No.
0 Name 444 string 6
1 Age 444 int 2
2 Name 56441 string 6
0 DOB 4aBc Date 5
0 DOB 09031992 Date 6
i tried this:
for i,r in column_info_df.iterrows():
if r['Column_Type'] == 'string':
main_df[r['Column_Name']].loc[main_df[r['Column_Name']].str.match(r'[^a-z|A-Z]+')]
elif r['Column_Type'] == 'integer':
main_df[r['Column_Name']].loc[main_df[r['Column_Name']].str.match(r'[^0-9]+')]
elif r['Column_Type'] == 'Date':
i have stuck here,because this RE is not catching every errors. i don't know how to go further?
Here is one way of using df.eval(),
Note: though this will check based on pattern and return non matching values. However, note that this cannot check valid types, example if date column has an entry which looks like a date but is an invalid date, this wouldnot identify that:
d={"string":".str.contains(r'[a-z|A-Z]')","integer":".str.contains('^[0-9]*$')",
"Date":".str.contains('\d\d-\d\d-\d\d\d\d')"}
m=df.eval([f"~{a}{b}"
for a,b in zip(column_info_df['Column_Name'],column_info_df['Column_Type'].map(d))]).T
final=(pd.DataFrame(np.where(m,df,np.nan),columns=df.columns)
.reset_index().melt('index',var_name='Column_Name',
value_name='Current_Value').dropna())
final['Expected_dtype']=(final['Column_Name']
.map(column_info_df.set_index('Column_Name')['Column_Type']))
print(final)
Output:
index Column_Name Current_Value Expected_dtype
6 6 Name 444 string
9 2 Age AKL integer
19 5 Id 56541 string
26 5 DOB 45aBc Date
27 6 DOB 09031992 Date
I agree there can be better regex patterns for this job but the idea should be same.
If I understood what you did, you created separate dataframes, which contains infos about your main one.
What I suggest would be instead to use the build-in methods offered by pandas to deal with dataframes.
For instance, if you have a dataframe main, then:
main.info()
will give you the type of object for each column. Note that a column can contain only one type, as it is a series, which is itself a ndarray.
So your column name cannot have anything else but strings that you would have missed. Instead, you can have NaN values. You can check for them with the help of
main.describe()
I hope that helped :-)
I am trying to create a new column in my pandas dataframe, but only with a value if another column contains a certain string.
My dataframe looks something like this:
raw val1 val2
0 Vendor Invoice Numbe Inv Date
1 Vendor: Company Name 1 123 456
2 13445 07708-20-2019 US 432 676
3 79935 19028808-15-2019 US 444 234
4 Vendor: company Name 2 234 234
I am trying to create a new column, vendor that transforms the dataframe into:
raw val1 val2 vendor
0 Vendor Invoice Numbe Inv Date Vendor Invoice Numbe Inv Date
1 Vendor: Company Name 1 123 456 Vendor: Company Name 1
2 13445 07708-20-2019 US 432 676 NaN
3 79935 19028808-15-2019 US 444 234 NaN
4 Vendor: company Name 2 234 234 company Name 2
5 Vendor: company Name 2 928 528 company Name 2
However, whenever I try,
df['vendor'] = df.loc[df['raw'].str.contains('Vendor', na=False), 'raw']
I get the error
ValueError: cannot reindex from a duplicate axis
I know that at index 4 and 5 it's the same value for the company, but what am I doing wrong and how to I add the new column to my dataframe?
The problem is df.loc[df['raw'].str.contains('Vendor', na=False), 'raw'] as different length than df.
You can try np.where, which assigns a new columns by an np.array of the same size, so it doesn't need index alignment.
df['vendor'] = np.where(df['raw'].str.contains('Vendor'), df['raw'], np.NaN)
You could .extract() the part of the string that comes after Vendor: using a positive lookbehind:
df['vendor'] = df['raw'].str.extract(r'(?<=Vendor:\s)(.*)')