ok, I have a bit of a humdinger.
I hava a dataframe that can be upwards of 120,000 entries
The frames will be similar to this:
ID UID NAME DATE
1 1234 Bob 02/02/2020
2 1235 Jim 02/04/2020
3 1234 Bob
4 1234 Bob 02/02/2020
5 1236 Jan 20/03/2020
6 1235 Jim
I need to be able to eliminate all duplicates, however; i need to check if in the duplicates, if there is a date, then that one, or one of the ones that does have a date, is the one kept, and remove all others. if there is no date in any of the duplicates, then just use whichever is easiest.
I am struggling to come up with a way to do this elegantly.
My thought is:
iterate through all entries, for each entry, create a temp DF and place all duplicates in it, iterate through THAT df and if i find a date, save the index and then delete each entry that is not that entry.. but that seems VERY bulky and slow.
any better suggestions??
Since the blanks are empty string '', you can do this:
(sample_df.sort_values(['UID','NAME','DATE'],
ascending=(True, True, False))
.groupby(['UID','NAME'])
.first()
.reset_index())
which gives you:
UID NAME DATE
0 1234 Bob 02/02/2020
1 1235 Jim 02/04/2020
2 1236 Jan 20/03/2020
Note the ascending flag in sort_values. Pandas will sort string according to their length, and to have non-empty DATE sorted before empty DATE (i.e. ''), you need to sort the column in descending order.
After sorting, you can simply group each pair of (UID, NAME) and keep the first entry.
Related
I have this as the ID numbers:
"00456, 0000456, 567, 00567" in a dataframe called "test".
I created a dataframe where it has the IDs with leading zeros in a column called ID, and a left strip version of the ID in a column named stripped. and an indicator column coded like this:
df.loc[df['stripped'].isin(test['ID'], 'indicator']=1
This tells me if the left stripped version of the ID is in the test df.
So for 00567, it would indicate 1, since 567 is in the test dataframe. And for 456 it would indicator nan, since 456 is not in test.
I am trying to find a way to indicate for leading zero differences. So I want an indicator column to indicate yes that 00456 and 0000456 are the same with leading zero differences,and let me access that info easily.
I am a bit at a loss for how to do this, I tried groupby, but because they are not matches it wasnt working.
Any tips would be appreciated.
ID stripped indicator
00456 456 1
0000456 456 1
Currently, the output is that the indicator column is nan for rows shown above. But I want it to be what is shown above, indicator = 1. Showing that theres leading zero differences.
I am not sure how to compare row by row, and by specific ID
Updated Question: How do I code a way of comparing the semi matching IDs, 00456, 0000456. and indicating a leading zero difference. I am thinking a .str.contains, but I am not sure how to group that to only the same ID numbers.
Are either of these helpful to what you want to do?
Given a column of ids:
id
0 00456
1 0000456
2 456
3 00345
4 345
5 00000345
Doing:
df['id'].groupby(df.id.str.lstrip('0')).agg(list)
Output:
id
345 [00345, 345, 00000345]
456 [00456, 0000456, 456]
Name: id, dtype: object
Or doing:
df['all'] = df['id'].groupby(df.id.str.lstrip('0')).transform(lambda x: [x.tolist()]*len(x))
# or
df['all'] = df['id'].groupby(df.id.str.lstrip('0')).transform(' '.join).str.split()
Output:
id all
0 00456 [00456, 0000456, 456]
1 0000456 [00456, 0000456, 456]
2 456 [00456, 0000456, 456]
3 00345 [00345, 345, 00000345]
4 345 [00345, 345, 00000345]
5 00000345 [00345, 345, 00000345]
IIUC you need an or condition to look both in ID, stripped.
df.loc[((df['stripped'].isin(test['ID']) | (df['ID'].isin(test['ID'])), 'indicator']=1
I have a data frame containing a multi-parent hierarchy of employees. Node (int64) is a key that identifies each unique combination of employee and manager. Parent (float64) is a key that represents the manager in the ‘node’.
Due to some source data anomalies, there are parent keys present in the data frame that are not 'nodes'. I would like to delete all such rows where this is occurring.
empId
empName
mgrId
mgrName
node
parent
111
Alice
222
Bob
1
3
111
Alice
333
Charlie
2
4
222
Bob
444
Dave
3
5
333
Charlie
444
Dave
4
5
444
Dave
5
555
Elizabeth
333
Charlie
6
7
In the above sample, it would be employee ID 555 because parent key 7 is not present anywhere in ‘node’ column.
Here's what I tried so far:
This removes some rows but does not remove all. Not sure why?
df1 = df[df['parent'].isin(df['node'])]
I thought maybe it was because ‘parent’ is float and ‘node’ is int64, so I converted and tried but same result as previous.
df1 = df[df['parent'].astype('int64').isin(df['node'])]
Something to consider is that the data frame contains around 1.5 million rows.
I also tried this, but this just keeps running the code forever - I'm assume this is because .map will loop through the entire data frame (which is around 1.5 million rows):
df[df['parent'].map(lambda x: np.isin(x, df['node']).all())]
I'm especially perplexed, when I use the first 2 code snippets, as to why it would consistently filter out a small subset of rows that do not meet the filter condition but not all.
Again, 'parent' is float64 and has empty values. 'node' is int64 and has no empty values. A more realistic example of node and parent keys is as follows:
Node - 13192210227
Parent - 12668210227.0
My dataframe is this:
Date Name Type Description Number
2020-07-24 John Doe Type1 NaN NaN
2020-08-10 Jo Doehn Type1 NaN NaN
2020-08-15 John Doe Type1 NaN NaN
2020-09-10 John Doe Type2 NaN NaN
2020-11-24 John Doe Type1 NaN NaN
I want the Number column to have the instance number with the 60 day period. So for entry 1, the Number should just be 1 since it's the first instance - same with entry 2 since it's a different name. Entry 3 however, should have 2 in the Number column since it's the second instance of John Doe and Type 1 in the 60 day period starting 7/24 (the first instance date). Entry 4 would be 1 as well since the Type is different. Entry 5 would also be 1 since it's outside the 60 day period from 7/24. However, any entries after this with John Doe, Type 1 would have a new 60 day period starting 11/24.
Sorry, I know this is a pretty loaded question with a lot of aspects to it, but I'm trying to get up to speed on dataframes again and I'm not sure where to begin.
As a starting point, you could create a pivot table. (The assign statement just creates a temporary column of ones, to support counting.) In the example below, each row is a date, and each column is a (name, type) pair.
Then, use the resample function (to get one row for every calendar day), and the rolling function (to sum the numbers in the 60-day window).
x = (df.assign(temp = 1)
.pivot_table(index='date',
columns=['name', 'type'],
values='temp',
aggfunc='count',
fill_value=0)
)
x.resample('1d').count().rolling(60).sum()
Can you post sample data in text format (for copy/paste)?
main_df:
Name Age Id DOB
0 Tom 20 A4565 22-07-1993
1 nick 21 G4562 11-09-1996
2 krish AKL F4561 15-03-1997
3 636A 18 L5624 06-07-1995
4 mak 20 K5465 03-09-1997
5 nits 55 56541 45aBc
6 444 66 NIT 09031992
column_info_df:
Column_Name Column_Type
0 Name string
1 Age integer
2 Id string
3 DOB Date
how can i find data type error value from main df. For example from column info df we can see 'Name' is a string column, so in main df, 'Name' column should contain either string or alphanumeric other than that it's an error. I need to find those datatype error values in a separate df.
error output df:
Column_Name Current_Value Exp_Dtype Index_No.
0 Name 444 string 6
1 Age 444 int 2
2 Name 56441 string 6
0 DOB 4aBc Date 5
0 DOB 09031992 Date 6
i tried this:
for i,r in column_info_df.iterrows():
if r['Column_Type'] == 'string':
main_df[r['Column_Name']].loc[main_df[r['Column_Name']].str.match(r'[^a-z|A-Z]+')]
elif r['Column_Type'] == 'integer':
main_df[r['Column_Name']].loc[main_df[r['Column_Name']].str.match(r'[^0-9]+')]
elif r['Column_Type'] == 'Date':
i have stuck here,because this RE is not catching every errors. i don't know how to go further?
Here is one way of using df.eval(),
Note: though this will check based on pattern and return non matching values. However, note that this cannot check valid types, example if date column has an entry which looks like a date but is an invalid date, this wouldnot identify that:
d={"string":".str.contains(r'[a-z|A-Z]')","integer":".str.contains('^[0-9]*$')",
"Date":".str.contains('\d\d-\d\d-\d\d\d\d')"}
m=df.eval([f"~{a}{b}"
for a,b in zip(column_info_df['Column_Name'],column_info_df['Column_Type'].map(d))]).T
final=(pd.DataFrame(np.where(m,df,np.nan),columns=df.columns)
.reset_index().melt('index',var_name='Column_Name',
value_name='Current_Value').dropna())
final['Expected_dtype']=(final['Column_Name']
.map(column_info_df.set_index('Column_Name')['Column_Type']))
print(final)
Output:
index Column_Name Current_Value Expected_dtype
6 6 Name 444 string
9 2 Age AKL integer
19 5 Id 56541 string
26 5 DOB 45aBc Date
27 6 DOB 09031992 Date
I agree there can be better regex patterns for this job but the idea should be same.
If I understood what you did, you created separate dataframes, which contains infos about your main one.
What I suggest would be instead to use the build-in methods offered by pandas to deal with dataframes.
For instance, if you have a dataframe main, then:
main.info()
will give you the type of object for each column. Note that a column can contain only one type, as it is a series, which is itself a ndarray.
So your column name cannot have anything else but strings that you would have missed. Instead, you can have NaN values. You can check for them with the help of
main.describe()
I hope that helped :-)
I have a dataframe like this:
name time session1 session2 session3
Alex 135 10 3 5
Lee 136 2 6 4
I want to make multiple dataframes based on each session. for example, i want to make dataframe one that has name, time, and session1. and dataframe 2 has name, time, and session2. and dataframe 3 has name, time, and session3. I want to use for loop or any other way is better but don't know how to choose column 1,2,3 at one time but column 1,2, 4 and etc. Any one has idea about that. The data is saved in pandas dataframe. I just don't know how to type it in Stackoverflow here. Thank you
I don't think you need to create a new dictionary for that.
Just directly slice your data frame whenever needed.
df[['name', 'time', 'session 1']]
If you think the following design can help you, you can set the name and time to be indexes (df.set_index(['name', 'time'])) and just simply
df['session 1']
Organize it into a dictionary of dataframes:
dict_of_dfs = {f'df {i}':df[['name','time', i]] for i in df.columns[2:]}
Then you can access each dataframe as you would any other dictionary values:
>>> dict_of_dfs['df session1']
name time session1
0 Alex 135 10
1 Lee 136 2
>>> dict_of_dfs['df session2']
name time session2
0 Alex 135 3
1 Lee 136 6