I ran into this specific problem where I have a dataframe of ID numbers. Some of these account numbers have dropped leading zeros.
dataframe is df.
ID
345
345
543
000922
000345
000345
000543
So what im trying to do is create a generalized way to check if we have dropped leading zeros. So basically, in my real data set there would be millions of rows. So I want to use a pandas method to say if there is a section of ID that matches a section with the zeros to put that into another dataframe so I can further examine.
I do that like this:
new_df = df.loc[df['ID'].isin(df['ID'])]
My reasoning for this is that I want to filter that dataset to find if any of the IDs are inside the full IDs.
Now I have
ID
345
345
543
000345
000345
000543
I can use a .unique() to get a series of each unique combo.
ID
345
543
000345
000543
This is fine for a small dataset. But for rows of millions, I am wondering how I can make it easier to do this check.
I trying to find a way to create a dictionary where the keys are the 3 digit and the values are its full ID. or vice versa.
Any tips on that would be appreciated.
If anyone has any tips also on a different idea to checking for dropped zeros, other than the dictionary approach, that would be helpful too.
Note: It is not always 3 digits. Could be 4567 for example, where the real value would be 004567.
One option is to strip leading "0"s:
out = df['ID'].str.lstrip('0').unique()
Output:
array(['345', '543', '922'], dtype=object)
or prepend "0"s:
out = df['ID'].str.zfill(df['ID'].str.len().max()).unique()
Output:
array(['000345', '000543', '000922'], dtype=object)
Use:
print (df)
ID
0 345
1 345
2 540
3 2922
4 002922
5 000344
6 000345
7 000543
#filter ID starting by 0 to Series
d = df.loc[df['ID'].str.startswith('0'), 'ID']
#create index in Series with remove zeros from left side
d.index = d.str.lstrip('0')
print (d)
ID
2922 002922
344 000344
345 000345
543 000543
Name: ID, dtype: object
#dict all possible values
print (d.to_dict())
{'2922': '002922', '344': '000344', '345': '000345', '543': '000543'}
#compare if exist indices in original ID column and create dict
d = d[d.index.isin(df['ID'])].to_dict()
print (d)
{'2922': '002922', '345': '000345', '543': '000543'}
Create a dictionary for finding potentially affected records.
# Creates a dummy dataframe.
df = pd.DataFrame(['00456', '0000456', '567', '00567'], columns=['ID'])
df['stripped'] = pd.to_numeric(df['ID'])
df['affected_id'] = df.ID.str.len() == df.stripped.astype(str).str.len()
df
ID stripped affected_id
0 00456 456 False
1 0000456 456 False
2 567 567 True
3 00567 567 False
# Creates a dictionary of potentially affected records.
d = dict()
for i in df[df.affected_id == True].stripped.unique():
d[i] = df[(df.stripped == i) & (df.ID != str(i))].ID.unique().tolist()
d
{567: ['00567']}
If you want to include the stripped records into the list, then:
for i in df[df.affected_id == True].stripped.unique():
d[i] = df[df.stripped == i].ID.unique().tolist()
d
{567: ['567', '00567']}
You can convert the column type to int
m = df['ID'].ne(df['ID'].astype(int))
print(m)
0 False
1 False
2 False
3 True
4 True
5 True
Name: ID, dtype: bool
print(df[m])
ID
3 000345
4 000345
5 000543
Related
I have 3 columns of ID's that I want to combine into a single column like the example below. The goal here is to simply replace all 0's in the main column with the values in either ID1 or ID2 AND maintain the score column to the far right.
Note, The Main ID column also has cases where there is already a value as shown in row 3, in that case, nothing needs to be done. Ultimately trying to get a single column as shown in the desired output. Tried using some iterative loop but it was not a pythonic approach.
Data Table
Main ID ID_1 ID_2 SCORE
0 0 121231 212
0 54453 0 199
12123 12123 0 185
343434 0 343434 34
2121 0 0 66
0 0 11 565
Desired output:
MAIN ID SCORE
121231 212
54453 199
12123 185
343434 34
2121 66
11 565
Update, applying the bfill method changed all the 'MAIN_ID' numbers into scientific notation like: 3.43559e+06
This one works for me, It's simple but functional :D
import pandas as pd
d = {'MAIN ID' : [0,0,12123,343434,2121,0], 'ID_1': [0,54453,12123,0,0,0],'ID_2':[121231,0,0,343434,0,11]}
df = pd.DataFrame(data=d)
for i in range(len(df)):
if df.iloc[i]['MAIN ID'] == 0:
if df.iloc[i]['ID_1'] != 0:
df.iloc[i]['MAIN ID'] = df.iloc[i]['ID_1']
else:
df.iloc[i]['MAIN ID'] = df.iloc[i]['ID_2']
df = df.drop(['ID_1', 'ID_2'], axis=1)
Try bfill with mask
out = df.mask(df.eq(0)).bfill(1)[['Main ID']]
I have a data frame that present some features with cumulative values. I need to identify those features in order to revert the cumulative values.
This is how my dataset looks (plus about 50 variables):
a b
346 17
76 52
459 70
680 96
679 167
246 180
What I wish to achieve is:
a b
346 17
76 35
459 18
680 26
679 71
246 13
I've seem this answer, but it first revert the values and then try to identify the columns. Can't I do the other way around? First identify the features and then revert the values?
Finding cumulative features in dataframe?
What I do at the moment is run the following code in order to give me the feature's names with cumulative values:
def accmulate_col(value):
count = 0
count_1 = False
name = []
for i in range(len(value)-1):
if value[i+1]-value[i] >= 0:
count += 1
if value[i+1]-value[i] > 0:
count_1 = True
name.append(1) if count == len(value)-1 and count_1 else name.append(0)
return name
df.apply(accmulate_col)
Afterwards, I save these features names manually in a list called cum_features and revert the values, creating the desired dataset:
df_clean = df.copy()
df_clean[cum_cols] = df_clean[cum_features].apply(lambda col: np.diff(col, prepend=0))
Is there a better way to solve my problem?
To identify which columns have increasing* values throughout the whole column, you will need to apply conditions on all the values. So in that sense, you have to use the values first to figure out what columns fit the conditions.
With that out of the way, given a dataframe such as:
import pandas as pd
d = {'a': [1,2,3,4],
'b': [4,3,2,1]
}
df = pd.DataFrame(d)
#Output:
a b
0 1 4
1 2 3
2 3 2
3 4 1
Figuring out which columns contain increasing values is just a question of using diff on all values in the dataframe, and checking which ones are increasing throughout the whole column.
That can be written as:
out = (df.diff().dropna()>0).all()
#Output:
a True
b False
dtype: bool
Then, you can just use the column names to select only those with True in them
new_df = df[df.columns[out]]
#Output:
a
0 1
1 2
2 3
3 4
*(the term cumulative doesn't really represent the conditions you used.Did you want it to be cumulative or just increasing? Cumulative implies that the value in a particular row/index was the sum of all previous values upto that index, while increasing is just that, the value in current row/index is greater than previous.)
I have a pandas DataFrame like so:
from_user to_user
0 123 456
1 894 135
2 179 890
3 456 123
Where each row contains two IDs that reflect whether the from_user "follows" the to_user. How can I count the total number of mutual followers in the DataFrame using pandas?
In the example above, the answer should be 1 (users 123 & 456).
One way is to use MultiIndex set operations:
In [11]: i1 = df.set_index(["from_user", "to_user"]).index
In [12]: i2 = df.set_index(["to_user", "from_user"]).index
In [13]: (i1 & i2).levels[0]
Out[13]: Int64Index([123, 456], dtype='int64')
To get the count you have to divide the length of this index by 2:
In [14]: len(i1 & i2) // 2
Out[14]: 1
Another way to do is to concat the values and sort them as string.
Then count how many times the values occur:
# concat the values as string type
df['concat'] = df.from_user.astype(str) + df.to_user.astype(str)
# sort the string values of the concatenation
df['concat'] = df.concat.apply(lambda x: ''.join(sorted(x)))
# count the occurences of each and substract 1
count = (df.groupby('concat').size() -1).sum()
Out[64]: 1
Here is another slightly more hacky way to do this:
df.loc[df.to_user.isin(df.from_user)]
.assign(hacky=df.from_user * df.to_user)
.drop_duplicates(subset='hacky', keep='first')
.drop('hacky', 1)
from_user to_user
0 123 456
The whole multiplication hack exists to ensure we don't return 123 --> 456 and 456 --> 123 since both are valid given the conditional we provide to loc
I have a data frame (df)
df = pd.DataFrame({'No': [123,234,345,456,567,678], 'text': ['60 ABC','1nHG','KL HG','21ABC','K 200','1g HG'], 'reference':['ABC','HG','FL','','200',''], 'result':['','','','','','']}, columns=['No', 'text', 'reference', 'result'])
No text reference result
0 123 60 ABC ABC
1 234 1nHG HG
2 345 KL HG FL
3 456 21ABC
4 567 K 200 200
5 678 1g HG
and a list with elements
list
['ABC','HG','FL','200','CP1']
Now I have the following coding:
for idx, row in df.iterrows():
for item in list:
if row['text'].strip().endswith(item):
if pd.isnull(row['reference']):
df.at[idx, 'result'] = item
elif pd.notnull(row['reference']) and row['reference'] != item:
df.at[idx, 'result'] = 'wrong item'
if pd.isnull(row['result']):
break
I run through df and the list and check for matches.
Output:
No text reference result
0 123 60 ABC ABC
1 234 1nHG HG
2 345 KL HG FL wrong item
3 456 21ABC ABC
4 567 K 200 200
5 678 1g HG HG
The break instruction is important because otherwise a second element could be found within the list and then this second element would overwrite the content in result.
Now I need another solution because the data frame is huge and for loops are inefficient. Think using apply could work but how?
Thank you!
Instead of iterating rows, you can iterate your suffixes, which is likely a much smaller iterable. This way, you can take advantage of series-based methods and Boolean indexing.
I've also created an extra series to identify when a row has been updated. The cost of this extra check should be small versus the expense of iterating by row.
L = ['ABC', 'HG', 'FL', '200', 'CP1']
df['text'] = df['text'].str.strip()
null = df['reference'].eq('')
df['updated'] = False
for item in L:
ends = df['text'].str.endswith(item)
diff = df['reference'].ne(item)
m1 = ends & null & ~df['updated']
m2 = ends & diff & ~null & ~df['updated']
df.loc[m1, 'result'] = item
df.loc[m2, 'result'] = 'wrong item'
df.loc[m1 | m2, 'updated'] = True
Result:
No text reference result updated
0 123 60 ABC ABC False
1 234 1nHG HG False
2 345 KL HG FL wrong item True
3 456 21ABC ABC True
4 567 K 200 200 False
5 678 1g HG HG True
You can drop the final column, but you may find it useful for other purposes.
please consider the following DataFrame df:
timestamp id condition
1234 A
2323 B
3843 B
1234 C
8574 A
9483 A
Basing on the condition contained in the column condition I have to define a new column in this data frame which counts how many ids are in that condition.
However, please note that since the DataFrame is ordered by the timestamp column, one could have multiple entries of the same id and then a simple .cumsum() is not a viable option.
I have come out with the following code, which is working properly but is extremely slow:
#I start defining empty arrays
ids_with_condition_a = np.empty(0)
ids_with_condition_b = np.empty(0)
ids_with_condition_c = np.empty(0)
#Initializing new column
df['count'] = 0
#Using a for loop to do the task, but this is sooo slow!
for r in range(0, df.shape[0]):
if df.condition[r] == 'A':
ids_with_condition_a = np.append(ids_with_condition_a, df.id[r])
elif df.condition[r] == 'B':
ids_with_condition_b = np.append(ids_with_condition_b, df.id[r])
ids_with_condition_a = np.setdiff1d(ids_with_condition_a, ids_with_condition_b)
elifif df.condition[r] == 'C':
ids_with_condition_c = np.append(ids_with_condition_c, df.id[r])
df.count[r] = ids_with_condition_a.size
Keeping these Numpy arrays is very useful to me because it gives the list of the ids in a particular condition. I would also be able to put dinamically these arrays in a corresponding cell in the df DataFrame.
Are you able to come out with a better solution in terms of performance?
you need to use groupby on the column 'condition' and cumcount to count how many ids are in each condition up to the current row (which seems to be what your code do):
df['count'] = df.groupby('condition').cumcount()+1 # +1 is to start at 1 not 0
with your input sample, you get:
id condition count
0 1234 A 1
1 2323 B 1
2 3843 B 2
3 1234 C 1
4 8574 A 2
5 9483 A 3
which is faster than using loop for
and if you want just have the row with condition A for example, you can use a mask such as, if you do
print (df[df['condition'] == 'A']), you see row with only condition egal to A. So to get an array,
arr_A = df.loc[df['condition'] == 'A','id'].values
print (arr_A)
array([1234, 8574, 9483])
EDIT: to create two column per conditions, you can do for example for condition A:
# put 1 in a column where the condition is met
df['nb_cond_A'] = pd.np.where(df['condition'] == 'A',1,None)
# then use cumsum for increment number, ffill to fill the same number down
# where the condition is not meet, fillna(0) for filling other missing values
df['nb_cond_A'] = df['nb_cond_A'].cumsum().ffill().fillna(0).astype(int)
# for the partial list, first create the full array
arr_A = df.loc[df['condition'] == 'A','id'].values
# create the column with apply (here another might exist, but it's one way)
df['partial_arr_A'] = df['nb_cond_A'].apply(lambda x: arr_A[:x])
the output looks like this:
id condition nb_condition_A partial_arr_A nb_cond_A
0 1234 A 1 [1234] 1
1 2323 B 1 [1234] 1
2 3843 B 1 [1234] 1
3 1234 C 1 [1234] 1
4 8574 A 2 [1234, 8574] 2
5 9483 A 3 [1234, 8574, 9483] 3
then same thing for B, C. Maybe with a loop for cond in set(df['condition']) ould be practical for generalisation
EDIT 2: one idea to do what you expalined in the comments but not sure it improves the performance:
# array of unique condition
arr_cond = df.condition.unique()
#use apply to create row-wise the list of ids for each condition
df[arr_cond] = (df.apply(lambda row: (df.loc[:row.name].drop_duplicates('id','last')
.groupby('condition').id.apply(list)) ,axis=1)
.applymap(lambda x: [] if not isinstance(x,list) else x))
Some explanations: for each row, select the dataframe up to this row loc[:row.name], drop the duplicated 'id' and keep the last one drop_duplicates('id','last') (in your example, it means that once we reach the row 3, the row 0 is dropped, as the id 1234 is twice), then the data is grouped by condition groupby('condition'), and ids for each condition are put in a same list id.apply(list). The part starting with applymap fillna with empty list (you can't use fillna([]), it's not possible).
For the length for each condition, you can do:
for cond in arr_cond:
df['len_{}'.format(cond)] = df[cond].str.len().fillna(0).astype(int)
THe result is like this:
id condition A B C len_A len_B len_C
0 1234 A [1234] [] [] 1 0 0
1 2323 B [1234] [2323] [] 1 1 0
2 3843 B [1234] [2323, 3843] [] 1 2 0
3 1234 C [] [2323, 3843] [1234] 0 2 1
4 8574 A [8574] [2323, 3843] [1234] 1 2 1
5 9483 A [8574, 9483] [2323, 3843] [1234] 2 2 1