I have a data frame (df)
df = pd.DataFrame({'No': [123,234,345,456,567,678], 'text': ['60 ABC','1nHG','KL HG','21ABC','K 200','1g HG'], 'reference':['ABC','HG','FL','','200',''], 'result':['','','','','','']}, columns=['No', 'text', 'reference', 'result'])
No text reference result
0 123 60 ABC ABC
1 234 1nHG HG
2 345 KL HG FL
3 456 21ABC
4 567 K 200 200
5 678 1g HG
and a list with elements
list
['ABC','HG','FL','200','CP1']
Now I have the following coding:
for idx, row in df.iterrows():
for item in list:
if row['text'].strip().endswith(item):
if pd.isnull(row['reference']):
df.at[idx, 'result'] = item
elif pd.notnull(row['reference']) and row['reference'] != item:
df.at[idx, 'result'] = 'wrong item'
if pd.isnull(row['result']):
break
I run through df and the list and check for matches.
Output:
No text reference result
0 123 60 ABC ABC
1 234 1nHG HG
2 345 KL HG FL wrong item
3 456 21ABC ABC
4 567 K 200 200
5 678 1g HG HG
The break instruction is important because otherwise a second element could be found within the list and then this second element would overwrite the content in result.
Now I need another solution because the data frame is huge and for loops are inefficient. Think using apply could work but how?
Thank you!
Instead of iterating rows, you can iterate your suffixes, which is likely a much smaller iterable. This way, you can take advantage of series-based methods and Boolean indexing.
I've also created an extra series to identify when a row has been updated. The cost of this extra check should be small versus the expense of iterating by row.
L = ['ABC', 'HG', 'FL', '200', 'CP1']
df['text'] = df['text'].str.strip()
null = df['reference'].eq('')
df['updated'] = False
for item in L:
ends = df['text'].str.endswith(item)
diff = df['reference'].ne(item)
m1 = ends & null & ~df['updated']
m2 = ends & diff & ~null & ~df['updated']
df.loc[m1, 'result'] = item
df.loc[m2, 'result'] = 'wrong item'
df.loc[m1 | m2, 'updated'] = True
Result:
No text reference result updated
0 123 60 ABC ABC False
1 234 1nHG HG False
2 345 KL HG FL wrong item True
3 456 21ABC ABC True
4 567 K 200 200 False
5 678 1g HG HG True
You can drop the final column, but you may find it useful for other purposes.
Related
I ran into this specific problem where I have a dataframe of ID numbers. Some of these account numbers have dropped leading zeros.
dataframe is df.
ID
345
345
543
000922
000345
000345
000543
So what im trying to do is create a generalized way to check if we have dropped leading zeros. So basically, in my real data set there would be millions of rows. So I want to use a pandas method to say if there is a section of ID that matches a section with the zeros to put that into another dataframe so I can further examine.
I do that like this:
new_df = df.loc[df['ID'].isin(df['ID'])]
My reasoning for this is that I want to filter that dataset to find if any of the IDs are inside the full IDs.
Now I have
ID
345
345
543
000345
000345
000543
I can use a .unique() to get a series of each unique combo.
ID
345
543
000345
000543
This is fine for a small dataset. But for rows of millions, I am wondering how I can make it easier to do this check.
I trying to find a way to create a dictionary where the keys are the 3 digit and the values are its full ID. or vice versa.
Any tips on that would be appreciated.
If anyone has any tips also on a different idea to checking for dropped zeros, other than the dictionary approach, that would be helpful too.
Note: It is not always 3 digits. Could be 4567 for example, where the real value would be 004567.
One option is to strip leading "0"s:
out = df['ID'].str.lstrip('0').unique()
Output:
array(['345', '543', '922'], dtype=object)
or prepend "0"s:
out = df['ID'].str.zfill(df['ID'].str.len().max()).unique()
Output:
array(['000345', '000543', '000922'], dtype=object)
Use:
print (df)
ID
0 345
1 345
2 540
3 2922
4 002922
5 000344
6 000345
7 000543
#filter ID starting by 0 to Series
d = df.loc[df['ID'].str.startswith('0'), 'ID']
#create index in Series with remove zeros from left side
d.index = d.str.lstrip('0')
print (d)
ID
2922 002922
344 000344
345 000345
543 000543
Name: ID, dtype: object
#dict all possible values
print (d.to_dict())
{'2922': '002922', '344': '000344', '345': '000345', '543': '000543'}
#compare if exist indices in original ID column and create dict
d = d[d.index.isin(df['ID'])].to_dict()
print (d)
{'2922': '002922', '345': '000345', '543': '000543'}
Create a dictionary for finding potentially affected records.
# Creates a dummy dataframe.
df = pd.DataFrame(['00456', '0000456', '567', '00567'], columns=['ID'])
df['stripped'] = pd.to_numeric(df['ID'])
df['affected_id'] = df.ID.str.len() == df.stripped.astype(str).str.len()
df
ID stripped affected_id
0 00456 456 False
1 0000456 456 False
2 567 567 True
3 00567 567 False
# Creates a dictionary of potentially affected records.
d = dict()
for i in df[df.affected_id == True].stripped.unique():
d[i] = df[(df.stripped == i) & (df.ID != str(i))].ID.unique().tolist()
d
{567: ['00567']}
If you want to include the stripped records into the list, then:
for i in df[df.affected_id == True].stripped.unique():
d[i] = df[df.stripped == i].ID.unique().tolist()
d
{567: ['567', '00567']}
You can convert the column type to int
m = df['ID'].ne(df['ID'].astype(int))
print(m)
0 False
1 False
2 False
3 True
4 True
5 True
Name: ID, dtype: bool
print(df[m])
ID
3 000345
4 000345
5 000543
I would like to compare two consecutive rows from two columns, and if they are the same value, then create a new column based on the difference between a third column value. See input and expected output below:
Input:
df = pd.Dataframe({'Account Number': [123,123,123,456,456,456], 'Value':['ABC', 'ABC', 'ABC', 'DEF', 'DEF', 'DEF'],'Positions':[10,10,20,20,20,15]})
Expected Output:
df = pd.Dataframe({'Account Number': [123,123,123,456,456,456],'Value':['ABC', 'ABC', 'ABC', 'DEF', 'DEF', 'DEF'],'Positions':[10,10,20,20,20,15], 'new_col': [0,0,10,0,0,-5]})
In excel, the formula is simply:
IF(AND(B2=B1,C2=C1), D2-D1, 0)
where: B = Account Number, C = Value, D = new_col
I've tried two attempts so far - (1) using iloc (which is yielding an IndexError: Single positional indexer is out of bounds") and (2) using rolling(n) - but I'm not able to even compile. See below my attempt at (1) Any help would be great. Thanks!
a = 0
if a != len(df):
for a in range(len(df)):
df['new_col'] = np.where((df["Account Number"].iloc[a+1] == df["Account Number"].iloc[a]) and (df["Value"].iloc[a+1] == df["Value"].iloc[a]), df["Positions"].iloc[a+1] df["Positions"].iloc[a], 0)
a+= 1
Instead of a loop, you should use a simple and more performant pandas method called .diff():
df['new_col'] = df.groupby('Account Number')['Positions'].diff().fillna(0).astype(int)
Full code:
df = pd.DataFrame({'Account Number': [123,123,123,456,456,456], 'Value':['ABC', 'ABC', 'ABC', 'DEF', 'DEF', 'DEF'],'Positions':[10,10,20,20,20,15]})
df['new_col'] = df.groupby('Account Number')['Positions'].diff().fillna(0).astype(int)
df
Out[1]:
Account Number Value Positions new_col
0 123 ABC 10 0
1 123 ABC 10 0
2 123 ABC 20 10
3 456 DEF 20 0
4 456 DEF 20 0
5 456 DEF 15 -5
for a in range(len(df)-1):
Notice the -1.
Without it - yes, you will run into an indexerror
Looping over pandas structure in general should be avoided, but I don't know how to help you without a loop or reworking your data struct
new_col=[0]
for a in range(len(df)-1):
if (df["Account Number"].iloc[a+1] == df["Account Number"].iloc[a]) and (df["Value"].iloc[a+1] == df["Value"].iloc[a]):
new_col.append(df["Positions"].iloc[a+1] - df["Positions"].iloc[a])
else:
new_col.append(0)
df["new_col"]=new_col
print(df)
gives the following output
Account Number Value Positions new_col
0 123 ABC 10 0
1 123 ABC 10 0
2 123 ABC 20 10
3 456 DEF 20 0
4 456 DEF 20 0
5 456 DEF 15 -5
I have a pandas dataframe that looks something like this:
Item Status
123 B
123 BW
123 W
123 NF
456 W
456 BW
789 W
789 NF
000 NF
And I need to create a new column Value which will be either 1 or 0 depending on the values in the Item and Status columns. The assignment of the value 1 is prioritized by this order: B, BW, W, NF. So, using the sample dataframe above, the result should be:
Item Status Value
123 B 1
123 BW 0
123 W 0
123 NF 0
456 W 0
456 BW 1
789 W 1
789 NF 0
000 NF 1
Using Python 3.7.
Taking your original dataframe as input df dataframe, the following code will produce your desired output:
#dictionary assigning order of priority to status values
priority_map = {'B':1,'BW':2,'W':3,'NF':4}
#new temporary column that converts Status values to order of priority values
df['rank'] = df['Status'].map(priority_map)
#create dictionary with Item as key and lowest rank value per Item as value
lowest_val_dict = df.groupby('Item')['rank'].min().to_dict()
#new column that assigns the same Value to all rows per Item
df['Value'] = df['Item'].map(lowest_val_dict)
#replace Values where rank is different with 0's
df['Value'] = np.where(df['Value'] == df['rank'],1,0)
#delete rank column
del df['rank']
I would prefer an approach where the status is an ordered pd.Categorical, because a) that's what it is and b) it's much more readable: if you have that, you just compare if a value is equal to the max of its group:
df['Status'] = pd.Categorical(df['Status'], categories=['NF', 'W', 'BW', 'B'],
ordered=True)
df['Value'] = df.groupby('Item')['Status'].apply(lambda x: (x == x.max()).astype(int))
# Item Status Value
#0 123 B 1
#1 123 BW 0
#2 123 W 0
#3 123 NF 0
#4 456 W 0
#5 456 BW 1
#6 789 W 1
#7 789 NF 0
#8 0 NF 1
I might be able to help you conceptually, by explaining some steps that I would do:
Create the new column Value, and fill it with zeros np.zeros() or pd.fillna()
Group the dataframe by Item with groupby = pd.groupby('Item')
Iterate through all the groups founds for name, group in groupby:
By using a simple function with if's, a custom priority queue, custom sorting criteria, or any other preferred method, determine which entry has higher priority " by this value 1 is prioritized by this order: B, BW, W, NF ", and assign a value of 1 to it's Value column group.loc[entry]['Value'] == 0
Let's say we are looking at group '123':
Item Status Value
-------------------------
123 B 0 (before 0, after 1)
123 BW 0
123 W 0
123 NF 0
Because the row [123, 'B', 0] had the highest priority based on your criteria, you change it to [123, 'B', 1]
When finished, create the dataframe back from the groupby object, and you're done. You have a lot of possibilities for doing that, might check here: Converting a Pandas GroupBy object to DataFrame
I have a pandas DataFrame like so:
from_user to_user
0 123 456
1 894 135
2 179 890
3 456 123
Where each row contains two IDs that reflect whether the from_user "follows" the to_user. How can I count the total number of mutual followers in the DataFrame using pandas?
In the example above, the answer should be 1 (users 123 & 456).
One way is to use MultiIndex set operations:
In [11]: i1 = df.set_index(["from_user", "to_user"]).index
In [12]: i2 = df.set_index(["to_user", "from_user"]).index
In [13]: (i1 & i2).levels[0]
Out[13]: Int64Index([123, 456], dtype='int64')
To get the count you have to divide the length of this index by 2:
In [14]: len(i1 & i2) // 2
Out[14]: 1
Another way to do is to concat the values and sort them as string.
Then count how many times the values occur:
# concat the values as string type
df['concat'] = df.from_user.astype(str) + df.to_user.astype(str)
# sort the string values of the concatenation
df['concat'] = df.concat.apply(lambda x: ''.join(sorted(x)))
# count the occurences of each and substract 1
count = (df.groupby('concat').size() -1).sum()
Out[64]: 1
Here is another slightly more hacky way to do this:
df.loc[df.to_user.isin(df.from_user)]
.assign(hacky=df.from_user * df.to_user)
.drop_duplicates(subset='hacky', keep='first')
.drop('hacky', 1)
from_user to_user
0 123 456
The whole multiplication hack exists to ensure we don't return 123 --> 456 and 456 --> 123 since both are valid given the conditional we provide to loc
I have a Dataframe that look like this (small version):
A B C
0 125 ADB [AF:12]
1 189 ACB [AF:78, AF:85, AF:98]
2 148 ADB []
3 789 ARF [AF:89, AF:85, AF:12]
4 789 BCD [AF:76, AF:25]
How can I see if some of the items in column "C" are in a list?
knowing that when I do type(df.C) I get class 'pandas.core.series.Series'
if for example the list is:
['AF:12', 'AF25']
The expected output would be:
A B C D
0 125 ADB [AF:12] True
1 189 ACB [AF:78, AF:85, AF:98] False
2 148 ADB [] False
3 789 ARF [AF:89, AF:85, AF:12] True
4 789 BCD [AF:76, AF:25] True
I have tried df['D'] = df['C'].isin(list)
but get False everywhere probably because "C" is a list of list.
Is there a way to get around that?
Any help would be greatly appreciated
If the type of elements of C column is list , then I believe one method to do this would be to use set intersection between your list and the elements of C column using Series.apply method . Example -
setlst = set(yourlist)
df['D'] = df['C'].apply(lambda x: bool(setlst.intersection(x)))
You can confirm that C is of type list, if type(df['C'][0]) is list .
Also please note using list as a variable name is not recommended as it would shadow the built in type list .
data = {'B':['ADB','ACB','ADB','ARF','BCD'],
'A':[125,189,148,789,789],
'C':[['AF:12'],['AF:78', 'AF:85', 'AF:98'],[],
['AF:89', 'AF:85', 'AF:12'],['AF:76', 'AF:25']]}
df = pd.DataFrame(data)
def in_list(list_to_search,terms_to_search):
results = [item for item in list_to_search if item in terms_to_search]
if len(results) > 0:
return 'True'
else:
return 'False'
df['D'] = df['C'].apply(lambda x: in_list(x, ['AF:12', 'AF:25']))
Result:
A B C D
0 125 ADB [AF:12] True
1 189 ACB [AF:78, AF:85, AF:98] False
2 148 ADB [] False
3 789 ARF [AF:89, AF:85, AF:12] True
4 789 BCD [AF:76, AF:25] True
def is_in_list():
for ele in df['C']:
if ele in list:
return True
return False;
Maybe this function can make it.