Match standalone words in string in DataFrame column - python

I have a dataframe such as:
import pandas as pd
import re
df = pd.DataFrame({"Name": ["D1", "D2", "D3", "D4", "M1", "M2", "M3"],
"Requirements": ["3 meters|2/3 meters|3.5 meters",
"3 meters",
"3/5 meters|3 meters",
"2/3 meters",
"steel|g1_steel",
"steel",
"g1_steel"]})
dataframe df
Name Requirements
0 D1 3 meters|2/3 meters|3.5 meters
1 D2 3 meters
2 D3 3/5 meters|3 meters
3 D4 2/3 meters
4 M1 steel|g1_steel
5 M2 steel
6 M3 g1_steel
I have a list of words req_list = ['3 meters', 'steel'] and I am trying to extract rows from df where the strings in column Requirements contain standalone words that are from req_list. This is what I have done:
This one prints just D2 and M2
df[df.Requirements.apply(lambda x: any(len(x.replace(y, '')) == 0 for y in req_list))]
This one prints all rows
df[df['Requirements'].str.contains(fr"\b(?:{'|'.join(req_list)})\b")]
My desired result is as follows:
Name Requirements
0 D1 3 meters|2/3 meters|3.5 meters
1 D2 3 meters
2 D3 3/5 meters|3 meters
4 M1 steel|g1_steel
5 M2 steel
In this desired output, D4 and M3 are eliminated because they do not have words from req_list as standalone strings. Is there any way to achieve this preferably in an one-liner without using custom functions?
EDIT
The strings in the column Requirements can come in any pattern such as:
Name Requirements
0 D1 3 meters|2/3 meters|3.5 meters
1 D2 3 meters
2 D3 3/5 meters|3 meters
3 D4 2/3 meters
4 D5 3::3 meters # New pattern which needs to be eliminated
5 D6 3.3 meters # New pattern which needs to be eliminated
6 D7 3?3 meters # New pattern which needs to be eliminated
7 M1 steel|g1_steel
8 M2 steel
9 M3 g1_steel

Since you want to make sure you do not match 3 meters that is preceded with a digit + /, you may add a (?<!\d/) negative lookbehind after the intial word boundary:
df[df['Requirements'].str.contains(fr"\b(?<!\d/)(?:{'|'.join(req_list)})\b")]
Output:
Name Requirements
0 D1 3 meters|2/3 meters|3.5 meters
1 D2 3 meters
2 D3 3/5 meters|3 meters
4 M1 steel|g1_steel
5 M2 steel
See the regex demo.
Notes
Since req_list contains phrases (mutiword strings) you might have to sort the items by length in the descending order before joining with the | OR operator, so you'd better use fr"\b(?<!\d/)(?:{'|'.join(sorted(req_list, key=len, reverse=True))})\b" as regex
If the req_list ever contains items with special chars you should also use adaptive dynamic word boundaries, i.e. fr"(?!\B\w)(?<!\d/)(?:{'|'.join(sorted(map(re.escape, req_list), key=len, reverse=True))})(?<!\w\B)".

Here is one more way to do it
def chk(row):
for r in row:
if r.strip() in req_list:
return True
return False
df[df.assign(lst = df['Requirements'].str.split('|'))['lst'].apply(chk) == True]
Name Requirements
0 D1 3 meters|2/3 meters|3.5 meters
1 D2 3 meters
2 D3 3/5 meters|3 meters
4 M1 steel|g1_steel
5 M2 steel

Related

Applying a for loop on rows that meet certain conditions in a DataFrame

I am looking to apply two different for loops on a single dataframe.
The data I have is taken from a PDF and looks like this upon reading into a DataFrame
Output
Summary
Prior Years
1
2
3
4
5
6
7
8
9
10
Total
Total Value 3,700
110
-
-
-
5
NaN
-
-
-
-
--
3,815
Total Value
115 100
-
-
-
10
NaN
-
-
-
-
--
225
The expected table output is
Expected Output
Summary
Prior Years
1
2
3
4
5
6
7
8
9
10
Total
Total Value
3,700
110
-
-
-
5
-
-
-
-
--
3,815
Total Value
115
100
-
-
-
10
-
-
-
-
--
225
To resolve the errors from the original output I did as follows
test.loc[:,"1":"5"]=test.loc[:,"Prior Years":"5"].shift(axis=1)
test[['Summary','Prior Years']]=test['Summary'].str.strip().str.extract(r'(\D*).*?([\d\,\.]*)' )
and
test.loc[:,"1":"5"]=test.loc[:,"Prior Years":"5"].shift(axis=1)
test[['Prior Years', '1']]=test['Prior Years'].str.split(' ',expand=True)
These solve the respective issues in both columns when isolated but I am looking to utilize both these conditions simultaneously
When I attempt to write 'for' loops using these conditions above, it affects the whole dataframe, rather than just the row where individual conditions are met
An example of this is
for i in test.loc[:,'Summary']:
if len(i)>12:
test.loc[:,"1":"5"]=test.loc[:,"Prior Years":"5"].shift(axis=1)
test[['Summary','Prior Years']]=test['Summary'].str.strip().str.extract(r'(\D*).*?([\d\,\.]*)' )
Which then outputs
Output
Summary
Prior Years
1
2
3
4
5
6
7
8
9
10
Total
Total Value
3,700
110
-
-
-
5
-
-
-
-
--
3,815
Total Value
115 100
-
-
-
10
-
-
-
-
--
225
I am using the string length criteria as the hit for the for loop as the 'Summary' Column and 'Prior Years' Column will have fairly uniform str lengths
Right now your operations are affecting the whole column. If you loop through the index instead, you can limit the operation to just the rows you want to change:
for idx in test.index:
if len(test.loc[idx, "Summary"]) > 12:
test.loc[idx,"1":"5"] = test.loc[idx,"Prior Years":"5"].shift(axis=1)
test.loc[idx, ['Summary','Prior Years']] = test.iloc[idx, 'Summary'].str.strip().str.extract(r'(\D*).*?([\d\,\.]*)' )
if len(test.loc[idx, "1"]) > 5:
test.loc[idx,"1":"5"] = test.loc[idx,"Prior Years":"5"].shift(axis=1)
test.loc[idx, ['Prior Years', '1']] = test.loc[idx, 'Prior Years'].str.split(' ',expand=True)
If this code is too slow, it's also possible to vectorize this:
mask = test.Summary > 12
test.loc[mask,"1":"5"] = test.loc[mask,"Prior Years":"5"].shift(axis=1)
test.loc[mask, ['Summary','Prior Years']] = test.iloc[mask, 'Summary'].str.strip().str.extract(r'(\D*).*?([\d\,\.]*)' )
mask = test["1"] > 5
test.loc[mask,"1":"5"] = test.loc[mask,"Prior Years":"5"].shift(axis=1)
test.loc[mask, ['Prior Years', '1']] = test.loc[mask, 'Prior Years'].str.split(' ',expand=True)

Data manupulation python

I am trying to get the mode of models at column level and the find the error associated with that model
if two or more more modes are received then we select the model with least error out of these two modes
import pandas as pd
data1 = {'Iteration1': ["M2",'M1',"M3","M5","M4","M6"],
'Iteration1_error': [96,98,34,19,22,9],
'Iteration2': ["M3",'M1',"M1","M5","M6","M4"],
'Iteration2_error': [76,88,54,12,92,19],
'Iteration3': ["M3",'M1',"M1","M5","M6","M4"],
'Iteration3_error': [66,68,84,52,72,89]}
Input1 = pd.DataFrame(data1,
columns=['Iteration1','Iteration1_error','Iteration2','Iteration2_error','Iteration3','Iteration3_error'],
index=['I1', 'I2','I3','I4','I5','I6'])
print(Input1)
data2 = {'Iteration1': ["M2",'M1',"M3","M5","M4","M6"],
'Iteration1_error': [96,98,34,19,22,9],
'Iteration2': ["M3",'M1',"M1","M5","M6","M4"],
'Iteration2_error': [76,88,54,12,92,19],
'Iteration3': ["M3",'M1',"M1","M5","M6","M4"],
'Iteration3_error': [66,68,84,52,72,89],
'Mode of model name in all iterations':['M3','M1','M1','M5','M6','M4'],
'Best model error':[66,68,54,12,72,19]
}
Output1 = pd.DataFrame(data2,
columns=['Iteration1','Iteration1_error','Iteration2','Iteration2_error','Iteration3','Iteration3_error','Mode of model name in all iterations','Best model error'],
index=['I1', 'I2','I3','I4','I5','I6'])
print(Output1)
Question: So we are expecting an output with two etc columns at the end , one tells us about mode at column level second tells about the error of that mode, first 6 columns are input dataframe, incase two modes or more are received example ("M1","M2","M3") all three values are different so technically it will have 3 modes so model with least accuracy will be selected
What I tried: I was able to get the mode at column level by using .mode(numeric_only=False) but what issue I am getting how can I get that modes error from 2nd, 4th and 6th column, there I am stuck at
Use:
#filter only columns by Iteration with number
df = Input1.filter(regex='Iteration\d+$')
#get first mode
s = df.mode(axis=1).iloc[:, 0]
#compare df for all possible modes, add suffix for match errors columns,
#last filter original with min
s1 = Input1.where(df.eq(s, axis=0).add_suffix('_error')).min(axis=1)
#add new columns
Output1 = Input1.assign(best_mode = s, best_error=s1)
print (Output1)
Iteration1 Iteration1_error Iteration2 Iteration2_error Iteration3 \
I1 M2 96 M3 76 M3
I2 M1 98 M1 88 M1
I3 M3 34 M1 54 M1
I4 M5 19 M5 12 M5
I5 M4 22 M6 92 M6
I6 M6 9 M4 19 M4
Iteration3_error best_mode best_error
I1 66 M3 66.0
I2 68 M1 68.0
I3 84 M1 54.0
I4 52 M5 12.0
I5 72 M6 72.0
I6 89 M4 19.0
Another idea if is possible use pair and unpairs columns (in data has to exist all pairs ech other, sorted):
df = Input1.iloc[:, ::2]
s = df.mode(axis=1).iloc[:, 0]
s1 = Input1.iloc[:, 1::2].where(df.eq(s, axis=0).to_numpy()).min(axis=1)
Output1 = Input1.assign(best_mode = s, best_error=s1)

Using pandas to identify nearest objects

I have an assignment that can be done using any programming language. I chose Python and pandas since I have little experience using these and thought it would be a good learning experience. I was able to complete the assignment using traditional loops that I know from traditional computer programming, and it ran okay over thousands of rows, but it brought my laptop down to a screeching halt once I let it process millions of rows. The assignment is outlined below.
You have a two-lane road on a two-dimensional plane. One lane is for cars and the other lane is reserved for trucks. The data looks like this (spanning millions of rows for each table):
cars
id start end
0 C1 200 215
1 C2 110 125
2 C3 240 255
...
trucks
id start end
0 T1 115 175
1 T2 200 260
2 T3 280 340
3 T4 25 85
...
The two dataframes above correspond to this:
start and end columns represent arbitrary positions on the road, where start = the back edge of the vehicle and end = the front edge of the vehicle.
The task is to identify the trucks closest to every car. A truck can have up to three different relationships to a car:
Back - it is in back of the car (cars.end > trucks.end)
Across - it is across from the car (cars.start >= trucks.start and cars.end <= trucks.end)
Front - it is in front of the car (cars.start < trucks.start)
I emphasized "up to" because if there is another car in back or front that is closer to the nearest truck, then this relationship is ignored. In the case of the illustration above, we can observe the following:
C1: Back = T1, Across = T2, Front = none (C3 is blocking)
C2: Back = T4, Across = none, Front = T1
C3: Back = none (C1 is blocking), Across = T2, Front = T3
The final output needs to be appended to the cars dataframe along with the following new columns:
data cross-referenced from the trucks dataframe
for back positions, the gap distance (cars.start - trucks.end)
for front positions, the gap distance (trucks.start - cars.end)
The final cars dataframe should look like this:
id start end back_id back_start back_end back_distance across_id across_start across_end front_id front_start front_end front_distance
0 C1 200 215 T1 115 175 25 T2 200 260
1 C2 110 125 T4 25 85 25 T1 115 175 -10
2 C3 240 255 T2 200 260 T3 280 340 25
Is pandas even the best tool for this task? If there is a better suited tool that is efficient at cross-referencing and appending columns based on some calculation across millions of rows, then I am all ears.
so with pandas, you can use merge_asof, here is one way, maybe not efficient with millions of rows:
#first sort values
trucks = trucks.sort_values(['start'])
cars = cars.sort_values(['start'])
#create back condition
df_back = pd.merge_asof(trucks.rename(columns={col:f'back_{col}'
for col in trucks.columns}),
cars.assign(back_end=lambda x: x['end']),
on='back_end', direction='forward')\
.query('end>back_end')\
.assign(back_distance=lambda x: x['start']-x['back_end'])
#create across condition: here note that cars is the first of the 2 dataframes
df_across = pd.merge_asof(cars.assign(across_start=lambda x: x['start']),
trucks.rename(columns={col:f'across_{col}'
for col in trucks.columns}),
on=['across_start'], direction='backward')\
.query('end<=across_end')
#create front condition
df_front = pd.merge_asof(trucks.rename(columns={col:f'front_{col}'
for col in trucks.columns}),
cars.assign(front_start=lambda x: x['start']),
on='front_start', direction='backward')\
.query('start<front_start')\
.assign(front_distance=lambda x: x['front_start']-x['end'])
# merge all back to cars
df_f = cars.merge(df_back, how='left')\
.merge(df_across, how='left')\
.merge(df_front, how='left')
and you get
print (df_f)
id start end back_id back_start back_end back_distance across_start \
0 C2 110 125 T4 25.0 85.0 25.0 NaN
1 C1 200 215 T1 115.0 175.0 25.0 200.0
2 C3 240 255 NaN NaN NaN NaN 240.0
across_id across_end front_id front_start front_end front_distance
0 NaN NaN T1 115.0 175.0 -10.0
1 T2 260.0 NaN NaN NaN NaN
2 T2 260.0 T3 280.0 340.0 25.0

how do i group interval by time and do calculation based on ID using python

TIME X Y class ID
16:00:06 103.78499 71.45633073 1 A
16:05:06 109.09495 92.82448711 2 A2
16:08:06 105.98095 67.84456255 1 A6
16:11:05 103.461044 71.42452176 3 F4
16:15:05 103.7674 71.47382163 1 C55
16:17:26 107.476204 73.34450468 3 E4
16:18:06 105.27734 87.58118231 2 O8
16:19:21 103.785805 71.44292864 2 H77
16:33:18 103.720085 71.37973541 1 A
17:40:11 107.343185 88.27521372 1 A6
17:43:06 100.84964 67.38097318 2 D
17:45:06 110.1006 66.26799123 2 A
17:50:12 105.854195 87.77078816 2 D4
17:55:09 96.142845 61.99788643 2 F4
18:02:08 103.78293 71.48327491 2 RR3
18:09:33 103.927475 71.49321361 2 U7
18:25:12 104.722595 90.43501546 3 S
18:30:05 109.6942 66.22699393 3 S33
18:33:04 109.71278 97.2428849 3 T17
18:40:44 56.124245 71.13521775 3 G22
18:44:02 93.29675 63.89221211 3 II1
18:50:13 109.70228 71.47756311 1 S9
18:55:11 104.626045 89.71097044 1 B
19:00:06 93.210075 63.85872612 1 A
19:05:04 57.414974 67.63951569 3 D4
19:22:15 103.91814 71.67476075 3 E65
19:24:10 93.26354 67.95513579 3 P9
19:30:16 104.45209 82.10376889 3 Q12
19:40:07 103.7826 71.49702169 2 XX6
19:44:07 103.79914 71.71096234 2 OL4
1- Above is a sample of a data frame which has thousands of records.
2- How can group it by the time (20 min) and class number. Then calculate the Euclidean distance between x, y using :
dist = sqrt( (x2 - x1)**2 + (y2 - y1)**2 )
3- An iteration of every 20 min for all data, no value duplicates in the single iteration.
I have tried to implement this sample. but still not working.
python split a pandas data frame by week or month and group the data based on these sp
The expected results for the 20min in two iteration:
Iter ID CLASS distnace
1 A,A6 1 4.22695
1 A,C55 1 0.024806
1 A6, C55 1 4.251038
1 A2,O8 2 6.485861
1 A2,H77 2 22.030843
1 O8,H77 2 16.207033
1 F4, E4 3 4.4506
2 A,A6 1 17.279585

How to delete rows with less than a certain amount of items or strings with Pandas?

I have searched a lot but couldn't find a solution to this particular case. I want to remove any rows that contains less than 3 strings or items in the lists. My issues will be addressed more clearly further down.
I'm preparing a LDA topic modelling with a large Swedish database in pandas and have limited the test case to 1000 rows. I'm only concerned with a specific column and my approach so far has been as follows:
con = sqlite3.connect('/Users/mo/EXP/NAV/afm.db')
sql = """
select * from stillinger limit 1000
"""
dfs = pd.read_sql(sql, con)
plb = """
select PLATSBESKRIVNING from stillinger limit 1000
"""
dfp = pd.read_sql(plb, con);dfp
Then I've defined a regular expression where the first argument removes any meta characters while keeping the Swedish and Norwegian language specific letters. The second argument removes words < 3:
rep = {
'PLATSBESKRIVNING': {
r'[^A-Za-zÅåÄäÖöÆØÅæøå]+': ' ',
r'\W*\b\w{1,3}\b': ' '}
}
p0 = (pd.DataFrame(dfp['PLATSBESKRIVNING'].str.lower()).replace(rep, regex=True).
drop_duplicates('PLATSBESKRIVNING').reset_index(drop=True));p0
PLATSBESKRIVNING
0 medrek rekrytering söker uppdrag manpower h...
1 familj barn tjejer kille söker pair ...
2 uppgift blir tillsammans medarbetare leda ...
3 behov operasjonssykepleiere langtidsoppdr...
4 detta perfekta jobbet arbetstiderna vardaga...
5 familj paris barn söker älskar barn v...
6 alla inom cafe restaurang förekommande arbets...
.
.
Creating a pandas Series:
s0 = p0['PLATSBESKRIVNING']
Then:
ts = s0.str.lower().str.split();ts
0 [medrek, rekrytering, söker, uppdrag, manpower...
1 [familj, barn, tjejer, kille, söker, pair, vil...
2 [uppgift, blir, tillsammans, medarbetare, leda...
3 [behov, operasjonssykepleiere, langtidsoppdrag...
4 [detta, perfekta, jobbet, arbetstiderna, varda...
5 [familj, paris, barn, söker, älskar, barn, vil...
6 [alla, inom, cafe, restaurang, förekommande, a...
7 [diskare, till, cafe, dubbel, sökes, arbetet, ...
8 [diskare, till, thelins, konditori, sökes, arb...
Removing the stop words from the database:
r = s0.str.split().apply(lambda x: [item for item in x if item not in mswl]);r
0 [uppdrag, bemanningsföretag, erbjuds, tillägg,...
1 [föräldrarna, citycentre, stort, tomt, mamman,...
2 [utveckla, övergripande, strategiska, frågor, ...
3 [erfaring, sykepleier, legitimasjon]
4 [arbetstiderna, vardagar, härliga, människor, ...
5 [paris, utav, badrum, båda, yngsta, endast, fö...
6 [förekommande, emot, utbildning]
7 []
8 [thelins]
9 [paris, baby, månader, våning, delar, badrum, ...
Creating a new DataFrame and removing the empty brackets:
dr = pd.DataFrame(r)
dr0 = dr[dr.astype(str)['PLATSBESKRIVNING'] != '[]'].reset_index(drop=True); dr0
PLATSBESKRIVNING
0 [uppdrag, bemanningsföretag, erbjuds, tillägg,...
1 [föräldrarna, citycentre, stort, tomt, mamman,...
2 [utveckla, övergripande, strategiska, frågor, ...
3 [erfaring, sykepleier, legitimasjon]
4 [arbetstiderna, vardagar, härliga, människor, ...
5 [paris, utav, badrum, båda, yngsta, endast, fö...
6 [förekommande, emot, utbildning]
7 [thelins]
8 [paris, baby, månader, våning, delar, badrum, ...
Maintaining the string:
dr1 = dr0['PLATSBESKRIVNING'].apply(str); len(dr1),type(dr1), dr1
0 ['uppdrag', 'bemanningsföretag', 'erbjuds', 't...
1 ['föräldrarna', 'citycentre', 'stort', 'tomt',...
2 ['utveckla', 'övergripande', 'strategiska', 'f...
3 ['erfaring', 'sykepleier', 'legitimasjon']
4 ['arbetstiderna', 'vardagar', 'härliga', 'männ...
5 ['paris', 'utav', 'badrum', 'båda', 'yngsta', ...
6 ['förekommande', 'emot', 'utbildning']
7 ['thelins']
8 ['paris', 'baby', 'månader', 'våning', 'delar'...
My issue now is that I want to remove any rows that contains less than 3 strings in the lists, e.g row 3, 6 and 7. Desired result would be like this:
0 ['uppdrag', 'bemanningsföretag', 'erbjuds', 't...
1 ['föräldrarna', 'citycentre', 'stort', 'tomt',...
2 ['utveckla', 'övergripande', 'strategiska', 'f...
3 ['arbetstiderna', 'vardagar', 'härliga', 'männ...
4 ['paris', 'utav', 'badrum', 'båda', 'yngsta', ...
5 ['paris', 'baby', 'månader', 'våning', 'delar'...
.
.
How can I obtain this? I'm also wondering if this could be done in a neater way? My approach seems so clumsy and cumbersome.
I would also like to remove both indexes and column name for the LDA topic modelling such that I could write it to a text file without the header and the digits of indexes. I have tried:
dr1.to_csv('LDA1.txt',header=None,index=False)
But this wraps quotation marks "['word1', 'word2', 't.. ]" to the each list of strings in the file.
Any suggestions would be much appreciated.
Best regards
Mo
Just measure the number of items in the list and filter the rows with length lower than 3
dr0['length'] = dr0['PLATSBESKRIVNING'].apply(lambda x: len(x))
cond = dr0['length'] > 3
dr0 = dr0[cond]
You can use apply len and then select data store it in the dataframe variable you like i.e
df[df['PLATSBESKRIVNING'].apply(len)>3]
Output :
PLATSBESKRIVNING
0 [uppdrag, bemanningsföretag, erbjuds, nice]
1 [föräldrarna, citycentre, stort, tomt]
2 [utveckla, övergripande, strategiska, fince]
4 [arbetstiderna, vardagar, härliga, männ]
5 [paris, utav, badrum, båda, yngsta]
8 [paris, baby, månader, våning, delar]

Categories