I am trying to perform string matching between two pandas dataframe.
df_1:
ID Text Text_ID
1 Apple 53
2 Banana 84
3 Torent File 77
df_2:
ID File_name
22 apple_mask.txt
23 melon_banana.txt
24 Torrent.pdf
25 Abc.ppt
Objective: I want to populate the Text_ID against File_name in df_2 if the string in df_1['Text'] matches with df_2['File_name']. If no matches found then populate the df_2[Text_ID] as -1. So the resultant df` looks like
ID Flie_name Text_ID
22 apple_mask.txt 53
23 melon_banana.txt 84
24 Torrent.pdf 77
25 Abc.ppt -1
I have tried this SO thread, but it is giving a column where File_name wise fuzz score is listed.
I am trying out a non fuzzy way. Please see below the code snippets:
text_ls = df_1['Text'].tolist()
file_ls = df_2['File_name'].tolist()
text_id = []
for i,j in zip(text_ls,file_ls):
if str(j) in str(i):
t_i = df_1.loc[df_1['Text']==i,'Text_ID']
text_id.append(t_i)
else:
t_i = -1
text_id.append(t_i)
df_2['Text_ID'] = text_id
But I am getting a blank text_id list.
Can anybody provide some clue on this? I am OK to use fuzzywuzzy as well.
You can get it with the following code:
df2['Text_ID'] = -1 # set -1 by default for all the file names
for _,file_name in df2.iterrows():
for _, text in df1.iterrows():
if text[0].lower() in file_name[0]: # compare strings
df2.loc[df2.File_name == file_name[0],'Text_ID'] = text[1] # assaign the Text_ID from df1 in df2
break
Keep in mind:
String comparison: As it is now working, apple and banana are contained in apple_mask.txt and melon_banana.txt, but torrent file is not in torrent.pdf. Consider redefining the string comparison.
df.iterrows() returns two values, the index of the row and the values of the row, in this case I have replaced the index by _ since it is not necessary to solve this problem
result:
df2
File_name Text_ID
0 apple_mask.text 53
1 melon_banana.txt 84
2 Torrent.pdf -1
3 Abc.ppt -1
You can try following code:
text_ls = df_1['Text'].tolist()
file_ls = df_2['File_name'].tolist()
text_id = []
for i,j in zip(text_ls,file_ls):
if j.lower().find(i.lower()) == -1:
t_i = -1
df_2.loc[df_2['File_name']==j,'Text_ID']=t_i
else:
t_i = df_1.loc[df_1['Text']==i,'Text_ID']
df_2.loc[df_2['File_name']==j,'Text_ID']=t_i
Related
I am getting myself very confused over a problem I am encountering with a short python script I am trying to put together. I am trying to iterate through a dataframe, appending rows to a new dataframe, until a certain value is encountered.
import pandas as pd
#this function will take a raw AGS file (saved as a CSV) and convert to a
#dataframe.
#it will take the AGS CSV and print the top 5 header lines
def AGS_raw(file_loc):
raw_df = pd.read_csv(file_loc)
#print(raw_df.head())
return raw_df
import_df = AGS_raw('test.csv')
def AGS_snip(raw_df):
for i in raw_df.iterrows():
df_new_row = pd.DataFrame(i)
cut_df = pd.DataFrame(raw_df)
if "**PROJ" == True:
cut_df = cut_df.concat([cut_df,df_new_row],ignore_index=True, sort=False)
elif "**ABBR" == True:
break
print(raw_df)
return cut_df
I don't need to get into specifics, but the values (**PROJ and **ABBR) in this data occur as single cells as the top of tables. So I want to loop row-wise through the data, appending rows until **ABBR is encountered.
When I call AGS_snip(import_df), nothing happens. Previous incarnations just spat out the whole df, and I'm just confused over the logic of the loops. Any assistance much appreciated.
EDIT: raw text of the CSV
**PROJ,
1,32
1,76
32,56
,
**ABBR,
1,32
1,76
32,56
The test CSV looks like this:
The reason that "nothing happens" is likely b/c of the conditions you're using in if and elif.
Neither "**PROJ" == True nor "**ABBR" == True will ever be True because neither "**PROJ" nor "**ABBR" are equal to True. Your code is equivalent to:
def AGS_snip(raw_df):
for i in raw_df.iterrows():
df_new_row = pd.DataFrame(i)
cut_df = pd.DataFrame(raw_df)
if False:
cut_df = cut_df.concat([cut_df,df_new_row],ignore_index=True, sort=False)
elif False:
break
print(raw_df)
return cut_df
Which is the same as:
def AGS_snip(raw_df):
for i in raw_df.iterrows():
df_new_row = pd.DataFrame(i)
cut_df = pd.DataFrame(raw_df)
print(raw_df)
return cut_df
You also always return from inside the loop and df_new_row isn't used for anything, so it's equivalent to:
def AGS_snip(raw_df):
first_row = next(raw_df.iterrows(), None)
if first_row:
cut_df = pd.DataFrame(raw_df)
print(raw_df)
return cut_df
Here's how to parse your CSV file into multiple separate dataframes based on a row condition. Each dataframe is stored in a Python dictionary, with titles as keys and dataframes as values.
import pandas as pd
df = pd.read_csv('ags.csv', header=None)
# Drop rows which consist of all NaN (Not a Number) / missing values.
# Reset index order from 0 to the end of dataframe.
df = df.dropna(axis='rows', how='all').reset_index(drop=True)
# Grab indices of rows beginning with "**", and append an "end" index.
idx = df.index[df[0].str.startswith('**')].append(pd.Index([len(df)]))
# Dictionary of { dataframe titles : dataframes }.
dfs = {}
for k in range(len(idx) - 1):
table_name = df.iloc[idx[k],0]
dfs[table_name] = df.iloc[idx[k]+1:idx[k+1]].reset_index(drop=True)
# Print the titles and tables.
for k,v in dfs.items():
print(k)
print(v)
# **PROJ
# 0 1
# 0 1 32.0
# 1 1 76.0
# 2 32 56.0
# **ABBR
# 0 1
# 0 1 32.0
# 1 1 76.0
# 2 32 56.0
# Access each dataframe by indexing the dictionary "dfs", for example:
print(dfs['**ABBR'])
# 0 1
# 0 1 32.0
# 1 1 76.0
# 2 32 56.0
# You can rename column names with for example this code:
dfs['**PROJ'].set_axis(['data1', 'data2'], axis='columns', inplace=True)
print(dfs['**PROJ'])
# data1 data2
# 0 1 32.0
# 1 1 76.0
# 2 32 56.0
I have a data frame which contains a text column i.e. df["input"],
I would like to create a new variable which checks whether df["input"] column contains any of the word in a given list and assigns a value of 1 if previous dummy variable is equal to 0 (logic is 1) create a dummy variable that equals to zero 2) replace it to one if it contains any word in a given list and it was not contained in the previous lists.)
# Example lists
listings = ["amazon listing", "ecommerce", "products"]
scripting = ["subtitle", "film", "dubbing"]
medical = ["medical", "biotechnology", "dentist"]
df = pd.DataFrame({'input': ['amazon listing subtitle',
'medical',
'film biotechnology dentist']})
which looks like:
input
amazon listing subtitle
medical
film biotechnology dentist
final dataset should look like:
input listings scripting medical
amazon listing subtitle 1 0 0
medical 0 0 1
film biotechnology dentist 0 1 0
One possible implementation is to use str.contains in a loop to create the 3 columns, then use idxmax to get the column name (or the list name) of the first match, then create a dummy variable from these matches:
import numpy as np
d = {'listings':listings, 'scripting':scripting, 'medical':medical}
for k,v in d.items():
df[k] = df['input'].str.contains('|'.join(v))
arr = df[list(d)].to_numpy()
tmp = np.zeros(arr.shape, dtype='int8')
tmp[np.arange(len(arr)), arr.argmax(axis=1)] = arr.max(axis=1)
out = pd.DataFrame(tmp, columns=list(d)).combine_first(df)
But in this case, it might be more efficient to use a nested for-loop:
import re
def get_dummy_vars(col, lsts):
out = []
len_lsts = len(lsts)
for row in col:
tmp = []
# in the nested loop, we use the any function to check for the first match
# if there's a match, break the loop and pad 0s since we don't care if there's another match
for lst in lsts:
tmp.append(int(any(True for x in lst if re.search(fr"\b{x}\b", row))))
if tmp[-1]:
break
tmp += [0] * (len_lsts - len(tmp))
out.append(tmp)
return out
lsts = [listings, scripting, medical]
out = df.join(pd.DataFrame(get_dummy_vars(df['input'], lsts), columns=['listings', 'scripting', 'medical']))
Output:
input listings medical scripting
0 amazon listing subtitle 1 0 0
1 medical 0 1 0
2 film biotechnology dentist 0 0 1
Here is a simpler - more pandas vector style solution:
patterns = {} #<-- dictionary
patterns["listings"] = ["amazon listing", "ecommerce", "products"]
patterns["scripting"] = ["subtitle", "film", "dubbing"]
patterns["medical"] = ["medical", "biotechnology", "dentist"]
df = pd.DataFrame({'input': ['amazon listing subtitle',
'medical',
'film biotechnology dentist']})
#---------------------------------------------------------------#
# step 1, for each column create a reg-expression
for col, items in patterns.items():
# create a regex pattern (word1|word2|word3)
pattern = f"({'|'.join(items)})"
# find the pattern in the input column
df[col] = df['input'].str.contains(pattern, regex=True).astype(int)
# step 2, if the value to the left is 1, change its value to 0
## 2.1 create a mask
## shift the rows to the right,
## --> if the left column contains the same value as the current column: True, otherwise False
mask = (df == df.shift(axis=1)).values
# substract the mask from the df
## and clip the result --> negative values will become 0
df.iloc[:,1:] = np.clip( df[mask].iloc[:,1:] - mask[:,1:], 0, 1 )
print(df)
Result
input listings scripting medical
0 amazon listing subtitle 1 0 0
1 medical 0 0 1
2 film biotechnology dentist 0 1 0
Great question and good answers (I somehow missed it yesterday)! Here's another variation with .str.extractall():
search = {"listings": listings, "scripting": scripting, "medical": medical, "dummy": []}
pattern = "|".join(
f"(?P<{column}>" + "|".join(r"\b" + s + r"\b" for s in strings) + ")"
for column, strings in search.items()
)
result = (
df["input"].str.extractall(pattern).assign(dummy=True).groupby(level=0).any()
.idxmax(axis=1).str.get_dummies().drop(columns="dummy")
)
So my previous, more simplified question is here - How to search for text across multiple rows in a pandas dataframe?
What I want to do is basically to be able to feed a text document containing multiple phrases, not just singular words, i.e. 'new jersey,' etc, into a search and then to search for the terms across multiple rows and output a new column in the table with 'True,' if the terms and present and 'False,' if not. For instance, this is a very small section of my table, and I would like to search 'new jersey' and 'grew up,' with words that are in separate rows.
subtitle start end duration
14 new 71.986000 72.096000 0.110000
15 jersey 72.106000 72.616000 0.510000
16 grew 72.696000 73.006000 0.310000
17 up 73.007000 73.147000 0.140000
18 believing 73.156000 73.716000 0.560000
So far, thanks to kind help on the old thread, this is what I have, with terms.txt being the list of search terms:
import re
search = [term.strip() for term in open("terms.txt").readlines()]
search = fr"({'|'.join(search)})"
text = " ".join(df["subtitle"])
end = df["subtitle"].apply(len).cumsum() + pd.RangeIndex(len(df))
start = end.shift(fill_value=-1) + 1
df["start"] = start.tolist()
df["end"] = end.tolist()
df["match"] = False
Everything works fine up until this point:
for match in re.finditer(search, text, re.IGNORECASE):
idx1 = df[df["start"] == match.start()].index[0]
idx2 = df[df["end"] == match.end()].index[0]
df.loc[idx1:idx2, "match"] = True
I get the error message:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-14-9f347152f616> in <module>
1 for match in re.finditer(search, text, re.IGNORECASE):
----> 2 idx1 = df[df["start"] == match.start()].index[0]
3 idx2 = df[df["end"] == match.end()].index[0]
4 df.loc[idx1:idx2, "match"] = True
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in __getitem__(self, key)
4099 if is_scalar(key):
4100 key = com.cast_scalar_indexer(key, warn_float=True)
-> 4101 return getitem(key)
4102
4103 if isinstance(key, slice):
IndexError: index 0 is out of bounds for axis 0 with size 0
Does anyone know how I could fix this or if there are other methods I could use to acheive the desired result? All help is appreciated, and I apologise for any formatting issues since I am very new here.
There are 2 columns 'start' and 'end'.
import re
terms = [term.strip() for term in open("terms.txt").readlines()]
word = df["subtitle"].str.strip()
end = word.apply(len).cumsum() + pd.RangeIndex(len(df))
start = end.shift(fill_value=-1) + 1
text = " ".join(word)
df["match"] = False
for term in terms:
for match in re.finditer(fr"\b{term}\b", text, re.IGNORECASE):
idx1 = start[start == match.start()].index[0]
idx2 = end[end == match.end()].index[0]
df[idx1:idx2] = True
Output:
$ cat terms.txt
new jersey
hello
>>> df
id subtitle start end duration match
0 14 new 71.986 72.096 0.11 True
1 15 jersey 72.106 72.616 0.51 True
2 16 grew 72.696 73.006 0.31 False
3 17 up 73.007 73.147 0.14 False
4 18 believing 73.156 73.716 0.56 False
I am a beginner in python. I have a hundred pair of CSV file. The file looks like this:
25_13oct_speed_0.csv
26_13oct_speed_0.csv
25_13oct_speed_0.1.csv
26_13oct_speed_0.1.csv
25_13oct_speed_0.2.csv
26_13oct_speed_0.2.csv
and others
I want to concatenate the pair files between 25 and 26 file. each pair of the file has a speed threshold (Speed_0, 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0) which is labeled on the file name. These files have the same structure data.
Mac Annotation X Y
A first 0 0
A last 0 0
B first 0 0
B last 0 0
Therefore, concatenate analyze is enough to join these two data. I use this method:
df1 = pd.read_csv('25_13oct_speed_0.csv')
df2 = pd.read_csv('26_13oct_speed_0.csv')
frames = [df1, df2]
result = pd.concat(frames)
for each pair files. but it takes time and not an elegant way. is there a good way to combine automatically the pair file and save simultaneously?
Idea is create DataFrame by list of files and add 2 new columns by Series.str.split by first _:
print (files)
['25_13oct_speed_0.csv', '26_13oct_speed_0.csv',
'25_13oct_speed_0.1.csv', '26_13oct_speed_0.1.csv',
'25_13oct_speed_0.2.csv', '26_13oct_speed_0.2.csv']
df1 = pd.DataFrame({'files': files})
df1[['g','names']] = df1['files'].str.split('_', n=1, expand=True)
print (df1)
files g names
0 25_13oct_speed_0.csv 25 13oct_speed_0.csv
1 26_13oct_speed_0.csv 26 13oct_speed_0.csv
2 25_13oct_speed_0.1.csv 25 13oct_speed_0.1.csv
3 26_13oct_speed_0.1.csv 26 13oct_speed_0.1.csv
4 25_13oct_speed_0.2.csv 25 13oct_speed_0.2.csv
5 26_13oct_speed_0.2.csv 26 13oct_speed_0.2.csv
Then loop per groups per column names, loop by groups with DataFrame.itertuples and create new DataFrame with read_csv, if necessary add new column filled by values from g, append to list, concat and last cave to new file by name from column names:
for i, g in df1.groupby('names'):
out = []
for n in g.itertuples():
df = pd.read_csv(n.files).assign(source=n.g)
out.append(df)
dfbig = pd.concat(out, ignore_index=True)
print (dfbig)
dfbig.to_csv(g['names'].iat[0])
I am working with the sklearn.datasets.fetch_20newsgroups() dataset. Here, there are some documents that belong to more than one news group. I want to treat those documents as two different entities that each belong to one news group. To do this, I've brought the document IDs and group names into a dataframe.
import sklearn
from sklearn import datasets
data = datasets.fetch_20newsgroups()
filepaths = data.filenames.astype(str)
keys = []
for path in filepaths:
keys.append(os.path.split(path)[1])
groups = pd.DataFrame(keys, columns = ['Document_ID'])
groups['Group'] = data.target
groups.head()
>> Document_ID Group
0 102994 7
1 51861 4
2 51879 4
3 38242 1
4 60880 14
print (len(groups))
>>11314
print (len(groups['Document_ID'].drop_duplicates()))
>>9840
print (len(groups['Group'].drop_duplicates()))
>>20
For each Document_ID, I want to change its value if it has more than one Group number assigned. Example,
groups[groups['Document_ID']=='76139']
>> Document_ID Group
5392 76139 6
5680 76139 17
I want this to become:
>> Document_ID Group
5392 76139 6
5680 12345 17
Here, 12345 is a random new ID that is not already in keys list.
How can I do this?
You can find all the rows that contain duplicate Document_ID after the first with the duplicated methdod. Then create a list of new id's beginning with one more than the max id. Use the loc indexing operator to overwrite the duplicate keys with the new ids.
groups['Document_ID'] = groups['Document_ID'].astype(int)
dupes = groups.Document_ID.duplicated(keep='first')
max_id = groups.Document_ID.max() + 1
new_id = range(max_id, max_id + dupes.sum())
groups.loc[dupes, 'Document_ID'] = new_id
Test case
groups.loc[[5392,5680]]
Document_ID Group
5392 76139 6
5680 179489 17
Ensure that no duplicates remain.
groups.Document_ID.duplicated(keep='first').any()
False
Kinda Hacky, but why not!
data = {"Document_ID": [102994,51861,51879,38242,60880,76139,76139],
"Group": [7,1,3,4,4,6,17],
}
groups = pd.DataFrame(data)
groupDict ={}
tempLst=[]
#Create a list of unique ID's
DocList = groups['Document_ID'].unique()
DocList.tolist()
#Build a dictionary and push all group ids to the correct doc id
DocDict = {}
for x in DocList:
DocDict[x] = []
for index, row in groups.iterrows():
DocDict[row['Document_ID']].append(row['Group'])
#For all doc Id's with multip entries create a new id with the group id as a decimal point.
groups['DupID'] = groups['Document_ID'].apply(lambda x: len(DocDict[x]))
groups["Document_ID"] = np.where(groups['DupID'] > 1, groups["Document_ID"] + groups["Group"]/10,groups["Document_ID"])
Hope that helps...