I am trying to match and find names exactly from one list with another list using python.
# First List
file = 'Last_First.csv'
filename = file.split('_')
last = filename[0]
first = filename[1]
search a large list of names where names are saved as Last, First
pattern = re.compile(re.escape(last+','+first))
# Second List
['63', 'Last, First', '65164345']
when i search the list line by line, i get an empty list
matches = pattern.findall(line)
printing the pattern, i get
re.compile(r'Last\,First', re.UNICODE)
how can i get rid of \ ?
The \ is an escape character that has to be there. You are getting an empty list because 'Last, First' has a space after the comma, your regular expression is not matching that space.
Related
I have a large list with strings and I would like to filter everything inside a parenthesis, thus I am using the following regex:
text_list = [' 1__(this_is_a_string) 74_string__(anotherString_with_underscores) question__(stringWithAlot_of_underscores) 1.0__(another_withUnderscores) 23:59:59__(get_arguments_end) 2018-05-13 00:00:00__(get_arguments_start)']
import re
r = re.compile('\([^)]*\)')
a_lis = list(filter(r.search, text_list))
print(a_lis)
I test my regex here, and is working. However, when I apply the above regex I end up with an empty list:
[]
Any idea of how to filter all the tokens inside parenthesis from a list?
Your regex is OK (though perhaps you don't want to capture the parentheses as part of the match), but search() is the wrong method to use. You want findall() to get the text of all the matches, rather than the indices of the first match:
list(map(r.findall, text_list))
This will give you a list of lists, where each inner list contains the strings which were inside parentheses.
For example, given this input:
text_list = ['asdf (qwe) asdf (gdfd)', 'xx', 'gdfw(rgf)']
The result is:
[['(qwe)', '(gdfd)'], [], ['(rgf)']]
If you want to exclude the parentheses, change the regex slightly:
'\(([^)]*)\)'
The unescaped parentheses within the escaped ones indicate what to capture.
Question:
I need to match and replace on the whole words in the pandas df column 'messages' with the dictionary values. Is there any way I can do this within the df["column"].replace command? Or do I need to find another way to replace whole words?
Background:
in my pandas data frame I have a column of text messages that contain English human names keys i'm trying to replace with dictionary value of "First Name". The specific column in the data frame looks like this, where you can see "tommy" as a single name.
tester.df["message"]
message
0 what do i need to do
1 what do i need to do
2 hi tommy thank you for contacting app ...
3 hi tommy thank you for contacting app ...
4 hi we are just following up to see if you read...
The dictionary is created from a list I extracted from the 2000 census data base. It has many different first names that could match inline text including 'al' or 'tom', and if i'm not careful could place my value "First Name" everywhere across the pandas df column messages:
import requests
#import the total name
r = requests.get('http://deron.meranda.us/data/census-derived-all-first.txt')
#US Census first names
list1= re.findall(r'\n(.*?)\s', r.text, re.DOTALL)
#turn list to string, force lower case
str1 = ', '.join('"{0}"'.format(w) for w in list1)
str1 = ','.join(list1)
str1 = (str1.lower())
#turn into dictionary with "First Name" as value
str1 = dict((el, 'FirstName') for el in str1)
Now I want to replace whole words within the DF column "message" that match the dictionary keys with the 'FirstName' value. Unfortunately when I do the following it replaces the text in messages where it matches even the short names like "al" or 'tom".
In [254]: tester["message"].replace(str1, regex = True)
Out[254]:
0 wFirstNamet do i neFirstName to do
1 wFirstNamet do i neFirstName to do
2 hi FirstNameFirstName tFirstName you for conFi...
3 hi FirstNameFirstName tFirstName you for conFi...
4 hi we are just followFirstNameg up to FirstNam...
Name: message, dtype: object
Any help matching and replacing the whole key with value is appreciated!
Update / attempt to fix 1: Tried adding some regular expression features to match whole words only**
I tried adding a break character to each word within the extracted string that the dictionary of which the dictionary is constructed. Unfortunately the single slashes are limited words that get turned into double slashes and won't match the dictionary key -> value replace.
#import the total name
r = requests.get('http://deron.meranda.us/data/census-derived-all-first.txt')
l = requests.get('https://deron.meranda.us/data/popular-last.txt')
#US Census first names
list1= re.findall(r'\n(.*?)\s', r.text, re.DOTALL)
#add regex before
string = 'r"\\'
endstring = '\\b'
list1 = [ string + x + endstring for x in list1]
#turn list to string, force lower case
str1 = ', '.join('"{0}"'.format(w) for w in list1)
str1 = ','.join(list1)
str1 = (str1.lower())
##if we do print(str1) it shows one backslash
##turn to list ..but print() doesn't let us have one backlash anymore
str1 = [x.strip() for x in str1.split(',')]
#turn to dictionary with "firstname"
str1 = dict((el, 'FirstName') for el in str1)
And then when I try to match and replace with the updated dictionary keys with the break regular expressions, I get a bad escape
tester["message"].replace(str1, regex = True)
" Traceback (most recent call last):
error: bad escape \j "
This might be the right direction, but the backslash to double backslash conversion seems to be tricky...
First you need to prepare the list of names such that it matches the name preceded by either the beginning of the string (^) or a whitespace (\s) and followed by either a whitespace or the end of the string ($). Then you need to make sure to preserve the preceding and following element (via backreferences). Assuming you have a list first_names which contains all first names that should be replaced:
replacement_dict = {
r'(^|\s){}($|\s)'.format(name): r'\1FirstName\2'
for name in first_names
}
Let's take a look at the regex:
( # Start group.
^|\s # Match either beginning of string or whitespace.
) # Close group.
{} # This is where the actual name will be inserted.
(
$|\s # Match either end of string or whitespace.
)
And the replacement regex:
\1 # Backreference; whatever was matched by the first group.
FirstName
\2 # Backreference; whatever was matched by the second group.
x = re.findall(r'FROM\s(.*?\s)(WHERE|INNER|OUTER|JOIN|GROUP,data,re.DOTALL)
I am using above expression to parse oracle sql query and get the result.
I get multiple matches and want to print them each line by line.
How can i do that.
Some result even have "," in between them.
You can try this :
for elt in x:
print('\n'.join(elt.split(',')))
join returns a list of the comma-separated elements, which are then joined again with \n (new line). Therefore, you get one result per line.
Your result is returned in a list.
from https://docs.python.org/2/library/re.html:
re.findall(pattern, string, flags=0) Return all non-overlapping
matches of pattern in string, as a list of strings.
If you are not familiar with data structures, more information here
you should be able to easily iterate on over the returned list with a for loop:
for matchedString in x:
#replace commas
n = matchedString.replace(',','') #to replace commas
#add to new list or print, do something, any other logic
print n
I have a list of strings where all of the strings roughly follow the format 'foo\tbar\tfoo\n' in that there are three segments of variable length that are separated by two tabs (\t) and with a newline indicator at the end (\n).
I want to remove everything except for the text before the first \, so that it would return as 'foo'. Given that the first segment is of variable length, I'm not sure how I can do that.
Use str.split():
>>> string = 'foo\tbar\tfoo\n'
>>> string.split('\t', 1)[0]
'foo'
This splits the string by the first occurrence of the '\t' tab character, which returns a list with two elements. The [0] selects the first element in the list, which is the part of the string before the first '\t' occurrence.
Just search for the first \t character, and get everything before it. Slicing makes this easy.
newstr = oldstr[:oldstr.find("\t")]
Try with:
t = 'foo\tbar\tfoo\n'
t[:t.index("\t")]
I am new in Python and I am trying to to get some contents from a file using regex. I upload a file, I load it in memory and then I run this regular expression. I want to take the names from the file but it also needs to work with names that have spaces like "Marie Anne". So imagine that the array of names has this values:
all_names = [{name:"Marie Anne", id:1}, {name:"Johnathan", id:2}, {name:"Marie", id:3}, {name:"Anne", id:4},{name:"John", id:5}]
An the string that I am searching might have multiple occurrences and it's multiline.
print all_names # this is an array of id and name, ordered descendently by names length
textToStrip = stdout.decode('ascii', 'ignore').lower()
for i in range(len(all_skills)):
print all_names[i]
m = re.search(r'\W' + re.escape(unicode(all_names[i]['name'].lower())) + '\W',textToStrip)
if m:
textToStrip = re.sub(r'\W' + re.escape(unicode(all_names[i]['name'].lower())) + '\W', "", textToStrip, 100)
print "found " + all_names[i]['name']
print textToStrip
The script is finding the names, but the line re.sub removes them from the list to avoid that takes "Maria Anne", and "Marie" from the same instance, it's also removing extra characters like "," or "." before or after.
Any help would much appreciated... or if you have a better solution for this problem even better.
The characters on both sides are deleted because you have \W included in re.sub() regexp. That's because re.sub replaced everything the regexp matches -- the way you call re.sub.
There's an alternate way to do this. If you wrap the part that you want keep in the matched regext with grouping parens, and if you call re.sub with a callable (a function) instead of the new string, that function can extract the group values from the match object passed to it and assemble a return value that preserves them.
Read documentation for re.sub for details.