Pythonic way to solve a text normalization task

Pythonic way to solve a text normalization task - python

Basically, I have a Hive script file, from which I need to extract the names for all the tables created. For example, from the contents
...
create table Sales ...
...
create external table Persons ...
...
Sales and Persons should be extracted. To accomplish this, my basic idea is like:
Search for key phrases create table and create external table,
Extract the next token which should be the table name.
However, the input may not be canonical. For example,
Tab/newline may be used along with space as token delimiter
There may be multiple consecutive delimiters between tokens
Mixed use of upper and lower case letters like create TABLE
Therefore, I'm thinking about first normalizing the input to a canonical form before applying the basic algorithm. Then with some effort, I come up with the following
' '.join(input.split()).lower()
As a Python newcomer, I'm wondering whether this is the Pythonic way to solve the problem, or it may be flawed in the very first place? Is there a simple way to do this in a streaming fashion, i.e., avoiding loading the whole input into memory at once?

Like some comments stated, regex is a neat and easy way to get what you want. If you don't mind getting lowercase results, this one should work:
import re
my_str = """
...
create table Sales ...
create TabLE
test
create external table Persons ...
...
"""
pattern = r"table\s+(\w+)\b"
items = re.findall(pattern, my_str.lower())
print items
It captures the next word after "table " (followed by at least one whitespace / newline).
To get the original case of the table names:
for x, item in enumerate(items):
i = my_str.lower().index(item)
items[x] = my_str[i:i+len(item)]
print items

Related

Replace a value on one data set with a value from another data set with a dynamic lookup

This question relates primarly to Alteryx, however if it can be done in Python, or R in Alteryx workflow using the R tool then that would work as well.
I have two data sets.
Address (contains address information: Line1, Line2, City, State, Zip)
USPS (contains USPS abbreviations: Street to ST, Boulevard to BLVD, etc.)
Goal: Look at the string on the Address data set for Line1. IF it CONTAINS one of the types of streets in the USPS data set, I want to replace that part of the string with its proper abbreviation which is in a different column of the USPS data set.
Example, 123 Main Street would become 123 Main St
What I have tried:
Imported the two data sets.
Union the two data sets with the instruction of Output All Fields for When Fields Differ.
Added a formula, but this is where I am getting stuck. So far it reads:
if [Addr1] Contains(Sting, Target)
Not sure how to have it look in the USPS for one of the values. I am also not certain if this sort of dynamic lookup can take place.
If this can be done in python (I know very basic Python so I don't have code for this yet because I do not know where to start other than importing the data) I can use python within Alteryx.
Any assistance would be great. Please let me know if you need additional information.
Thank you in advance.

Use the Find Replace tool in Alteryx. This tool is akin to a lookup. Furthermore, use the Alteryx Community as a go to for these types of questions.
Input the Address dataset into the top anchor of the Find Replace tool and the USPS dataset into the bottom anchor. You'll want to find any part of the address field using the lookup field and replace it with the abbreviation field. If you need to do this across several fields in the Address dataset, then you could replicate this logic or you could use a Record ID tool, Transpose, run this logic on one field, and then Cross Tab back to the original schema. It's an advanced recipe that you'll want to master in Alteryx.
https://help.alteryx.com/current/FindReplace.htm

The overall logic that can be used is here: Using str_detect (or some other function) and some way to loop through a list to essentially perform a vlookup
However, in order to expand to Alteryx, you would need to add the Alteryx R tool. Also, some of the code would need to be changed to use the syntax that Alteryx likes.
read in the data with:
read.Alteryx('#Link Number', mode = 'data.frame')
After, the above linked question will provide the overall framework for the logic. Reiterated here:
usps[] = lapply(usps, as.character)
##Copies the original address data to a new column that will
##be altered. Preserves the orignal formatting for rollback
##if necessary
vendorData$new_addr1 = as.character(vendorData$Addr1)
##Loops through the dictionary replacing all of the common names
##with their USPS approved abbreviations for the Addr1 field.
for(i in 1:nrow(usps)) {
vendorData$new_addr1 = str_replace_all(
vendorData$new_addr1,
pattern = paste0("\\b", usps$Abbreviation[i], "\\b"),
replacement = usps$USPS_Abbrv_updated[i]
)
}
Finally, in order to be able to see the output, we would need to write a statement that will output it in one of the 5 output slots the R tool has. Here is the code for that:
write.Alteryx(data, #)

Is there an R or Python function for separating information in non-delimited strings, where the information varies?

I am currently cleaning up a messy data sheet in which information is given in one excel cell where the different characteristics are not delimited (no comma, spaces are random).
Thus, my problem is to separate the different information without a delimitation I could use in my code (can't use a split command)
I assume that I need to include some characteristics of each part of information, such that the corresponding characteristic is recognized. However, I don't have a clue how to do that since I am quite new to Python and I only worked with R in the framework of regression models and other statistical analysis.
Short data example:
INPUT:
"WMIN CBOND12/05/2022 23554132121"
or
"WalMaInCBND 12/05/2022-23554132121"
or
"WalmartI CorpBond12/05/2022|23554132121"
EXPECTED OUTPUT:
"Walmart Inc.", "Corporate Bond", "12/05/2022", "23554132121"
So each of the "x" should be classified in a new column with the corresponding header (Company, Security, Maturity, Account Number)
As you can see the input varies randomly but I want to have the same output for each of the three inputs given above (I have over 200k data points with different companies, securities etc.)
First Problem is how to separate the information effectively without being able to use a systematic pattern.
Second Problem (lower priority) is how to identify the company without setting up a dictionary with 50 different inputs for 50k companies.
Thanks for your help!

I recommend to first introduce useful seperators where possible and construct a dictionary of replacements for processing with regular expressions.
import re
s = 'WMIN CBOND12/05/2022 23554132121'
# CAREFUL this not a real date regex, this should just
# illustrate the principle of regex
# see https://stackoverflow.com/a/15504877/5665958 for
# a good US date regex
date_re = re.compile('([0-9]{2}/[0-9]{2}/[0-9]{4})')
# prepend a whitespace before the date
# this is achieved by searching the date within the string
# and replacing it with itself with a prepended whitespace
# /1 means "insert the first capture group", which in our
# case is the date
s = re.sub(date_re, r' \1', s)
# split by one or more whitespaces and insert
# a seperator (';') to make working with the string
# easier
s = ';'.join(s.split())
# build a dictionary of replacements
replacements = {
'WMIN': 'Walmart Inc.',
'CBOND': 'Corporate Bond',
}
# for each replacement apply subsitution
# a better, but more replicated solution for
# this is given here:
# https://stackoverflow.com/a/15175239/5665958
for pattern, r in replacements.items():
s = re.sub(pattern, r, s)
# use our custom separator to split the parts
out = s.split(';')
print(out)

Using python and regular expressions:
import re
def make_filter(pattern):
pattern = re.compile(pattern)
def filter(s):
filtered = pattern.match(s)
return filtered.group(1), filtered.group(2), filtered.group(3), filtered.group(4)
return filter
filter = make_filter("^([a-zA-Z]+)\s([a-zA-Z]+)(\d+/\d+/\d+)\s(\d+)$")
filter("WMIN CBOND12/05/2022 23554132121")
The make_filter function is just an utility to allow you to modify the pattern. It returns a function that will filter the output according to that pattern. I use it with the "^([a-zA-Z]+)\s([a-zA-Z]+)(\d+/\d+/\d+)\s(\d+)$" pattern that considers some text, an space, some text, a date, an space, and a number. If you want to kodify this pattern provide more info about it. The output will be ("WMIN", "CBOND", "12/05/2022", "23554132121").

welcome! Yeah, we would definitely need to see more examples and regex seems to be the way to go... but since there seems to be no structure, I think it's better to think of this as seperate steps.
We KNOW there's a date which is (X)X/(X)X/XXXX (ie. one or two digit day, one or two digit month, four digit year, maybe with or without the slashes, right?) and after that there's numbers. So solve that part first, leaving only the first two categories. That's actually the easy part :) but don't lose heart!
if these two categories might not have ANY delimiter (for example WMINCBOND 12/05/202223554132121, or delimiters are not always delimiters for example IMAGINARY COMPANY X CBOND, then you're in deep trouble. :) BUT this is what we can do:
Gather a list of all the codes (hopefully you have that).
use str_detect() on each code and see if you can recognize the exact string in any of the dataset (if you do have the codes lemme know I'll write the code to do this part).
What's left after identifying the code will be the CBOND, whatever that is... so do that part last... what's left of the string will be that. Alternatively, you can use the same str_detect() if you have a list of whatever CBOND stuff is.
ONLY AFTER YOU'VE IDENTIFIED EVERYTHING, you can then replace the codes for what they stand for.
If you have the code-list let me know and I'll post the code.

edit
s = c("WMIN CBOND12/05/2022 23554132121",
"WalMaInCBND 12/05/2022-23554132121",
"WalmartI CorpBond12/05/2022|23554132121")
ID = gsub("([a-zA-Z]+).*","\\1",s)
ID2 = gsub(".* ([a-zA-Z]+).*","\\1",s)
date = gsub("[a-zA-Z ]+(\\d+\\/\\d+\\/\\d+).*","\\1",s)
num = gsub("^.*[^0-9](.*$)","\\1",s)
data.frame(ID=ID,ID2=ID2,date=date,num=num,stringsAsFactors=FALSE)
ID ID2 date num
1 WMIN CBOND 12/05/2022 23554132121
2 WalMaInCBND WalMaInCBND 12/05/2022-23554132121 12/05/2022 23554132121
3 WalmartI CorpBond 12/05/2022 23554132121
Works for cases 1 and 3 but I haven't figured out a logic for the second case, how can we know where to split the string containing the company and security if they are not separated?

Substring with multiple instances of the same character

So I am using a Magtek USB reader that will read card information,
As of right now I can swipe a card and I get a long string of information that goes into a Tkinter Entry textbox that looks like this
%B8954756016548963^LAST/FIRST INITIAL^180912345678912345678901234?;8954756016548963=180912345678912345678901234?
All of the data has been randomized, but that's the format
I've got a tkinter button (it gets the text from the entry box in the format I included above and runs this)
def printCD(self):
print(self.carddata.get())
self.card_data_get = self.carddata.get()
self.creditnumber =
self.card_data_get[self.card_data_get.find("B")+1:
self.card_data_get.find("^")]
print(self.creditnumber)
print(self.card_data_get.count("^"))
This outputs:
%B8954756016548963^LAST/FIRST INITIAL^180912345678912345678901234?;8954756016548963=180912345678912345678901234?
8954756016548963
This yields no issues, but if I wanted to get the next two variables firstname, and lastname
I would need to reuse self.variable.find("^") because in the format it's used before LAST and after INITIAL
So far when I've tried to do this it hasn't been able to reuse "^"
Any takers on how I can split that string of text up into individual variables:
Card Number
First Name
Last Name
Expiration Date

Regex will work for this. I didn't capture everything because you didn't detail what's what but here's an example of capturing the name:
import re
data = "%B8954756016548963^LAST/FIRST INITIAL^180912345678912345678901234?;8954756016548963=180912345678912345678901234?"
matches = re.search(r"\^(?P<name>.+)\^", data)
print(matches.group('name'))
# LAST/FIRST INITIAL
If you aren't familiar with regex, here's a way of testing pattern matching: https://regex101.com/r/lAARCP/1 and an intro tutorial: https://regexone.com/
But basically, I'm searching for (one or more of anything with .+ between two carrots, ^).
Actually, since you mentioned having first and last separate, you'd use this regex:
\^(?P<last>.+)/(?P<first>.+)\^
This question may also interest you regarding finding something twice: Finding multiple occurrences of a string within a string in Python

If you find regex difficult you can divide the problem into smaller pieces and attack one at a time:
data = '%B8954756016548963^LAST/FIRST INITIAL^180912345678912345678901234?;8954756016548963=180912345678912345678901234?'
pieces = data.split('^') # Divide in pieces, one of which contains name
for piece in pieces:
if '/' in piece:
last, the_rest = piece.split('/')
first, initial = the_rest.split()
print('Name:', first, initial, last)
elif piece.startswith('%B'):
print('Card no:', piece[2:])

Search methods and string matching in python

I have a task to search for a group of specific terms(around 138000 terms) in a table made of 4 columns and 187000 rows. The column headers are id, title, scientific_title and synonyms, where each column might contain more than one term inside it.
I should end up with a csv table with the id where a term has been found and the term itself. What could be the best and the fastest way to do so?
In my script, I tried creating phrases by iterating over the different words in a term in order and comparing each word with each row of each column of the table.
It looks something like this:
title_prepared = string_preparation(title)
sentence_array = title_prepared.split(" ")
length = len(sentence_array)
for i in range(length):
for place_length in range(len(sentence_array)):
last_element = place_length + 1
phrase = ' '.join(sentence_array[0:last_element])
if phrase in literalhash:
final_dict.setdefault(id,[])
if not phrase in final_dict[id]:
final_dict[trial_id].append(phrase)
How should I be doing this?

The code on the website you link to is case-sensitive - it will only work when the terms in tumorabs.txt and neocl.xml are the exact same case. If you can't change your data then change:
After:
for line in text:
add:
line = line.lower()
(this is indented four spaces)
And change:
phrase = ' '.join(sentence_array[0:last_element])
to:
phrase = ' '.join(sentence_array[0:last_element]).lower()
AFAICT this works with the unmodified code from the website when I change the case of some of the data in tumorabs.txt and neocl.xml.

To clarify the problem: we are running small scientific project where we need to extract all text parts with particular keywords. We have used coded dictionary and python script posted on http://www.julesberman.info/coded.htm ! But it seems that something does not working properly.
For exemple the script do not recognize a keyword "Heart Disease" in string "A Multicenter Randomized Trial Evaluating the Efficacy of Sarpogrelate on Ischemic Heart Disease After Drug-eluting Stent Implantation in Patients With Diabetes Mellitus or Renal Impairment".
Thanks for understanding! we are a biologist and medical doctor, with little bit knowlege of python!
If you need some more code i would post it online.

Irregular String Parsing on Python

I'm new to python/django and I am trying to suss out more effective information from my scraper. Currently, the scraper takes a list of comic book titles and correctly divides them into a CSV list in three parts (Published Date, Original Date, and Title). I then pass the current date and title through to different parts of my databse, which I do in my Loader script (convert mm/dd/yy into yyyy-mm-dd, save to "pub_date" column, title goes to "title" column).
A common string can look like this:
10/12/11|10/12/11|Stan Lee's Traveler #12 (10 Copy Incentive Cover)
I am successfully grabbing the date, but the title is trickier. In this instance, I'd ideally like to fill three different columns with the information after the second "|". The Title should go to "title", a charfield. the number 12 (after the '#') should go into the DecimalField "issue_num", and everything between the '()' 's should go into the "Special" charfield. I am not sure how to do this kind of rigorous parsing.
Sometimes, there are multiple #'s (one comic in particular is described as a bundle, "Containing issues #90-#95") and several have multiple '()' groups (such as, "Betrayal Of The Planet Of The Apes #1 (Of 4)(25 Copy Incentive Cover)
)
What would be a good road to start onto crack this problem? My knowledge of If/else statements quickly fell apart for the more complicated lines. How can I efficiently and (if possible) pythonic-ly parse through these lines and subdivide them so I can later slot them into the correct place in my database?

Use the regular expression module re. For example, if you have the third |-delimited field of your sample record in a variable s, then you can do
match = re.match(r"^(?P<title>[^#]*) #(?P<num>[0-9]+) \((?P<special>.*)\)$", s)
title = match.groups('title')
issue = match.groups('num')
special = match.groups('special')
You'll get an IndexError in the last three lines for a missing field. Adapt the RE until it parses everything your want.

Parsing the title is the hard part, it sounds like you can handle the dates etc yourself. The problem is that there is not one rule that can parse every title but there are many rules and you can only guess which one works on a particular title.
I usually handle this by creating a list of rules, from most specific to general and try them out one by one until one matches.
To write such rules you can use the re module or even pyparsing.
The general idea goes like this:
class CantParse(Exception):
pass
# one rule to parse one kind of title
import re
def title_with_special( title ):
""" accepts only a title of the form
<text> #<issue> (<special>) """
m = re.match(r"[^#]*#(\d+) \(([^)]+)\)", title)
if m:
return m.group(1), m.group(2)
else:
raise CantParse(title)
def parse_extra(title, rules):
""" tries to parse extra information from a title using the rules """
for rule in rules:
try:
return rule(title)
except CantParse:
pass
# nothing matched
raise CantParse(title)
# lets try this out
rules = [title_with_special] # list of rules to apply, add more functions here
titles = ["Stan Lee's Traveler #12 (10 Copy Incentive Cover)",
"Betrayal Of The Planet Of The Apes #1 (Of 4)(25 Copy Incentive Cover) )"]
for title in titles:
try:
issue, special = parse_extra(title, rules)
print "Parsed", title, "to issue=%s special='%s'" % (issue, special)
except CantParse:
print "No matching rule for", title
As you can see the first title is parsed correctly, but not the 2nd. You'll have to write a bunch of rules that account for every possible title format in your data.

Regular expression is the way to go. But if you fill uncomfortably writing them, you can try a small parser that I wrote (https://github.com/hgrecco/stringparser). It translates a string format (PEP 3101) to a regular expression. In your case, you will do the following:
>>> from stringparser import Parser
>>> p = Parser(r"{date:s}\|{date2:s}\|{title:s}#{issue:d} \({special:s}\)")
>>> x = p("10/12/11|10/12/11|Stan Lee's Traveler #12 (10 Copy Incentive Cover)")
OrderedDict([('date', '10/12/11'), ('date2', '10/12/11'), ('title', "Stan Lee's Traveler "), ('issue', 12), ('special', '10 Copy Incentive Cover')])
>>> x.issue
12
The output in this case is an (ordered) dictionary. This will work for any simple cases and you might tweak it to catch multiple issues or multiple ()
One more thing: notice that in the current version you need to manually escape regex characters (i.e. if you want to find |, you need to type \|). I am planning to change this soon.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pythonic way to solve a text normalization task - python

Related

Replace a value on one data set with a value from another data set with a dynamic lookup

Is there an R or Python function for separating information in non-delimited strings, where the information varies?

Substring with multiple instances of the same character

Search methods and string matching in python

Irregular String Parsing on Python

Categories

Resources