How to clean up text data in a pandas dataframe column - python

I am playing around with some banking information and I have a csv file of all my transactions. I have opened it up as a dataframe and it looks something like this:
banking.csv
my as you can see in the 2nd column, there is a whole bunch of text that I do not need, all I am interested in is the store name which typically comes at the end.
I have managed to get rid of the part that says 'Point of sale - INTERAC RETAIL PURCHASE'
by using
checking['POS'] = checking['POS'].str.replace('Point of Sale - Interac RETAIL PURCHASE', '')
my issue now comes when I try to delete the numbers that comes after that, just before the store names. I wanted to do something similar to above but the numbers are all unique so I am not sure how I can do this.
thanks for the help

You could do a replace with Regular expressions:
import re
checking['POS'].apply(lambda x: re.sub(r"Point of Sale - Interac RETAIL PURCHASE \d+", "", x))
\d+ = "one or more numbers".
So that will only work if there are only numbers between the text and the store name.
The .replace unfortunately does not allow regular expressions thats why you have to use the re module.

Related

Is there an R or Python function for separating information in non-delimited strings, where the information varies?

I am currently cleaning up a messy data sheet in which information is given in one excel cell where the different characteristics are not delimited (no comma, spaces are random).
Thus, my problem is to separate the different information without a delimitation I could use in my code (can't use a split command)
I assume that I need to include some characteristics of each part of information, such that the corresponding characteristic is recognized. However, I don't have a clue how to do that since I am quite new to Python and I only worked with R in the framework of regression models and other statistical analysis.
Short data example:
INPUT:
"WMIN CBOND12/05/2022 23554132121"
or
"WalMaInCBND 12/05/2022-23554132121"
or
"WalmartI CorpBond12/05/2022|23554132121"
EXPECTED OUTPUT:
"Walmart Inc.", "Corporate Bond", "12/05/2022", "23554132121"
So each of the "x" should be classified in a new column with the corresponding header (Company, Security, Maturity, Account Number)
As you can see the input varies randomly but I want to have the same output for each of the three inputs given above (I have over 200k data points with different companies, securities etc.)
First Problem is how to separate the information effectively without being able to use a systematic pattern.
Second Problem (lower priority) is how to identify the company without setting up a dictionary with 50 different inputs for 50k companies.
Thanks for your help!
I recommend to first introduce useful seperators where possible and construct a dictionary of replacements for processing with regular expressions.
import re
s = 'WMIN CBOND12/05/2022 23554132121'
# CAREFUL this not a real date regex, this should just
# illustrate the principle of regex
# see https://stackoverflow.com/a/15504877/5665958 for
# a good US date regex
date_re = re.compile('([0-9]{2}/[0-9]{2}/[0-9]{4})')
# prepend a whitespace before the date
# this is achieved by searching the date within the string
# and replacing it with itself with a prepended whitespace
# /1 means "insert the first capture group", which in our
# case is the date
s = re.sub(date_re, r' \1', s)
# split by one or more whitespaces and insert
# a seperator (';') to make working with the string
# easier
s = ';'.join(s.split())
# build a dictionary of replacements
replacements = {
'WMIN': 'Walmart Inc.',
'CBOND': 'Corporate Bond',
}
# for each replacement apply subsitution
# a better, but more replicated solution for
# this is given here:
# https://stackoverflow.com/a/15175239/5665958
for pattern, r in replacements.items():
s = re.sub(pattern, r, s)
# use our custom separator to split the parts
out = s.split(';')
print(out)
Using python and regular expressions:
import re
def make_filter(pattern):
pattern = re.compile(pattern)
def filter(s):
filtered = pattern.match(s)
return filtered.group(1), filtered.group(2), filtered.group(3), filtered.group(4)
return filter
filter = make_filter("^([a-zA-Z]+)\s([a-zA-Z]+)(\d+/\d+/\d+)\s(\d+)$")
filter("WMIN CBOND12/05/2022 23554132121")
The make_filter function is just an utility to allow you to modify the pattern. It returns a function that will filter the output according to that pattern. I use it with the "^([a-zA-Z]+)\s([a-zA-Z]+)(\d+/\d+/\d+)\s(\d+)$" pattern that considers some text, an space, some text, a date, an space, and a number. If you want to kodify this pattern provide more info about it. The output will be ("WMIN", "CBOND", "12/05/2022", "23554132121").
welcome! Yeah, we would definitely need to see more examples and regex seems to be the way to go... but since there seems to be no structure, I think it's better to think of this as seperate steps.
We KNOW there's a date which is (X)X/(X)X/XXXX (ie. one or two digit day, one or two digit month, four digit year, maybe with or without the slashes, right?) and after that there's numbers. So solve that part first, leaving only the first two categories. That's actually the easy part :) but don't lose heart!
if these two categories might not have ANY delimiter (for example WMINCBOND 12/05/202223554132121, or delimiters are not always delimiters for example IMAGINARY COMPANY X CBOND, then you're in deep trouble. :) BUT this is what we can do:
Gather a list of all the codes (hopefully you have that).
use str_detect() on each code and see if you can recognize the exact string in any of the dataset (if you do have the codes lemme know I'll write the code to do this part).
What's left after identifying the code will be the CBOND, whatever that is... so do that part last... what's left of the string will be that. Alternatively, you can use the same str_detect() if you have a list of whatever CBOND stuff is.
ONLY AFTER YOU'VE IDENTIFIED EVERYTHING, you can then replace the codes for what they stand for.
If you have the code-list let me know and I'll post the code.
edit
s = c("WMIN CBOND12/05/2022 23554132121",
"WalMaInCBND 12/05/2022-23554132121",
"WalmartI CorpBond12/05/2022|23554132121")
ID = gsub("([a-zA-Z]+).*","\\1",s)
ID2 = gsub(".* ([a-zA-Z]+).*","\\1",s)
date = gsub("[a-zA-Z ]+(\\d+\\/\\d+\\/\\d+).*","\\1",s)
num = gsub("^.*[^0-9](.*$)","\\1",s)
data.frame(ID=ID,ID2=ID2,date=date,num=num,stringsAsFactors=FALSE)
ID ID2 date num
1 WMIN CBOND 12/05/2022 23554132121
2 WalMaInCBND WalMaInCBND 12/05/2022-23554132121 12/05/2022 23554132121
3 WalmartI CorpBond 12/05/2022 23554132121
Works for cases 1 and 3 but I haven't figured out a logic for the second case, how can we know where to split the string containing the company and security if they are not separated?

Pythonic way to solve a text normalization task

Basically, I have a Hive script file, from which I need to extract the names for all the tables created. For example, from the contents
...
create table Sales ...
...
create external table Persons ...
...
Sales and Persons should be extracted. To accomplish this, my basic idea is like:
Search for key phrases create table and create external table,
Extract the next token which should be the table name.
However, the input may not be canonical. For example,
Tab/newline may be used along with space as token delimiter
There may be multiple consecutive delimiters between tokens
Mixed use of upper and lower case letters like create TABLE
Therefore, I'm thinking about first normalizing the input to a canonical form before applying the basic algorithm. Then with some effort, I come up with the following
' '.join(input.split()).lower()
As a Python newcomer, I'm wondering whether this is the Pythonic way to solve the problem, or it may be flawed in the very first place? Is there a simple way to do this in a streaming fashion, i.e., avoiding loading the whole input into memory at once?
Like some comments stated, regex is a neat and easy way to get what you want. If you don't mind getting lowercase results, this one should work:
import re
my_str = """
...
create table Sales ...
create TabLE
test
create external table Persons ...
...
"""
pattern = r"table\s+(\w+)\b"
items = re.findall(pattern, my_str.lower())
print items
It captures the next word after "table " (followed by at least one whitespace / newline).
To get the original case of the table names:
for x, item in enumerate(items):
i = my_str.lower().index(item)
items[x] = my_str[i:i+len(item)]
print items

Python - how to pull an address from a string or how to get the word before something thats on a different line?

my sample content is below
content ="""
Dear Customer,
Detail of service affected:
Bobs Builders
Retail park
The Aavenue
London
LDN 4DX
Start Time & Date: 04/01/2017 00:05
Completion Time & Date: 04/01/2017 06:00
Details of Work:
....
Im already pulling out the postcode with
postcodes = re.findall(r"[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][A-Z]{2}", content)
I would also like to get the City from this content, is that even possible? would i have to provide it with a list of Citys first? and then check against that?
or is there a way of getting the line before the postcode? as the addresses are always sent that way.
could i use the postcodes regex to get the word before the postcode?
Thanks
Here's an example :
import re
postcodes = re.findall(r"(\w+)\s+([A-Z]{3} \d[A-Z]{2})", content)
print postcodes
# => [('London', 'LDN 4DX')]
You get 2 groups, the first one is the word right before the postcode (possibly on another line), the second one is the postcode itself.
The postcode regex has been simplified in order to make the example more readable.
If you want to match any UK code, here is a good reference.
The regex you mentioned doesn't match LDN 4DX by the way. Adding a ? for [0-9R] would do :
postcodes = re.findall(r"[A-Z]{1,2}[0-9R]?[0-9A-Z]? [0-9][A-Z]{2}", content)
There are multiple ways to approach this problem:
1- Use Google API geolocation
If you can extract the address part by doing pattern matching, you can pass the address to Google Map Geocode API and let it parse the address for you.
2- Regex search
If you are sure that the address is always well-formatted, and postcode always precede by city name, you can use regex to handle these situations:
(\w*)\s+([A-Z]{3}\s+\d[A-Z]{2})
3- Use a database of city names
If the addresses are not always well-formatted, your best bet would use a database of city names such as OpenAddresses.
4- Use an entity extraction API [BEST]
This is a classic application of Information extraction in Natural Language Processing. You can implement your own using nltk, or even better you can use a web service such as AlchemyAPI. Copy and Paste your text in their demo and see how powerful it is by yourself.

Search methods and string matching in python

I have a task to search for a group of specific terms(around 138000 terms) in a table made of 4 columns and 187000 rows. The column headers are id, title, scientific_title and synonyms, where each column might contain more than one term inside it.
I should end up with a csv table with the id where a term has been found and the term itself. What could be the best and the fastest way to do so?
In my script, I tried creating phrases by iterating over the different words in a term in order and comparing each word with each row of each column of the table.
It looks something like this:
title_prepared = string_preparation(title)
sentence_array = title_prepared.split(" ")
length = len(sentence_array)
for i in range(length):
for place_length in range(len(sentence_array)):
last_element = place_length + 1
phrase = ' '.join(sentence_array[0:last_element])
if phrase in literalhash:
final_dict.setdefault(id,[])
if not phrase in final_dict[id]:
final_dict[trial_id].append(phrase)
How should I be doing this?
The code on the website you link to is case-sensitive - it will only work when the terms in tumorabs.txt and neocl.xml are the exact same case. If you can't change your data then change:
After:
for line in text:
add:
line = line.lower()
(this is indented four spaces)
And change:
phrase = ' '.join(sentence_array[0:last_element])
to:
phrase = ' '.join(sentence_array[0:last_element]).lower()
AFAICT this works with the unmodified code from the website when I change the case of some of the data in tumorabs.txt and neocl.xml.
To clarify the problem: we are running small scientific project where we need to extract all text parts with particular keywords. We have used coded dictionary and python script posted on http://www.julesberman.info/coded.htm ! But it seems that something does not working properly.
For exemple the script do not recognize a keyword "Heart Disease" in string "A Multicenter Randomized Trial Evaluating the Efficacy of Sarpogrelate on Ischemic Heart Disease After Drug-eluting Stent Implantation in Patients With Diabetes Mellitus or Renal Impairment".
Thanks for understanding! we are a biologist and medical doctor, with little bit knowlege of python!
If you need some more code i would post it online.

How to change position of data columns using regex in a CSV file using comma as separator?

I am giving up with this, its almost due date. I enrolled to regex class this summer (biggest mistake of my life), and we have this topic (where we choose an old software and make updates to it), well I'm almost done with everything but, except this, I have a .txt document of database of monster attributes?
Anyways, the logic is each variable represent columns/keys and each column are separated by comma. And we need to delete/add/reposition the columns using any available tool (regex the only thing I know can help me? do you know anything? )
Here is the OLD form:
ID,Name,JName,LV,HP,SP,EXP,JEXP,Range1,ATK1,ATK2,DEF,MDEF,STR,AGI,VIT,INT,DEX,LUK,Range2,Range3,Scale,Race,Element,Mode,Speed,ADelay,aMotion,dMotion,Drop1id,Drop1per,Drop2id,Drop2per,Drop3id,Drop3per,Drop4id,Drop4per,Drop5id,Drop5per,Drop6id,Drop6per,Drop7id,Drop7per,Drop8id,Drop8per,MEXP,ExpPer,MVP1id,MVP1per,MVP2id,MVP2per,MVP3id,MVP3per
First, delete 7th column from the last (deleting all ExpPer entries):
Results to:
ID,Name,JName,LV,HP,SP,EXP,JEXP,Range1,ATK1,ATK2,DEF,MDEF,STR,AGI,VIT,INT,DEX,LUK,Range2,Range3,Scale,Race,Element,Mode,Speed,ADelay,aMotion,dMotion,Drop1id,Drop1per,Drop2id,Drop2per,Drop3id,Drop3per,Drop4id,Drop4per,Drop5id,Drop5per,Drop6id,Drop6per,Drop7id,Drop7per,Drop8id,Drop8per,MEXP,MVP1id,MVP1per,MVP2id,MVP2per,MVP3id,MVP3per
Second, duplicate JName column to next column:
Results to:
ID,Name,JName,Jname,LV,HP,SP,EXP,JEXP,Range1,ATK1,ATK2,DEF,MDEF,STR,AGI,VIT,INT,DEX,LUK,Range2,Range3,Scale,Race,Element,Mode,Speed,ADelay,aMotion,dMotion,Drop1id,Drop1per,Drop2id,Drop2per,Drop3id,Drop3per,Drop4id,Drop4per,Drop5id,Drop5per,Drop6id,Drop6per,Drop7id,Drop7per,Drop8id,Drop8per,MEXP,MVP1id,MVP1per,MVP2id,MVP2per,MVP3id,MVP3per
Third, pull the last 7 columns, put them starting to 31st column, i.e. from ...,dMotion,Drop1id,Drop1per,... to ...,dMotion,MEXP,...,MVP3per,Drop1id,...
Results to:
ID,Name,JName,Jname,LV,HP,SP,EXP,JEXP,Range1,ATK1,ATK2,DEF,MDEF,STR,AGI,VIT,INT,DEX,LUK,Range2,Range3,Scale,Race,Element,Mode,Speed,ADelay,aMotion,dMotion,MEXP,MVP1id,MVP1per,MVP2id,MVP2per,MVP3id,MVP3per,Drop1id,Drop1per,Drop2id,Drop2per,Drop3id,Drop3per,Drop4id,Drop4per,Drop5id,Drop5per,Drop6id,Drop6per,Drop7id,Drop7per,Drop8id,Drop8per
Fourth, Finally, add these columns to the last: ,0,0,DONE,1:
Results to:
ID,Name,JName,Jname,LV,HP,SP,EXP,JEXP,Range1,ATK1,ATK2,DEF,MDEF,STR,AGI,VIT,INT,DEX,LUK,Range2,Range3,Scale,Race,Element,Mode,Speed,ADelay,aMotion,dMotion,MEXP,MVP1id,MVP1per,MVP2id,MVP2per,MVP3id,MVP3per,Drop1id,Drop1per,Drop2id,Drop2per,Drop3id,Drop3per,Drop4id,Drop4per,Drop5id,Drop5per,Drop6id,Drop6per,Drop7id,Drop7per,Drop8id,Drop8per,0,0,DONE,1
Hence, if I run whatever or how many regex search/replace tool,
the original:
1052,ROCKER,Rocker,9,198,0,20,16,1,24,29,5,10,1,9,18,10,14,15,10,12,1,4,22,129,200,1864,864,540,940,5000,909,5500,2298,4,1402,80,520,10,752,5,703,3,4021,10,0,0,0,0,0,0,0,0
would result to:
1052,ROCKER,Rocker,Rocker,9,198,0,20,16,1,24,29,5,10,1,9,18,10,14,15,10,12,1,4,22,129,200,1864,864,540,0,0,0,0,0,0,0,940,5000,909,5500,2298,4,1402,80,520,10,752,5,703,3,4021,10,0,0,DONE,1
Hope somebody can help me, there are 500+ monsters in this old database .txt file.
Thanks!
Microsoft Excel has a Text Import Wizard to import data in any CSV format from any text file into an empty Excel worksheet. This wizard can be used for small CSV files to load the data, next delete/move/copy data columns and finally export/save the modified data again in CSV format into a file.
But the question was about reformatting the CSV file using a text editor with regular expression.
I used UltraEdit v21.20 with selecting Perl regular expression engine, but below should work with any text editor supporting Perl regular expressions. The regular expression search and replace strings should work also with Python.
Important:
The regular expressions below work only if CSV file does not contain commas in double quoted values.
First, delete 7th column from the last (deleting all ExpPer entries):
Search:   ,[^,\r\n]*?(,(?:[^,\r\n]*?,){5}[^,\r\n]*)$
Replace: \1
Second, duplicate JName column to next column:
Search:   ^((?:[^,\r\n]*?,){2})([^,\r\n]*?,)
Replace: \1\2\2
Third, pull the last 7 columns, put them starting to 31st column:
Search:   ^((?:[^,\r\n]*?,){30})((?:[^,\r\n]*?,){15}[^,]*?),((?:[^,\r\n]*?,){6}[^,\r\n]*)$
Replace: \1\3,\2
Fourth, finally, add ,0,0,DONE,1 to the last:
Search:   (.)$
Replace: \1,0,0,DONE,1
But those 4 replaces can be done also with a single regular expression replace:
Search:   ^((?:[^,\r\n]*?,){2})([^,\r\n]*?,)((?:[^,\r\n]*?,){26})((?:[^,\r\n]*?,){16})([^,\r\n]*?,)[^,\r\n]*?,((?:[^,\r\n]*?,){5}[^,\r\n]*)$
Replace: \1\2\2\3\5\6,\40,0,DONE,1

Categories