How to clean a list of strings - python

I´m trying to clean the following data:
from sklearn import datasets
data = datasets.fetch_20newsgroups(categories=['rec.autos', 'rec.sport.baseball', 'soc.religion.christian'])
texts, targets = data['data'], data['target']
Where texts is a list of articles and targets is a vector containing the index of the category to which each article belongs to.
I need to clean all articles. The cleaning task means:
Remove headers
Remove punctuation
Remove parenthesis
Consecutive blank spaces
Tokens emails with length 1
Line breaks
I'm quite new at Python but I've tried to remove all punctuation and everything using replace(). However, I think that an easy way to do this task must exist.
def clean_articles (article):
return ' '.join([x for x in article[article.find('\n\n'):].replace('.','').replace('[','')
clean_articles(data['data'][1])
For the following article:
print(data['data'][1])
Uncleaned Article:
'From: aas7#po.CWRU.Edu (Andrew A. Spencer)\nSubject: Re: Too fast\nOrganization: Case Western Reserve University, Cleveland, OH (USA)\nLines: 25\nReply-To: aas7#po.CWRU.Edu (Andrew A. Spencer)\nNNTP-Posting-Host: slc5.ins.cwru.edu\n\n\nIn a previous article, wrat#unisql.UUCP (wharfie) says:\n\n>In article <1qkon8$3re#armory.centerline.com> jimf#centerline.com (Jim Frost) writes:\n>>larger engine. That\'s what the SHO is -- a slightly modified family\n>>sedan with a powerful engine. They didn\'t even bother improving the\n>>brakes.\n>\n>\tThat shows how much you know about anything. The brakes on the\n>SHO are very different - 9 inch (or 9.5? I forget) discs all around,\n>vented in front. The normal Taurus setup is (smaller) discs front, \n>drums rear.\n\none i saw had vented rears too...it was on a lot.\nof course, the sales man was a fool..."titanium wheels"..yeah, right..\nthen later told me they were "magnesium"..more believable, but still\ncrap, since Al is so m uch cheaper, and just as good....\n\n\ni tend to agree, tho that this still doesn\'t take the SHO up to "standard"\nfor running 130 on a regular basis. The brakes should be bigger, like\n11" or so...take a look at the ones on the Corrados.(where they have\nbraking regulations).\n\nDREW\n'
Cleaned Article:
In previous article UUCP wharfie says In article centerline com com Jim Frost writes larger engine That's what the SHO is slightly modified family sedan with powerful engine They didn't even bother improving the *brakes That shows how much you know about anything The brakes on the SHO are very different inch or forget discs all around vented in front The normal Taurus setup is smaller discs front drums rear one saw had vented rears too it was on lot of course the sales man was fool titanium wheels yeah right then later told me they were magnesium more believable but still crap since Al is so uch cheaper and just as good tend to agree tho that this still doesn't take the SHO up to standard for running 130 on regular basis The brakes should be bigger like 11 or so take look at the ones on the Corrados where they have braking regulations DREW

note: this is not a complete answer, but the following will at least get you half way to:
remove punctuation
remove line breaks
remove consecutive white space
remove parentheses
import re
s = ';\n(a b.,'
print('before:', s)
s = re.sub('[.,;\n(){}\[\]]', '', s)
s = re.sub('\s+', ' ', s)
print('after:', s)
this will print:
before: ;
(a b.,
after: a b

Related

How to efficiently search for similar substring in a large text python?

Let me try to explain my issue with an example, I have a large corpus and a substring like below,
corpus = """very quick service, polite workers(cory, i think that's his name), i basically just drove there and got a quote(which seems to be very fair priced), then dropped off my car 4 days later(because they were fully booked until then), then i dropped off my car on my appointment day, then the same day the shop called me and notified me that the the job is done i can go pickup my car. when i go checked out my car i was amazed by the job they've done to it, and they even gave that dirty car a wash( prob even waxed it or coated it, cuz it was shiny as hell), tires shine, mats were vacuumed too. i gave them a dirty, broken car, they gave me back a what seems like a brand new car. i'm happy with the result, and i will def have all my car's work done by this place from now."""
substring = """until then then i dropped off my car on my appointment day then the same day the shop called me and notified me that the the job is done i can go pickup my car when i go checked out my car i was amazed by the job they ve done to it and they even gave that dirty car a wash prob even waxed it or coated it cuz it was shiny as hell tires shine mats were vacuumed too i gave them a dirty broken car they gave me back a what seems like a brand new car i m happy with the result and i will def have all my car s work done by this place from now"""
Both the substring and corpus are very similar but it not exact,
If I do something like,
import re
re.search(substring, corpus, flags=re.I) # this will fail substring is not exact but rather very similar
In the corpus the substring is like below which is bit different from the substring I have because of that regular expression search is failing, can someone suggest a really good alternative for similar substring lookup,
until then), then i dropped off my car on my appointment day, then the same day the shop called me and notified me that the the job is done i can go pickup my car. when i go checked out my car i was amazed by the job they've done to it, and they even gave that dirty car a wash( prob even waxed it or coated it, cuz it was shiny as hell), tires shine, mats were vacuumed too. i gave them a dirty, broken car, they gave me back a what seems like a brand new car. i'm happy with the result, and i will def have all my car's work done by this place from now
I did try difflib library but it was not satisfying my use-case.
Some background information,
The substring I have right now, is obtained some time ago from pre-processed corpus using this regex re.sub("[^a-zA-Z]", " ", corpus).
But now I need to use that substring I have to do the reverse lookup in the corpus text and find the start and ending index in the corpus.
You don't actually need to fuzzy match all that much, at least for the example given; text can only change in spaces within substring, and it can only change by adding at least one non-alphabetic character (which can replace a space, but the space can't be deleted without a replacement). This means you can construct a regex directly from substring with wildcards between words, search (or finditer) the corpus for it, and the resulting match object will tell you where the match(es) begin and end:
import re
# Allow any character between whitespace-separated "words" except ASCII
# alphabetic characters
ssre = re.compile(r'[^a-z]+'.join(substring.split()), re.IGNORECASE)
if m := ssre.search(corpus):
print(m.start(), m.end())
print(repr(m.group(0)))
Try it online!
which correctly identifies where the match began (index 217) and ended (index 771) in corpus; .group(0) can directly extract the matching text for you if you prefer (it's uncommon to need the indices, so there's a decent chance you were asking for them solely to extract the real text, and .group(0) does that directly). The output is:
217 771
"until then), then i dropped off my car on my appointment day, then the same day the shop called me and notified me that the the job is done i can go pickup my car. when i go checked out my car i was amazed by the job they've done to it, and they even gave that dirty car a wash( prob even waxed it or coated it, cuz it was shiny as hell), tires shine, mats were vacuumed too. i gave them a dirty, broken car, they gave me back a what seems like a brand new car. i'm happy with the result, and i will def have all my car's work done by this place from now"
If spaces might be deleted without being replaced, just change the + quantifier to * (the regex will run a little slower since it can't short-circuit as easily, but would still work, and should run fast enough).
If you need to handle non-ASCII alphabetic characters, the regex joiner can change from r'[^a-z]+' to the equivalent r'[\W\d_]+' (which means "match all non-word characters [non-alphanumeric and not underscore], plus numeric characters and underscores"); it's a little more awkward to read, but it handles stuff like é properly (treating it as part of a word, not a connector character).
While it's not going to be as flexible as difflib, when you know no words are removed or added, it's just a matter of spacing and punctuation, this works perfectly, and should run significantly faster than a true fuzzy matching solution (that has to do far more work to handle the concept of close matches).
While you can not find an exact match if the strings differ by just even one character - you can find similar strings.
So here I made use of the builtin difflib SequenceMatcher in order to check for the similarity of two differing strings.
In case that you need the indices of where the substring starts within the corpus - that could be easily added. In case you have any questions comment pls.
Hope it helps. - Adapted to your edited question - Implemented #ShadowRanger's note
import re
from difflib import SequenceMatcher
def similarity(a, b) -> float:
"""Return similarity between 2 strings"""
return SequenceMatcher(None, a, b).ratio()
def find_similar_match(a, b, threshold=0.7) -> list:
"""Find string b in a - while the strings being different"""
corpus_lst = a.split()
substring_lst = b.split()
nonalpha = re.compile(r"[^a-zA-Z]")
start_indices = [i for i, x in enumerate(
corpus_lst) if nonalpha.sub("", x) == substring_lst[0]]
end_indices = [i for i, x in enumerate(
corpus_lst) if nonalpha.sub("", x) == substring_lst[-1]]
rebuild_substring = " ".join(substring_lst)
max_sim = 0
for start_idx in start_indices:
for end_idx in end_indices:
corpus_search_string = " ".join(corpus_lst[start_idx: end_idx])
sim = similarity(corpus_search_string, rebuild_substring)
if sim > max_sim:
result = [start_idx, end_idx]
return result
The results are of calling find_similar_match(corpus, substring):
Found a match with similarity : 0.8429752066115702
[38, 156]
Not exactly the best solution but this might help.
match = SequenceMatcher(None, corpus, substring).find_longest_match(0, len(corpus), 0, len(substring))
print(match)
print(corpus[match.a:match.a + match.size])
print(substring[match.b:match.b + match.size])
This may help you to visualise the similarity of the two strings based on the
percentage of words in the corpus that are in the substring.
The code below aims to :
use the substring as a bag of words
finds these words in the corpus (and if found - makes them uppercase)
display the modifications in the corpus
calculate the percentage of modified words in the corpus
show the number of words in substring that were not in the corpus
This way you can see which of the substring words were matched in the corpus, and then identify the percentage similarity by word (but not necessarily in the right order).
Code:
import re
corpus = """very quick service, polite workers(cory, i think that's his name), i basically just drove there and got a quote(which seems to be very fair priced), then dropped off my car 4 days later(because they were fully booked until then), then i dropped off my car on my appointment day, then the same day the shop called me and notified me that the the job is done i can go pickup my car. when i go checked out my car i was amazed by the job they've done to it, and they even gave that dirty car a wash( prob even waxed it or coated it, cuz it was shiny as hell), tires shine, mats were vacuumed too. i gave them a dirty, broken car, they gave me back a what seems like a brand new car. i'm happy with the result, and i will def have all my car's work done by this place from now."""
substring = """until then then i dropped off my car on my appointment day then the same day the shop called me and notified me that the the job is done i can go pickup my car when i go checked out my car i was amazed by the job they ve done to it and they even gave that dirty car a wash prob even waxed it or coated it cuz it was shiny as hell tires shine mats were vacuumed too i gave them a dirty broken car they gave me back a what seems like a brand new car i m happy with the result and i will def have all my car s work done by this place from now"""
sub_list = set(substring.split(" "))
unused_words = []
for word in sub_list:
if word in corpus:
r = r"\b" + word + r"\b"
ru = f"{word.upper()}"
corpus = re.sub(r, ru, corpus)
else:
unused_words.append(word)
print(corpus)
lower_strings = len(re.findall("[a-z']+", corpus))
upper_strings = len(re.findall("[A-Z']+", corpus))
print(f"\nWords Matched = {(upper_strings)/(upper_strings + lower_strings)*100:.1f}%")
print(f"Unused Substring words: {len(unused_words)}")
Output:
very quick service, polite workers(cory, I think THAT'S his name), I
basically just drove there AND got A quote(which SEEMS TO be very fair
priced), THEN DROPPED OFF MY CAR 4 days later(because THEY WERE fully
booked UNTIL THEN), THEN I DROPPED OFF MY CAR ON MY APPOINTMENT DAY, THEN
THE SAME DAY THE SHOP CALLED ME AND NOTIFIED ME THAT THE THE JOB IS DONE I
CAN GO PICKUP MY CAR. WHEN I GO CHECKED OUT MY CAR I WAS AMAZED BY THE JOB
THEY'VE DONE TO IT, AND THEY EVEN GAVE THAT DIRTY CAR A WASH( PROB EVEN
WAXED IT OR COATED IT, CUZ IT WAS SHINY AS HELL), TIRES SHINE, MATS WERE
VACUUMED TOO. I GAVE THEM A DIRTY, BROKEN CAR, THEY GAVE ME BACK A WHAT
SEEMS LIKE A BRAND NEW CAR. I'M HAPPY WITH THE RESULT, AND I WILL DEF HAVE
ALL MY CAR'S WORK DONE BY THIS PLACE FROM NOW.
Words Matched = 82.1%
Unused Substring words: 0

Python Regex: Why my pattern does not match?

Here is my pattern:
pattern_1a = re.compile(r"(?:```|\n)Item *1A\.?.{0,50}Risk Factors.*?(?:\n)Item *1B(?!u)", flags = re.I|re.S)
Why it does not match text like the following? What's wrong?
"""
Item 1A.
Risk
Factors
If we
are unable to commercialize
ADVEXIN
therapy in various markets for multiple indications,
particularly for the treatment of recurrent head and neck
cancer, our business will be harmed.
under which we may perform research and development services for
them in the future.
42
Table of Contents
We believe the foregoing transactions with insiders were and are
in our best interests and the best interests of our
stockholders. However, the transactions may cause conflicts of
interest with respect to those insiders.
Item 1B.
"""
Here is one solution that will math with your actual text. Put ( ) around your string it will solve a lot of issue. See the solution below.
pattern_1a = re.compile(r"(?:```|\n)(Item 1A)[.\n]{0,50}(Risk Factors)([\n]|.)*(\nItem 1B.)(?!u)", flags = re.I|re.S)
Match evidence:
https://regexr.com/41ejq
The problem is Risk Factors is spread over two lines. It is actually: Risk\nFactors
Using a general white space \s or a new line \n instead of a space matches the text.

matching similar but not identical strings

I have two csv's. One with a large chunk of text and the other with annotations/strings. I want to find the position of the annotation in the text. The problem is some of the annotations have extra space/characters that are not in the text. I can not trim white space/ characters from the original text since I need the exact position. I started out using regex but it seems there is no way to search for partial matches.
Example
text = ' K. Meney & L. Pantelic, Int. J. Sus. Dev. Plann. Vol. 10, No. 4 (2015) 544?561\n? 2015 WIT Press, www.witpress.com\nISSN: 1743-7601 (paper format), ISSN: 1743-761X (online), http://www.witpress.com/journals\nDOI: 10.2495/SDP-V10-N4-544-561\nNOVEL DECISION MODEL FOR DELIVERING SUSTAINABLE \nINFRASTRUCTURE SOLUTIONS ? AN AUSTRALIAN \nCASE STUDY\nK. MENEY & L. PANTELIC\nSyrinx Environmental PL, Australia.\nABSTRACT\nConventional approaches to water supply and wastewater treatment in regional towns globally are failing \ndue to population growth and resource pressure, combined with prohibitive costs of infrastructure upgrades. '
seg = 'water supply and wastewater ¿treatment'
m = re.search(seg, text, re.M | re.DOTALL | re.I)
this matchs on about 15% segs
m = re.match(r'(water).*(treatment)$', text, re.M)
this did not work, I thought it would be possible to match on the first and last words and get their positions but this has numerous problems such as multiple occurrences of 'water'
with open(file_path) as file, \
mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as s:
if s.find(seg) != -1:
print('true')
I had no luck with this at all for some reason.
Am I on the right path with any of these or is there a better way to do this?
Extra Example
From Text
The SIDM? model was applied to a rapidly grow-\ning Australian township (Hopetoun)
From Seg
The SIDM model was applied to a rapidly grow-ing Australian township (Hopetoun)
From Text
\nSIDM? is intended to be used both as a design and evaluation tool. As a design tool, it i) guides \nthe design of sustainable infrastructure solutions, ii) can be used as a progress check to assess the \nlevel of completion of a project, iii) highlights gaps in the existing information sets, and iv) essen-\ntially provides the scope of work required to advance the design process. As an evaluation tool it can \nact both as a quick diagnostic tool, to check whether or not a solution has major flaws or is generally \nacceptable, and as a detailed evaluation tool where various options can be compared in detail in \norder to establish a preferred solution.
From Seg
SIDM is intended to be used both as a design and evaluation tool. As a design tool, it i) guides the design of sustainable infrastructure solutions, ii) can be used as a progress check to assess the level of completion of a project, iii) highlights gaps in the existing information sets, and iv) essen-tially provides the scope of work required to advance the design process. As an evaluation tool it can act both as a quick diagnostic tool, to check whether or not a solution has major flaws or is generally acceptable, and as a detailed evaluation tool where various options can be compared in detail in order to establish a preferred solution.
List of subs to segment prior to matching:
seg = re.sub(r'\(', r'\\(', seg ) #Need to escape paraenthesis due to regex
seg = re.sub(r'\)', r'\\)', seg )
seg = re.sub(r'\?', r' ', seg )
seg = re.sub(r'[^\x00-\x7F]+',' ', seg)
seg = re.sub(r'\s+', ' ', seg)
seg = re.sub(r'\\r', ' ', seg)
As casimirethippolyte pointed out, patseg = re.sub(r'\W+', '\W+', seg) solved the problem for me.

find (two) words next to each other in texts with python with proximity operators. excalable solution

I am relatively beginner in python and I am looking for a solution for the following Problem.
I have to "scan" texts looking for concepts.
A concept looks like this: (electrical 3d car)
I have to look for appearences of the word "car" where in a proximity of 3 words ('3d') there is another being 'electrical' (for instance electrical conventional car, electrical propulsed car, electrical driven autonomous car etc)
I know that you can work with a text as if it would be a list of words and punctuations.
I thought about the following solution:
with open(filepath) as fp:
concept=['motor','3d','car']
concept_appearences=0 ## counters
concept_positions=[]
concept_list=[]
word1 = concept[0]
word2 = concept[2]
separator=concept[1]
distance=[int(s) for s in separator if s.isdigit()]
distance=int(distance[0])
distanceright=distance
if 'd' in separator: distanceleft=distance
if 'w' in separator: distanceleft=0
for line in fp:
## look for the concepts in every line which is like a paragraph
for index,word in enumerate(line.split()):
if word.upper()==word1.upper():
## i found the first concept-word
for i in range(index-distanceleft,index+distanceright,1):
if line.split()[i]==word2:
##print('thline.split()[i],word2)
print('i found the concept in postion', i )
start,end=i,index
if index<i:start,end=index,i
print('check:',line.split()[start:end+1])
concept_appearences +=1
concept_list.append(line.split()[start:end+1])
concept_positions.append(start)
print('the concept appeared {} times'.format(concept_appearences))
print('in positoins',concept_positions)
print('list of concepts',concept_list)
Note: Not yet implemented is the case where there is a point between both words which will exclude the hit from being a concept. (like: blah blah electrical. The car of my aunt blah blah.... that should not be a hit for obvious reasons)
Probably not a super-pythonic code, but it works so far.
The questions here are.
First:
This seems to me to be a quite common problem. Is there any library specific for that?
I don't even know the "technical" name for such a thing beyond than "proximity operators"
Note: I read quite a bit about NLTK (a NL library) but did not really find a solution for that.
Second:
Any idea how to make this code scalable? meaning this (electrical 3d car) could in itself become a concept within a concept, when looking for instance for (electrical 3d car) in the surroundings of "gasoline" being gasoline not more than 10 words away:
((electrical 3d car) 10w gasoline)
Third:
If there is no library for such a thing, any comment on speed is welcome, I have to look for thousands of concepts within a 100 pages text.
Thanks a lot.
Re-edit adding an input and outputfile as asked by #Mathieu thx.
INPUT TEXT:
Referring to FIGS. 1 to 3, an electrical car 1 comprises bodywork 2, pairs of ground engaging wheels 3, 4 front and rear, an electric motor 5 driving the front wheels 3 through a suitable transmission (not shown) and electrical battery sets 6A, 6B for supplying electrical power to the motor 5. Suitable control gear (not shown) adapted for operation by the car driver serves to control operation of the electric motor 5 and hence motion of the car 1. An old electrical driven car found in the garage was painted yellow instead of blue.
output:
the concept appeared 2 times
in positoins [7, 82]
list [['electrical', 'car'], ['electrical', 'driven', 'car']]

lists and sublists

i use this code to split a data to make a list with three sublists.
to split when there is * or -. but it also reads the the \n\n *.. dont know why?
i dont want to read those? can some one tell me what im doing wrong?
this is the data
*Quote of the Day
-Education is the ability to listen to almost anything without losing your temper or your self-confidence - Robert Frost
-Education is what survives when what has been learned has been forgotten - B. F. Skinner
*Fact of the Day
-Fractals, an important part of chaos theory, are very useful in studying a huge amount of areas. They are present throughout nature, and so can be used to help predict many things in nature. They can also help simulate nature, as in graphics design for movies (animating clouds etc), or predict the actions of nature.
-According to a recent survey by Just-Eat, not everyone in The United Kingdom actually knows what the Scottish delicacy, haggis is. Of the 1,623 British people polled:\n\n * 18% of Brits thought haggis was some sort of Scottish animal.\n\n * 15% thought it was a Scottish musical instrument.\n\n * 4% thought it was a character from Harry Potter.\n\n * 41% didn't even know what Scotland's national dish was.\n\nWhile a small number of Scots admitted not knowing what haggis was either, they also discovered that 68% of Scots would like to see Haggis delivered as takeaway.
-With the growing concerns involving Facebook and its ever changing privacy settings, a few software developers have now engineered a website that allows users to trawl through the status updates of anyone who does not have the correct privacy settings to prevent it.\n\nNamed Openbook, the ultimate aim of the site is to further expose the problems with Facebook and its privacy settings to the general public, and show people just how easy it is to access this type of information about complete strangers. The site works as a search engine so it is easy to search terms such as 'don't tell anyone' or 'I hate my boss', and searches can also be narrowed down by gender.
*Pet of the Day
-Scottish Terrier
-Land Shark
-Hamster
-Tse Tse Fly
END
i use this code:
contents = open("data.dat").read()
data = contents.split('*') #split the data at the '*'
newlist = [item.split("-") for item in data if item]
to make that wrong similar to what i have to get list
The "\n\n" is part of the input data, so it's preserved in python. Just add a strip() to remove it:
finallist = [item.strip() for item in newlist]
See the strip() docs: http://docs.python.org/library/stdtypes.html#str.strip
UPDATED FROM COMMENT:
finallist = [item.replace("\\n", "\n").strip() for item in newlist]
open("data.dat").read() - reads all symbols in file, not only those you want.
If you don't need '\n' you can try content.replace("\n",""), or read lines (not whole content), and truncate the last symbol'\n' of each line.
This is going to split any asterisk you have in the text as well.
Better implementation would be to do something like:
lines = []
for line in open("data.dat"):
if line.lstrip.startswith("*"):
lines.append([line.strip()]) # append a list with your line
elif line.lstrip.startswith("-"):
lines[-1].append(line.strip())
For more homework, research what's happening when you use the open() function in this way.
The following solves your problem i believe:
result = [ [subitem.replace(r'\n\n', '\n') for subitem in item.split('\n-')]
for item in open('data.txt').read().split('\n*') ]
# now let's pretty print the result
for i in result:
print '***', i[0], '***'
for j in i[1:]:
print '\t--', j
print
Note I split on new-line + * or -, in this way it won't split on dashes inside the text. Also i replace the textual character sequence \ n \ n (r'\n\n') with a new line character '\n'. And the one-liner expression is list comprehension, a way to construct lists in one gulp, without multiple .append() or +

Categories