Python- reformat text file, create full sentences from fragmented sentences - python

Python question, as it's the only language I know.
I've got many very long text files (8,000+ lines) with the sentences fragmented and split across multiple lines, i.e.
Research both sides and you should then
formulate your own answer. It's simple,
straightforward advice that is common sense.
And when it comes to vaccinations, climate
change, and the novel coronavirus SARS-CoV-2
etc.
I need to concatenate the fragments into full sentences breaking them at the full stops (periods) question marks, quoted full stops, etc. And write them to a new cleaned up text file, but I am unsure the best way to go about it.
I tried looping though but the results showed me that this method was not going to work.
I have never coded Generators (not sure if that is what is called for in this instance) before as I am an amateur developer and use coding to make my life easier and solve problems.
Any help would be very greatly appreciated.

If you read the file into a variable f, then you can access the text one row at a time (as in f is similar to a list of strings). The functions that might be helpful to you are String.join and String.split. Join will take a list of strings, and join them with a string in between. 'z'.join["a", "b", "c"] will produce "azbzc". Split will take a string as a parameter, find each instance of that string, and split it up. "azbzc".split('z') will produce ["a", "b", "c"] again. Removing the newline after every line, then joining them with something like a space will rebuild the text back into a single string, then using split on things like question marks, etc. will split it up the way you want it.

Related

Do I need to do any text cleaning for Spacy NER?

I am new to NER and Spacy. Trying to figure out what, if any, text cleaning needs to be done. Seems like some examples I've found trim the leading and trailing whitespace and then muck with the start/stop indexes. I saw one example where the guy did a bunch of cleaning and his accuracy was really bad because all the indexes were messed up.
Just to clarify, the dataset was annotated with DataTurks, so you get json like this:
"Content": <original text>
"label": [
"Skills"
],
"points": [
{
"start": 1295,
"end": 1621,
"text": "\n• Programming language...
So by "mucking with the indexes", I mean, if you strip off the leading \n, you need to update the start index, so it's still aligned properly.
So that's really the question, if I start removing characters from the beginning, end or middle, I need to apply the rule to the content attribute and adjust start/end indexes to match, no? I'm guessing an obvious "yes" :), so I was wondering how much cleaning needs to be done.
So you would remove the \ns, bullets, leading / trailing whitespace, but leave standard punctuation like commas, periods, etc?
What about stuff like lowercasing, stop words, lemmatizing, etc?
One concern I'm seeing with a few samples I've looked at, is the start/stop indexes do get thrown off by the cleaning they do because you kind of need to update EVERY annotation as you remove characters to keep them in sync.
I.e.
A 0 -> 100
B 101 -> 150
if I remove a char at position 50, then I need to adjust B to 100 -> 149.
First, spaCy does no transformation of the input - it takes it literally as-is and preserves the format. So you don't lose any information when you provide text to spaCy.
That said, input to spaCy with the pretrained pipelines will work best if it is in natural sentences with no weird punctuation, like a newspaper article, because that's what spaCy's training data looks like.
To that end, you should remove meaningless white space (like newlines, leading and trailing spaces) or formatting characters (maybe a line of ----?), but that's about all the cleanup you have to do. The spaCy training data won't have bullets, so they might get some weird results, but I would leave them in to start. (Also, bullets are obviously printable characters - maybe you mean non-ASCII?)
I have no idea what you mean by "muck with the indexes", but for some older NLP methods it was common to do more extensive preprocessing, like removing stop words and lowercasing everything. Doing that will make things worse with spaCy because it uses the information you are removing for clues, just like a human reader would.
Note that you can train your own models, in which case they'll learn about the kind of text you show them. In that case you can get rid of preprocessing entirely, though for actually meaningless things like newlines / leading and following spaces you might as well remove them anyway.
To address your new info briefly...
Yes, character indexes for NER labels must be updated if you do preprocessing. If they aren't updated they aren't usable.
It looks like you're trying to extract "skills" from a resume. That has many bullet point lists. The spaCy training data is newspaper articles, which don't contain any lists like that, so it's hard to say what the right thing to do is. I don't think the bullets matter much, but you can try removing or not removing them.
What about stuff like lowercasing, stop words, lemmatizing, etc?
I already addressed this, but do not do this. This was historically common practice for NLP models, but for modern neural models, including spaCy, it is actively unhelpful.

how do i make python find words that look similar to a bad word, but not necessarily a proper word in english?

I'm making a cyberbullying detection discord bot in python, but sadly there are some people who may find their way around conventional English and spell a bad word in a different manner, like the n-word with 3 g's or the f word without the c. There are just too many variants of bad words some people may use. How can I make python find them all?
I've tried pyenchant but it doesn't do what I want it to do. If I put suggest("racist slur"), "sucker" is in the array. I can't seem to find anything that works.
Will I have to consider every possibility separately and add all the possibilities into a single dictionary? (I hope not.)
It's not necessarily python's job to do the heavy lifting but rather its ecosystem. You may want to look into Natural Language Understanding algorithms and find a way that suits your specific needs. This takes some time and further expertise to figure out.
You may want to start with pytorch, it has helped my learning curve a lot. Their docs regarding text: https://pytorch.org/text/stable/index.html
Also, I'd suggest, you have a look around at kaggle, several datascience challenges have a prize on them to tackle the same task you are aiming to solve.
https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification
These competitions usually have public starter notebooks to get you started with your own implementation.
You could try looping through the string that you are moderating and putting it into an array.
For example, if you wanted to blacklist "foo"
x=[["f","o","o"],[" "], ["f","o","o","o"]]
then count the letters in each word to count how many of each letter is in each word:
y = [["f":"1", "o":"2"], [" ":"1"], ["f":"1", "o":"3"]]
then see that y[2] is very similar to y[0] (the banned word).
While this method is not perfect, it is a start.
Another thing to look in to is using a neural language interpreter that detects if a word is being used in a derogatory way. A while back, Google designed one of these.
The other answer is just that no bot is perfect.
You might just have to put these common misspellings in the blacklist.
However, the automatic approach would be awesome if you got it working with 100% accuracy.
Unfortunately, spell checking (for different languages) alone is still an open problem that people do research on, so there is no perfect solution for this, let alone for the case when the user intentionally tries to insert some "errors".
Fortunately, there is a conceptually limited number of ways people can intentionally change the input word in order to obtain a new word that resembles the initial one enough to be understood by other people. For example, bad actors could try to:
duplicate some letters multiple times
add some separators (e.g. "-", ".") between characters
delete some characters (e.g. the f word without "c")
reverse the word
potentially others
My suggestion is to initially keep it simple, if you don't want to delve into machine learning. As a possible approach you could try to:
manually create a set of lower-case bad words with their duplicated letters removed (e.g. "killer" -> "kiler").
manually/automatically add to this set variants of these words with one or multiple letters missing that can still be easily understood (e.g. "kiler" +-> "kilr").
extract the words in the message (e.g. by message_str.split())
for each word and its reversed version:
a. remove possible separators (e.g. "-", ".")
b. convert it to lower case and remove consecutive, duplicate letters
c. check if this new form of the word is present in the set, if so, censor it or the entire message
This solution lacks the protection against words with characters separated by one or multiple white spaces / newlines (e.g. "killer" -> "k i l l e r").
Depending on how long the messages are (I believe they are generally short in chat rooms), you can try to consider each substring of the initial message with removed whitespaces, instead of each word detected by the white space separator in step 3. This will take more time, as generating each substring will take alone O(message_length^2) time.

How to make a random name matcher with multiple arguments?

So this is my first year getting into code as a hobby. For my personal side project I want to make a date-matcher (not for a friend haha). This is mainly for me trying to get a better understanding for python structures.
To summarize: People fill 2 lists of names and the matcher returns back a list with random matches. (NO DUPLICATES)
Also, coming with these rules:
1. I want make every 'user'(name) choose between they are (Open, Not Interested, Taken) and match the strings accordingly.
When the are more items in a certain list, left over strings get printed out too
3 [Optional] When users fill in their name, they can fill in a certain 'preference string', making it a higher chance to be matched together with that string.
I'm kinda stuck at the first phase, this is what I have:
import random
VNamen=["Sarah","Annelotte","Kelsey","Mika","Ilse","Yara","Sjouke"]
MNamen=["Kelvin","Xander","Kolten","Ezekiel","Misael","Landon","Noel"]
VR= random.choices(VNamen)
MR= random.choices(MNamen)
print(VR, "together with",MR)
How do I randomly match the strings together?
How do I remove the duplicates in the resulting list
Maybe some suggestions on the rest of the functions above?
I hope someone has the time for this (for me) complicated question!
Greetings,
Quinten
Now, there are things like "re" that i would suggest (like dper did in the comment of your code), but if you want to do it with your own code, would suggest using random.choice(list) after importing random (which you have done) which will chose a random person from that list, do this with both lists, and put them(as in the two given names) together into another list and remove their names from the original lists, do this until one of the lists is empty, then print out everything in the not empty list.
Woah that was a lot of lists...
preference settings would be a little more complicated, you would have to use a list, which goes everywhere the name used to go, and in that list there would be all the information they have, but this way it would be impossible(as far as i am aware) to change the likelihood of getting a certain name.
if you would like me to actually show you it with your code, comment and ask me to do so, but i would suggest giving it a go yourself (if you chose to do it this way that is).

Split string using Regular Expression that includes lowercase, camelcase, numbers

I'm working on to get twitter trends using tweepy in python and I'm able find out world top 50 trends so for sample I'm getting results like these
#BrazilianFansAreTheBest, #PSYPagtuklas, Federer, Corcuera, Ouvindo ANTI,
艦これ改, 영혼의 나이, #TodoDiaéDiaDe, #TronoChicas, #이사람은_분위기상_군주_장수_책사,
#OTWOLKnowYourLimits, #BaeIn3Words, #NoEntiendoPorque
(Please ignore non English words)
So here I need to parse every hashtag and convert them into proper English words, Also I checked how people write hashtag and found below ways -
#thisisawesome
#ThisIsAwesome
#thisIsAwesome
#ThisIsAWESOME
#ThisISAwesome
#ThisisAwesome123
(some time hashtags have numbers as well)
So keeping all these in mind I thought if I'm able to split below string then all above cases will be covered.
string ="pleaseHelpMeSPLITThisString8989"
Result = please, Help, Me, SPLIT, This, String, 8989
I tried something using re.sub but it is not giving me desired results.
Regex is the wrong tool for the job. You need a clearly-defined pattern in order to write a good regex, and in this case, you don't have one. Given that you can have Capitalized Words, CAPITAL WORDS, lowercase words, and numbers, there's no real way to look at, say, THATSand and differentiate between THATS and or THAT Sand.
A natural-language approach would be a better solution, but again, it's inevitably going to run into the same problem as above - how do you differentiate between two (or more) perfectly valid ways to parse the same inputs? Now you'd need to get a trie of common sentences, build one for each language you plan to parse, and still need to worry about properly parsing the nonsensical tags twitter often comes up with.
The question becomes, why do you need to split the string at all? I would recommend finding a way to omit this requirement, because it's almost certainly going to be easier to change the problem than it is to develop this particular solution.

SPSS and Python integration: Splitting on a String Variable

I have my SPSS dataset with customer comments housed as a string variable. I'm trying to come up with a piece of syntax that will pull out a word either before or after a specific keyword (One example might be "you were out of organic milk"). If you are wanting to know what type of milk they are talking about, you would want to pull out the word directly before it ("organic").
Through a series of string searches/manipulations, I have it to look for the first space before the word, and to pull out the characters in between. However, I feel like there should be an easier way if I was solely using Python (split on spaces, identify what place the keyword is, and return the word X before or after). However, I don't know to accomplish this in SPSS using python. Any ideas for how to approach this problem?

Categories