Split Big String by Specific Word In Python

Split Big String by Specific Word In Python - python

I want to split a big string by word and that word is repeating in that big string.
Example what i expect :
We have tried to split a code, please check below
string.split("RFF+AAJ:")
So we need a bunch of list that i have described in my above screenshot.

You can get your result with the help of regex :-
import re
string = 'helloisworldisbyeishi'
re.split('(is)', string) # Splitting from 'is'
Output
['hello', 'is', 'world', 'is', 'bye', 'is', 'hi']
I hope it may help you.

split returns one single list with the complete string in it ( it is just split in parts ). So the list here contains the part before the first "RFF+AAJ:", then the part between the two "RFF+AAJ:"s and the last part, after the second "RFF+AAJ:". If you want to have three differrent lists use:
all = string.split("RFF+AAJ:")
first = all[0]
second = all[1]
third = all[2]
And the elements will be stored in first, second and third.
If you want to create lists, use first = list(first) # and so on.
Hope that helped.

Related

Wyh there are empty items after re.split()? [duplicate]

This question already has an answer here:
re.split() gives empty elements in list
(1 answer)
Closed 23 days ago.
The community is reviewing whether to reopen this question as of 22 days ago.
I assume I misunderstand how re.split() works.
Here is a real and simple example.
>>> import re
>>> re.split('(abc)', 'abc')
['', 'abc', '']
I'm confused about the first and last empty ('') element in the resulting list. The result expected by me would be this:
['abc']
This was a very simplified example. Please let me give something more complex.
>>> re.split(r'\[\[(.+?)\]\[(.+?)\]\]', '[[one][two]]')
['', 'one', 'two', '']
Here the result expect by me would be:
['one', 'two']
This third example with words before and after works as expected.
>>> re.split(r'\[\[(.+?)\]\[(.+?)\]\]', 'zero [[one][two]] three')
['zero ', 'one', 'two', ' three']
My final goal is to split (tokenize) a string with a regex, get the splitted parts as results but also the separators (the regex matches). That is why I'm not able to handle that with re.findall().

If you use capturing groups in the re.split expression, the splitting part (abc) is also returned in the output. This can be very useful with eg tokenization tasks.
Every second item in the return value is the captured split pattern; e. g. if (a.c) was the splitter and dabcdagde then splittee, you'd get ['d', 'abc', 'd', 'agd', 'e'].
In your first example, since the split expression is the whole string, you get empty strings "on the sides".

My answer is based on that answer in a similar question.
The behavior is as specified in the docs:
If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string:
That way, separator components are always found at the same relative indices within the result list.
Especially the last sentence to describe why this behavior is useful.
In short: The user/developer is every time able to identify the separators/matches in the resulting list when using catch groups.
When using catching groups the user/developer always can expect the separators (the matches) at the same position in the resulting list. Assuming one catching group each second element in the result is the matched separator (the catched group).
If you have two catch groups as in my example the relative position changes. You have to count to three. 0 is the splitted token, 1 is the first catch group, 2 is the second catch group, and again...

Python: Only using "maxsplit" in one of them, I need to split a string multiple times

I started studying Python yesterday and I wanted to study a little about the string split method.
I wasn't looking for anything specific, I was just trying to learn it. I saw that it's possible to split multiple characters of a string, but what if I want to use the maxsplit parameter in only one of those characters?
I searched a little about it and found nothing, so I'm here to ask how. Here's an example:
Let's suppose I have this string:
normal_string = "1d30 drake dreke"
I want this to be a list like this:
['1', '30', 'drake', 'dreke']
Now let's suppose I use a method to split multiple characters, so I split the character 'd' and the character ' '.
The thing is:
I don't want to take the "d" from "drake" and "dreke" off, only from "1d30", but at the same time I don't want this, I want to split all of the space characters.
I need to put a maxsplit parameter ONLY at the character "d", how can I do it?

Do the following:
normal_string = "1d30 drake dreke"
# first split by d
start, end = normal_string.split("d", maxsplit=1)
# the split by space and concat the results
res = start.split() + end.split()
print(res)
Output
['1', '30', 'drake', 'dreke']
A more general approach, albeit more advanced, is to do:
res = [w for s in normal_string.split("d", maxsplit=1) for w in s.split()]
print(res)

Python - Delete a character from list or string

If the verb ends in e, drop the e and add -ing.
I'm inputing a string (English verb). And my goal is to delete last char of the word if it's "e". And add 3 more characters "i","n" and "g".
I'd like to know how to delete the list object or if possible a string character. And how to switch a list into a string.
Currently im on.
if verb_list[-1] == ["e"]: #verb_list is a inputed string putted into a list
verb_list[-1] = "i"
verb_list.append("n")
verb_list.append("g")
This isnt a proper solution for me. I'd like to know how to delete for example [-1] element from list or from string. Also here im left with a list, and i want my output to be a string.
Thanks for any help!

You can use re.sub:
re.sub('e$', 'ing', s)
The $ in the regex matches the pattern only if it's at the end of a string.
Example usage:
import re
data = ['date', 'today', 'done', 'cereal']
print([re.sub('e$', 'ing', s) for s in data])
#['dating', 'today', 'doning', 'cereal']
I know the words in data aren't verbs but those were words off the top of my head.

This should suffice
if verb[-1]=='e':
verb = verb[:-1]+"ing"
For more about slicing in Python - Understanding slice notation

Try this:
li=list(verb)
if li[-1]=='e':
li[-1]='ing'
verb=''.join(li)

Detecting which words are the same between two pieces of text

I need some python advice to implement an algorithm.
What I need is to detect which words from text 1 are in text 2:
Text 1: "Mary had a dog. The dog's name was Ethan. He used to run down
the meadow, enjoying the flower's scent."
Text 2: "Mary had a cat. The cat's name was Coco. He used to run down
the street, enjoying the blue sky."
I'm thinking I could use some pandas datatype to check repetitions, but I'm not sure.
Any ideas on how to implement this would be very helpful. Thank you very much in advance.

Since you do not show any work of your own, I'll just give an overall algorithm.
First, split each text into its words. This can be done in several ways. You could remove any punctuation then split on spaces. You need to decide if an apostrophe as in dog's is part of the word--you probably want to leave apostrophes in. But remove periods, commas, and so forth.
Second, place the words for each text into a set.
Third, use the built-in set operations to find which words are in both sets.
This will answer your actual question. If you want a different question that involves the counts or positions of the words, you should make that clear.

You can use dictionary to first store words from first text and than just simply look up while iterating the second text. But this will take space.
So best way is to use regular expressions.

First extract words from both strings into lists. I assume you would want to ignore any trailing periods or commas. Add one of the lists to a set (for expected constant time lookup). For each word in another list, check if it's also present in the set; That gets you words common in both of the texts. I assumed that duplicate elements are counted only once. Following is the code for doing this:
def get_words(text):
words = text.split()
for i in range(len(words)):
words[i] = words[i].strip('.,')
return words
def common_words(text1, text2):
words1 = get_words(text1)
words2 = set(get_words(text2))
common = set()
for word in words1:
if word in words2:
common.add(word)
return common
For your example, it would return:
{'enjoying', 'had', 'to', 'Mary', 'used', 'the', 'The', 'was', 'down', 'name', 'He', 'run', 'a'}
Note that words "the" and "The" are counted as distinct. If you don't want that, you can convert all words to lower case; words[i] = lower(words[i].strip('.,'))

Python: Replace all substring occurrences with regular expressions

I would like to replace all substring occurrences with regular expressions. The original sentences would be like:
mystring = "Carl's house is big. He is asking 1M for that(the house)."
Now let's suppose I have two substrings I would like to bold. I bold the words by adding ** at the beginning and at the end of the substring. The 2 substrings are:
substring1 = "house", so bolded it would be "**house**"
substring2 = "the house", so bolded it would be "**the house**"
At the end I want the original sentence like this:
mystring = "Carl's **house** is big. He is asking 1M for that(**the house**)."
The main problem is that as I have several substrings to replace, they can overlap words like the example above. If I analyze the longest substring at first, I am getting this:
Carl's **house** is big. He is asking 1M for that(**the **house****).
On the other hand, if I analyze the shortest substring first, I am getting this:
Carl's **house** is big. He is asking 1M for that(the **house**).
It seems to be I will need to replace from the longest substring to the shortest, but I wonder how should I do to consider it in the first replacement but in the second. Also remember the substring can appear several times in the string.
Note:// Suppose the string ** will never occur in the original string, so we can use it to bold our words

You can search for all of the strings at once, so that the fact that one is a substring of another doesn't matter:
re.sub(r"(house|the house)", r"**\1**", mystring)

You could have a group that is not captured and is note required. If you look at the regex patter (?P<repl>(?:the )?house), the (?:the )? part is saying that there might be a the in the string, if it is present, include it in the match. This way, you let the re library optimize the way it matches. Here is the complete example
>>> data = "Carl's house is big. He is asking 1M for that(the house)."
>>> re.sub('(?P<repl>(?:the )?house)', '**\g<repl>**', data)
"Carl's **house** is big. He is asking 1M for that(**the house**)."
Note: \g<repl> is used to get all the string matched by the group <repl>

You could do two passes:
First: Go through from longest to shortest and replace with something like:
'the house': 'AA_THE_HOUSE'
'house': 'BB_HOUSE'
Second: Go through replace like:
'AA_THE_HOUSE': '**the house**'
'BB_HOUSE': '**house**'

Replace the strings with some unique values and then replace them back with original string enclosed in ** to make them bold.
For example:
'the house' with 'temp_the_house'
'house' with 'temp_house'
then 'temp_house' with 'house'
'temp_the_house' with '**the house****'
Should work fine. You can automate this by using two lists.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split Big String by Specific Word In Python - python

I want to split a big string by word and that word is repeating in that big string. Example what i expect : We have tried to split a code, please check below string.split("RFF+AAJ:") So we need a bunch of list that i have described in my above screenshot.

You can get your result with the help of regex :- import re string = 'helloisworldisbyeishi' re.split('(is)', string) # Splitting from 'is' Output ['hello', 'is', 'world', 'is', 'bye', 'is', 'hi'] I hope it may help you.

Related

Wyh there are empty items after re.split()? [duplicate]

Python: Only using "maxsplit" in one of them, I need to split a string multiple times

Python - Delete a character from list or string

Detecting which words are the same between two pieces of text

Python: Replace all substring occurrences with regular expressions

Categories

Resources