How to find all the substrings in a .txt file in Python

How to find all the substrings in a .txt file in Python - python

Okay, so I have this text file:
hello there hello print hello there print lolol
this is what I want to do in Python(down below in pseudo-code):
when print statement found:
print next five letters(not including space);
This is the result I want:
>>>[hello, lolol]
How do I solve this problem in python?

If there are always 5 letters that follow a print and a space, you can use the following regex with lookbehind:
import re
print(re.findall(r'(?<=\bprint ).{5}', 'hello there hello print hello there print lolol'))
This outputs:
['hello', 'lolol']

split by 'print ' and use list indexing to get the first 5 characters of the string
In [253]: [res[:5] for res in s.split('print ')[1:]]
Out[253]: ['hello', 'lolol']

Related

Split string but replace with another string and get list

I am trying to split a string but it should be replaced to another string and return as a list. Its hard to explain so here is an example:
I have string in variable a:
a = "Hello World!"
I want a list such that:
a.split("Hello").replace("Hey") == ["Hey"," World!"]
It means I want to split a string and write another string to that splited element in the list. SO if a is
a = "Hello World! Hello Everybody"
and I use something like a.split("Hello").replace("Hey") , then the output should be:
a = ["Hey"," World! ","Hey"," Everybody"]
How can I achieve this?

From your examples it sounds a lot like you want to replace all occurrences of Hello with Hey and then split on spaces.
What you are currently doing can't work, because replace needs two arguments and it's a method of strings, not lists. When you split your string, you get a list.
>>> a = "Hello World!"
>>> a = a.replace("Hello", "Hey")
>>> a
'Hey World!'
>>> a.split(" ")
['Hey', 'World!']

x = "HelloWorldHelloYou!"
y = x.replace("Hello", "\nHey\n").lstrip("\n").split("\n")
print(y) # ['Hey', 'World', 'Hey', 'You!']
This is a rather brute-force approach, you can replace \n with any character you're not expecting to find in your string (or even something like XXXXX). The lstrip is to remove \n if your string starts with Hello.
Alternatively, there's regex :)

this functions can do it
def replace_split(s, old, new):
return sum([[blk, new for blk] in s.split(old)], [])[:-1]

It wasnt clear if you wanted to split by space or by uppercase.
import re
#Replace all 'Hello' with 'Hey'
a = 'HelloWorldHelloEverybody'
a = a.replace('Hello', 'Hey')
#This will separate the string by uppercase character
re.findall('[A-Z][^A-Z]*', a) #['Hey', 'World' ,'Hey' ,'Everybody']

You can do this with iteration:
a=a.split(' ')
for word in a:
if word=='Hello':
a[a.index(word)]='Hey'

Strip function/print in python - why there is no difference for removing whitespace in end

I was just trying out the strip function:
>> a = "hello world "
>> print(a)
hello world
>> print(a.strip())
hello world
There is no difference in the output even though the string has spaces at the end. Could someone explain why?

There is a difference if you check the lengths, you just can't see it when printing;
a = "hello world "
print(len(a))
print(len(a.strip()))
Output:
15
11

There is a difference, you just can't see it because it's whitespace. Try to replace whitespace with a visible character
a = "hello world "
print(a.replace(' ', '+'))
print(a.strip().replace(' ', '+'))

Space characters are not printable, so there won't be a visible difference in the output. To see the difference, try adding and then stripping some printable character:
a = "hello world____"
print(a)
print(a.strip('_'))

How to remove line break in array

While running code I got an output,
vivek
Hello World!
There is a line break between "vivek" and "hello World", but I want an output without a line break
vivek
Hello World
like above
# Hello World program in Python
arr =['vivek\n','singh\n']
arr[0].replace('\n','')
print arr[0]
print "Hello World!"

Just replace the line
arr[0].replace('\n','')
by
arr[0] = arr[0].replace('\n', '')
as str.replace does only return a modified copy and not modify the original. See str.replace documentation.
Other suggestions
You could also use str.strip to remove surrounding whitespaces. A neat way to remove all surrounding whitespaces for a list is
yourlist = [strelement.strip() for strelement in yourlist]
This is called a list comprehension.
You might also want to use print as a function instead of a statement. So you use print("whatever") instead of print "whatever". The print function works with Python 2 and Python 3, whereas the statement works only in Python 2.
Then you might want to take a look at http://pep8online.com/ and https://www.python.org/dev/peps/pep-0008/

Because attr[0] is a string, string.replace just return a copy of updated string, but not change original string.

You can also remove using Regular Expression:
>>> import re
>>> result = re.sub(u"\u005cn", r"", "vivek\n Hello World!")
>>> result
'Vivek Hello World!'

You can also do split and join that will remove the new line from the list:
arr = ['vivek\n', 'singh\n']
arr = ''.join(arr).split('\n')
print(arr[0] + " Hello World!")
or
print(arr[0])
print(" Hello World!")
The question is about removing the new line from the strings in the list. split and join is the most pythonic way available to do so.

You can use regular expressions to filter out unwanted characters from your string.
import re
s = "hello\nworld"
print re.sub(r'\n',' ',s)
will return : "hello world"
So you need to just do
print re.sub(r'\n','',arr[0])

Python Regex for Words & single space

I am using re.sub in order to forcibly convert a "bad" string into a "valid" string via regex. I am struggling with creating the right regex that will parse a string and "remove the bad parts". Specifically, I would like to force a string to be all alphabetical, and allow for a single space between words. Any values that disagree with this rule I would like to substitute with ''. This includes multiple spaces. Any help would be appreciated!
import re
list_of_strings = ["3He2l2lo Wo45rld!", "Hello World- -number two-", "Hello World number .. three"
for str in list_of_strings:
print re.sub(r'[^A-Za-z]+([^\s][A-Za-z])*', '' , str)
I would like the output to be:
Hello World
Hello World number two
Hello World number three

Try if the following works. It matches both groups of characters to remove, but only when there is at least an space in them subsitutes it with an space.
import re
list_of_strings = ["3He2l2lo Wo45rld!", "Hello World- -number two-", "Hello World number .. three"]
for str in list_of_strings:
print(re.sub(r'((?:[^A-Za-z\s]|\s)+)', lambda x: ' ' if ' ' in x.group(0) else '' , str))
It yields:
Hello World
Hello World number two
Hello World number three

I would prefer to have 2 passes to simplify the regex. First pass removes non-alphas, second removes multiple spaces.
pass1 = re.sub(r'[^A-Za-z\s]','',str) # remove non-alpha
pass2 = re.sub(r'\s+',' ',pass1); # collapses spaces to 1

Finding words after keyword in python [duplicate]

This question already has answers here:
How to get a string after a specific substring?
(9 answers)
Closed 2 months ago.
I want to find words that appear after a keyword (specified and searched by me) and print out the result. I know that i am suppose to use regex to do it, and i tried it out too, like this:
import re
s = "hi my name is ryan, and i am new to python and would like to learn more"
m = re.search("^name: (\w+)", s)
print m.groups()
The output is just:
"is"
But I want to get all the words and punctuations that comes after the word "name".

Instead of using regexes you could just (for example) separate your string with str.partition(separator) like this:
mystring = "hi my name is ryan, and i am new to python and would like to learn more"
keyword = 'name'
before_keyword, keyword, after_keyword = mystring.partition(keyword)
>>> before_keyword
'hi my '
>>> keyword
'name'
>>> after_keyword
' is ryan, and i am new to python and would like to learn more'
You have to deal with the needless whitespaces separately, though.

Your example will not work, but as I understand the idea:
regexp = re.compile("name(.*)$")
print regexp.search(s).group(1)
# prints " is ryan, and i am new to python and would like to learn more"
This will print all after "name" and till end of the line.

An other alternative...
import re
m = re.search('(?<=name)(.*)', s)
print m.groups()

import re
s = "hi my name is ryan, and i am new to python and would like to learn more"
m = re.search("^name: (\w+)", s)
print m.group(1)

Instead of "^name: (\w+)" use:
"^name:(.*)"

What you have used regarding your output:
re.search("name (\w+)", s)
What you have to use (match all):
re.search("name (.*)", s)

You could simply do
s = "hi my name is ryan, and i am new to python and would like to learn more"
s.split('name')
This will split your string and return a list like this ['hi my', 'is ryan, and i am new to python and would like to learn more']
depending on what you want to do this may help or not.

This will work out for u : work name\s\w+\s(\w+)
>>> s = 'hi my name is ryan, and i am new to python and would like to learn more'
>>> m = re.search('name\s\w+\s(\w+)',s)
>>> m.group(0)
'name is ryan'
>>>> m.group(1)
'ryan'

Without using regex, you can
strip punctuation (consider making everything single case, including search term)
split your text into individual words
find index of searched word
get word from array (index + 1 for word after, index - 1 for word before )
Code snippet:
import string
s = 'hi my name is ryan, and i am new to python and would like to learn more'
t = 'name'
i = s.translate(string.maketrans("",""), string.punctuation).split().index(t)
print s.split()[i+1]
>> is
For multiple occurences, you need to save multiple indices:
import string
s = 'hi my NAME is ryan, and i am new to NAME python and would like to learn more'
t = 'NAME'
il = [i for i, x in enumerate(s.translate(string.maketrans("",""), string.punctuation).split()) if x == t]
print [s.split()[x+1] for x in il]
>> ['is', 'python']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to find all the substrings in a .txt file in Python - python

If there are always 5 letters that follow a print and a space, you can use the following regex with lookbehind: import re print(re.findall(r'(?<=\bprint ).{5}', 'hello there hello print hello there print lolol')) This outputs: ['hello', 'lolol']

split by 'print ' and use list indexing to get the first 5 characters of the string In [253]: [res[:5] for res in s.split('print ')[1:]] Out[253]: ['hello', 'lolol']

Related

Split string but replace with another string and get list

Strip function/print in python - why there is no difference for removing whitespace in end

How to remove line break in array

Python Regex for Words & single space

Finding words after keyword in python [duplicate]

Categories

Resources