String out of first words in a textfile in python - python

I cannot solve the following exercise:
In the given function "a_open()" open the file "mytext" and create a string out of the first words in each line of the file. Each word should be separated by a blank (" ").
I am stuck at this point:
a_open():
f= open ("mytext", "r")
for line in f:
print (line.split(' ')[0])
I am aware I should use the function .join but I do not know how. Any suggestions?
Thank you in advance!

Assuming this is Python, you might use an approach like passing the filename to the function.
Create an empty list words outside of the loop to hold all the first words per line.
Per line, split on a space, use strip to remove the leading and trailing whitespaces and newlines and filter out the "empty" entries.
If the list is not empty, add the first item to the list.
After processing all the lines, use join with a space on the words list to return a string of all the words.
def a_open(filename):
words = []
for line in open(filename, "r"):
parts = list(filter(None, line.strip().split(' ')))
if len(parts):
words.append(parts[0])
return ' '.join(words)
print(a_open("mytext"))
If the contents of the file is for example
This abc
is def
a
test k lm
The output will be
This is a test
Another option using a regex could be reading the whole file, and use re.findall to return a list of groups.
The pattern ^\s*(\S+) matches optional whitespace chars \s* at the start of the string ^ and captures 1 or more non whitespace chars in group 1 (\S+) which will be returned.
import re
def a_open(filename):
return ' '.join(
re.findall(r"^\s*(\S+)",
open(filename, "r").read(),
re.MULTILINE)
)
print(a_open("mytext"))
Output
This is a test

Related

How to use a secondary delimiter for every 6th string generated by using split function on a primary delimiter in Python?

I have a pipe delimited file that ends a record with a newline delimiter after every 6 pipe delimited fields as follows.
uid216|Banana
bunches
nurture|Fail|76|7645|Singer
uid342|Orange
vulture|Pass|56
87|3547|Actor
I was using split function in python to convert the records in the file to a list of strings.
parts = file_str.split('|')
However, I don't seem to understand how I can use a newline character as delimiter for every 6th string alone. Can someone please help me?
The right way to do this is probably to use Python's csv module for reading delimited files and stream the data from the file rather than reading it all into memory at once. When you read the whole file into a string you essentially have to iterate over it twice.
import csv
def process_file(path):
with open(path, 'r') as file_handle:
reader = csv.Reader(file_handle, delimiter='|')
for row in reader:
# row is a list whose entries are the fields of the delimited row;
# do what you want with it.
It seems that the words can span multiple lines between the pipes.
You could read the whole file, and then use a pattern to match 5 times a pip char with all the preceding and following words.
^[^|\n]+(?:\n?[^|\n]+)*(?:\|[^|\n]+(?:\n?[^|\n]+)*){5}
Explanation
^ Start of string
[^|\n]+ Match 1+ chars other than | or a newline
(?:\n?[^|\n]+)* Optionally match an optional newline and 1+ chars other than | or a newline
(?: Non capture group to repeat as a whole part
\|[^|\n]+ Match | and 1+ chars other than | or a newline
(?:\n?[^|\n]+)* Optionally repeat an optional newline and 1+ chars other than | or a newline
){5} Close the non capture group and repeat it 5 times to match 5 pipe chars
Regex demo
For example
import re
file = open('file', mode='r')
allText = file.read()
pattern = r"^[^|\n]+(?:\n?[^|\n]+)*(?:\|[^|\n]+(?:\n?[^|\n]+)*){5}"
file.close()
for s in re.findall(pattern, allText, re.M):
print(s.split("|"))
Output
['uid216', 'Banana\nbunches\nnurture', 'Fail', '76', '7645', 'Singer']
['uid342', 'Orange\nvulture', 'Pass', '56\n87', '3547', 'Actor']
If there have to be either 2 newlines following or the end of the string:
^[^|\n]+(?:\n?[^|\n]+)*(?:\|[^|\n]+(?:\n?[^|\n]+)*){5}(?=\n\n|\Z)
Regex demo
parts = []
#iterate over each line by using split on \n
# extend to gather all strings in a single list
for line in file_str.split("\n"):
parts.extend(line.split("|"))
print(parts)

Regex for listing words start and ends with symbol

my file content has token words starts and ends with symbol #. there could also be two pairs in single line.
eg.
line1 ncghtdhj #token1# jjhhja #token2# hfyuj.
line2 hjfuijgt #token3# ghju
line3 hdhjii#jk8ok#token4#hj
how do i get list of tokens...like
[token1,token2,token3,jk8ok,token4]
using python re
tried ...
mlist = re.findall(r'#.+#', content)
not working as expected, file content has token words starts and ends with symbol #. there could also be two pairs in single line.
If jk8ok can also be a match and there should be no spaces in the token you might use a negated character class with a capturing group and use a positive lookahead to assert what is on the right is an #
#([^\s#]+)(?=#)
Regex demo | Python demo
For example
import re
regex = r"#([^\s#]+)(?=#)"
test_str = ("line1 ncghtdhj #token1# jjhhja #token2# hfyuj. \n"
"line2 hjfuijgt #token3# ghju \n"
"line3 hdhjii#jk8ok#token4#hj")
print(re.findall(regex, test_str))
Result
['token1', 'token2', 'token3', 'jk8ok', 'token4']
If the tokens should be on the same line and spaces are allowed, you might use
#([^\r\n#]+)(?=#)
If you only want to match token followed by a digit:
#(token\d+)(?=#)
Regex demo
First, you need to separate the words with # on the beginning and end. And then you can filter out the words between the #.
with open("filename", "r") as fp:
lines = fp.readlines()
lines_string = " ".join(lines)
# Seperating the words with # on the beginning and end.
temp1 = re.findall("#([^\s#]+)(?=#)", lines_string)
# Filtering out the words between the #s.
temp2 = list(map(lambda x: re.findall("\w+", x), temp1))
# Flattening the list.
tokens = [val for sublist in temp2 for val in sublist]
Output:
['token1', 'token2', 'token3', 'jk8ok']
I have used the regex as mentioned by #The fourth bird

How to insert commas amongst any letter and following any digit using regex

fileinput = open('INFILE.txt', 'r')
fileoutput = fileinput.read()
replace = re.sub(r'([A-Za-z]),([A-Za-z])', r'\1\2', fileoutput)
print replace
replaceout = open('OUTFILE.txt', 'w')
replaceout.write(replace)
The code above delete commas among any letter whether CapsLocks or not. How to insert commas among any letter and digit? I try the code
replace = re.sub(r"([a-z])([0-9])", r",\1", fileoutput)
but it does not work. Any suggestion how to insert commas among any letter and any digit?
This may help you understand how to add in the comma and reference out what you want. The brackets around the pattern allow you to capture a value in the regex pattern to return later. First one you capture is referenced as \1 and second \2 and so on.
Inside the square brackets you are telling the regex what you want it to match and without further instructions in the regex pattern it's referencing a single character it's trying to match. So the code below will put a comma in between each character.
import re
test = "123frogger"
replace = re.sub(r'([A-Za-z0-9])', r'\1,', test)
creating the output
1,2,3,f,r,o,g,g,e,r,
Here's an update based on one of your comments above about the content of what you are trying to adjust.
import re
test = "Vilniausnuoma483,NuomaVilniuiiraplinkVilniu"
replace = re.sub(r'([A-Za-z])([0-9].*)', r'\1,\2', test)
It will output the following.
Vilniausnuoma,483,NuomaVilniuiiraplinkVilniu

extract English words from string in python

I have a document that each line is a string. It might contain digits, non-English letters and words, symbols(such as ! and *). I want to extract the English words from each line(English words are separated by space).
My code is the following, which is the map function of my map-reduce job. However, based on the final result, this mapper function only produces letters(such as a,b,c) frequency count. Can anyone help me find the bug? Thanks
import sys
import re
for line in sys.stdin:
line = re.sub("[^A-Za-z]", "", line.strip())
line = line.lower()
words = ' '.join(line.split())
for word in words:
print '%s\t%s' % (word, 1)
You've actually got two problems.
First, this:
line = re.sub("[^A-Za-z]", "", line.strip())
This removes all non-letters from the line. Which means you no longer have any spaces to split on, and therefore no way to separate it into words.
Next, even if you didn't do that, you do this:
words = ' '.join(line.split())
This doesn't give you a list of words, this gives you a single string, with all those words concatenated back together. (Basically, the original line with all runs of whitespace converted into a single space.)
So, in the next line, when you do this:
for word in words:
You're iterating over a string, which means each word is a single character. Because that's what strings are: iterables of characters.
If you want each word (as your variable names imply), you already had those, the problem is that you joined them back into a string. Just don't do this:
words = line.split()
for word in words:
Or, if you want to strip out things besides letters and whitespace, use a regular expression that strips out everything besides letters and whitespace, not one that strips out everything besides letters, including whitespace:
line = re.sub(r"[^A-Za-z\s]", "", line.strip())
words = line.split()
for word in words:
However, that pattern is still probably not what you want. Do you really want to turn 'abc1def' into the single string 'abcdef', or into the two strings 'abc' and 'def'? You probably want either this:
line = re.sub(r"[^A-Za-z]", " ", line.strip())
words = line.split()
for word in words:
… or just:
words = re.split(r"[^A-Za-z]", line.strip())
for word in words:
There are two issues here:
line = re.sub("[^A-Za-z]", "", line.strip()) will remove all the non-characters, making it hard to split word in the subsequent stage. One alternatively solution is like this words = re.findall('[A-Za-z]', line)
As mentioned by #abarnert, in the existing code words is a string, for word in words will iterate each letter. To get words as a list of words, you can follow 1.

copy required data from one file to another file in python

I am new to Python and am stuck at this I have a file a.txt which contains 10-15 lines of html code and text. I want to copy data which matches my regular expression from one a.txt to b.txt. Suppose i have a line Hello "World" How "are" you and I want to copy data which is between double quotes i.e. World and are to be copied to new file.
This is what i have done.
if x in line:
p = re.compile("\"*\"")
q = p.findall(line)
print q
But this is just displaying only " "(double quotes) as output. I think there is a mistake in my regular expression.
any help is greatly appreciated.
Thanks.
Your regex (which translates to "*" without all the string escaping) matches zero or more quotes, followed by a quote.
You want
p = re.compile(r'"([^"]*)"')
Explanation:
" # Match a quote
( # Match and capture the following:
[^"]* # 0 or more characters except quotes
) # End of capturing group
" # Match a quote
This assumes that you never have to deal with escaped quotes, e. g.
He said: "The board is 2\" by 4\" in size"
Capture the group you're interested in (ie, between quotes), extract the matches from each line, then write them one per line to the new file, eg:
import re
with open('input') as fin, open('output', 'w') as fout:
for line in fin:
matches = re.findall('"(.*?)"', line)
fout.writelines(match + '\n' for match in matches)

Categories