copy required data from one file to another file in python - python

I am new to Python and am stuck at this I have a file a.txt which contains 10-15 lines of html code and text. I want to copy data which matches my regular expression from one a.txt to b.txt. Suppose i have a line Hello "World" How "are" you and I want to copy data which is between double quotes i.e. World and are to be copied to new file.
This is what i have done.
if x in line:
p = re.compile("\"*\"")
q = p.findall(line)
print q
But this is just displaying only " "(double quotes) as output. I think there is a mistake in my regular expression.
any help is greatly appreciated.
Thanks.

Your regex (which translates to "*" without all the string escaping) matches zero or more quotes, followed by a quote.
You want
p = re.compile(r'"([^"]*)"')
Explanation:
" # Match a quote
( # Match and capture the following:
[^"]* # 0 or more characters except quotes
) # End of capturing group
" # Match a quote
This assumes that you never have to deal with escaped quotes, e. g.
He said: "The board is 2\" by 4\" in size"

Capture the group you're interested in (ie, between quotes), extract the matches from each line, then write them one per line to the new file, eg:
import re
with open('input') as fin, open('output', 'w') as fout:
for line in fin:
matches = re.findall('"(.*?)"', line)
fout.writelines(match + '\n' for match in matches)

Related

String out of first words in a textfile in python

I cannot solve the following exercise:
In the given function "a_open()" open the file "mytext" and create a string out of the first words in each line of the file. Each word should be separated by a blank (" ").
I am stuck at this point:
a_open():
f= open ("mytext", "r")
for line in f:
print (line.split(' ')[0])
I am aware I should use the function .join but I do not know how. Any suggestions?
Thank you in advance!
Assuming this is Python, you might use an approach like passing the filename to the function.
Create an empty list words outside of the loop to hold all the first words per line.
Per line, split on a space, use strip to remove the leading and trailing whitespaces and newlines and filter out the "empty" entries.
If the list is not empty, add the first item to the list.
After processing all the lines, use join with a space on the words list to return a string of all the words.
def a_open(filename):
words = []
for line in open(filename, "r"):
parts = list(filter(None, line.strip().split(' ')))
if len(parts):
words.append(parts[0])
return ' '.join(words)
print(a_open("mytext"))
If the contents of the file is for example
This abc
is def
a
test k lm
The output will be
This is a test
Another option using a regex could be reading the whole file, and use re.findall to return a list of groups.
The pattern ^\s*(\S+) matches optional whitespace chars \s* at the start of the string ^ and captures 1 or more non whitespace chars in group 1 (\S+) which will be returned.
import re
def a_open(filename):
return ' '.join(
re.findall(r"^\s*(\S+)",
open(filename, "r").read(),
re.MULTILINE)
)
print(a_open("mytext"))
Output
This is a test

delete whitespace in regular expression

I'm learning python and also english. And I have a problem that might be easy, but I can't solve it. I have a folder of .txt's, I was able to extract by regular expression a sequence of 17 numbers of each one.I need to rename each file with the sequence I extracted from .txt
import os
import re
path_txt = (r'C:\Users\usuario\Desktop\files')
name_files = os.listdir(path_txt)
for TXT in name_files:
with open(path_txt + '\\' + TXT, "r") as content:
search = re.search(r'(\d{5}\.?\d{4}\.?\d{3}\.?\d{2}\.?\d{2}\-?\d)', content.read())
if search is not None:
print(search.group(0))
f = open(os.path.join( "Processes" , search.group(0) + ".txt"), "w")
for line in content:
print(line)
f.write(line)
f.close()
there are .txt where the sequences appear with spaces between characters, and my regular expression can not find them (example: 00372.2004 .442.02.00-1, 00572.2008.872.02.00- 5)
edit: They are serial numbers, were typed, so sometimes they appear with "." and "-" and other times without them. Sometimes spaces appear because of typos.
You want this regex:
search = re.search(r'(\d{5}.*\d{4}.*\d{3}.*\d{2}.*\d{2}-.*\d)', content.read())
Dot . is any character. By putting \ in front of the dot you escaped it and searched for dots and not any character.
You can use \D in your regular expression to match any non-numeric character (including white space) and + to match one or more (or * to match zero or more), so you could rewrite your expression as:
pattern = r'(\d{5}\D+\d{4}\D+\d{3}\D+\d{2}\D+\d{2}\D+\d)'
re.findall(pattern, '00372.2004 .442.02.00-1, 00572.2008.872.02.00- 5')
# ['00372.2004 .442.02.00-1', '00572.2008.872.02.00- 5']
Note I am using re.findall to find every match in the string and return them in a list.

How to insert commas amongst any letter and following any digit using regex

fileinput = open('INFILE.txt', 'r')
fileoutput = fileinput.read()
replace = re.sub(r'([A-Za-z]),([A-Za-z])', r'\1\2', fileoutput)
print replace
replaceout = open('OUTFILE.txt', 'w')
replaceout.write(replace)
The code above delete commas among any letter whether CapsLocks or not. How to insert commas among any letter and digit? I try the code
replace = re.sub(r"([a-z])([0-9])", r",\1", fileoutput)
but it does not work. Any suggestion how to insert commas among any letter and any digit?
This may help you understand how to add in the comma and reference out what you want. The brackets around the pattern allow you to capture a value in the regex pattern to return later. First one you capture is referenced as \1 and second \2 and so on.
Inside the square brackets you are telling the regex what you want it to match and without further instructions in the regex pattern it's referencing a single character it's trying to match. So the code below will put a comma in between each character.
import re
test = "123frogger"
replace = re.sub(r'([A-Za-z0-9])', r'\1,', test)
creating the output
1,2,3,f,r,o,g,g,e,r,
Here's an update based on one of your comments above about the content of what you are trying to adjust.
import re
test = "Vilniausnuoma483,NuomaVilniuiiraplinkVilniu"
replace = re.sub(r'([A-Za-z])([0-9].*)', r'\1,\2', test)
It will output the following.
Vilniausnuoma,483,NuomaVilniuiiraplinkVilniu

Python regex extract string between two braces, including new lines [duplicate]

I'm having a bit of trouble getting a Python regex to work when matching against text that spans multiple lines. The example text is ('\n' is a newline)
some Varying TEXT\n
\n
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF\n
[more of the above, ending with a newline]\n
[yep, there is a variable number of lines here]\n
\n
(repeat the above a few hundred times).
I'd like to capture two things: the 'some_Varying_TEXT' part, and all of the lines of uppercase text that comes two lines below it in one capture (i can strip out the newline characters later).
I've tried with a few approaches:
re.compile(r"^>(\w+)$$([.$]+)^$", re.MULTILINE) # try to capture both parts
re.compile(r"(^[^>][\w\s]+)$", re.MULTILINE|re.DOTALL) # just textlines
and a lot of variations hereof with no luck. The last one seems to match the lines of text one by one, which is not what I really want. I can catch the first part, no problem, but I can't seem to catch the 4-5 lines of uppercase text.
I'd like match.group(1) to be some_Varying_Text and group(2) to be line1+line2+line3+etc until the empty line is encountered.
If anyone's curious, its supposed to be a sequence of aminoacids that make up a protein.
Try this:
re.compile(r"^(.+)\n((?:\n.+)+)", re.MULTILINE)
I think your biggest problem is that you're expecting the ^ and $ anchors to match linefeeds, but they don't. In multiline mode, ^ matches the position immediately following a newline and $ matches the position immediately preceding a newline.
Be aware, too, that a newline can consist of a linefeed (\n), a carriage-return (\r), or a carriage-return+linefeed (\r\n). If you aren't certain that your target text uses only linefeeds, you should use this more inclusive version of the regex:
re.compile(r"^(.+)(?:\n|\r\n?)((?:(?:\n|\r\n?).+)+)", re.MULTILINE)
BTW, you don't want to use the DOTALL modifier here; you're relying on the fact that the dot matches everything except newlines.
This will work:
>>> import re
>>> rx_sequence=re.compile(r"^(.+?)\n\n((?:[A-Z]+\n)+)",re.MULTILINE)
>>> rx_blanks=re.compile(r"\W+") # to remove blanks and newlines
>>> text="""Some varying text1
...
... AAABBBBBBCCCCCCDDDDDDD
... EEEEEEEFFFFFFFFGGGGGGG
... HHHHHHIIIIIJJJJJJJKKKK
...
... Some varying text 2
...
... LLLLLMMMMMMNNNNNNNOOOO
... PPPPPPPQQQQQQRRRRRRSSS
... TTTTTUUUUUVVVVVVWWWWWW
... """
>>> for match in rx_sequence.finditer(text):
... title, sequence = match.groups()
... title = title.strip()
... sequence = rx_blanks.sub("",sequence)
... print "Title:",title
... print "Sequence:",sequence
... print
...
Title: Some varying text1
Sequence: AAABBBBBBCCCCCCDDDDDDDEEEEEEEFFFFFFFFGGGGGGGHHHHHHIIIIIJJJJJJJKKKK
Title: Some varying text 2
Sequence: LLLLLMMMMMMNNNNNNNOOOOPPPPPPPQQQQQQRRRRRRSSSTTTTTUUUUUVVVVVVWWWWWW
Some explanation about this regular expression might be useful: ^(.+?)\n\n((?:[A-Z]+\n)+)
The first character (^) means "starting at the beginning of a line". Be aware that it does not match the newline itself (same for $: it means "just before a newline", but it does not match the newline itself).
Then (.+?)\n\n means "match as few characters as possible (all characters are allowed) until you reach two newlines". The result (without the newlines) is put in the first group.
[A-Z]+\n means "match as many upper case letters as possible until you reach a newline. This defines what I will call a textline.
((?:textline)+) means match one or more textlines but do not put each line in a group. Instead, put all the textlines in one group.
You could add a final \n in the regular expression if you want to enforce a double newline at the end.
Also, if you are not sure about what type of newline you will get (\n or \r or \r\n) then just fix the regular expression by replacing every occurrence of \n by (?:\n|\r\n?).
The following is a regular expression matching a multiline block of text:
import re
result = re.findall('(startText)(.+)((?:\n.+)+)(endText)',input)
If each file only has one sequence of aminoacids, I wouldn't use regular expressions at all. Just something like this:
def read_amino_acid_sequence(path):
with open(path) as sequence_file:
title = sequence_file.readline() # read 1st line
aminoacid_sequence = sequence_file.read() # read the rest
# some cleanup, if necessary
title = title.strip() # remove trailing white spaces and newline
aminoacid_sequence = aminoacid_sequence.replace(" ","").replace("\n","")
return title, aminoacid_sequence
find:
^>([^\n\r]+)[\n\r]([A-Z\n\r]+)
\1 = some_varying_text
\2 = lines of all CAPS
Edit (proof that this works):
text = """> some_Varying_TEXT
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF
GATACAACATAGGATACA
GGGGGAAAAAAAATTTTTTTTT
CCCCAAAA
> some_Varying_TEXT2
DJASDFHKJFHKSDHF
HHASGDFTERYTERE
GAGAGAGAGAG
PPPPPAAAAAAAAAAAAAAAP
"""
import re
regex = re.compile(r'^>([^\n\r]+)[\n\r]([A-Z\n\r]+)', re.MULTILINE)
matches = [m.groups() for m in regex.finditer(text)]
#NOTE can be sorter with matches = re.findall(pattern, text, re.MULTILINE)
for m in matches:
print 'Name: %s\nSequence:%s' % (m[0], m[1])
It can sometimes be comfortable to specify the flag directly inside the string, as an inline-flag:
"(?m)^A complete line$".
For example in unit tests, with assertRaisesRegex. That way, you don't need to import re, or compile your regex before calling the assert.
My preference.
lineIter= iter(aFile)
for line in lineIter:
if line.startswith( ">" ):
someVaryingText= line
break
assert len( lineIter.next().strip() ) == 0
acids= []
for line in lineIter:
if len(line.strip()) == 0:
break
acids.append( line )
At this point you have someVaryingText as a string, and the acids as a list of strings.
You can do "".join( acids ) to make a single string.
I find this less frustrating (and more flexible) than multiline regexes.

Parsing semi-structured json data(Python/R)

I'm not good with regular expressions or programming.
I have my data like this in a text file:
RAMCHAR#HOTMAIL.COM ():
PATTY.FITZGERALD327#GMAIL.COM ():
OHSCOACHK13#AOL.COM (19OB3IRCFHHYO): [{"num":1,"name":"Bessey VAS23 Vario Angle Strap Clamp","link":"http:\/\/www.amazon.com\/dp\/B0000224B3\/ref=wl_it_dp_v_nS_ttl\/181-6441163-6563619?_encoding=UTF8&colid=37XI10RRD17X2&coliid=I1YMLERDXCK3UU&psc=1","old-price":"N\/A","new-price":"","date-added":"October 19, 2014","priority":"","rating":"N\/A","total-ratings":"","comment":"","picture":"http:\/\/ecx.images-amazon.com\/images\/I\/51VMDDHT20L._SL500_SL135_.jpg","page":1},{"num":2,"name":"Designers Edge L-5200 500-Watt Double Bulb Halogen 160 Degree Wide Angle Surround Portable Worklight, Red","link":"http:\/\/www.amazon.com\/dp\/B0006OG8MY\/ref=wl_it_dp_v_nS_ttl\/181-6441163-6563619?_encoding=UTF8&colid=37XI10RRD17X2&coliid=I1BZH206RPRW8B","old-price":"N\/A","new-price":"","date-added":"October 8, 2014","priority":"","rating":"N\/A","total-ratings":"","comment":"","picture":"http:\/\/ecx.images-amazon.com\/images\/I\/5119Z4RDFYL._SL500_SL135_.jpg","page":1},{"num":3,"name":"50 Pack - 12"x12" (5) Bullseye Splatterburst Target - Instantly See Your Shots Burst Bright Florescent Yellow Upon Impact!","link":"http:\/\/www.amazon.com\/dp\/B00C88T12K\/ref=wl_it_dp_v_nS_ttl\/181-6441163-6563619?_encoding=UTF8&colid=37XI10RRD17X2&coliid=I31RJXFVF14TBM","old-price":"N\/A","new-price":"","date-added":"October 8, 2014","priority":"","rating":"N\/A","total-ratings":"67","comment":"","picture":"http:\/\/ecx.images-amazon.com\/images\/I\/51QwsvI43IL._SL500_SL135_.jpg","page":1},{"num":4,"name":"DEWALT DW618PK 12-AMP 2-1\/4 HP Plunge and Fixed-Base Variable-Speed Router Kit","link":"http:\/\/www.amazon.com\/dp\/B00006JKXE\/ref=wl_it_dp_v_nS_ttl\/181-6441163-6563619?_encoding=UTF8&colid=37XI10RRD17X2&coliid=I39QDQSBY00R56&psc=1","old-price":"N\/A","new-price":"","date-added":"September 3, 2012","priority":"","rating":"N\/A","total-ratings":"","comment":"","picture":"http:\/\/ecx.images-amazon.com\/images\/I\/416a5nzkYTL._SL500_SL135_.jpg","page":1}]
Could anybody suggest any easy way of separating this data into two columns(email id in the first column and json format data in the second column). Some rows might just have email id's(like in row 1) and no corresponding json data.
Please help. Thanks!
Please try the following solution (for Python 2). This assumes that each entry is on a single line (which means that there may be no linebreaks within the JSON substring). I've chosen in.txt as the filename for your data file - change that to the actual filename/path:
import csv
import re
regex = re.compile("""
([^:]*) # Match and capture any characters except colons
:[ ]* # Match a colon, followed by optional spaces
(.*) # Match and capture the rest of the line""",
re.VERBOSE)
with open("in.txt") as infile, open("out.csv", "wb") as outfile:
writer = csv.writer(outfile)
for line in infile:
writer.writerow(regex.match(line).groups())
if you are in a Linux/Unix environment you can use sed like so (a.txt is your input file):
<a.txt sed 's/\(^[^ (]*\)[^:]*: */\1 /'
The regular expression ^[^ (]* matches the start of each line (^) and zero of more characters that are not space or left parenthesis ([^ (]*) and by putting it around \( and \) you make sed "remember" the matching string as \1. Then the [^:]*: * expression matches any character up and including the colon and zero or more spaces after that. All this matched expression is then replaced in each line with the remembered /1 string, which is actually the email. The rest of the line is the JSON data and they are left intact.
If you want a CSV or a Tab separated file you need to replace the space character after \1, e.g.
<a.txt sed 's/\(^[^ (]*\)[^:]*:/\1,/'

Categories