Parsing semi-structured json data(Python/R) - python

I'm not good with regular expressions or programming.
I have my data like this in a text file:
RAMCHAR#HOTMAIL.COM ():
PATTY.FITZGERALD327#GMAIL.COM ():
OHSCOACHK13#AOL.COM (19OB3IRCFHHYO): [{"num":1,"name":"Bessey VAS23 Vario Angle Strap Clamp","link":"http:\/\/www.amazon.com\/dp\/B0000224B3\/ref=wl_it_dp_v_nS_ttl\/181-6441163-6563619?_encoding=UTF8&colid=37XI10RRD17X2&coliid=I1YMLERDXCK3UU&psc=1","old-price":"N\/A","new-price":"","date-added":"October 19, 2014","priority":"","rating":"N\/A","total-ratings":"","comment":"","picture":"http:\/\/ecx.images-amazon.com\/images\/I\/51VMDDHT20L._SL500_SL135_.jpg","page":1},{"num":2,"name":"Designers Edge L-5200 500-Watt Double Bulb Halogen 160 Degree Wide Angle Surround Portable Worklight, Red","link":"http:\/\/www.amazon.com\/dp\/B0006OG8MY\/ref=wl_it_dp_v_nS_ttl\/181-6441163-6563619?_encoding=UTF8&colid=37XI10RRD17X2&coliid=I1BZH206RPRW8B","old-price":"N\/A","new-price":"","date-added":"October 8, 2014","priority":"","rating":"N\/A","total-ratings":"","comment":"","picture":"http:\/\/ecx.images-amazon.com\/images\/I\/5119Z4RDFYL._SL500_SL135_.jpg","page":1},{"num":3,"name":"50 Pack - 12"x12" (5) Bullseye Splatterburst Target - Instantly See Your Shots Burst Bright Florescent Yellow Upon Impact!","link":"http:\/\/www.amazon.com\/dp\/B00C88T12K\/ref=wl_it_dp_v_nS_ttl\/181-6441163-6563619?_encoding=UTF8&colid=37XI10RRD17X2&coliid=I31RJXFVF14TBM","old-price":"N\/A","new-price":"","date-added":"October 8, 2014","priority":"","rating":"N\/A","total-ratings":"67","comment":"","picture":"http:\/\/ecx.images-amazon.com\/images\/I\/51QwsvI43IL._SL500_SL135_.jpg","page":1},{"num":4,"name":"DEWALT DW618PK 12-AMP 2-1\/4 HP Plunge and Fixed-Base Variable-Speed Router Kit","link":"http:\/\/www.amazon.com\/dp\/B00006JKXE\/ref=wl_it_dp_v_nS_ttl\/181-6441163-6563619?_encoding=UTF8&colid=37XI10RRD17X2&coliid=I39QDQSBY00R56&psc=1","old-price":"N\/A","new-price":"","date-added":"September 3, 2012","priority":"","rating":"N\/A","total-ratings":"","comment":"","picture":"http:\/\/ecx.images-amazon.com\/images\/I\/416a5nzkYTL._SL500_SL135_.jpg","page":1}]
Could anybody suggest any easy way of separating this data into two columns(email id in the first column and json format data in the second column). Some rows might just have email id's(like in row 1) and no corresponding json data.
Please help. Thanks!

Please try the following solution (for Python 2). This assumes that each entry is on a single line (which means that there may be no linebreaks within the JSON substring). I've chosen in.txt as the filename for your data file - change that to the actual filename/path:
import csv
import re
regex = re.compile("""
([^:]*) # Match and capture any characters except colons
:[ ]* # Match a colon, followed by optional spaces
(.*) # Match and capture the rest of the line""",
re.VERBOSE)
with open("in.txt") as infile, open("out.csv", "wb") as outfile:
writer = csv.writer(outfile)
for line in infile:
writer.writerow(regex.match(line).groups())

if you are in a Linux/Unix environment you can use sed like so (a.txt is your input file):
<a.txt sed 's/\(^[^ (]*\)[^:]*: */\1 /'
The regular expression ^[^ (]* matches the start of each line (^) and zero of more characters that are not space or left parenthesis ([^ (]*) and by putting it around \( and \) you make sed "remember" the matching string as \1. Then the [^:]*: * expression matches any character up and including the colon and zero or more spaces after that. All this matched expression is then replaced in each line with the remembered /1 string, which is actually the email. The rest of the line is the JSON data and they are left intact.
If you want a CSV or a Tab separated file you need to replace the space character after \1, e.g.
<a.txt sed 's/\(^[^ (]*\)[^:]*:/\1,/'

Related

How to use a secondary delimiter for every 6th string generated by using split function on a primary delimiter in Python?

I have a pipe delimited file that ends a record with a newline delimiter after every 6 pipe delimited fields as follows.
uid216|Banana
bunches
nurture|Fail|76|7645|Singer
uid342|Orange
vulture|Pass|56
87|3547|Actor
I was using split function in python to convert the records in the file to a list of strings.
parts = file_str.split('|')
However, I don't seem to understand how I can use a newline character as delimiter for every 6th string alone. Can someone please help me?
The right way to do this is probably to use Python's csv module for reading delimited files and stream the data from the file rather than reading it all into memory at once. When you read the whole file into a string you essentially have to iterate over it twice.
import csv
def process_file(path):
with open(path, 'r') as file_handle:
reader = csv.Reader(file_handle, delimiter='|')
for row in reader:
# row is a list whose entries are the fields of the delimited row;
# do what you want with it.
It seems that the words can span multiple lines between the pipes.
You could read the whole file, and then use a pattern to match 5 times a pip char with all the preceding and following words.
^[^|\n]+(?:\n?[^|\n]+)*(?:\|[^|\n]+(?:\n?[^|\n]+)*){5}
Explanation
^ Start of string
[^|\n]+ Match 1+ chars other than | or a newline
(?:\n?[^|\n]+)* Optionally match an optional newline and 1+ chars other than | or a newline
(?: Non capture group to repeat as a whole part
\|[^|\n]+ Match | and 1+ chars other than | or a newline
(?:\n?[^|\n]+)* Optionally repeat an optional newline and 1+ chars other than | or a newline
){5} Close the non capture group and repeat it 5 times to match 5 pipe chars
Regex demo
For example
import re
file = open('file', mode='r')
allText = file.read()
pattern = r"^[^|\n]+(?:\n?[^|\n]+)*(?:\|[^|\n]+(?:\n?[^|\n]+)*){5}"
file.close()
for s in re.findall(pattern, allText, re.M):
print(s.split("|"))
Output
['uid216', 'Banana\nbunches\nnurture', 'Fail', '76', '7645', 'Singer']
['uid342', 'Orange\nvulture', 'Pass', '56\n87', '3547', 'Actor']
If there have to be either 2 newlines following or the end of the string:
^[^|\n]+(?:\n?[^|\n]+)*(?:\|[^|\n]+(?:\n?[^|\n]+)*){5}(?=\n\n|\Z)
Regex demo
parts = []
#iterate over each line by using split on \n
# extend to gather all strings in a single list
for line in file_str.split("\n"):
parts.extend(line.split("|"))
print(parts)

Remove every 10 digit number

I have a huge collection of files that I am trying to rename in bulk. The patterns of these filenames are somewhat consistent but there are few bumps that render my basic regex knowledge inadequate.
The filenames usually go like this:
1050327473 {913EDD51} 1st Filename [2nd Edition].txt
I could remove the strings between {}, [], and few other special characters with this piece of code:
new_file_name = re.sub(r'{.+?}', '', filename)
new_file_name = re.sub(r'\[.+?]', '', new_file_name)
new_file_name = ((new_file_name.split(" .pdf", 1)[0]) + '.pdf').translate({ord(i):None for i in '/\:*?"<>|_'})
and it successfully outputs this:
1050327473 1st Filename
However some of the original filenames are different than the pattern and I still have to remove the 10 digit number. Few of the other patterns are like this:
785723041X, 4844004976 {2C5ACB07} 1st Filename.txt
0383948600 {6A7528B5} 2nd Filename.txt
3263031418, 7966530910, 8070331430 {DCBAD13B} 3rd Filename.txt
The expect output is
1st Filename.txt
2nd Filename.txt
3rd Filename.txt
Now, I could remove every bit of number characters but the file name would also lose a meaningful part of it and become st Filename.txt. Taking a certain part of the string array with something like [10:] would also not work because the length of this digit is interchangeable.
I thought the most logical thing would be to remove every 10 digit character but some of the 10 digit number sequences end with an X instead of the 10th digit, like 785723041X. Also, if the 10 digit sequence is followed by a comma that should be removed too.
What would be the best approach to solve this problem? Is it doable with regex only?
With specific regex pattern:
import re
filenames = ['785723041X, 4844004976 {2C5ACB07} 1st Filename.txt',
'0383948600 {6A7528B5} 2nd Filename.txt',
'3263031418, 7966530910, 8070331430 {DCBAD13B} 3rd Filename.txt']
pat = re.compile(r'\{[^{}]+\}|\[[^[]]+\]|\b\d{9}[\dX],?')
filenames = [pat.sub('', f).strip() for f in filenames]
print(filenames)
The output:
['1st Filename.txt', '2nd Filename.txt', '3rd Filename.txt']
Regex details:
..|..|.. - alternation group (to match a single regular expression out of several possible regular expressions)
\{[^{}]+\} - match any characters enclosed with {} (except themselves, ensured by character class [^{}]+)
\[[^[]]+\] - match any characters enclosed with [] (except themselves, ensured by character class [^[]]+)
\b\d{9}[\dX],? - match 9-digit sequence followed either by 10th digit or X char and optional trailing , char

How to match numeric characters with no white space following

I need to match lines in text document where the line starts with numbers and the numbers are followed by nothing.... I want to include numbers that have '.' and ',' separating them.
Currently, I have:
p = re.compile('\$?\s?[0-9]+')
for i, line in enumerate(letter):
m = p.match(line)
if s !=None:
print(m)
print(line)
Which gives me this:
"15,704" and "416" -> this is good, I want this
but also this:
"$40 million...." -> I do not want to match this line or any line where the numbers are followed by words.
I've tried:
p = re.compile('\$?\s?[0-9]+[ \t\n\r\f\v]')
But it doesn't work. One reason is that it turns out there is no white space after the numbers I'm trying to match.
Appreciate any tips or tricks.
If you want to match the whole string with a regex,
you have 2 choices:
Either call re.fullmatch(pattern, string) (note full in the function name).
It tries to match just the whole string.
Or put $ anchor at the end of your regex and call re.match(pattern, string).
It tries to find a match from the start of the string.
Actually, you could also add ^ at the start of regex and call re.search(pattern,
string), but it would be a very strange combination.
I have also a remark concerning how you specified your conditions, maybe in incomplete
way: You put e.g. $40 million string and stated that the only reason to reject
it is space and letters after $40.
So actually you should have written that you want to match a string:
Possibly starting with $.
After the $ there can be a space (maybe, I'm not sure).
Then there can be a sequence of digits, dots or commas.
And nothing more.
And one more remark concerning Python literals: Apparently you have forgotten to prepend the pattern with r.
If you use r-string literal, you do not have to double backslashes inside.
So I think the most natural solution is to call a function devoted just to
match the whole string (i.e. fullmatch), without adding start / end
anchors and the whole script can be:
import re
pat = re.compile(r'(?:\$\s?)?[\d,.]+')
lines = ["416", "15,704", "$40 million"]
for line in lines:
if pat.fullmatch(line):
print(line)
Details concerning the regex:
(?: - A non-capturing group.
\$ - Consisting of a $ char.
\s? - And optional space.
)? - End of the non-capturing group and ? stating that the whole
group group is optional.
[\d,.]+ - A sequence of digits, commas and dots (note that between [
and ] the dot represents itself, so no backslash quotation is needed.
If you would like to reject strings like 2...5 or 3.,44 (no consecutive
dots or commas allowed), change the last part of the above regex to:
[\d]+(?:[,.]?[\d]+)*
Details:
[\d]+ - A sequence of digits.
(?: - A non-capturing group.
[,.] - Either a comma or a dot (single).
[\d]+ - Another sequence of digits.
)* - End of the non-capturing group, it may occur several times.
With a little modification to your code:
letter = ["15,704", "$40 million"]
p = re.compile('^\d{1,3}([\.,]\d{3})*$') # Numbers separated by commas or points
for i, line in enumerate(letter):
m = p.match(line)
if m:
print(line)
Output:
15,704
You could use the following regex:
import re
pattern = re.compile('^[0-9,.]+\s*$')
lines = ["416", "15,704", "$40 million...."]
for line in lines:
if pattern.match(line):
print(line)
Output
416
15,704
The pattern ^[0-9,.]+\s*$ matches everything that is a digit a , or ., followed by zero or more spaces. If you want to match only numbers with one , or . use the following pattern: '^\d+[,.]?\d+\s*$', code:
import re
pattern = re.compile('^\d+[,.]?\d+\s*$')
lines = ["416", "15,704", "$40 million...."]
for line in lines:
if pattern.match(line):
print(line)
Output
416
15,704
The pattern ^\d+[,.]?\d+\s*$ matches everything that starts with a group of digits (\d+) followed by an optional , or . ([,.]?) followed by a group of digits, with an optional group of spaces \s*.

copy required data from one file to another file in python

I am new to Python and am stuck at this I have a file a.txt which contains 10-15 lines of html code and text. I want to copy data which matches my regular expression from one a.txt to b.txt. Suppose i have a line Hello "World" How "are" you and I want to copy data which is between double quotes i.e. World and are to be copied to new file.
This is what i have done.
if x in line:
p = re.compile("\"*\"")
q = p.findall(line)
print q
But this is just displaying only " "(double quotes) as output. I think there is a mistake in my regular expression.
any help is greatly appreciated.
Thanks.
Your regex (which translates to "*" without all the string escaping) matches zero or more quotes, followed by a quote.
You want
p = re.compile(r'"([^"]*)"')
Explanation:
" # Match a quote
( # Match and capture the following:
[^"]* # 0 or more characters except quotes
) # End of capturing group
" # Match a quote
This assumes that you never have to deal with escaped quotes, e. g.
He said: "The board is 2\" by 4\" in size"
Capture the group you're interested in (ie, between quotes), extract the matches from each line, then write them one per line to the new file, eg:
import re
with open('input') as fin, open('output', 'w') as fout:
for line in fin:
matches = re.findall('"(.*?)"', line)
fout.writelines(match + '\n' for match in matches)

How to ignore multiple whitespace chars and words in python regex

I have a pattern which is looking for word1 followed by word2 followed by word3 with any number of characters in between.
My file however contains many random newline and other white space characters - which means that between word 1 and 2 or word 2 and 3 there could be 0 or more words and/or 0 or more newlines randomly
Why isn't this code working? (Its not matching anything)
strings = re.findall('word1[.\s]*word2[.\s]*word3', f.read())
[.\s]* - What I mean by this - find either '.'(any char) or '\s'(newline char) multiple times(*)
The reason why your reg ex is not working is because reg ex-es only try to match on a single line. They stop when they find a new line character (\n) and try to match the pattern on the new line starting from the beginning of the pattern.
In order to make the reg ex ignore the newline character you must add re.DOTALL as a third parameter to the findall function:
strings = re.findall('word1.*?word2.*?word3', f.read(), re.DOTALL)
You have two problems:
1) . doesn't mean anything special inside brackets [].
Change your [] to use () instead, like this: (.|\s)
2) \ doesn't mean what you think it does inside regular strings.
Try using raw strings:
re.findall(r'word1 ..blah..')
Notice the r prefix of the string.
Putting them together:
strings = re.findall(r'word1(.|\s)*word2(.|\s)*word3', f.read())
However, do note that this changes the returned list.

Categories