Delete a repeating pattern in a string using Python - python

I have a JavaScript file with an array of data.
info = [ {
Date = "YR-MM-DDT00:00:10"
}, ....
What I'm trying to do is remove T and on in the Date field.
Here's what I've tried:
import re
with open ("info.js","r") as myFile:
data= myFile.read();
data= re.sub('\0-9T,'',data);
Desired output for each Date field in the array:
Date = "YR-MM-DD"

You should match the T and the characters that come after it, This works for a single timestamp:
import re
print(re.sub('T.*$', '', 'YR-MM-DDT00:00:10'))
Or if you have text containing a bunch of timestamps, match the closing double quote as well, and replace with a double quote:
import re
text = """
info = [ {
Date = "YR-MM-DDT00:00:10",
Date = "YR-MM-DDT01:02:03",
Date = "YR-MM-DDT11:22:33"
}
"""
new_text = re.sub('T.*"', '"', text)
print(new_text)

Related

Extract only the specific value from string with Regex Using Python

I am trying to extract Specific text values from string using regex but due to not having the spaces between the start of the keyword from which the values need to be extracted getting the error.
Looking out to extract the values of the keywords starts with.
Tried using PyPDF2 and pdfminer but getting the Error.
fr = PyPDF2.PdfFileReader(file)
data = fr.getPage(0).extractText()
OutPut : ['Date : 2020-09-06 20:43:00Ack No : 3320000266Original for RecipientInvoice No.: IN05200125634Date of Issue: 06.09.2015TAX INVOICE(Issued u/s 31(1) of GST Act, 2017)POLO INDUSTRIES LIMITEDCIN: K253648B85PLC015063GSTIN: 3451256132uuy668803E1Z9PAN: BBB7653279K .....']
I am looking out to capture Ack No, Date of Issue, CIN from the above output
Using the script:
regex_ack_no = re.compile(r"Ack No(\d+)")
regex_due_date = re.compile(r"Date of Issue(\S+ \d{1,2}, \d{4})")
regex_CIN = re.compile(r"CIN(\$\d+\.\d{1,2})")
ack_no = re.search(regex_ack_no, data).group(1)
due_date = re.search(regex_due_date, data).group(1)
cin = re.search(regex_CIN, data).group(1)
return[ack_no, due_date, cin]
Error:
AttributeError: 'NoneType' object has no attribute 'group'
When using the same script with the another PDF file having data in the table format its working.
You need to change the regexp patterns to match the data format. The keywords are followed by spaces and :, you have to match them. The format of the date is not what you have in your pattern, neither is the format of CIN.
Before calling .group(1), check that the match was successful. In my code below I return default values when there's no match.
import re
data = 'Date : 2020-09-06 20:43:00Ack No : 3320000266Original for RecipientInvoice No.: IN05200125634Date of Issue: 06.09.2015TAX INVOICE(Issued u/s 31(1) of GST Act, 2017)POLO INDUSTRIES LIMITEDCIN: K253648B85PLC015063GSTIN: 3451256132uuy668803E1Z9PAN: BBB7653279K .....'
regex_ack_no = re.compile(r"Ack No\s*:\s*(\d+)")
regex_due_date = re.compile(r"Date of Issue\s*:\s*(\d\d\.\d\d\.\d{4})")
regex_CIN = re.compile(r"CIN:\s*(\w+?)GSTIN:")
ack_no = re.search(regex_ack_no, data)
if ack_no:
ack_no = ack_no.group(1)
else:
ack_no = 'Ack No not found'
due_date = re.search(regex_due_date, data)
if due_date:
due_date = due_date.group(1)
else:
due_date = 'Due date not found'
cin = re.search(regex_CIN, data)
if cin:
cin = cin.group(1)
else:
cin = 'CIN not found'
print([ack_no, due_date, cin])
DEMO

How to get readable unicode string from single bibtex entry field in python script

Suppose you have a .bib file containing bibtex-formatted entries. I want to extract the "title" field from an entry, and then format it to a readable unicode string.
For example, if the entry was:
#article{mypaper,
author = {myself},
title = {A very nice {title} with annoying {symbols} like {\^{a}}}
}
what I want to extract is the string:
A very nice title with annoying symbols like â
I am currently trying to use the pybtex package, but I cannot figure out how to do it. The command-line utility pybtex-format does a good job in converting full .bib files, but I need to do this inside a script and for single title entries.
Figured it out:
def load_bib(filename):
from pybtex.database.input.bibtex import Parser
parser = Parser()
DB = parser.parse_file(filename)
return DB
def get_title(entry):
from pybtex.plugin import find_plugin
style = find_plugin('pybtex.style.formatting', 'plain')()
backend = find_plugin('pybtex.backends', 'plaintext')()
sentence = style.format_title(entry, 'title')
data = {'entry': entry,
'style': style,
'bib_data': None}
T = sentence.f(sentence.children, data)
title = T.render(backend)
return title
DB = load_bib("bibliography.bib")
print ( get_title(DB.entries["entry_label"]) )
where entry_label must match the label you use in latex to cite the bibliography entry.
Building upon the answer by Daniele, I wrote this function that lets one render fields without having to use a file.
from io import StringIO
from pybtex.database.input.bibtex import Parser
from pybtex.plugin import find_plugin
def render_fields(author="", title=""):
"""The arguments are in bibtex format. For example, they may contain
things like \'{i}. The output is a dictionary with these fields
rendered in plain text.
If you run tests by defining a string in Python, use r'''string''' to
avoid issues with escape characters.
"""
parser = Parser()
istr = r'''
#article{foo,
Author = {''' + author + r'''},
Title = {''' + title + '''},
}
'''
bib_data = parser.parse_stream(StringIO(istr))
style = find_plugin('pybtex.style.formatting', 'plain')()
backend = find_plugin('pybtex.backends', 'plaintext')()
entry = bib_data.entries["foo"]
data = {'entry': entry, 'style': style, 'bib_data': None}
sentence = style.format_author_or_editor(entry)
T = sentence.f(sentence.children, data)
rendered_author = T.render(backend)[0:-1] # exclude period
sentence = style.format_title(entry, 'title')
T = sentence.f(sentence.children, data)
rendered_title = T.render(backend)[0:-1] # exclude period
return {'title': rendered_title, 'author': rendered_author}

How to split a date time string and replace with date only string?

I am having a problem with splitting this string:
"published": "2018-08-15T08:04:57Z",
I would like to split the 2018-08-15 part from the T08 part. After that the T08... part needs to be removed. This will be applied to every "published": rule in the .json file.
I'll have to do this with Python, as I also convert the XML file to JSON.
So in the convert process I would like to remove the T08... part.
I hope someone can help me and if more clarification is needed, I don't mind giving it :)
Searched the internet, had some look into the .split, .pop etc. methods. I am just a rookie at Python still but I want to learn.
Here is my current code:
import xmltodict
import json
#Searching for .xml file to convert
with open ('../../get_url/chocolatey.xml') as fd:
xmlString = fd.read()
#Converting .xml file
print("XML Input (../../get_url/chocolatey.xml):")
print(xmlString)
#Removing certain Characters from strings in file
jsonString = json.dumps(xmltodict.parse(xmlString), indent=4)
jsonString = jsonString.replace("#", "")
jsonString = jsonString.replace("m:", "")
jsonString = jsonString.replace("d:", "")
#jsonString = jsonString.replace('"', '')
#Printing output in Json format
print("\nJson Output (../../get_url/chocolatey.json):")
print(jsonString)
#Applying output to .json file
with open("chocolatey.json", 'w') as fd:
fd.write(jsonString)
Example of the JSON file
},
"published": "2018-08-15T08:04:57Z",
"updated": "2018-08-15T08:04:57Z",
"author": {
"name": "Microsoft"
},
you can try like this:
timestamp = "2018-08-15T08:04:57Z"
timestamp = timestamp.split("T")[0]
op:
2018-08-15
You can use the dateutil.parser for this.
from dateutil import parser
d = "2018-08-15T08:04:57Z"
dt = parser.parse(d) # parses date string of any format and returns a date time object
print(dt,type(dt))
# outputs 2018-08-15 08:04:57+00:00 <class 'datetime.datetime'>
You can then do use strftime to get the date only or date and time in any format.
print(dt.strftime('%Y-%m-%d')) # You can specify any format you need
# outputs 2018-08-15
Read more about how to get date string from datetime object in any format here.
Example code:
import json
from dateutil import parser
jsonDict = {"published": "2018-08-15T08:04:57Z", "updated": "2018-08-15T08:04:57Z", "author": { "name": "Microsoft"},}
# converting a dictionary object to json String
jsonString = json.dumps(jsonDict)
# converting a json string to json object
jsonObj = json.loads(jsonString)
# replacing the "published" value with date only
jsonObj["published"] = parser.parse("2018-08-15T08:04:57Z").strftime('%Y-%m-%d')
# printing the result
print(jsonObj["published"])
# outputs 2018-08-15
# converting back to json string to print
jsonString = json.dumps(jsonObj)
# printing the json string
print(jsonString)
# ouputs
'''
{"published": "2018-08-15", "updated": "2018-08-15T08:04:57Z", "author":{"name": "Microsoft"}}
'''
You can test the code here
Something like this. Using JSONEncoder
import json
class PublishedEncoder(json.JSONEncoder):
def encode(self, o):
if 'published' in o:
o['published'] = o['published'][:o['published'].find('T')]
return super().encode(o)
data = {1: 'X', 'published': '2018-08-15T08:04:57Z'}
json_str = json.dumps(data, cls=PublishedEncoder)
print(json_str)
output
{"1": "X", "published": "2018-08-15"}

Python backreference replacing doesn't work as expected

There are two named groups in my pattern: myFlag and id, I want to add one more myFlag immediately before group id.
Here is my current code:
# i'm using Python 3.4.2
import re
import os
contents = b'''
xdlg::xdlg(x_app* pApp, CWnd* pParent)
: customized_dlg((UINT)0, pParent, pApp)
, m_pReaderApp(pApp)
, m_info(pApp)
{
}
'''
pattern = rb'(?P<myFlag>[a-zA-Z0-9_]+)::(?P=myFlag).+:.+(?P<id>\(UINT\)0 *,)'
res = re.search(pattern, contents, re.DOTALL)
if None != res:
print(res.groups()) # the output is (b'xdlg', b'(UINT)0,')
# 'replPattern' becomes b'(?P<myFlag>[a-zA-Z0-9_]+)::(?P=myFlag).+:.+((?P=myFlag)\\(UINT\\)0 *,)'
replPattern = pattern.replace(b'?P<id>', b'(?P=myFlag)', re.DOTALL)
print(replPattern)
contents = re.sub(pattern, replPattern, contents)
print(contents)
The expected results should be:
xdlg::xdlg(x_app* pApp, CWnd* pParent)
: customized_dlg(xdlg(UINT)0, pParent, pApp)
, m_pReaderApp(pApp)
, m_info(pApp)
{
}
but now the result this the same with the original:
xdlg::xdlg(x_app* pApp, CWnd* pParent)
: customized_dlg((UINT)0, pParent, pApp)
, m_pReaderApp(pApp)
, m_info(pApp)
{
}
The issue appears to be the pattern syntax — particularly the end:
0 *,)
That makes no sense really... fixing it seems to solve most of the issues, although I would recommend ditching DOTALL and going with MULTILINE instead:
p = re.compile(ur'([a-zA-Z0-9_]+)::\1(.*\n\W+:.*)(\(UINT\)0,.*)', re.MULTILINE)
sub = u"\\1::\\1\\2\\1\\3"
result = re.sub(p, sub, s)
print(result)
Result:
xdlg::xdlg(x_app* pApp, CWnd* pParent)
: customized_dlg(xdlg(UINT)0, pParent, pApp)
, m_pReaderApp(pApp)
, m_info(pApp)
{
}
https://regex101.com/r/hG3lV7/1

Work with Chinese in Python

I`m trying to work with Chinese text and big data in Python.
Part of work is clean text from some unneeded data. For this goal I am using regexes. However I met some problems as in Python regex as in PyCharm application:
1) The data is stored in postgresql and viewed well in the columns, however, after select and pull it to the var it is displayed as a square:
When the value printed to the console is looks like:
Mentholatum 曼秀雷敦 男士 深层活炭洁面乳100g(新包装)
So I presume there is no problem with application encoding but with debug part of encoding, however, I did not find any solutions for such behaviour.
2) The example of regex that I need to care is to remove the values between Chinese brackets include them. The code I used is:
#!/usr/bin/env python
# -*- coding: utf-8 -*
import re
from pprint import pprint
import sys, locale, os
columnString = row[columnName]
startFrom = valuestoremove["startsTo"]
endWith = valuestoremove["endsAt"]
isInclude = valuestoremove["include"]
escapeCharsRegex = re.compile('([\.\^\$\*\+\?\(\)\[\{\|])')
nonASCIIregex = re.compile('([^\x00-\x7F])')
if escapeCharsRegex.match(startFrom):
startFrom = re.escape(startFrom)
if escapeCharsRegex.match(endWith):
endWith = re.escape(endWith)
if isInclude:
regex = startFrom + '(.*)' + endWith
else:
regex = '(?<=' + startFrom + ').*?(?=' + endWith + ')'
if nonASCIIregex.match(regex):
p = re.compile(ur'' + regex)
else:
p = re.compile(regex)
row[columnName] = p.sub("", columnString).strip()
But the regex does not influence on the given string.
I`ve made a test with next code:
#!/usr/bin/env python
# -*- coding: utf-8 -*
import re
reg = re.compile(ur'((.*))')
string = u"巴黎欧莱雅 男士 劲能冰爽洁面啫哩(原男士劲能净爽洁面啫哩)100ml"
print string
string = reg.sub("", string)
print string
And it is work fine for me.
The only difference between those two code examples is that n the first the regex values are come from the txt file with json, encoded as utf-8:
{
"between": {
"startsTo": "(",
"endsAt": ")",
"include": true,
"sequenceID": "1"
}
}, {
"between": {
"startsTo": "(",
"endsAt": ")",
"include": true,
"sequenceID": "2"
}
},{
"between": {
"startsTo": "(",
"endsAt": ")",
"include": true,
"sequenceID": "2"
}
},{
"between": {
"startsTo": "(",
"endsAt": ")",
"include": true,
"sequenceID": "2"
}
}
The Chinese brackets from the file are also viewed like the squares:
I cant find explanation or any solution for such behavior, thus the community help need
Thanks for help.
The problem is that the text you're reading in isn't getting understood as Unicode correctly (this is one of the big gotchas that prompted sweeping changes for Python 3k). Instead of:
data_file = myfile.read()
You need to tell it to decode the file:
data_file = myfile.read().decode("utf8")
Then continue with json.loads, etc, and it should work out fine. Alternatively,
data = json.load(myfile, "utf8")
After many searches and consultations here is a solution for Chinese text (also mixed and non-mixed language)
import codecs
def betweencase(valuestoremove, row, columnName):
columnString = row[columnName]
startFrom = valuestoremove["startsTo"]
endWith = valuestoremove["endsAt"]
isInclude = valuestoremove["include"]
escapeCharsRegex = re.compile('([\.\^\$\*\+\?\(\)\[\{\|])')
if escapeCharsRegex.match(startFrom):
startFrom = re.escape(startFrom)
if escapeCharsRegex.match(endWith):
endWith = re.escape(endWith)
if isInclude:
regex = ur'' + startFrom + '(.*)' + endWith
else:
regex = ur'(?<=' + startFrom + ').*?(?=' + endWith + ')'
***p = re.compile(codecs.encode(unicode(regex), "utf-8"))***
delimiter = ' '
if localization == 'CN':
delimiter = ''
row[columnName] = p.sub(delimiter, columnString).strip()
As you can see we encode any regex to utf-8 thus the postgresql db value is match to regex.

Categories