Text mining attempts here, I would like to turn the below:
a=['Colors.of.the universe:\n',
' Black: 111\n',
' Grey: 222\n',
' White: 11\n'
'Movies of the week:\n',
' Mission Impossible: 121\n',
' Die_Hard: 123\n',
' Jurassic Park: 33\n',
'Lands.categories.said:\n',
' Desert: 33212\n',
' forest: 4532\n',
' grassland : 431\n',
' tundra : 243451\n']
to this:
{'Colors.of.the universe':{Black:111,Grey:222,White:11},
'Movies of the week':{Mission Impossible:121,Die_Hard:123,Jurassic Park:33},
'Lands.categories.said': {Desert:33212,forest:4532,grassland:431,tundra:243451}}
Tried this code below but it was not good:
{words[1]:words[1:] for words in a}
which gives
{'o': 'olors.of.the universe:\n',
' ': ' tundra : 243451\n',
'a': 'ands.categories.said:\n'}
It only takes the first word as the key which is not what's needed.
A dict comprehension is an interesting approach.
a = ['Colors.of.the universe:\n',
' Black: 111\n',
' Grey: 222\n',
' White: 11\n',
'Movies of the week:\n',
' Mission Impossible: 121\n',
' Die_Hard: 123\n',
' Jurassic Park: 33\n',
'Lands.categories.said:\n',
' Desert: 33212\n',
' forest: 4532\n',
' grassland : 431\n',
' tundra : 243451\n']
result = dict()
current_key = None
for w in a:
# If starts with tab - its an item (under category)
if w.startswith(' '):
# Splitting item (i.e. ' Desert: 33212\n' -> [' Desert', ' 33212\n']
splitted = w.split(':')
# Setting the key and the value of the item
# Removing redundant spaces and '\n'
# Converting value to number
k, v = splitted[0].strip(), int(splitted[1].replace('\n', ''))
result[current_key][k] = v
# Else, it's a category
else:
# Removing ':' and '\n' form category name
current_key = w.replace(':', '').replace('\n', '')
# If category not exist - create a dictionary for it
if not current_key in result.keys():
result[current_key] = {}
# {'Colors.of.the universe': {'Black': 111, 'Grey': 222, 'White': 11}, 'Movies of the week': {'Mission Impossible': 121, 'Die_Hard': 123, 'Jurassic Park': 33}, 'Lands.categories.said': {'Desert': 33212, 'forest': 4532, 'grassland': 431, 'tundra': 243451}}
print(result)
That's really close to valid YAML already. You could just quote the property labels and parse. And parsing a known format is MUCH superior to dealing with and/or inventing your own. Even if you're just exploring base python, exploring good practices is just as (probably more) important.
import re
import yaml
raw = ['Colors.of.the universe:\n',
' Black: 111\n',
' Grey: 222\n',
' White: 11\n',
'Movies of the week:\n',
' Mission Impossible: 121\n',
' Die_Hard: 123\n',
' Jurassic Park: 33\n',
'Lands.categories.said:\n',
' Desert: 33212\n',
' forest: 4532\n',
' grassland : 431\n',
' tundra : 243451\n']
# Fix spaces in property names
fixed = []
for line in raw:
match = re.match(r'^( *)(\S.*?): ?(\S*)\s*', line)
if match:
fixed.append('{indent}{safe_label}:{value}'.format(
indent = match.group(1),
safe_label = "'{}'".format(match.group(2)),
value = ' ' + match.group(3) if match.group(3) else ''
))
else:
raise Exception("regex failed")
parsed = yaml.load('\n'.join(fixed), Loader=yaml.FullLoader)
print(parsed)
def text_process(text):
text = text.translate(str.maketrans('', '', string.punctuation))
return " ".join(text)
Input text: 'Transaction value was - RS.3456.63 '
Output : 'Transaction value was RS 345663 '
Could someone suggest me how to remove special characters (including '.' ) during text pre-processing but retain the decimal numbers?
Required Output : 'Transaction value was RS 3456.63 '
You can use a more generic regex to replace all special characters except .
import re
def text_process(text):
text = re.sub('[^\w.]+', ' ', text)
return text
s = 'Transaction: value* #was - 3456.63 Rupees'
text_process(s)
You get
'Transaction value was 3456.63 Rupees'
EDIT: The following function returns only the number with decimals.
def text_process(text):
text = re.sub('[^\d.]+', '', text)
return text
s = 'Transaction: value* #was - 3456.63 Rupees'
text_process(s)
'3456.63'
If I understand your question correctly, this code is for you:
text = 'Transaction value was, - 3456.63 Rupees'
regex = r"(?<!\d)[" + string.punctuation + "](?!\d)"
result = re.sub(regex, "", text)
# output: 'Transaction value was 3456.63 Rupees'
To solve your second question, try using this trick:
text = 'Transaction value was, - Rs.3456.63'
regex_space = r"([0-9]+(\.[0-9]+)?)"
regex_punct = r'[^\w.]+'
re.sub(r'[^\w.]+', ' ', re.sub(regex_space,r" \1 ", text).strip())
# output: 'Transaction value was Rs. 3456.63 Rupees'
In a program - the program doesn't matter -, only the first lines, I open an empty file (named empty.txt).
Then I define functions, but never use them on main ... so, I do not actually write anything.
This the nearly complete code:
from os import chdir
chdir('C:\\Users\\Julien\\Desktop\\PS BOT')
fic=open('empty.txt','r+')
def addtodic(txt):
"""Messages de la forme !add id,message ; txt='id,message' """
fic.write(txt+'\n')
fic.seek(0)
def checkdic(txt):
"""Messages de la forme !lien id ; txt='id' """
for i in fic.readlines().split('\n'):
ind=i.index(',')
if i[:ind]==txt:
fic.seek(0)
return i[ind+1:]
fic.seek(0)
return 'Not found'
Then I launch it, and using the console, I simply ask "fic.write( 'tadam' )", like, to check if the writing works well before moving on.
%run "C:/Users/Julien/Desktop/PS BOT/dic.py"
fic
Out[8]: <open file 'empty.txt', mode 'r+' at 0x0000000008D9ED20>
fic.write('tadam')
fic.readline()
Out[10]: 'os import chdir\n'
fic.readline()
Out[11]: "chdir('C:\\\\Users\\\\Julien\\\\Desktop\\\\PS BOT')\n"
fic.readline()
Out[12]: '\n'
fic.readline()
Out[13]: "fic=open('empty.txt','r+')\n"
fic.readlines()
Out[14]:
['\n',
'def addtodic(txt):\n',
' """Messages de la forme !add id,message ; txt=\'id,message\' """\n',
' fic.seek(0)\n',
" fic.write(txt)+'\\n'\n",
'\n',
'def checkdic(txt):\n',
' """Messages de la forme !lien id ; txt=\'id\' """\n',
" for i in fic.readline().split('\\n'):\n",
" ind=i.index(',')\n",
' if i[:ind]==txt:\n',
' fic.seek(0)\n',
' return i[ind+1:]\n',
' fic.seek(0)\n',
" return 'Not found'\n",
' \n',
'def removedic(txt):\n',
' """Messages de la forme !remove id ; txt=\'id\' """\n',
' check=True\n',
' while check:\n',
' i=fic.readline()\n',
' if i[:len(txt)]==txt: \n',
' fic.seek(0)\n',
' return check\n',
'#removedic fauxeturn check\r\n',
"#removedic faux tmp_file = open(filename,'w')\n",
' tmp_file.write(data)\n',
' tmp_file.close()\n',
' return filename\n',
'\n',
' # TODO: This should be removed when Term is refactored.\n',
' def write(self,data):\n',
' """Write a string to the default output"""\n',
' io.stdout.write(data)\n',
'\n',
' # TODO: This should be removed when Term is refactored.\n',
' def write_err(self,data):\n',
' """Write a string to the default error output"""\n',
' io.stderr.write(data)\n',
'\n',
' def ask_yes_no(self, prompt, default=None):\n',
' if self.quiet:\n',
' return True\n',
' return ask_yes_no(prompt,default)\n',
'\n',
' def show_usage(self):\n',
' """Show a usage message"""\n',
' page.page(IPython.core.usage.interactive_usage)\n',
'\n',
' def extract_input_lines(self, range_str, raw=False):\n',
' """Return as a string a set of input history slices.\n',
'\n',
' Parameters\n',
' ----------\n',
' range_str : string\n',
' The set of slices is given as a string, like "~5/6-~4/2 4:8 9",\n',
' since this function is for use by magic functions which get their\n',
' arguments as strings. The number before the / is the session\n',
' number: ~n goes n back from the current session.\n',
'\n',
' Optional Parameters:\n',
' - raw(False): by default, the processed input is used. If this is\n',
' true, the raw input history is used instead.\n',
'\n',
' Note that slices can be called with two notations:\n',
'\n',
' N:M -> standard python form, means including items N...(M-1).\n',
'\n',
' N-M -> include items N..M (closed endpoint)."""\n',
' lines = self.history_manager.get_range_by_str(range_str, raw=raw)\n',
' return "\\n".join(x for _, _, x in lines)\n',
'\n',
' def find_user_code(self, target, raw=True, py_only=False, skip_encoding_cookie=True):\n',
' """Get a code string from history, file, url, or a string or macro.\n',
'\n',
' This is mainly used by magic functions.\n',
'\n',
' Parameters\n',
' ----------\n',
'\n',
' target : str\n',
'\n',
' A string specifying code to retrieve. This will be tried respectively\n',
' as: ranges of input history (see %history for syntax), url,\n',
' correspnding .py file, filename, or an expression evaluating to a\n',
' string or Macro in the user namespace.\n',
'\n',
' raw : bool\n',
' If true (default), retrieve raw history. Has no effect on the other\n',
' retrieval mechanisms.\n',
'\n',
' py_only : bool (default False)\n',
' Only try to fetch python code, do not try alternative methods to decode file\n',
' if unicode fails.\n',
'\n',
' Returns\n',
' -------\n',
' A string of code.\n',
'\n',
' ValueError is raised if nothing is found, and TypeError if it evaluates\n',
' to an object of another type. In each case, .args[0] is a printable\n',
' message.\n',
' """\n',
' code = self.extract_input_lines(target, raw=raw) # Grab history\n',
' if code:\n',
' return code\n',
' utarget = unquote_filename(target)\n',
' try:\n',
" if utarget.startswith(('http://', 'https://')):\n",
' return openpy.read_py_url(utarget, skip_encoding_cookie=skip_encoding_cookie)\n',
' except UnicodeDecodeError:\n',
' if not py_only :\n',
' from urllib import urlopen # Deferred import\n',
' response = urlopen(target)\n',
" return response.read().decode('latin1')\n",
' raise ValueError(("\'%s\' seem to be un']
KABOOM ! Has anybody an explanation ? By the way, I use Python 2.7 with Enthought Canopy.
When you open a file with 'r+', it doesn't get truncated, it still retains its old contents. To truncate it to 0 bytes, call fic.truncate(0) right after opening it.
You must seek between read and write operations on the same file object (otherwise the results are undefined because of buffering), e.g. add a fic.seek(0, 0) (or any other seek) after the write call.
I am currently trying to use python to parse the notes file for my kindle so that I can keep them more organized than the chronologically ordered list that the kindle automatically saves notes in. Unfortunately, I'm having trouble using regex to parse the file. Here's my code so far:
import re
def parse_file(in_file):
read_file = open(in_file, 'r')
file_lines = read_file.readlines()
read_file.close()
raw_note = "".join(file_lines)
# Regex parts
title_regex = "(.+)"
title_author_regex = "(.+) \((.+)\)"
loc_norange_regex = "(.+) (Location|on Page) ([0-9]+)"
loc_range_regex = "(.+) (Location|on Page) ([0-9]+)-([0-9]+)"
date_regex = "([a-zA-Z]+), ([a-zA-Z]+) ([0-9]+), ([0-9]+)" # Date
time_regex = "([0-9]+):([0-9]+) (AM|PM)" # Time
content_regex = "(.*)"
footer_regex = "=+"
nl_re = "\r*\n"
# No author
regex_noauthor_str =\
title_regex + nl_re +\
"- Your " + loc_range_regex + " | Added on " +\
date_regex + ", " + time_regex + nl_re +\
content_regex + nl_re +\
footer_regex
regex_noauthor = re.compile(regex_noauthor_str)
print regex_noauthor.findall(raw_note)
parse_file("testnotes")
Here is the contents of "testnotes":
Title
- Your Highlight Location 3360-3362 | Added on Wednesday, March 21, 2012, 12:16 AM
Note content goes here
==========
What I want:
[('Title', 'Highlight', 'Location', '3360', '3362', 'Wednesday', 'March', '21', '2012', '12', '16', 'AM',
But when I run the program, I get:
[('Title', 'Highlight', 'Location', '3360', '3362', '', '', '', '', '', '', '', '')]
I'm fairly new to regex, but I feel like this should be fairly straightforward.
When you say " | Added on ", you need to escape the |.
Replace that string with " \| Added on "
You need to escape the | in "- Your " + loc_range_regex + " | Added on " +\
to: "- Your " + loc_range_regex + " \| Added on " +\
| is the OR operator in a regex.
Should anyone need an update to this, the following works with Paperwhite & Voyage Kindles in 2017 : https://gist.github.com/laffan/7b945d256028d2ffaacd4d99be40ca34