Translate unicode characters in a string to ISO 8859-1 Characters - python
I have the following string:
I want to transform this string into the following JSON-object:
{"h":[{"id":"242611","minute":"2","result":"MissedShots","X":"0.9359999847412109","Y":"0.534000015258789","xG":"0.1072189137339592","player":"Ari","h_a":"h","player_id":"2930","situation":"OpenPlay","season":"2018","shotType":"Head","match_id":"9071","h_team":"FC Krasnodar","a_team":"Ural","h_goals":"2","a_goals":"0","date":"2018-12-02 11:00:00","player_assisted":"Wanderson","lastAction":"Chipped"},{"id":"242612","minute":"4","result":"SavedShot","X":"0.8059999847412109","Y":"0.7069999694824218","xG":"0.021672379225492477","player":"Cristian Ramirez","h_a":"h","player_id":"5477","situation":"OpenPlay","season":"2018","shotType":"LeftFoot","match_id":"9071","h_team":"FC Krasnodar","a_team":"Ural","h_goals":"2","a_goals":"0","date":"2018-12-02 11:00:00","player_assisted":null,"lastAction":"None"},{"id":"242613","minute":"4","result":"SavedShot","X":"0.7780000305175782","Y":"0.505","xG":"0.023817993700504303","player":"Mauricio Pereyra","h_a":"h","player_id":"2922","situation":"OpenPlay","season":"2018","shotType":"RightFoot","match_id":"9071","h_team":"FC Krasnodar","a_team":"Ural","h_goals":"2","a_goals":"0","date":"2018-12-02 11:00:00","player_assisted":"Viktor Claesson","lastAction":"Pass"},{"id":"242614","minute":"17","result":"MissedShots","X":"0.9330000305175781","Y":"0.41","xG":"0.01863950863480568","player":"Ari","h_a":"h","player_id":"2930","situation":"FromCorner","season":"2018","shotType":"Head","match_id":"9071","h_team":"FC Krasnodar","a_team":"Ural","h_goals":"2","a_goals":"0","date":"2018-12-02 11:00:00","player_assisted":"Mauricio Pereyra","lastAction":"Aerial"},{"id":"242617","minute":"21","result":"SavedShot","X":"0.710999984741211","Y":"0.534000015258789","xG":"0.015956614166498184","player":"Ivan Ignatyev","h_a":"h","player_id":"6025","situation":"OpenPlay","season":"2018","shotType":"RightFoot","match_id":"9071","h_team":"FC Krasnodar","a_team":"Ural","h_goals":"2","a_goals":"0","date":"2018-12-02 11:00:00","player_assisted":"Ari","lastAction":"Pass"},{"id":"242621","minute":"31","result":"MissedShots","X":"0.7959999847412109","Y":"0.4640000152587891","xG":"0.03898102045059204","player":"Viktor Claesson","h_a":"h","player_id":"5478","situation":"OpenPlay","season":"2018","shotType":"RightFoot","match_id":"9071","h_team":"FC Krasnodar","a_team":"Ural","h_goals":"2","a_goals":"0","date":"2018-12-02 11:00:00","player_assisted":null,"lastAction":"None"},{"id":"242622","minute":"36","result":"MissedShots","X":"0.759000015258789","Y":"0.3509999847412109","xG":"0.05237437039613724","player":"Mauricio Pereyra","h_a":"h","player_id":"2922","situation":"DirectFreekick","season":"2018","shotType":"LeftFoot","match_id":"9071","h_team":"FC Krasnodar","a_team":"Ural","h_goals":"2","a_goals":"0","date":"2018-12-02 11:00:00","player_assisted":null,"lastAction":"Standard"},{"id":"242624","minute":"42","result":"BlockedShot","X":"0.919000015258789","Y":"0.37","xG":"0.10843519121408463","player":"Sergei Petrov","h_a":"h","player_id":"2920","situation":"OpenPlay","season":"2018","shotType":"RightFoot","match_id":"9071","h_team":"FC Krasnodar","a_team":"Ural","h_goals":"2","a_goals":"0","date":"2018-12-02 11:00:00","player_assisted":"Viktor Claesson","lastAction":"Pass"},{"id":"242625","minute":"48","result":"MissedShots","X":"0.7719999694824219","Y":"0.385","xG":"0.023656079545617104","player":"Aleksandr Martynovich","h_a":"h","player_id":"2790","situation":"OpenPlay","season":"2018","shotType":"RightFoot","match_id":"9071","h_team":"FC Krasnodar","a_team":"Ural","h_goals":"2","a_goals":"0","date":"2018-12-02 11:00:00","player_assisted":"Yuri Gazinskiy","lastAction":"Pass"},{"id":"242626","minute":"49","result":"MissedShots","X":"0.715999984741211","Y":"0.4879999923706055","xG":"0.013118931092321873","player":"Yuri Gazinskiy","h_a":"h","player_id":"2929","situation":"OpenPlay","season":"2018","shotType":"RightFoot","match_id":"9071","h_team":"FC Krasnodar","a_team":"Ural"}]}
With the help of a table I what each of the following unicode-characters represents:
\x7B = {
\x22 = "
\x3A = :
\x5B = [
\x7D = }
\x5D = ]
As a result, I have written the following script:
string = '\x7B\x22h\x22\x3A\....\\x22\x22\x7D\x5D\x7D'
cleaned_string = string.replace("\x7B","{")
cleaned_string = cleaned_string.replace('\x22', '"')
cleaned_string = cleaned_string.replace("\x3A", ":")
cleaned_string = cleaned_string.replace("\x5B", "[")
cleaned_string = cleaned_string.replace("\x7D", "}")
cleaned_string = cleaned_string.replace("\x5D", "]")
s = json.loads(cleaned_string)
However, The problem arises in the player- and player-assisted -field. Especially, for French or Spanish names there are multiple special characters possible. Currently, my script is only transforming the "standard" characters. In theory, there are of course a lot of different other characters that need to be transformed. How can I do this easily?
Thanks in advance,
how to indicate raw string with regex() if my pattern come from another string?
I have a csv table from which I get my regex pattern, e.g. \bconden Problem : I don't manage to specify to python that this is a raw string How to put r before a pattern when it comes from a string ? import re a = 'de la matière condensée' fromcsv = '\bconden' print('r' + fromcsv, a)) result is None
You can use the str_to_raw function below to make a raw string out of an already declared plain string variable: import re a = 'de la matière condensée' pattern = '\bconden' escape_dict = { '\a': r'\a', '\b': r'\b', '\c': r'\c', '\f': r'\f', '\n': r'\n', '\r': r'\r', '\t': r'\t', '\v': r'\v', '\'': r'\'', '\"': r'\"', '\0': r'\0', '\1': r'\1', '\2': r'\2', '\3': r'\3', '\4': r'\4', '\5': r'\5', '\6': r'\6', '\7': r'\7', '\8': r'\8', '\9': r'\9' } def str_to_raw(s): return r''.join(escape_dict.get(c, c) for c in s) print('\bconden', a)) print(, a)) Output: <re.Match object; span=(14, 20), match='conden'> <re.Match object; span=(14, 20), match='conden'> note: I got escape_dict from this page.
How to replace some special characters from user input for different Python platforms
I need to replace some special characters from user input for different platform (i.e. Linux and Windows) using Python. Here is my code: if request.method == 'POST': rname1 = request.POST.get('react') Here I am getting the user input by post method. I need to the following characters to remove from the user input (if there is any). 1- Escape or filter special characters for windows, ( ) < > * ‘ = ? ; [ ] ^ ~ ! . ” % # / \ : + , ` 2- Escape or filter special characters for Linux, { } ( ) < > * ‘ = ? ; [ ] $ – # ~ ! . ” % / \ : + , ` The special characters are given above. Here I need to remove for both Linux and Windows.
Python strings have a built in method translate for substitution/deletion of characters. You need to build a translation table and then call the function. import sys if "win" in sys.platform: special = """( ) < > * ‘ = ? ; [ ] ^ ~ ! . ” % # / \ : + , `""".split() else: special = """{ } ( ) < > * ‘ = ? ; [ ] $ – # ~ ! . ” % / \ : + , `""".split() trans_dict = {character: None for character in special} trans_table = str.maketrans(trans_dict) print("Lo+=r?e~~m ipsum dol;or sit!! amet, consectet..ur ad%".translate(trans_table)) Will print Lorem ipsum dolor sit amet consectetur ad. If you want to use a replacement character instead of deleting, then replace None above with the character. You can build a translation table with specific substitutions, `{"a": "m", "b": "n", ...} Edit: The above snippet is indeed in Python3. In Python2 (TiO) it's easier to delete characters: >>> import sys >>> import string >>> if "win" in sys.platform: ... special = """()<>*'=?;[]^~!%#/\:=,`""" ... else: ... special = """{}()<>*'=?;[]$-#~!."%/\:+""" ... >>> s = "Lo+r?e~~/\#<>m ips()u;m" >>> string.translate(s, None, special) 'Lorem ipsum' Note that I've substituted ‘ with ' and similarly replaced ” with " because I think you're only dealing with ascii strings.
prevent pandas from removing spaces in numbers in text columns
I'm trying to load CSV file into pandas dataframe. CSV is semicolon delimited. Values in text columns are in double quotation marks. File in question: In one of the text columns ('TYTUL') i have following value: "00 307 1457 212" I specify the column as str but when i print or export results to excel I get 003071457212 instead of 00 307 1457 212 How do I prevent pandas from removing spaces? Here is my code: import pandas df = pandas.read_csv(r'file_01.csv' ,sep = ';' ,quotechar = '"' ,names = ['DATA_OPERACJI' ,'DATA_KSIEGOWANIA' ,'OPIS_OPERACJI' ,'TYTUL' ,'NADAWCA_ODBIORCA' ,'NUMER_KONTA' ,'KWOTA' ,'SALDO_PO_OPERACJI' ,'KOLUMNA_9'] ,usecols = [0,1,2,3,4,5,6,7] ,skiprows = 38 ,skipfooter = 3 ,encoding = 'cp1250' ,thousands = ' ' ,decimal = ',' ,parse_dates = [0,1] ,converters = {'OPIS_OPERACJI': str ,'TYTUL': str ,'NADAWCA_ODBIORCA': str ,'NUMER_KONTA': str} ,engine = 'python' ) df.TYTUL.replace([' +', '^ +', ' +$'], [' ', '', ''],regex=True,inplace=True) #this only removes excessive spaces print(df.TYTUL) I also came up with a workaround (comment #workaround) but I would like to ask if there is a better way. import pandas df = pandas.read_csv(r'file_01.csv' ,sep = ';' ,quotechar = '?' #workaround ,names = ['DATA_OPERACJI' ,'DATA_KSIEGOWANIA' ,'OPIS_OPERACJI' ,'TYTUL' ,'NADAWCA_ODBIORCA' ,'NUMER_KONTA' ,'KWOTA' ,'SALDO_PO_OPERACJI' ,'KOLUMNA_9'] ,usecols = [0,1,2,3,4,5,6,7] ,skiprows = 38 ,skipfooter = 3 ,encoding = 'cp1250' ,thousands = ' ' ,decimal = ',' ,parse_dates = [0,1] ,converters = {'OPIS_OPERACJI': str ,'TYTUL': str ,'NADAWCA_ODBIORCA': str ,'NUMER_KONTA': str} ,engine = 'python' ) df.TYTUL.replace([' +', '^ +', ' +$'], [' ', '', ''],regex=True,inplace=True) #this only removes excessive spaces df.TYTUL.replace(['^"', '"$'], ['', ''],regex=True,inplace=True) #workaround print(df.TYTUL)
remove this line from your read_csv code ,thousands = ' ' I tested it, the output is correct without this option '00 307 1457 212'
Python regex sub confusion
There are four keywords: title, blog, tags, state Excess keyword occurrences are being removed from their respective matches. Example: blog: blog state title tags and returns state title tags and instead of blog state title tags and The sub function should be matching .+ after it sees blog:, so I don't know why it treats blog as an exception to .+ Regex: re.sub(r'((^|\n|\s|\b)(title|blog|tags|state)(\:\s).+(\n|$))', matcher, a) Code: def n15(): import re a = """blog: blog: fooblog state: private title: this is atitle bun and text""" kwargs = {} def matcher(string): v =, '').replace(, '').replace(, '').replace(, '') if == 'title': kwargs['title'] = v elif == 'blog': kwargs['blog_url'] = v elif == 'tags': kwargs['comma_separated_tags'] = v elif == 'state': kwargs['post_state'] = v return '' a = re.sub(r'((^|\n|\s|\b)(title|blog|tags|state)(\:\s).+(\n|$))', matcher, a) a = a.replace('\n', '<br />') a = a.replace('\r', '') a = a.replace('"', r'\"') a = '<p>' + a + '</p>' kwargs['body'] = a print kwargs Output: {'body': '<p>and text</p>', 'post_state': 'private', 'blog_url': 'foo', 'title': 'this is a bun'} Edit: Desired Output: {'body': '<p>and text</p>', 'post_state': 'private', 'blog_url': 'fooblog', 'title': 'this is atitle bun'}
replace(, '') is replacing all occurrences of 'blog' with '' . Rather than try to replace all the other parts of the matched string, which will be hard to get right, I suggest capture the string you actually want in the original match. r'((^|\n|\s|\b)(title|blog|tags|state)(\:\s)(.+)(\n|$))' which has () around the .+ to capture that part of the string, then v = at the start of matcher.
Using regex to parse kindle "My Clippings.txt" file
I am currently trying to use python to parse the notes file for my kindle so that I can keep them more organized than the chronologically ordered list that the kindle automatically saves notes in. Unfortunately, I'm having trouble using regex to parse the file. Here's my code so far: import re def parse_file(in_file): read_file = open(in_file, 'r') file_lines = read_file.readlines() read_file.close() raw_note = "".join(file_lines) # Regex parts title_regex = "(.+)" title_author_regex = "(.+) \((.+)\)" loc_norange_regex = "(.+) (Location|on Page) ([0-9]+)" loc_range_regex = "(.+) (Location|on Page) ([0-9]+)-([0-9]+)" date_regex = "([a-zA-Z]+), ([a-zA-Z]+) ([0-9]+), ([0-9]+)" # Date time_regex = "([0-9]+):([0-9]+) (AM|PM)" # Time content_regex = "(.*)" footer_regex = "=+" nl_re = "\r*\n" # No author regex_noauthor_str =\ title_regex + nl_re +\ "- Your " + loc_range_regex + " | Added on " +\ date_regex + ", " + time_regex + nl_re +\ content_regex + nl_re +\ footer_regex regex_noauthor = re.compile(regex_noauthor_str) print regex_noauthor.findall(raw_note) parse_file("testnotes") Here is the contents of "testnotes": Title - Your Highlight Location 3360-3362 | Added on Wednesday, March 21, 2012, 12:16 AM Note content goes here ========== What I want: [('Title', 'Highlight', 'Location', '3360', '3362', 'Wednesday', 'March', '21', '2012', '12', '16', 'AM', But when I run the program, I get: [('Title', 'Highlight', 'Location', '3360', '3362', '', '', '', '', '', '', '', '')] I'm fairly new to regex, but I feel like this should be fairly straightforward.
When you say " | Added on ", you need to escape the |. Replace that string with " \| Added on "
You need to escape the | in "- Your " + loc_range_regex + " | Added on " +\ to: "- Your " + loc_range_regex + " \| Added on " +\ | is the OR operator in a regex.
Should anyone need an update to this, the following works with Paperwhite & Voyage Kindles in 2017 :