How to convert a 'raw' string into a 'decoded' string in Python? - python

I have the following string:
raw_text = r"The Walt Disney Company, (2006\u2013present)"
print(raw_text)
#result : The Walt Disney Company, (2006\u2013present)
My questions is how can I get a decoded string "decoded_text" from the raw_text so I can get
print(decoded_text)
#result : The Walt Disney Company, (2006-present)
except this trivial way:
decoded_text = raw_text.replace("\u2013", "-")
In fact, I have big strings, which contains a lot of \u-- stuff (like \u2013, \u00c9, and so forth). So I'm looking for a way to convert all of them at once in a right way.

You might use built-in codecs module for this task as follows
import codecs
raw_text = r"The Walt Disney Company, (2006\u2013present)"
print(codecs.unicode_escape_decode(raw_text)[0])
output:
The Walt Disney Company, (2006–present)

Related

Question on regex not performing as expected

I am trying to change the suffixes of companies such that they are all in a common pattern such as Limited, Limiteed all to LTD.
Here is my code:
re.sub(r"\s+?(CORPORATION|CORPORATE|CORPORATIO|CORPORATTION|CORPORATIF|CORPORATI|CORPORA|CORPORATN)", r" CORP", 'ABC CORPORATN')
I'm trying 'ABC CORPORATN' and it's not converting it to CORP. I can't see what the issue is. Any help would be great.
Edit: I have tried the other endings that I included in the regex and they all work except for corporatin (that I mentioned above)
I see that all te patterns begins with "CORPARA", so we can just go:
import re
print(re.sub("CORPORA\w+", "CORP", 'ABC CORPORATN'))
Output:
ABC CORP
Same for the possible patterns of limited; if they all begin with "Limit", you can
import re
print(re.sub("Limit\w+", "LTD", 'Shoe Shop Limited.'))
Output:
Shoe Shop LTD.

python regex for people names

hello i have tried to extract all the names from the following string:
import re
def Find(string):
url = re.findall(r"[A-Z][a-z]+,?\s+(?:[A-Z][a-z]*\.?\s*)?[A-Z][a-z]+", string)
return url
string = 'Arnold Schwarzenegger was born in Austria. He and Sylvester Stalone used to run a restaurant with J. Edgar Hoover.'
print(Find(string))
but i have got a problem with the output(doesnt print the J. on edgar)
['Arnold Schwarzenegger', 'Sylvester Stalone', 'Edgar Hoover']
another question for you :)
i have tried to print the second string but i get a problem.
i need to write a regex that print it without www or http or https like in the example:
import re
def Find(string):
url = re.findall(r'https?://[^\s<>"]+|www\.[^\s<>"]+', string)
return url
string = 'To learn about pros/cons of data science, go to http://datascience.net. Alternatively, go to datascience.net/2020/'
print(Find(string))
output is:
['http://datascience.net.']
thanks
Question 1
Here's a regex that works for that specific case of three names:
((?:[A-Z]\.\s)?[A-Z][a-z]+\s[A-Z][a-z]+)
yields
Arnold Schwarzenegger
Sylvester Stalone
J. Edgar Hoover
Question 2
(?:http)?s?(?:\:\/\/)?(?:www.)?([A-z]+\.[A-z]+(?:[\./][A-z0-9]+)*\/?)
yields
http://datascience.net
datascience.net/2020/

convert API response to json

I get the following response using Flickr API photo.search
jsonFlickrApi({"photos":{"page":1,"pages":3786,"perpage":100,"total":"378562","photo":[{"id":"48197008707","owner":"22430972#N05","secret":"36b279092c","server":"65535","farm":66,"title":"Callum and our Cat friend, 5th June 2019.","ispublic":1,"isfriend":0,"isfamily":0},{"id":"48196846446","owner":"156701458#N02","secret":"d650bc4c35","server":"65535","farm":66,"title":"\u2606 Post Nr. 294 SENSE \u2013 Celestinas Kids, Parke Ave. & Posh and Tm:.Creation \u2606","ispublic":1,"isfriend":0,"isfamily":0}...{"id":"48196265577","owner":"61762095#N08","secret":"db8d31c2b2","server":"65535","farm":66,"title":"190702_028.jpg","ispublic":1,"isfriend":0,"isfamily":0}]},"stat":"ok"})
I added ... in midddle because response is too long to share all,anyway How can I covert this to a JSON object (dict that contains a list of dicts in "photo". If I use json.dumps directly it get messed up i.e treated as a single string
If you know for sure the text will start with "jsonFlickrApi(", you can parse it as such. You can change variable start for some other starting string.
Regular Expression is a right tool if you need more advanced matching tools
str = r"""jsonFlickrApi({"photos":{"page":1,"pages":3786,"perpage":100,"total":"378562","photo":[{"id":"48197008707","owner":"22430972#N05","secret":"36b279092c","server":"65535","farm":66,"title":"Callum and our Cat friend, 5th June 2019.","ispublic":1,"isfriend":0,"isfamily":0},{"id":"48196846446","owner":"156701458#N02","secret":"d650bc4c35","server":"65535","farm":66,"title":"\u2606 Post Nr. 294 SENSE \u2013 Celestinas Kids, Parke Ave. & Posh and Tm:.Creation \u2606","ispublic":1,"isfriend":0,"isfamily":0}...{"id":"48196265577","owner":"61762095#N08","secret":"db8d31c2b2","server":"65535","farm":66,"title":"190702_028.jpg","ispublic":1,"isfriend":0,"isfamily":0}]},"stat":"ok"})"""
start = len("jsonFlickrApi(");
json.loads(str[start: -1]);
Update for re
Since it is a response from other's API, I would assume the json object is valid. The pattern is simple with re too
str = r"""jsonFlickrApi({"photos":{"page":1,"pages":3786,"perpage":100,"total":"378562","photo":[{"id":"48197008707","owner":"22430972#N05","secret":"36b279092c","server":"65535","farm":66,"title":"Callum and our Cat friend, 5th June 2019.","ispublic":1,"isfriend":0,"isfamily":0},{"id":"48196846446","owner":"156701458#N02","secret":"d650bc4c35","server":"65535","farm":66,"title":"\u2606 Post Nr. 294 SENSE \u2013 Celestinas Kids, Parke Ave. & Posh and Tm:.Creation \u2606","ispublic":1,"isfriend":0,"isfamily":0}...{"id":"48196265577","owner":"61762095#N08","secret":"db8d31c2b2","server":"65535","farm":66,"title":"190702_028.jpg","ispublic":1,"isfriend":0,"isfamily":0}]},"stat":"ok"})"""
jsonStr = re.findall("{.*}",str)[0]
json.loads(jsonStr)
You can use regex to extract the JSON formatted data and then use json.loads:
import re
import json
text = 'jsonFlickrApi({"photos":[{"id":"A", "title":"Hello"}, {"id":"B", "title":"World"}]})'
result = re.fullmatch('[ ]*jsonFlickrApi[ ]*\((.+?)\)[ ]*', text)
print(json.loads(result.group(1)))
Output:
{'photos': [{'id': 'A', 'title': 'Hello'}, {'id': 'B', 'title': 'World'}]}
You must be getting the body as a string
Use json.parse(response)
And it should solve the problem

Convert string to dataframe, separated by colon

I have a string that comes from an article with a few hundred sentences. I want to convert the string to a dataframe, with each sentence as a row. For example,
data = 'This is a book, to which I found exciting. I bought it for my cousin. He likes it.'
I hope it becomes:
This is a book, to which I found exciting.
I bought it for my cousin.
He likes it.
As a python newbie, this is what I tried:
import pandas as pd
data_csv = StringIO(data)
data_df = pd.read_csv(data_csv, sep = ".")
With the code above, all sentences become column names. I actually want them in rows of a single column.
Don't use read_csv. Just split by '.' and use the standard pd.DataFrame:
data = 'This is a book, to which I found exciting. I bought it for my cousin. He likes it.'
data_df = pd.DataFrame([sentence for sentence in data.split('.') if sentence],
columns=['sentences'])
print(data_df)
# sentences
# 0 This is a book, to which I found exciting
# 1 I bought it for my cousin
# 2 He likes it
Keep in mind that this will break if there will be
floating point numbers in some of the sentences. In this case you will need to change the format of your string (eg use '\n' instead of '.' to separate sentences.)
this is a quick solution but it solves your issue:
data_df = pd.read_csv(data, sep=".", header=None).T
You can achieve this via a list comprehension:
data = 'This is a book, to which I found exciting. I bought it for my cousin. He likes it.'
df = pd.DataFrame({'sentence': [i+'.' for i in data.split('. ')]})
print(df)
# sentence
# 0 This is a book, to which I found exciting.
# 1 I bought it for my cousin.
# 2 He likes it.
What you are trying to do is called tokenizing sentences. The easiest way would be to use a Text-Mining library such as NLTK for it:
from nltk.tokenize import sent_tokenize
pd.DataFrame(sent_tokenize(data))
Otherwise you could simply try something like:
pd.DataFrame(data.split('. '))
However, this will fail if you run into sentences like this:
problem = 'Tim likes to jump... but not always!'

How to extract text before a specific keyword in python?

import re
col4="""May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004"""
b=re.findall(r'\sCiteSeerX',col4)
print b
I have to print "May god bless our families studied". I'm using pythton regular expressions to extract the file name but i'm only getting CiteSeerX as output.I'm doing this on a very large dataset so i only want to use regular expression if there is any other efficient and faster way please point out.
Also I want the last year 2004 as a output.
I'm new to regular expressions and I now that my above implementation is wrong but I can't find a correct one. This is a very naive question. I'm sorry and Thank you in advance.
Here is an answer that doesn't use regex.
>>> s = "now is the time for all good men"
>>> s.find("all")
20
>>> s[:20]
'now is the time for '
>>>
If the structure of all your data is similar to the sample you provided, this should get you going:
import re
data = re.findall("(.*?) CiteSeerX.*(\d{4})$", col4)
if data:
# we have a match extract the first capturing group
title, year = data[0]
print(title, year)
else:
print("Unable to parse the string")
# Output: May god bless our families studied. 2004
This snippet extracts everything before CiteSeerX as the title and the last 4 digits as the year (again, assuming that the structure is similar for all the data you have). The brackets mark the capturing groups for the parts that we are interested in.
Update:
For the case, where there is metadata following the year of publishing, use the following regular expression:
import re
YEAR = "\d{4}"
DATE = "\d\d\d\d-\d\d-\d\d"
def parse_citation(s):
regex = "(.*?) CiteSeerX\s+{date} {date} ({year}).*$".format(date=DATE, year=YEAR)
data = re.findall(regex, s)
if data:
# we have a match extract the first group
return data[0]
else:
return None
c1 = """May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004"""
c2 = """May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004 application/pdf text http //citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.1483 http //www.biomedcentral.com/content/pdf/1471-2350-5-20.pdf en Metadata may be used without restrictions as long as the oai identifier remains attached to it."""
print(parse_citation(c1))
print(parse_citation(c2))
# Output:
# ('May god bless our families studied.', '2004')
# ('May god bless our families studied.', '2004')

Categories