I get the following response using Flickr API photo.search
jsonFlickrApi({"photos":{"page":1,"pages":3786,"perpage":100,"total":"378562","photo":[{"id":"48197008707","owner":"22430972#N05","secret":"36b279092c","server":"65535","farm":66,"title":"Callum and our Cat friend, 5th June 2019.","ispublic":1,"isfriend":0,"isfamily":0},{"id":"48196846446","owner":"156701458#N02","secret":"d650bc4c35","server":"65535","farm":66,"title":"\u2606 Post Nr. 294 SENSE \u2013 Celestinas Kids, Parke Ave. & Posh and Tm:.Creation \u2606","ispublic":1,"isfriend":0,"isfamily":0}...{"id":"48196265577","owner":"61762095#N08","secret":"db8d31c2b2","server":"65535","farm":66,"title":"190702_028.jpg","ispublic":1,"isfriend":0,"isfamily":0}]},"stat":"ok"})
I added ... in midddle because response is too long to share all,anyway How can I covert this to a JSON object (dict that contains a list of dicts in "photo". If I use json.dumps directly it get messed up i.e treated as a single string
If you know for sure the text will start with "jsonFlickrApi(", you can parse it as such. You can change variable start for some other starting string.
Regular Expression is a right tool if you need more advanced matching tools
str = r"""jsonFlickrApi({"photos":{"page":1,"pages":3786,"perpage":100,"total":"378562","photo":[{"id":"48197008707","owner":"22430972#N05","secret":"36b279092c","server":"65535","farm":66,"title":"Callum and our Cat friend, 5th June 2019.","ispublic":1,"isfriend":0,"isfamily":0},{"id":"48196846446","owner":"156701458#N02","secret":"d650bc4c35","server":"65535","farm":66,"title":"\u2606 Post Nr. 294 SENSE \u2013 Celestinas Kids, Parke Ave. & Posh and Tm:.Creation \u2606","ispublic":1,"isfriend":0,"isfamily":0}...{"id":"48196265577","owner":"61762095#N08","secret":"db8d31c2b2","server":"65535","farm":66,"title":"190702_028.jpg","ispublic":1,"isfriend":0,"isfamily":0}]},"stat":"ok"})"""
start = len("jsonFlickrApi(");
json.loads(str[start: -1]);
Update for re
Since it is a response from other's API, I would assume the json object is valid. The pattern is simple with re too
str = r"""jsonFlickrApi({"photos":{"page":1,"pages":3786,"perpage":100,"total":"378562","photo":[{"id":"48197008707","owner":"22430972#N05","secret":"36b279092c","server":"65535","farm":66,"title":"Callum and our Cat friend, 5th June 2019.","ispublic":1,"isfriend":0,"isfamily":0},{"id":"48196846446","owner":"156701458#N02","secret":"d650bc4c35","server":"65535","farm":66,"title":"\u2606 Post Nr. 294 SENSE \u2013 Celestinas Kids, Parke Ave. & Posh and Tm:.Creation \u2606","ispublic":1,"isfriend":0,"isfamily":0}...{"id":"48196265577","owner":"61762095#N08","secret":"db8d31c2b2","server":"65535","farm":66,"title":"190702_028.jpg","ispublic":1,"isfriend":0,"isfamily":0}]},"stat":"ok"})"""
jsonStr = re.findall("{.*}",str)[0]
json.loads(jsonStr)
You can use regex to extract the JSON formatted data and then use json.loads:
import re
import json
text = 'jsonFlickrApi({"photos":[{"id":"A", "title":"Hello"}, {"id":"B", "title":"World"}]})'
result = re.fullmatch('[ ]*jsonFlickrApi[ ]*\((.+?)\)[ ]*', text)
print(json.loads(result.group(1)))
Output:
{'photos': [{'id': 'A', 'title': 'Hello'}, {'id': 'B', 'title': 'World'}]}
You must be getting the body as a string
Use json.parse(response)
And it should solve the problem
Related
I have a pandas dataframe which contains a column containing twitter profile descriptions. In some of these description, there are strings like 'insta: profile_name'.
How can I create a line of code which would search for a string (eg, 'insta:' or 'instagram:') and then return the rest of the string of whatever is next to it?
1252: 'lad who loves to cook 🥘 • insta: xxx',
1254: 'founder and head chef | insta: xxx |',
1992: '🇬🇧 |bakery instagram - xxx',
2291: 'insta: #xxx for enquiries'
2336: 'self taught baker. ig:// xxxx 🍥🧆',
You can use Regex to match each of the keywords such as: Insta
The code should be something like this:
import re
container = list()
for word in [list of keywords, ex: "insta","face"]:
_tag = re.findall( word + 'Regex Syntax', the_string_to_find_from)
container.append([word,_tag])
then you can unpack the resulted Container variable when you want to get the result. I can help you write the Regex syntax but I need more information on the way your required information is wrapped in the text.
Answer provided by Nk03 in the comments:
df['name'].str.extract(pat = r'(insta:|ig:)(.*)')[1].str.strip('\',')
I have the following string:
raw_text = r"The Walt Disney Company, (2006\u2013present)"
print(raw_text)
#result : The Walt Disney Company, (2006\u2013present)
My questions is how can I get a decoded string "decoded_text" from the raw_text so I can get
print(decoded_text)
#result : The Walt Disney Company, (2006-present)
except this trivial way:
decoded_text = raw_text.replace("\u2013", "-")
In fact, I have big strings, which contains a lot of \u-- stuff (like \u2013, \u00c9, and so forth). So I'm looking for a way to convert all of them at once in a right way.
You might use built-in codecs module for this task as follows
import codecs
raw_text = r"The Walt Disney Company, (2006\u2013present)"
print(codecs.unicode_escape_decode(raw_text)[0])
output:
The Walt Disney Company, (2006–present)
hello i have tried to extract all the names from the following string:
import re
def Find(string):
url = re.findall(r"[A-Z][a-z]+,?\s+(?:[A-Z][a-z]*\.?\s*)?[A-Z][a-z]+", string)
return url
string = 'Arnold Schwarzenegger was born in Austria. He and Sylvester Stalone used to run a restaurant with J. Edgar Hoover.'
print(Find(string))
but i have got a problem with the output(doesnt print the J. on edgar)
['Arnold Schwarzenegger', 'Sylvester Stalone', 'Edgar Hoover']
another question for you :)
i have tried to print the second string but i get a problem.
i need to write a regex that print it without www or http or https like in the example:
import re
def Find(string):
url = re.findall(r'https?://[^\s<>"]+|www\.[^\s<>"]+', string)
return url
string = 'To learn about pros/cons of data science, go to http://datascience.net. Alternatively, go to datascience.net/2020/'
print(Find(string))
output is:
['http://datascience.net.']
thanks
Question 1
Here's a regex that works for that specific case of three names:
((?:[A-Z]\.\s)?[A-Z][a-z]+\s[A-Z][a-z]+)
yields
Arnold Schwarzenegger
Sylvester Stalone
J. Edgar Hoover
Question 2
(?:http)?s?(?:\:\/\/)?(?:www.)?([A-z]+\.[A-z]+(?:[\./][A-z0-9]+)*\/?)
yields
http://datascience.net
datascience.net/2020/
I'd like to use part of a string ('project') that is returned from an API. The string looks like this:
{'Project Title': 'LS003942_EP - 5 Random Road, Sunny Place, SA 5000'}
I'd like to store the 'LS003942_EP... ' part in a new variable called foldername. I'm thought a good way would be to use a regex to find the text after Title. Here's my code:
orders = api.get_all(view='Folder', fields='Project Title', maxRecords=1)
for new in orders:
print ("Found 1 new project")
print (new['fields'])
project = (new['fields'])
s = re.search('Title(.+?)', result)
if s:
foldername = s.group(1)
print(foldername)
This gives me an error -
TypeError: expected string or bytes-like object.
I'm hoping for foldername = 'LS003942_EP - 5 Random Road, Sunny Place, SA 5000'
You can use ast.literal_eval to safely evaluate a string containing a Python literal:
import ast
s = "{'Project Title': 'LS003942_EP - 5 Random Road, Sunny Place, SA 5000'}"
print(ast.literal_eval(s)['Project Title'])
# LS003942_EP - 5 Random Road, Sunny Place, SA 5000
It seems (to me) that you have a dictionary and not string. Considering this case, you may try:
s = {'Project Title': 'LS003942_EP - 5 Random Road, Sunny Place, SA 5000'}
print(s['Project Title'])
If you have time, take a look at dictionaries.
I don't think you need a regex here:
string = "{'Project Title': 'LS003942_EP - 5 Random Road, Sunny Place, SA 5000'}"
foldername = string[string.index(":") + 2: len(string)-1]
Essentially, I'm finding the position of the first colon, then adding 2 to get the starting index of your foldername (which would be the apostrophe), and then I use index slicing and slice everything from the index to the second-last character (the last apostrophe).
However, if your string is always going to be in the form of a valid python dict, you could simply do foldername = (eval(string).values)[0]. Here, I'm treating your string as a dict and am getting the first value from it, which is your desired foldername. But, as #AKX notes in the comments, eval() isn't safe as somebody could pass malicious code as a string. Unless you're sure that your input strings won't contain code (which is unlikely), it's best to use ast.literal_eval() as it only evaluates literals.
But, as #MaximilianPeters notes in the comments, your response looks like a valid JSON, so you could easily parse it using json.parse().
You could try this pattern: (?<='Project Title': )[^}]+.
Explanation: it uses positive lookbehind to assure, that match will occure after 'Project Title':. Then it matches until } is encountered: [^}]+.
Demo
import urllib2,sys
from bs4 import BeautifulSoup,NavigableString
obama_4427_url = 'http://www.millercenter.org/president/obama/speeches/speech-4427'
obama_4427_html = urllib2.urlopen(obama_4427_url).read()
obama_4427_soup = BeautifulSoup(obama_4427_html)
# find the speech itself within the HTML
obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
# convert soup to string for easier processing
obama_4427_str = str(obama_4427_div)
# list of characters to be removed from obama_4427_str
remove_char = ['<br/>','</p>','</div>','<div class="indent" id="transcript">','<h2>','</h2>','<p>']
remove_char
for char in obama_4427_str:
if char in obama_4427_str:
obama_4427_replace = obama_4427_str.replace(remove_char,'')
obama_4427_replace = obama_4427_str.replace(remove_char,'')
print(obama_4427_replace)
Using BeautifulSoup, I scraped one of Obama's speeches off of the above website. Now, I need to replace some residual HTML in an efficient manner. I've stored a list of elements I'd like to eliminate in remove_char. I'm trying to write a simple for statement, but am getting the error: TypeError: expected a character object buffer. It's a beginner question, I know, but how can I get around this?
Since you are using BeautifulSoup already , you can directly use obama_4427_div.text instead of str(obama_4427_div) to get the correctly formatted text. Then the text you get would not contain any residual html elements, etc.
Example -
>>> obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
>>> obama_4427_str = obama_4427_div.text
>>> print(obama_4427_str)
Transcript
To Chairman Dean and my great friend Dick Durbin; and to all my fellow citizens of this great nation;
With profound gratitude and great humility, I accept your nomination for the presidency of the United States.
Let me express my thanks to the historic slate of candidates who accompanied me on this ...
...
...
...
Thank you, God Bless you, and God Bless the United States of America.
For completeness, for removing elements from a string, I would create a list of elements to remove (like the remove_char list you have created) and then we can do str.replace() on the string for each element in the list. Example -
obama_4427_str = str(obama_4427_div)
remove_char = ['<br/>','</p>','</div>','<div class="indent" id="transcript">','<h2>','</h2>','<p>']
for char in remove_char:
obama_4427_str = obama_4427_str.replace(char,'')