I've got a series of malformed JSON data that I need to use Regex to get the data I need out of it, then I need to use regex again to remove a specific aspect of the data i.e. the main category, in the example below it's 'games'.
Part 1 works, the second part does not.
I've limited experience with Python, and next to no experience with Regex.
Final Output: games
I'm getting the error:
ValueError: pattern contains no capture groups
The series of data contains information formated like this:
{"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/games/playing%20cards"}},"color":51627,"parent_id":12,"name":"Playing Cards","id":273,"position":4,"slug":"games/playing cards"}
The Python call I'm using is this:
First I remove the slug from the JSON.
ksdata.cat_slug_raw = ksdata.category.str.extract('\"slug\"\:\"(.+?)\"', expand=False)
Then I remove everything before the /
ksdata.cat_slug = ksdata.cat_slug_raw.str.extract('^[^/]+(?=/)', expand=False)
I'd really appreciate some help with where I'm going wrong...and if you think my solution as a whole sux please tell me :)
You can use ast.literal_eval:
s = '{"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/games/playing%20cards"}},"color":51627,"parent_id":12,"name":"Playing Cards","id":273,"position":4,"slug":"games/playing cards"}'
import ast
final_data = ast.literal_eval(s)
Output:
{'name': 'Playing Cards', 'color': 51627, 'slug': 'games/playing cards', 'parent_id': 12, 'urls': {'web': {'discover': 'http://www.kickstarter.com/discover/categories/games/playing%20cards'}}, 'position': 4, 'id': 273}
Based on an amended suggestion from TomSitter I used
ksdata.cat_slug_raw.str.split('/').str[0]
This was the simplest way to get around it.
Related
I'm having hard time trying to extract an id number from a string.
I could get it using index but it would fail for the other rows of the data-frame.
How do I extract campaignid=351154190, in a such way that would work for all rows.
only pattern is the word campaignid, need extract and store in new column in the data-frame. Performance is not crucial in this task.
Original string
https:_utm_source=googlebrand&utm_medium=ppc&utm_campaign=brand&utm_campaignid=3
51154190&keyword=aihdisadjiajdutm_matchtype=e&device=m&utm_network=g&utm_adposit
ion=1t1&geo=9027258&gclid=CjwKCsadjjsaopdl[psdklksfdosjfidj9FOk033DKW1xoCXlwQAvD
_BwE&affiliate_id=asdaskdosjadiasjdisaj-asdhasuigdyusagdyusagyk033DKW1xoCXlwQAvD_BwE&utm_content=search&utm_contentid=1251489456158180&placement&extension
Spliting the string
x= cw.captureurl.str.split('&').str[:-1]
printing one row
print(x[25])
['https:_utm_source=googlebrand', 'utm_medium=ppc', 'utm_campaign=brand',
'utm_campaignid=35119190', 'keyword=co',
'utm_matchtype=e', 'device=m', 'utm_network=g', 'utm_adposition=1t1',
'geo=9027258', 'gclid=CjwKCAjwnMTqBRAzEiwAEF3ndo3-
CNOsp1VT5OIxm0BuUcSWQEwtJSR5KLiJzrvjjc9FOk033DKW1xoCXlwQAvD_BwE',
'affiliate_id=CjwKCAjwnMTqBRAzEiwAEF3ndo3-
CNOsp1VT5OIxm0BuUcSWQEwtJSR5KLiJzrvjjc9FOk033DKW1xoCXlwQAvD_BwE',
'utm_content=search', 'utm_contentid=1211732930', 'placement']
It would be great if I could use something that would search for the word "campaignid" (what is my target)
Then store it in another column of the some data-frame.
I tried doing a split after split, it didn't work
I tried using for loop with if statement, didn't work also.
Use regex:
campaign_id = cw['captureurl'].str.extract('campaignid=(\\d+)')[0]
I'd recommend using urllib. In particular, the parse_qs function will get a dictionary of string arguments. https://docs.python.org/3/library/urllib.parse.html
Using your example URL we get:
from urllib.parse import parse_qs
test = 'https:_utm_source=googlebrand&utm_medium=ppc&utm_campaign=brand&utm_campaignid=351154190&keyword=aihdisadjiajdutm_matchtype=e&device=m&utm_network=g&utm_adposition=1t1&geo=9027258&gclid=CjwKCsadjjsaopdl[psdklksfdosjfidj9FOk033DKW1xoCXlwQAvD_BwE&affiliate_id=asdaskdosjadiasjdisaj-asdhasuigdyusagdyusagyk033DKW1xoCXlwQAvD_BwE&utm_content=search&utm_contentid=1251489456158180&placement&extension'
print(parse_qs(test))
{'https:_utm_source': ['googlebrand'],
'utm_medium': ['ppc'],
'utm_campaign': ['brand'],
'utm_campaignid': ['351154190'],
'keyword': ['aihdisadjiajdutm_matchtype=e'],
'device': ['m'],
'utm_network': ['g'],
'utm_adposition': ['1t1'],
'geo': ['9027258'],
'gclid': ['CjwKCsadjjsaopdl[psdklksfdosjfidj9FOk033DKW1xoCXlwQAvD_BwE'],
'affiliate_id': ['asdaskdosjadiasjdisaj-asdhasuigdyusagdyusagyk033DKW1xoCXlwQAvD_BwE'],
'utm_content': ['search'],
'utm_contentid': ['1251489456158180']}
To get the campaignids for the entire dataframe, we can use a .apply to get this done:
# After parsing each url's arguments, we extract the first campaignid from the dictionary's list.
df['utm_campaignid'] = df['url'].apply(lambda x: parse_qs(x)['utm_campaignid'][0])
df.head()
url utm_campaignid
0 https:_utm_source=googlebrand&utm_medium=ppc&u... 351154190
How can I get various words from a string(URL) in python?
From a URL like:
http://www.sample.com/level1/level2/index.html?id=1234
I want to get words like:
http, www, sample, com, level1, level2, index, html, id, 1234
Any solutions using python.
Thanks.
This is how you may do it for all URL
import re
def getWordsFromURL(url):
return re.compile(r'[\:/?=\-&]+',re.UNICODE).split(url)
Now you may use it as
url = "http://www.sample.com/level1/level2/index.html?id=1234"
words = getWordsFromURL(url)
just regex-split according to the biggest sequence of non-alphanums:
import re
l = re.split(r"\W+","http://www.sample.com/level1/level2/index.html?id=1234")
print(l)
yields:
['http', 'www', 'sample', 'com', 'level1', 'level2', 'index', 'html', 'id', '1234']
This is simple but as someone noted, doesn't work when there are _, -, ... in URL names. So the less fun solution would be to list all possible tokens that can separate path parts:
l = re.split(r"[/:\.?=&]+","http://stackoverflow.com/questions/41935748/splitting-a-string-url-into-words-using-python")
(I admit that I may have forgotten some separation symbols)
I'm using python 2.7 for this here. I've got a bit of code to extract certain mp3 tags, like this here
mp3info = EasyID3(fileName)
print mp3info
print mp3info['genre']
print mp3info.get('genre', default=None)
print str(mp3info['genre'])
print repr(mp3info['genre'])
genre = unicode(mp3info['genre'])
print genre
I have to use the name ['genre'] instead of [2] as the order can vary between tracks. It produces output like this
{'artist': [u'Really Cool Band'], 'title': [u'Really Cool Song'], 'genre': [u'Rock'], 'date': [u'2005']}
[u'Rock']
[u'Rock']
[u'Rock']
[u'Rock']
[u'Rock']
At first I was like, "Why thank you, I do rock" but then I got on with trying to debug the code. As you can see, I've tried a few different approaches, but none of them work. All I want is for it to output
Rock
I reckon I could possibly use split, but that could get very messy very quickly as there's a distinct possibility that artist or title could contain '
Any suggestions?
It's not a string that you can use split on,, it's a list; that list usually (always?) contains one item. So you can get that first item:
genre = mp3info['genre'][0]
[u'Rock']
Is a list of length 1, its single element is a Unicode string.
Try
print genre[0]
To only print the first element of the list.
I am working on merging a few datasets regarding over 200 countries in the world. In cleaning the data I need to convert some three-letter codes for each country into the countries' full names.
The three-letter codes and country full names come from a separate CSV file, which shows a slightly different set of countries.
My question is: Is there a better way to write this?
str.replace("USA", "United States of America")
str.replace("CAN", "Canada")
str.replace("BHM", "Bahamas")
str.replace("CUB", "Cuba")
str.replace("HAI", "Haiti")
str.replace("DOM", "Dominican Republic")
str.replace("JAM", "Jamaica")
and so on. It goes on for another 200 rows. Thank you!
Since the number of substitution is high, I would instead iterate over the words in the string and replace based upon a dictionary lookup.
mapofcodes = {'USA': 'United States of America', ....}
for word in mystring.split():
finalstr += mapofcodes.get(word, word)
Try reading the CSV file into a dictionary to a 2D array, you can access which ever one you want then.
that is if I understand your question correctly.
Here's a regular expressions solution:
import re
COUNTRIES = {'USA': 'United States of America', 'CAN': 'Canada'}
def repl(m):
country_code = m.group(1)
return COUNTRIES.get(country_code, country_code)
p = re.compile(r'([A-Z]{3})')
my_string = p.sub(repl, my_string)
I have a file, named a particular way. Let's say it's:
tv_show.s01e01.episode_name.avi
it's the standard way a video file of a tv show's episode is named on the net. The pattern is quite the same all over the web, so I want to extract some information from a file named this way. Basically I want to get:
the show's title;
the season number s01;
the episode number e01;
the extension.
I'm using a Python 3 script to do so. This test file is pretty simple because all I have to do is this
import re
def acquire_info(f="tv_show.s01e01.episode_name.avi"):
tvshow_title = title_p.match(f).group()
numbers = numbers_p.search(f).group()
season_number = numbers.split("e")[0].split("s")[1]
ep_number = numbers.split("e")[1]
return [tvshow_title, season_number, ep_number]
if __name__ == '__main__':
# re.I stands for the option "ignorecase"
title_p = re.compile("^[a-z]+", re.I)
numbers_p = re.compile("s\d{1,2}e\d{1,2}", re.I)
print(acquire_info())
and the output is as expected ['tv_show', '01', '01']. But what if my file name is like this other one? some.other.tv.show.s04e05.episode_name.avi.
How can I build a regex that gets all the text BEFORE the "s\d{1,2}e\d{1,2}" pattern is found?
P.S. I didn't put in the example the code to get the extension, I know, but that's not my problem so it does not matter.
try this
show_p=re.compile("(.*)\.s(\d*)e(\d*)")
show_p.match(x).groups()
where x is your string
Edit** (I forgot to include the extension, here is the revision)
show_p=re.compile("^(.*)\.s(\d*)e(\d*).*?([^\.]*)$")
show_p.match(x).groups()
And Here is the test result
>>> show_p=re.compile("(.*)\.s(\d*)e(\d*).*?([^\.]*)$")
>>> x="tv_show.s01e01.episode_name.avi"
>>> show_p.match(x).groups()
('tv_show', '01', '01', 'avi')
>>> x="tv_show.s2e1.episode_name.avi"
>>> show_p.match(x).groups()
('tv_show', '2', '1', 'avi')
>>> x='some.other.tv.show.s04e05.episode_name.avi'
>>> show_p.match(x).groups()
('some.other.tv.show', '04', '05', 'avi')
>>>
Here is one option, use capturing groups to extract all of the info you want in one step:
>>> show_p = re.compile(r'(.*?)\.s(\d{1,2})e(\d{1,2})')
>>> show_p.match('some.other.tv.show.s04e05.episode_name.avi').groups()
('some.other.tv.show', '04', '05')
I'm not a Python expert but if it can do named captures, something general like this might work:
^(?<Title>.+)\.s(?<Season>\d{1,2})e(?<Episode>\d{1,2})\..*?(?<Extension>[^.]+)$
if no named groups, just use normal groups.
A problem could occur if the title has a .s2e1. part that masks the real season/episode part. That would require more logic. The regex above asumes that the title/season/episode/extension exists, and s/e is the farthest one to the right.