Just a new user of scrapy.org and a newbie to Python. I have this values at brand and title properties (JAVA OOP Term) that contains tab spaces and new line. How can we trim it to make this 2 following object properties to have this plain string value
item['brand'] = "KORAL ACTIVEWEAR"
item['title'] = "Boom Leggings"
Below is the data structure
{'store_id': 870, 'sale_price_low': [], 'brand': [u'\n KORAL ACTIVEWEAR\n '], 'currency': 'AUD', 'retail_price': [u'$140.00'], 'category': [u'Activewear'], 'title': [u'\n Boom Leggings\n '], 'url': [u'/boom-leggings-koral-activewear/vp/v=1/1524019474.htm?folderID=13331&fm=other-shopbysize-viewall&os=false&colorId=68136'], 'sale_price_high': [], 'image_url': [u' https://images-na.sample-store.com/images/G/01/samplestore/p/prod/products/kacti/kacti3025868136/kacti3025868136_q1_2-0._SH20_QL90_UY365_.jpg\n'], 'category_link': 'https://www.samplestore.com/clothing-activewear/br/v=1/13331.htm?baseIndex=500', 'store': 'SampleStore'}
I was able to trim the prices to only get the number and decimal by using regex search method, which I think might be wrong when there is a price comma separator.
price = re.compile('[0-9\.]+')
item['retail_price'] = filter(price.search, item['retail_price'])
It looks like all you need to do, at least for this example, is strip all whitespace off the edges of the brand and title values. You don't need a regex for that, just call the strip method.
However, your brand isn't a single string; it's a list of strings (even if there's only one string in the list). So, if you try to just strip it, or run a regex on it, you're going to get an AttributeError or TypeError from trying to treat that list as a string.
To fix this, you need to map the strip over all of the strings, with either the map function or a list comprehension:
item['brand'] = [brand.strip() for brand in item['brand']]
item['title'] = map(str.strip, item['title'])
… whichever of the two is easier for you to understand.
If you have other examples that have embedded runs of whitespace, and you want to turn every such run into exactly one space character, you need to use the sub method with your regex:
item['brand'] = [re.sub(ur'\s+', u' ', brand.strip() for brand in item['brand']]
Notice the u prefixes. In Python 2, you need a u prefix to make a unicode literal instead of a str (encoded bytes) literal. And it's important to use Unicode patterns against Unicode strings, even if the pattern itself doesn't care about any non-ASCII characters. (If all of this seems like a pointless pain and a bug magnet—well, it is; that's the main reason Python 3 exists.)
As for the retail_price, the same basic observations apply. Again, it's a list of strings, not just a string. And again, you probably don't need regex. Assuming the price is always a $ (or other single-character currency marker) followed by a number, just slice off the $ and call float or Decimal on it:
item['retail_price'] = [float(price[1:]) for price in item['retail_price']]
… but if you have examples that look different, with arbitrary extra characters on both sides of the price, you can use re.search here, but you'll still need to map it, and to use a Unicode pattern.
You also need to grab the matching group out of the search, and to handle empty/invalid strings in some way (they'll return None for the search, and you can't convert that to a float). You have to decide what to do about it, but from your attempt with filter it looks like you just want to skip them. This is complicated enough that I'd do it in multiple steps:
prices = item['price']
matches = (re.search(r'[0-9.]+', price) for price in prices)
groups = (match.group() for match in matches if match)
item['price'] = map(float, validmatches)
… or maybe wrap that in a function.
You can define a method like below which takes an object and returns all the leaves normalized.
import six
def normalize(obj):
if isinstance(obj, six.string_types):
return ' '.join(obj.split())
elif isinstance(obj, list):
return [normalize(x) for x in obj]
elif isinstance(obj, dict):
return {k:normalize(v) for k,v in obj.items()}
return obj
This is a recursive method and does not modify the original object instead returns the normalized object. You can also use it for normalizing the strings.
For your example item
>> item = {'store_id': 870, 'sale_price_low': [], 'brand': [u'\n KORAL ACTIVEWEAR\n '], 'currency': 'AUD', 'retail_price': [u'$140.00'], 'category': [u'Activewear'], 'title': [u'\n Boom Leggings\n '], 'url': [u'/boom-leggings-koral-activewear/vp/v=1/1524019474.htm?folderID=13331&fm=other-shopbysize-viewall&os=false&colorId=68136'], 'sale_price_high': [], 'image_url': [u' https://images-na.sample-store.com/images/G/01/samplestore/p/prod/products/kacti/kacti3025868136/kacti3025868136_q1_2-0._SH20_QL90_UY365_.jpg\n'], 'category_link': 'https://www.samplestore.com/clothing-activewear/br/v=1/13331.htm?baseIndex=500', 'store': 'SampleStore'}
>> print (normalize(item))
>> {'category': [u'Activewear'], 'store_id': 870, 'sale_price_low': [], 'title': [u'Boom Leggings'], 'url': [u'/boom-leggings-koral-activewear/vp/v=1/1524019474.htm?folderID=13331&fm=other-shopbysize-viewall&os=false&colorId=68136'], 'brand': [u'KORAL ACTIVEWEAR'], 'currency': 'AUD', 'image_url': [u'https://images-na.sample-store.com/images/G/01/samplestore/p/prod/products/kacti/kacti3025868136/kacti3025868136_q1_2-0._SH20_QL90_UY365_.jpg'], 'category_link': 'https://www.samplestore.com/clothing-activewear/br/v=1/13331.htm?baseIndex=500', 'sale_price_high': [], 'retail_price': [u'$140.00'], 'store': 'SampleStore'}
Related
I am writing a program in Python to parse a Ledger/hledger journal file.
I'm having problems coming up with a regex that I'm sure is quite simple. I want to parse a string of the form:
expenses:food:food and wine 20.99
and capture the account sections (between colons, allowing any spaces), regardless of the number of sub-accounts, and the total, in groups. There can be any number of spaces between the final character of the sub-account name and the price digits.
expenses:food:wine:speciality 19.99 is also allowable (no space in sub-account).
So far I've got (\S+):|(\S+ \S+):|(\S+ (?!\d))|(\d+.\d+) which is not allowing for any number of sub-accounts and possible spaces. I don't think I want to have OR operators in there either as this is going to concatenated with other regexes with .join() as part of the parsing function.
Any help greatly appreciated.
Thanks.
You can use the following:
((?:[^\s:]+)(?:\:[^\s:]+)*)\s*(\d+\.\d+)
Now we can use:
s = 'expenses:food:wine:speciality 19.99'
rgx = re.compile(r'((?:[^\s:]+)(?:\:[^\s:]+)*)\s*(\d+\.\d+)')
mat = rgx.match(s)
if mat:
categories,price = mat.groups()
categories = categories.split(':')
Now categories will be a list containing the categories, and price a string with the price. For your sample input this gives:
>>> categories
['expenses', 'food', 'wine', 'speciality']
>>> price
'19.99'
You don't need regex for such a simple thing at all, native str.split() is more than enough:
def split_ledger(line):
entries = line.split(":") # first split all the entries
last = entries.pop() # take the last entry
return entries + last.rsplit(" ", 1) # split on last space and return all together
print(split_ledger("expenses:food:food and wine 20.99"))
# ['expenses', 'food', 'food and wine ', '20.99']
print(split_ledger("expenses:food:wine:speciality 19.99"))
# ['expenses', 'food', 'wine', 'speciality ', '19.99']
Or if you don't want the leading/trailing whitespace in any of the entries:
def split_ledger(line):
entries = [e.strip() for e in line.split(":")]
last = entries.pop()
return entries + [e.strip() for e in last.rsplit(" ", 1)]
print(split_ledger("expenses:food:food and wine 20.99"))
# ['expenses', 'food', 'food and wine', '20.99']
print(split_ledger("expenses:food:wine:speciality 19.99"))
# ['expenses', 'food', 'wine', 'speciality', '19.99']
I'm using python 2.7 for this here. I've got a bit of code to extract certain mp3 tags, like this here
mp3info = EasyID3(fileName)
print mp3info
print mp3info['genre']
print mp3info.get('genre', default=None)
print str(mp3info['genre'])
print repr(mp3info['genre'])
genre = unicode(mp3info['genre'])
print genre
I have to use the name ['genre'] instead of [2] as the order can vary between tracks. It produces output like this
{'artist': [u'Really Cool Band'], 'title': [u'Really Cool Song'], 'genre': [u'Rock'], 'date': [u'2005']}
[u'Rock']
[u'Rock']
[u'Rock']
[u'Rock']
[u'Rock']
At first I was like, "Why thank you, I do rock" but then I got on with trying to debug the code. As you can see, I've tried a few different approaches, but none of them work. All I want is for it to output
Rock
I reckon I could possibly use split, but that could get very messy very quickly as there's a distinct possibility that artist or title could contain '
Any suggestions?
It's not a string that you can use split on,, it's a list; that list usually (always?) contains one item. So you can get that first item:
genre = mp3info['genre'][0]
[u'Rock']
Is a list of length 1, its single element is a Unicode string.
Try
print genre[0]
To only print the first element of the list.
I am implementing a simple DSL. I have the following input string:
txt = 'Hi, my name is <<name>>. I was born in <<city>>.'
And I have the following data:
{
'name': 'John',
'city': 'Paris',
'more': 'xxx',
'data': 'yyy',
...
}
I need to implement the following function:
def tokenize(txt):
...
return fmt, vars
Where I get:
fmt = 'Hi, my name is {name}. I was born in {city}.'
vars = ['name', 'city']
That is, fmt can be passed to the str.format() function, and vars is a list of the detected tokens (so that I can perform lookup in the data, which can be more complex than what I described, since it can be split in several namespaces)
After this, processing the format would be simple:
def expand(fmt, vars, data):
params = get_params(vars, data)
return fmt.format(params)
Where get_params is performing simple lookup of the data, and returning something like:
params = {
'name': 'John',
'city': 'Paris',
}
My question is:
How can I implement tokenize? How can I detect the tokens, knowing that the delitimers are << and >>? Should I go for regexes, or is there an easier path?
This is something similar to what pystache, or even .format itself, are doing, but I would like a light-weight implementation. Robustness is not very critical at this stage.
Yes, this is a perfect target for regexp. Find the begin/end quotation marks, replace them with braces, and extract the symbol names into a list. Do you have a solid description of legal symbols? You'll want a search such as
/\<\<([a-zA-Z]+[a-zA-Z0-9_]*)\>\>/
For classical variable names (note that this excludes leading underscores). Are you familiar enough with regexps to take it from here?
import re
def tokenize(text):
found_variables = []
def replace_and_capture(match):
found_variables.append(match.group(1))
return "{{{}}}".format(match.group(1))
return re.sub(r'<<([^>]+)>>', replace_and_capture, text), found_variables
fmt, vars = tokenize('Hi, my name is <<name>>. I was born in <<city>>.')
print(fmt)
print(vars)
# Output:
# Hi, my name is {name}. I was born in {city}.
# ['name', 'city']
I have a dictionary named dicitionario1. I need to replace the content of dicionario[chave][1] which is a list, for the list lista_atributos.
lista_atribtutos uses the content of dicionario[chave][1] to get a list where:
All the information is separed by "," except when it finds the characters "(#" and ")". In this case, it should create a list with the content between those characters (also separated by ","). It can find one or more entries of '(#' and I need to work with every single of them.
Although this might be easy, I'm stuck with the following code:
dicionario1 = {'#998' : [['IFCPROPERTYSET'],["'0siSrBpkjDAOVD99BESZyg',#41,'Geometric Position',$,(#977,#762,#768,#754,#753,#980,#755,#759,#757)"]],
'#1000' : [['IFCRELDEFINESBYPROPERTIES'],["'1dEWu40Ab8zuK7fuATUuvp',#41,$,$,(#973,#951),#998"]]}
for chave in dicionario1:
lista_atributos = []
ini = 0
for i in dicionario1[chave][1][0][ini:]:
if i == '(' and dicionario1[chave][1][0][dicionario1[chave][1][0].index(i) + 1] == '#':
ini = dicionario1[chave][1][0].index(i) + 1
fim = dicionario1[chave][1][0].index(')')
lista_atributos.append(dicionario1[chave][1][0][:ini-2].split(','))
lista_atributos.append(dicionario1[chave][1][0][ini:fim].split(','))
lista_atributos.append(dicionario1[chave][1][0][fim+2:].split(','))
print lista_atributos
Result:
[["'1dEWu40Ab8zuK7fuATUuvp'", '#41', '$', '$'], ['#973', '#951'], ['#998']]
[["'0siSrBpkjDAOVD99BESZyg'", '#41', "'Geometric Position'", '$'], ['#977', '#762', '#768', '#754', '#753', '#980', '#755', '#759', '#757'], ['']]
Unfortunately I can figure out how to iterate over the dictionario1[chave][1][0] to get this result:
[["'1dEWu40Ab8zuK7fuATUuvp'"], ['#41'], ['$'], ['$'], ['#973', '#951'], ['#998']]
[["'0siSrBpkjDAOVD99BESZyg'", ['#41'], ["'Geometric Position'"], ['$'], ['#977', '#762', '#768', '#754', '#753', '#980', '#755', '#759', '#757']]
I need the"["'1dEWu40Ab8zuK7fuATUuvp'", '#41', '$', '$']..." in the result, also to turn into ["'1dEWu40Ab8zuK7fuATUuvp'"], ['#41'], ['$'], ['$']...
Also If I modify "Geometric Position" to "(Geometric Position)" the result becomes:
[["'1dEWu40Ab8zuK7fuATUuvp'", '#41', '$', '$'], ['#973', '#951'], ['#998']]
SOLUTION: (thanks to Rob Watts)
import re
dicionario1 =["'0siSrBpkjDAOVD99BESZyg',#41,'(Geometric) (Position)',$,(#977,#762,#768,#754,#753,#980,#755,#759,#757)"]
dicionario1 = re.findall('\([^)]*\)|[^,]+', dicionario1[0])
for i in range(len(dicionario1)):
if dicionario1[i].startswith('(#'):
dicionario1[i] = dicionario1[i][1:-1].split(',')
else:
pass
print dicionario1
["'0siSrBpkjDAOVD99BESZyg'", '#41', "'(Geometric) (Position)'", '$', ['#977', '#762', '#768', '#754', '#753', '#980', '#755', '#759', '#757']]
One problem I see with your code is the use of index:
ini = dicionario1[chave][1][0].index(i) + 2
fim = dicionario1[chave][1][0].index(')')
index returns the index of the first occurrence of the character. So if you have two ('s in your string, then both times it will give you the index of the first one. That (and your break statement) is why in your example you've got ['2.1', '2.2', '2.3'] correctly but also have '(#5.1', '5.2', '5.3)'.
You can get around this by specifying a starting index to the index method, but I'd suggest a different strategy. If you don't have any commas in the parsed strings, you can use a fairly simple regex to find all your groups:
'\([^)]*\)|[^,]+'
This will find everything inside parenthesis and also everything that doesn't contain a comma. For example:
>>> import re
>>> teststr = "'1',$,#41,(#10,#5)"
>>> re.findall('\([^)]*\)|[^,]+', teststr)
["'1'", '$', '#41', '(#10,#5)']
This leaves you will everything grouped appropriately. You still have to do a little bit of processing on each entry, but it should be fairly straightforward.
During your processing, the startswith method should be helpful. For example:
>>> '(something)'.startswith('(')
True
>>> '(something)'.startswith('(#')
False
>>> '(#1,#2,#3)'.startswith('(#')
True
This will make it easy for you to distinguish between (...) and (#...). If there are commas in the (...), you could always split on comma after you've used the regex.
I am trying to parse the result output from a natural language parser (Stanford parser).
Some of the results are as below:
dep(Company-1, rent-5')
conj_or(rent-5, share-10)
amod(information-12, personal-11)
prep_about(rent-5, you-14)
amod(companies-20, non-affiliated-19)
aux(provide-23, to-22)
xcomp(you-14, provide-23)
dobj(provide-23, products-24)
aux(requested-29, 've-28)
The result am trying to get are:
['dep', 'Company', 'rent']
['conj_or', 'rent', 'share']
['amod', 'information', 'personal']
...
['amod', 'companies', 'non-affiliated']
...
['aux', 'requested', "'ve"]
First I tried to directly get these elements out, but failed.
Then I realized regex should be the right way forward.
However, I am totally unfamiliar with regex. With some exploration, I got:
m = re.search('(?<=())\w+', line)
m2 =re.search('(?<=-)\d', line)
and stuck.
The first one can correctly get the first elements, e.g. 'dep', 'amod', 'conj_or', but I actually have not totally figured out why it is working...
Second line is trying to get the second elements, e.g. 'Company', 'rent', 'information', but I can only get the number after the word. I cannot figure out how to lookbefore rather than lookbehind...
BTW, I also cannot figure out how to deal with exceptions such as 'non-affiliated' and "'ve".
Could anyone give some hints or help. Highly appreciated.
It is difficult to give an optimal answer without knowing the full range of possible outputs, however, here's a possible solution:
>>> [re.findall(r'[A-Za-z_\'-]+[^-\d\(\)\']', line) for line in s.split('\n')]
[['dep', 'Company', 'rent'],
['conj_or', 'rent', 'share'],
['amod', 'information', 'personal'],
['prep_about', 'rent', 'you'],
['amod', 'companies', 'non-affiliated'],
['aux', 'provide', 'to'],
['xcomp', 'you', 'provide'],
['dobj', 'provide', 'products'],
['aux', 'requested', "'ve"]]
It works by finding all the groups of contiguous letters ([A-Za-z] represent the interval between capital A and Z and small a and z) or the characters "_" and "'" in the same line.
Furthermore it enforce the rule that your matched string must not have in the last position a given list of characters ([^...] is the syntax to say "must not contain any of the characters (replace "..." with the list of characters)).
The character \ escapes those characters like "(" or ")" that would otherwise be parsed by the regex engine as instructions.
Finally, s is the example string you gave in the question...
HTH!
Here is something you're looking for:
([\w-]*)\(([\w-]*)-\d*, ([\w-]*)-\d*\)
The parenthesis around [\w-]* are for grouping, so that you can access data as:
ex = r'([\w-]*)\(([\w-]*)-\d*, ([\w-]*)-\d*\)'
m = re.match(ex, line)
print(m.group(0), m.group(1), m.group(2))
Btw, I recommend using "Kodos" program written in Python+PyQT to learn and test regular expressions. It's my favourite tool to test regexs.
If the results from the parser are as regular as suggested, regexes may not be necessary:
from pprint import pprint
source = """
dep(Company-1, rent-5')
conj_or(rent-5, share-10)
amod(information-12, personal-11)
prep_about(rent-5, you-14)
amod(companies-20, non-affiliated-19)
aux(provide-23, to-22)
xcomp(you-14, provide-23)
dobj(provide-23, products-24)
aux(requested-29, 've-28)
"""
items = []
for line in source.splitlines():
head, sep, tail = line.partition('(')
if head:
item = [head]
head, sep, tail = tail.strip('()').partition(', ')
item.append(head.rpartition('-')[0])
item.append(tail.rpartition('-')[0])
items.append(item)
pprint(items)
Output:
[['dep', 'Company', 'rent'],
['conj_or', 'rent', 'share'],
['amod', 'information', 'personal'],
['prep_about', 'rent', 'you'],
['amod', 'companies', 'non-affiliated'],
['aux', 'provide', 'to'],
['xcomp', 'you', 'provide'],
['dobj', 'provide', 'products'],
['aux', 'requested', "'ve"]]