Python: How to get string from item without [u''] - python

I'm using python 2.7 for this here. I've got a bit of code to extract certain mp3 tags, like this here
mp3info = EasyID3(fileName)
print mp3info
print mp3info['genre']
print mp3info.get('genre', default=None)
print str(mp3info['genre'])
print repr(mp3info['genre'])
genre = unicode(mp3info['genre'])
print genre
I have to use the name ['genre'] instead of [2] as the order can vary between tracks. It produces output like this
{'artist': [u'Really Cool Band'], 'title': [u'Really Cool Song'], 'genre': [u'Rock'], 'date': [u'2005']}
[u'Rock']
[u'Rock']
[u'Rock']
[u'Rock']
[u'Rock']
At first I was like, "Why thank you, I do rock" but then I got on with trying to debug the code. As you can see, I've tried a few different approaches, but none of them work. All I want is for it to output
Rock
I reckon I could possibly use split, but that could get very messy very quickly as there's a distinct possibility that artist or title could contain '
Any suggestions?

It's not a string that you can use split on,, it's a list; that list usually (always?) contains one item. So you can get that first item:
genre = mp3info['genre'][0]

[u'Rock']
Is a list of length 1, its single element is a Unicode string.
Try
print genre[0]
To only print the first element of the list.

Related

How to access the tuple values from .partition()?

Today I've learned that string partition(sep) gives me a tuple with the before, separator, and after params.
Let's say I have this:
string = 'Plans for this weekend include turning wine into water.'
print(string.partition('weekend '))
and it prints out:
('Plans for this ', 'weekend ', 'include turning wine into water.')
How do I grab the value in the third index?
Thanks in advance! (I'm pretty new to Python :)
Just store the result in new variable and access it through its index.
string = 'Plans for this weekend include turning wine into water.'
res = string.partition('weekend ')
print(res[2])
string = 'Plans for this weekend include turning wine into water.'
print(string.partition('weekend ')[2][0])
This will print out i
print(string.partition('weekend ')[2])
Output: include turning wine into water.
It works like array/list referencing.

How do I search for specific text in a string using python

I am using this dataset to try and make a dynamic bubble graph but I came across an issue. When I search for the genre eg Western it searches for exactly that string. The problem is that 'dataset_by_year["genre"]' holds genres with multiple genres (e.g. western, comedy, action), which is not matched.
for genre in genres:
dataset_by_year = BubbleGV[BubbleGV["year"] == year1]
dataset_by_year_and_cont = dataset_by_year[
dataset_by_year["genre"] == genre]
All I want to do is to search for the genre within the multiple genres and match the string.
Any help would be greatly appreciated.
Sorry didn't check with your full data try this, it might work:
dataset_by_year_and_cont = dataset_by_year[dataset_by_year["genre"].str.contains(genre)]

Using Python to parse pdf and extract Author and Book name

I have a Mailing Reference List in a form a pdf. The mailing list has a very general format i.e Author Name followed by the Name of the book.
Consider the following examples:
American Reading List
Democratic Theory
• Dahl, Preface to Democratic Theory
• Schumpeter, Capitalism, Socialism, and Democracy (Introduction and part IV only)
• Machperson, Life and Times of Liberal Democracy
• Dahl, Democracy and its Critics
Now I am trying to parse the pdf using pdf miner and create a list where in the first index is the author name and the second index is the name of the book just like this:
[Dahl, Preface to Democratic Theory]
I am trying to use the split functionality because there is a comma and a space followed by the Author name. However I don't get the correct results.
Can somebody help?
def extract():
string = convert_pdf_to_txt("/Users/../../names.pdf")
lines = list(filter(bool, string.split('\n')))
for i in lines:
check.extend(i.split(','))
x=remove_numbers(check)
remove_blank= [x for x in x if x]
combine_two = [remove_blank[x:x + 2] for x in xrange(0,len(remove_blank), 2)]
print combine_two
Let's see what's going wrong here. I'm making some guesses, but hopefully they are the relevant ones.
Your convert_pdf_to_text() function returns a single long string containing all the text of the PDF.
You split the text on ", " which results in a list of strings.
Given your example data, this list looks something like this (each element is on a separate line here):
Dahl
Preface to Democratic Theory(line break)(bullet)(tab)Schumpeter
Captitalism
Socialism
and Democracy (Introduction and part IV only)(line break)(bullet)(tab)Machpherson
Life and Times of Liberal Democracy(line break)(bullet)(tab)Dahl
Democracy and its Critics
Because you split on ", " without regard for the fact that the data is formatted as lines, you end up with stuff from multiple lines in each item.
Now you use filter() to iterate over this list and filter out all the ones that aren't true. A non-empty string is true, and all of the elements are non-empty strings, so all the elements get through. Your filter() therefore doesn't do anything.
What you seem to want is something more like this:
lines = [line.split(", ", 1) for line in string.splitlines() if ", " in line]
Here we first split the lines, filter out any that don't have comma-space in them, and return a list of lists based on splitting the string on the first comma-space.

Parsing file name with RegEx - Python

I'm trying to get the "real" name of a movie from its name when you download it.
So for instance, I have
Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY
and would like to get
Star Wars Episode 4 A New Hope
So I'm using this regex:
.*?\d{1}?[ .a-zA-Z]*
which works fine, but only for a movie with a number, as in 'Iron Man 3' for example.
I'd like to be able to get movies like 'Interstellar' from
Interstellar.2014.1080p.BluRay.H264.AAC-RARBG
and I currently get
Interstellar 2
I tried several ways, and spent quite a lot of time on it already, but figured it wouldn't hurt asking you guys if you had any suggestion/idea/tip on how to do it...
Thanks a lot!
Given your examples and assuming you always download in 1080p (or know that field's value):
x = 'Interstellar.2014.1080p.BluRay.H264.AAC-RARBG'
y = x.split('.')
print " ".join(y[:y.index('1080p')-1])
Forget the regex (for now anyway!) and work with the fixed field layout. Find a field you know (1080p) and remove the information you don't want (the year). Recombine the results and you get "Interstellar" and "Star Wars Episode 4 A New Hope".
The following regex would work (assuming the format is something like moviename.year.1080p.anything or moviename.year.720p.anything:
.*(?=.\d{4}.*\d{3,}p)
Regex example (try the unit tests to see the regex in action)
Explanation:
\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$
Try this with re.sub.See demo.
https://regex101.com/r/hR7tH4/10
import re
p = re.compile(r'\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$', re.MULTILINE)
test_str = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY\nInterstellar.2014.1080p.BluRay.H264.AAC-RARBG\nIron Man 3"
subst = " "
result = re.sub(p, subst, test_str)
Assuming, there is always a four-digit-year, or a four-digit-resolution notation within the movie's file name, a simple solution replaces the not-wanted parts as this:
"(?:\.|\d{4,4}.+$)"
by a blank, strip()'ing them afterwards ...
For example:
test1 = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY"
test2 = "Interstellar.2014.1080p.BluRay.H264.AAC-RARBG"
res1 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test1).strip()
res2 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test2).strip()
print(res1, res2, sep='\n')
>>> Star Wars Episode 4 A New Hope
>>> Interstellar

How to save a regular expression user input value (Python)

I am making a simple chat bot in Python. It has a text file with regular expressions which help to generate the output. The user input and the bot output are separated by a | character.
my name is (?P<'name'>\w*) | Hi {'name'}!
This works fine for single sets of input and output responses, however I would like the bot to be able to store the regex values the user inputs and then use them again (i.e. give the bot a 'memory'). For example, I would like to have the bot store the value input for 'name', so that I can have this in the rules:
my name is (?P<'word'>\w*) | You said your name is {'name'} already!
my name is (?P<'name'>\w*) | Hi {'name'}!
Having no value for 'name' yet, the bot will first output 'Hi steve', and once the bot does have this value, the 'word' rule will apply. I'm not sure if this is easily feasible given the way I have structured my program. I have made it so that the text file is made into a dictionary with the key and value separated by the | character, when the user inputs some text, the program compares whether the user input matches the input stored in the dictionary, and prints out the corresponding bot response (there is also an 'else' case if no match is found).
I must need something to happen at the comparing part of the process so that the user's regular expression text is saved and then substituted back into the dictionary somehow. All of my regular expressions have different names associated with them (there are no two instances of 'word', for example...there is 'word', 'word2', etc), I did this as I thought it would make this part of the process easier. I may have structured the thing completely wrong to do this task though.
Edit: code
import re
io = {}
with open("rules.txt") as brain:
for line in brain:
key, value = line.split('|')
io[key] = value
string = str(raw_input('> ')).lower()+' word'
x = 1
while x == 1:
for regex, output in io.items():
match = re.match(regex, string)
if match:
print(output.format(**match.groupdict()))
string = str(raw_input('> ')).lower()+' word'
else:
print ' Sorry?'
string = str(raw_input('> ')).lower()+' word'
I had some difficulty to understand the principle of your algorithm because I'm not used to employ the named groups.
The following code is the way I would solve your problem, I hope it will give you some ideas.
I think that having only one dictionary isn't a good principle, it increases the complexity of reasoning and of the algorithm. So I based the code on two dictionaries: direg and memory
Theses two dictionaries have keys that are indexes of groups, not all the indexes, only some particular ones, the indexes of the groups being the last in each individual patterns.
Because, for the fun, I decided that the regexes must be able to have several groups.
What I call individual patterns in my code are the following strings:
"[mM]y name [Ii][sS] (\w*)"
"[Ii]n repertory (\w*) I [wW][aA][nN][tT] file (\w*)"
"[Ii] [wW][aA][nN][tT] to ([ \w]*)"
You see that the second individual pattern has 2 capturing groups: consequently there are 3 individual patterns, but a total of 4 groups in all the individual groups.
So the creation of the dictionaries needs some additional care to take account of the fact that the index of the last matching group ( which I use with help of the attribute of name lastindex of a regex MatchObject ) may not correspond to the numbering of individual regexes present in the regex pattern: it's harder to explain than to understand. That's the reason why I count in the function distr() the occurences of strings {0} {1} {2} {3} {4} etc whose number MUST be the same as the number of groups defined in the corresponding individual pattern.
I found the suggestion of Laurence D'Oliveiro to use '||' instead of '|' as separator interesting.
My code simulates a session in which several inputs are done:
import re
regi = ("[mM]y name [Ii][sS] (\w*)"
"||Hi {0}!"
"||You said that your name was {0} !!!",
"[Ii]n repertory (\w*) I [wW][aA][nN][tT] file (\w*)"
"||OK here's your file {0}\\{1} :"
"||I already gave you the file {0}\\{1} !",
"[Ii] [wW][aA][nN][tT] to ([ \w]*)"
"||OK, I will do {0}"
"||You already did {0}. Do yo really want again ?")
direg = {}
memory = {}
def distr(regi,cnt = 0,di = direg,mem = memory,
regnb = re.compile('{\d+}')):
for i,el in enumerate(regi,start=1):
sp = el.split('||')
cnt += len(regnb.findall(sp[1]))
di[cnt] = sp[1]
mem[cnt] = sp[2]
yield sp[0]
regx = re.compile('|'.join(distr(regi)))
print 'direg :\n',direg
print
print 'memory :\n',memory
for inp in ('I say that my name is Armano the 1st',
'In repertory ONE I want file SPACE',
'I want to record music',
'In repertory ONE I want file SPACE',
'I say that my name is Armstrong',
'But my name IS Armstrong now !!!',
'In repertory TWO I want file EARTH',
'Now my name is Helena'):
print '\ninput ==',inp
mat = regx.search(inp)
if direg[mat.lastindex]:
print 'output ==',direg[mat.lastindex]\
.format(*(d for d in mat.groups() if d))
direg[mat.lastindex] = None
memory[mat.lastindex] = memory[mat.lastindex]\
.format(*(d for d in mat.groups() if d))
else:
print 'output ==',memory[mat.lastindex]\
.format(*(d for d in mat.groups() if d))
if not memory[mat.lastindex].startswith('Sorry'):
memory[mat.lastindex] = 'Sorry, ' \
+ memory[mat.lastindex][0].lower()\
+ memory[mat.lastindex][1:]
result
direg :
{1: 'Hi {0}!', 3: "OK here's your file {0}\\{1} :", 4: 'OK, I will do {0}'}
memory :
{1: 'You said that your name was {0} !!!', 3: 'I already gave you the file {0}\\{1} !', 4: 'You already did {0}. Do yo really want again ?'}
input == I say that my name is Armano the 1st
output == Hi Armano!
input == In repertory ONE I want file SPACE
output == OK here's your file ONE\SPACE :
input == I want to record music
output == OK, I will do record music
input == In repertory ONE I want file SPACE
output == I already gave you the file ONE\SPACE !
input == I say that my name is Armstrong
output == You said that your name was Armano !!!
input == But my name IS Armstrong now !!!
output == Sorry, you said that your name was Armano !!!
input == In repertory TWO I want file EARTH
output == Sorry, i already gave you the file ONE\SPACE !
input == Now my name is Helena
output == Sorry, you said that your name was Armano !!!
OK, let me see if I understand this:
You want to a dictionary of key-value pairs. This will be the “memory” of the chatbot.
You want to apply regular-expression rules to user input. But which rules might apply is conditional on which keys are already present in the memory dictionary: if “name” is not yet defined, then the rule that defines “name” applies; but if it is, then the rule that mentions “word” applies.
Seems to me you need more information attached to your rules. For example, the “word” rule you gave above shouldn’t actually add “word” to the dictionary, otherwise it would only apply once (imagine if the user keeps trying to say “my name is x” more than twice).
Does that give you a bit more idea about how to proceed?
Oh, by the way, I think “|” is a poor choice for a separator character, because it can occur in regular expressions. Not sure what to suggest: how about “||”?

Categories