get string with parsing in python list - python

i have list like this
["<name:john student male age=23 subject=\computer\sience_{20092973}>",
"<name:Ahn professor female age=61 subject=\computer\math_{20092931}>"]
i want to get student using {20092973},{20092931}.
so i want to split to list like this
my expect result 1 is this (input is {20092973})
"student"
my expect result 2 is this (input is {20092931})
"professor"
i already searching... but i can't find.. sorry..
how can i this?

I don't think you should be doing this in the first place. Unlike your toy example, your real problem doesn't involve a string in some clunky format; it involves a Scapy NetworkInterface object. Which has attributes that you can just access directly. You only have to parse it because for some reason you stored its string representation. Just don't do that; store the attributes you actually want when you have them as attributes.
The NetworkInterface object isn't described in the documentation (because it's an implementation detail of the Windows-specific code), but you can interactively inspect it like any other class in Python (e.g., dir(ni) will show you all the attributes), or just look at the source. The values you want are name and win_name. So, instead of print ni, just do something like print '%s,%s' % (ni.name, ni.win_name). Then, parsing the results in some other program will be trivial, instead of a pain in the neck.
Or, better, if you're actually using this in Scapy itself, just make the dict directly out of {ni.win_name: ni.name for ni in nis}. (Or, if you're running Scapy against Python 2.5 or something, dict((ni.win_name, ni.name) for ni in nis).)
But to answer the question as you asked it (maybe you already captured all the data and it's too late to capture new data, so now we're stuck working around your earlier mistake…), there are three steps to this: (1) Figure out how to parse one of these strings into its component parts. (2) Do that in a loop to build a dict mapping the numbers to the names. (3) Just use the dict for your lookups.
For parsing, I'd use a regular expression. For example:
<name:\S+\s(\S+).*?\{(\d+)\}>
Debuggex Demo
Now, let's build the dict:
r = re.compile(r'<name:\S+\s(\S+).*?\{(\d+)\}>')
matches = (r.match(thing) for thing in things)
d = {match.group(2): match.group(1) for match in matches}
And now:
>>> d['20092973']
'student'

Code:
def grepRole(role, lines):
return [line.split()[1] for line in lines if role in line][0]
l = ["<name:john student male age=23 subject=\computer\sience_{20092973}>",
"<name:Ahn professor female age=61 subject=\compute\math_{20092931}>"]
print(grepRole("{20092973}", l))
print(grepRole("{20092931}", l))
Output:
student
professor

current_list = ["<name:john student male age=23 subject=\computer\sience_{20092973}>", "<name:Ahn professor female age=61 subject=\computer\math_{20092931}>"]
def get_identity(code):
print([row.split(' ')[1] for row in current_list if code in row][0])
get_identity("{20092973}")
regular expression is good ,but for me, a rookie, regular expression is another big problem...

Related

Pythonic way to solve a text normalization task

Basically, I have a Hive script file, from which I need to extract the names for all the tables created. For example, from the contents
...
create table Sales ...
...
create external table Persons ...
...
Sales and Persons should be extracted. To accomplish this, my basic idea is like:
Search for key phrases create table and create external table,
Extract the next token which should be the table name.
However, the input may not be canonical. For example,
Tab/newline may be used along with space as token delimiter
There may be multiple consecutive delimiters between tokens
Mixed use of upper and lower case letters like create TABLE
Therefore, I'm thinking about first normalizing the input to a canonical form before applying the basic algorithm. Then with some effort, I come up with the following
' '.join(input.split()).lower()
As a Python newcomer, I'm wondering whether this is the Pythonic way to solve the problem, or it may be flawed in the very first place? Is there a simple way to do this in a streaming fashion, i.e., avoiding loading the whole input into memory at once?
Like some comments stated, regex is a neat and easy way to get what you want. If you don't mind getting lowercase results, this one should work:
import re
my_str = """
...
create table Sales ...
create TabLE
test
create external table Persons ...
...
"""
pattern = r"table\s+(\w+)\b"
items = re.findall(pattern, my_str.lower())
print items
It captures the next word after "table " (followed by at least one whitespace / newline).
To get the original case of the table names:
for x, item in enumerate(items):
i = my_str.lower().index(item)
items[x] = my_str[i:i+len(item)]
print items

Python - any property file or data format that is mostly free-form?

I'm about to roll my own property file parser. I've got a somewhat odd requirement where I need to be able to store metadata in an existing field of a GUI. The data needs to be easily parse-able and human readable, preferably with some flexibility in defining the data (no yaml for example).
I was thinking I could do something like this:
this is random text that is truly a description
.metadata.
owner.first: rick
owner.second: bob
property: blue
pets.mammals.dog: rufus
pets.mammals.cat: ludmilla
I was thinking I could use something like '.metadata.' to denote that anything below that line is metadata to be parsed. Then, I would treat the properties almost like java properties where I would read each line in and build a map (or object) to hold the metadata, which would then be outputted and searchable via a simple web app.
My real question before I roll this on my own, is can anyone suggest a better method for solving this problem? A specific data format or library that would fit this use case? I would normally use something like yaml or the like, but there's no good way for me to validate that the data is indeed in yaml format when it is saved.
You have 3 problems:
How to fit two different things into one box.
If you are mixing free form text and something that is more tightly defined, you are always going to end up with stuff that you can't parse. Then you will have a never ending battle of trying to deal with the rubbish that gets put in. Is there really no other way?
How to define a simple format for metadata that is robust enough for simple use.
This is a hard problem - all attempts to do so seem to expand until they become quite complicated (e.g. YAML). You will probably have custom requirements for your domain, so what you've proposed may be best.
How to parse that format.
For this I would recommend parsy.
It would be quite simple to split the text on .metadata. and then parse what remains.
Here is an example using parsy:
from parsy import *
attribute = letter.at_least(1).concat()
name = attribute.sep_by(string("."))
value = regex(r"[^\n]+")
definition = seq(name << string(":") << string(" ").many(), value)
metadata = definition.sep_by(string("\n"))
Example usage:
>>> metadata.parse_partial("""owner.first: rick
owner.second: bob
property: blue
pets.mammals.dog: rufus
pets.mammals.cat: ludmilla""")
([[['owner', 'first'], 'rick'],
[['owner', 'second'], 'bob'],
[['property'], 'blue'],
[['pets', 'mammals', 'dog'], 'rufus'],
[['pets', 'mammals', 'cat'], 'ludmilla']],
'')
YAML is a simple and nice solution. There is a YAML library in Python:
import yaml
output = {'a':1,'b':{'c':output = {'a':1,'b':{'c':[2,3,4]}}}}
print yaml.dump(output,default_flow_style=False)
Giving as a result:
a: 1
b:
c:
- 2
- 3
- 4
You can also parse from string and so. Just explore it and check if it fits your requeriments.
Good luck!

match hex string with list indice

I'm building a de-identify tool. It replaces all names by other names.
We got a report that <name>Peter</name> met <name>Jane</name> yesterday. <name>Peter</name> is suspicious.
outpout :
We got a report that <name>Billy</name> met <name>Elsa</name> yesterday. <name>Billy</name> is suspicious.
It can be done on multiple documents, and one name is always replaced by the same counterpart, so you can still understand who the text is talking about. BUT, all documents have an ID, referring to the person this file is about (I'm working with files in a public service) and only documents with the same people ID will be de-identified the same way, with the same names. (the goal is to watch evolution and people's history) This is a security measure, such as when I hand over the tool to a third party, I don't hand over the key to my own documents with it.
So the same input, with a different ID, produces :
We got a report that <name>Henry</name> met <name>Alicia</name> yesterday. <name>Henry</name> is suspicious.
Right now, I'm hashing each name with the document ID as a salt, I convert the hash to an integer, then subtract the length of the name list until I can request a name with that integer as an indice. But I feel like there should be a quicker/more straightforward approach ?
It's really more of an algorithmic question, but if it's of any relevance I'm working with python 2.7 Please request more explanation if needed. Thank you !
I hope it's clearer this way ô_o Sorry when you are neck-deep in your code you forget others need a bigger picture to understand how you got there.
As #LutzHorn pointed out, you could just use a dict to map real names to false ones.
You could also just do something like:
existing_names = []
for nameocurrence in original_text:
if not nameoccurence.name in existing_names:
nameoccurence.id = len(existing_names)
existing_names.append(nameoccurence.name)
else:
nameoccurence.id = existing_names.index(nameoccurence.name)
for idx, _ in enumerate(existing_names):
existing_names[idx] = gimme_random_name()
Try using a dictionary of names.
import re
names = {"Peter": "Billy", "Jane": "Elsa"}
for name in re.findall("<name>([a-zA-Z]+)</name>", s):
s = re.sub("<name>" + name + "</name>", "<name>"+ names[name] + "</name>", s)
print(s)
Output:
'We got a report that <name>Billy</name> met <name>Elsa</name> yesterday. <name>Billy</name> is suspicious.'

Using Strings to Name Hash Keys?

I'm working through a book called "Head First Programming," and there's a particular part where I'm confused as to why they're doing this.
There doesn't appear to be any reasoning for it, nor any explanation anywhere in the text.
The issue in question is in using multiple-assignment to assign split data from a string into a hash (which doesn't make sense as to why they're using a hash, if you ask me, but that's a separate issue). Here's the example code:
line = "101;Johnny 'wave-boy' Jones;USA;8.32;Fish;21"
s = {}
(s['id'], s['name'], s['country'], s['average'], s['board'], s['age']) = line.split(";")
I understand that this will take the string line and split it up into each named part, but I don't understand why what I think are keys are being named by using a string, when just a few pages prior, they were named like any other variable, without single quotes.
The purpose of the individual parts is to be searched based on an individual element and then printed on screen. For example, being able to search by ID number and then return the entire thing.
The language in question is Python, if that makes any difference. This is rather confusing for me, since I'm trying to learn this stuff on my own.
My personal best guess is that it doesn't make any difference and that it was personal preference on part of the authors, but it bewilders me that they would suddenly change form like that without it having any meaning, and further bothers me that they don't explain it.
EDIT: So I tried printing the id key both with and without single quotes around the name, and it worked perfectly fine, either way. Therefore, I'd have to assume it's a matter of personal preference, but I still would like some info from someone who actually knows what they're doing as to whether it actually makes a difference, in the long run.
EDIT 2: Apparently, it doesn't make any sense as to how my Python interpreter is actually working with what I've given it, so I made a screen capture of it working https://www.youtube.com/watch?v=52GQJEeSwUA
I don't understand why what I think are keys are being named by using a string, when just a few pages prior, they were named like any other variable, without single quotes
The answer is right there. If there's no quote, mydict[s], then s is a variable, and you look up the key in the dict based on what the value of s is.
If it's a string, then you look up literally that key.
So, in your example s[name] won't work as that would try to access the variable name, which is probably not set.
EDIT: So I tried printing the id key both with and without single
quotes around the name, and it worked perfectly fine, either way.
That's just pure luck... There's a built-in function called id:
>>> id
<built-in function id>
Try another name, and you'll see that it won't work.
Actually, as it turns out, for dictionaries (Python's term for hashes) there is a semantic difference between having the quotes there and not.
For example:
s = {}
s['test'] = 1
s['othertest'] = 2
defines a dictionary called s with two keys, 'test' and 'othertest.' However, if I tried to do this instead:
s = {}
s[test] = 1
I'd get a NameError exception, because this would be looking for an undefined variable called test whose value would be used as the key.
If, then, I were to type this into the Python interpreter:
>>> s = {}
>>> s['test'] = 1
>>> s['othertest'] = 2
>>> test = 'othertest'
>>> print s[test]
2
>>> print s['test']
1
you'll see that using test as a key with no quotes uses the value of that variable to look up the associated entry in the dictionary s.
Edit: Now, the REALLY interesting question is why using s[id] gave you what you expected. The keyword "id" is actually a built-in function in Python that gives you a unique id for an object passed as its argument. What in the world the Python interpreter is doing with the expression s[id] is a total mystery to me.
Edit 2: Watching the OP's Youtube video, it's clear that he's staying consistent when assigning and reading the hash about using id or 'id', so there's no issue with the function id as a hash key somehow magically lining up with 'id' as a hash key. That had me kind of worried for a while.

Trouble calling Dictionary using Variable using Python

I have only been working with python for a few months,
so sorry if I am asking a stupid question. I am having
a problem calling a dictionary name using a variable.
The problem is, if I use a variable to call a dictionary & [] operators,
python interprets my code trying to return a single character in the string
instead of anything within the dictionary list.
To illustrate by an example ... let's say I
have a dictionary list like below.
USA={'Capital':'Washington',
'Currency':'USD'}
Japan={'Capital':'Tokyo',
'Currency':'JPY'}
China={'Capital':'Beijing',
'Currency':'RMB'}
country=input("Enter USA or JAPAN or China? ")
print(USA["Capital"]+USA["Currency"]) #No problem -> WashingtonUSD
print(Japan["Capital"]+Japan["Currency"]) #No problem -> TokyoJPY
print(China["Capital"]+China["Currency"]) #No problem -> BeijingRMB
print(country["Capital"]+country["Currency"]) #Error -> TypeError: string indices must be integers
In the example above, I understand the interpreter
is expecting an integer because it views the value
of "country" as a string instead of dictionary...
like if I put country[2] using Japan as input (for example),
it will return the character "p". But clearly that
is not what my intent is.
Is there a way I can work around this?
You should put your countries themselves into a dictionary, with the keys being the country names. Then you would be able to do COUNTRIES[country]["Capital"], etc.
Example:
COUNTRIES = dict(
USA={'Capital':'Washington',
'Currency':'USD'},
Japan={'Capital':'Tokyo',
'Currency':'JPY'},
...
)
country = input("Enter USA or Japan or China? ")
print(COUNTRIES[country]["Capital"])
Disclaimer: Any other way of doing it is definitely better than the way I'm about to show. This way will work, but it is not pythonic. I'm offering it for entertainment purposes, and to show that Python is cool.
USA={'Capital':'Washington',
'Currency':'USD'}
Japan={'Capital':'Tokyo',
'Currency':'JPY'}
China={'Capital':'Beijing',
'Currency':'RMB'}
country=input("Enter USA or Japan or China? ")
print(USA["Capital"]+USA["Currency"]) #No problem -> WashingtonUSD
print(Japan["Capital"]+Japan["Currency"]) #No problem -> TokyoJPY
print(China["Capital"]+China["Currency"]) #No problem -> BeijingRMB
# This works, but it is probably unwise to use it.
print(vars()[country]["Capital"] + vars()[country]['Currency'])
This works because the built-in function vars, when given no arguments, returns a dict of variables (and other stuff) in the current namespace. Each variable name, as a string, becomes a key in the dict.
But #tom's suggestion is actually a much better one.

Categories