Increasing efficiency of a string filter

Increasing efficiency of a string filter - python

I have a long text file containing a number of strings. Here is the part of the file:
tyh89= 13
kb2= 0
78%= yes
###bb1= 7634.0
iih54= 121
fgddd= no
#aa1= 0
#aa2= 1
#$ac3= 0
yt##hh= 0
#j= 12.1
##hf= no
So, basically all elements have a common structure of: header= value. My goal is to search for elements, whose headers contain specific string parts and read out those elements' values.
A the moment I do it with a rather straight approach: open/read the whole file as a string, differentiate it into list of elements and run if/elif conditions over all elements using a for loop. I provide my code below.
Is it the most efficient way to do it? Or is there a more efficient way to do it with not implementing the loop?
def main():
print(list(import_param()))
def import_param():
fl = open('filename','r')
cn = fl.read()
cn = cn.split('\n')
fl.close()
for st in cn:
if 'fgddd' in st:
el = st.split(' ')
yield float(el[1])
elif '#j' in st:
el = st.split(' ')
yield float(el[1])
if __name__ == '__main__': main()

yes, there is. You have to avoid testing if string contains a string, but rather focus on string equality.
Once you settle for equality, it means that you can create a set with the known keywords, split according to = and test if the set contains your value (using O(1) lookup):
key_set = {"fgddd","#j"}
for st in cn:
if '=' in st:
key,value = st.split("=",1)
if key in key_set:
el = value.strip()
yield float(el)
if you have different types, use a dictionary to convert to the proper type according to the key
key_set = {"fgddd":float ,"#j": float, "whatever":int , "something":str}
for st in cn:
if '=' in st:
key,value = st.split("=",1)
if key in key_set:
el = value.strip()
yield key_set[key](el) # apply type conversion
note that if you don't want any conversion, str will do the job as it returns itself when passed a string.
final note: if you have a say on the input format, suggest to use json instead of a custom format. Parsing becomes trivial using json module, and filtering can be achieved by the same way I've shown.

Related

almost similar keys in dictionary

I have a dataframe which contains the below column:
column_name
CUVITRU 8 gram
CUVITRU 1 grams
I want to replace these gram and grams to gm. So I have created a dictionary
dict_ = {'gram':'gm','grams':'gm'}
I am able to replace it but it is converting grams to gms. Below is the column after conversion:
column_name
CUVITRU 8 gm
CUVITRU 1 gms
How can I solve this issue.
Below is my code:
dict_ = {'gram':'gm','grams':'gm'}
for key, value in dict_abbr.items():
my_string = my_string.replace(key,value)
my_string = ' '.join(unique_list(my_string.split()))
def unique_list(l):
ulist = []
[ulist.append(x) for x in l if x not in ulist]
return ulist

because it finds 'gram' in 'grams', one way is to instead of string use reg exp for replacement on word boundaries, like (r"\b%s\.... look at the answer usign .sub here for example: search-and-replace-with-whole-word-only-option

You don't actually care about the dict; you care about the key/value pairs produced by its items() method, so just store that in the first place. This lets you specify the order of replacements to try regardless of your Python version.
d = [('grams':'gm'), ('gram':'gm')]
for key, value in d:
my_string = my_string.replace(key,value)

You can make replacements in the reverse order of the key lengths instead:
dict_ = {'gram':'gm','grams':'gm'}
for key in sorted(dict_abbr, key=len, reverse=True):
my_string = my_string.replace(key, dict_[key])

Put the longer string grams before the shorter one gram like this {'grams':'gm','gram':'gm'}, and it will work.
Well, I’m using a recent python 3 like 3.7.2 which guarantees that the sequence of retrieving items is the same as that they are created in the dictionary. For earlier Pythons that may happen (and this appears to be the problem) but isn’t guaranteed.

Python 2.7 parsing data

I have data that look like this:
data = 'somekey:value4thekey&second-key:valu3-can.be?anything&third_k3y:it%can have spaces;too'
In a nice human-readable way it would look like this:
somekey : value4thekey
second-key : valu3-can.be?anything
third_k3y : it%can have spaces;too
How should I parse the data so when I do data['somekey'] I would get >>> value4thekey?
Note: The & is connecting all of the different items
How am I currently tackling with it
Currently, I use this ugly solution:
all = data.split('&')
for i in all:
if i.startswith('somekey'):
print i
This solution is very bad due to multiple obvious limitations. It would be much better if I can somehow parse it into a python tree object.

I'd split the string by & to get a list of key-value strings, and then split each such string by : to get key-value pairs. Using dict and list comprehensions actually makes this quite elegant:
result = {k:v for k, v in (part.split(':') for part in data.split('&'))}

You can parse your data directly to a dictionary - split on the item separator & then split again on the key,value separator ::
table = {
key: value for key, value in
(item.split(':') for item in data.split('&'))
}
This allows you direct access to elements, e.g. as table['somekey'].

If you don't have objects within a value, you can parse it to a dictionary
structure = {}
for ele in data.split('&'):
ele_split = ele.split(':')
structure[ele_split[0]] = ele_split[1]
You can now use structure to get the values:
print structure["somekey"]
#returns "value4thekey"

Since the keys have a common format of being in the form of "key":"value".
You can use it as a parameter to split on.
for i in x.split("&"):
print(i.split(":"))
This would generate an array of even items where every even index is the key and odd index being the value. Iterate through the array and load it into a dictionary. You should be good!

I'd format data to YAML and parse the YAML
import re
import yaml
data = 'somekey:value4thekey&second-key:valu3-can.be?anything&third_k3y:it%can have spaces;too'
yaml_data = re.sub('[:]', ': ', re.sub('[&]', '\n', data ))
y = yaml.load(yaml_data)
for k in y:
print "%s : %s" % (k,y[k])
Here's the output:
third_k3y : it%can have spaces;too
somekey : value4thekey
second-key : valu3-can.be?anything

Python elegant way to map string structure

Let's say I know beforehand that the string
"key1:key2[]:key3[]:key4" should map to "newKey1[]:newKey2[]:newKey3"
then given "key1:key2[2]:key3[3]:key4",
my method should return "newKey1[2]:newKey2[3]:newKey3"
(the order of numbers within the square brackets should stay, like in the above example)
My solution looks like this:
predefined_mapping = {"key1:key2[]:key3[]:key4": "newKey1[]:newKey2[]:newKey3"}
def transform(parent_key, parent_key_with_index):
indexes_in_parent_key = re.findall(r'\[(.*?)\]', parent_key_with_index)
target_list = predefined_mapping[parent_key].split(":")
t = []
i = 0
for elem in target_list:
try:
sub_result = re.subn(r'\[(.*?)\]', '[{}]'.format(indexes_in_parent_key[i]), elem)
if sub_result[1] > 0:
i += 1
new_elem = sub_result[0]
except IndexError as e:
new_elem = elem
t.append(new_elem)
print ":".join(t)
transform("key1:key2[]:key3[]:key4", "key1:key2[2]:key3[3]:key4")
prints newKey1[2]:newKey2[3]:newKey3 as the result.
Can someone suggest a better and elegant solution (around the usage of regex especially)?
Thanks!

You can do it a bit more elegantly by simply splitting the mapped structure on [], then interspersing the indexes from the actual data and, finally, joining everything together:
import itertools
# split the map immediately on [] so that you don't have to split each time on transform
predefined_mapping = {"key1:key2[]:key3[]:key4": "newKey1[]:newKey2[]:newKey3".split("[]")}
def transform(key, source):
mapping = predefined_mapping.get(key, None)
if not mapping: # no mapping for this key found, return unaltered
return source
indexes = re.findall(r'\[.*?\]', source) # get individual indexes
return "".join(i for e in itertools.izip_longest(mapping, indexes) for i in e if i)
print(transform("key1:key2[]:key3[]:key4", "key1:key2[2]:key3[3]:key4"))
# newKey1[2]:newKey2[3]:newKey3
NOTE: On Python 3 use itertools.zip_longest() instead.
I still think you're over-engineering this and that there is probably a much more elegant and far less error-prone approach to the whole problem. I'd advise stepping back and looking at the bigger picture instead of hammering out this particular solution just because it seems to be addressing the immediate need.

extract a dictionary key value from a string

I am currently in the process of using python to transmit a python dictionary from one raspberry pi to another over a 433Mhz link, using virtual wire (vw.py) to send data.
The issue with vw.py is that data being sent is in string format.
I am successfully receiving the data on PI_no2, and now I am trying to reformat the data so it can be placed back in a dictionary.
I have created a small snippet to test with, and created a temporary string in the same format it is received as from vw.py
So far I have successfully split the string at the colon, and I am now trying to get rid of the double quotes, without much success.
my_status = {}
#temp is in the format the data is recieved
temp = "'mycode':['1','2','firstname','Lastname']"
key,value = temp.split(':')
print key
print value
key = key.replace("'",'')
value = value.replace("'",'')
my_status.update({key:value})
print my_status
Gives the result
'mycode'
['1','2','firstname','Lastname']
{'mycode': '[1,2,firstname,Lastname]'}
I require the value to be in the format
['1','2','firstname','Lastname']
but the strip gets rid of all the single speech marks.

You can use ast.literal_eval
import ast
temp = "'mycode':['1','2','firstname','Lastname']"
key,value = map(ast.literal_eval, temp.split(':'))
status = {key: value}
Will output
{'mycode': ['1', '2', 'firstname', 'Lastname']}

This shouldn't be hard to solve. What you need to do is strip away the [ ] in your list string, then split by ,. Once you've done this, iterate over the elements are add them to a list. Your code should look like this:
string = "[1,2,firstname,lastname]"
string = string.strip("[")
string = string.strip("]")
values = string.split(",")
final_list = []
for val in values:
final_list.append(val)
print final_list
This will return:
> ['1','2','firstname','lastname']
Then take this list and insert it into your dictionary:
d = {}
d['mycode'] = final_list
The advantage of this method is that you can handle each value independently. If you need to convert 1 and 2 to int then you'll be able to do that while leaving the other two as str.

Alternatively to cricket_007's suggestion of using a syntax tree parser - you're format is very similar to the standard yaml format. This is a pretty lightweight and intutive framework so I'll suggest it
a = "'mycode':['1','2','firstname','Lastname']"
print yaml.load(a.replace(":",": "))
# prints the dictionary {'mycode': ['1', '2', 'firstname', 'Lastname']}
The only thing that's different between your format and yaml is the colon needs a space
It also will distinguish between primitive data types for you, if that's important. Drop the quotes around 1 and 2 and it determines that they're numerical.
Tadhg McDonald-Jensen suggested pickling in the comments. This will allow you to store more complicated objects, though you may lose the human-readable format you've been experimenting with

Assign strings to IDs in Python

I am reading a text file with python, formatted where the values in each column may be numeric or strings.
When those values are strings, I need to assign a unique ID of that string (unique across all the strings under the same column; the same ID must be assigned if the same string appears elsewhere under the same column).
What would be an efficient way to do it?

Use a defaultdict with a default value factory that generates new ids:
ids = collections.defaultdict(itertools.count().next)
ids['a'] # 0
ids['b'] # 1
ids['a'] # 0
When you look up a key in a defaultdict, if it's not already present, the defaultdict calls a user-provided default value factory to get the value and stores it before returning it.
collections.count() creates an iterator that counts up from 0, so collections.count().next is a bound method that produces a new integer whenever you call it.
Combined, these tools produce a dict that returns a new integer whenever you look up something you've never looked up before.

defaultdict answer updated for python 3, where .next is now .__next__, and for pylint compliance, where using "magic" __*__ methods is discouraged:
ids = collections.defaultdict(functoools.partial(next, itertools.count()))

Create a set, and then add strings to the set. This will ensure that strings are not duplicated; then you can use enumerate to get a unique id of each string. Use this ID when you are writing the file out again.
Here I am assuming the second column is the one you want to scan for text or integers.
seen = set()
with open('somefile.txt') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
try:
int(row[1])
except ValueError:
seen.add(row[1]) # adds string to set
# print the unique ids for each string
for id,text in enumerate(seen):
print("{}: {}".format(id, text))
Now you can take the same logic, and replicate it across each column of your file. If you know the column length in advanced, you can have a list of sets. Suppose the file has three columns:
unique_strings = [set(), set(), set()]
with open('file.txt') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
for column,value in enumerate(row):
try:
int(value)
except ValueError:
# It is not an integer, so it must be
# a string
unique_strings[column].add(value)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Increasing efficiency of a string filter - python

Related

almost similar keys in dictionary

Python 2.7 parsing data

Python elegant way to map string structure

extract a dictionary key value from a string

Assign strings to IDs in Python

Categories

Resources