Testing if strip in column was successful with polars - python

I have developed a function to strip a dataset using polars. Now I want to check with a test if the strip was successful. For this I want to use the following logic. But this code is in python. How can I solve this using polars?
def test_strip():
df = pd.DataFrame({
'ID': [1, 1, 1, 1, 1],
'Entity': ['Entity 1 ', 'Entity 2', 'Entity 3', 'Entity 4', 'Entity 5'],
'Table': ['Table 1', ' Table 2', 'Table 3', 'Table 4', None],
'Local': ['Local 1', 'Local 2 ', None, 'Local 4', 'Local 5'],
'Global': ['Global 1', ' Global 2', 'Global 3', None, ' Global 5'],
'mandatory': ['M', 'M', 'M', 'CM ', 'M']
})
job = first_job(
config=test_config,
copying_list=copying,
)
result = job.run(df)
df_clean, *_ = result
for column in df_clean.columns:
for value in df_clean[column]:
if isinstance(value, str) and (value.startswith(" ") or value.endswith(" ")):
raise AssertionError(f"Strip failed for column '{column}'")

This should do it...
def test_strip(df):
bad_rows=df.filter(
pl.any([pl.col(x).str.contains("(^ )|( $)") for x in df.columns])
)
if bad_rows.shape[0]==0:
return("all good")
else:
str_cols=', '.join(bad_rows.melt().filter(pl.col('value').str.contains("(^ )|( $)")).get_column('variable').unique().to_list())
raise AssertionError(f"Strip failed for column(s): {str_cols}")
The meat and potatoes is the bad_rows assignment. It combines a list comprehension that uses a regex with the beginning of string anchor and the end of string anchor. That is wrapped in pl.any so that any column can trigger it. If the shape is 0 that means everything worked and it returns a message stating as much. Otherwise it'll raise the error and tell you which columns were bad.

Related

How can I return the entire dictionary?

This is my method. I am having trouble with returning the entire dictionary
def get_col(amount):
letter = 0
value = []
values = {}
for i in range(amount):
letter = get_column_letter(i + 1)
[value.append(row.value) for row in ws[letter]]
values = dict(zip(letter, [value]))
value = []
return values
I want it to output it like this:
{'A': ['ID', 'value is 1', 'value is 2', 'value is 3', 'value is 4', 'value is 5', 'value is 6']}
{'B': ['Name', 'value is 1', 'value is 2', 'value is 3', 'value is 4', 'value is 5', 'value is 6']}
{'C': ['Math', 'value is 1', 'value is 2', 'value is 3', 'value is 4', 'value is 5', 'value is 6']}
But when the return is onside the 'for' it only returns
{'A': ['ID', 'value is 1', 'value is 2', 'value is 3', 'value is 4', 'value is 5', 'value is 6']}
and when the return is outside the 'for' loop, it returns
{'C': ['Math', 'value is 1', 'value is 2', 'value is 3', 'value is 4', 'value is 5', 'value is 6']}
Any help would be appreciated. Thank you!
I am assuming you want all of the data in one dictionary:
values = dict(zip(letter, [value]))
Currently this part of your code overites the dictionary everytime. It is why you get the "A" dict with returning before the for loop finishes, and why after the loop finishes when return the dict is only the "C" dict as the "A" and "B" were overwriten.
Put the return outside the for loop afterwards, and instead of
values = dict(zip(letter, [value]))
use
values[letter] = value
as this will append more keys/values to the dict.
ps. This is my first post, I hope it helps and is understandable.
edit: If you are wanting a list of three dictionaries like your desired output shows do this:
def get_col(amount):
letter = 0
value = []
values = []
for i in range(amount):
letter = get_column_letter(i + 1)
[value.append(row.value) for row in ws[letter]]
values.append(dict(zip(letter, [value])))
value = []
return values
Your desired output is not a single dictionary. It's a list of dictionaries.
In the for loop, at each iteration you are creating a new dictionary. When you return, you either return the first one you create or the last one if you put the return inside or outside respectevely.
You need to return a list of the created dictionaries
def get_col(amount):
letter = 0
value = []
values = {}
values_list = []
for i in range(amount):
letter = get_column_letter(i + 1)
[value.append(row.value) for row in ws[letter]]
values = dict(zip(letter, [value]))
value = []
values_list.append(values)
return values_list

Adding values from one list to another when they share value

I'm trying to add values from List2 if the type is the same in List1. All the data is strings within lists. This isn't the exact data I'm using, just a representation. This is my first programme so please excuse any misunderstandings.
List1 = [['Type A =', 'Value 1', 'Value 2', 'Value 3'], ['Type B =', 'Value 4', 'Value 5']]
List2 = [['Type Z =', 'Value 6', 'Value 7', 'Value 8'], ['Type A =', 'Value 9', 'Value 10', 'Value 11'], ['Type A =', 'Value 12', 'Value 13']]
Desired result:
new_list =[['Type A =', 'Value 1', 'Value 2', 'Value 3', 'Value 9', 'Value 10', 'Value 11', 'Value 12', 'Value 13'], ['Type B =', 'Value 4', 'Value 5']]
Current attempt:
newlist = []
for values in List1:
for valuestoadd in List2:
if values[0] == valuestoadd[0]:
newlist = [List1 + [valuestoadd[1:]]]
else:
print("Types don't match")
return newlist
This works for me if there weren't two Type A's in List2 as this causes my code to create two instances of List1. If I was able to add the values at a specific index of the list then that would be great but I can work around that.
It's probably easier to use a dictionary for this:
def merge(d1, d2):
return {k: v + d2[k] if k in d2 else v for k, v in d1.items()}
d1 = {'A': [1, 2, 3], 'B': [4, 5, 6]}
d2 = {'A': [7, 8, 9], 'C': [0]}
print(merge(d1, d2))
If you must use a list, it's fairly easy to temporarily convert to a dictionary and back to a list:
from collections import defaultdict
def list_to_dict(xss):
d = defaultdict(list)
for xs in xss:
d[xs[0]].extend(xs[1:])
return d
def dict_to_list(d):
return [[k, *v] for k, v in d.items()]
Rather than using List1 + [valuestoadd[1:]], you should be using newlist[0].append(valuestoadd[1:]) so that it doesn't ever create a new list and only appends to the old one. The [0] is necessary so that it appends to the first sublist rather than the whole list.
newlist = List1 #you're doing this already - might as well initialize the new list with this code
for values in List1:
for valuestoadd in List2:
if values[0] == valuestoadd[0]:
newlist[0].append(valuestoadd[1:]) #adds the values on to the end of the first list
else:
print("Types don't match")
Output:
[['Type A =', 'Value 1', 'Value 2', 'Value 3', ['Value 9', 'Value 10', 'Value 11'], ['Value 12', 'Value 13']], ['Type B =', 'Value 4', 'Value 5']]
This does, sadly, input the values as a list - if you want to split them into individual values, you would need to iterate through the lists you're adding on, and append individual values to newlist[0].
This could be achieved with another for loop, like so:
if values[0] == valuestoadd[0]:
for subvalues in valuestoadd[1:]: #splits the list into subvalues
newlist[0].append(subvalues) #appends those subvalues
Output:
[['Type A =', 'Value 1', 'Value 2', 'Value 3', 'Value 9', 'Value 10', 'Value 11', 'Value 12', 'Value 13'], ['Type B =', 'Value 4', 'Value 5']]
I agree with the other answers that it would be better to use a dictionary right away. But if you want, for some reason, stick to the data structure you have, you could transform it into a dictionary and back:
type_dict = {}
for tlist in List1+List2:
curr_type = tlist[0]
type_dict[curr_type] = tlist[1:] if not curr_type in type_dict else type_dict[curr_type]+tlist[1:]
new_list = [[k] + type_dict[k] for k in type_dict]
In the creation of new_list, you can take the keys from a subset of type_dict only if you do not want to include all of them.

how to extract values from python sublists

data_sets = [
['O'],
['X'],
# These data sets put Sheet A in all possible locations and orientations
# Data sets 2 - 9
['O', ['Sheet A', 'Location 1', 'Upright']],
['O', ['Sheet A', 'Location 2', 'Upright']],
['O', ['Sheet A', 'Location 3', 'Upright']],
['O', ['Sheet A', 'Location 4', 'Upright']],
['O', ['Sheet A', 'Location 1', 'Upside down']],
['O', ['Sheet A', 'Location 2', 'Upside down']],
['O', ['Sheet A', 'Location 3', 'Upside down']],
['O', ['Sheet A', 'Location 4', 'Upside down']]
]
for each in data_sets:
if 'Sheet A' in each:
print('1')
when i run this, it doesn't print anything because i dont think its going through all the sublists. how can i get this to work?
You can use itertools.chain.from_iterable
import itertools
for each in data_sets:
if "Sheet A" in itertools.chain.from_iterable(eeach):
print("1")
1
1
1
1
1
1
1
1
Here you have a live example
in is not recursive. It tries to find the item in the list itself. If the item is a list, in won't go down in the list to look for the string.
In your case, you could
check if the list has at least 2 items
perform in on the second item
like this:
for each in data_sets:
if len(each)>1 and 'Sheet A' in each[1]:
print('1')
of course if the structure is more complex/not fixed, you have to use a recursive approach which tests item type, like this: Python nested list recursion search
def listChecker(list_elems):
for list_elem in list_elems:
if "Sheet A" in list_elem:
print "1"
if any(isinstance(elem, list) for elem in list_elem):
listChecker(list_elem)
listChecker(data_sets)
you can also use this function. It will be helpful to print 1 in all cases of nested lists. Just pass your list object to this function.
you can also check it by count.
for each in data_sets:
if len(each)>1 and each[1].count("Sheet A"):
print('1')
len(each)>1 checks the number of list item.
each[1] is the second sublist of your given list. and .count("Sheet A") returns occurrence number of Sheet A.

programmatically generate key and values to put in python dictionary

I want to generate a large number of key value pairs to put in my dictionary using a for loop. For example, the dictionary looks like this:
my_dict = dict()
my_dict["r0"] = "tag 0"
my_dict["r1"] = "tag 1"
my_dict["r2"] = "tag 2"
...
Note that both the key and value follows a pattern, i.e., the number increase by 1. Now I cannot do this 1M times and would prefer an automatic way to initialize my dictionary.
The most efficient way to do this is probably with a dict comprehension:
mydict={'r%s'%n : 'tag %s'%n for n in range(10)}
Which is equivalent to:
mydict=dict()
for n in range(10):
mydict.update({'r%s'%n:'tag %s'%n})
... but more efficient. Just change range(10) as necessary.
You could also use .format() formatting instead of percent (C-like) formatting in the dict:
mydict={'r{}'.format(n) : 'tag {}'.format(n) for n in range(10)}
If you are using Python2 replace all the range() functions with xrange() functions
my_dict = dict()
for i in range(0, 1000000):
key = "r{}".format(i)
value = "tag {}".format(i)
my_dict[key] = value
EDIT: As pointed out by others, if you are using python 2 use xrange instead since it is lazy (so more efficient). In Python 3 range does the same thing as xrange in python 2
my_dict = dict()
for i in xrange(1000000):
my_dict["r%s" % i] = "tag %s" % i
my_dict = dict()
for x in range(1000000):
key="r"+str(x)
val="tag " +str(x)
my_dict[key]=val
simple way is to do the following
#using python format strings
keyf = "r{}"
valf = "tag {}"
#dictionary comprehension
a = {keyf.format(i) : valf.format(i) for i in range(5)}
# can modify range to handle 1,000,000 if you wanted
print(a)
{'r0': 'tag 0', 'r1': 'tag 1', 'r2': 'tag 2', 'r3': 'tag 3', 'r4': 'tag 4', 'r5': 'tag 5'}
if you wanted to quickly append this to another dictionary you would use the dictionary equivalent of extend, which is called update.
b = dict{"x":1,"y":2}
b.update(a)
print(b)
{'x': 1, 'y': 2, 'r0': 'tag 0', 'r1': 'tag 1', 'r2': 'tag 2', 'r3': 'tag 3', 'r4': 'tag 4'}
you could also shorten the original comprehension by doing this:
a = {"r{}".format(i) : "tag {}".format(i) for i in range(5)}
You wouldn't even need to make keyf, or valf
Python can build dicts from lists:
$ python2 -c "print dict(map(lambda x: ('r' + str(x), 'tag ' + str(x)), range(10)))"
{'r4': 'tag 4', 'r5': 'tag 5', 'r6': 'tag 6', 'r7': 'tag 7', 'r0': 'tag 0', 'r1': 'tag 1', 'r2': 'tag 2', 'r3': 'tag 3', 'r8': 'tag 8', 'r9': 'tag 9'}

Reduce a list of 2-dicts, to a list unique by the one item, concatenating the other items in python3

I'm trying to transform some data:
[
{'delim_type': '', 'arcana_list': 'Life 3'},
{'delim_type': ' and/or ', 'arcana_list': 'Mind 3'},
{'delim_type': ' and/or ', 'arcana_list': 'Prime 3'},
]
To look like this:
[
{'delim_type': '', 'arcana_list': 'Life 3'},
{'delim_type': ' and/or ', 'arcana_list': ['Mind 3', 'Prime 3']},
]
Basically where the delim type is the same, append one of the arcana_list items to the other.
I've tried to look up ziping and chain iter and stuff, but I can't find a short pythonic way of doing this without unpacking this and then repacking it. I feel it should be do-able with list comprehension, but my python-fu is weak.
I'm sure there's a slicker way to do this, but here's what I came up with
def combineDicts(old):
from itertools import groupby
new = []
for k,g in groupby(l, lambda i: i['delim_type']):
d = {'delim_type':k}
d['arcana_list'] = ', '.join(i['arcana_list'] for i in g)
new.append(d)
return new
Testing
>>> old = [
{'delim_type': '', 'arcana_list': 'Life 3'},
{'delim_type': ' and/or ', 'arcana_list': 'Mind 3'},
{'delim_type': ' and/or ', 'arcana_list': 'Prime 3'},
]
>>> combineDicts(old)
[{'arcana_list': 'Life 3', 'delim_type': ''},
{'arcana_list': 'Mind 3, Prime 3', 'delim_type': ' and/or '}]
[
{'delim_type': '', 'arcana_list': 'Life 3'},
{'delim_type': ' and/or ', 'arcana_list': 'Mind 3, Prime 3'},
]
What I would do unless you were stuck with the above format:
Would be the following format (which allows for much easier lookups):
{' and/or ': ["mind 3", "prime 3"],
' ': 'Prime 3'}
This can be done very easily:
from collections import defaultdict
d = defaultdict(list)
for node in my_list:
d[node['delim_type']].append(node['arcana_list'])
You can of course do defaultdict(str) and += ', ' if strings are preferable to a list (though I'm not sure when that would be).
Example:
my_list = [
{'delim_type': '', 'arcana_list': 'Life 3'},
{'delim_type': ' and/or ', 'arcana_list': 'Mind 3'},
{'delim_type': ' and/or ', 'arcana_list': 'Prime 3'},
]
output:
defaultdict(<class 'list'>, {'': ['Life 3'], ' and/or ': ['Mind 3', 'Prime 3']})
usage:
In [19]: d['']
Out[19]: ['Life 3']
In [20]: d[' and/or ']
Out[20]: ['Mind 3', 'Prime 3']

Categories