I need to write a parameterized for loop.
# This works but...
df["ID"]=np_get_defined(df["methodA"+"ID"], df["methodB"+"ID"],df["methodC"+"ID"])
# I need a for loop as follows
df["ID"]=np_get_defined(df[sm+"ID"] for sm in strmethods)
and I get the following error:
ValueError: Length of values does not match length of index
Remaining definitions:
import numpy as np
df is a Pandas.DataFrame
strmethods=['methodA','methodB','methodC']
def get_defined(*args):
strs = [str(arg) for arg in args if not pd.isnull(arg) and 'N/A' not in str(arg) and arg!='0']
return ''.join(strs) if strs else None
np_get_defined = np.vectorize(get_defined)
df["ID"]=np_get_defined(df[sm+"ID"] for sm in strmethods) means you're passing a generator as single argument to the called method.
If you want to expand the generated sequence to a list of arguments use the * operator:
df["ID"] = np_get_defined(*(df[sm + "ID"] for sm in strmethods))
# or:
df["ID"] = np_get_defined(*[df[sm + "ID"] for sm in strmethods])
The first uses a generator and unpacks its elements, the second uses a list comprehension instead, the result will be the same in either case.
I think the reason why it doesn't work is that your DataFrame consists of columns with different lengths.
Related
I have to variables one="{}{}{}T{}:{}:{}" and two='2021,-10,-28,05,40,33' When i try to use print(one.format(two)) i am getting errors. IndexError: Replacement index 1 out of range for positional args tuple. guessing it is because the two variable is being seen as a string.
So I switched and tried to do the *args method.
for item in two:
item = str(item)
data[item] = ''
result = template.format(*data)
However if I switch two='2021,-10,-28,00,00,00.478' it fails because a dictionary is unique. so how do i get the first method to work or is there a better solution.
You should split the two into list and then unpack it with *
one="{}{}{}T{}:{}:{}"
two='2021,-10,-28,05,40,33'
print(one.format(*two.split(','))) # 2021-10-28T05:40:33
I need to extract path component from url string at different depth levels.
If the input is:
http//10.6.7.9:5647/folder1/folder2/folder3/folder4/df.csv
Output should be:
folder1_path = 'http//10.6.7.9:5647/folder1'
folder2_path = 'http//10.6.7.9:5647/folder1/folder2'
folder3_path = 'http//10.6.7.9:5647/folder1/folder2/folder3'
folder4_path = 'http//10.6.7.9:5647/folder1/folder2/folder3/folder4'
Output is to create 3 new string variable by doing string operation on my_url_path.
You can use a clever combination of string split and join. Something like this should work:
def path_to_folder_n(url, n):
"""
url: str, full url as string
n: int, level of directories to include from root
"""
base = 3
s = url.split('/')
return '/'.join(s[:base+n])
my_url_path = 'http//10.6.7.9:5647/folder1/folder2/folder3/folder4/df.csv'
# folder 1
print(path_to_folder_n(my_url_path, 1))
# folder 4
print(path_to_folder_n(my_url_path, 4))
# folder 3
print(path_to_folder_n(my_url_path, 3))
Output:
>> http//10.6.7.9:5647/folder1
>> http//10.6.7.9:5647/folder1/folder2/folder3/folder4
>> http//10.6.7.9:5647/folder1/folder2/folder3
Keep in mind you may want to add error checks to avoid n going too long.
See it in action here: https://repl.it/repls/BelovedUnhealthyBase#main.py
For getting the parent directory from a string in this format you could simply do
my_url_path.split('/')[-2]
For any parent you subtract the number from the index of the list.
I've made this function that address your problem.
It just uses split() and join() methods of the str class, and also the takewhile() function of the itertools module, which basically takes elements from an iterable while the predicate (its first argument) is true.
from itertools import takewhile
def manipulate_path(target, url):
path_parts = url.split('/')
partial_output = takewhile(lambda x: x != target, path_parts)
return "/".join(partial_output) + f'/{target}'
You can use it as follows:
manipulate_path('folder1', my_url_path) # returns 'http//10.6.7.9:5647/folder1'
manipulate_path('folder2', my_url_path) # returns 'http//10.6.7.9:5647/folder1/folder2'
Say I have multiple lists called data_w1, data_w2, data_w3, ..., data_wn. I have a function that takes an integer band as an input, and I'd like the function to operate only on the corresponding list.
I am familiar with string substitution, but I'm not dealing with strings here, so how would I do this? Some pseudocode:
def my_function(band):
wband_new = []
for entry in data_wband:
# do stuff
wband_new.append( #new stuff )
return wband_new
But doing the above doesn't work as expected because I get the errors that anything with wband in it isn't defined. How can I do this?
Not exactly sure what you're asking, but if you mean to have lists 1, 2, ..., n then an integer i and you want to get the i'th list, simply have a list of lists and index the outer list with the integer i (in your case called band).
l = [data_w1, data_w2, data_w3]
list_to_operate_on = l[band]
func(list_to_operate_on)
Suppose you have your data variables in the script before the function. What you need to do is substitute data_wband with globals()['data_w'+str(band)]:
data_w1 = [1,2,3]
data_w2 = [4,5,6]
def my_function(band):
wband_new = []
for entry in globals()['data_w'+str(band)]:
# do stuff
wband_new.append( #new stuff )
return wband_new
A small snippet of a data set I have is the following:
import numpy
fns = numpy.array(["filename_0004_0003_info.hdf5", "filename_0003_0003_info.hdf5", "filename_0001_0001_info.hdf5", "filename_0002_0001_info.hdf5", "filename_0006_0002_info.hdf5", "filename_0005_0002_info.hdf5"])
The first integer I call run, whereas the second integer I denote as order. I want to sort this data set. First based on the order number, and second based on the run number. Every order number exists twice while the run number is unique. When I only sort based on the order number using numpy.argsort():
order_nrs = numpy.array([int(fn.split("_")[2]) for fn in fns])
fns = numpy.copy(fns)[numpy.argsort(order_nrs)]
I obtain
['filename_0001_0001_info.hdf5' 'filename_0002_0001_info.hdf5' 'filename_0006_0002_info.hdf5' 'filename_0005_0002_info.hdf5' 'filename_0004_0003_info.hdf5' 'filename_0003_0003_info.hdf5']
Although fns is sorted based on order it should afterwards also be sorted by run. The results should be:
filename_0001_0001_info.hdf5
filename_0002_0001_info.hdf5
filename_0005_0002_info.hdf5
filename_0006_0002_info.hdf5
filename_0003_0003_info.hdf5
filename_0004_0003_info.hdf5
How do I achieve this?
fns = ["filename_0004_0003_info.hdf5", "filename_0003_0003_info.hdf5", "filename_0001_0001_info.hdf5", "filename_0002_0001_info.hdf5", "filename_0006_0002_info.hdf5", "filename_0005_0002_info.hdf5"]
def sortfunc(s):
words = s.split('_')
run, order = int(words[1]), int(words[2])
return order, run
fns.sort(key=sortfunc)
A bit verbose, but useful to see the 'order' parameter in argsort.
fns = np.array(["filename_0004_0003_info.hdf5", "filename_0003_0003_info.hdf5", "filename_0001_0001_info.hdf5", "filename_0002_0001_info.hdf5",
"filename_0006_0002_info.hdf5", "filename_0005_0002_info.hdf5"])
o = np.array([fn.split("_") for fn in fns])
a_r = np.core.records.fromarrays(o.transpose(),
names=['a', 'b', 'c', 'd'],
formats=['U9']*4)
idx = np.argsort(a_r, order=['c', 'b'])
out = fns[idx]
out
Out[22]:
array(['filename_0001_0001_info.hdf5', 'filename_0002_0001_info.hdf5',
'filename_0005_0002_info.hdf5', 'filename_0006_0002_info.hdf5',
'filename_0003_0003_info.hdf5', 'filename_0004_0003_info.hdf5'],
dtype='<U28')
Start with the filenames, split on the _ as you have done, convert to a recarray or structured array (a simple example given), use argsort but order on the two numeric types you want to use... and use the idx index to reorder the original.
There are probably other elegant ways, but I like to see exactly what is going on and argsort order property makes it clear for me. To bad you just can't use an index number instead of the named fields
One way is engineering the dtype such that only the two numbers are visible and the second comes first:
>>> fns = numpy.array(["filename_0004_0003_info.hdf5", "filename_0003_0003_info.hdf5", "filename_0001_0001_info.hdf5", "filename_0002_0001_info.hdf5", "filename_0006_0002_info.hdf5", "filename_0005_0002_info.hdf5"])
>>> fns.view(np.dtype({'names':['f1','f2'], 'formats':['<U4','<U4'], 'offsets':[56,36], 'itemsize':112})).sort()
>>> fns
array(['filename_0001_0001_info.hdf5', 'filename_0002_0001_info.hdf5',
'filename_0005_0002_info.hdf5', 'filename_0006_0002_info.hdf5',
'filename_0003_0003_info.hdf5', 'filename_0004_0003_info.hdf5'],
dtype='<U28')
This is direct inplace sort, but argsort, of course, also works.
Look at np.lexsort, it uses a stable sort on a series of keys to do what you want.
I need to pass a list as arguments for a certain UDF I have in pyspark. Example:
def cat(mine,mine2):
if mine is not None and mine2 is not None:
return "2_"+mine+"_"+mine2
udf_cat = UserDefinedFunction(cat, "string")
l = ["COLUMN1","COLUMN2"]
df = df.withColumn("NEW_COLUMN", udf_cat(l))
But I always get an error.
After a while, I figured out that all I need is to pass the list using the character '*' before it. Example:
df = df.withColumn("NEW_COLUMN", udf_cat(*l))
That way, it will work.