Scoring pandas column's vs other columns - python

I want to rank how many of other cols in df is greater than or equal to a reference col. Given testdf:
testdf = pd.DataFrame({'RefCol': [10, 20, 30, 40],
'Col1': [11, 19, 29, 40],
'Col2': [12, 21, 28, 39],
'Col3': [13, 22, 31, 38]
})
I am using the helper function:
def sorter(row):
sortedrow = row.sort_values()
return sortedrow.index.get_loc('RefCol')
as:
testdf['Score'] = testdf.apply(sorter, axis=1)
With actual data this method is very slow, how to speed it up? Thanks

Looks like you need to compare RefCol and check if there are any column less than the RefCol , use:
testdf.lt(testdf['RefCol'],axis=0).sum(1)
0 0
1 1
2 2
3 2
For greater than equal to use:
testdf.drop('RefCol',1).ge(testdf.RefCol,axis=0).sum(1)

Related

Retrieve Pandas dataframe rows that its column (one) values are consecutively equal to the values of a list

How do I retrieve Pandas dataframe rows that its column (one) values are consecutively equal to the values of a list?
Example, given this:
import pandas as pd
df = pd.DataFrame({'col1': [10, 20, 30, 40, 50, 88, 99, 30, 40, 50]})
lst = [30, 40, 50]
I want to extract the dataframe rows from 30 to 50, but just the first sequence of consecutive values (just the 2 to 4 index rows).
this should do the trick:
df = pd.DataFrame({'col1': [10, 20, 30, 40, 50, 88, 99, 30, 40, 50]})
lst = [30, 40, 50]
ans=[]
for i,num in enumerate(df['col1']):
if num in lst:
lst.remove(num)
ans.append(i)
print(ans)
You can use a rolling comparison:
s = df['col1'][::-1].rolling(len(lst)).apply(lambda x: x.eq(lst[::-1]).all())[::-1].eq(1)
if s.any():
idx = s.idxmax()
out = df.iloc[idx:idx+len(lst)]
print(out)
else:
print('Not found')
output:
col1
2 30
3 40
4 50
Try:
lst = [30, 40, 50]
if any(lst == (found := s).to_list() for s in df["col1"].rolling(len(lst))):
print(df.loc[found.index])
Prints:
col1
2 30
3 40
4 50

modifying nans position in the dataframe

I'm hoping I can explain this well. I have this df with 2 clumns: group and numbers. I'm trying to get that np.nan and pop it into it's new group.
def check_for_nan():
# for example let's say my new value is 14.5
new_nan_value=14.5
data = {"group:" : [-1,0,1,2,3],
'numbers': [[np.nan], [11, 12], [14, 15], [16, 17], [18, 19]],
}
df = pd.DataFrame(data=data)
# *** add some code ***
# I created a new dataframe to visually show how it should look like but we would want to operate only on the same df from above
data_2 = {"group" : [0,1,2,3],
'numbers': [[11, 12], [14,np.nan, 15], [16, 17], [18, 19]],
}
df_2 = pd.DataFrame(data=data_2)
# should return the new group number where the nan would live
return data_2["group"][1]
Output:
current:
group: numbers
0 -1 [nan]
1 0 [11, 12]
2 1 [14, 15]
3 2 [16, 17]
4 3 [18, 19]
Desired output when new_nan_value =14.5
group numbers
0 0 [11, 12]
1 1 [14, nan, 15]
2 2 [16, 17]
3 3 [18, 19]
return 1
With the dataframe you provided:
import pandas as pd
df = pd.DataFrame(
{
"group": [-1, 0, 1, 2, 3],
"numbers": [[pd.NA], [11, 12], [14, 15], [16, 17], [18, 19]],
}
)
new_nan_value = 14.5
Here is one way to do it:
def move_nan(df, new_nan_value):
"""Helper function.
Args:
df: input dataframe.
new_nan_value: insertion value.
Returns:
Dataframe with nan value at insertion point, new group.
"""
# Reshape dataframe along row axis
df = df.explode("numbers").dropna().reset_index(drop=True)
# Insert new row
insert_pos = df.loc[df["numbers"] < new_nan_value, "numbers"].index[-1] + 1
df = pd.concat(
[
df.loc[: insert_pos - 1, :],
pd.DataFrame({"group": [pd.NA], "numbers": pd.NA}, index=[insert_pos]),
df.loc[insert_pos:, :],
]
)
df["group"] = df["group"].fillna(method="bfill")
# Find new group value
new_group = df.loc[df["numbers"].isna(), "group"].values[0]
# Groupby and reshape dataframe along column axis
df = df.groupby("group").agg(list).reset_index(drop=False)
return df, new_group
So that:
df, new_group = move_nan(df, 14.5)
print(df)
# Output
group numbers
0 0 [11, 12]
1 1 [14, nan, 15]
2 2 [16, 17]
3 3 [18, 19]
print(new_group) # 1

How to use Python Pandas to calculate the mean for skipped backward rows?

Here is the data:
data = {'col1': [12, 13, 5, 2, 12, 12, 13, 23, 32, 65, 33, 52, 63, 12, 42, 65, 24, 53, 35]}
df = pd.DataFrame(data)
I want to create a new col skipped_mean. Only the last 3 rows have a valid value for this variable. What it does is it looks back 6 rows backward, continuously for 3 times, and take the average of the three numbers
How can it be done?
You could do it with a weighted rolling mean approach:
import numpy as np
weights = np.array([1/3,0,0,0,0,0,1/3,0,0,0,0,0,1/3])
df['skipped_mean'] = df['col1'].rolling(13).apply(lambda x: np.sum(weights*x))

pandas read text file into a dataframe

I have a .txt file
[7, 9, 20, 30, 50] [1-8]
[9, 14, 27, 31, 45] [2-5]
[7, 10, 22, 27, 38] [1-7]
that I am trying to read into a data frame of two columns using df = pd.read_fwf(readfile,header=None)
Instead of two columns it forms a data frame with three columns and sometimes reads each of the first list of numbers into five columns
0 1 2
0 [7, 9, 20, 30, 50] [1-8]
1 [9, 14, 27, 31, 45] [2-5]
2 [7, 10, 22, 27, 38] [1-7]
I do not understand what I am doing wrongly. Could someone please help?
You can exploit the two spaces between the lists
pd.read_csv(readfile, sep='\s\s', header=None, engine='python')
Out:
0 1
0 [7, 9, 20, 30, 50] [1-8]
1 [9, 14, 27, 31, 45] [2-5]
2 [7, 10, 22, 27, 38] [1-7]
pd.read_fwf without an explicit widths argument tries to infere the fixed widths. But the length of the first list varies. There is no fixed width to separate each line into two columns.
The widths argument is very usefull if your data has no delimiter but fixed number of letters per value. 40 years ago this was a common data format.
# data.txt
20200810ITEM02PRICE30COUNT001
20200811ITEM03PRICE31COUNT012
20200812ITEM12PRICE02COUNT107
pd.read_csv sep argument accepts multi char and regex delimiter. Often this is more flexible to separate strings to columns.
By single line you can read using pandas
import pandas as pd
df = pd.read_csv(readfile, sep='\s\s')

np reshape within pandas apply

Arise Exception: Data must be 1-dimensional.
I'll present the problem with a toy example to be clear.
import pandas as pd
import numpy as np
Initial dataframe:
df = pd.DataFrame({"A": [[10,15,12,14],[20,30,10,43]], "R":[2,2] ,"C":[2,2]})
>>df
A C R
0 [10, 15, 12, 14] 2 2
1 [20, 30, 10, 43] 2 2
Conversion to numpy array and reshape:
df['A'] = df['A'].apply(lambda x: np.array(x))
df.apply(lambda x: print(x[0],(x[1],x[2])) ,axis=1)
df['A_reshaped'] = df.apply(lambda x[['A','R','C']]: np.reshape(x[0],(x[1],x[2])),axis=1)
df
A C R A_reshaped
0 [10, 15, 12, 14] 2 2 [[10,15],[12,14]]
1 [20, 30, 10, 43] 2 2 [[20,30],[10,43]]
Someone know the reason? It seems to not accept 2 dimensional arrays in pandas cells but it's strange...
Thanks in advance for any help!!!
Using apply directly doesn't work - the return value is a numpy 2d array, and placing it back in the DataFrame confuses Pandas, for some reason.
This seems to work, though:
df['reshaped'] = pd.Series([a.reshape((c, r)) for (a, c, r) in zip(df.A, df.C, df.R)])
>>> df
A C R reshaped
0 [10, 15, 12, 14] 2 2 [[10, 15], [12, 14]]
1 [20, 30, 10, 43] 2 2 [[20, 30], [10, 43]]

Categories