Pandas : Top N results with float and string list - python

idx float str+list
1 -0.2 [A,B]
1 -0.1 [A,D]
1 0.2 [B,C]
To know the best result :
df.loc[df['float'].idxmax()]['str+list']
How can I have the top 2 idxmax results?
nlargest gives me error

Use DataFrame.nlargest:
s = df.nlargest(2, 'float')['str+list']
print (s)
2 [B,C]
1 [A,D]
Name: str+list, dtype: object
Or sorting with select top N values:
df.sort_values('float', ascending=False)['str+list'].head(2)

Related

Modifying pandas row value based on its length

I have a column in my pandas dataframe with the following values that represent hours worked in a week.
0 40
1 40h / week
2 46.25h/week on average
3 11
I would like to check every row, and if the length of the value is larger than 2 digits - extract the number of hours only from it.
I have tried the following:
df['Hours_per_week'].apply(lambda x: (x.extract('(\d+)') if(len(str(x)) > 2) else x))
However I am getting the AttributeError: 'str' object has no attribute 'extract' error.
It looks like you could ensure having h after the number:
df['Hours_per_week'].str.extract(r'(\d{2}\.?\d*)h', expand=False)
Output:
0 NaN
1 40
2 46.25
3 NaN
Name: Hours_per_week, dtype: object
Assuming the series data are strings, try this:
df['Hours_per_week'].str.extract('(\d+)')
Why not immediately extract float pattern i.e. \d+\.?\d+ ?
>>> s = pd.Series(['40', '40h / week', '46.25h/week on average', '11'])
>>> s.str.extract("(\d+\.?\d+)")
0
0 40
1 40
2 46.25
3 11
2 digits will still match either way.

Iterating with conditions over Distinct Values in a Column

I have a dataset that looks something like this:
category
value
ID
A
1
x
A
0.5
y
A
0.33
y
B
0.5
z
B
0.33
z
C
5
w
C
0.33
w
For each category, I want to grab all the instances that have a value of <= 0.5. I want to have a count of those instances for each category.
My ideal end goal would be to have a dataframe or list with the counts for each of these categories.
Thanks so much for your help.
EDIT:
To get more complex, let's say I want the count for each category where value is <=0.5 but only count each ID once.
Whereas before the values would be:
cat A -> 2, cat B -> 2, cat C -> 1
Now ideal values would be:
Cat A -> 1, cat B -> 1, cat C -> 1
You can use groupby.sum on the boolean Series of the comparison of value to 0.5:
out = df['value'].le(0.5).groupby(df['category']).sum()
Alternatively, use boolean indexing and value_counts:
df.loc[df['value'].le(0.5), 'category'].value_counts()
output:
category
A 2
B 2
C 1
Name: value, dtype: int64

pandas rounding when converting the series to int

How can I round a number of decimals based on an assigned series?
My sample data is like this:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.uniform(1,5,size=(10,1)), columns=['Results'])
df['groups'] = ['A', 'B', 'C', 'D']
df['decimal'] = [1, 0, 2, 3]
This produces a dataframe like:
Results groups decimal
0 2.851325 A 1
1 1.397018 B 0
2 3.522660 C 2
3 1.995171 D 3
Next: each result number needs to be rounded the number of decimals shown in decimal. What I tried below resulted in an error of TypeError: cannot convert the series to <class 'int'>
df['new'] = df['Results'].round(df['decimal'])
I want the results like:
Results groups decimal new
0 2.851325 A 1 2.9
1 1.397018 B 0 1
2 3.522660 C 2 3.52
3 1.995171 D 3 1.995
You can pass a dict-like object to DataFrame.round to set different precision levels for different columns. So you need to transpose a single column DataFrame (constructed from Results column) twice:
df['Results'] = df[['Results']].T.round(df['decimal']).T
Another option is a list comprehension:
df['Results'] = [round(num, rnd) for num, rnd in zip(df['Results'], df['decimal'])]
Output:
Results groups decimal
0 2.500 A 1
1 2.000 B 0
2 2.190 C 2
3 1.243 D 3
Note that since it's a single column, it's decimal places is determined by the highest decimal; but if you look at the constructor of this DataFrame, you'll see that the precisions have indeed changed:
>>> df[['Results']].to_dict('list')
{'Results': [2.5, 2.0, 2.19, 1.243]}
Try this :
df['new']=df['Results'].copy()
df=df.round({'new': 1})

How to remove a string selectively from pandas series?

I am having some hard time to get rid of a string from my pandas series. I'd like to remove the first two '-' strings but want to keep the last two number objectives. Example is below.
import pandas as pd
temp = pd.Series(['-', '-', '-0.3', '-0.9'])
print(temp)
Out[135]:
0 -
1 -
2 -0.3
3 -0.9
dtype: object
Can't use temp.str.replace("-", "") since it removes the minus sign from the last two number objectives as well. Can anyone help me with this. Thanks in advance!
Use a regular expression:
temp = pd.Series(['-', '-', '-0.3', '-0.9'])
print(temp.str.replace('^-$', '', regex=True))
Output
0
1
2 -0.3
3 -0.9
dtype: object
Or simply use replace:
print(temp.replace('-', '')) # notice that there is no .str
Output
0
1
2 -0.3
3 -0.9
dtype: object
You can convert strings to numbers:
pd.to_numeric(temp, errors='coerce').fillna('')
Output:
0
1
2 -0.3
3 -0.9
You can remove the unwanted string's like this:
import pandas as pd
temp = pd.Series(['-', '-', '-0.3', '-0.9'])
# this will drop the string that match '-'
new_temp= temp[temp != '-']
print(new_temp)
Output:
2 -0.3
3 -0.9
dtype: object
reference: Here

pandas.DataFrame.round() not accepting pd.NA or pd.NAN

pandas version: 1.2
I have a dataframe that columns as 'float64' with null values represented as pd.NAN. Is there way to round without converting to string then decimal:
df = pd.DataFrame([(.21, .3212), (.01, .61237), (.66123, .03), (.21, .18),(pd.NA, .18)],
columns=['dogs', 'cats'])
df
dogs cats
0 0.21 0.32120
1 0.01 0.61237
2 0.66123 0.03000
3 0.21 0.18000
4 <NA> 0.18000
Here is what I wanted to do, but it is erroring:
df['dogs'] = df['dogs'].round(2)
TypeError: float() argument must be a string or a number, not 'NAType'
Here is another way I tried but this silently fails and no conversion occurs:
tn.round({'dogs': 1})
dogs cats
0 0.21 0.32120
1 0.01 0.61237
2 0.66123 0.03000
3 0.21 0.18000
4 <NA> 0.18000
While annoying, the pandas.NA is still relatively new and doesn't support ALL numpy ufuncs. Oddly I'm also encountering errors trying to change the "dogs" column's dtype from object -> float which seems like a bug to me. There's a couple of alternatives that you can achieve your desired result though:
mask the NA away and round the rest of the column
na_mask = df["dogs"].notnull()
df.loc[na_mask, "dogs"] = df.loc[na_mask, "dogs"].astype(float).round(1)
print(df)
dogs cats
0 0.2 0.32120
1 0 0.61237
2 0.7 0.03000
3 0.2 0.18000
4 <NA> 0.18000
Replace the pd.NA with np.nan and then round
df = df.replace(pd.NA, np.nan).round({"dogs": 1})
print(df)
dogs cats
0 0.2 0.32120
1 0.0 0.61237
2 0.7 0.03000
3 0.2 0.18000
4 NaN 0.18000
df['dogs'] = df['dogs'].apply(lambda x: round(x,2) if str(x) != '<NA>' else x)
Does the following code work?
import pandas as pd
import numpy as np
df = pd.DataFrame([(.21, .3212), (.01, .61237), (.66123, .03), (.21, .18), (np.nan, .18)],
columns=['dogs', 'cats'])
df['dogs'] = df['dogs'].round(2)
print(df)

Categories