I have a dataframe where the top scores/instances have parenthesis. I would like to remove the parenthesis and only leave the number. How would I do so?
I have tried the code below, but it leaves me with nans for all other numbers that do not have paranthesis.
.str.replace(r"\(.*\)","")
This is what the columns look like:
0 1(1P)
1 3(3P)
2 2(2P)
3 4(RU)
4 5(RU)
5 6(RU)
6 8
7 7
8 11
9 13
I want clean columns with only numbers.
Thanks!
Reason is mixed values - numeric with strings, possible solution is:
df['a'] = df['a'].astype(str).str.replace(r"\(.*\)","").astype(int)
print (df)
a
0 1
1 3
2 2
3 4
4 5
5 6
6 8
7 7
8 11
9 13
Related
I have a dataframe that looks like this:
ID Age Score
0 9 5 3
1 4 6 1
2 9 7 2
3 3 2 1
4 12 1 15
5 2 25 6
6 9 5 4
7 9 5 61
8 4 2 12
I want to sort based on the first column, then the second column, and so on.
So I want my output to be this:
ID Age Score
5 2 25 6
3 3 2 1
8 4 2 12
1 4 6 1
0 9 5 3
6 9 5 4
7 9 5 61
2 9 7 2
4 12 1 15
I know I can do the above with df.sort_values(df.columns.to_list()), however I'm worried this might be quite slow for much larger dataframes (in terms of columns and rows).
Is there a more optimal solution?
You can use numpy.lexsort to improve performance.
import numpy as np
a = df.to_numpy()
out = pd.DataFrame(a[np.lexsort(np.rot90(a))],
index=df.index, columns=df.columns)
Assuming as input a random square DataFrame of side n:
df = pd.DataFrame(np.random.randint(0, 100, size=(n, n)))
here is the comparison for 100 to 100M items (slower runtime is the best):
Same graph with the speed relative to pandas
By still using df.sort_values() you can speed it up a bit by selecting the type of sorting algorithm. By default it's set to quicksort, but there is the alternatives of 'mergesort', 'heapsort' and 'stable'.
Maybe specifying one of these would improve it?
df.sort_values(df.columns.to_list(), kind="mergesort")
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
Background
Pretty new to Python and dataframes. I'm on a Mac (Sierra) running Jupyter Notebook in Firefox (87.0). I've got a dataframe like this:
df = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9,10],
'SubGroup':[1,5,6,5,8,6,8,6,6,5],
'Price':[7,1,0,10,2,3,0,0,10,0]})
A SubGroup Price
0 1 1 7
1 2 5 1
2 3 6 0
3 4 5 10
4 5 8 2
5 6 6 3
6 7 8 0
7 8 6 0
8 9 6 10
9 10 5 0
I want to add a Boolean column to this dataframe that checks whether a) the price in this row is zero and b) if it's the first occurrence of a zero price for this subgroup (reading from top to bottom). If both (a) and (b) are true, then return true, otherwise false. So it should look like this:
A SubGroup Price Test
0 1 1 7 False
1 2 5 1 False
2 3 6 0 True
3 4 5 10 False
4 5 8 2 False
5 6 6 3 False
6 7 8 0 True
7 8 6 0 False
8 9 6 10 False
9 10 5 0 True
What I've Tried
The first condition (Price == 0) is easy. Checking whether it's the first occurrence for the subgroup is where I could use some help. I have an Excel background, so I started by thinking about how to solve this using a MINIFS function. The idea was to find the minimum Price for the Subgroup, looking only at the rows above the current row. If that min was greater than zero, then I'd know this was the first zero occurrence. The closest I could find (from this post) was a line like...
df['subgroupGlobalMin'] = df.groupby('SubGroup')['Price'].transform('min')
...which works but takes a global minimum across all rows for the Subgroup, not just the ones above the current row. So I tried to specify the target range for my min using iloc, like this...
df['subgroupPreviousMin'] = df.iloc[:df.index].groupby('SubGroup')['Price'].transform('min')
...but this produces the error "cannot do positional indexing on RangeIndex with these indexers [RangeIndex(start=0, stop=10, step=1)] of type RangeIndex". I couldn't figure out how to dynamically specify my rows/indices.
So I changed strategies and instead tried to find the index of the first occurrence of the minimum value for a subgroup using idxmin (like this post):
df['minIndex'] = df.groupby(['SubGroup'])[['Price']].idxmin()
The plan was to check this against the current row index with df.index, but I get unexpected output here:
A SubGroup Price minIndex
0 1 1 7 NaN
1 2 5 1 0.0
2 3 6 0 NaN
3 4 5 10 NaN
4 5 8 2 NaN
5 6 6 3 9.0
6 7 8 0 2.0
7 8 6 0 NaN
8 9 6 10 6.0
9 10 5 0 NaN
I know what it's doing here, but I don't know why or how to fix it.
Questions
Which strategy is best for what I'm trying to achieve - using a min function, checking the index with something like idxmin, or something else?
How should I add a column to my dataframe that checks if the price is 0 for that row and if it's the first occurrence of a zero for that subgroup?
Let us try your logic:
is_zero = df.Price.eq(0)
is_first_zero = is_zero.groupby(df['SubGroup']).cumsum().eq(1)
df['Test'] = is_zero & is_first_zero
Output:
A SubGroup Price Test
0 1 1 7 False
1 2 5 1 False
2 3 6 0 True
3 4 5 10 False
4 5 8 2 False
5 6 6 3 False
6 7 8 0 True
7 8 6 0 False
8 9 6 10 False
9 10 5 0 True
Say I want to delete a set of adjacent columns in a DataFrame and my code looks something like this currently:
del df['1'], df['2'], df['3'], df['4'], df['5'], df['6']
This works, but I was wondering if there was a more efficient, compact, or aesthetically pleasing way to do it, such as:
del df['1','6']
I think you need drop, for selecting is used range or numpy.arange:
df = pd.DataFrame({'1':[1,2,3],
'2':[4,5,6],
'3':[7,8,9],
'4':[1,3,5],
'5':[7,8,9],
'6':[1,3,5],
'7':[5,3,6],
'8':[5,3,6],
'9':[7,4,3]})
print (df)
1 2 3 4 5 6 7 8 9
0 1 4 7 1 7 1 5 5 7
1 2 5 8 3 8 3 3 3 4
2 3 6 9 5 9 5 6 6 3
print (np.arange(1,7))
[1 2 3 4 5 6]
print (range(1,7))
range(1, 7)
#convert string column names to int
df.columns = df.columns.astype(int)
df = df.drop(np.arange(1,7), axis=1)
#another solution with range
#df = df.drop(range(1,7), axis=1)
print (df)
7 8 9
0 5 5 7
1 3 3 4
2 6 6 3
You can do this without modifying the columns, by passing a slice object to drop:
In [29]:
df.drop(df.columns[slice(df.columns.tolist().index('1'),df.columns.tolist().index('6')+1)], axis=1)
Out[29]:
7 8 9
0 5 5 7
1 3 3 4
2 6 6 3
So this returns the ordinal position of the lower and upper bound of the column end points and passes these to create a slice object against the columns array
I suppose this is something rather simple, but I can't find how to make this. I've been searching tutorials and stackoverflow.
Suppose I have a dataframe df loking like this :
Group Id_In_Group SomeQuantity
1 1 10
1 2 20
2 1 7
3 1 16
3 2 22
3 3 5
3 4 12
3 5 28
4 1 1
4 2 18
4 3 14
4 4 7
5 1 36
I would like to select only the lines having at least 4 objects in the group (so there are at least 4 rows having the same "group" number) and for which SomeQuantity for the 4th object, when sorted in the group by ascending SomeQuantity, is greater than 20 (for example).
In the given Dataframe, for example, it would only return the 3rd group, since it has 4 (>=4) members and its 4th SomeQuantity (after sorting) is 22 (>=20), so it should construct the dataframe :
Group Id_In_Group SomeQuantity
3 1 16
3 2 22
3 3 5
3 4 12
3 5 28
(being or not sorted by SomeQuantity, whatever).
Could somebody be kind enough to help me? :)
I would use .groupby() + .filter() methods:
In [66]: df.groupby('Group').filter(lambda x: len(x) >= 4 and x['SomeQuantity'].max() >= 20)
Out[66]:
Group Id_In_Group SomeQuantity
3 3 1 16
4 3 2 22
5 3 3 5
6 3 4 12
7 3 5 28
A slightly different approach using map, value_counts, groupby , filter:
(df[df.Group.map(df.Group.value_counts().ge(4))]
.groupby('Group')
.filter(lambda x: np.any(x['SomeQuantity'].sort_values().iloc[3] >= 20)))
Breakdown of steps:
Perform value_counts to compute the total counts of distinct elements present in Group column.
>>> df.Group.value_counts()
3 5
4 4
1 2
5 1
2 1
Name: Group, dtype: int64
Use map which functions like a dictionary (wherein the index becomes the keys and the series elements become the values) to map these results back to the original DF
>>> df.Group.map(df.Group.value_counts())
0 2
1 2
2 1
3 5
4 5
5 5
6 5
7 5
8 4
9 4
10 4
11 4
12 1
Name: Group, dtype: int64
Then, we check for the elements having a value of 4 or more which is our threshold limit and take only those subset from the entire DF.
>>> df[df.Group.map(df.Group.value_counts().ge(4))]
Group Id_In_Group SomeQuantity
3 3 1 16
4 3 2 22
5 3 3 5
6 3 4 12
7 3 5 28
8 4 1 1
9 4 2 28
10 4 3 14
11 4 4 7
Inorder to use groupby.filter operation on this, we must make sure that we return a single boolean value corresponding to each grouped key when we perform the sorting process and compare the fourth element to the threshold which is 20.
np.any returns all such possiblities matching our filter.
>>> df[df.Group.map(df.Group.value_counts().ge(4))] \
.groupby('Group').apply(lambda x: x['SomeQuantity'].sort_values().iloc[3])
Group
3 22
4 18
dtype: int64
From these, we compare the fourth element .iloc[3] as it is 0-based indexed and return all such favourable matches.
This is how I have worked through your question, warts and all. Im sure there are much nicer ways to do this.
Find groups with "4 objects in the group"
import collections
groups = list({k for k, v in collections.Counter(df.Group).items() if v > 3} );groups
Out:[3, 4]
Use these groups to filter to a new df containing these groups:
df2 = df[df.Group.isin(groups)]
"4th SomeQuantity (after sorting) is 22 (>=20)"
df3 = df2.sort_values(by='SomeQuantity',ascending=False)
(Updated as per comment below...)
df3.groupby('Group').filter(lambda grp: any(grp.sort_values('SomeQuantity').iloc[3] >= 20)).sort_index()
Group Id_In_Group SomeQuantity
3 3 1 16
4 3 2 22
5 3 3 5
6 3 4 12
7 3 5 28
I'm new to Python and using pandas dataframes to store and work with a large dataset.
I'm interested in knowing whether it's possible to compare values between dataframes of similarly named columns. For example, the functionality I'm after would be similar to comparing the column 'A' in this dataframe:
A
0 9
1 9
2 5
3 8
4 7
5 9
6 2
7 2
8 5
9 7
to the column 'A' in this one:
A
0 6
1 3
2 7
3 8
4 2
5 5
6 1
7 8
8 4
9 9
Then, for each row I would determine which of the two 'A' values is smaller and add it to, say, a new column in the first dataframe called 'B':
A B
0 9 6
1 9 3
2 5 5
3 8 8
4 7 2
5 9 5
6 2 1
7 2 2
8 5 4
9 7 7
I'm aware of the
pandas.DataFrame.min
method but as I understand it this will only located the smallest value of one column and can't be used to compare columns of different dataframes. I'm not sure of any other ways in which this functionality could be achieved
Any suggestions for solving this (probably) very simple question would be much appreciated! Thank you.
You can use numpy.minimum():
import numpy as np
df1['B'] = np.minimum(df1.A, df2.A)
Or use Series.where() to replace values:
df1['B'] = df1['A'].where(df1.A < df2.A, df2.A)