How to select specific rows based on specific condition - python

I am working in Pandas Python. I am trying to select specific rows based on specific condition. From the below dataset, I want the system groups which has Type 1 in it. System groups which don't have type 1 can be ignored.
System
Type
A
1
A
2
A
2
A
3
B
1
B
2
C
2
D
3
Required Output
System
Type
A
1
A
2
A
2
A
3
B
1
B
2
System A and B is obtained in the required output becuase it contain the Type 1 value. C and D groups has been ignore due to no Type 1 in them. I am trying to do with groupby but unable to extend this function to check for presence type 1 in it in the condition. Please help
Code to generate dataframe
import pandas as pd
data = [['A', 1], ['A', 2], ['A', 2],['A',3],['B',1],['B',2],['C',2],['C',3],['D',3]]
df = pd.DataFrame(data, columns=['System', 'Type'])

df[df['System'].isin(df[df['Type'] == 1]['System'])]
System Type
0 A 1
1 A 2
2 A 2
3 A 3
4 B 1
5 B 2
First, you filter the df to only rows with Type=1, and select the System column. Then you filter the df to only includes rows where System is in that list
It might be faster to get the values into a set and search in it
df[df['System'].isin(set(df[df['Type'] == 1]['System'].values))]

let's say the table variable is df . Then the code will be
sys = list(df[df['Type'] == 1]['System'].values)
ans = df[df['System'].isin(sys)]
ans is your preferred table. I am sure there are better ways but hopefully this works.

You can use where clause it will be much faster than conditional selection
selectedvariables=df.where(df['Type']==1)["System"].dropna()
print(df[df["System"].isin(selectedvariables)])

Related

Pandas - How to extract values from a large DF without any 'keys' using another DF's values?

I've got one large matrix as a pandas DF w/o any 'keys' but plain numbers on top. A smaller version of that just to demonstrate the problem in here would be like this input:
M=pd.DataFrame(np.random.rand(4,5))
What I want to accomplish is using another given DF as reference that has a structure like this
N=pd.DataFrame({'A':[2,2,2],'B':[2,3,4]})
...to extract the values from the large DF whereas the values of 'A' correspond to the ROW number and 'B' values to the COLUMN number of the large DF so that the expected output would look like this:
Large DF
0 1 2 3 4
0 0.766275 0.910825 0.378541 0.775416 0.639854
1 0.505877 0.992284 0.720390 0.181921 0.501062
2 0.439243 0.416820 0.285719 0.100537 0.429576
3 0.243298 0.560427 0.162422 0.631224 0.033927
Small DF
A B
0 2 2
1 2 3
2 2 4
Expected Output:
A B extracted values
0 2 2 0.285719
1 2 3 0.100537
2 2 4 0.429576
So far I've tried different version of something like this
N['extracted'] = M.iloc[N['A'].astype(int):,N['B'].astype(int)]
..but it keeps failing with an error saying
TypeError: cannot do positional indexing on RangeIndex with these indexers
[0 2
1 2
2 2
Which approach would be the best ?
Is this job better to accomplish by converting the DF's into a numpy arrays ?
Thanks for help!
I think you want to use the apply function. This goes row by row through your data set.
N['extracted'] = N.apply(lambda row: M.iloc[row['A'], row['B']], axis=1)

What's the fastest way to select values from columns based on keys in another columns in pandas?

I need a fast way to extract the right values from a pandas dataframe:
Given a dataframe with (a lot of) data in several named columns and an additional columns whose values only contains names of the other columns, how do I select values from the data-columns with the additional columns as keys?
It's simple to do via an explicit loop, but this is extremely slow with something like .iterrows() directly on the DataFrame. If converting to numpy-arrays, it's faster, but still not fast. Can I combine methods from pandas to do it even faster?
Example: This is the kind of DataFrame structure, where columns A and B contain data and column keys contains the keys to select from:
import pandas
df = pandas.DataFrame(
{'A': [1,2,3,4],
'B': [5,6,7,8],
'keys': ['A','B','B','A']},
)
print(df)
output:
Out[1]:
A B keys
0 1 5 A
1 2 6 B
2 3 7 B
3 4 8 A
Now I need some fast code that returns a DataFrame like
Out[2]:
val_keys
0 1
1 6
2 7
3 4
I was thinking something along the lines of this:
tmp = df.melt(id_vars=['keys'], value_vars=['A','B'])
out = tmp.loc[a['keys']==a['variable']]
which produces:
Out[2]:
keys variable value
0 A A 1
3 A A 4
5 B B 6
6 B B 7
but doesn't have the right order or index. So it's not quite a solution.
Any suggestions?
See if either of these work for you
df['val_keys']= np.where(df['keys'] =='A', df['A'],df['B'])
or
df['val_keys']= np.select([df['keys'] =='A', df['keys'] =='B'], [df['A'],df['B']])
No need to specify anything for the code below!
def value(row):
a = row.name
b = row['keys']
c = df.loc[a,b]
return c
df.apply(value, axis=1)
Have you tried filtering then mapping:
df_A = df[df['key'].isin(['A'])]
df_B = df[df['key'].isin(['B'])]
A_dict = dict(zip(df_A['key'], df_A['A']))
B_dict = dict(zip(df_B['key'], df_B['B']))
df['val_keys'] = df['key'].map(A_dict)
df['val_keys'] = df['key'].map(B_dict).fillna(df['val_keys']) # non-exhaustive mapping for the second one
Your df['val_keys'] column will now contain the result as in your val_keys output.
If you want you can just retain that column as in your expected output by:
df = df[['val_keys']]
Hope this helps :))

Create a column based on multiple column distinct count pandas [duplicate]

I want to add an aggregate, grouped, nunique column to my pandas dataframe but not aggregate the entire dataframe. I'm trying to do this in one line and avoid creating a new aggregated object and merging that, etc.
my df has track, type, and id. I want the number of unique ids for each track/type combination as a new column in the table (but not collapse track/type combos in the resulting df). Same number of rows, 1 more column.
something like this isn't working:
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].nunique()
nor is
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(nunique)
this last one works with some aggregating functions but not others. the following works (but is meaningless on my dataset):
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(sum)
in R this is easily done in data.table with
df[, n_unique_id := uniqueN(id), by = c('track', 'type')]
thanks!
df.groupby(['track', 'type'])['id'].transform(nunique)
Implies that there is a name nunique in the name space that performs some function. transform will take a function or a string that it knows a function for. nunique is definitely one of those strings.
As pointed out by #root, often the method that pandas will utilize to perform a transformation indicated by these strings are optimized and should generally be preferred to passing your own functions. This is True even for passing numpy functions in some cases.
For example transform('sum') should be preferred over transform(sum).
Try this instead
df.groupby(['track', 'type'])['id'].transform('nunique')
demo
df = pd.DataFrame(dict(
track=list('11112222'), type=list('AAAABBBB'), id=list('XXYZWWWW')))
print(df)
id track type
0 X 1 A
1 X 1 A
2 Y 1 A
3 Z 1 A
4 W 2 B
5 W 2 B
6 W 2 B
7 W 2 B
df.groupby(['track', 'type'])['id'].transform('nunique')
0 3
1 3
2 3
3 3
4 1
5 1
6 1
7 1
Name: id, dtype: int64

Change an element in the last row of a DataFrame

I set up a simple DataFrame in pandas:
a = pandas.DataFrame([[1,2,3], [4,5,6], [7,8,9]], columns=['a','b','c'])
>>> print a
a b c
0 1 2 3
1 4 5 6
2 7 8 9
I would like to be able to alter a single element in the last row of. In pandas==0.13.1 I could use the following:
a.iloc[-1]['a'] = 77
>>> print a
a b c
0 1 2 3
1 4 5 6
2 77 8 9
but after updating to pandas==0.14.1, I get the following warning when doing this:
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
The problem of course being that -1 is not an index of a, so I can't use loc. As the warning indicates, I have not changed column 'a' of the last row, I've only altered a discarded local copy.
How do I do this in the newer version of pandas? I realize I could use the index of the last row like:
a.loc[2,'a'] = 77
But I'll be working with tables where multiple rows have the same index, and I don't want to reindex my table every time. Is there a way to do this without knowing the index of the last row before hand?
Taking elements from the solutions of #PallavBakshi and #Mike, the following works in Pandas >= 0.19:
a.loc[a.index[-1], 'a'] = 4.0
Just using iloc[-1, 'a'] won't work as 'a' is not a location.
Alright I've found a way to solve this problem without chaining, and without worrying about multiple indices.
a.iloc[-1, a.columns.get_loc('a')] = 77
>>> a
a b c
0 1 2 3
1 4 5 6
2 77 8 9
I wasn't able to use iloc before because I couldn't supply the column index as an int, but get_loc solves that problem. Thanks for the helpful comments everyone!
For pandas 0.22,
a.at[a.index[-1], 'a'] = 77
this is just one of the ways.

Apply function to pandas DataFrame that can return multiple rows

I am trying to transform DataFrame, such that some of the rows will be replicated a given number of times. For example:
df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]})
class count
0 A 1
1 B 0
2 C 2
should be transformed to:
class
0 A
1 C
2 C
This is the reverse of aggregation with count function. Is there an easy way to achieve it in pandas (without using for loops or list comprehensions)?
One possibility might be to allow DataFrame.applymap function return multiple rows (akin apply method of GroupBy). However, I do not think it is possible in pandas now.
You could use groupby:
def f(group):
row = group.irow(0)
return DataFrame({'class': [row['class']] * row['count']})
df.groupby('class', group_keys=False).apply(f)
so you get
In [25]: df.groupby('class', group_keys=False).apply(f)
Out[25]:
class
0 A
0 C
1 C
You can fix the index of the result however you like
I know this is an old question, but I was having trouble getting Wes' answer to work for multiple columns in the dataframe so I made his code a bit more generic. Thought I'd share in case anyone else stumbles on this question with the same problem.
You just basically specify what column has the counts in it in and you get an expanded dataframe in return.
import pandas as pd
df = pd.DataFrame({'class 1': ['A','B','C','A'],
'class 2': [ 1, 2, 3, 1],
'count': [ 3, 3, 3, 1]})
print df,"\n"
def f(group, *args):
row = group.irow(0)
Dict = {}
row_dict = row.to_dict()
for item in row_dict: Dict[item] = [row[item]] * row[args[0]]
return pd.DataFrame(Dict)
def ExpandRows(df,WeightsColumnName):
df_expand = df.groupby(df.columns.tolist(), group_keys=False).apply(f,WeightsColumnName).reset_index(drop=True)
return df_expand
df_expanded = ExpandRows(df,'count')
print df_expanded
Returns:
class 1 class 2 count
0 A 1 3
1 B 2 3
2 C 3 3
3 A 1 1
class 1 class 2 count
0 A 1 1
1 A 1 3
2 A 1 3
3 A 1 3
4 B 2 3
5 B 2 3
6 B 2 3
7 C 3 3
8 C 3 3
9 C 3 3
With regards to speed, my base df is 10 columns by ~6k rows and when expanded is ~100,000 rows takes ~7 seconds. I'm not sure in this case if grouping is necessary or wise since it's taking all the columns to group form, but hey whatever only 7 seconds.
There is even a simpler and significantly more efficient solution.
I had to make similar modification for a table of about 3.5M rows, and the previous suggested solutions were extremely slow.
A better way is to use numpy's repeat procedure for generating a new index in which each row index is repeated multiple times according to its given count, and use iloc to select rows of the original table according to this index:
import pandas as pd
import numpy as np
df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count': [1, 0, 2]})
spread_ixs = np.repeat(range(len(df)), df['count'])
spread_ixs
array([0, 2, 2])
df.iloc[spread_ixs, :].drop(columns='count').reset_index(drop=True)
class
0 A
1 C
2 C
This question is very old and the answers do not reflect pandas modern capabilities. You can use iterrows to loop over every row and then use the DataFrame constructor to create new DataFrames with the correct number of rows. Finally, use pd.concat to concatenate all the rows together.
pd.concat([pd.DataFrame(data=[row], index=range(row['count']))
for _, row in df.iterrows()], ignore_index=True)
class count
0 A 1
1 C 2
2 C 2
This has the benefit of working with any size DataFrame.

Categories