Contains one of several values - python

I have a dataframe df with column x and a list lst =["apple","peach","pear"]
df
x
apple234
pear231
banana233445
If row1 in df["x"] contains any of the values in lst: then 1 else 0
Final data should look like this:
df
x y
apple234 -- 1
pear231 -- 1
banana233445 - 0

Use str.contains with regex | for join all values of list, last cast boolean mask to 0,1 by astype:
lst =["apple","peach","pear"]
df['y'] = df['x'].str.contains('|'.join(lst)).astype(int)
print (df)
x y
0 apple234 1
1 pear231 1
2 banana233445 0

Related

Groupby count of values - pandas

I'm hoping to count specific values from a pandas df. Using below, I'm subsetting Item by Up and grouping Num and Label to count the values in Item. The values in the output are correct but I want to drop Label and include Up in the column headers.
import pandas as pd
df = pd.DataFrame({
'Num' : [1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
'Label' : ['A','B','A','B','B','B','A','B','B','A','A','A','B','A','B','A'],
'Item' : ['Up','Left','Up','Left','Down','Right','Up','Down','Right','Down','Right','Up','Up','Right','Down','Left'],
})
df1 = (df[df['Item'] == 'Up']
.groupby(['Num','Label'])['Item']
.count()
.unstack(fill_value = 0)
.reset_index()
)
intended output:
Num A_Up B_Up
1 3 0
2 1 1
With your approach, you can include the Item in the grouper.
out = (df[df['Item'] == 'Up'].groupby(['Num','Label','Item']).size()
.unstack(['Label','Item'],fill_value=0))
out.columns=out.columns.map('_'.join)
print(out)
A_Up B_Up
Num
1 3 0
2 1 1
You can use Groupby.transform to have all column names. Then use df.pivot_table and a list comprehension to get your desired column names.
In [2301]: x = df[df['Item'] == 'Up']
In [2304]: x['c'] = x.groupby(['Num','Label'])['Item'].transform('count')
In [2310]: x = x.pivot_table(index='Num', columns=['Label', 'Item'], aggfunc='first', fill_value=0)
In [2313]: x.columns = [j+'_'+k for i,j,k in x.columns]
In [2314]: x
Out[2314]:
A_Up B_Up
Num
1 3 0
2 1 1

How to get new pandas dataframe with certain columns and rows depending on list elements?

I have such a list:
l = ['A','B']
And such a dataframe df
Name x y
A 1 2
B 2 1
C 2 2
I now want to get a new dataframe where only the entries for Name and x which are included in l are kept.
new_df should look like this:
Name x
A 1
B 2
I was playing around with isin but did not solve this problem.
Use DataFrame.loc with Series.isin:
new_df = df.loc[df.Name.isin(l), ["Name", "x"]]
This should do it:
# assuming Name is the index
new_df = df[df.index.isin(l)]
# if you only want column x
new_df = df.loc[df.index.isin(l), "x"]
simple as that
l = ['A','B']
def make_empty(row):
print(row)
for idx, value in enumerate(row):
row[idx] = value if value in l else ''
return row
df_new = df[df['Name'].isin(l) | df['x'].isin(l)][['Name','x']]
df_new.apply(lambda row: make_empty(row)
Output:
Name x
0 A
1 B

How to split/group a list of dataframes by the length of each data frames

For example, I have a list of 100 data frames, some have column length of 8, others 10, others 12. I want to be able to split these into groups based on their column length. I have tried dictionaries but couldn't get it to append properly in a loop.
Previously tried code:
col_count = [8, 10, 12]
d = dict.fromkeys(col_count, [])
for df in df_lst:
for i in col_count:
if i == len(df.columns):
d[i] = df
but this just seems to replace the values in the dict each time. I have tried .append also, but that seems to append to all keys.
Instead of assigning a df to d[column_count]. You should append it.
You initialized d with d = dict.fromkeys(col_count, []) so d is a dictionary of empty lists.
When you do d[i] = df you replace the empty list by a DataFrame, so d will be a dictionary of DataFrame. If you do d[i].append(df) you will have a dictionary of list of DataFrame. (which is what you want AFAIU)
Also i'm not sure that you need the col_count variable. You could just do d[len(df.columns)].append(df).
I think this should suffice for you. Think of how to dynamically solve your problems to make better use of Python.
In [2]: import pandas as pd
In [3]: for i in range(1, 5):
...: exec(f"df{i} = pd.DataFrame(0, index=range({i}), columns=list('ABCD'))") #making my own testing list of dataframes with variable length
...:
In [4]: df1 #one row df
Out[4]:
A B C D
0 0 0 0 0
In [5]: df2 #two row df
Out[5]:
A B C D
0 0 0 0 0
1 0 0 0 0
In [6]: df3 #three row df
Out[6]:
A B C D
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
In [7]: L = [df1, df2, df3, df4, df5] #i assume all your dataframes are put into something like a container, which is the problem
In [13]: my_3_length_shape_dfs = [] #you need to create some sort of containers for your lengths (you can do an additional exec in the following In
In [14]: for i in L:
...: if i.shape[0] == 3: #add more of these if needed, you mentioned your lengths are known [8, 10, 12]
...: my_3_length_shape_dfs.append(i) #adding the df to a specified container, thus grouping any dfs that are of row length/shape equal to 3
...: print(i)
...:
A B C D
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0

trim last rows of a pandas dataframe based on a condition

let's assume a dataframe like this:
idx x y
0 a 3
1 b 2
2 c 0
3 d 2
4 e 5
how can I trim the bottom rows, based on a condition, so that any row after the last one matching the condition would be removed?
for example:
with the following condition: y == 0
the output would be
idx x y
0 a 3
1 b 2
2 c 0
the condition can happen many times, but the last one is the one that triggers the cut.
Method 1:
Usng index.max & iloc:
index.max to get the last row with condition y==0
iloc to slice of the dataframe on the index found with df['y'].eq(0)
idx = df.query('y.eq(0)').index.max()+1
# idx = df.query('y==0').index.max()+1 -- if pandas < 0.25
df.iloc[:idx]
Output
x y
0 a 3
1 b 2
2 c 0
Method 2:
Using np.where
idx = np.where(df['y'].eq(0), df.index, 0).max()+1
df.iloc[:idx]
Output
x y
0 a 3
1 b 2
2 c 0
you could do, here np.wherereturns a tuple, so we access the value of the indexes as the first element of the tuple using np.where(df.y == 0), the first occurence is then returned as the last element of this vector, finaly we add 1 to the index so we can include this index of the last occurence while slicing
df_cond = df.iloc[:np.where(df.y == 0)[0][-1]+1, :]
or you could do :
df_cond = df[ :df.y.eq(0).cumsum().idxmax()+1 ]
Set up your dataframe:
data = [
[ 'a', 3],
[ 'b' , 2],
[ 'c' , 0],
[ 'd', 2],
[ 'e' , 5]
]
df = pd.DataFrame(data, columns=['x', 'y']).reset_index().rename(columns={'index':'idx'}).sort_values('idx')
Then find your cutoff (assuming the idx column is already sorted):
cutoff = df[df['y'] == 0].idx.min()
The df['y'] == 0 is your condition. Then get the min idx that meets that condition and save it as our cutoff.
Finally, create a new dataframe using your cutoff:
df_new = df[df.idx <= cutoff].copy()
Output:
df_new
idx x y
0 0 a 3
1 1 b 2
2 2 c 0
I would do something like this:
df.iloc[:df['y'].eq(0).idxmax()+1]
Just look for the largest index where your condition is true.
EDIT
So the above code wont work because idxmax() still only takes the first index where the value is true. So we we can do the following to trick it:
df.iloc[:df['y'].eq(0).sort_index(ascending = False).idxmax()+1]
Flip the index, so the last index is the first index that idxmax picks up.

Find column names when row element meets a criteria Pandas

This is a basic question. I've got a square array with the rows and columns summed up. Eg:
df = pd.DataFrame([[0,0,1,0], [0,0,1,0], [1,0,0,0], [0,1,0,0]], index = ["a","b","c","d"], columns = ["a","b","c","d"])
df["sumRows"] = df.sum(axis = 1)
df.loc["sumCols"] = df.sum()
This returns:
In [100]: df
Out[100]:
a b c d sumRows
a 0 0 1 0 1
b 0 0 1 0 1
c 1 0 0 0 1
d 0 1 0 0 1
sumCols 1 1 2 0 4
I need to find the column labels for the sumCols rows which matches 0. At the moment I am doing this:
[df.loc["sumCols"] == 0].index
But this return a strange index type object. All I want is a list of values that match this criteria i.e: ['d'] in this case.
There is two ways (the index object can be converted to an interable like a list).
Do that with the columns:
columns = df.columns[df.sum()==0]
columns = list(columns)
Or you can rotate the Dataframe and treat columns as rows:
list(df.T[df.T.sumCols == 0].index)
You can use a lambda expression to filter series and if you want a list instead of index as result, you can call .tolist() on the index object:
(df.loc['sumCols'] == 0)[lambda x: x].index.tolist()
# ['d']
Or:
df.loc['sumCols'][lambda x: x == 0].index.tolist()
# ['d']
Without explicitly creating the sumCols and if you want to check which column has sum of zero, you can do:
df.sum()[lambda x: x == 0].index.tolist()
# ['d']
Check rows:
df.sum(axis = 1)[lambda x: x == 0].index.tolist()
# []
Note: The lambda expression approach is as fast as the vectorized method for subsetting, functional style and can be written easily in a one-liner if you prefer.
Heres a simple method using query after transposing
df.T.query('sumCols == 0').index.tolist()

Categories