Get substring in one column based on the value in another column - python

My dataframe looks like this:
test_df = pd.DataFrame({'name':['a12','b1','c'],'Length':[2,1,0]})
test_df
Length name
0 2 a12
1 1 b1
2 0 c
I would like to have a result like this:
Length name
0 2 a
1 1 b
2 0 c
With this code: Getting substring based on another column in a pandas dataframe
test_df.apply(lambda x: x['name'][:-x['Length']],axis = 1)
test_df
I got the same dataframe than before
Length name
0 2 a12
1 1 b1
2 0 c

Modify your apply a bit, to slice with respect to len(x['name']) -
def f(x):
return x['name'][:len(x['name']) - x['Length_to_drop']]
df.apply(f, 1)
0 a
1 b
2 c
dtype: object

Try this:
import pandas as pd
test_df = pd.DataFrame({'name':['a12','b1','c'],'Length':[2,1,0]})
test_df['name']=test_df.apply(lambda x: x['name'][:len(x['name'])-x['Length']],axis = 1)
test_df
This will output as you intended
Length name
0 2 a
1 1 b
2 0 c

One can use list functions for this:
outlist = list(map(lambda x,y: x[0:(len(x)-y)], test_df.name, test_df.Length_to_drop))
test_df.name = outlist
print(test_df)
Output:
Length_to_drop name
0 2 a
1 1 b
2 0 c

Related

Output as zero in data frame with size() in python

I have a file that consists of three columns: A, B and C with some integer. Using python, Let say I would like to grouby() column 'A' and get the size() of each group with number greater than 4 , 6 and 8 in column 'B'. So I implemented the code below:
>>> import pandas as pd
>>>
>>> df = pd.read_csv("test.txt", sep="\t")
>>> df
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
>>>
>>> out1 = df[df['B'] > 4].groupby(['A']).size().reset_index()
>>> out1
A 0
0 1 1
1 2 2
>>> out2 = df[df['B'] > 6].groupby(['A']).size().reset_index()
>>> out2
A 0
0 2 1
>>> out3 = df[df['B'] > 8].groupby(['A']).size().reset_index()
>>> out3
Empty DataFrame
Columns: [A, 0]
Index: []
>>>
out1 is the output that I want. But for out2 and out3, how do I get the data frame similar to out1 with zero as below?
out2:
A 0
0 2 1
1 2 0
out3:
A 0
0 2 0
1 2 0
Thanks in advance.
Idea is create boolean mask, convert to integers and aggregate sum - here is necessary grouping by Series like df['A'] instead by column name A:
out3 = (df['B'] > 8).astype(int).groupby(df['A']).sum().reset_index()
#alternative
#out3 = (df['B'] > 8).view('i1').groupby(df['A']).sum().reset_index()
print (out3)
A B
0 1 0
1 2 0
Another idea is create helper column - e.g. assign B to new values and then aggregate sum:
out3 = df.assign(B = (df['B'] > 8).astype(int)).groupby('A')['B'].sum().reset_index()
print (out3)
A B
0 1 0
1 2 0

pandas groupby apply list from column based on binary column

I have a dataframe:
id to from flag
1 a x 1
1 a y 0
2 c z 1
2 c m 1
2 b v 0
2 b p 0
and I want to groupby(['id', 'to']) and return a list of the elements in from that have a flag 1 only. If no element has a flag 1, then the resulting output should be 'None'. The desired output should be:
id to from
1 a ['x']
2 c ['z','m']
2 b None
I can do it with apply i.e.
out_df = df.groupby(['id', 'to'])['from'].apply(
lambda x: match_to_list(x['from'], x['flag'])).reset_index()
where:
def match_to_list(to, flag):
matches = list(to.iloc[flag.nonzero()[0]])
if len(matches) == 0:
return 'None'
else:
matches
but this is taking too long and I think there must be a better way that I am missing.
Any help/insights would be very appreciated! TIA
IIUC 1st create the index , with MultiIndex, then we do groupby with agg
idx=pd.MultiIndex.from_tuples(list(map(tuple,df[['id','to']].drop_duplicates().values.tolist())))
yourdf=df.loc[df.flag==1].groupby(['id','to'])['from'].agg(list).reindex(idx).reset_index()
yourdf
Out[13]:
level_0 level_1 from
0 1 a [x]
1 2 c [z, m]
2 2 b NaN
Or just using apply , less efficient but more readable
df.groupby(['id','to']).apply(lambda x : x['from'][x['flag']==1].tolist() if (x['flag']==1).any() else None).reset_index()
Out[17]:
id to 0
0 1 a [x]
1 2 b None
2 2 c [z, m]

Python: how to drop duplicates with duplicates?

I have a dataframe like the following
df
Name Y
0 A 1
1 A 0
2 B 0
3 B 0
5 C 1
I want to drop the duplicates of Name and keep the ones that have Y=1 such as:
df
Name Y
0 A 1
1 B 0
2 C 1
Use drop_duplicates method,
df.sort_values('Y', ascending= False).drop_duplicates(subset=['Name'])
groupby + max
Assuming your Y series consists only of 0 and 1 values:
res = df.groupby('Name', as_index=False)['Y'].max()
print(res)
Name Y
0 A 1
1 B 0
2 C 1
Does 'Y' column contain only 0-1? In that case, you can try the following :
df = df.sort_values(['Y'], ascending= False)
df = df.drop_duplicates(['Name'])

Pandas Python : how to create multiple columns from a list

I have a list with columns to create :
new_cols = ['new_1', 'new_2', 'new_3']
I want to create these columns in a dataframe and fill them with zero :
df[new_cols] = 0
Get error :
"['new_1', 'new_2', 'new_3'] not in index"
which is true but unfortunate as I want to create them...
EDIT : This is a duplicate of this question : Add multiple empty columns to pandas DataFrame however I keep this one too because the accepted answer here was the simple solution I was looking for, and it was not he accepted answer out there
EDIT 2 : While the accepted answer is the most simple, interesting one-liner solutions were posted below
You need to add the columns one by one.
for col in new_cols:
df[col] = 0
Also see the answers in here for other methods.
Use assign by dictionary:
df = pd.DataFrame({
'A': ['a','a','a','a','b','b','b','c','d'],
'B': list(range(9))
})
print (df)
0 a 0
1 a 1
2 a 2
3 a 3
4 b 4
5 b 5
6 b 6
7 c 7
8 d 8
new_cols = ['new_1', 'new_2', 'new_3']
df = df.assign(**dict.fromkeys(new_cols, 0))
print (df)
A B new_1 new_2 new_3
0 a 0 0 0 0
1 a 1 0 0 0
2 a 2 0 0 0
3 a 3 0 0 0
4 b 4 0 0 0
5 b 5 0 0 0
6 b 6 0 0 0
7 c 7 0 0 0
8 d 8 0 0 0
import pandas as pd
new_cols = ['new_1', 'new_2', 'new_3']
df = pd.DataFrame.from_records([(0, 0, 0)], columns=new_cols)
Is this what you're looking for ?
You can use assign:
new_cols = ['new_1', 'new_2', 'new_3']
values = [0, 0, 0] # could be anything, also pd.Series
df = df.assign(**dict(zip(new_cols, values)
Try looping through the column names before creating the column:
for col in new_cols:
df[col] = 0
We can use the Apply function to loop through the columns in the dataframe and assigning each of the element to a new field
for instance for a list in a dataframe with a list named keys
[10,20,30]
In your case since its all 0 we can directly assign them as 0 instead of looping through. But if we have values we can populate them as below
...
df['new_01']=df['keys'].apply(lambda x: x[0])
df['new_02']=df['keys'].apply(lambda x: x[1])
df['new_03']=df['keys'].apply(lambda x: x[2])

Is there a way to put a dataframe as the value of a specific column in pandas python?

I have a set of data that has column names and values to create a dataframe.
However one of the column values is another dataframe is it possible to do this in pandas, or are each column values meant to be a single value?
For example what I am trying to achieve would look something like this;
df
out:
A B C
0 A1 B1 D E
D1 E1
F G
F1 G1
This is where letters that have numbers with them are the values, and just letters are the column names.
Yes it is possible to put another dataframe (or any type of object) in a pandas cell.
In[2]: df1 = pd.DataFrame({'a':range(2)})
df1
Out[2]:
a
0 0
1 1
In[3]: df2 = pd.DataFrame({'x':range(3), 'y':range(3)})
df2
Out[3]:
x y
0 0 0
1 1 1
2 2 2
In[4]: df1['b'] = [df2, {'cat':'meow', 'otter':'clap'}]
df1
Out[4]:
a b
0 0 x y 0 0 0 1 1 1 2 2 2
1 1 {u'otter': u'clap', u'cat': u'meow'}
In[5]: df1.get_value(0, 'b')
Out[5]:
x y
0 0 0
1 1 1
2 2 2
As you see it's not very readable to print a dataframe contaning another dataframe. If you want it to look as your example you should go with multiindex as Wen suggested.

Categories