I got two dataframes, simplified they look like this:
Dataframe A
ID
item
1
apple
2
peach
Dataframe B
ID
flag
price ($)
1
A
3
1
B
2
2
B
4
2
A
2
ID: unique identifier for each item
flag: unique identifier for each vendor
price: varies for each vendor
In this simplified case I want to extract the price values of dataframe B and add them to dataframe A in separate columns depending on their flag value.
The result should look similar to this
Dataframe C
ID
item
price_A
price_B
1
apple
3
2
2
peach
2
4
I tried to split dataframe B into two dataframes the different flag values and merge them afterwards with dataframe A, but there must be an easier solution.
Thank you in advance! :)
*edit: removed the pictures
You can use pd.merge and pd.pivot_table for this:
df_C = pd.merge(df_A, df_B, on=['ID']).pivot_table(index=['ID', 'item'], columns='flag', values='price')
df_C.columns = ['price_' + alpha for alpha in df_C.columns]
df_C = df_C.reset_index()
Output:
>>> df_C
ID item price_A price_B
0 1 apple 3 2
1 2 peach 2 4
(dfb
.merge(dfa, on="ID")
.pivot_table(index=['ID', 'item'], columns='flag', values='price ($)')
.add_prefix("price_")
.reset_index()
)
Here I have an example dataframe:
dfx = pd.DataFrame({
'name': ['alex','bob','jack'],
'age': ["0,26,4","1,25,4","5,30,2"],
'job': ["x,abc,0","y,xyz,1","z,pqr,2"],
'gender': ["0,1","0,1","0,1"]
})
I want to first split column dfx['age'] and insert 3 separate columns for it, one for each substring in age value, naming them dfx['age1'],dfx['age2'],dfx['age3'] . I used following code for this:
dfx = dfx.assign(**{'age1':(dfx['age'].str.split(',', expand = True)[0]),\
'age2':(dfx['age'].str.split(',', expand = True)[1]),\
'age3':(dfx['age'].str.split(',', expand = True)[2])})
dfx = dfx[['name', 'age','age1', 'age2', 'age3', 'job', 'gender']]
dfx
So far so good!
Now, I want to repeat the same operations to other columns job and gender.
Desired Output
name age age1 age2 age3 job job1 job2 job3 gender gender1 gender2
0 alex 0,26,4 0 26 4 x,abc,0 x abc 0 0,1 0 1
1 bob 1,25,4 1 25 4 y,xyz,1 y xyz 1 0,1 0 1
2 jack 5,30,2 5 30 2 z,pqr,2 z pqr 2 0,1 0 1
I have no problem doing it individually for small data frame like this. But, the actual datafile has many such columns. I need iterations.
I found difficulty in iteration over columns, and naming the individual columns.
I would be very glad to have better solution for it.
Thanks !
Use list comprehension for splitting columns defined in list for list of DataFrames, add filtered columns and join together by concat with sorting columns names, then prepend not matched columns by DataFrame.join:
cols = ['age','job','gender']
L = [dfx[x].str.split(',',expand=True).rename(columns=lambda y: f'{x}{y+1}') for x in cols]
df1 = dfx[dfx.columns.difference(cols)]
df = df1.join(pd.concat([dfx[cols]] + L, axis=1).sort_index(axis=1))
print (df)
name age age1 age2 age3 gender gender1 gender2 job job1 job2 job3
0 alex 0,26,4 0 26 4 0,1 0 1 x,abc,0 x abc 0
1 bob 1,25,4 1 25 4 0,1 0 1 y,xyz,1 y xyz 1
2 jack 5,30,2 5 30 2 0,1 0 1 z,pqr,2 z pqr 2
Thanks again #jezrael for your answer. Being inspired by the use of 'f-string' I have solved the problem using iteration as follows:
for col in dfx.columns[1:]:
for i in range(len(dfx[col][0].split(','))):
dfx[f'{col}{i+1}'] = (dfx[col].str.split(',', expand = True)[i])
dfx = dfx[['name', 'age','age1', 'age2', 'age3', 'job','job1', 'job2','job3', 'gender'
, 'gender1', 'gender2']]
dfx
I have a dataframe with only one column named 'ALL_category[![enter image description here][1]][1]'. There are multiple names in a row ranging between 1 to 3 and separated by delimiters '|', '||' or '|||', which can be either at the beginning, in between or end of the words in every row. I want to split the column into multiple columns such that the new columns contain the names. How can I do it?
Below is the code to generate the dataframe:
x = {'ALL Categories': ['Rakesh||Ramesh|','||Rajesh|','HARPRIT|||','Tushar||manmit|']}
df = pd.DataFrame(x)
When I used the below code for modification of the above dataframe, it didn't give me any result.
data = data.ALL_HOLDS.str.split(r'w', expand = True)
I believe you need Series.str.extractall if want each word to separate column:
df1 = df['ALL Categories'].str.extractall(r'(\w+)')[0].unstack()
print (df1)
match 0 1
0 Rakesh Ramesh
1 Rajesh NaN
2 HARPRIT NaN
3 Tushar manmit
Or a bit changed code of #Chris A from comments with Series.str.strip and Series.str.split by one or more |:
df1 = df['ALL Categories'].str.strip('|').str.split(r'\|+', expand=True)
print (df1)
0 1
0 Rakesh Ramesh
1 Rajesh None
2 HARPRIT None
3 Tushar manmit
I have two dataframe like this:
df1 = pd.DataFrame({'a':[1,2]})
df2 = pd.DataFrame({'a':[1,1,1,2,2,3,4,5,6,7,8]})
I want to count the two numbers of df1 separately in df2, the correct answer like:
No Amount
1 3
2 2
Instead of:
No Amount
1 5
2 5
How can I solve this problem?
First filter df2 for values that are contained in df1['a'], then apply value_counts. The rest of the code just presents the data in your desired format.
result = (
df2[df2['a'].isin(df1['a'].unique())]['a']
.value_counts()
.reset_index()
)
result.columns = ['No', 'Amount']
>>> result
No Amount
0 1 3
1 2 2
In pandas 0.21.0 you can use set_axis to rename columns as chained method. Here's a one line solution:
df2[df2.a.isin(df1.a)]\
.squeeze()\
.value_counts()\
.reset_index()\
.set_axis(['No','Amount'], axis=1, inplace=False)
Output:
No Amount
0 1 3
1 2 2
You can simply find value_counts of second df and map that with first df i.e
df1['Amount'] = df1['a'].map(df2['a'].value_counts())
df1 = df1.rename(columns={'a':'No'})
Output :
No Amount
0 1 3
1 2 2
I have the following data frame:
import pandas as pd
df = pd.DataFrame({'a':[0,0,1,1], 'b':[0,1,0,1],'tag':['apple','orange','grapes','lemon']})
df = df[["tag","a","b"]]
That looks like this:
In [37]: df
Out[37]:
tag a b
0 apple 0 0
1 orange 0 1
2 grapes 1 0
3 lemon 1 1
What I want to do is to remove rows where numerical columns is zero resulting in this:
tag a b
orange 0 1
grapes 1 0
lemon 1 1
How can I achieve that?
Note that in actuality, the number of columns can be more than 2 and column name can be varied. So we need a general solution.
I tried this but has no effect:
df[(df.T != 0).any()]
There's a few different things going on in this answer, let me know if anything is particularly confusing:
df.loc[~ (df.select_dtypes(include=['number']) == 0).all(axis='columns'), :]
So:
Filtering to find just the numeric columns
Applying the .all() method across columns rather than rows (rows is the default)
Negating with ~
Passing the resulting boolean series to df.loc[]
Get numeric columns:
numcols = df.dtypes == np.int64
create indexer
I = np.sum((df.loc[:,numcols]) != 0,axis = 1) != 0
df[I]
tag a b
1 orange 0 1
2 grapes 1 0
3 lemon 1 1