Creating new records in dataframe based on character - python

I have fields in a pandas dataframe like the sample data below. The values in one of the fields are fractions with the form something/count(something). I would like to split the values like the example output below, and create new records. Basically the numerator and the denominator. Some of the values even have multiple /, like count(something)/count(thing)/count(dog). So I'd want to split that value in to 3 records. Any tips on how to do this would be greatly appreciated.
Sample Data:
SampleDf=pd.DataFrame([['tom','sum(stuff)/count(things)'],['bob','count(things)/count(stuff)']],columns=['ReportField','OtherField'])
Example Output:
OutputDf=pd.DataFrame([['tom1','sum(stuff)'],['tom2','count(things)'],['bob1','count(things)'],['bob2','count(stuff)']],columns=['ReportField','OtherField'])

There might be a better way but try this,
df = df.set_index('ReportField')
df = pd.DataFrame(df.OtherField.str.split('/', expand = True).stack().reset_index(-1, drop = True)).reset_index()
You get
ReportField 0
0 tom sum(stuff)
1 tom count(things)
2 bob count(things)
3 bob count(stuff)

One possible way might be as following:
# split and stack
new_df = pd.DataFrame(SampleDf.OtherField.str.split('/').tolist(), index=SampleDf.ReportField).stack().reset_index()
print(new_df)
Output:
ReportField level_1 0
0 tom 0 sum(stuff)
1 tom 1 count(things)
2 bob 0 count(things)
3 bob 1 count(stuff)
Now, combine ReportField with level_1:
# combine strings for tom1, tom2 ,.....
new_df['ReportField'] = new_df.ReportField.str.cat((new_df.level_1+1).astype(str))
# remove level column
del new_df['level_1']
# rename columns
new_df.columns = ['ReportField', 'OtherField']
print (new_df)
Output:
ReportField OtherField
0 tom1 sum(stuff)
1 tom2 count(things)
2 bob1 count(things)
3 bob2 count(stuff)

You can use:
split with expand=True for new DataFrame
reshape by stack and reset_index
add Counter to ReportField column with convert to str by astype
remove helper column level_1 by drop
OutputDf = SampleDf.set_index('ReportField')['OtherField'].str.split('/',expand=True)
.stack().reset_index(name='OtherField')
OutputDf['ReportField'] = OutputDf['ReportField'] + OutputDf['level_1'].add(1).astype(str)
OutputDf = OutputDf.drop('level_1', axis=1)
print (OutputDf)
ReportField OtherField
0 tom1 sum(stuff)
1 tom2 count(things)
2 bob1 count(things)
3 bob2 count(stuff)

Related

Combining two pandas dataframes into one based on conditions

I got two dataframes, simplified they look like this:
Dataframe A
ID
item
1
apple
2
peach
Dataframe B
ID
flag
price ($)
1
A
3
1
B
2
2
B
4
2
A
2
ID: unique identifier for each item
flag: unique identifier for each vendor
price: varies for each vendor
In this simplified case I want to extract the price values of dataframe B and add them to dataframe A in separate columns depending on their flag value.
The result should look similar to this
Dataframe C
ID
item
price_A
price_B
1
apple
3
2
2
peach
2
4
I tried to split dataframe B into two dataframes the different flag values and merge them afterwards with dataframe A, but there must be an easier solution.
Thank you in advance! :)
*edit: removed the pictures
You can use pd.merge and pd.pivot_table for this:
df_C = pd.merge(df_A, df_B, on=['ID']).pivot_table(index=['ID', 'item'], columns='flag', values='price')
df_C.columns = ['price_' + alpha for alpha in df_C.columns]
df_C = df_C.reset_index()
Output:
>>> df_C
ID item price_A price_B
0 1 apple 3 2
1 2 peach 2 4
(dfb
.merge(dfa, on="ID")
.pivot_table(index=['ID', 'item'], columns='flag', values='price ($)')
.add_prefix("price_")
.reset_index()
)

inserting new columns by splitting each column and iterating for many columns in python pandas DataFrame

Here I have an example dataframe:
dfx = pd.DataFrame({
'name': ['alex','bob','jack'],
'age': ["0,26,4","1,25,4","5,30,2"],
'job': ["x,abc,0","y,xyz,1","z,pqr,2"],
'gender': ["0,1","0,1","0,1"]
})
I want to first split column dfx['age'] and insert 3 separate columns for it, one for each substring in age value, naming them dfx['age1'],dfx['age2'],dfx['age3'] . I used following code for this:
dfx = dfx.assign(**{'age1':(dfx['age'].str.split(',', expand = True)[0]),\
'age2':(dfx['age'].str.split(',', expand = True)[1]),\
'age3':(dfx['age'].str.split(',', expand = True)[2])})
dfx = dfx[['name', 'age','age1', 'age2', 'age3', 'job', 'gender']]
dfx
So far so good!
Now, I want to repeat the same operations to other columns job and gender.
Desired Output
name age age1 age2 age3 job job1 job2 job3 gender gender1 gender2
0 alex 0,26,4 0 26 4 x,abc,0 x abc 0 0,1 0 1
1 bob 1,25,4 1 25 4 y,xyz,1 y xyz 1 0,1 0 1
2 jack 5,30,2 5 30 2 z,pqr,2 z pqr 2 0,1 0 1
I have no problem doing it individually for small data frame like this. But, the actual datafile has many such columns. I need iterations.
I found difficulty in iteration over columns, and naming the individual columns.
I would be very glad to have better solution for it.
Thanks !
Use list comprehension for splitting columns defined in list for list of DataFrames, add filtered columns and join together by concat with sorting columns names, then prepend not matched columns by DataFrame.join:
cols = ['age','job','gender']
L = [dfx[x].str.split(',',expand=True).rename(columns=lambda y: f'{x}{y+1}') for x in cols]
df1 = dfx[dfx.columns.difference(cols)]
df = df1.join(pd.concat([dfx[cols]] + L, axis=1).sort_index(axis=1))
print (df)
name age age1 age2 age3 gender gender1 gender2 job job1 job2 job3
0 alex 0,26,4 0 26 4 0,1 0 1 x,abc,0 x abc 0
1 bob 1,25,4 1 25 4 0,1 0 1 y,xyz,1 y xyz 1
2 jack 5,30,2 5 30 2 0,1 0 1 z,pqr,2 z pqr 2
Thanks again #jezrael for your answer. Being inspired by the use of 'f-string' I have solved the problem using iteration as follows:
for col in dfx.columns[1:]:
for i in range(len(dfx[col][0].split(','))):
dfx[f'{col}{i+1}'] = (dfx[col].str.split(',', expand = True)[i])
dfx = dfx[['name', 'age','age1', 'age2', 'age3', 'job','job1', 'job2','job3', 'gender'
, 'gender1', 'gender2']]
dfx

how to split the data in a column based on multiple delimiters, into multiple columns, in pandas

I have a dataframe with only one column named 'ALL_category[![enter image description here][1]][1]'. There are multiple names in a row ranging between 1 to 3 and separated by delimiters '|', '||' or '|||', which can be either at the beginning, in between or end of the words in every row. I want to split the column into multiple columns such that the new columns contain the names. How can I do it?
Below is the code to generate the dataframe:
x = {'ALL Categories': ['Rakesh||Ramesh|','||Rajesh|','HARPRIT|||','Tushar||manmit|']}
df = pd.DataFrame(x)
When I used the below code for modification of the above dataframe, it didn't give me any result.
data = data.ALL_HOLDS.str.split(r'w', expand = True)
I believe you need Series.str.extractall if want each word to separate column:
df1 = df['ALL Categories'].str.extractall(r'(\w+)')[0].unstack()
print (df1)
match 0 1
0 Rakesh Ramesh
1 Rajesh NaN
2 HARPRIT NaN
3 Tushar manmit
Or a bit changed code of #Chris A from comments with Series.str.strip and Series.str.split by one or more |:
df1 = df['ALL Categories'].str.strip('|').str.split(r'\|+', expand=True)
print (df1)
0 1
0 Rakesh Ramesh
1 Rajesh None
2 HARPRIT None
3 Tushar manmit

How to count data in a column based on another column separately?

I have two dataframe like this:
df1 = pd.DataFrame({'a':[1,2]})
df2 = pd.DataFrame({'a':[1,1,1,2,2,3,4,5,6,7,8]})
I want to count the two numbers of df1 separately in df2, the correct answer like:
No Amount
1 3
2 2
Instead of:
No Amount
1 5
2 5
How can I solve this problem?
First filter df2 for values that are contained in df1['a'], then apply value_counts. The rest of the code just presents the data in your desired format.
result = (
df2[df2['a'].isin(df1['a'].unique())]['a']
.value_counts()
.reset_index()
)
result.columns = ['No', 'Amount']
>>> result
No Amount
0 1 3
1 2 2
In pandas 0.21.0 you can use set_axis to rename columns as chained method. Here's a one line solution:
df2[df2.a.isin(df1.a)]\
.squeeze()\
.value_counts()\
.reset_index()\
.set_axis(['No','Amount'], axis=1, inplace=False)
Output:
No Amount
0 1 3
1 2 2
You can simply find value_counts of second df and map that with first df i.e
df1['Amount'] = df1['a'].map(df2['a'].value_counts())
df1 = df1.rename(columns={'a':'No'})
Output :
No Amount
0 1 3
1 2 2

How to remove rows where all numerical columns contain zero in Pandas Dataframe with mixed type of columns?

I have the following data frame:
import pandas as pd
df = pd.DataFrame({'a':[0,0,1,1], 'b':[0,1,0,1],'tag':['apple','orange','grapes','lemon']})
df = df[["tag","a","b"]]
That looks like this:
In [37]: df
Out[37]:
tag a b
0 apple 0 0
1 orange 0 1
2 grapes 1 0
3 lemon 1 1
What I want to do is to remove rows where numerical columns is zero resulting in this:
tag a b
orange 0 1
grapes 1 0
lemon 1 1
How can I achieve that?
Note that in actuality, the number of columns can be more than 2 and column name can be varied. So we need a general solution.
I tried this but has no effect:
df[(df.T != 0).any()]
There's a few different things going on in this answer, let me know if anything is particularly confusing:
df.loc[~ (df.select_dtypes(include=['number']) == 0).all(axis='columns'), :]
So:
Filtering to find just the numeric columns
Applying the .all() method across columns rather than rows (rows is the default)
Negating with ~
Passing the resulting boolean series to df.loc[]
Get numeric columns:
numcols = df.dtypes == np.int64
create indexer
I = np.sum((df.loc[:,numcols]) != 0,axis = 1) != 0
df[I]
tag a b
1 orange 0 1
2 grapes 1 0
3 lemon 1 1

Categories