How to explode multiple columns that contain a string? - python

I have a dataset that includes different types of tags. Each column has a string that contains a list of tags.
How am I supposed to explode selected columns at the same time ?
Unnamed: id Tag1 Tag2
0 A a,b,c d,e
1 B m,n x
to this:
Unnamed: id Tag1 Tag2
0 A a d
1 A a e
2 A b d
3 A b e
4 A c d
6 A c e
7 B m x
8 B n x

First, split the string values of each Tag column into lists, using Series.apply + Series.str.split. I'm using DataFrame.filter to select only the columns which starts with 'Tag'.
Then, use DataFrame.explode in a loop to explode sequentially each Tag column of the df, turning the values of each list into new rows.
tag_cols = df.filter(like='Tag').columns
df[tag_cols] = df[tag_cols].apply(lambda col: col.str.split(','))
for col in tag_cols:
df = df.explode(col, ignore_index=True)
print(df)
Output:
id Tag1 Tag2
0 A a d
1 A a e
2 A b d
3 A b e
4 A c d
5 A c e
6 B m x
7 B n x
Note that using just df.apply(lambda col: col.str.split(',').explode()) won't work in this case because some rows have strings/lists with a different number of elements. Therefore the rows can't be correctly aligned after exploding them, and apply will complain.

Related

perform df.loc to groupby df

I've a df consisted of person, origin and destination
df = pd.DataFrame({'PersonID':['1','1','2','2','2','3'],'O':['A','B','C','B','A','X'],'D':['B','A','B','A','B','Y']})
the df:
PersonID O D
1 A B
1 B A
2 C B
2 B A
2 A B
3 X Y
I have grouped by the df with df_grouped = df.groupby(['O','D']) and match them with another dataframe, taxi.
TaxiID O D
T1 B A
T2 A B
T3 C B
similarly, I group by the taxi with their O and D. Then I merged them after aggregating and counting the PersonID and TaxiID per O-D pair. I did it to see how many taxis are available for how many people.
O D PersonID TaxiID
count count
A B 2 1
B A 2 1
C B 1 1
Now, I want to perform df.loc to take only those PersonID that was counted in the merged file. How can I do this? I've tried to us:
seek = df.loc[df.PersonID.isin(merged['PersonID'])]
but it returns an empty dataframe. What can I do to do this?
edit: I attach the complete code for this case using dummy data
df = pd.DataFrame({'PersonID':['1','1','2','2','2','3'],'O':['A','B','C','B','A','X'],'D':['B','A','B','A','B','Y']})
taxi = pd.DataFrame({'TaxiID':['T1','T2','T3'],'O':['B','A','C'],'D':['A','B','B']})
df_grouped = df.groupby(['O','D'])
taxi_grouped = taxi.groupby(['O','D'])
dfm = df_grouped.agg({'PersonID':['count',list]}).reset_index()
tgm = taxi_grouped.agg({'TaxiID':['count',list]}).reset_index()
merged = pd.merge(dfm, tgm, how='inner')
seek = df.loc[df.PersonID.isin(merged['PersonID'])]
Select MultiIndex by tuple with Series.explode for scalars from nested lists:
seek = df.loc[df.PersonID.isin(merged[('PersonID', 'list')].explode().unique())]
print (seek)
PersonID O D
0 1 A B
1 1 B A
2 2 C B
3 2 B A
4 2 A B
For better performance is possible use set comprehension with flatten:
seek = df.loc[df.PersonID.isin(set(z for x in merged[('PersonID', 'list')] for z in x))]
print (seek)
PersonID O D
0 1 A B
1 1 B A
2 2 C B
3 2 B A
4 2 A B

String split using a delimiter on pandas column to create new columns

I have a dataframe with a column like this
Col1
1 A, 2 B, 3 C
2 B, 4 C
1 B, 2 C, 4 D
I have used the .str.split(',', expand=True), the result is like this
0 | 1 | 2
1 A | 2 B | 3 C
2 B | 4 C | None
1 B | 2 C | 4 D
what I am trying to achieve is to get this one:
Col A| Col B| Col C| Col D
1 A | 2 B | 3 C | None
None | 2 B | 4 C | None
None | 1 B | 2 C | 4 D
I am stuck, how to get new columns formatted as such ?
Let's try:
# split and explode
s = df['Col1'].str.split(', ').explode()
# create new multi-level index
s.index = pd.MultiIndex.from_arrays([s.index, s.str.split().str[-1].tolist()])
# unstack to reshape
out = s.unstack().add_prefix('Col ')
Details:
# split and explode
0 1 A
0 2 B
0 3 C
1 2 B
1 4 C
2 1 B
2 2 C
2 4 D
Name: Col1, dtype: object
# create new multi-level index
0 A 1 A
B 2 B
C 3 C
1 B 2 B
C 4 C
2 B 1 B
C 2 C
D 4 D
Name: Col1, dtype: object
# unstack to reshape
Col A Col B Col C Col D
0 1 A 2 B 3 C NaN
1 NaN 2 B 4 C NaN
2 NaN 1 B 2 C 4 D
Most probably there are more general approaches you can use but this worked for me. Please note that this is based on a lot of assumptions and constraints of your particular example.
test_dict = {'col_1': ['1 A, 2 B, 3 C', '2 B, 4 C', '1 B, 2 C, 4 D']}
df = pd.DataFrame(test_dict)
First, we split the df into initial columns:
df2 = df.col_1.str.split(pat=',', expand=True)
Result:
0 1 2
0 1 A 2 B 3 C
1 2 B 4 C None
2 1 B 2 C 4 D
Next, (first assumption) we need to ensure that we can later use ' ' as delimiter to extract the columns. In order to do that we need to remove all the starting and trailing spaces from each string
func = lambda x: pd.Series([i.strip() for i in x])
df2 = df2.astype(str).apply(func, axis=1)
Next, We would need to get a list of unique columns. To do that we first extract column names from each cell:
func = lambda x: pd.Series([i.split(' ')[1] for i in x if i != 'None'])
df3 = df2.astype(str).apply(func, axis=1)
Result:
0 1 2
0 A B C
1 B C NaN
2 B C D
Then create a list of unique columns ['A', 'B', 'C', 'D'] that are present in your DataFrame:
columns_list = pd.unique(df3[df3.columns].values.ravel('K'))
columns_list = [x for x in columns_list if not pd.isna(x)]
And create an empty base dataframe with those columns which will be used to assign the corresponding values:
result_df = pd.DataFrame(columns=columns_list)
Once the preparations are done we can assign column values for each of the rows and use pd.concat to merge them back in to one DataFrame:
result_list = []
result_list.append(result_df) # Adding the empty base table to ensure the columns are present
for row in df2.iterrows():
result_object = {} # dict that will be used to represent each row in source DataFrame
for column in columns_list:
for value in row[1]: # row is returned in the format of tuple where first value is row_index that we don't need
if value != 'None':
if value.split(' ')[1] == column: # Checking for a correct column to assign
result_object[column] = [value]
result_list.append(pd.DataFrame(result_object)) # Adding dicts per row
Once the list of DataFrames is generated we can use pd.concat to put it together:
final_df = pd.concat(result_list, ignore_index=True) # ignore_index will rebuild the index for the final_df
And the result will be:
A B C D
0 1 A 2 B 3 C NaN
1 NaN 2 B 4 C NaN
2 NaN 1 B 2 C 4 D
I don't think this is the most elegant and efficient way to do it but it will produce the results you need

Pandas with mapped modification

I have two Pandas DataFrames; df1 has two columns called A and B and looks like:
df1=pd.DataFrame(data={'A':[10,30], 'B':[5,4]})
A B
10 5
30 4
df2 has two columns B and C, and looks like:
df2=pd.DataFrame(data={'B':[4,7], 'B':[10,20]})
B C
4 10
7 20
I want to modify df1.A based on if df1.B matches df2.B. If so, df1.A should divide df2.C. Namely, I want to get the following with the aforementioned df1 and df2:
A B
10 5
3 4
Is there a one-line solution in Python?
This is essentially merge with some manipulation:
(df1.merge(df2, on='B', how='left')
.assign(C=lambda x: x.C.fillna(1)) # those don't match has `C` value `1`
.assign(A=lambda x: x.A/x.C) # divide by `C` value
.drop('C', axis=1) # remove the `C` column
)
Output:
A B
0 10.0 5
1 3.0 4
map
d = dict(zip(df2.B, df2.C))
f = lambda x: d.get(x, 1)
df1.assign(A=df1.A / df1.B.map(f))
A B
0 10.0 5
1 3.0 4

How to merge 2 data frames?

I have this table1:
A B C D
0 1 2 k l
1 3 4 e r
df.dtypes gets me this:
A int64
B int64
C object
D object
Now, I want to create a table2 which only includes objects (column C and D) using this command table2=df.select_dtypes(include=[object]).
Then, I want to encode table2 using this command pd.get_dummies(table).
It gives me this table2:
C D
0 0 1
1 1 0
The last thing I want to do is append both tables together (table 1 + table 2), so that the final table looks like this:
A B C D
0 1 2 0 1
1 3 4 1 0
Can somebody help?
This should do it:
table2=df.select_dtypes(include=[object])
table1.select_dtypes(include=[int]).join(table2.apply(lambda x:pd.factorize(x, sort=True)[0]))
It first factorizes the object typed columns of table 2 (instead of using dummies generator) and then merge it back to the int typed columns of the original dataframe!
Assuming what you're trying to do from the question is have a column for C that has a value of 1 replace values of e and in column D, values of 1 replace values of l. Otherwise, as mentioned elsewhere there will be a column for each response possibility.
df = pd.DataFrame({'A': [1,2], 'B': [2,4], 'C': ['k','e'], 'D': ['l','r']})
df
A B C D
0 1 2 k l
1 2 4 e r
df.dtypes
A int64
B int64
C object
D object
dtype: object
Now, if you want to drop the e and l because you want to have k-1 columns, you can use the drop_first argument.
df = pd.get_dummies(df, drop_first = True)
df
A B C_k D_r
0 1 2 1 0
1 2 4 0 1
Note that the dtypes are not int64 like columns A and B.
df
A int64
B int64
C_k uint8
D_r uint8
dtype: object
If it's important they are the same type you can of course change those as appropriate. In the general case, you may want to keep names like C_k and D_r so you know what the dummies correspond to. If not, you can always rename based on the _ (the default of get_dummies prefix argument.) So, you could create the rename dictionary using the '_' as as way to split out the part of the column name after the prefix. Or for a simple case like this.
df.rename({'C_k': 'C', 'D_r': 'D'}, axis = 1, inplace = True)
df
A B C D
0 1 2 1 0
1 2 4 0 1

find the number of elements in a column of a dataframe

I want to file the row length of a column from the dataframe.
dataframe name- df
sample data:
a b c
1 d ['as','the','is','are','we']
2 v ['a','an']
3 t ['we','will','pull','this','together','.']
expected result:
a b c len
1 d ['as','the','is','are','we'] 5
2 v ['a','an'] 2
3 t ['we','will','pull','this','together','.'] 6
Till now, i have just tried:
df.loc[:,'len']=len(df.c)
but this gives me the total rows present in the dataframe.
How can i get the elements in each row of a specific column of a dataframe?
One way, is to use apply and calculate len
In [100]: dff
Out[100]:
a b c
0 1 d [as, the, is, are, we]
1 2 v [a, an]
2 3 t [we, will, pull, this, together, .]
In [101]: dff['len'] = dff['c'].apply(len)
In [102]: dff
Out[102]:
a b c len
0 1 d [as, the, is, are, we] 5
1 2 v [a, an] 2
2 3 t [we, will, pull, this, together, .] 6

Categories