Pandas: Conditional Split on Column - python

I have the following question: I have the following table:
A B C
1 A A
2 A A.B
3 B B.C
4 A,B A.A,A.B,B.C
Column A is an index (1 through 4). Column B lists the letters, which appear in column C before the point (if there is any, if there is none, this is implicit, so the entry in (C,1) = A is the letter after the (.) (so this entry = A.A).
And column C either lists both letters before and after or only after the point.
The idea is to split these points and lists up. So column C should first be split up by the comma to separate rows (that works). Problematic here is, whenever there are different letter possible in B - because after splitting up, B should also only contain 1 letter (the correct on for column C).
So the result should look like this:
A B C
1 A A
2 A B
3 B C
4 A A
4 B B
4 B C
Can someone help me with ensuring, that column B contains the correct (i.e., fitting) information, which is denoted in column C?
Thanks and kind regards.

First, stack your dataframe to get your combinations:
out = (
df.set_index(['A', 'B']).C
.str.split(',').apply(pd.Series)
.stack().reset_index([0,1]).drop('B', 1)
)
A 0
0 1 A
1 2 A.B
2 3 B.C
3 4 A.A
4 4 A.B
5 4 B.C
Then replace single entries with their counterpart and apply pd.Series:
(out.set_index('A')[0].str
.replace(r'^([A-Z])$', r'\1.\1')
.str.split('.').apply(pd.Series)
.reset_index()
).rename(columns={0: 'B', 1: 'C'})
Output:
A B C
0 1 A A
1 2 A B
2 3 B C
3 4 A A
4 4 A B
5 4 B C

With a comprehension
def m0(x):
"""Take a string, return a dictionary split on '.' or a self mapping"""
if '.' in x:
return dict([x.split('.')])
else:
return {x: x}
def m1(s):
"""split string on ',' then do the dictionary thing in m0"""
return [*map(m0, s.split(','))]
pd.DataFrame([
(a, b, m[b])
for a, B, C in df.itertuples(index=False)
for b in B.split(',')
for m in m1(C) if b in m
], df.index.repeat(df.C.str.count(',') + 1), df.columns)
A B C
0 1 A A
1 2 A B
2 3 B C
3 4 A A
3 4 A B
3 4 B C

Related

String split using a delimiter on pandas column to create new columns

I have a dataframe with a column like this
Col1
1 A, 2 B, 3 C
2 B, 4 C
1 B, 2 C, 4 D
I have used the .str.split(',', expand=True), the result is like this
0 | 1 | 2
1 A | 2 B | 3 C
2 B | 4 C | None
1 B | 2 C | 4 D
what I am trying to achieve is to get this one:
Col A| Col B| Col C| Col D
1 A | 2 B | 3 C | None
None | 2 B | 4 C | None
None | 1 B | 2 C | 4 D
I am stuck, how to get new columns formatted as such ?
Let's try:
# split and explode
s = df['Col1'].str.split(', ').explode()
# create new multi-level index
s.index = pd.MultiIndex.from_arrays([s.index, s.str.split().str[-1].tolist()])
# unstack to reshape
out = s.unstack().add_prefix('Col ')
Details:
# split and explode
0 1 A
0 2 B
0 3 C
1 2 B
1 4 C
2 1 B
2 2 C
2 4 D
Name: Col1, dtype: object
# create new multi-level index
0 A 1 A
B 2 B
C 3 C
1 B 2 B
C 4 C
2 B 1 B
C 2 C
D 4 D
Name: Col1, dtype: object
# unstack to reshape
Col A Col B Col C Col D
0 1 A 2 B 3 C NaN
1 NaN 2 B 4 C NaN
2 NaN 1 B 2 C 4 D
Most probably there are more general approaches you can use but this worked for me. Please note that this is based on a lot of assumptions and constraints of your particular example.
test_dict = {'col_1': ['1 A, 2 B, 3 C', '2 B, 4 C', '1 B, 2 C, 4 D']}
df = pd.DataFrame(test_dict)
First, we split the df into initial columns:
df2 = df.col_1.str.split(pat=',', expand=True)
Result:
0 1 2
0 1 A 2 B 3 C
1 2 B 4 C None
2 1 B 2 C 4 D
Next, (first assumption) we need to ensure that we can later use ' ' as delimiter to extract the columns. In order to do that we need to remove all the starting and trailing spaces from each string
func = lambda x: pd.Series([i.strip() for i in x])
df2 = df2.astype(str).apply(func, axis=1)
Next, We would need to get a list of unique columns. To do that we first extract column names from each cell:
func = lambda x: pd.Series([i.split(' ')[1] for i in x if i != 'None'])
df3 = df2.astype(str).apply(func, axis=1)
Result:
0 1 2
0 A B C
1 B C NaN
2 B C D
Then create a list of unique columns ['A', 'B', 'C', 'D'] that are present in your DataFrame:
columns_list = pd.unique(df3[df3.columns].values.ravel('K'))
columns_list = [x for x in columns_list if not pd.isna(x)]
And create an empty base dataframe with those columns which will be used to assign the corresponding values:
result_df = pd.DataFrame(columns=columns_list)
Once the preparations are done we can assign column values for each of the rows and use pd.concat to merge them back in to one DataFrame:
result_list = []
result_list.append(result_df) # Adding the empty base table to ensure the columns are present
for row in df2.iterrows():
result_object = {} # dict that will be used to represent each row in source DataFrame
for column in columns_list:
for value in row[1]: # row is returned in the format of tuple where first value is row_index that we don't need
if value != 'None':
if value.split(' ')[1] == column: # Checking for a correct column to assign
result_object[column] = [value]
result_list.append(pd.DataFrame(result_object)) # Adding dicts per row
Once the list of DataFrames is generated we can use pd.concat to put it together:
final_df = pd.concat(result_list, ignore_index=True) # ignore_index will rebuild the index for the final_df
And the result will be:
A B C D
0 1 A 2 B 3 C NaN
1 NaN 2 B 4 C NaN
2 NaN 1 B 2 C 4 D
I don't think this is the most elegant and efficient way to do it but it will produce the results you need

python 3 get the column name depending of a condition [duplicate]

This question already has answers here:
Create a column in a dataframe that is a string of characters summarizing data in other columns
(3 answers)
Closed 4 years ago.
So i have a pandas df (python 3.6) like this
index A B C ...
A 1 5 0
B 0 0 1
C 1 2 4
...
As you can see, the index values are the same as the columns names.
What i'm trying to do is to get a new column in the dataframe that has the name of the columns where the value is > than 0
index A B C ... NewColumn
A 1 5 0 [A,B]
B 0 0 1 [C]
C 1 2 4 [A,B,C]
...
i've been trying with iterrows with no success
also i know i can melt and pivot but i think there should be a way with apply lamnda maybe?
Thanks in advance
If new column should be string compare by DataFrame.gt with dot product with columns, last remove trailing separator:
df['NewColumn'] = df.gt(0).dot(df.columns + ', ').str.rstrip(', ')
print (df)
A B C NewColumn
A 1 5 0 A, B
B 0 0 1 C
C 1 2 4 A, B, C
And for lists use apply with lambda function:
df['NewColumn'] = df.gt(0).apply(lambda x: x.index[x].tolist(), axis=1)
print (df)
A B C NewColumn
A 1 5 0 [A, B]
B 0 0 1 [C]
C 1 2 4 [A, B, C]
Use:
df['NewColumn'] = df.apply(lambda x: list(x[x.gt(0)].index),axis=1)
A B C NewColumn
A 1 5 0 [A, B]
B 0 0 1 [C]
C 1 2 4 [A, B, C]
You could use .gt to check which values are greater than 0 and .dot to obtain the corresponding columns. Finally .apply(list) to turn the results to lists:
df.loc[:, 'NewColumn'] = df.gt(0).dot(df.columns).apply(list)
A B C NewColumn
index
A 1 5 0 [A, B]
B 0 0 1 [C]
C 1 2 4 [A, B, C]
Note: works with single letter columns, otherwise you could do:
df.loc[:, 'NewColumn'] = ((df.gt(0) # df.columns.map('{},'.format))
.str.rstrip(',').str.split(','))
A B C NewColumn
index
A 1 5 0 [A, B]
B 0 0 1 [C]
C 1 2 4 [A, B, C]

Navigate to a cell in a DataFrame using a set of criteria

I have a csv table like so:
a, b, c, d
value, value, value, value
value, value, value, value
which I'm loading into a DataFrame. I also have a dictionary that looks like this:
data = {'a': some_value, 'b' = some_value, 'c' = some_value}
I want to navigate to the cell in column d where the row has values a, b and c as specified by this dictionary. I know that there exists exactly one cell that matches these criteria. How can I do this?
You could convert the data into a dataframe, then use a merge:
data = pd.DataFrame({'a':[1,2,3,4], 'b':[1,2,3,4],'c':[1,2,3,4],'d':[1,2,3,4]})
lookup = {'a':2,'b':2, 'c':2}
lookupdf = pd.DataFrame(lookup, index = [1]) #need the index, as they are all scalar
pd.merge(lookupdf, data)
a b c d
0 2 2 2 2
Another approach, would be use reduce on boolean conditions
In [1034]: data[np.logical_and.reduce(pd.DataFrame(data[x] == lookup[x] for x in lookup))]
Out[1034]:
a b c d
1 2 2 2 2
In [1035]: data[reduce(lambda x, y: x & y, [data[x] == lookup[x] for x in lookup])]
Out[1035]:
a b c d
1 2 2 2 2
Another way, could be using pd.query
In [1009]: query = ' and '.join(['%s == %s' % (x, lookup[x]) for x in lookup])
In [1010]: query
Out[1010]: 'a == 2 and c == 2 and b == 2'
In [1011]: data.query(query)
Out[1011]:
a b c d
1 2 2 2 2
Details
In [1012]: lookup
Out[1012]: {'a': 2, 'b': 2, 'c': 2}
In [1013]: data
Out[1013]:
a b c d
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
3 4 4 4 4

find the number of elements in a column of a dataframe

I want to file the row length of a column from the dataframe.
dataframe name- df
sample data:
a b c
1 d ['as','the','is','are','we']
2 v ['a','an']
3 t ['we','will','pull','this','together','.']
expected result:
a b c len
1 d ['as','the','is','are','we'] 5
2 v ['a','an'] 2
3 t ['we','will','pull','this','together','.'] 6
Till now, i have just tried:
df.loc[:,'len']=len(df.c)
but this gives me the total rows present in the dataframe.
How can i get the elements in each row of a specific column of a dataframe?
One way, is to use apply and calculate len
In [100]: dff
Out[100]:
a b c
0 1 d [as, the, is, are, we]
1 2 v [a, an]
2 3 t [we, will, pull, this, together, .]
In [101]: dff['len'] = dff['c'].apply(len)
In [102]: dff
Out[102]:
a b c len
0 1 d [as, the, is, are, we] 5
1 2 v [a, an] 2
2 3 t [we, will, pull, this, together, .] 6

merge data frames with duplicated indices

Is there any way to merge two data frames while one of them has duplicated indices such as following:
dataframe A:
value
key
a 1
b 2
b 3
b 4
c 5
a 6
dataframe B:
number
key
a I
b X
c V
after merging, I want to have a data frame like the following:
value number
key
a 1 I
b 2 X
b 3 X
b 4 X
c 5 V
a 6 I
Or maybe there are better ways to do it using groupby?
>>> a.join(b).sort('value')
value number
key
a 1 I
b 2 X
b 3 X
b 4 X
c 5 V
a 6 I
Use join:
>>> a = pd.DataFrame(range(1,7), index=list('abbbca'), columns=['value'])
>>> b = pd.DataFrame(['I', 'X', 'V'], index=list('abc'), columns=['number'])
>>> a.join(b)
value number
a 1 I
a 6 I
b 2 X
b 3 X
b 4 X
c 5 V

Categories