This question already has answers here:
Create a column in a dataframe that is a string of characters summarizing data in other columns
(3 answers)
Closed 4 years ago.
So i have a pandas df (python 3.6) like this
index A B C ...
A 1 5 0
B 0 0 1
C 1 2 4
...
As you can see, the index values are the same as the columns names.
What i'm trying to do is to get a new column in the dataframe that has the name of the columns where the value is > than 0
index A B C ... NewColumn
A 1 5 0 [A,B]
B 0 0 1 [C]
C 1 2 4 [A,B,C]
...
i've been trying with iterrows with no success
also i know i can melt and pivot but i think there should be a way with apply lamnda maybe?
Thanks in advance
If new column should be string compare by DataFrame.gt with dot product with columns, last remove trailing separator:
df['NewColumn'] = df.gt(0).dot(df.columns + ', ').str.rstrip(', ')
print (df)
A B C NewColumn
A 1 5 0 A, B
B 0 0 1 C
C 1 2 4 A, B, C
And for lists use apply with lambda function:
df['NewColumn'] = df.gt(0).apply(lambda x: x.index[x].tolist(), axis=1)
print (df)
A B C NewColumn
A 1 5 0 [A, B]
B 0 0 1 [C]
C 1 2 4 [A, B, C]
Use:
df['NewColumn'] = df.apply(lambda x: list(x[x.gt(0)].index),axis=1)
A B C NewColumn
A 1 5 0 [A, B]
B 0 0 1 [C]
C 1 2 4 [A, B, C]
You could use .gt to check which values are greater than 0 and .dot to obtain the corresponding columns. Finally .apply(list) to turn the results to lists:
df.loc[:, 'NewColumn'] = df.gt(0).dot(df.columns).apply(list)
A B C NewColumn
index
A 1 5 0 [A, B]
B 0 0 1 [C]
C 1 2 4 [A, B, C]
Note: works with single letter columns, otherwise you could do:
df.loc[:, 'NewColumn'] = ((df.gt(0) # df.columns.map('{},'.format))
.str.rstrip(',').str.split(','))
A B C NewColumn
index
A 1 5 0 [A, B]
B 0 0 1 [C]
C 1 2 4 [A, B, C]
Related
My csv file row column data looks like this -
a a a a a
b b b b b
c c c c c
d d d d d
a b c d e
a d b c c
When I have patterns like row 1-5, I want to return value 0
When I have row like 6 or random alphabets (not like row 1-5), I want to return value 1.
How do I do it using python?It must be done by using csv file
You can read your csv file to pandas dataframe using:
df = pd.read_csv(header=None)
output:
0 1 2 3 4
0 a a a a a
1 b b b b b
2 c c c c c
3 d d d d d
4 a b c d e
5 a d b c c
Then, use nunique to count the number of unique values per row, if 1 or 5 (the max), then it is valid, else not. Use between for that.
df.nunique(1).between(2, len(df.columns)-1).astype(int)
output:
0 0
1 0
2 0
3 0
4 0
5 1
dtype: int64
What's the simplest way to achieve the below with pandas?
df1 =
A B C
0 1 1 2
1 2 3 1
2 3 3 2
to
df_result =
1 2 3
0 [A, B] [C] []
1 [C] [A] [B]
2 [] [C] [A,B]
Thanks in advance
Use DataFrame.stack with Series.reset_index for DataFrame, aggregate list and reshape by Series.unstack, lasr remove index and columns names by DataFrame.rename_axis:
df = (df.stack()
.reset_index(name='val')
.groupby(['level_0','val'])['level_1']
.agg(list)
.unstack(fill_value=[])
.rename_axis(index=None, columns=None))
print (df)
1 2 3
0 [A, B] [C] []
1 [C] [A] [B]
2 [] [C] [A, B]
I have a dataframe with a column like this
Col1
1 A, 2 B, 3 C
2 B, 4 C
1 B, 2 C, 4 D
I have used the .str.split(',', expand=True), the result is like this
0 | 1 | 2
1 A | 2 B | 3 C
2 B | 4 C | None
1 B | 2 C | 4 D
what I am trying to achieve is to get this one:
Col A| Col B| Col C| Col D
1 A | 2 B | 3 C | None
None | 2 B | 4 C | None
None | 1 B | 2 C | 4 D
I am stuck, how to get new columns formatted as such ?
Let's try:
# split and explode
s = df['Col1'].str.split(', ').explode()
# create new multi-level index
s.index = pd.MultiIndex.from_arrays([s.index, s.str.split().str[-1].tolist()])
# unstack to reshape
out = s.unstack().add_prefix('Col ')
Details:
# split and explode
0 1 A
0 2 B
0 3 C
1 2 B
1 4 C
2 1 B
2 2 C
2 4 D
Name: Col1, dtype: object
# create new multi-level index
0 A 1 A
B 2 B
C 3 C
1 B 2 B
C 4 C
2 B 1 B
C 2 C
D 4 D
Name: Col1, dtype: object
# unstack to reshape
Col A Col B Col C Col D
0 1 A 2 B 3 C NaN
1 NaN 2 B 4 C NaN
2 NaN 1 B 2 C 4 D
Most probably there are more general approaches you can use but this worked for me. Please note that this is based on a lot of assumptions and constraints of your particular example.
test_dict = {'col_1': ['1 A, 2 B, 3 C', '2 B, 4 C', '1 B, 2 C, 4 D']}
df = pd.DataFrame(test_dict)
First, we split the df into initial columns:
df2 = df.col_1.str.split(pat=',', expand=True)
Result:
0 1 2
0 1 A 2 B 3 C
1 2 B 4 C None
2 1 B 2 C 4 D
Next, (first assumption) we need to ensure that we can later use ' ' as delimiter to extract the columns. In order to do that we need to remove all the starting and trailing spaces from each string
func = lambda x: pd.Series([i.strip() for i in x])
df2 = df2.astype(str).apply(func, axis=1)
Next, We would need to get a list of unique columns. To do that we first extract column names from each cell:
func = lambda x: pd.Series([i.split(' ')[1] for i in x if i != 'None'])
df3 = df2.astype(str).apply(func, axis=1)
Result:
0 1 2
0 A B C
1 B C NaN
2 B C D
Then create a list of unique columns ['A', 'B', 'C', 'D'] that are present in your DataFrame:
columns_list = pd.unique(df3[df3.columns].values.ravel('K'))
columns_list = [x for x in columns_list if not pd.isna(x)]
And create an empty base dataframe with those columns which will be used to assign the corresponding values:
result_df = pd.DataFrame(columns=columns_list)
Once the preparations are done we can assign column values for each of the rows and use pd.concat to merge them back in to one DataFrame:
result_list = []
result_list.append(result_df) # Adding the empty base table to ensure the columns are present
for row in df2.iterrows():
result_object = {} # dict that will be used to represent each row in source DataFrame
for column in columns_list:
for value in row[1]: # row is returned in the format of tuple where first value is row_index that we don't need
if value != 'None':
if value.split(' ')[1] == column: # Checking for a correct column to assign
result_object[column] = [value]
result_list.append(pd.DataFrame(result_object)) # Adding dicts per row
Once the list of DataFrames is generated we can use pd.concat to put it together:
final_df = pd.concat(result_list, ignore_index=True) # ignore_index will rebuild the index for the final_df
And the result will be:
A B C D
0 1 A 2 B 3 C NaN
1 NaN 2 B 4 C NaN
2 NaN 1 B 2 C 4 D
I don't think this is the most elegant and efficient way to do it but it will produce the results you need
I need apply a function to all rows of dataframe
I have used this function that returns a list of column names if value is 1:
def find_column(x):
a=[]
for column in df.columns:
if (df.loc[x,column] == 1):
a = a + [column]
return a
it works if i just insert the index, for example:
print(find_column(1))
but:
df['new_col'] = df.apply(find_column,axis=1)
does not work
any idea?
Thanks!
You can iterate by each row, so x is Series with index same like columns names, so is possible filter index matched data and convert to list:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,1,4,5,5,1],
'C':[7,1,9,4,2,3],
'D':[1,1,5,7,1,1],
'E':[5,1,6,9,1,4],
'F':list('aaabbb')
})
def find_column(x):
return x.index[x == 1].tolist()
df['new'] = df.apply(find_column,axis=1)
print (df)
A B C D E F new
0 a 4 7 1 5 a [D]
1 b 1 1 1 1 a [B, C, D, E]
2 c 4 9 5 6 a []
3 d 5 4 7 9 b []
4 e 5 2 1 1 b [D, E]
5 f 1 3 1 4 b [B, D]
Another idea is use DataFrame.dot with mask by DataFrame.eq for equal, then remove last separator and use Series.str.split:
df['new'] = df.eq(1).dot(df.columns + ',').str.rstrip(',').str.split(',')
print (df)
A B C D E F new
0 a 4 7 1 5 a [D]
1 b 1 1 1 1 a [B, C, D, E]
2 c 4 9 5 6 a []
3 d 5 4 7 9 b []
4 e 5 2 1 1 b [D, E]
5 f 1 3 1 4 b [B, D]
I am trying to add an incremental value to a column based on specific values of another column in a dataframe. So that...
col A col B
A 0
B 1
C 2
A 3
A 4
B 5
Would become something like this:
col A col B
A 1
B 2
C 3
A 1
A 1
B 2
C 3
Have tried using groupby function but cant really get my head around setting incremental values on column B.
Any thoughts?
Thanks
I think need factorize:
df['col B'] = pd.factorize(df['col A'])[0] + 1
print (df)
col A col B
0 A 1
1 B 2
2 C 3
3 A 1
4 A 1
5 B 2
Another solution:
df['col B'] = pd.Categorical(df['col A']).codes + 1
print (df)
col A col B
0 A 1
1 B 2
2 C 3
3 A 1
4 A 1
5 B 2