new column with each element as a list pandas - python

I have some data frames where I want to add new columns, and in this new column each element should be a string for example of two rows,
df
index colA colB
0 a a1
1 b b1
Now I can add new column as
df['colC']=5
index colA colB colC
0 a a1 5
1 b b1 5
now I want to add a third column with each element as list
index colA colB colC
0 a a1 ['m','n','p']
1 b b1 ['m','n','p']
but,
df['colC']=['m','n','p'] is giving error
ValueError: Length of values does not match length of index
which is obvious.
I know in our example I can do
df['colC']=[['m','n','p'],['m','n','p']]
But I want to set each element to same list of strings, when I do not know number of rows.
Can anyone suggest something easy to achieve this.

Adding object(list) to cell is tricky
df['colC']=[['m','n','p']]*len(df)
Or
df['colC'] = [list('mnp') for _ in range(len(df))]
df returns:
index colA colB colC
0 0 a a1 [m, n, p]
1 1 b b1 [m, n, p]

Related

Pandas: conditionally concatenate original columns with a string

INPUT>df1
ColumnA ColumnB
A1 NaN
A1A2 NaN
A3 NaN
What I tried to do is to change column B's value conditionally,
based on iteration of checking ColumnA, adding remarks to column B.
The previous value of column B shall be kept after new string is added.
In sample dataframe, what I want to do would be
If ColumnA contains A1. If so, add string "A1" to Column B (without cleaning all previous value.)
If ColumnA contains A2. If so, add string "A2" to Column B (without cleaning all previous value.)
OUTPUT>df1
ColumnA ColumnB
A1 A1
A1A2 A1_A2
A3 NaN
I have tried the following codes but not working well.
Could anyone give me some advices? Thanks.
df1['ColumnB'] = np.where(df1['ColumnA'].str.contains('A1'), df1['ColumnB']+"_A1",df1['ColumnB'])
df1['ColumnB'] = np.where(df1['ColumnA'].str.contains('A2'), df1['ColumnB']+"_A2",df1['ColumnB'])
One way using pandas.Series.str.findall with join:
key = ["A1", "A2"]
df["ColumnB"] = df["ColumnA"].str.findall("|".join(key)).str.join("_")
print(df)
Output:
ColumnA ColumnB
0 A1 A1
1 A1A2 A1_A2
2 A3
You cannot add or append strings to np.nan. That means you would always need to check if any position in your ColumnB is still a np.nan or already a string to properly set its new value. If all you want to do is to work with text you could initialize your ColumnB with empty strings and append selected string pieces from ColumnA as:
import pandas as pd
import numpy as np
I = pd.DataFrame({'ColA': ['A1', 'A1A2', 'A2', 'A3']})
I['ColB'] = ''
I.loc[I.ColA.str.contains('A1'), 'ColB'] += 'A1'
print(I)
I.loc[I.ColA.str.contains('A2'), 'ColB'] += 'A2'
print(I)
The output is:
ColA ColB
0 A1 A1
1 A1A2 A1
2 A2
3 A3
ColA ColB
0 A1 A1
1 A1A2 A1A2
2 A2 A2
3 A3
Note: this is a very verbose version as an example.

Python3 Pandas Filter by Columns with Unknown Column Names

Working with a data set comparing rosters with different dates. It goes through a pivot and we don't know the dates of when the rosters are pulled but the resulting data set is structured like this:
colA ColB colC colD Date:yymmdd Date:yymmdd Date:yymmdd
Bob aa aa aa 0 0 1
Jack bb bb bb 1 1 1
Steve cc cc cc 0 1 1
Mary dd dd dd 1 1 1
Abu ee ee ee 1 1 0
I successfully did a fillna for every column after the first 4 columns (they are known).
df.iloc[:,4:] = df.iloc[:,4:].fillna(0) #<-- Fills blanks on every column after column 4.
Question: Now i'm trying to filter the df on the columns that have a zero. Is there a way to filter by columns after 4? I tried:
df = df[(df.iloc[:,4:] == 0)] # error
df = df[(df.columns[:,4:] == 0)] # error
df = df[(df.columns.str.contains(':') == 0)] # unknown columns do have a ':', but didn't work.
Is there a better way to do this? Looking for a result that only shows the rows with a 0 in any column past #4.
Below snippet will give you one Dataframe containing True and False as cell values of df.
df.iloc[:, 4:].eq(x)
If you want to have only those rows where x is there, then you can any() clause.
like the way #jpp has shown in his answer.
In your case, it will be df[df.iloc[:, 4:].eq(0).any(1)]
This will give you all the rows of Dataframe, where rows have atleast one '0' as data value
If all values are 0 or bigger, use min :
df[df.columns[:,4:].min(axis = 1) == 0]

How to shift rows up in Pandas Dataframe based on specific column

How do I shift up all the values in a row for one specific column without affecting the order of the other columns?
For example, let's say i have the following code:
import pandas as pd
data= {'ColA':["A","B","C"],
'ColB':[0,1,2],
'ColC':["First","Second","Third"]}
df = pd.DataFrame(data)
print(df)
I would see the following output:
ColA ColB ColC
0 A 0 First
1 B 1 Second
2 C 2 Third
In my case I want to verify that Column B does not have any 0s and if so, it is removed and all the other values below it get pushed up, and the order of the other columns are not affected. Presumably, I would then see the following:
ColA ColB ColC
0 A 1 First
1 B 2 Second
2 C NaN Third
I can't figure out how to do this using either the drop() or shift() methods.
Thank you
Let us do simple sorted
invalid=0
df['ColX']=sorted(df.ColB,key=lambda x : x==invalid)
df.ColX=df.ColX.mask(df.ColX==invalid)
df
Out[351]:
ColA ColB ColC ColX
0 A 0 First 1.0
1 B 1 Second 2.0
2 C 2 Third NaN
The way I'd do this IIUC is to filter out the values in ColB which are not 0, and fill the column with these values according to the length of the obtained valid values:
m = df.loc[~df.ColB.eq(0), 'ColB'].values
df['ColB'] = float('nan')
df.loc[:m.size-1, 'ColB'] = m
print(df)
ColA ColB ColC
0 A 1.0 First
1 B 2.0 Second
2 C NaN Third
You can swap 0s for nans and then move up the rest of the values:
import numpy as np
df.ColB.replace(0, np.nan, inplace=True)
df.assign(ColB=df.ColB.shift(df.ColB.count() - len(df.ColB)))

df.unique() on whole DataFrame based on a column

I have a DataFrame df filled with rows and columns where there are duplicate Id's:
Index Id Type
0 a1 A
1 a2 A
2 b1 B
3 b3 B
4 a1 A
...
When I use:
uniqueId = df["Id"].unique()
I get a list of unique IDs.
How can I apply this filtering on the whole DataFrame such that it keeps the structure but that the duplicates (based on "Id") are removed?
It seems you need DataFrame.drop_duplicates with parameter subset which specify where are test duplicates:
#keep first duplicate value
df = df.drop_duplicates(subset=['Id'])
print (df)
Id Type
Index
0 a1 A
1 a2 A
2 b1 B
3 b3 B
#keep last duplicate value
df = df.drop_duplicates(subset=['Id'], keep='last')
print (df)
Id Type
Index
1 a2 A
2 b1 B
3 b3 B
4 a1 A
#remove all duplicate values
df = df.drop_duplicates(subset=['Id'], keep=False)
print (df)
Id Type
Index
1 a2 A
2 b1 B
3 b3 B
It's also possible to call duplicated() to flag the duplicates and drop the negation of the flags.
df = df[~df.duplicated(subset=['Id'])].copy()
This is particularly useful if you want to conditionally drop duplicates, e.g. drop duplicates of a specific value, etc. For example, the following code drops duplicate 'a1's from column Id (other duplicates are not dropped).
new_df = df[~df['Id'].duplicated() | df['Id'].ne('a1')].copy()

Join two pandas data frames with the indices of the first?

I have two dataframes, df1:
column1 column2
0 A B
1 A A
2 C A
3 None None
4 None None
and df2
id l
40 100005090 A
188 100020985 B
Now I want to join df1 and df2, but I don't know how to match the indices. If I simply do df1.join(df2), the indices are aligned to df2. That is, it finds the 40th entry of df2 and that is now the first entry of the dataframe that starts at 40 (df1). How do I tell pandas to align indices to df1, meaning that the first entry of df2 is actually index 40? That is, I would like to get:
id l column1 column2
40 100005090 A A B
188 100020985 B A A
...
You can take a slice of your df that is the same length as df1, then you can overwrite the index values and then join:
In [174]:
sub = df.iloc[:len(df1)]
sub.index = df1.index
df1.join(sub)
Out[174]:
id l column1 column2
40 100005090 A A B
188 100020985 B A A
If the dfs are the same length then the first line is not needed, you just overwrite the index with the index values from the other df.

Categories