Left shift with condition in pandas - python

I have some problems with a csv file, I have tried several solutions through the pandas library but none has worked for me, I want to make a left shift to 3 columns in case that in one of them appears a certain code (in this case 11 or 22), for example, this would be my input:
code
name
%
code 2
name 2
% 2
code 3
name 3
% 3
11
John
34
44
Rob
23
33
Peter
15
22
Ken
45
33
Peter
45
44
Rob
25
33
Peter
34
66
Abraham
37
77
Harry
67
11
John
45
77
Harry
39
88
Mary
20
And I expect something like this:
code
name
%
code 2
name 2
% 2
code 3
name 3
% 3
44
Rob
23
33
Peter
15
33
Peter
45
44
Rob
25
33
Peter
34
66
Abraham
37
77
Harry
67
77
Harry
39
88
Mary
20
any idea how I could solve my problem with pandas?
Thanks in advance!

Do you want this?
mask = df['code'].isin([11,22])
df.loc[mask] = df.loc[mask].shift(-3,axis=1)
Output -
code name % code 2 name 2 % 2 code 3 name 3 % 3
0 44.0 Rob 23.0 33.0 Peter 15.0 NaN NaN NaN
1 33.0 Peter 45.0 44.0 Rob 25.0 NaN NaN NaN
2 33.0 Peter 34.0 66.0 Abraham 37.0 77.0 Harry 67.0
3 77.0 Harry 39.0 88.0 Mary 20.0 NaN NaN NaN

Related

How to update multiple column values in pandas

been trying to crack this for a while, but stuck now.
This is my code
l=list()
column_name=[col for col in df.columns if 'SalesPerson' in col]
filtereddf=pd.DataFrame(columns=['Item','SerialNo','Location','SalesPerson01','SalesPerson02',SalesPerson03',SalesPerson04',SalesPerson05',SalesPerson06','PredictedSales01','PredictedSales02','PredictedSales03','PredictedSales04','PredictedSales05','PredictedSales06']
for i,r in df.iterrows():
if len(r['Name'].split(';'))>1:
for x in r['Name'].split(';'):
for y in column_name:
if x in r[y]:
number_is=y[-2:]
filtereddf.at[i,'SerialNo']=r['SerialNo']
filtereddf.at[i,'Location']=r['Location']
filtereddf.at[i,y]=r[y]
filtereddf.at[i,'Item']=r['Item']
filtereddf.at[i,f'PredictedSales{number_is}']=r[f'PredictedSales{number_is}']
#The below statement however prints the values correctly. But I want to filter the values and use in a dataframe
#print(r['SerialNo'],r['Location'],r[f'SalesPerson{number_is}'],r[f'PredictedSales{number_is}]',r['Definition'])
l.append(filtereddf)
elif for y in column_name:
if r['Name'] in r[y]:
number_is=y[-2:]
filtereddf.at[i,'SerialNo']=r['SerialNo']
filtereddf.at[i,'Location']=r['Location']
filtereddf.at[i,y]=r[y]
filtereddf.at[i,'Item']=r['Item']
filtereddf.at[i,f'PredictedSales{number_is}']=r[f'PredictedSales{number_is}']
#The below statement however prints the values correctly. But I want to filter the values and use in a dataframe
#print(r['SerialNo'],r['Location'],r[f'SalesPerson{number_is}'],r[f'PredictedSales{number_is}]',r['Definition'])
l.append(filtereddf)
finaldf=pd.concat(l,ignore_index=True)
It eventually throws an error
MemoryError: Unable to allocate 9.18 GiB for an array with shape (1, 1231543895) and data type object
Basically I want to extract SalesPersonNN and corresponding PredicatedSalesNN from the main dataframe df
sampled dataset is (Actual csv file is almost 100k entries)
Item Name SerialNo Location SalesPerson01 SalesPerson02 SalesPerson03 SalesPerson04 SalesPerson05 SalesPerson06 PredictedSales01 PredictedSales02 PredictedSales03 PredictedSales04 PredictedSales05 PredictedSales06
0 TV Joe;Mary;Philip 11111 NY Tom Julie Joe Sara Mary Philip 90 80 30 98 99 100
1 WashingMachine Mike 22222 NJ Tom Julie Joe Mike Mary Philip 80 70 40 74 88 42
2 Dishwasher Tony;Sue 33333 NC Margaret Tony William Brian Sue Bert 58 49 39 59 78 89
3 Microwave Bill;Jeff;Mary 44444 PA Elmo Bill Jeff Mary Chris Kevin 80 70 90 56 92 59
4 Printer Keith;Joe 55555 DE Keith Clark Ed Matt Martha Joe 87 94 59 48 74 89
And I want the output dataframe to look like
tem Name SerialNo Location SalesPerson01 SalesPerson02 SalesPerson03 SalesPerson04 SalesPerson05 SalesPerson06 PredictedSales01 PredictedSales02 PredictedSales03 PredictedSales04 PredictedSales05 PredictedSales06
0 TV Joe;Mary;Philip 11111 NY NaN NaN Joe NaN Mary Philip NaN NaN 30.0 NaN 99.0 100.0
1 WashingMachine Mike 22222 NJ NaN NaN NaN Mike NaN NaN NaN NaN NaN 74.0 NaN NaN
2 Dishwasher Tony;Sue 33333 NC NaN Tony NaN NaN Sue NaN NaN 49.0 NaN NaN 78.0 NaN
3 Microwave Bill;Jeff;Mary 44444 PA NaN Bill Jeff Mary NaN NaN NaN 70.0 90.0 56.0 NaN NaN
4 Printer Keith;Joe 55555 DE Keith NaN NaN NaN NaN Joe 87.0 NaN NaN NaN NaN 89.0
​
I am not sure if my approach using dataframe.at is correct or if any pointers as to what i can use to efficiently filter only those columns values which matches the value in column Name
I would recommend changing from a column focused dataframe to a row focused dataframe. You can rewrite your dataset using melt:
df_person = df.loc[:,'SalesPerson01':'SalesPerson06']
df_sales = df.loc[:,'PredictedSales01':'PredictedSales06']
df_person = df_person.melt(ignore_index=False, value_name='SalesPerson')[['SalesPerson']]
PredictedSales = df_sales.melt(ignore_index=False, value_name='PredictedSales')[['PredictedSales']]
df_person['PredictedSales'] = PredictedSales
index_cols = ['Item','SerialNo', 'Location', 'SalesPerson']
df_person = df_person.reset_index().sort_values(index_cols).set_index(index_cols)
df_person will look like this:
Item SerialNo Location SalesPerson PredictedSales
TV 11111 NY Joe 30
Julie 80
Mary 99
Philip 100
Sara 98
Tom 90
WashingMachine 22222 NJ Joe 40
Julie 70
Mary 88
Mike 74
Philip 42
Tom 80
... ... ... ... ...
Printer 55555 DE Clark 94
Ed 59
Joe 89
Keith 87
Martha 74
Matt 48
Now you only want the values from the names in you 'Name' column. Therefor we create a separate dataframe using explode:
df_names = df[['Name']].explode('Name').rename({'Name':'SalesPerson'}, axis=1)
df_names = df_names.reset_index().set_index(['Item','SerialNo', 'Location', 'SalesPerson'])
df_names will look something like this:
Item SerialNo Location SalesPerson
TV 11111 NY Joe
Mary
Philip
WashingMachine 22222 NJ Mike
Dishwasher 33333 NC Tony
Sue
Microwave 44444 PA Bill
Jeff
Mary
Printer 55555 DE Keith
Joe
Now you can simply merge your dataframes:
df_names.merge(df_person, left_index=True, right_index=True)
Now the PredictedSales are added to you df_names dataframe.
Hopefully this will run without errors. Please let me know 😀

How to select row before and after NaN in pandas?

I have a dataframe which looks like this :
Name Age Job
0 Alex 20 Student
1 Sara 21 Doctor
2 john 23 NaN
3 kevin 22 Teacher
4 Rosa 20 senior manager
5 johanes 25 Dentist
6 lina 23 Student
7 yaser 25 Pilot
8 jason 20 Manager
9 Ali 23 NaN
10 Ahmad 21 Professor
11 Joe 24 NaN
12 Donald 29 Waiter
.
.
.
.
I want to select the rows before and after the row that has NaN values in column Job with the row itself. For that I have the following code :
Rows = df[df. Shift(1, fill_value="dummy").Job. isna() | df.Job. isna()| df. Shift(-1, fill_value="dummy"). df. isna()]
print(Rows)
the result is this:
1 Sara 21 Doctor
2 john 23 NaN
3 kevin 22 Teacher
8 jason 20 Manager
9 Ali 23 NaN
10 Ahmad 21 Professor
11 Joe 24 NaN
12 Donald 29 Waiter
The only problem here is the row number 10, it should be double in the result because this row is one time the row after NaN which is number 9 and at the same time the row before NaN value which is row number 11( the row is between two rows with NaN value). So at the end I want to have this :
1 Sara 21 Doctor
2 john 23 NaN
3 kevin 22 Teacher
8 jason 20 Manager
9 Ali 23 NaN
10 Ahmad 21 Professor
10 Ahmad 21 Professor
11 Joe 24 NaN
12 Donald 29 Waiter
So every row which is between two rows with NaN values should be also two times in the result (or should be dupplicate). Is there any way to do this? Any help will be appreciated.
Use concat with rows before, after and match by condition:
m = df.Job.isna()
df = pd.concat([df[m.shift(fill_value=False)],
df[m.shift(-1, fill_value=False)],
df[m]]).sort_index()
print (df)
Name Age Job
1 Sara 21 Doctor
2 john 23 NaN
3 kevin 22 Teacher
8 jason 20 Manager
9 Ali 23 NaN
10 Ahmad 21 Professor
10 Ahmad 21 Professor
11 Joe 24 NaN
12 Donald 29 Waiter

Conditional filling of column based on string

I have a dataset which I have to fill conditional or dropping the conditional rows. But, I am still unsuccessful.
Idx Fruits Days Name
0 60 20
1 15 85.5
2 10 62 Peter
3 40 90 Maria
4 5 10.2
5 92 66
6 65 87 John
7 50 1 Eric
8 50 0 Maria
9 80 87 John
Now, I have some empty cells. I can fill with fillna or regex or can drop empty cells.
I want only first starting cells until the string starts, either dropping or filling with "."
Like below
Idx Fruits Days Name
0 60 20 .
1 15 85.5 .
2 10 62 Peter
3 40 90 Maria
4 5 10.2
5 92 66
6 65 87 John
7 50 1 Eric
8 50 0 Maria
9 80 87 John
and
Idx Fruits Days Name
2 10 62 Peter
3 40 90 Maria
4 5 10.2
5 92 66
6 65 87 John
7 50 1 Eric
8 50 0 Maria
9 80 87 John
Is there any possibility using pandas? or any looping?
You can try this:
df['Name'] = df['Name'].replace('', np.nan)
df['Name'] = df['Name'].where(df['Name'].ffill().notna(), '.')
print(df)
Idx Fruits Days Name
0 0 60 20.0 .
1 1 15 85.5 .
2 2 10 62.0 Peter
3 3 40 90.0 Maria
4 4 5 10.2
5 5 92 66.0
6 6 65 87.0 John
7 7 50 1.0 Eric
8 8 50 0.0 Maria
9 9 80 87.0 John

converting duplicated data from row to columns

I have data like:
name val trc
jin 23 apb
tim 52 nmq
tim 61 apb
tim 92 rrc
ron 13 apq
stark 34 rrc
stark 34 apq
ron 4 apq
sia 6 wer
i am looking for output like:
name val_1 trc1 val_2 trc2 val_3 trc3
jin 23 apb
tim 92 rrc 61 apb 52 nmq
ron 13 apq 4 apq
stark 34 rrc 34 apq
sia 6 wer
i want to transform the duplicated values in the row to column with higest val in val_1 and lesser val in val_2 and so on. even the trc1 value should correspond to val_1. Please let me know how to achieve this.
I tried this approach:
d = {k: v.reset_index(drop=True) for k, v in df.groupby('name')}
pd.concat(d, axis=1).reset_index()
index jin ron sia stark tim \
name val trc name val trc name val trc name val trc name
0 0 jin 23.0 apb ron 13.0 apq sia 6.0 wer stark 34.0 rrc tim
1 1 NaN NaN NaN ron 4.0 apq NaN NaN NaN stark 34.0 apq tim
2 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN tim
Use:
df1 = df.sort_values(['name','val'], ascending=False)
df1 = df1.set_index('name').stack().groupby(level=0).apply(list).apply(pd.Series)
df1 = df1.reset_index().fillna("")
print(df1)
name 0 1 2 3 4 5
0 jin 23 apb
1 ron 13 apq 4 apq
2 sia 6 wer
3 stark 34 rrc 34 apq
4 tim 92 rrc 61 apb 52 nmq
Convert your object into a dictionary with the names as keys and your vals and trcs as connected values in a tuple or list.
You want to end up with something like this:
yourDict[name] = [ [val_1, trc1] , [val_2, trc2] ]
Here an option using pivot:
df['index'] = df.groupby('name').cumcount()
df_vals = df.pivot(index='name', columns='index', values='val').rename(columns=lambda x: 'val_'+str(x))
df_trcs = df.pivot(index='name', columns='index', values='trc').rename(columns=lambda x: 'trc_'+str(x))
df_vals.join(df_trcs).fillna('').reset_index()
index name val_0 val_1 val_2 trc_0 trc_1 trc_2
0 jin 23.0 apb
1 ron 13.0 4 apq apq
2 sia 6.0 wer
3 stark 34.0 34 rrc apq
4 tim 52.0 61 92 nmq apb rrc

add new column and remove duplicates in that replace null values column wise

Duplication type:
Check this column only (default)
Check other columns only
Check all columns
Use Last Value:
True - retain the last duplicate value
False - retain the first of the duplicates (default)
This rule should add a new column to the dataframe which contains the same as the source column for any unique columns and is null for any duplicate columns.
basic code is df.loc[df.duplicated(),get_unique_column_name(df, "clean")] = df[get_column_name(df, column)] with the parameters for duplicated() set based on the duplication type
See reference for this function above: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html
You should specify the columns in the subset parameter based on the setting of duplication_type
You should specify use_last_value based on use_last_value above
This is my file.
Jason Miller 42 4 25
Tina Ali 36 31 57
Jake Milner 24 2 62
Jason Miller 42 4 25
Jake Milner 24 2 62
Amy Cooze 73 3 70
Jason Miller 42 4 25
Jason Miller 42 4 25
Jake Milner 24 2 62
Jake Miller 42 4 25
I want to get like this by using in pandas.in below file i have choose 2 column.
Jason Miller 42 4 25
Jake Ali 36 31 57
Jake Milner 24 2 62
Jason Miller 4 25
Jake Milner 2 62
Jake Cooze 73 3 70
Jason Miller 4 25
Jason Miller 4 25
Jake Milner 2 62
Jake Miller 4 25
Please anybody reply to my query.
You can use DF.duplicated and assign the values of column C where the first occurence of values appears along columns A and B.
You could then fill the Nans produced with empty strings to produce the required dataframe.
df = pd.read_csv(data, delim_whitespace=True, header=None, names=['A','B','C','D','E'])
df.loc[~df.duplicated(), "C'"] = df['C']
df.fillna('', inplace=True)
df = df[["A","B", "C'","D","E"]]
print(df)
A B C' D E
0 Jason Miller 42 4 25
1 Tina Ali 36 31 57
2 Jake Milner 24 2 62
3 Jason Miller 4 25
4 Jake Milner 2 62
5 Amy Cooze 73 3 70
6 Jason Miller 4 25
7 Jason Miller 4 25
8 Jake Milner 2 62
9 Jake Miller 42 4 25
Another way of doing would be to take a subset of the duplicated columns and replace the concerned column with empty strings. Then, you could use update to modify the dataframe in place with the original, df.
In [2]: duplicated_cols = df[df.duplicated(subset=['C', 'D', 'E'])]
In [3]: duplicated_cols
Out[3]:
A B C D E
3 Jason Miller 42 4 25
4 Jake Milner 24 2 62
6 Jason Miller 42 4 25
7 Jason Miller 42 4 25
8 Jake Milner 24 2 62
9 Jake Miller 42 4 25
In [4]: duplicated_cols.loc[:,'C'] = ''
In [5]: df.update(duplicated_cols)
In [6]: df
Out[6]:
A B C D E
0 Jason Miller 42 4.0 25.0
1 Tina Ali 36 31.0 57.0
2 Jake Milner 24 2.0 62.0
3 Jason Miller 4.0 25.0
4 Jake Milner 2.0 62.0
5 Amy Cooze 73 3.0 70.0
6 Jason Miller 4.0 25.0
7 Jason Miller 4.0 25.0
8 Jake Milner 2.0 62.0
9 Jake Miller 4.0 25.0

Categories