been trying to crack this for a while, but stuck now.
This is my code
l=list()
column_name=[col for col in df.columns if 'SalesPerson' in col]
filtereddf=pd.DataFrame(columns=['Item','SerialNo','Location','SalesPerson01','SalesPerson02',SalesPerson03',SalesPerson04',SalesPerson05',SalesPerson06','PredictedSales01','PredictedSales02','PredictedSales03','PredictedSales04','PredictedSales05','PredictedSales06']
for i,r in df.iterrows():
if len(r['Name'].split(';'))>1:
for x in r['Name'].split(';'):
for y in column_name:
if x in r[y]:
number_is=y[-2:]
filtereddf.at[i,'SerialNo']=r['SerialNo']
filtereddf.at[i,'Location']=r['Location']
filtereddf.at[i,y]=r[y]
filtereddf.at[i,'Item']=r['Item']
filtereddf.at[i,f'PredictedSales{number_is}']=r[f'PredictedSales{number_is}']
#The below statement however prints the values correctly. But I want to filter the values and use in a dataframe
#print(r['SerialNo'],r['Location'],r[f'SalesPerson{number_is}'],r[f'PredictedSales{number_is}]',r['Definition'])
l.append(filtereddf)
elif for y in column_name:
if r['Name'] in r[y]:
number_is=y[-2:]
filtereddf.at[i,'SerialNo']=r['SerialNo']
filtereddf.at[i,'Location']=r['Location']
filtereddf.at[i,y]=r[y]
filtereddf.at[i,'Item']=r['Item']
filtereddf.at[i,f'PredictedSales{number_is}']=r[f'PredictedSales{number_is}']
#The below statement however prints the values correctly. But I want to filter the values and use in a dataframe
#print(r['SerialNo'],r['Location'],r[f'SalesPerson{number_is}'],r[f'PredictedSales{number_is}]',r['Definition'])
l.append(filtereddf)
finaldf=pd.concat(l,ignore_index=True)
It eventually throws an error
MemoryError: Unable to allocate 9.18 GiB for an array with shape (1, 1231543895) and data type object
Basically I want to extract SalesPersonNN and corresponding PredicatedSalesNN from the main dataframe df
sampled dataset is (Actual csv file is almost 100k entries)
Item Name SerialNo Location SalesPerson01 SalesPerson02 SalesPerson03 SalesPerson04 SalesPerson05 SalesPerson06 PredictedSales01 PredictedSales02 PredictedSales03 PredictedSales04 PredictedSales05 PredictedSales06
0 TV Joe;Mary;Philip 11111 NY Tom Julie Joe Sara Mary Philip 90 80 30 98 99 100
1 WashingMachine Mike 22222 NJ Tom Julie Joe Mike Mary Philip 80 70 40 74 88 42
2 Dishwasher Tony;Sue 33333 NC Margaret Tony William Brian Sue Bert 58 49 39 59 78 89
3 Microwave Bill;Jeff;Mary 44444 PA Elmo Bill Jeff Mary Chris Kevin 80 70 90 56 92 59
4 Printer Keith;Joe 55555 DE Keith Clark Ed Matt Martha Joe 87 94 59 48 74 89
And I want the output dataframe to look like
tem Name SerialNo Location SalesPerson01 SalesPerson02 SalesPerson03 SalesPerson04 SalesPerson05 SalesPerson06 PredictedSales01 PredictedSales02 PredictedSales03 PredictedSales04 PredictedSales05 PredictedSales06
0 TV Joe;Mary;Philip 11111 NY NaN NaN Joe NaN Mary Philip NaN NaN 30.0 NaN 99.0 100.0
1 WashingMachine Mike 22222 NJ NaN NaN NaN Mike NaN NaN NaN NaN NaN 74.0 NaN NaN
2 Dishwasher Tony;Sue 33333 NC NaN Tony NaN NaN Sue NaN NaN 49.0 NaN NaN 78.0 NaN
3 Microwave Bill;Jeff;Mary 44444 PA NaN Bill Jeff Mary NaN NaN NaN 70.0 90.0 56.0 NaN NaN
4 Printer Keith;Joe 55555 DE Keith NaN NaN NaN NaN Joe 87.0 NaN NaN NaN NaN 89.0
I am not sure if my approach using dataframe.at is correct or if any pointers as to what i can use to efficiently filter only those columns values which matches the value in column Name
I would recommend changing from a column focused dataframe to a row focused dataframe. You can rewrite your dataset using melt:
df_person = df.loc[:,'SalesPerson01':'SalesPerson06']
df_sales = df.loc[:,'PredictedSales01':'PredictedSales06']
df_person = df_person.melt(ignore_index=False, value_name='SalesPerson')[['SalesPerson']]
PredictedSales = df_sales.melt(ignore_index=False, value_name='PredictedSales')[['PredictedSales']]
df_person['PredictedSales'] = PredictedSales
index_cols = ['Item','SerialNo', 'Location', 'SalesPerson']
df_person = df_person.reset_index().sort_values(index_cols).set_index(index_cols)
df_person will look like this:
Item SerialNo Location SalesPerson PredictedSales
TV 11111 NY Joe 30
Julie 80
Mary 99
Philip 100
Sara 98
Tom 90
WashingMachine 22222 NJ Joe 40
Julie 70
Mary 88
Mike 74
Philip 42
Tom 80
... ... ... ... ...
Printer 55555 DE Clark 94
Ed 59
Joe 89
Keith 87
Martha 74
Matt 48
Now you only want the values from the names in you 'Name' column. Therefor we create a separate dataframe using explode:
df_names = df[['Name']].explode('Name').rename({'Name':'SalesPerson'}, axis=1)
df_names = df_names.reset_index().set_index(['Item','SerialNo', 'Location', 'SalesPerson'])
df_names will look something like this:
Item SerialNo Location SalesPerson
TV 11111 NY Joe
Mary
Philip
WashingMachine 22222 NJ Mike
Dishwasher 33333 NC Tony
Sue
Microwave 44444 PA Bill
Jeff
Mary
Printer 55555 DE Keith
Joe
Now you can simply merge your dataframes:
df_names.merge(df_person, left_index=True, right_index=True)
Now the PredictedSales are added to you df_names dataframe.
Hopefully this will run without errors. Please let me know 😀
Related
Having a data set as below.Here I need to group the subset in column and fill the missing values using mode method.Here specifically needs to fill the missing value of Tom from UK. So I need to group the TOM from Uk, and in that group the most repeating value needs to be added to the missing value.
Below fig shows how i need to do the group by.From the below matrix i need to replace all the Nan values using mode.
the desired output:
attaching the dataset
Name location Value
Tom USA 20
Tom UK Nan
Tom USA Nan
Tom UK 20
Jack India Nan
Nihal Africa 30
Tom UK Nan
Tom UK 20
Tom UK 30
Tom UK 20
Tom UK 30
Sam UK 30
Sam UK 30
try:
df = df\
.set_index(['Name', 'location'])\
.fillna(
df[df.Name.eq('Tom') & df.location.eq('UK')]\
.groupby(['Name', 'location'])\
.agg(pd.Series.mode)\
.to_dict()
)\
.reset_index()
Output:
Name location Value
0 Tom USA 20
1 Tom UK 20
2 Tom USA NaN
3 Tom UK 20
4 Jack India NaN
5 Nihal Africa 30
6 Tom UK 20
7 Tom UK 20
8 Tom UK 30
9 Tom UK 20
10 Tom UK 30
11 Sam UK 30
12 Sam UK 30
I have a data set which has values for different columns as different entries with first name to identify the respective columns.
For instance James's gender is in first row and James's age is in 5th row.
DataFrame
df1=
Index
First Name
Age
Gender
Weight in lb
Height in cm
0
James
Male
1
John
175
2
Patricia
23
5
James
22
4
James
185
5
John
29
6
John
176
I am trying to make it combined into one DataFrame as below
df1=
Index
First Name
Age
Gender
Weight
Height
0
James
22
Male
185
1
John
29
175
176
2
Patricia
23
I tried to do groupby but it is not working.
Assuming NaN in the empty cells, you can use groupby.first:
df.groupby('First Name', as_index=False).first()
output:
First Name Age Gender Weight in lb Height in cm
0 James 22.0 Male 185.0 NaN
1 John 29.0 None 175.0 176.0
2 Patricia 23.0 None NaN NaN
I have some problems with a csv file, I have tried several solutions through the pandas library but none has worked for me, I want to make a left shift to 3 columns in case that in one of them appears a certain code (in this case 11 or 22), for example, this would be my input:
code
name
%
code 2
name 2
% 2
code 3
name 3
% 3
11
John
34
44
Rob
23
33
Peter
15
22
Ken
45
33
Peter
45
44
Rob
25
33
Peter
34
66
Abraham
37
77
Harry
67
11
John
45
77
Harry
39
88
Mary
20
And I expect something like this:
code
name
%
code 2
name 2
% 2
code 3
name 3
% 3
44
Rob
23
33
Peter
15
33
Peter
45
44
Rob
25
33
Peter
34
66
Abraham
37
77
Harry
67
77
Harry
39
88
Mary
20
any idea how I could solve my problem with pandas?
Thanks in advance!
Do you want this?
mask = df['code'].isin([11,22])
df.loc[mask] = df.loc[mask].shift(-3,axis=1)
Output -
code name % code 2 name 2 % 2 code 3 name 3 % 3
0 44.0 Rob 23.0 33.0 Peter 15.0 NaN NaN NaN
1 33.0 Peter 45.0 44.0 Rob 25.0 NaN NaN NaN
2 33.0 Peter 34.0 66.0 Abraham 37.0 77.0 Harry 67.0
3 77.0 Harry 39.0 88.0 Mary 20.0 NaN NaN NaN
I have a dataset that looks like the following.
Region_Name Date Average
London 1990Q1 105
London 1990Q1 118
... ... ...
London 2018Q1 157
I converted the date into quarters and wish to create a new dataframe with the matching quarters and region names grouped together, with the mean average.
What is the best way to accomplish such a task.
I have been looking at the groupby function but keep getting a traceback.
for example:
new_df = df.groupby(['Resion_Name','Date']).mean()
dict3={'Region_Name': ['London','Newyork','London','Newyork','London','London','Newyork','Newyork','Newyork','Newyork','London'],
'Date' : ['1990Q1','1990Q1','1990Q2','1990Q2','1991Q1','1991Q1','1991Q2','1992Q2','1993Q1','1993Q1','1994Q1'],
'Average': [34,56,45,67,23,89,12,45,67,34,67]}
df3=pd.DataFrame(dict3)
**Now My df3 is as follows **
Region_Name Date Average
0 London 1990Q1 34
1 Newyork 1990Q1 56
2 London 1990Q2 45
3 Newyork 1990Q2 67
4 London 1991Q1 23
5 London 1991Q1 89
6 Newyork 1991Q2 12
7 Newyork 1992Q2 45
8 Newyork 1993Q1 67
9 Newyork 1993Q1 34
10 London 1994Q1 67
code looks as follows:
new_df = df3.groupby(['Region_Name','Date'])
new1=new_df['Average'].transform('mean')
Result of dataframe new1:
print(new1)
0 34.0
1 56.0
2 45.0
3 67.0
4 56.0
5 56.0
6 12.0
7 45.0
8 50.5
9 50.5
10 67.0
I have data like:
name val trc
jin 23 apb
tim 52 nmq
tim 61 apb
tim 92 rrc
ron 13 apq
stark 34 rrc
stark 34 apq
ron 4 apq
sia 6 wer
i am looking for output like:
name val_1 trc1 val_2 trc2 val_3 trc3
jin 23 apb
tim 92 rrc 61 apb 52 nmq
ron 13 apq 4 apq
stark 34 rrc 34 apq
sia 6 wer
i want to transform the duplicated values in the row to column with higest val in val_1 and lesser val in val_2 and so on. even the trc1 value should correspond to val_1. Please let me know how to achieve this.
I tried this approach:
d = {k: v.reset_index(drop=True) for k, v in df.groupby('name')}
pd.concat(d, axis=1).reset_index()
index jin ron sia stark tim \
name val trc name val trc name val trc name val trc name
0 0 jin 23.0 apb ron 13.0 apq sia 6.0 wer stark 34.0 rrc tim
1 1 NaN NaN NaN ron 4.0 apq NaN NaN NaN stark 34.0 apq tim
2 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN tim
Use:
df1 = df.sort_values(['name','val'], ascending=False)
df1 = df1.set_index('name').stack().groupby(level=0).apply(list).apply(pd.Series)
df1 = df1.reset_index().fillna("")
print(df1)
name 0 1 2 3 4 5
0 jin 23 apb
1 ron 13 apq 4 apq
2 sia 6 wer
3 stark 34 rrc 34 apq
4 tim 92 rrc 61 apb 52 nmq
Convert your object into a dictionary with the names as keys and your vals and trcs as connected values in a tuple or list.
You want to end up with something like this:
yourDict[name] = [ [val_1, trc1] , [val_2, trc2] ]
Here an option using pivot:
df['index'] = df.groupby('name').cumcount()
df_vals = df.pivot(index='name', columns='index', values='val').rename(columns=lambda x: 'val_'+str(x))
df_trcs = df.pivot(index='name', columns='index', values='trc').rename(columns=lambda x: 'trc_'+str(x))
df_vals.join(df_trcs).fillna('').reset_index()
index name val_0 val_1 val_2 trc_0 trc_1 trc_2
0 jin 23.0 apb
1 ron 13.0 4 apq apq
2 sia 6.0 wer
3 stark 34.0 34 rrc apq
4 tim 52.0 61 92 nmq apb rrc