I am trying to add a name to my pandas df but I am failing. I want the two columns to be named "Job department" and "Amount"
df["sales"].value_counts()
>>>>>>output
sales 4140
technical 2720
support 2229
IT 1227
product_mng 902
marketing 858
RandD 787
accounting 767
hr 739
management 630
Name: sales, dtype: int64
Then I do:
job_frequency = pd.DataFrame(df["sales"].value_counts(), columns=['Job department','Amount'])
print(job_frequency)
but I get:
Empty DataFrame
Columns: [Job department, Amount]
Index: []
Use DataFrame.rename_axis for index name with
Series.reset_index for convert Series to DataFrame:
job_frequency = (df["sales"].value_counts()
.rename_axis('Job department')
.reset_index(name='Amount'))
print(job_frequency)
Job department Amount
0 sales 4140
1 technical 2720
2 support 2229
3 IT 1227
4 product_mng 902
5 marketing 858
6 RandD 787
7 accounting 767
8 hr 739
9 management 630
job_frequency = pd.DataFrame(
data={
'Job department': df["sales"].value_counts().index,
'Amount': df["sales"].value_counts().values
}
)
Related
I have a data set that can be found here https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset.
What we need exactly is for every employee to have their own set of rows of all the employees they share the same age with.
The desired output would be to add these rows in the data frame like so
source
target
Bob
Tom
Bob
Carl
Tom
Bob
Tom
Carl
Carl
Bob
Carl
Tom
I am using pandas to create the data frame from the csv file pd.read_csv
I am struggling with creating the loop to have my desired input.
This where I am at so far
import pandas as pd
path = "C:\CNT\IBM.csv"
df = pd.read_csv(path)
def f(row):
if row['A'] == row['B']:
val = 0
elif row['A'] > row['B']:
val = 1
else:
val = -1
return val
df['source'] = ''
df['target'] = ''
df2 = (df.loc[df['Age'] == 18])
print(df2)
this produces this
Age EmployeeNumber MonthlyIncome source target
296 18 297 1420
301 18 302 1200
457 18 458 1878
727 18 728 1051
828 18 829 1904
972 18 973 1611
1153 18 1154 1569
1311 18 1312 1514
My desired output is this
Age EmployeeNumber MonthlyIncome source target
296 18 297 1420 297 302
301 18 302 1200 297 458
457 18 458 1878 297 728
727 18 728 1051 297 829
828 18 829 1904 297 973
972 18 973 1611 297 1154
1153 18 1154 1569 297 1312
1311 18 1312 1514
Where do I go from here?
This will need some modification because I don't have the features added that you do. But this copies the EmployeeNumber to a new column target and shifts the values up. Leaving the last EmployeeNumber as a NaN. I added some modification to have the last row in a group's target value empty. The line of code would need to be modified more if using strings in the source column as well. The main point is using .shift(periods=-1) with the groupby().
import pandas as pd
import numpy as np
path = "C:\CNT\IBM.csv"
df = pd.read_csv(path)
def f(row):
val = np.where(row['A'] == row['B'], 0, np.where(row['A'] >= row['B'], 1, -1))
return val
df['source'] = df['EmployeeNumber']
df['target'] = df.groupby('Age')['EmployeeNumber'].shift(periods=-1).fillna(0).astype(int).replace(0,'')
print(df)
df2 = (df.loc[df['Age'] == 18])
df3 = df2[['Age','EmployeeNumber','source','target']]
print(df3)
I have two dataframes - "grower_moo" and "pricing" in a Python notebook to analyze harvested crops and price payments to the growers.
pricing is the index dataframe, and grower_moo has various unique load tickets with information about each load.
I need to pull the price per ton from the pricing index to a new column in the load data if the Fat of that load is not greater than the next Wet Fat.
Below is a .head() sample of each dataframe and the code I tried. I received a ValueError: Can only compare identically-labeled Series objects error.
pricing
Price_Per_Ton Wet_Fat
0 306 10
1 339 11
2 382 12
3 430 13
4 481 14
5 532 15
6 580 16
7 625 17
8 665 18
9 700 19
10 728 20
11 750 21
12 766 22
13 778 23
14 788 24
15 797 25
grower_moo
Load Ticket Net Fruit Weight Net MOO Percent_MOO Fat
0 L2019000011817 56660 833 1.448872 21.92
1 L2019000011816 53680 1409 2.557679 21.12
2 L2019000011815 53560 1001 1.834644 21.36
3 L2019000011161 62320 2737 4.207080 21.41
4 L2019000011160 57940 1129 1.911324 20.06
grower_moo['price_per_ton'] = max(pricing[pricing['Wet_Fat'] < grower_moo['Fat']]['Price_Per_Ton'])
Example output - grower_moo['Fat'] of 13.60 is less than 14 Fat, therefore gets a price per ton of $430
grower_moo_with_price
Load Ticket Net Fruit Weight Net MOO Percent_MOO Fat price_per_ton
0 L2019000011817 56660 833 1.448872 21.92 750
1 L2019000011816 53680 1409 2.557679 21.12 750
2 L2019000011815 53560 1001 1.834644 21.36 750
3 L2019000011161 62320 2737 4.207080 21.41 750
4 L2019000011160 57940 1129 1.911324 20.06 728
This looks like a job for an "as of" merge, pd.merge_asof (documentation):
This is similar to a left-join except that we match on nearest key
rather than equal keys. Both DataFrames must be sorted by the key.
For each row in the left DataFrame:
A "backward" search [the default]
selects the last row in the right DataFrame whose ‘on’ key is less
than or equal to the left’s key.
In the following code, I use your example inputs, but with column names using underscores _ instead of spaces .
# Required by merge_asof: sort keys in left DataFrame
grower_moo = grower_moo.sort_values('Fat')
# Required by merge_asof: key column data types must match
pricing['Wet_Fat'] = pricing['Wet_Fat'].astype('float')
# Perform the asof merge
res = pd.merge_asof(grower_moo, pricing, left_on='Fat', right_on='Wet_Fat')
# Print result
res
Load_Ticket Net_Fruit_Weight Net_MOO Percent_MOO Fat Price_Per_Ton Wet_Fat
0 L2019000011160 57940 1129 1.911324 20.06 728 20.0
1 L2019000011816 53680 1409 2.557679 21.12 750 21.0
2 L2019000011815 53560 1001 1.834644 21.36 750 21.0
3 L2019000011161 62320 2737 4.207080 21.41 750 21.0
4 L2019000011817 56660 833 1.448872 21.92 750 21.0
# Optional: drop the key column from the right DataFrame
res.drop(columns='Wet_Fat')
Load_Ticket Net_Fruit_Weight Net_MOO Percent_MOO Fat Price_Per_Ton
0 L2019000011160 57940 1129 1.911324 20.06 728
1 L2019000011816 53680 1409 2.557679 21.12 750
2 L2019000011815 53560 1001 1.834644 21.36 750
3 L2019000011161 62320 2737 4.207080 21.41 750
4 L2019000011817 56660 833 1.448872 21.92 750
concat_df = pd.concat([grower_moo, pricing], axis)
cocnat_df = concat_df[concat_df['Wet_Fat'] < concat_df['Fat']]
del cocnat_df['Wet_Fat']
I have a dataframe with the following structure:
Cluster 1 Cluster 2 Cluster 3
ID Name Revenue ID Name Revenue ID Name Revenue
1234 John 123 1235 Jane 761 1237 Mary 276
1376 Peter 254 1297 Paul 439 1425 David 532
However I am unsure how to perform basic functions like .unique or .value_count for columns as I am unsure how to refer to them in the code...
For example, if I want to see the unique values in the Cluster 2 Name column, how would I code that?
Usually I would type df.Name.unique() or df['Name'].unique() but neither of these work.
My original data looked like this:
ID Name Revenue Cluster
1234 John 123 1
1235 Jane 761 2
1237 Mary 276 3
1297 Paul 439 2
1376 Peter 254 1
1425 David 532 3
And I used this code to get me to my current point:
df = (df.set_index([df.groupby('Cluster').cumcount(), 'Cluster'])
.unstack()
.swaplevel(1,0, axis=1)
.sort_index(axis=1)
.rename(columns=lambda x: f'Cluster {x}', level=0))```
You just need to subset by the index in sequence.
So your first step would be to subset Cluster 2, then get unique names.
For example:
df["Cluster 2"]["Names"].unique()
I have a data frame look like
df1
UserID group day sp PU
213 test 12/11/14 3 311
314 control 13/11/14 4 345
354 test 13/08/14 5 376
and second data frame df2, it has information about the values in df1 column UserID, the matching rows in df2 and df1 are test-red and others should be itself.
df2
UserID
213
And what I am aiming is to append a new column group2 to df1 derived from the group column in df1 using matching values from df2 as well as the values already there in df1 as following,. For instance here UserId 213 is matching in df1 and df2 so it should be added in the newly appended column 'group2' as test-Red and otherwise it should as it is from group column.
df1
UserID group day sp PU group2
213 test 12/11/14 3 311 test-Red
314 control 13/11/14 4 345 control
354 test 13/08/14 5 376 test-NonRed
This is what I tried,
def converters(df2,df1):
if df1['UserId']==df2['UserId']:
val="test-Red"
elif df1['group']== "test":
val="test-NonRed"
else:
val="control"
return val
But it throws error as following,
ValueError: Series lengths must match to compare
Use numpy.where :
df1['new'] = np.where(df1['UserID'].isin(df2['UserID']), 'test-Red',
np.where(df1['group'] == 'test','test-NonRed',df1['group']))
print (df1)
UserID group day sp PU new
0 213 test 12/11/14 3 311 test-Red
1 314 control 13/11/14 4 345 control
2 354 test 13/08/14 5 376 test-NonRed
Or numpy.select:
m1 = df1['UserID'].isin(df2['UserID'])
m2 = df1['group'] == 'test'
df1['new'] = np.select([m1,m2], ['test-Red', 'test-NonRed'],default=df1['group'])
print (df1)
UserID group day sp PU new
0 213 test 12/11/14 3 311 test-Red
1 314 control 13/11/14 4 345 control
2 354 test 13/08/14 5 376 test-NonRed
More general solution:
print (df1)
UserID group day sp PU
0 213 test 12/11/14 3 311
1 314 control 13/11/14 4 345
2 354 test 13/08/14 5 376
3 2131 test1 12/11/14 3 311
4 314 control1 13/11/14 4 345
5 354 test1 13/08/14 5 376
df2 = pd.DataFrame({'UserID':[213, 2131]})
m1 = df1['UserID'].isin(df2['UserID'])
m2 = df1['group'].isin(df1.loc[m1, 'group'])
df1['new'] = np.select([m1,m2],
[df1['group'] + '-Red', df1['group'] + '-NonRed'],
default=df1['group'])
print (df1)
UserID group day sp PU new
0 213 test 12/11/14 3 311 test-Red
1 314 control 13/11/14 4 345 control
2 354 test 13/08/14 5 376 test-NonRed
3 2131 test1 12/11/14 3 311 test1-Red
4 314 control1 13/11/14 4 345 control1
5 354 test1 13/08/14 5 376 test1-NonRed
Can you use pd.merge and specify the how=outer parameter? This would include all the data from both tables being joined
ie:
df1.merge(df2, how='outer', on='UserId')
How do I filter pivot tables to return specific columns. Currently my dataframe is this:
print table
sum
Sex Female Male All
Date (Intervals)
April 166 191 357
August 212 263 475
December 173 263 436
February 192 298 490
January 148 195 343
July 189 260 449
June 165 238 403
March 165 278 443
May 236 253 489
November 167 247 414
October 185 287 472
September 175 306 481
All 2173 3079 5252
I want to display results of only the male column. I tried the following code:
table.query('Sex == "Male"')
However I got this error
TypeError: Expected tuple, got str
How would I be able to filter my table with specified rows or columns.
It looks like table has a column MultiIndex:
sum
Sex Female Male All
One way to check if your table has a column MultiIndex is to inspect table.columns:
In [178]: table.columns
Out[178]:
MultiIndex(levels=[['sum'], ['All', 'Female', 'Male']],
labels=[[0, 0, 0], [1, 2, 0]],
names=[None, 'sex'])
To access a column of table you need to specify a value for each level of the MultiIndex:
In [179]: list(table.columns)
Out[179]: [('sum', 'Female'), ('sum', 'Male'), ('sum', 'All')]
Thus, to select the Male column, you would use
In [176]: table[('sum', 'Male')]
Out[176]:
date
April 42.0
August 34.0
December 32.0
...
Since the sum level is unnecessary, you could get rid of it by specifying the values parameter when calling df.pivot or df.pivot_table.
table2 = df.pivot_table(index='date', columns='sex', aggfunc='sum', margins=True,
values='sum')
# sex Female Male All
# date
# April 40.0 40.0 80.0
# August 48.0 32.0 80.0
# December 48.0 44.0 92.0
For example,
import numpy as np
import pandas as pd
import calendar
np.random.seed(2016)
N = 1000
sex = np.random.choice(['Male', 'Female'], size=N)
date = np.random.choice(calendar.month_name[1:13], size=N)
df = pd.DataFrame({'sex':sex, 'date':date, 'sum':1})
# This reproduces a table similar to yours
table = df.pivot_table(index='date', columns='sex', aggfunc='sum', margins=True)
print(table[('sum', 'Male')])
# table2 has a single level Index
table2 = df.pivot_table(index='date', columns='sex', aggfunc='sum', margins=True,
values='sum')
print(table2['Male'])
Another way to remove the sum level would be to use table = table['sum'],
or table.columns = table.columns.droplevel(0).