Update dataframe header with values from another dataframe - python

I'm working with census data (using the Census package ). When I select variables with the census API, they pass through in their raw format (e.g. B01001_007) and I'd like to replace the column name with the label (e.g. male 18 to 19 years).
I know this can be done through df.columns = ['male 18 to 19 years',
'male 20 years',
'male 21 years']
but this is tedious.
Is there a way to do some type of mapping that will auto-query into the header in my df below?
Sample data:
import pandas as pd
from pandas import DataFrame
variables_table = pd.DataFrame({'variable': ['B01001_007E','B01001_008E','B01001_009E'],
'label': ['male 18 to 19 years','male 20 years','male 21 years']
})
variables_table
label variable
male 18 to 19 years B01001_007E
male 20 years B01001_008E
male 21 years B01001_009E
Unclean output:
df = pd.DataFrame({'B01001_007E': ['100','200','300'],
'B01001_008E': ['300','200','100'],
'B01001_009E': ['500','100','200']})
df
B01001_007E B01001_008E B01001_009E
100 300 500
200 200 100
300 100 200

df.rename(columns=variables_table.set_index('variable')['label'])
Out:
male 18 to 19 years male 20 years male 21 years
0 100 300 500
1 200 200 100
2 300 100 200
Note that variables_table.set_index('variable')['label'] is a Series whose index is 'variable`. It will do the mapping on that index.
This is not an inplace operation. If you want to change the actual dataframe, assign it back to df: df = df.rename(columns=variables_table.set_index('variable')['label']) or use the inplace parameter: df.rename(columns=variables_table.set_index('variable')['label'], inplace=True)

Related

Add a column in pandas based on sum of the subgroup values in another column [duplicate]

This question already has answers here:
How do I create a new column from the output of pandas groupby().sum()?
(4 answers)
Closed 12 days ago.
Here is a simplified version of my dataframe (the number of persons in my dataframe is way more than 3):
df = pd.DataFrame({'Person':['John','David','Mary','John','David','Mary'],
'Sales':[10,15,20,11,12,18],
})
Person Sales
0 John 10
1 David 15
2 Mary 20
3 John 11
4 David 12
5 Mary 18
I would like to add a column "Total" to this data frame, which is the sum of total sales per person
Person Sales Total
0 John 10 21
1 David 15 27
2 Mary 20 38
3 John 11 21
4 David 12 27
5 Mary 18 38
What would be the easiest way to achieve this?
I have tried
df.groupby('Person').sum()
but the shape of the output is not congruent with the shape of df.
Sales
Person
David 27
John 21
Mary 38
What you want is the transform method which can apply a function on each group:
df['Total'] = df.groupby('Person')['Sales'].transform(sum)
It gives as expected:
Person Sales Total
0 John 10 21
1 David 15 27
2 Mary 20 38
3 John 11 21
4 David 12 27
5 Mary 18 38
The easiest way to achieve this is by using the pandas groupby and sum functions.
df['Total'] = df.groupby('Person')['Sales'].sum()
This will add a column to the dataframe with the total sales per person.
your 'Persons' column in the dataframe contains repeated values
it is not possible to apply a new column to this via groupby
I would suggest making a new dataframe based on sales sum
The below code will help you with that
newDf = pd.DataFrame(df.groupby('Person')['Sales'].sum()).reset_index()
This will create a new dataframe with 'Person' and 'sales' as columns.

How to melt the pd.DataFrame to organize the data? (toy example included)

Issue
I am curious to know how to melt the data_df in the toy example provided below to the desired_df.
import pandas as pd
data_df = pd.DataFrame(data = [['FR','Aug',100], ['FR','Sep',170], ['FR','Oct',250],
['KR','Aug',9], ['KR','Sep',12],['KR','Oct',19],
['US','Aug',360], ['US','Sep',500], ['US','Oct',700]],
columns = ['country','time','covid19'])
data_df
>>> country time covid19
0 FR Aug 100
1 FR Sep 170
2 FR Oct 250
3 KR Aug 9
4 KR Sep 12
5 KR Oct 19
6 US Aug 360
7 US Sep 500
8 US Oct 700
My desired output desired_df is as follows, country names at columns, time at index, and number of Covid 19 patients in the dataframe as values.
desired_df
>>> FR KR US
Aug 100 9 360
Sep 170 12 500
Oct 250 19 700
I think pd.melt would help, but it does not create index and columns as I wanted.
Try pivot:
data = data_df.pivot(index = 'time', columns = 'country')
print(data)
Which gives:
country FR KR US
time
Aug 100 9 360
Oct 250 19 700
Sep 170 12 500
The indices are in alphabetical order. Reorder them as you like. For ordering them calendrically, I'd suggest Brad Solomon's answer to Sort a pandas's dataframe series by month name?, which uses the pd.Categorical.

How to create a feature based on an average of X rows before? [duplicate]

This question already has answers here:
Moving Average Pandas
(4 answers)
Closed 2 years ago.
I have a dataframe with years of data and many features.
For each of those features I want to create a new feature that averages the last 12 weeks of data.
So say I have weekly data. I want a datapoint for feature1B to give me the average of the last 12 rows of data from feature1A. And if the data is hourly, I want the same done but for the last 2016 rows (24 hours * 7 days * 12 weeks)
So for instance, say the data looks like this:
Week Feature1
1 8846
2 2497
3 1987
4 5294
5 2487
6 1981
7 8973
8 9873
9 8345
10 5481
11 4381
12 8463
13 7318
14 8642
15 4181
16 3871
17 7919
18 2468
19 4981
20 9871
I need the code to loop through the multiple feature, create a feature name such as 'TARGET.'+feature and spit the averaged data based on my criteria (last 12 rows... last 2016 rows... depends on the format).
Week Feature1 Feature1-B
1 8846
2 2497
3 1987
4 5294
5 2487
6 1981
7 8973
8 9873
9 8345
10 5481
11 4381
12 8463
13 7318 5717.333333
14 8642 5590
15 4181 6102.083333
16 3871 6284.916667
17 7919 6166.333333
18 2468 6619
19 4981 6659.583333
20 9871 6326.916667
Appreciate any help.
Solved with the helpful comment from Chris A. Can't seem to mark that comment as an answer.
import pandas as pd
df = pd.read_csv('data.csv')
cols = df.iloc[:,2:].columns
for c in cols:
df['12W_AVG.'+c] = df[c].rolling(2016).mean()
df['12W_AVG.'+c] = df['12W_AVG.'+c].fillna(df['12W_AVG.'+c][2015])
df['12W_AVG.'+c+'_LAL'] = df['12W_AVG.'+c]*0.9
df['12W_AVG.'+c+'_UAL'] = df['12W_AVG.'+c]*1.1
df.drop(c, axis=1, inplace=True)
Does this work for you?
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=["week", "data"], data=[
[1, 8846],
[2,2497],
[3,1987],
[4,5294],
[5,2487],
[6,1981],
[7,8973],
[8,9873],
[9,8345],
[10,5481],
[11,4381],
[12,8463],
[13,7318],
[14,8642],
[15,4181],
[16,3871],
[17,7919],
[18,2468],
[19,4981],
[20,9871]])
df.insert(2, "average",0, True)
for length in range(12, len(df.index)):
values = df.iloc[length-12:index, 1]
weekly_sum = np.sum(values, axis=0)
df.at[length, 'average'] = weekly_sum / 12
print(df)
mind you, this is very bad code and requires you to do some work on it yourself

Make new column from slice of string from one column pandas python

I read the answer at the link [Text] (Pandas make new column from string slice of another column), but it does not solve my problem.
df
SKU Noodles FaceCream BodyWash Powder Soap
Jan10_Sales 122 100 50 200 300
Feb10_Sales 100 50 80 90 250
Mar10_sales 40 30 100 10 11
and so on
Now I want column month and year which will take value from SKU Column and return Jan for month and 10 for year (2010).
df['month']=df['SKU'].str[0:3]
df['year']=df['SKU'].str[4:5]
I get KeyError: 'SKU'
Doing other things to understand why the error, I perform the following:
[IN]df.index.name
[OUT]None
[IN]df.columns
[OUT]Index(['Noodles','FaceCream','BodyWash','Powder','Soap'], dtype='object', name='SKU')
Please help
I think first column is index, so use .index, also for year change 4:5 slicing to 3:5, 0 is possible omit in 0:3:
df['month']=df.index.str[:3]
df['year']=df.index.str[3:5]
print (df)
Noodles FaceCream BodyWash Powder Soap month year
SKU
Jan10_Sales 122 100 50 200 300 Jan 10
Feb10_Sales 100 50 80 90 250 Feb 10
Mar10_sales 40 30 100 10 11 Mar 10

Python Pandas - How to substitue a column of DataFrame1 with the values of a column in DataFrame2

Hi All I have a Dataframe with more than 50000 records. It has a column by name "Country" which has duplicate values.
As part of a Machine Learning project I am doing a Label Encoding on this column which will replace this column with 50000 records with integer values. (ok for those who do not know about Label Encoding - it takes the unique values of the column and assign an integer value to it mostly based on alphabetical order but not sure though). Say this Dataframe is DF1 and column is "Country".
Now my requirement is that I have to do the same for another dataframe (DF2) manually i.e without using the Label Encoding function.
What I have tried so far and where do I get struck is mentioned below
I have taken the unique values of DF1.Country column and kept in a
new dataframe(temp_df).
Tried to do right join of DF2 and temp_df keeping on="Country". But getting "NaN" in few records. Not sure why
Tried to do find-and-replace using .isin method but still not getting
desired output.
So my basic question is how to fill a column in a dataframe with the values of a column in another dataframe by matching the values of two columns in two dataframes ?
UPDATED
Sample code output is given below for better understanding
The Country Column in DF2 has repeatable values like this :
0 us
1 us
2 gb
3 us
4 au
5 fr
6 us
7 us
8 us
9 us
10 us
11 us
12 ca
13 at
14 us
15 us
16 es
17 fi
18 fr
19 us
20 us
The temp_df dataframe will have integer value for every unique country name like mentioned below (Note : This dataframe will only have unique values. Not duplicates) :
1 gb 49
2 ca 22
3 au 5
4 de 34
5 fr 48
6 br 17
7 jp 75
8 sv 136
9 no 111
10 se 132
11 es 43
12 nl 110
13 mx 103
14 dk 36
15 ro 127
16 ch 24
17 it 71
18 be 10
19 ru 129
20 kr 78
21 fi 44
22 hk 59
23 ie 65
24 sg 133
25 nz 112
26 ar 3
27 at 4
28 in 68
29 cl 26
30 il 66
Now I have to create a new column in DF2 by taking the integer values from temp_df for each country value in DF2. Hope this helps.
You could use pandas.Series.map to accomplish this:
from io import StringIO
import pandas as pd
# Your data ..
data = """
id,country
0,AT
1,DE
2,UK
3,FR
4,AT
5,UK
6,IT
7,DE
"""
df = pd.read_table(StringIO(data), sep=',', index_col=[0])
# Create a map from your current labels to numeric labels:
country_labels = dict([(c, i) for i, c in enumerate(df.country.unique())])
# Use map() to transform your column and re-assign it
df.country = df.country.map(lambda c: country_labels[c])
print(df)
which will transform the above data to
country
id
0 0
1 1
2 2
3 3
4 0
5 2
6 4
7 1
As suggested in one of the comments to your question, you could also use replace()
df = df.replace({'country': country_labels })
Try this:
import pandas as pd
# dataframe
df = pd.DataFrame({'Country' : ['z','x', 'x', 'a', 'a', 'b', 'c'], 'Something' : [10, 1, 2, 1, 2, 3, 4]})
# create dictionary for mapping `sorted` countries to integer
country_map = dict(zip(sorted(df.Country.unique()), range(len(df.Country.unique()))))
# country_map should look smthing like:
# {'a': 0, 'b': 1, 'c': 2, 'x': 3, 'z': 4}, where a, b, .. are countries
# replace `Country` coloumn with mapping
df.replace({'Country': country_map })

Categories