This question already has answers here:
Adding column to pandas DataFrame containing list of other columns' values
(2 answers)
Closed 2 years ago.
I have the following dataframe:
name age year salary1 salary2 salary3 salary4
Rodrigo 28 2021 1945 2312 4567 3214
Neil 26 2021 3546 6657 -3200 1855
Loti 34 5500 4671 3895 5512 7864
...
I would like to create new column that will have list of values of the column salary1,salary2,salary3 and salary4 , when this is the result output:
name age year salary1 salary2 salary3 salary4 new_colum
Rodrigo 28 2021 1945 2312 4567 3214 [1945,2312,4567,3214]
Neil 26 2021 3546 6657 -3200 1855 [3546,6657,-3200,1855]
Loti 34 2021 4671 3895 5512 7864 [4671,3895,5512,7864]
I have tried to concat the relevant column by change the type of each column to string and then add them:
df['new_column'] = df['salary1'].astype(str) + ',' + \
df['salary2'].astype(str) + ',' + \
df['salary3'].astype(str) + ',' + \
df['salary4'].astype(str)
that indeed concat the columns but does not make them a list and also change the type to string while I still need it numerical.
My questionn is how can I cerate new column with list of the four column values?
Try this:
df['new_column'] = df[['salary1', 'salary2', 'salary3', 'salary 4']].values.tolist()
Another possibility using apply is
df['new_col'] = df[['salary1', 'salary2', 'salary3', 'salary4']].apply(lambda r: list(r), axis=1)
Note however that this is probably slower than using .values.tolist() instead of .apply as suggested in another answer.
Related
This question already has answers here:
Split / Explode a column of dictionaries into separate columns with pandas
(13 answers)
Closed 14 days ago.
I have a data frame with one string column and I'd like to split it into multiple columns by seperate with
','. I want to name the column as same as the string in the column before ':'.
The column looks like this:
0 {"ID":"AP001","Name":"Anderson","Age":"23"}
1 {"ID":"AP002","Name":"Jasmine","Age":"36"}
2 {"ID":"AP003","Name":"Zack","Age":"28"}
3 {"ID":"AP004","Name":"Chole","Age":"39"}
And I want to split to this:
ID
Name
Age
AP001
Anderson
23
AP002
Jasmine
36
AP003
Zack
28
AP004
Chole
39
I have tried to split it by ',', but im not sure how to remove the string before ':' and put it as the column name.
data1 = data['demographic'].str.split(',',expand=True)
This is what I get after splitting it:
0
1
2
"ID":"AP001"
"Name":"Anderson"
"Age":"23"
"ID":"AP002"
"Name":"Jasmine"
"Age":"36"
"ID":"AP003"
"Name":"Zack"
"Age":"28"
"ID":"AP004"
"Name":"Chole"
"Age":"39"
Anyone knows how to do it?
You can use ast.literal_eval:
import ast
data1 = pd.json_normalize(data['demographic'].apply(ast.literal_eval))
print(data1)
# Output
ID Name Age
0 AP001 Anderson 23
1 AP002 Jasmine 36
2 AP003 Zack 28
3 AP004 Chole 39
I am using Python to analyze a data set that has a column with a year range (see below for example):
Name
Years Range
Andy
1985 - 1987
Bruce
2011 - 2018
I am trying to convert the "Years Range" column that has a string of start and end years into two separate columns within the data frame to: "Year Start" and "Year End".
Name
Years Range
Year Start
Year End
Andy
1985 - 1987
1985
1987
Bruce
2011 - 2018
2011
2018
You can use expand=True within split function
df[['Year Start','Year End']] = df['Years Range'].str.split('-',expand=True)
output #
Nmae Years_Range Year Start Year End
0 NAdy 1995-1987 1995 1987
1 bruce 1890-8775 1890 8775
I think str.extract can do the job.
Here is an example :
df = pd.DataFrame([ "1985 - 1987"], columns = [ "Years Range"])
df['Year Start'] = df['Years Range'].str.extract('(\d{4})')
df['Year End'] = df['Years Range'].str.extract('- (\d{4})')
df['start']=''#create a blank column name 'start'
df['end']=''#create a blank column name 'end'
#loop over the data frame
for i in range(len(df)):
df['start'][i]=df['Year'][i].split('-')[0]#split each data and store first element
df['end'][i]=df['Year'][i].split('-')[1]#split each data and store second element
https://colab.research.google.com/drive/1Kemzk-aSUKRfE_eSrsQ7jS6e0NwhbXWp#scrollTo=esXNvRpnSN9I&line=1&uniqifier=1
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 8 months ago.
I have data that contains several rows for each employee. Each row contains one attribute and its value. For example:
Worker ID
Last Name
First Name
Metric Name
Metric Value
1
Hanson
Scott
Attendance
98
1
Hanson
Scott
On time
35
2
Avery
Kara
Attendance
95
2
Avery
Kara
On time
57
I would like to combine rows based on worker id, taking metrics to their own columns like so:
Worker ID
Last Name
First Name
Attendance
On time
1
Hanson
Scott
98
35
2
Avery
Kara
95
57
I can do worker_data.pivot_table(values='Metric Value', index='Worker ID', columns=['Metric Name']), but that does not give me the first and last names as columns. What is the best Pandas way to merge these rows?
In your solution change index parameter by list and for avoid MultiIndex remove [] from column parameter:
df = (worker_data.pivot_table(index=['Worker ID','Last Name','First Name'],
columns='Metric Name',
values='Metric Value')
.reset_index()
.rename_axis(None, axis=1))
My pandas df has a column containg the birthyearof the household members and looks like this:
Birthyear_household_members
1960
1982 + 1989
1941
1951 + 1953
1990 + 1990
1992
I want to create a column with a variable that contains the number of people above 64 years old in a household.
Therefore, for each row, I need to separate the string and count the number of people with a birthyear before 1956.
How can I do this using pandas? My original df is very large.
Try use apply method of your df
df['cnt'] = df['Birthyear_household_members'].apply(lambda x: len([None for year in x.split(" + ") if year < '1956']))
suppose I have a dataframe with index as monthy timestep, I know I can use dataframe.groupby(lambda x:x.year) to group monthly data into yearly and apply other operations. Is there some way I could quick group them, let's say by decade?
thanks for any hints.
To get the decade, you can integer-divide the year by 10 and then multiply by 10. For example, if you're starting from
>>> dates = pd.date_range('1/1/2001', periods=500, freq="M")
>>> df = pd.DataFrame({"A": 5*np.arange(len(dates))+2}, index=dates)
>>> df.head()
A
2001-01-31 2
2001-02-28 7
2001-03-31 12
2001-04-30 17
2001-05-31 22
You can group by year, as usual (here we have a DatetimeIndex so it's really easy):
>>> df.groupby(df.index.year).sum().head()
A
2001 354
2002 1074
2003 1794
2004 2514
2005 3234
or you could do the (x//10)*10 trick:
>>> df.groupby((df.index.year//10)*10).sum()
A
2000 29106
2010 100740
2020 172740
2030 244740
2040 77424
If you don't have something on which you can use .year, you could still do lambda x: (x.year//10)*10).
if your Data Frame has Headers say : DataFrame ['Population','Salary','vehicle count']
Make your index as Year: DataFrame=DataFrame.set_index('Year')
use below code to resample data in decade of 10 years and also gives you some of all other columns within that dacade
datafame=dataframe.resample('10AS').sum()
Use the year attribute of index:
df.groupby(df.index.year)
lets say your date column goes by the name Date, then you can group up
dataframe.set_index('Date').ix[:,0].resample('10AS', how='count')
Note: the ix - here chooses the first column in your dataframe
You get the various offsets:
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases