Sort a column containing string in Pandas - python

I am new to Pandas, and looking to sort a column containing strings and generate a numerical value to uniquely identify the string. My data frame looks something like this:
df = pd.DataFrame({'key': range(8), 'year_week': ['2015_10', '2015_1', '2015_11', '2016_9', '2016_10','2016_3', '2016_9', '2016_10']})
First I like to sort the 'year_week' column to arrange in ascending order (2015_1, 2016_9, '2016_9', 2016_10, 2016_11, 2016_3, 2016_10, 2016_10) and then generate a numerical value for each unique 'year_week' string.

You can first convert to_datetime column year_week, then sort it by sort_values and last use factorize:
df = pd.DataFrame({'key': range(8), 'year_week': ['2015_10', '2015_1', '2015_11', '2016_9', '2016_10','2016_3', '2016_9', '2016_10']})
#http://stackoverflow.com/a/17087427/2901002
df['date'] = pd.to_datetime(df.year_week + '-0', format='%Y_%W-%w')
#sort by column date
df.sort_values('date', inplace=True)
#create numerical values
df['num'] = pd.factorize(df.year_week)[0]
print (df)
key year_week date num
1 1 2015_1 2015-01-11 0
0 0 2015_10 2015-03-15 1
2 2 2015_11 2015-03-22 2
5 5 2016_3 2016-01-24 3
3 3 2016_9 2016-03-06 4
6 6 2016_9 2016-03-06 4
4 4 2016_10 2016-03-13 5
7 7 2016_10 2016-03-13 5

## 1st method :-- This apply for large dataset
## Split the "year_week" column into 2 columns
df[['year', 'week']] =df['year_week'].str.split("_",expand=True)
## Change the datatype of newly created columns
df['year'] = df['year'].astype('int')
df['week'] = df['week'].astype('int')
## Sort the dataframe by newly created column
df= df.sort_values(['year','week'],ascending=True)
## Drop years & months column
df.drop(['year','week'],axis=1,inplace=True)
## Sorted dataframe
df
## 2nd method:--
## This apply for small dataset
## Change the datatype of column
df['year_week'] = df['year_week'].astype('str')
## Categories the string, the way you want
cats = ['2015_1', '2015_10','2015_11','2016_3','2016_9', '2016_10']
# Use pd.categorical() to categories it
df['year_week']=pd.Categorical(df['year_week'],categories=cats,ordered=True)
## Sort the 'year_week' column
df= df.sort_values('year_week')
## Sorted dataframe
df

Related

Stick the columns based on the one columns keeping ids

I have a DataFrame with 100 columns (however I provide only three columns here) and I want to build a new DataFrame with two columns. Here is the DataFrame:
import pandas as pd
df = pd.DataFrame()
df ['id'] = [1,2,3]
df ['c1'] = [1,5,1]
df ['c2'] = [-1,6,5]
df
I want to stick the values of all columns for each id and put them in one columns. For example, for id=1 I want to stick 2, 3 in one column. Here is the DataFrame that I want.
Note: df.melt does not solve my question. Since I want to have the ids also.
Note2: I already use the stack and reset_index, and it can not help.
df = df.stack().reset_index()
df.columns = ['id','c']
df
You could first set_index with "id"; then stack + reset_index:
out = (df.set_index('id').stack()
.droplevel(1).reset_index(name='c'))
Output:
id c
0 1 1
1 1 -1
2 2 5
3 2 6
4 3 1
5 3 5

Create Pivot table for each column in Pandas df

I have a DataFrame where I have many columns (there is one dependent variable and many independent variables)
variable_id
dep_var
variable_1
variable_2
new
1
6
3
new
0
3
6
new
0
8
7
new
1
11
1
new
0
17
9
new
1
1
2
I want to create a Pivot table such as this:
pd.pivot_table(df,index=['variable_1'], columns=['dep_var'], values=['variable_id'],aggfunc='count')
I want to create it for each column separatly (so I need to change index in pd.pivot_table)
I have written a sample code:
def pivot_table(df):
df_columns = list(df)
for column in df_columns:
print("indexing by: ", column)
print(pd.pivot_table(df,index=[column], columns=['dep_var'], values=['variable_id'],aggfunc='count'))
but I want my result to be saved as pandas DataFrame
desired output:
how I want my output for each variable separately
Use:
def pivot_table(df):
dfs = []
for column in df:
print("indexing by: ", column)
df = pd.pivot_table(df,index=[column], values=['dep_var'])
dfs.append(df)
return dfs

Python - Pandas resample dataframe with strings and floats

I have a dataframe where the index is a datetimeindex, and every row is every day over the course of a couple years. I need to resample the dataframe by month where the two float columns are summed, but the string columns are all the unique values during that month. I can do the resampling to a single column, but I don't know how to do it to everything, or how to combine them back together if I do it one at a time.
For the floats I was trying:
# go through the column list
for col in col_list:
# process all run time columns for month
if "float" in str(col):
# resample for one month and sum
df[col] = df[col].resample('M').sum()
# rename the column
df.rename(columns={col: col + " MONTHLY"}, inplace=True)
and for the strings:
elif "string" in str(col):
# get all the unique jobs run during the month
df[col] = df[col].groupby(pd.Grouper(freq='M')).unique()
df.rename(columns={col: col + " MONTHLY"}, inplace=True)
these were resulting in the monthly data being inserted into tho the dataframe with every day still existing though, and was hard to find and not what I need.
Some sample data:
float_1 float_2 string_1 string_2
12/30/2019 1 2 a a
12/31/2019 1 3 a b
1/1/2020 2 4 a c
1/2/2020 3 5 b d
The expected output would be:
12/2019 2 5 a a, b
1/2020 5 9 a, b c, d
Not sure if it matters but the real data does have NaN in random days throughout the data.
Try aggregate numeric columns and non numeric columns separately and then join them back:
df.index = pd.to_datetime(df.index)
numerics = df.select_dtypes('number').resample('M').sum()
strings = df.select_dtypes('object').resample('M').agg(lambda x: ','.join(set(x)))
numerics.join(strings)
# float_1 float_2 string_1 string_2
#2019-12-31 2 5 a a,b
#2020-01-31 5 9 a,b d,c

How to re-index as multi-index pandas dataframe from index value that repeats

I have an index in a pandas dataframe which repeats the index value. I want to re-index as multi-index where repeated indexes are grouped.
The indexing looks like such:
so I would like all the 112335586 index values would be grouped under the same in index.
I have looked at this question Create pandas dataframe by repeating one row with new multiindex but here the value can be index can be pre-defined but this is not possible as my dataframe is far too large to hard code this.
I also looked at at the multi-index documentation but this also pre-defines the value for the index.
I believe you need:
s = pd.Series([1,2,3,4], index=[10,10,20,20])
s.index.name = 'EVENT_ID'
print (s)
EVENT_ID
10 1
10 2
20 3
20 4
dtype: int64
s1 = s.index.to_series()
s2 = s1.groupby(s1).cumcount()
s.index = [s.index, s2]
print (s)
EVENT_ID
10 0 1
1 2
20 0 3
1 4
dtype: int64
Try this:
df.reset_index(inplace=True)
df['sub_idx'] = df.groupby('EVENT_ID').cumcount()
df.set_index(['EVENT_ID','sub_idx'], inplace=True)

Create new column based on another column for a multi-index Panda dataframe

I'm running Python 3.5 on Windows and writing code to study financial econometrics.
I have a multi-index panda dataframe where the level=0 index is a series of month-end dates and the level=1 index is a simple integer ID. I want to create a new column of values ('new_var') where for each month-end date, I look forward 1-month and get the values from another column ('some_var') and of course the IDs from the current month need to align with the IDs for the forward month. Here is a simple test case.
import pandas as pd
import numpy as np
# Create some time series data
id = np.arange(0,5)
date = [pd.datetime(2017,1,31)+pd.offsets.MonthEnd(i) for i in [0,1]]
my_data = []
for d in date:
for i in id:
my_data.append((d, i, np.random.random()))
df = pd.DataFrame(my_data, columns=['date', 'id', 'some_var'])
df['new_var'] = np.nan
df.set_index(['date', 'id'], inplace=True)
# Drop an observation to reflect my true data
df.drop(('2017-02-28',3), level=None, inplace=True)
df
# The desired output....
list1 = df.loc['2017-01-31'].index.labels[1].tolist()
list2 = df.loc['2017-02-28'].index.labels[1].tolist()
common = list(set(list1) & set(list2))
for i in common:
df.loc[('2017-01-31', i)]['new_var'] = df.loc[('2017-02-28', i)]['some_var']
df
I feel like there is a better way to get my desired output. Maybe I should just embrace the "for" loop? Maybe a better solution is to reset the index?
Thank you,
F
I would create a integer column representing the date, substrate one from it (to shift it by one month) and the merge the value left on back to the original dataframe.
Out[28]:
some_var
date id
2017-01-31 0 0.736003
1 0.248275
2 0.844170
3 0.671364
4 0.034331
2017-02-28 0 0.051586
1 0.894579
2 0.136740
4 0.902409
df = df.reset_index()
df['n_group'] = df.groupby('date').ngroup()
df_shifted = df[['n_group', 'some_var','id']].rename(columns={'some_var':'new_var'})
df_shifted['n_group'] = df_shifted['n_group']-1
df = df.merge(df_shifted, on=['n_group','id'], how='left')
df = df.set_index(['date','id']).drop('n_group', axis=1)
Out[31]:
some_var new_var
date id
2017-01-31 0 0.736003 0.051586
1 0.248275 0.894579
2 0.844170 0.136740
3 0.671364 NaN
4 0.034331 0.902409
2017-02-28 0 0.051586 NaN
1 0.894579 NaN
2 0.136740 NaN
4 0.902409 NaN

Categories