So, i have some data in list form, such as:
Q=[2,3,4,5,6,7,8,9,10,11,12] #values
M=[11,0,1,2,3,4,5,6,7,8,9] #months
Y=[2010,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011] #years
And i want to get a dataframe, with one row per year, and one column per month, adding the data of Q on the positions given by M and Y.
so far i have tried a couple of things, my current code is as follows:
def save_data(data_list,year_info,month_info):
#how many datapoints
n_data=len(data_list)
#how many years
y0=year_info[0]
yf=year_info[n_data-1]
n_years=yf-y0+1
#creating the list i want to fill out
df_list=[[math.nan]*12]*n_years
ind=0
for y in range(n_years):
for m in range(12):
if ind<len(data_list):
if year_info[ind]-y0==y and month_info[ind]==m:
df_list[y][m]=data_list[ind]
ind+=1
df=pd.DataFrame(df_list)
return df
I get this output:
0
1
2
3
4
5
6
7
8
9
10
11
0
3
4
5
6
7
8
9
10
11
12
nan
2
1
3
4
5
6
7
8
9
10
11
12
nan
2
And i want to get:
0
1
2
3
4
5
6
7
8
9
10
11
0
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
2
1
3
4
5
6
7
8
9
10
11
12
nan
nan
I have tried doing a bunch of diferent things, but so far nothing has worked, I'm wondering if there's a more straightforward way of doing this, my code seems to be overwriting in a weird way, i do not know for instance why is there a 2 on the last value of second row, since that's the first value of my list.
Thanks in advance!
Try pivot:
(pd.DataFrame({'Y':Y,'M':M,'Q':Q})
.pivot(index='Y', columns='M', values='Q')
)
Output:
M 0 1 2 3 4 5 6 7 8 9 11
Y
2010 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0
2011 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 NaN
I've got 3 Dataframes I would like to merge or join by "label" and then being able to compare all columns
Examples of df are below:
df1
Label,col1,col2,col3
NF1,1,1,6
NF2,3,2,8
NF3,4,5,4
NF4,5,7,2
NF5,6,2,2
df2
Label,col1,col1,col3
NF1,8,4,5
NF2,4,7,8
NF3,9,7,8
df3
Label,col1,col1,col3
NF1,2,8,8
NF2,6,2,0
NF3,2,2,5
NF4,2,4,9
NF5,2,5,8
and what ill like to see is similar to
Label,df1_col1,df2_col1,df_col1,df1_col2,df2_col2,df3_col2,df1_col3,df2_col3,df_col3
NF1,1,8,2,1,4,8,6,5,8
NF2,3,4,6,2,7,2,8,8,0
NF3,4,9,2,5,7,2,4,8,5
NF4,5,,2,7,,4,2,,9
NF5,6,,2,2,,5,2,,8
but I'm to suggestions on how to make the comparisons more readable.
Thanks!
Use concat with list of DataFrames, add parameter keys for prefixes and sorting by columns names:
dfs = [df1, df2, df3]
k = ('df1','df2','df3')
df = (pd.concat([x.set_index('Label') for x in dfs], axis=1, keys=k)
.sort_index(axis=1, level=1)
.rename_axis('Label')
.reset_index())
df.columns = df.columns.map('_'.join).str.strip('_')
print (df)
Label df1_col1 df2_col1 df3_col1 df2_col1.1 df3_col1.1 df1_col2 \
0 NF1 1 8.0 2 4.0 8 1
1 NF2 3 4.0 6 7.0 2 2
2 NF3 4 9.0 2 7.0 2 5
3 NF4 5 NaN 2 NaN 4 7
4 NF5 6 NaN 2 NaN 5 2
df1_col3 df2_col3 df3_col3
0 6 5.0 8
1 8 8.0 0
2 4 8.0 5
3 2 NaN 9
4 2 NaN 8
You can use df.merge:
In [1965]: res = df1.merge(df2, on='Label', how='left', suffixes=('_df1', '_df2')).merge(df3, on='Label', how='left').rename(columns={'col1': 'col1_df3','col2':'col2_df3','col3':'col3_df3'})
In [1975]: res = res.reindex(sorted(res.columns), axis=1)
In [1976]: res
Out[1965]:
Label col1_df1 col1_df2 col1_df3 col2_df1 col2_df2 col2_df3 col3_df1 col3_df2 col3_df3
0 NF1 1 8.00 2 1 4.00 8 6 5.00 8
1 NF2 3 4.00 6 2 7.00 2 8 8.00 0
2 NF3 4 9.00 2 5 7.00 2 4 8.00 5
3 NF4 5 nan 2 7 nan 4 2 nan 9
4 NF5 6 nan 2 2 nan 5 2 nan 8
We can use Pandas' join method, by setting the Label column as the index and joining the dataframes :
dfs = [df1,df2,df3]
keys = ['df1','df2','df3']
#set Label as index
df1, *others = [frame.set_index("Label").add_prefix(f"{prefix}_")
for frame,prefix in zip(dfs,keys)]
#join df1 with others
outcome = df1.join(others,how='outer').rename_axis(index='Label').reset_index()
outcome
Label df1_col1 df1_col2 df1_col3 df2_col1 df2_col2 df2_col3 df3_col1 df3_col2 df3_col3
0 NF1 1 1 6 8.0 4.0 5.0 2 8 8
1 NF2 3 2 8 4.0 7.0 8.0 6 2 0
2 NF3 4 5 4 9.0 7.0 8.0 2 2 5
3 NF4 5 7 2 NaN NaN NaN 2 4 9
4 NF5 6 2 2 NaN NaN NaN 2 5 8
I have a two dataframes as follows:
df1:
A B C D E
0 8 6 4 9 7
1 2 6 3 8 5
2 0 7 6 5 8
df2:
M N O P Q R S T
0 1 2 3
1 4 5 6
2 7 8 9
3 8 6 5
4 5 4 3
I have taken out a slice of data from df1 as follows:
>data_1 = df1.loc[0:1]
>data_1
A B C D E
0 8 6 4 9 7
1 2 6 3 8 5
Now I need to insert this data_1 into df2 at specific location of Index(0,P) (row,column). Is there any way to do it? I do not want to disturb the other columns in df2.
I can extract individual values of each cell and do it but since I have to do it for a large dataset, its not possible to do it cell-wise.
Cellwise method:
>var1 = df1.iat[0,1]
>var2 = df1.iat[0,0]
>df2.at[0, 'P'] = var1
>df2.at[0, 'Q'] = var2
If you specify all the columns, it is possible to do it as follows:
df2.loc[0:1, ['P', 'Q', 'R', 'S', 'T']] = df1.loc[0:1].values
Resulting dataframe:
M N O P Q R S T
0 1 2 3 8.0 6.0 4.0 9.0 7.0
1 4 5 6 2.0 6.0 3.0 8.0 5.0
2 7 8 9
3 8 6 5
4 5 4 3
You can rename columns and index names for match to second DataFrame, so possible use DataFrame.update for correct way specifiest by tuple pos:
data_1 = df1.loc[0:1]
print (data_1)
A B C D E
0 8 6 4 9 7
1 2 6 3 8 5
pos = (2, 'P')
data_1 = data_1.rename(columns=dict(zip(data_1.columns, df2.loc[:, pos[1]:].columns)),
index=dict(zip(data_1.index, df2.loc[pos[0]:].index)))
print (data_1)
P Q R S T
2 8 6 4 9 7
3 2 6 3 8 5
df2.update(data_1)
print (df2)
M N O P Q R S T
0 1 2 3 NaN NaN NaN NaN NaN
1 4 5 6 NaN NaN NaN NaN NaN
2 7 8 9 8.0 6.0 4.0 9.0 7.0
3 8 6 5 2.0 6.0 3.0 8.0 5.0
4 5 4 3 NaN NaN NaN NaN NaN
How working rename - idea is select all columns and all index values after specified column, index name by loc and then zip by columns names of data_1 with convert to dictionary. So last replace bot, index and columns names in data_1 by next columns, index values.
I have a dataset which looks like below:
File_no A B Date Batch State
0 1 2 3 23-1-2019 2 3
1 2 7 6 23-1-2019 2 4
2 3 9 2 24-1-2019 1 2
3 5 6 3 24-1-2019 2 3
4 6 4 3 24-1-2019 1 4
5 8 2 3 25-1-2019 1 4
I want to group the data columns 'A' and 'B' based on date and batch. And then do a shift of rows of these columns based on the sequence of file numbers. For instance, in the above dataframe File no 4 is missing.
I am able to achive the shift function, but I am not able to do it for every group individually.
For e.g: 6 & 8 files are not in sequence, but they are from different dates. So the shift should not be performed because it is missing a sequence.
diff = data['File_no'].diff().ne(1).cumsum()
grouped=data.groupby(['Date','Batch'])
grouped.apply(lambda data: data.groupby(diff)['A','B'].shift())
This performs a shift, whenever there is a missing sequence and doesn't consider the groups into consideration.
Expected output:
File_no A B Date Batch State
0 1 Nan Nan 23-1-2019 2 3
1 2 2 3 23-1-2019 2 4
2 3 9 2 24-1-2019 1 2
3 5 Nan Nan 24-1-2019 2 3
4 6 6 3 24-1-2019 1 4
5 8 2 3 25-1-2019 1 4
I think you can pass columns with series to one groupby:
diff = data['File_no'].diff().ne(1).cumsum()
data[['A','B']] = data.groupby(['Date','Batch',diff])['A','B'].shift()
print (data)
File_no A B Date Batch State
0 1 NaN NaN 23-1-2019 2 3
1 2 2.0 3.0 23-1-2019 2 4
2 3 NaN NaN 24-1-2019 1 2
3 5 NaN NaN 24-1-2019 2 3
4 6 NaN NaN 24-1-2019 1 4
4 8 NaN NaN 25-1-2019 1 4
EDIT:
r = np.arange(data['File_no'].min(), data['File_no'].max() + 1)
data = data.set_index('File_no').reindex(r)
diff = data.index.to_series().diff().ne(1).cumsum()
data[['A','B']] = data.groupby(['Date','Batch',diff])['A','B'].shift()
data = data.dropna(how='all').reset_index()
print (data)
File_no A B Date Batch State
0 1 NaN NaN 23-1-2019 2.0 3.0
1 2 2.0 3.0 23-1-2019 2.0 4.0
2 3 NaN NaN 24-1-2019 1.0 2.0
3 5 NaN NaN 24-1-2019 2.0 3.0
4 6 9.0 2.0 24-1-2019 1.0 4.0
5 8 NaN NaN 25-1-2019 1.0 4.0
I am trying to get a rolling sum of the past 3 rows for the same ID but lagging this by 1 row. My attempt looked like the below code and i is the column. There has to be a way to do this but this method doesnt seem to work.
for i in df.columns.values:
df.groupby('Id', group_keys=False)[i].rolling(window=3, min_periods=2).mean().shift(1)
id dollars lag
1 6 nan
1 7 nan
1 6 6.5
3 7 nan
3 4 nan
3 4 5.5
3 3 5
5 6 nan
5 5 nan
5 6 5.5
5 12 5.67
5 7 8.3
I am trying to get a rolling sum of the past 3 rows for the same ID but lagging this by 1 row.
You can create the lagged rolling sum by chaining DataFrame.groupby(ID), .shift(1) for the lag 1, .rolling(3) for the window 3, and .sum() for the sum.
Example: Let's say your dataset is:
import pandas as pd
# Reproducible datasets are your friend!
d = pd.DataFrame({'grp':pd.Series(['A']*4 + ['B']*5 + ['C']*6),
'x':pd.Series(range(15))})
print(d)
grp x
A 0
A 1
A 2
A 3
B 4
B 5
B 6
B 7
B 8
C 9
C 10
C 11
C 12
C 13
C 14
I think what you're asking for is this:
d['y'] = d.groupby('grp')['x'].shift(1).rolling(3).sum()
print(d)
grp x y
A 0 NaN
A 1 NaN
A 2 NaN
A 3 3.0
B 4 NaN
B 5 NaN
B 6 NaN
B 7 15.0
B 8 18.0
C 9 NaN
C 10 NaN
C 11 NaN
C 12 30.0
C 13 33.0
C 14 36.0