I'm translating some datacleaning-stuff previously done in SPSS modeler to Python. In SPSS you have a 'node' that is called restructure. I'm trying to figure out how to do the same operation in Python, but I'm struggling on how to achieve this. What it does is combining every value in column X with all values in different columns A,B,C, etc... .
So, original dataframe looks like this:
Code Freq1 Freq2
A01 1 7
B02 0 6
C03 17 8
And after transforming it it should look like this:
Code Freq1 Freq2 A01_Freq1 A01_Freq2 B02_Freq1 B02_Freq2 C03_Freq1 C03_Freq2
A01 1 7 1 7 Nan Nan Nan Nan
B02 0 6 Nan Nan 0 6 Nan Nan
C03 17 8 Nan Nan Nan Nan 17 8
I've tried some pivoting stuff, but I guess this cannot be done in one step in Python...
Use DataFrame.set_index with DataFrame.unstack and DataFrame.sort_index for new DataFrame with MultiIndex, then flatten it by f-strings and last add to original by DataFrame.join:
df1 = df.set_index('Code', append=True).unstack().sort_index(axis=1, level=1)
df1.columns = df1.columns.map(lambda x: f'{x[1]}_{x[0]}')
df = df.join(df1)
print (df)
Code Freq1 Freq2 A01_Freq1 A01_Freq2 B02_Freq1 B02_Freq2 C03_Freq1 \
0 A01 1 7 1.0 7.0 NaN NaN NaN
1 B02 0 6 NaN NaN 0.0 6.0 NaN
2 C03 17 8 NaN NaN NaN NaN 17.0
C03_Freq2
0 NaN
1 NaN
2 8.0
Related
I am trying to use np.where to compare whether the values from two columns are equal, but I am getting inconsistent results.
df['compare'] = np.where(df['a'] == df['b'], '0', '1')
Output:
a b compare
1B NaN 1
NaN NaN 1
NaN NaN 1
32X NaN 1
NaN NaN 1
NaN NaN 1
NaN NaN 1
NaN NaN 1
NaN NaN 1
NaN NaN 1
NaN 321 1
NaN Z51 1
NaN 3Y 1
It seemed strange that the command would return pairs of NaN as non-matches. I confirmed that column 'a' and column 'b' are both string data types.
I double checked the original CSV files. Using the 'if' formula in Excel, I found several additional pairs of non-matches. The NaN matches were not identified in matches in Excel.
Any tips on troubleshooting this issue?
nan is a special value which is not equal to itself and should not be used in equality test. You need to fill the df with comparable values beforehand:
df_ = df.fillna(0)
df['compare'] = np.where(df_['a'] == df_['b'], '0', '1')
a b compare
0 1B NaN 1
1 NaN NaN 0
2 NaN NaN 0
3 32X NaN 1
4 NaN NaN 0
5 NaN NaN 0
6 NaN NaN 0
7 NaN NaN 0
8 NaN NaN 0
9 NaN NaN 0
10 NaN 321 1
11 NaN Z51 1
12 NaN 3Y 1
i have edited this post with the specific case:
i have a list of dataframes like this (note that df1 and df2 have a row in common)
df1
index
Date
A
0
2010-06-19
4
1
2010-06-20
3
2
2010-06-21
2
3
2010-06-22
1
4
2012-07-19
5
df2
index
Date
B
0
2012-07-19
5
1
2012-07-20
6
df3
index
Date
C
0
2020-06-19
5
1
2020-06-20
2
2
2020-06-21
9
df_list = [df1, df2, df3]
I would like to merge all dataframes in a single dataframe, without losing rows and placing nan where there are no things to merge. The criteria would be merging them by the column 'Date' (the column should have all the dates of all the merged dataframes, ordered by date).
The resulting dataframe should look like this:
Resulting Dataframe:
index
Date
A
B
C
0
2010-06-19
4
nan
nan
1
2010-06-20
3
nan
nan
2
2010-06-21
2
nan
nan
3
2010-06-22
1
nan
nan
4
2012-07-19
5
5
nan
5
2012-07-20
nan
6
nan
6
2020-06-19
nan
nan
5
7
2020-06-20
nan
nan
2
8
2020-06-21
nan
nan
9
I tried something like this:
from functools import reduce
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['Date'], how='outer'), df_list)
BUT the resulting dataframe is not as expected (i miss some columns and is not ordered by date). I think i am missing something.
Thank you very much
Use pandas.concat(). It takes a list of dataframes, and appends common columns together, filling new columns with NaN as necessary:
new_df = pd.concat([df1, df2, df3])
Output:
>>> new_df
index Date A B C
0 0 2010-06-19 4.0 NaN NaN
1 1 2010-06-20 3.0 NaN NaN
2 2 2010-06-21 2.0 NaN NaN
3 3 2010-06-22 1.0 NaN NaN
0 0 2012-07-19 NaN 5.0 NaN
1 1 2012-07-20 NaN 6.0 NaN
0 0 2020-06-19 NaN NaN 5.0
1 1 2020-06-20 NaN NaN 2.0
2 2 2020-06-21 NaN NaN 9.0
For overlapping data, I had to add the option: Sort = TRUE in the lambda function. Seemed I was missing the order for big dataframes and I was only seeing the nan at the end and start of frames. Thank you all ;-)
from functools import reduce
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['Date'],
how='outer', sort=True), df_list)
I want to turn my dataframe with non-distinct values underneath each column header into a dataframe with distinct values underneath each column header with next to it their occurrence in their particular column. An example:
My initial dataframe is visible underneath:
A B C D
0 CEN T2 56
2 DECEN T2 45
3 ONBEK T2 84
NaN CEN T1 59
3 NaN T1 87
NaN NaN T2 NaN
0 NaN NaN 98
NaN CEN NaN 23
NaN CEN T1 65
where A, B, C and D are the column headers with each 9 values underneath it (blanks included).
My preferred output dataframe should look like: (first a column of unique values for each column in the original dataframe and next to it their occurrence in that particular column)
A B C D A B C D
0 CEN T2 56 2 4 4 1
2 DECEN T1 45 1 1 3 1
3 ONBEK NaN 84 2 1 NaN 1
Nan NaN NaN 59 NaN NaN NaN 1
NaN NaN NaN 87 NaN NaN NaN 1
NaN NaN NaN 98 NaN NaN NaN 1
NaN NaN NaN 23 NaN NaN NaN 1
NaN NaN NaN 65 NaN NaN NaN 1
where A, B, C and D are the column headers with underneath them first the distinct values for each column from the original .csv-file and next to it the occurence of each element in their particular column.
Anybody ideas?
The code below is used to get the unique values out of each column into a new dataframe. I tried to do something with .value_counts to get the occurrence in each column but there I failed to get it into one dataframe again with the unique values..
df
new_df=pd.concat([pd.Series(df[i].unique()) for i in df.columns], axis=1)
new_df.columns=df.columns
new_df
The difficult part is keeping values of columns in each row aligned. To do this, you need to construct a new dataframe from unique, and pd.concat on with value_counts map to each column of this new dataframe.
new_df = (pd.DataFrame([df[c].unique() for c in df], index=df.columns).T
.dropna(how='all'))
df_final = pd.concat([new_df, *[new_df[c].map(df[c].value_counts()).rename(f'{c}_Count')
for c in df]], axis=1).reset_index(drop=True)
Out[1580]:
A B C D A_Count B_Count C_Count D_Count
0 0 CEN T2 56 2.0 4.0 4.0 1
1 2 DECEN T1 45 1.0 1.0 3.0 1
2 3 ONBEK NaN 84 2.0 1.0 NaN 1
3 NaN NaN NaN 59 NaN NaN NaN 1
4 NaN NaN NaN 87 NaN NaN NaN 1
5 NaN NaN NaN 98 NaN NaN NaN 1
6 NaN NaN NaN 23 NaN NaN NaN 1
7 NaN NaN NaN 65 NaN NaN NaN 1
If you only need to keep alignment between each pair of column and its count such as A - A_Count, B - B_Count..., it simply just use value_counts with reset_index some commands to change axis names
cols = df.columns.tolist() + (df.columns + '_Count').tolist()
new_df = pd.concat([df[col].value_counts(sort=False).rename_axis(col).reset_index(name=f'{col}_Count')
for col in df], axis=1).reindex(new_cols, axis=1)
Out[1501]:
A B C D A_Count B_Count C_Count D_Count
0 0.0 ONBEK T2 56.0 2.0 1.0 4.0 1
1 2.0 CEN T1 45.0 1.0 4.0 3.0 1
2 3.0 DECEN NaN 84.0 2.0 1.0 NaN 1
3 NaN NaN NaN 59.0 NaN NaN NaN 1
4 NaN NaN NaN 87.0 NaN NaN NaN 1
5 NaN NaN NaN 98.0 NaN NaN NaN 1
6 NaN NaN NaN 23.0 NaN NaN NaN 1
7 NaN NaN NaN 65.0 NaN NaN NaN 1
I'm preparing data for machine learning where data is in pandas DataFrame which looks like this:
Column v1 v2
first 1 2
second 3 4
third 5 6
now i want to transform it into:
Column v1 v2 first-v1 first-v2 second-v1 econd-v2 third-v1 third-v2
first 1 2 1 2 Nan Nan Nan Nan
second 3 4 Nan Nan 3 4 Nan Nan
third 5 6 Nan Nan Nan Nan 5 6
what i've tried is to do something like this:
# we know how many values there are but
# length can be changed into length of [1, 2, 3, ...] values
values = ['v1', 'v2']
# data with description from above is saved in data
for value in values:
data[ str(data['Column'] + '-' + value)] = data[ value]
Results are a columns with name:
['first-v1' 'second-v1'..], ['first-v2' 'second-v2'..]
where there are correct values. What i'm doing wrong? Is there a more optimal way to do this because my data is big?
Thank you for your time!
You can use unstack with swaping and sorting MultiIndex in columns:
df = data.set_index('Column', append=True)[values].unstack()
.swaplevel(0,1, axis=1).sort_index(1)
df.columns = df.columns.map('-'.join)
print (df)
first-v1 first-v2 second-v1 second-v2 third-v1 third-v2
0 1.0 2.0 NaN NaN NaN NaN
1 NaN NaN 3.0 4.0 NaN NaN
2 NaN NaN NaN NaN 5.0 6.0
Or stack + unstack:
df = data.set_index('Column', append=True).stack().unstack([1,2])
df.columns = df.columns.map('-'.join)
print (df)
first-v1 first-v2 second-v1 second-v2 third-v1 third-v2
0 1.0 2.0 NaN NaN NaN NaN
1 NaN NaN 3.0 4.0 NaN NaN
2 NaN NaN NaN NaN 5.0 6.0
Last join to original:
df = data.join(df)
print (df)
Column v1 v2 first-v1 first-v2 second-v1 second-v2 third-v1 \
0 first 1 2 1.0 2.0 NaN NaN NaN
1 second 3 4 NaN NaN 3.0 4.0 NaN
2 third 5 6 NaN NaN NaN NaN 5.0
third-v2
0 NaN
1 NaN
2 6.0
Consider the below pandas Series object,
index = list('abcdabcdabcd')
df = pd.Series(np.arange(len(index)), index = index)
My desired output is,
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
I have put some effort with pd.pivot_table, pd.unstack and probably the solution lies with correct use of one of them. The closest i have reached is
df.reset_index(level = 1).unstack(level = 1)
but this does not gives me the output i my looking for
// here is something even closer to the desired output, but i am not able to handle the index grouping.
df.to_frame().set_index(df1.values, append = True, drop = False).unstack(level = 0)
a b c d
0 0.0 NaN NaN NaN
1 NaN 1.0 NaN NaN
2 NaN NaN 2.0 NaN
3 NaN NaN NaN 3.0
4 4.0 NaN NaN NaN
5 NaN 5.0 NaN NaN
6 NaN NaN 6.0 NaN
7 NaN NaN NaN 7.0
8 8.0 NaN NaN NaN
9 NaN 9.0 NaN NaN
10 NaN NaN 10.0 NaN
11 NaN NaN NaN 11.0
A bit more general solution using cumcount to get new index values, and pivot to do the reshaping:
# Reset the existing index, and construct the new index values.
df = df.reset_index()
df.index = df.groupby('index').cumcount()
# Pivot and remove the column axis name.
df = df.pivot(columns='index', values=0).rename_axis(None, axis=1)
The resulting output:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
Here is a way that will work if the index is always cycling in the same order, and you know the "period" (in this case 4):
>>> pd.DataFrame(df.values.reshape(-1,4), columns=list('abcd'))
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
>>>