Insert value into column which is named in known column pandas - python

I'm preparing data for machine learning where data is in pandas DataFrame which looks like this:
Column v1 v2
first 1 2
second 3 4
third 5 6
now i want to transform it into:
Column v1 v2 first-v1 first-v2 second-v1 econd-v2 third-v1 third-v2
first 1 2 1 2 Nan Nan Nan Nan
second 3 4 Nan Nan 3 4 Nan Nan
third 5 6 Nan Nan Nan Nan 5 6
what i've tried is to do something like this:
# we know how many values there are but
# length can be changed into length of [1, 2, 3, ...] values
values = ['v1', 'v2']
# data with description from above is saved in data
for value in values:
data[ str(data['Column'] + '-' + value)] = data[ value]
Results are a columns with name:
['first-v1' 'second-v1'..], ['first-v2' 'second-v2'..]
where there are correct values. What i'm doing wrong? Is there a more optimal way to do this because my data is big?
Thank you for your time!

You can use unstack with swaping and sorting MultiIndex in columns:
df = data.set_index('Column', append=True)[values].unstack()
.swaplevel(0,1, axis=1).sort_index(1)
df.columns = df.columns.map('-'.join)
print (df)
first-v1 first-v2 second-v1 second-v2 third-v1 third-v2
0 1.0 2.0 NaN NaN NaN NaN
1 NaN NaN 3.0 4.0 NaN NaN
2 NaN NaN NaN NaN 5.0 6.0
Or stack + unstack:
df = data.set_index('Column', append=True).stack().unstack([1,2])
df.columns = df.columns.map('-'.join)
print (df)
first-v1 first-v2 second-v1 second-v2 third-v1 third-v2
0 1.0 2.0 NaN NaN NaN NaN
1 NaN NaN 3.0 4.0 NaN NaN
2 NaN NaN NaN NaN 5.0 6.0
Last join to original:
df = data.join(df)
print (df)
Column v1 v2 first-v1 first-v2 second-v1 second-v2 third-v1 \
0 first 1 2 1.0 2.0 NaN NaN NaN
1 second 3 4 NaN NaN 3.0 4.0 NaN
2 third 5 6 NaN NaN NaN NaN 5.0
third-v2
0 NaN
1 NaN
2 6.0

Related

append specific amount of empty rows to pandas dataframe

I want to append a specific amount of empty rows to that df
df = pd.DataFrame({'cow': [2, 4, 8],
'shark': [2, 0, 0],
'pudle': [10, 2, 1]})
with df = df.append(pd.Series(), ignore_index = True) I append one empty row, how can I append x amount of rows ?
You can use df.reindex to achieve this goal.
df.reindex(list(range(0, 10))).reset_index(drop=True)
cow shark pudle
0 2.0 2.0 10.0
1 4.0 0.0 2.0
2 8.0 0.0 1.0
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
The arguments you provide to df.reindex is going to be the total number of rows the new DataFrame has. So if your DataFrame has 3 objects, providing a list that caps out at 10 will add 7 new rows.
I'm not too pandas savvy, but if you can already add one empty row, why not just try writing a for loop and appending x times?
for i in range(x):
df = df.append(pd.Series(), ignore_index = True)
You could do:
import pandas as pd
df = pd.DataFrame({'cow': [2, 4, 8],
'shark': [2, 0, 0],
'pudle': [10, 2, 1]})
n = 10
df = df.append([[] for _ in range(n)], ignore_index=True)
print(df)
Output
cow shark pudle
0 2.0 2.0 10.0
1 4.0 0.0 2.0
2 8.0 0.0 1.0
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN NaN
Try with reindex
out = df.reindex(df.index.tolist()+[df.index.max()+1]*5)#reset_index(drop=True)
Out[93]:
cow shark pudle
0 2.0 2.0 10.0
1 4.0 0.0 2.0
2 8.0 0.0 1.0
3 NaN NaN NaN
3 NaN NaN NaN
3 NaN NaN NaN
3 NaN NaN NaN
3 NaN NaN NaN
Create an empty dataframe of the appropriate size and append it:
import numpy as np
df = df.append(pd.DataFrame([[np.nan] * df.shape[1]] * n,columns=df.columns),
ignore_index = True)

Merge rows of the same dataframe by columns

I am new to Python and i can solve the following problem.
I have a dataframe that looks like that:
ID X1 x2 x3
1 15 NaN NaN
2 NaN 2 NaN
3 NaN NaN 5
1 NaN 16 NaN
2 1 NaN NaN
3 6 NaN NaN
4 NaN NaN 75
5 NaN 67 NaN
I want to merge the rows by ID, as a result it should look like that:
ID x1 x2 x3
1 15 16 NaN
2 1 2 NaN
3 6 NaN 5
4 NaN NaN 75
5 NaN 67 NaN
I have tryed a lot with df.groupby("ID"), without success.
Can someone fix that for me an supply the code for me. Thx
You can change your existing groupby like this. You can remove replace part if you would like 0.0 instead of NaN:
import numpy as np
df = df.fillna(0).astype(int).groupby('ID').sum().replace(0,np.nan)
print(df)
Output:
ID X1 x2 x3
1 15.0 16.0 NaN
2 1.0 2.0 NaN
3 6.0 NaN 5.0
4 NaN NaN 75.0
5 NaN 67.0 NaN
If you don't want ID as index, you can add reset_index:
import numpy as np
df = df.fillna(0).astype(int).groupby('ID').sum().replace(0,np.nan).reset_index()
print(df)
Output:
ID X1 x2 x3
0 1 15.0 16.0 NaN
1 2 1.0 2.0 NaN
2 3 6.0 NaN 5.0
3 4 NaN NaN 75.0
4 5 NaN 67.0 NaN
Try this:
df1 = df.groupby('ID',as_index=False,sort=False).last()

Get unique values and their occurrence out of one dataframe into a new dataframe using Pandas DataFrame

I want to turn my dataframe with non-distinct values underneath each column header into a dataframe with distinct values underneath each column header with next to it their occurrence in their particular column. An example:
My initial dataframe is visible underneath:
A B C D
0 CEN T2 56
2 DECEN T2 45
3 ONBEK T2 84
NaN CEN T1 59
3 NaN T1 87
NaN NaN T2 NaN
0 NaN NaN 98
NaN CEN NaN 23
NaN CEN T1 65
where A, B, C and D are the column headers with each 9 values underneath it (blanks included).
My preferred output dataframe should look like: (first a column of unique values for each column in the original dataframe and next to it their occurrence in that particular column)
A B C D A B C D
0 CEN T2 56 2 4 4 1
2 DECEN T1 45 1 1 3 1
3 ONBEK NaN 84 2 1 NaN 1
Nan NaN NaN 59 NaN NaN NaN 1
NaN NaN NaN 87 NaN NaN NaN 1
NaN NaN NaN 98 NaN NaN NaN 1
NaN NaN NaN 23 NaN NaN NaN 1
NaN NaN NaN 65 NaN NaN NaN 1
where A, B, C and D are the column headers with underneath them first the distinct values for each column from the original .csv-file and next to it the occurence of each element in their particular column.
Anybody ideas?
The code below is used to get the unique values out of each column into a new dataframe. I tried to do something with .value_counts to get the occurrence in each column but there I failed to get it into one dataframe again with the unique values..
df
new_df=pd.concat([pd.Series(df[i].unique()) for i in df.columns], axis=1)
new_df.columns=df.columns
new_df
The difficult part is keeping values of columns in each row aligned. To do this, you need to construct a new dataframe from unique, and pd.concat on with value_counts map to each column of this new dataframe.
new_df = (pd.DataFrame([df[c].unique() for c in df], index=df.columns).T
.dropna(how='all'))
df_final = pd.concat([new_df, *[new_df[c].map(df[c].value_counts()).rename(f'{c}_Count')
for c in df]], axis=1).reset_index(drop=True)
Out[1580]:
A B C D A_Count B_Count C_Count D_Count
0 0 CEN T2 56 2.0 4.0 4.0 1
1 2 DECEN T1 45 1.0 1.0 3.0 1
2 3 ONBEK NaN 84 2.0 1.0 NaN 1
3 NaN NaN NaN 59 NaN NaN NaN 1
4 NaN NaN NaN 87 NaN NaN NaN 1
5 NaN NaN NaN 98 NaN NaN NaN 1
6 NaN NaN NaN 23 NaN NaN NaN 1
7 NaN NaN NaN 65 NaN NaN NaN 1
If you only need to keep alignment between each pair of column and its count such as A - A_Count, B - B_Count..., it simply just use value_counts with reset_index some commands to change axis names
cols = df.columns.tolist() + (df.columns + '_Count').tolist()
new_df = pd.concat([df[col].value_counts(sort=False).rename_axis(col).reset_index(name=f'{col}_Count')
for col in df], axis=1).reindex(new_cols, axis=1)
Out[1501]:
A B C D A_Count B_Count C_Count D_Count
0 0.0 ONBEK T2 56.0 2.0 1.0 4.0 1
1 2.0 CEN T1 45.0 1.0 4.0 3.0 1
2 3.0 DECEN NaN 84.0 2.0 1.0 NaN 1
3 NaN NaN NaN 59.0 NaN NaN NaN 1
4 NaN NaN NaN 87.0 NaN NaN NaN 1
5 NaN NaN NaN 98.0 NaN NaN NaN 1
6 NaN NaN NaN 23.0 NaN NaN NaN 1
7 NaN NaN NaN 65.0 NaN NaN NaN 1

Columns appending is troublesome with Pandas

Here is what I have tried and what error I received:
>>> import pandas as pd
>>> df = pd.DataFrame({"A":[1,2,3,4,5],"B":[5,4,3,2,1],"C":[0,0,0,0,0],"D":[1,1,1,1,1]})
>>> df
A B C D
0 1 5 0 1
1 2 4 0 1
2 3 3 0 1
3 4 2 0 1
4 5 1 0 1
>>> import pandas as pd
>>> df = pd.DataFrame({"A":[1,2,3,4,5],"B":[5,4,3,2,1],"C":[0,0,0,0,0],"D":[1,1,1,1,1]})
>>> first = [2,2,2,2,2,2,2,2,2,2,2,2]
>>> first = pd.DataFrame(first).T
>>> first.index = [2]
>>> df = df.join(first)
>>> df
A B C D 0 1 2 3 4 5 6 7 8 9 10 11
0 1 5 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2 4 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 3 3 0 1 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
3 4 2 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 5 1 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
>>> second = [3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3]
>>> second = pd.DataFrame(second).T
>>> second.index = [1]
>>> df = df.join(second)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python35\lib\site-packages\pandas\core\frame.py", line 6815, in join
rsuffix=rsuffix, sort=sort)
File "C:\Python35\lib\site-packages\pandas\core\frame.py", line 6830, in _join_compat
suffixes=(lsuffix, rsuffix), sort=sort)
File "C:\Python35\lib\site-packages\pandas\core\reshape\merge.py", line 48, in merge
return op.get_result()
File "C:\Python35\lib\site-packages\pandas\core\reshape\merge.py", line 552, in get_result
rdata.items, rsuf)
File "C:\Python35\lib\site-packages\pandas\core\internals\managers.py", line 1972, in items_overlap_with_suffix
'{rename}'.format(rename=to_rename))
ValueError: columns overlap but no suffix specified: Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype='object')
I am trying to create new list with the extra columns which I have to add at specific indexes of the main dataframe df.
When i tried the first it worked and you can see the output. But when I tried the same way with second I received the above mentioned error.
Kindly, let me know what I can do in this situation and achieve the goal I am expecting.
Use DataFrame.combine_first instead join if need assign to same columns created before, last DataFrame.reindex by list of columns for expected ordering:
df = pd.DataFrame({"A":[1,2,3,4,5],"B":[5,4,3,2,1],"C":[0,0,0,0,0],"D":[1,1,1,1,1]})
orig = df.columns.tolist()
first = [2,2,2,2,2,2,2,2,2,2,2,2]
first = pd.DataFrame(first).T
first.index = [2]
df = df.combine_first(first)
second = [3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3]
second = pd.DataFrame(second).T
second.index = [1]
df = df.combine_first(second)
df = df.reindex(orig + first.columns.tolist(), axis=1)
print (df)
A B C D 0 1 2 3 4 5 6 7 8 9 10 11
0 1 5 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2 4 0 1 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
2 3 3 0 1 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
3 4 2 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 5 1 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Yes this is expected behaviour because join works much like an SQL join, meaning that it will join on the provided index and concatenate all the columns together. The problem arises from the fact that pandas does not accept two columns to have the same name. Hence, if you have 2 columns in each dataframe with the same name, it will first look for a suffix to add to those columns to avoid name clashes. This is controlled with the lsuffix and rsuffix arguments in the join method.
Conclusion: 2 ways to solve this:
Either provide a suffix so that pandas is able to resolve the name clashes; or
Make sure that you don't have overlapping columns
You have to specify the suffixes since the column names are the same. Assuming you are trying to add the second values as new columns horizontally:
df = df.join(second, lsuffix='first', rsuffix='second')
A B C D 0first 1first 2first 3first 4first 5first ... 10second 11second 12 13 14 15 16 17 18 19
0 1 5 0 1 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2 4 0 1 NaN NaN NaN NaN NaN NaN ... 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
2 3 3 0 1 2.0 2.0 2.0 2.0 2.0 2.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 4 2 0 1 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 5 1 0 1 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

I want to subtract each column from the previous non-null column using the diff function

I have a long list of columns and I want to subtract the previous column from the current column and replace the current column with the difference.
So if I have:
A B C D
1 NaN 3 7
3 NaN 8 10
2 NaN 6 11
I want the output to be:
A B C D
1 NaN 2 4
3 NaN 5 2
2 NaN 4 5
I have been trying to use this code:
df2 = df1.diff(axis=1)
but this does not produce the desired output
Thanks in advance.
You can do this with df.where and then update to bring back the first non-null entry for each row of your DataFrame.
Sample Data: df
A B C D
0 1.0 NaN 3.0 7.0
1 1.0 4.0 5.0 9.0
2 NaN 4.0 NaN 4.0
3 NaN 4.0 NaN NaN
4 NaN NaN 3.0 7.0
5 3.0 NaN NaN 7.0
6 6.0 NaN NaN NaN
Code:
df_d = df.where(df.isnull(),
df.fillna(method='ffill', axis=1).diff(axis=1))
df_d.update(df.where(df.notnull().cumsum(1).cumsum(1) == 1))
Output: df_d
A B C D
0 1.0 NaN 2.0 4.0
1 1.0 3.0 1.0 4.0
2 NaN 4.0 NaN 0.0
3 NaN 4.0 NaN NaN
4 NaN NaN 3.0 4.0
5 3.0 NaN NaN 4.0
6 6.0 NaN NaN NaN
Actually, it is producing the desired result but you are trying to calculate diff on nan values which will be nan so diff is working as expected.
For your case just fetch the first column from original dataframe and you should be fine
df2=df1.diff(axis=1)
df2.A=df1.A
print(df2)
Output
A B C D
1 NaN 2.0 4.0

Categories