Handling Variable Number of Columns Dataframe - Python

Handling Variable Number of Columns Dataframe - Python - python

I am trying to write a list of lists into an excel sheet using pandas
the list looks like:
List_of Lists = [ [1,2,3,4],
[5,6,7,8],
[9,10,11,12],
........,
]
The number of these lists inside the main list could go up to a 1000.
I also want to label them like colums1, colomns2, until colums100 for
instance. on the same sheets. can anyone familiar with pandas help me?
as this could be really easy for some?

I believe you can just pass the list into pd.DataFrame() and you will just get NaNs for the values that don't exist.
For example:
List_of_Lists = [[1,2,3,4],
[5,6,7],
[9,10],
[11]]
df = pd.DataFrame(List_of_Lists)
print(df)
0 1 2 3
0 1 2.0 3.0 4.0
1 5 6.0 7.0 NaN
2 9 10.0 NaN NaN
3 11 NaN NaN NaN
Then to get the naming the way you want just use pandas.DataFrame.add_prefix
df = df.add_prefix('Column')
print(df)
Column0 Column1 Column2 Column3
0 1 2.0 3.0 4.0
1 5 6.0 7.0 NaN
2 9 10.0 NaN NaN
3 11 NaN NaN NaN
Now I guess there is the possibility that you also could want each list to be a column. In that case you need to transpose your List_of_Lists.
from itertools import zip_longest
df = pd.DataFrame(list(map(list, zip_longest(*List_of_Lists))))
print(df)
0 1 2 3
0 1 5.0 9.0 11.0
1 2 6.0 10.0 NaN
2 3 7.0 NaN NaN
3 4 NaN NaN NaN

Related

Is there a way to add NaN values based on another dataframe?

Say I have two data frames:
Original:
A B C
0 NaN 4.0 7.0
1 2.0 5.0 NaN
2 NaN NaN 9.0
Imputation:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
(both are the same dataframes except imputation has the NaN's filled in).
I would like to reintroduce the NaN values into the imputation df column A so it looks like this(column B, C are filled in but A keeps the NaN values):
# A B C
# 0 NaN 4.0 7.0
# 1 2.0 5.0 8.0
# 2 NaN 6.0 9.0
import pandas as pd
import numpy as np
dfImputation = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
dfOrginal = pd.DataFrame({'A':[np.NaN,2,np.NaN],
'B':[4,5,np.NaN],
'C':[7,np.NaN,9]})
print(dfOrginal.fillna(dfImputation))
I do not get the result I want because it just obviously fills in all values. There is a way to introduce NaN values or a way to fill in NA for specific columns? I'm not quite sure the best approach to get the intended outcome.

You can fill in only specified columns by subsetting the frame you pass into the fillna operation:
>>> dfOrginal.fillna(dfImputation[["B", "C"]])
A B C
0 NaN 4.0 7.0
1 2.0 5.0 8.0
2 NaN 6.0 9.0

Check update
df.update(im[['B','C']])
df
Out[7]:
A B C
0 NaN 4.0 7.0
1 2.0 5.0 8.0
2 NaN 6.0 9.0

Fill missing values of pandas based on values of other columns

I have a following dataframe:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
Now I want to fill null values of A with the values in B or D. i.e. if the value is Null in B than check D. So resultant dataframe looks like this.
A B C D
0 2.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 5 NaN NaN 5
3 3.0 3.0 NaN 4
I can do this using following code:
df['A'] = df['A'].fillna(df['B'])
df['A'] = df['A'].fillna(df['D'])
But I want to do this in one line, how can I do that?

You could simply chain both .fillna():
df['A'] = df.A.fillna(df.B).fillna(df.D)
A B C D
0 2.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 5.0 NaN NaN 5
3 3.0 3.0 NaN 4
Or using fillna with combine_first:
df['A'] = df.A.fillna(df.B.combine_first(df.D))

If dont need chain because many columns better is use back filling missing values with seelcting first column by positions:
df['A'] = df['A'].fillna(df[['B','D']].bfill(axis=1).iloc[:, 0])
print (df)
A B C D
0 2.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 5.0 NaN NaN 5
3 3.0 3.0 NaN 4

Insert value into column which is named in known column pandas

I'm preparing data for machine learning where data is in pandas DataFrame which looks like this:
Column v1 v2
first 1 2
second 3 4
third 5 6
now i want to transform it into:
Column v1 v2 first-v1 first-v2 second-v1 econd-v2 third-v1 third-v2
first 1 2 1 2 Nan Nan Nan Nan
second 3 4 Nan Nan 3 4 Nan Nan
third 5 6 Nan Nan Nan Nan 5 6
what i've tried is to do something like this:
# we know how many values there are but
# length can be changed into length of [1, 2, 3, ...] values
values = ['v1', 'v2']
# data with description from above is saved in data
for value in values:
data[ str(data['Column'] + '-' + value)] = data[ value]
Results are a columns with name:
['first-v1' 'second-v1'..], ['first-v2' 'second-v2'..]
where there are correct values. What i'm doing wrong? Is there a more optimal way to do this because my data is big?
Thank you for your time!

You can use unstack with swaping and sorting MultiIndex in columns:
df = data.set_index('Column', append=True)[values].unstack()
.swaplevel(0,1, axis=1).sort_index(1)
df.columns = df.columns.map('-'.join)
print (df)
first-v1 first-v2 second-v1 second-v2 third-v1 third-v2
0 1.0 2.0 NaN NaN NaN NaN
1 NaN NaN 3.0 4.0 NaN NaN
2 NaN NaN NaN NaN 5.0 6.0
Or stack + unstack:
df = data.set_index('Column', append=True).stack().unstack([1,2])
df.columns = df.columns.map('-'.join)
print (df)
first-v1 first-v2 second-v1 second-v2 third-v1 third-v2
0 1.0 2.0 NaN NaN NaN NaN
1 NaN NaN 3.0 4.0 NaN NaN
2 NaN NaN NaN NaN 5.0 6.0
Last join to original:
df = data.join(df)
print (df)
Column v1 v2 first-v1 first-v2 second-v1 second-v2 third-v1 \
0 first 1 2 1.0 2.0 NaN NaN NaN
1 second 3 4 NaN NaN 3.0 4.0 NaN
2 third 5 6 NaN NaN NaN NaN 5.0
third-v2
0 NaN
1 NaN
2 6.0

reshape a pandas dataframe index to columns

Consider the below pandas Series object,
index = list('abcdabcdabcd')
df = pd.Series(np.arange(len(index)), index = index)
My desired output is,
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
I have put some effort with pd.pivot_table, pd.unstack and probably the solution lies with correct use of one of them. The closest i have reached is
df.reset_index(level = 1).unstack(level = 1)
but this does not gives me the output i my looking for
// here is something even closer to the desired output, but i am not able to handle the index grouping.
df.to_frame().set_index(df1.values, append = True, drop = False).unstack(level = 0)
a b c d
0 0.0 NaN NaN NaN
1 NaN 1.0 NaN NaN
2 NaN NaN 2.0 NaN
3 NaN NaN NaN 3.0
4 4.0 NaN NaN NaN
5 NaN 5.0 NaN NaN
6 NaN NaN 6.0 NaN
7 NaN NaN NaN 7.0
8 8.0 NaN NaN NaN
9 NaN 9.0 NaN NaN
10 NaN NaN 10.0 NaN
11 NaN NaN NaN 11.0

A bit more general solution using cumcount to get new index values, and pivot to do the reshaping:
# Reset the existing index, and construct the new index values.
df = df.reset_index()
df.index = df.groupby('index').cumcount()
# Pivot and remove the column axis name.
df = df.pivot(columns='index', values=0).rename_axis(None, axis=1)
The resulting output:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11

Here is a way that will work if the index is always cycling in the same order, and you know the "period" (in this case 4):
>>> pd.DataFrame(df.values.reshape(-1,4), columns=list('abcd'))
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
>>>

Pandas CONCAT() with merged columns in Creation

I am trying to create a very large dataframe, made up of one column from many smaller dataframes (renamed to the dataframe name). I am using CONCAT() and looping through dictionary values which represent dataframes, and looping over index values, to create the large dataframe. The CONCAT() join_axes is the common index to all the dataframes. This works fine, however I then have duplicate column names.
I must be able to loop over the indexes at specifc windows as part of my final dataframe creation - so removing this step isnt an option
For example, this results in the following final dataframe with duplciate columns:
Is there any way I can use CONCAT() excatly as I am, but merge the columns to produce an output like so?:

I think you need:
df = pd.concat([df1, df2])
Or if have duplicates in columns use groupby where if some values are overlapping then are summed:
print (df.groupby(level=0, axis=1).sum())
Sample:
df1 = pd.DataFrame({'A':[5,8,7, np.nan],
'B':[1,np.nan,np.nan,9],
'C':[7,3,np.nan,0]})
df2 = pd.DataFrame({'A':[np.nan,np.nan,np.nan,2],
'B':[1,2,np.nan,np.nan],
'C':[np.nan,6,np.nan,3]})
print (df1)
A B C
0 5.0 1.0 7.0
1 8.0 NaN 3.0
2 7.0 NaN NaN
3 NaN 9.0 0.0
print (df2)
A B C
0 NaN 1.0 NaN
1 NaN 2.0 6.0
2 NaN NaN NaN
3 2.0 NaN 3.0
df = pd.concat([df1, df2],axis=1)
print (df)
A B C A B C
0 5.0 1.0 7.0 NaN 1.0 NaN
1 8.0 NaN 3.0 NaN 2.0 6.0
2 7.0 NaN NaN NaN NaN NaN
3 NaN 9.0 0.0 2.0 NaN 3.0
print (df.groupby(level=0, axis=1).sum())
A B C
0 5.0 2.0 7.0
1 8.0 2.0 9.0
2 7.0 NaN NaN
3 2.0 9.0 3.0

What you want is df1.combine_first(df2). Refer to pandas documentation.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Handling Variable Number of Columns Dataframe - Python - python

Related

Is there a way to add NaN values based on another dataframe?

Fill missing values of pandas based on values of other columns

Insert value into column which is named in known column pandas

reshape a pandas dataframe index to columns

Pandas CONCAT() with merged columns in Creation

Categories

Resources