Transpose dataframe with cells as sum over columns - python

I have a dataframe in the following form:
x_30d x_60d y_30d y_60d
127 1.0 1.0 0.0 1.0
223 1.0 0.0 1.0 NaN
1406 1.0 NaN 1.0 0.0
2144 1.0 0.0 1.0 1.0
2234 1.0 0.0 NaN NaN
I need to transform it into the following form (where each cell is the sum over each column above):
30d 60d
x 5 1
y 3 2
I've tried using dictionaries and splitting columns. melting the dataframe, along with transposing it, etc. but I cannot seem to get the correct pattern.
To make things slightly more complicated, here are some actual column names that have a mix of forms for date ranges: PASC_new_aches_30d_60d, PASC_new_aches_60d_180d, ... PASC_new_aches_360d, ..., PASC_new_jt_pain_180d_360d, ...

In [131]: new = df.sum()
In [132]: new.index = pd.MultiIndex.from_frame(
new.index.str.extract(r"^(.*?)_(\d+d.*)$"))
In [133]: new
Out[133]:
0 1
PASC_new_aches 30d_60d 5.0
60d_180d 1.0
x 30d 3.0
PASC_new_aches 360d 2.0
dtype: float64
In [134]: new.unstack()
Out[134]:
1 30d 30d_60d 360d 60d_180d
0
PASC_new_aches NaN 5.0 2.0 1.0
x 3.0 NaN NaN NaN
sum as usual per column
original's columns are now at the index; need to split them
using a regex here: ^(.*?)_(\d+d.*)$
^: beginning
(.*?) anything, but greedily until...
_(\d+d.*) ...underscore followed by d pattern; also anything after it
$ the end
while splitting we extracted before & after of an underscore with (...)s
make them the new index (a multiindex now)
unstack the inner level to become new columns, i.e., the parts after "_"
noting that those "1" and "0" at the top left are the "name"s of the axes of the frame; 0 is that of df.index, 1 is of df.columns. They are there due to pd.MultiIndex.from_frame. Can remove by .rename_axis(index=None, columns=None).

one option is with pivot_longer from pyjanitor:
# pip install pyjanitor
import janitor
import pandas as pd
(df
.agg(['sum'])
.pivot_longer(
index = None,
names_to = ('other', '.value'),
names_sep='_')
)
other 30d 60d
0 x 5.0 1.0
1 y 3.0 2.0
The .value determines which parts of the columns remain as column headers.
If your dataframe looks complicated (based on the columns you shared):
PASC_new_aches_30d_60d PASC_new_aches_60d_180d PASC_new_aches_360d PASC_new_jt_pain_180d_360d
127 1.0 1.0 0.0 1.0
223 1.0 0.0 1.0 NaN
1406 1.0 NaN 1.0 0.0
2144 1.0 0.0 1.0 1.0
2234 1.0 0.0 NaN NaN
then a regex, similar to #MustafaAydin works better:
(df
.agg(['sum'])
.pivot_longer(
index=None,
names_to = ('other', '.value'),
names_pattern=r"(\D+)_(.+)")
)
other 30d_60d 60d_180d 360d 180d_360d
0 PASC_new_aches 5.0 1.0 3.0 NaN
1 PASC_new_jt_pain NaN NaN NaN 2.0

Related

How to populate NaN by 0, starting after first non-nan value

I need to populate NaN values of my df by a static 0, starting from the first non-nan value.
In a way, combining method="ffill" (identify the first value per column, and only act on following NaN values) with value=0 (populating by 0, not the variable quantity in df).
How can I do that? This post is close, but not it: How to replace NaNs by preceding or next values in pandas DataFrame?
Example df
0 1 2
0 NaN NaN NaN
1 6.0 NaN 1.0
2 NaN 3.0 NaN
3 NaN NaN 4.0
Desired output:
0 1 2
0 NaN NaN NaN
1 6.0 NaN 1.0
2 0.0 3.0 0.0
3 0.0 0.0 4.0
If possible, df.fillna(value=0, method='ffill') would be great. But that returns ValueError: Cannot specify both 'value' and 'method'.
Edit: Oh, and time matters. We are talking ~60M rows and 4k columns - so looping is out of the question, and masking only if really, really fast
You can try mask(), ffill() and fillna():
df=df.fillna(df.mask(df.ffill().notna(),0))
#OR via where
df=df.fillna(df.where(df.ffill().isna(),0))
output:
0 1 2
0 NaN NaN NaN
1 6.0 NaN 1.0
2 0.0 3.0 4.0
3 0.0 0.0 0.0

Conditional row-wise fillna when previous' row condition is also met

Suppose I have the following df
import pandas as pd
import numpy as np
test = pd.DataFrame(data = {
'a': [1,np.nan,np.nan,4,np.nan,5,6,7,8,np.nan,np.nan,6],
'b': [10,np.nan,np.nan,1,np.nan,1,1,np.nan,1,1,np.nan,1]
})
I would like to use pd.fillna(method='ffill') but only when two separate condition are both met:
Both elements of a row are NaN
The element of the previous row of column 'b' is 10
Note: the first row can never be NaN
I am looking for a smart way - maybe a lambda expression or a vectorized form, avoiding for loop or .iterrows()
Required result:
result= pd.DataFrame(data = {
'a': [1,1,1,4,np.nan,5,6,7,8,np.nan,np.nan,6],
'b': [10,10,10,1,np.nan,1,1,np.nan,1,1,np.nan,1]
})
You can test if forward filled value in b is 10 and also if all columns are filled with missing values and pass to DataFrame.mask with ffill:
mask = test['b'].ffill().eq(10) & test.isna().all(axis=1)
test = test.mask(mask, test.ffill())
print (test)
a b
0 1.0 10.0
1 1.0 10.0
2 1.0 10.0
3 4.0 1.0
4 NaN NaN
5 5.0 1.0
6 6.0 1.0
7 7.0 NaN
8 8.0 1.0
9 NaN 1.0
10 NaN NaN
11 6.0 1.0

How to identify empty cells in a CVS file using pandas

I am taking a column from a csv file and inputting the data from it into an array using pandas. However, many of the cells are empty and get saved in the array as 'nan'. I want to either identify the empty cells so I can skip them or remove them all from the array after. Something like the following pseudo-code:
if df.row(column number) == nan
skip
or
if df.row(column number) != nan
do stuff
Basically how do I identify if a cell from the csv file is empty.
Best is to get rid of the NaN rows after you load it, by indexing:
df = df[df['column_to_check'].notnull()]
For example to get rid of NaN values found in column 3 in the following dataframe:
>>> df
0 1 2 3 4
0 1.0 1.0 NaN 1.0 1.0
1 1.0 NaN 1.0 1.0 1.0
2 NaN NaN NaN NaN NaN
3 NaN 1.0 1.0 NaN NaN
4 1.0 NaN NaN 1.0 1.0
>>> df[df[3].notnull()]
0 1 2 3 4
0 1.0 1.0 NaN 1.0 1.0
1 1.0 NaN 1.0 1.0 1.0
4 1.0 NaN NaN 1.0 1.0
pd.isnull() and pd.notnull() are standard ways of checking individual null values if you're iterating over a DataFrame row by row and indexing by column as you suggest in your code above. You could then use this expression to do whatever you like with that value.
Example:
import pandas as pd
import numpy as np
a = np.nan
pd.isnull(a)
Out[4]: True
pd.notnull(a)
Out[5]: False
If you want to manipulate all (or certain) NaN values from a DataFrame, handling missing data is a big topic when working with tabular data and there are many methods of doing so. I'd recommend chapter 7 from this book. Here are its contents:
The first section would be most pertinent to your question.
If you just want to exclude missing values, you can use pd.DataFrame.dropna()
Below is an example based on the one describes by #sacul:
>>> import pandas as pd
>>> df
0 1 2 3 4
0 0.0 1.0 NaN 1.0 1.0
1 1.0 NaN 1.0 1.0 1.0
2 NaN NaN NaN NaN NaN
3 NaN 1.0 1.0 NaN NaN
4 1.0 NaN NaN 1.0 1.0
>>> df.dropna(axis=0, subset=['3'])
0 1 2 3 4
0 0.0 1.0 NaN 1.0 1.0
1 1.0 NaN 1.0 1.0 1.0
4 1.0 NaN NaN 1.0 1.0
axis=0 indicates that rows containing NaN are excluded.
subset=['3'] indicate to only consider columns "3".
See the link above for details.

How to merge two data frames together?

I have two data frames:
Pre_data_inputs with the size of (4760,2)
Diff_course_Precourse with size of (4760,1).
I want to merge these two data frames together with name data_inputs. This new data frame should be (4760,3). I have this code so far:
data_inputs = pd.concat([pre_data_inputs, Diff_Course_PreCourse], axis=1)
But the size of data_inputs now is (4950,3).
I don't know what is the problem. I would be appreciated if anybody can help me. Thanks.
Well if your index matches in both cases you can go with:
pre_data_inputs.merge(Diff_Course_PreCourse, left_index=True, right_index=True)
Otherwise you might want to reset_index() on both dataframes.
As #Parfait commented, the index of your data frames has to match for concat to work as you describe it.
For example:
d1 = pd.DataFrame(np.zeros(shape = (3,1)))
0
0 0.0
1 0.0
2 0.0
d2 = pd.DataFrame(np.ones(shape = (3,2)), index = range(2,5))
0 1
2 1.0 1.0
3 1.0 1.0
4 1.0 1.0
Since the index doesn't match the result data frame will have a number of rows equal to the unique index set (0,1,2,3,4)
pd.concat([d1, d2], axis = 1)
0 0 1
0 0.0 NaN NaN
1 0.0 NaN NaN
2 0.0 1.0 1.0
3 NaN 1.0 1.0
4 NaN 1.0 1.0
You could use reset_index before the concat or force one of the data frames to use the index of the other
pd.concat([d1, d2.set_index(d1.index)], axis = 1)
0 0 1
0 0.0 1.0 1.0
1 0.0 1.0 1.0
2 0.0 1.0 1.0

Pandas CONCAT() with merged columns in Creation

I am trying to create a very large dataframe, made up of one column from many smaller dataframes (renamed to the dataframe name). I am using CONCAT() and looping through dictionary values which represent dataframes, and looping over index values, to create the large dataframe. The CONCAT() join_axes is the common index to all the dataframes. This works fine, however I then have duplicate column names.
I must be able to loop over the indexes at specifc windows as part of my final dataframe creation - so removing this step isnt an option
For example, this results in the following final dataframe with duplciate columns:
Is there any way I can use CONCAT() excatly as I am, but merge the columns to produce an output like so?:
I think you need:
df = pd.concat([df1, df2])
Or if have duplicates in columns use groupby where if some values are overlapping then are summed:
print (df.groupby(level=0, axis=1).sum())
Sample:
df1 = pd.DataFrame({'A':[5,8,7, np.nan],
'B':[1,np.nan,np.nan,9],
'C':[7,3,np.nan,0]})
df2 = pd.DataFrame({'A':[np.nan,np.nan,np.nan,2],
'B':[1,2,np.nan,np.nan],
'C':[np.nan,6,np.nan,3]})
print (df1)
A B C
0 5.0 1.0 7.0
1 8.0 NaN 3.0
2 7.0 NaN NaN
3 NaN 9.0 0.0
print (df2)
A B C
0 NaN 1.0 NaN
1 NaN 2.0 6.0
2 NaN NaN NaN
3 2.0 NaN 3.0
df = pd.concat([df1, df2],axis=1)
print (df)
A B C A B C
0 5.0 1.0 7.0 NaN 1.0 NaN
1 8.0 NaN 3.0 NaN 2.0 6.0
2 7.0 NaN NaN NaN NaN NaN
3 NaN 9.0 0.0 2.0 NaN 3.0
print (df.groupby(level=0, axis=1).sum())
A B C
0 5.0 2.0 7.0
1 8.0 2.0 9.0
2 7.0 NaN NaN
3 2.0 9.0 3.0
What you want is df1.combine_first(df2). Refer to pandas documentation.

Categories