How to merge two data frames together? - python

I have two data frames:
Pre_data_inputs with the size of (4760,2)
Diff_course_Precourse with size of (4760,1).
I want to merge these two data frames together with name data_inputs. This new data frame should be (4760,3). I have this code so far:
data_inputs = pd.concat([pre_data_inputs, Diff_Course_PreCourse], axis=1)
But the size of data_inputs now is (4950,3).
I don't know what is the problem. I would be appreciated if anybody can help me. Thanks.

Well if your index matches in both cases you can go with:
pre_data_inputs.merge(Diff_Course_PreCourse, left_index=True, right_index=True)
Otherwise you might want to reset_index() on both dataframes.

As #Parfait commented, the index of your data frames has to match for concat to work as you describe it.
For example:
d1 = pd.DataFrame(np.zeros(shape = (3,1)))
0
0 0.0
1 0.0
2 0.0
d2 = pd.DataFrame(np.ones(shape = (3,2)), index = range(2,5))
0 1
2 1.0 1.0
3 1.0 1.0
4 1.0 1.0
Since the index doesn't match the result data frame will have a number of rows equal to the unique index set (0,1,2,3,4)
pd.concat([d1, d2], axis = 1)
0 0 1
0 0.0 NaN NaN
1 0.0 NaN NaN
2 0.0 1.0 1.0
3 NaN 1.0 1.0
4 NaN 1.0 1.0
You could use reset_index before the concat or force one of the data frames to use the index of the other
pd.concat([d1, d2.set_index(d1.index)], axis = 1)
0 0 1
0 0.0 1.0 1.0
1 0.0 1.0 1.0
2 0.0 1.0 1.0

Related

Transpose dataframe with cells as sum over columns

I have a dataframe in the following form:
x_30d x_60d y_30d y_60d
127 1.0 1.0 0.0 1.0
223 1.0 0.0 1.0 NaN
1406 1.0 NaN 1.0 0.0
2144 1.0 0.0 1.0 1.0
2234 1.0 0.0 NaN NaN
I need to transform it into the following form (where each cell is the sum over each column above):
30d 60d
x 5 1
y 3 2
I've tried using dictionaries and splitting columns. melting the dataframe, along with transposing it, etc. but I cannot seem to get the correct pattern.
To make things slightly more complicated, here are some actual column names that have a mix of forms for date ranges: PASC_new_aches_30d_60d, PASC_new_aches_60d_180d, ... PASC_new_aches_360d, ..., PASC_new_jt_pain_180d_360d, ...
In [131]: new = df.sum()
In [132]: new.index = pd.MultiIndex.from_frame(
new.index.str.extract(r"^(.*?)_(\d+d.*)$"))
In [133]: new
Out[133]:
0 1
PASC_new_aches 30d_60d 5.0
60d_180d 1.0
x 30d 3.0
PASC_new_aches 360d 2.0
dtype: float64
In [134]: new.unstack()
Out[134]:
1 30d 30d_60d 360d 60d_180d
0
PASC_new_aches NaN 5.0 2.0 1.0
x 3.0 NaN NaN NaN
sum as usual per column
original's columns are now at the index; need to split them
using a regex here: ^(.*?)_(\d+d.*)$
^: beginning
(.*?) anything, but greedily until...
_(\d+d.*) ...underscore followed by d pattern; also anything after it
$ the end
while splitting we extracted before & after of an underscore with (...)s
make them the new index (a multiindex now)
unstack the inner level to become new columns, i.e., the parts after "_"
noting that those "1" and "0" at the top left are the "name"s of the axes of the frame; 0 is that of df.index, 1 is of df.columns. They are there due to pd.MultiIndex.from_frame. Can remove by .rename_axis(index=None, columns=None).
one option is with pivot_longer from pyjanitor:
# pip install pyjanitor
import janitor
import pandas as pd
(df
.agg(['sum'])
.pivot_longer(
index = None,
names_to = ('other', '.value'),
names_sep='_')
)
other 30d 60d
0 x 5.0 1.0
1 y 3.0 2.0
The .value determines which parts of the columns remain as column headers.
If your dataframe looks complicated (based on the columns you shared):
PASC_new_aches_30d_60d PASC_new_aches_60d_180d PASC_new_aches_360d PASC_new_jt_pain_180d_360d
127 1.0 1.0 0.0 1.0
223 1.0 0.0 1.0 NaN
1406 1.0 NaN 1.0 0.0
2144 1.0 0.0 1.0 1.0
2234 1.0 0.0 NaN NaN
then a regex, similar to #MustafaAydin works better:
(df
.agg(['sum'])
.pivot_longer(
index=None,
names_to = ('other', '.value'),
names_pattern=r"(\D+)_(.+)")
)
other 30d_60d 60d_180d 360d 180d_360d
0 PASC_new_aches 5.0 1.0 3.0 NaN
1 PASC_new_jt_pain NaN NaN NaN 2.0

How to populate NaN by 0, starting after first non-nan value

I need to populate NaN values of my df by a static 0, starting from the first non-nan value.
In a way, combining method="ffill" (identify the first value per column, and only act on following NaN values) with value=0 (populating by 0, not the variable quantity in df).
How can I do that? This post is close, but not it: How to replace NaNs by preceding or next values in pandas DataFrame?
Example df
0 1 2
0 NaN NaN NaN
1 6.0 NaN 1.0
2 NaN 3.0 NaN
3 NaN NaN 4.0
Desired output:
0 1 2
0 NaN NaN NaN
1 6.0 NaN 1.0
2 0.0 3.0 0.0
3 0.0 0.0 4.0
If possible, df.fillna(value=0, method='ffill') would be great. But that returns ValueError: Cannot specify both 'value' and 'method'.
Edit: Oh, and time matters. We are talking ~60M rows and 4k columns - so looping is out of the question, and masking only if really, really fast
You can try mask(), ffill() and fillna():
df=df.fillna(df.mask(df.ffill().notna(),0))
#OR via where
df=df.fillna(df.where(df.ffill().isna(),0))
output:
0 1 2
0 NaN NaN NaN
1 6.0 NaN 1.0
2 0.0 3.0 4.0
3 0.0 0.0 0.0

Average Pandas Dataframe with condition other Dataframe

I have two dataframes. One only contains binary values, the other floats between 0 and 1.
Eg.
df1:
col 1 col 2 col 3 col 4 col 5 col 6 col 7
0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 1.0 0.0 0.0
2 0.0 0.0 0.0 0.0 1.0 0.0 1.0
3 0.0 0.0 0.0 0.0 0.0 0.0 1.0
4 0.0 0.0 0.0 0.0 1.0 0.0 0.0
df2:
col 1 col 2 col 3 col 4 col 5 col 6 col 7
0 0.068467 0.099870 0.090778 0.087500 0.612955 0.081495 0.570557
1 0.091651 0.084946 0.082704 0.103070 0.517317 0.092595 0.603526
2 0.070380 0.104353 0.103062 0.086780 0.598848 0.101543 0.570064
3 0.052239 0.123760 0.215329 0.087608 0.581883 0.080650 0.574241
4 0.087564 0.104460 0.125887 0.079945 0.646284 0.081015 0.609308
What I need is to compute the average of df1 where df2 >= 0.5 (or any other number)
All I could find on this topic is for columns only and I could not get it to work on the entire dataframe.
Any help is appreciated.
First is necessary same index and same columns names in both DataFrames.
Then use DataFrame.where for set missing values to False values by mask and then get mean:
df = df1.where(df2 >= 0.5).mean()
If need mean of all values use numpy.nanmean for exclude missing values:
mean = np.nanmean(df1.where(df2 >= 0.5))
Another idea is convert all values to Series with DataFrame.stack and then get mean:
mean = df1.where(df2 >= 0.5).stack().mean()
What about creating a dataframe with the values above 0.5 included and NaN everywhere else:
df = df1.where(df2 >= 0.5)
We then calculate the sum of the values and count the number of values to get the mean:
sum_values = df.sum().sum()
count_values = df.count().sum()
mean_value = sum_values / count_values

How to merge rows with combination of values in a DataFrame

I have a DataFrame (df1) as given below
Hair Feathers Legs Type Count
R1 1 NaN 0 1 1
R2 1 0 Nan 1 32
R3 1 0 2 1 4
R4 1 Nan 4 1 27
I want to merge rows based by different combinations of the values in each column and also want to add the count values for each merged row. The resultant dataframe(df2) will look like this:
Hair Feathers Legs Type Count
R1 1 0 0 1 33
R2 1 0 2 1 36
R3 1 0 4 1 59
The merging is performed in such a way that any Nan value will be merged with 0 or 1. In df2, R1 is calculated by merging the Nan value of Feathers (df1,R1) with the 0 value of Feathers (df1,R2). Similarly, the value of 0 in Legs (df1,R1) is merged with Nan value of Legs (df1,R2). Then the count of R1 (1) and R2(32) are added. In the same manner R2 and R3 are merged because Feathers value in R2 (df1) is similar to R3 (df1) and Legs value of Nan is merged with 2 in R3 (df1) and the count of R2 (32) and R3 (4) are added.
I hope the explanation makes sense. Any help will be highly appreciated
A possible way to do it is by replicating each of the rows containing NaN and fill them with values for the column.
First, we need to get the possible not-null unique values per column:
unique_values = df.iloc[:, :-1].apply(
lambda x: x.dropna().unique().tolist(), axis=0).to_dict()
> unique_values
{'Hair': [1.0], 'Feathers': [0.0], 'Legs': [0.0, 2.0, 4.0], 'Type': [1.0]}
Then iterate through each row of the dataframe and replace each NaN by the possible values for each column. We can do this using pandas.DataFrame.iterrows:
mask = df.iloc[:, :-1].isnull().any(axis=1)
# Keep the rows that do not contain `Nan`
# and then added modified rows
list_of_df = [r for i, r in df[~mask].iterrows()]
for row_index, row in df[mask].iterrows():
for c in row[row.isnull()].index:
# For each column of the row, replace
# Nan by possible values for the column
for v in unique_values[c]:
list_of_df.append(row.copy().fillna({c:v}))
df_res = pd.concat(list_of_df, axis=1, ignore_index=True).T
The result is a dataframe where all the NaN have been filled with possible values for the column:
> df_res
Hair Feathers Legs Type Count
0 1.0 0.0 2.0 1.0 4.0
1 1.0 0.0 0.0 1.0 1.0
2 1.0 0.0 0.0 1.0 32.0
3 1.0 0.0 2.0 1.0 32.0
4 1.0 0.0 4.0 1.0 32.0
5 1.0 0.0 4.0 1.0 27.0
To get the final result of Count grouping by possible combinations of ['Hair', 'Feathers', 'Legs', 'Type'] we just need to do:
> df_res.groupby(['Hair', 'Feathers', 'Legs', 'Type']).sum().reset_index()
Hair Feathers Legs Type Count
0 1.0 0.0 0.0 1.0 33.0
1 1.0 0.0 2.0 1.0 36.0
2 1.0 0.0 4.0 1.0 59.0
Hope it serves
UPDATE
If one or more of the elements in the row are missing, the procedure looking for all the possible combinations for the missing values at the same time. Let us add a new row with two elements missing:
> df
Hair Feathers Legs Type Count
0 1.0 NaN 0.0 1.0 1.0
1 1.0 0.0 NaN 1.0 32.0
2 1.0 0.0 2.0 1.0 4.0
3 1.0 NaN 4.0 1.0 27.0
4 1.0 NaN NaN 1.0 32.0
We will proceed in similar way, but the replacements combinations will be obtained using itertools.product:
import itertools
unique_values = df.iloc[:, :-1].apply(
lambda x: x.dropna().unique().tolist(), axis=0).to_dict()
mask = df.iloc[:, :-1].isnull().any(axis=1)
list_of_df = [r for i, r in df[~mask].iterrows()]
for row_index, row in df[mask].iterrows():
cols = row[row.isnull()].index.tolist()
for p in itertools.product(*[unique_values[c] for c in cols]):
list_of_df.append(row.copy().fillna({c:v for c, v in zip(cols, p)}))
df_res = pd.concat(list_of_df, axis=1, ignore_index=True).T
> df_res.sort_values(['Hair', 'Feathers', 'Legs', 'Type']).reset_index(drop=True)
Hair Feathers Legs Type Count
1 1.0 0.0 0.0 1.0 1.0
2 1.0 0.0 0.0 1.0 32.0
6 1.0 0.0 0.0 1.0 32.0
0 1.0 0.0 2.0 1.0 4.0
3 1.0 0.0 2.0 1.0 32.0
7 1.0 0.0 2.0 1.0 32.0
4 1.0 0.0 4.0 1.0 32.0
5 1.0 0.0 4.0 1.0 27.0
8 1.0 0.0 4.0 1.0 32.0
> df_res.groupby(['Hair', 'Feathers', 'Legs', 'Type']).sum().reset_index()
Hair Feathers Legs Type Count
0 1.0 0.0 0.0 1.0 65.0
1 1.0 0.0 2.0 1.0 68.0
2 1.0 0.0 4.0 1.0 91.0

How to identify empty cells in a CVS file using pandas

I am taking a column from a csv file and inputting the data from it into an array using pandas. However, many of the cells are empty and get saved in the array as 'nan'. I want to either identify the empty cells so I can skip them or remove them all from the array after. Something like the following pseudo-code:
if df.row(column number) == nan
skip
or
if df.row(column number) != nan
do stuff
Basically how do I identify if a cell from the csv file is empty.
Best is to get rid of the NaN rows after you load it, by indexing:
df = df[df['column_to_check'].notnull()]
For example to get rid of NaN values found in column 3 in the following dataframe:
>>> df
0 1 2 3 4
0 1.0 1.0 NaN 1.0 1.0
1 1.0 NaN 1.0 1.0 1.0
2 NaN NaN NaN NaN NaN
3 NaN 1.0 1.0 NaN NaN
4 1.0 NaN NaN 1.0 1.0
>>> df[df[3].notnull()]
0 1 2 3 4
0 1.0 1.0 NaN 1.0 1.0
1 1.0 NaN 1.0 1.0 1.0
4 1.0 NaN NaN 1.0 1.0
pd.isnull() and pd.notnull() are standard ways of checking individual null values if you're iterating over a DataFrame row by row and indexing by column as you suggest in your code above. You could then use this expression to do whatever you like with that value.
Example:
import pandas as pd
import numpy as np
a = np.nan
pd.isnull(a)
Out[4]: True
pd.notnull(a)
Out[5]: False
If you want to manipulate all (or certain) NaN values from a DataFrame, handling missing data is a big topic when working with tabular data and there are many methods of doing so. I'd recommend chapter 7 from this book. Here are its contents:
The first section would be most pertinent to your question.
If you just want to exclude missing values, you can use pd.DataFrame.dropna()
Below is an example based on the one describes by #sacul:
>>> import pandas as pd
>>> df
0 1 2 3 4
0 0.0 1.0 NaN 1.0 1.0
1 1.0 NaN 1.0 1.0 1.0
2 NaN NaN NaN NaN NaN
3 NaN 1.0 1.0 NaN NaN
4 1.0 NaN NaN 1.0 1.0
>>> df.dropna(axis=0, subset=['3'])
0 1 2 3 4
0 0.0 1.0 NaN 1.0 1.0
1 1.0 NaN 1.0 1.0 1.0
4 1.0 NaN NaN 1.0 1.0
axis=0 indicates that rows containing NaN are excluded.
subset=['3'] indicate to only consider columns "3".
See the link above for details.

Categories