Dropping cell if it is NaN in a Dataframe in python - python

I have a dataframe like this.
Project 4 Project1 Project2 Project3
0 NaN laptio AB NaN
1 NaN windows ten NaN
0 one NaN NaN
1 two NaN NaN
I want to delete NaN values from Project 4 column
My desired output should be,
df,
Project 4 Project1 Project2 Project3
0 one laptio AB NaN
1 two windows ten NaN
0 NaN NaN NaN
1 NaN NaN

If your data frame's index is just standard 0 to n ordered integers, you can pop the Project4 column to a series, drop the NaN values, reset the index, and then merge it back with the data frame.
import pandas a pd
df = pd.DataFrame([[pd.np.nan, 1,2,3],
[pd.np.nan, 4,5,6],
['one',7,8,9],
['two',10,11,12]], columns=['p4','p1','p2','p3'])
s = df.pop('p4')
pd.concat([df, ps.dropna().reset_index(drop=True)], axis=1)
# returns:
p1 p2 p3 p4
0 1 2 3 one
1 4 5 6 two
2 7 8 9 NaN
3 10 11 12 NaN

Related

Copy row values of Data Frame along rows till not null and replicate the consecutive not null value further

I have a Dataframe as shown below
A B C D
0 1 2 3.3 4
1 NaT NaN NaN NaN
2 NaT NaN NaN NaN
3 5 6 7 8
4 NaT NaN NaN NaN
5 NaT NaN NaN NaN
6 9 1 2 3
7 NaT NaN NaN NaN
8 NaT NaN NaN NaN
I need to copy the first row values (1,2,3,4) till the non-null row with index 2. Then, copy row values (5,6,7,8) till the non-null row with index 5 and copy (9,1,2,3) till row with index 8 and so on. Is there any way to do this in Python or Pandas. Quick help appreciated! Also is necessary not replace column D
Column C ffill gives 3.3456 as value for next row
Expected Output:
A B C D
0 1 2 3.3 4
1 1 2 3.3 NaN
2 1 2 3.3 NaN
3 5 6 7 8
4 5 6 7 NaN
5 5 6 7 NaN
6 9 1 2 3
7 9 1 2 NaN
8 9 1 2 NaN
Question was changed, so for forward filling all columns without D use Index.difference with ffill for columns names in list:
cols = df.columns.difference(['D'])
df[cols] = df[cols].ffill()
Or create mask for all columns names without D:
mask = df.columns != 'D'
df.loc[:, mask] = df.loc[:, mask].ffill()
EDIT: I cannot replicate your problem:
df = pd.DataFrame({'a':[2114.201789, np.nan, np.nan, 1]})
print (df)
a
0 2114.201789
1 NaN
2 NaN
3 1.000000
print (df.ffill())
a
0 2114.201789
1 2114.201789
2 2114.201789
3 1.000000

How to merge multiple dataframe columns within a common dataframe in pandas in fastest way possible?

Need to perform the following operation on a pandas dataframe df inside a for loop with 50 iterations or more:
Column'X' of df has to be merged with column 'X' of df1,
Column'Y' of df has to be merged with column 'Y' of df2,
Column'Z' of df has to be merged with column 'Z' of df3,
Column'W' of df has to be merged with column 'W' of df4
The columns which are common in all 5 dataframes - df, df1, df2, df3 and df4 are A, B, C and D
EDIT
The shape of all dataframes is different from one another where df is the master dataframe having maximum number of rows and rest all other 4 dataframes have number of rows less than df but varying from each other. So while merging columns need to make sure that rows from both dataframes are matched first.
Input df
A B C D X Y Z W
1 2 3 4 nan nan nan nan
2 3 4 5 nan nan nan nan
5 9 7 8 nan nan nan nan
4 8 6 3 nan nan nan nan
df1
A B C D X Y Z W
2 3 4 5 100 nan nan nan
4 8 6 3 200 nan nan nan
df2
A B C D X Y Z W
1 2 3 4 nan 50 nan nan
df3
A B C D X Y Z W
1 2 3 4 nan nan 1000 nan
4 8 6 3 nan nan 2000 nan
df4
A B C D X Y Z W
2 3 4 5 nan nan nan 25
5 9 7 8 nan nan nan 35
4 8 6 3 nan nan nan 45
Output df
A B C D X Y Z W
1 2 3 4 nan 50 1000 nan
2 3 4 5 100 nan nan 25
5 9 7 8 nan nan nan 35
4 8 6 3 200 nan 2000 45
Which is the most efficient and fastest way to achieve it? Tried using 4 separate combine_first statements but that doesn't seem to be the most efficient way.
Can this be done by using just 1 line of code instead?
Any help will be appreciated. Many thanks in advance.

Pandas combines empty rows in Excel file to a single row in dataframe

I have different excel files that I am processing with Pandas. I need to remove a certain number of rows from the top of each file. These extra rows could be empty or they could contain text. Pandas is combining some of the rows so I am not sure how many need to be removed. For example:
Here is an example excel file (represented as csv):
,,
,,
some text,,
,,
,,
,,
name, date, task
Jason,1-Jan,swim
Aem,2-Jan,workout
Here is my current python script:
import pandas as pd
xl = pd.ExcelFile('extra_rows.xlsx')
dfs = xl.parse(xl.sheet_names[0])
print ("dfs: ", dfs)
Here is the results when I print the dataframe:
dfs: Unnamed: 0 Unnamed: 1 Unnamed: 2
0 some other text NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 name date task
5 Jason 2016-01-01 00:00:00 swim
6 Aem 2016-01-02 00:00:00 workout
From the file, I would remove the first 6 rows. However, from the dataframe I would only remove 4. Is there a way to read in the Excel file with the data in its raw state so the number of rows remains consistent?
I used python3 and pandas-0.18.1. The Excel load function is pandas.read_csv. You can try set the parameter header=None to achieve. Here are sample codes:
(1) With default parameters, result will ignore leading blank lines:
In [12]: pd.read_excel('test.xlsx')
Out[12]:
Unnamed: 0 Unnamed: 1 Unnamed: 2
0 text1 NaN NaN
1 NaN NaN NaN
2 n1 t2 c3
3 NaN NaN NaN
4 NaN NaN NaN
5 jim sum tim
(2) With header=None, result will keep leading blank lines.
In [13]: pd.read_excel('test.xlsx', header=None)
Out[13]:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 text1 NaN NaN
3 NaN NaN NaN
4 n1 t2 c3
5 NaN NaN NaN
6 NaN NaN NaN
7 jim sum tim
Here is what you are looking for:
import pandas as pd
xl = pd.ExcelFile('extra_rows.xlsx')
dfs = xl.parse(skiprows=6)
print ("dfs: ", dfs)
Check the docs on ExcelFile for more details.
If you read your file in with pd.read_excel and pass header=None, the blank rows should be included:
In [286]: df = pd.read_excel("test.xlsx", header=None)
In [287]: df
Out[287]:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 something NaN NaN
3 NaN NaN NaN
4 name date other
5 1 2 3

how to merge two dataframes if the index and length both do not match?

i have two data frames predictor_df and solution_df like this :
predictor_df
1000 A B C
1001 1 2 3
1002 4 5 6
1003 7 8 9
1004 Nan Nan Nan
and a solution_df
0 D
1 10
2 11
3 12
the reason for the names is that the predictor_df is used to do some analysis on it's columns to arrive at analysis_df . My analysis leaves the rows with Nan values in predictor_df and hence the shorter solution_df
Now i want to know how to join these two dataframes to obtain my final dataframe as
A B C D
1 2 3 10
4 5 6 11
7 8 9 12
Nan Nan Nan
please guide me through it . thanks in advance.
Edit : i tried to merge the two dataframes but the result comes like this ,
A B C D
1 2 3 Nan
4 5 6 Nan
7 8 9 Nan
Nan Nan Nan
Edit 2 : also when i do pd.concat([predictor_df, solution_df], axis = 1)
it becomes like this
A B C D
Nan Nan Nan 10
Nan Nan Nan 11
Nan Nan Nan 12
Nan Nan Nan Nan
You could use reset_index with drop=True which resets the index to the default integer index.
pd.concat([df_1.reset_index(drop=True), df_2.reset_index(drop=True)], axis=1)
A B C D
0 1 2 3 10.0
1 4 5 6 11.0
2 7 8 9 12.0
3 Nan Nan Nan NaN

How can I use apply with pandas rolling_corr()

I posted this a while ago but no one could solve the problem.
first let's create some correlated DataFrames and call rolling_corr(), with dropna() as I am going to sparse it up later and no min_period set as I want to keep the results robust and consistent with the set window
hey=(DataFrame(np.random.random((15,3)))+.2).cumsum()
hoo=(DataFrame(np.random.random((15,3)))+.2).cumsum()
hey_corr= rolling_corr(hey.dropna(),hoo.dropna(), 4)
gives me
In [388]: hey_corr
Out[388]:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 0.991087 0.978383 0.992614
4 0.974117 0.974871 0.989411
5 0.966969 0.972894 0.997427
6 0.942064 0.994681 0.996529
7 0.932688 0.986505 0.991353
8 0.935591 0.966705 0.980186
9 0.969994 0.977517 0.931809
10 0.979783 0.956659 0.923954
11 0.987701 0.959434 0.961002
12 0.907483 0.986226 0.978658
13 0.940320 0.985458 0.967748
14 0.952916 0.992365 0.973929
now when I sparse it up it gives me...
hey.ix[5:8,0] = np.nan
hey.ix[6:10,1] = np.nan
hoo.ix[5:8,0] = np.nan
hoo.ix[6:10,1] = np.nan
hey_corr_sparse = rolling_corr(hey.dropna(),hoo.dropna(), 4)
hey_corr_sparse
Out[398]:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 0.991273 0.992557 0.985773
4 0.953041 0.999411 0.958595
11 0.996801 0.998218 0.992538
12 0.994919 0.998656 0.995235
13 0.994899 0.997465 0.997950
14 0.971828 0.937512 0.994037
chucks of data are missing, it looks like we only have data where the dropna() can form a complete window across the dataframe
I can solve the problem with a ugly iter-fudge as follows...
hey_corr_sparse = DataFrame(np.nan, index=hey.index,columns=hey.columns)
for i in hey_corr_sparse.columns:
hey_corr_sparse.ix[:,i] = rolling_corr(hey.ix[:,i].dropna(),hoo.ix[:,i].dropna(), 4)
hey_corr_sparse
Out[406]:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 0.991273 0.992557 0.985773
4 0.953041 0.999411 0.958595
5 NaN 0.944246 0.961917
6 NaN NaN 0.941467
7 NaN NaN 0.963183
8 NaN NaN 0.980530
9 0.993865 NaN 0.984484
10 0.997691 NaN 0.998441
11 0.978982 0.991095 0.997462
12 0.914663 0.990844 0.998134
13 0.933355 0.995848 0.976262
14 0.971828 0.937512 0.994037
Does anyone in the community know if it is possible make this an array function to give this result, I've attempted to use .apply but drawn a blank, is it even possible to .apply a function that works on two data structures (hey and hoo in this example)?
many thanks, LW
you can try this:
>>> def sparse_rolling_corr(ts, other, window):
... return rolling_corr(ts.dropna(), other[ts.name].dropna(), window).reindex_like(ts)
...
>>> hey.apply(sparse_rolling_corr, args=(hoo, 4))

Categories