Python Pandas Next Sequence Number - python

I have a dataframe
df = pd.DataFrame([1,5,8, np.nan,np.nan], columns = ["UserID"])
I want to fill np.nan with next sequence numbers from starting with highest value + 1
expected result of df.UserID
[1, 5, 8, 9, 10]

Use Series.isna with Series.cumsum for counter and add original data with forward filling missing values:
df['UserID'] = df['UserID'].isna().cumsum().add(df['UserID'].ffill(), fill_value=0)
print (df)
UserID
0 1.0
1 5.0
2 8.0
3 9.0
4 10.0

Related

Filling pandas DataFrame with values from another DataFrame of different shape

I have a dataframe df2 containing four columns: A, B, C, D. I want to fill this dataframe with the values from another data frame temp= (1, 2, 6.5, 8, 3, 4, 6.6, 7.8, 5, 6, 5, 4).
What I want to obtain is given in
Any idea on how to do this?
If length of values modulo 4 is equal 0 seelct first row to Series by DataFrame.iloc, convert to numpy array and reshape by -1 for by default counts number of rows and 4 for number of columns:
print (len(df.iloc[0]) % 4)
0
df2 = pd.DataFrame(df.iloc[0].to_numpy().reshape(-1, 4), columns=list('ABCD'))
print (df2)
A B C D
0 1.0 2.0 6.5 8.0
1 3.0 4.0 6.6 7.8
2 5.0 6.0 5.0 4.0

Filter for rows in pandas dataframe where values in a column are greater than x or NaN

I'm trying to figure out how to filter a pandas dataframe so that that the values in a certain column are either greater than a certain value, or are NaN. Lets say my dataframe looks like this:
df = pd.DataFrame({"col1":[1, 2, 3, 4], "col2": [4, 5, np.nan, 7]})
I've tried:
df = df[df["col2"] >= 5 | df["col2"] == np.nan]
and:
df = df[df["col2"] >= 5 | np.isnan(df["col2"])]
But the first causes an error, and the second excludes rows where the value is NaN. How can I get the result to be this:
pd.DataFrame({"col1":[2, 3, 4], "col2":[5, np.nan, 7]})
Please Try
df[df.col2.isna()|df.col2.gt(4)]
col1 col2
1 2 5.0
2 3 NaN
3 4 7.0
Also, you can fill nan with the threshold:
df[df.fillna(5)>=5]

Loop through dataframe (cols and rows) and replace data

I have:
df = pd.DataFrame([[1, 2,3], [2, 4,6],[3, 6,9]], columns=['A', 'B','C'])
and I need to calculate de difference between the i+1 and i value of each row and column, and store it again in the same column. The output needed would be:
Out[2]:
A B C
0 1 2 3
1 1 2 3
2 1 2 3
I have tried to do this, but I finally get a list with all values appended, and I need to have them stored separately (in lists, or in the same dataframe).
Is there a way to do it?
difs=[]
for column in df:
for i in range(len(df)-1):
a = df[column]
b = a[i+1]-a[i]
difs.append(b)
for x in difs:
for column in df:
df[column]=x
You can use pandas function shift to achieve your intended goal. This is what it does (more on it on the docs):
Shift index by desired number of periods with an optional time freq.
for col in df:
df[col] = df[col] - df[col].shift(1).fillna(0)
df
Out[1]:
A B C
0 1.0 2.0 3.0
1 1.0 2.0 3.0
2 1.0 2.0 3.0
Added
In case you want to use the loop, probably a good approach is to use iterrows (more on it here) as it provides (index, Series) pairs.
difs = []
for i, row in df.iterrows():
if i == 0:
x = row.values.tolist() ## so we preserve the first row
else:
x = (row.values - df.loc[i-1, df.columns]).values.tolist()
difs.append(x)
difs
Out[1]:
[[1, 2, 3], [1, 2, 3], [1, 2, 3]]
## Create new / replace old dataframe
cols = [col for col in df.columns]
new_df = pd.DataFrame(difs, columns=cols)
new_df
Out[2]:
A B C
0 1.0 2.0 3.0
1 1.0 2.0 3.0
2 1.0 2.0 3.0

Find the minimum value of a column greater than another column value in Python Pandas

I'm working in Python. I have two dataframes df1 and df2:
d1 = {'timestamp1': [88148 , 5617900, 5622548, 5645748, 6603950, 6666502], 'col01': [1, 2, 3, 4, 5, 6]}
df1 = pd.DataFrame(d1)
d2 = {'timestamp2': [5629500, 5643050, 6578800, 6583150, 6611350], 'col02': [7, 8, 9, 10, 11], 'col03': [0, 1, 0, 0, 1]}
df2 = pd.DataFrame(d2)
I want to create a new column in df1 with the value of the minimum timestamp of df2 greater than the current df1 timestamp, where df2['col03'] is zero. This is the way I did it:
df1['colnew'] = np.nan
TSs = df1['timestamp1']
for TS in TSs:
values = df2['timestamp2'][(df2['timestamp2'] > TS) & (df2['col03']==0)]
if not values.empty:
df1.loc[df1['timestamp1'] == TS, 'colnew'] = values.iloc[0]
It works, but I'd prefer not to use a for loop. Is there a better way to do this?
Use pandas.merge_asof with a forward direction
pd.merge_asof(
df1, df2.loc[df2.col03 == 0, ['timestamp2']],
left_on='timestamp1', right_on='timestamp2', direction='forward'
).rename(columns=dict(timestamp2='colnew'))
col01 timestamp1 colnew
0 1 88148 5629500.0
1 2 5617900 5629500.0
2 3 5622548 5629500.0
3 4 5645748 6578800.0
4 5 6603950 NaN
5 6 6666502 NaN
Give a try to the apply method.
def func(x):
values = df2['timestamp2'][(df2['timestamp2'] > x) & (df2['col03']==0)]
if not values.empty:
return values.iloc[0]
else:
np.NAN
df1["timestamp1"].apply(func)
You can create a separate function to do what has to be done.
The output is your new column
0 5629500.0
1 5629500.0
2 5629500.0
3 6578800.0
4 NaN
5 NaN
Name: timestamp1, dtype: float64
It is not an one-line solution, but it helps keeping things organised.

Pandas fillna() not filling values from series

I'm trying to fill missing values in a column in a DataFrame with the value from another DataFrame's column. Here's the setup:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'a': [2, 3, 5, np.nan, np.nan],
'b': [10, 11, 13, 14, 15]
})
df2 = pd.DataFrame({
'x': [1]
})
I can of course do this and it works:
df['a'] = df['a'].fillna(1)
However, this results in the missing values not being filled:
df['a'] = df['a'].fillna(df2['x'])
And this results in an error:
df['a'] = df['a'].fillna(df2['x'].values)
How can I use the value from df2['x'] to fill in missing values in df['a']?
If you can guarantee df2['x'] only has a single element, then use .item:
df['a'] = df['a'].fillna(df2.values.item())
Or,
df['a'] = df['a'].fillna(df2['x'].item())
df
a b
0 2.0 10
1 3.0 11
2 5.0 13
3 1.0 14
4 1.0 15
Otherwise, this isn't possible unless they're either the same length and/or index-aligned.
As a rule of thumb; either
pass a scalar, or
pass a dictionary mapping the index of the NaN value to its replacement value (e.g., df.a.fillna({3 : 1, 4 : 1})), or
index aligned series
I think one general solution is select first value by [0] for scalar:
print (df2['x'].values[0])
1
df['a'] = df['a'].fillna(df2['x'].values[0])
#similar solution for select by loc
#df['a'] = df['a'].fillna(df2.loc[0, 'x'])
print (df)
a b
0 2.0 10
1 3.0 11
2 5.0 13
3 1.0 14
4 1.0 15

Categories