Pandas new column replace only show specific pattern value in new column - python

Index value
1 880770000-t-ptt-018-108
2 Nan
3 760770000-t-ptm-001-107
4 Date
5 11/20/2020
6 607722991-t-ptr-001-888
7 NaN
8 Date
9 10/25/2020
10 12/30/2019
11 967722944-t-ptq-020-888
I want this in next column specific pattern values to be only shown in new column in same dataframe and other values to be replace by NaN like this. the original table has 200k rows and 22 columns the pattern has above 5000 combinations.
Index value
1 880770000-t-ptt-018-108
2 Nan
3 760770000-t-ptm-001-107
4 NaN
5 Nan
6 607722991-t-ptr-001-888
7 NaN
8 NaN
9 NaN
10 NaN
11 967722944-t-ptq-020-888

df['value'] = df['value'].apply(lambda x: x if "-t-" in x else np.NaN)

Related

How to assign a column the value that is above it only if a condition is met?

So I have a dataframe where I have some empty values in a column. I need those empty values to be assigned to the next real value above them, whether it is 1 row above or 4 rows above. But, the caveat is that I only needs those empty values to be filled in if a certain condition is met.
Dataframe currently looks like:
Column A
Column B
1
100
1
NaN
1
NaN
2
150
2
NaN
2
NaN
3
NaN
3
NaN
4
60
5
70
5
NaN
I need it to look like:
Column A
Column B
1
100
1
100
1
100
2
150
2
150
2
150
3
NaN
3
NaN
4
60
5
70
5
70
So the first value for each grouping in column A needs to be carried out for that grouping in column B...all rows with a 1 in column A should have the same column B value. All rows with a 2 in column A should have the same column B value. The value it should be will always be the first value. In other words, the first row a new value comes up in column A will contain the correct value in Column B that should be carried down.
I really have no idea how to approach this. I was thinking about using groupby but that didn't make much sense.
I think groupby is the way to go:
g = df.groupby('Column A')
df['Column B'] = g.ffill()
Output:
Column A Column B
0 1 100.00
1 1 100.00
2 1 100.00
3 2 150.00
4 2 150.00
5 2 150.00
6 3 NaN
7 3 NaN
8 4 60.00
9 5 70.00
10 5 70.00

Copy row values of Data Frame along rows till not null and replicate the consecutive not null value further

I have a Dataframe as shown below
A B C D
0 1 2 3.3 4
1 NaT NaN NaN NaN
2 NaT NaN NaN NaN
3 5 6 7 8
4 NaT NaN NaN NaN
5 NaT NaN NaN NaN
6 9 1 2 3
7 NaT NaN NaN NaN
8 NaT NaN NaN NaN
I need to copy the first row values (1,2,3,4) till the non-null row with index 2. Then, copy row values (5,6,7,8) till the non-null row with index 5 and copy (9,1,2,3) till row with index 8 and so on. Is there any way to do this in Python or Pandas. Quick help appreciated! Also is necessary not replace column D
Column C ffill gives 3.3456 as value for next row
Expected Output:
A B C D
0 1 2 3.3 4
1 1 2 3.3 NaN
2 1 2 3.3 NaN
3 5 6 7 8
4 5 6 7 NaN
5 5 6 7 NaN
6 9 1 2 3
7 9 1 2 NaN
8 9 1 2 NaN
Question was changed, so for forward filling all columns without D use Index.difference with ffill for columns names in list:
cols = df.columns.difference(['D'])
df[cols] = df[cols].ffill()
Or create mask for all columns names without D:
mask = df.columns != 'D'
df.loc[:, mask] = df.loc[:, mask].ffill()
EDIT: I cannot replicate your problem:
df = pd.DataFrame({'a':[2114.201789, np.nan, np.nan, 1]})
print (df)
a
0 2114.201789
1 NaN
2 NaN
3 1.000000
print (df.ffill())
a
0 2114.201789
1 2114.201789
2 2114.201789
3 1.000000

How fill unstinting numeric values in df column

so I am trying to add rows to data frame that should follow a numeric order 1 to 52
but my data is missing numbers, so I need to add these rows and fill these spots with NaN values or null.
df = pd.DataFrame("Weeks": [1,2,3,15,16,20,21,52],
"Values": [10,10,10,10,50,60,70,40])
Desired output:
Weeks Values
1 10
2 10
3 10
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
...
52 40
and so on until it reach Weeks = 52
My solution:
new_df = pd.DataFrame("Weeks": "" , "Values":"")
for x in range(1,53):
for i in df.Weeks:
if x == i:
new_df["Weeks"] = x
new_df["Values"] = df.Values[i]
The problem it is super inefficient, anyone know a way to do it in much efficient way?
You could use set_index to set the Weeks as index an reindex with a range up to the maximum week:
df.set_index('Weeks').reindex(range(1,df.Weeks.max()))
Or accounting for the minimum week too:
df.set_index('Weeks').reindex(range(*df.Weeks.agg(('min', 'max'))))
Values
Weeks
1 10.0
2 10.0
3 10.0
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 10.0
16 50.0
17 NaN
...

How to change consecutive repeating values in pandas dataframe series to nan or 0?

I have a pandas dataframe created from measured numbers. When something goes wrong with the measurement, the last value is repeated. I would like to do two things:
1. Change all repeating values either to nan or 0.
2. Keep the first repeating value and change all other values nan or 0.
I have found solutions using "shift" but they drop repeating values. I do not want to drop repeating values.My data frame looks like this:
df = pd.DataFrame(np.random.randn(15, 3))
df.iloc[4:8,0]=40
df.iloc[12:15,1]=22
df.iloc[10:12,2]=0.23
giving a dataframe like this:
0 1 2
0 1.239916 1.109434 0.305490
1 0.248682 1.472628 0.630074
2 -0.028584 -1.116208 0.074299
3 -0.784692 -0.774261 -1.117499
4 40.000000 0.283084 -1.495734
5 40.000000 -0.074763 -0.840403
6 40.000000 0.709794 -1.000048
7 40.000000 0.920943 0.681230
8 -0.701831 0.547689 -0.128996
9 -0.455691 0.610016 0.420240
10 -0.856768 -1.039719 0.230000
11 1.187208 0.964340 0.230000
12 0.116258 22.000000 1.119744
13 -0.501180 22.000000 0.558941
14 0.551586 22.000000 -0.993749
what I would like to be able to do is write some code that would filter the data and give me a data frame like this:
0 1 2
0 1.239916 1.109434 0.305490
1 0.248682 1.472628 0.630074
2 -0.028584 -1.116208 0.074299
3 -0.784692 -0.774261 -1.117499
4 NaN 0.283084 -1.495734
5 NaN -0.074763 -0.840403
6 NaN 0.709794 -1.000048
7 NaN 0.920943 0.681230
8 -0.701831 0.547689 -0.128996
9 -0.455691 0.610016 0.420240
10 -0.856768 -1.039719 NaN
11 1.187208 0.964340 NaN
12 0.116258 NaN 1.119744
13 -0.501180 NaN 0.558941
14 0.551586 NaN -0.993749
or even better keep the first value and change the rest to NaN. Like this:
0 1 2
0 1.239916 1.109434 0.305490
1 0.248682 1.472628 0.630074
2 -0.028584 -1.116208 0.074299
3 -0.784692 -0.774261 -1.117499
4 40.000000 0.283084 -1.495734
5 NaN -0.074763 -0.840403
6 NaN 0.709794 -1.000048
7 NaN 0.920943 0.681230
8 -0.701831 0.547689 -0.128996
9 -0.455691 0.610016 0.420240
10 -0.856768 -1.039719 0.230000
11 1.187208 0.964340 NaN
12 0.116258 22.000000 1.119744
13 -0.501180 NaN 0.558941
14 0.551586 NaN -0.993749
using shift & mask:
df.shift(1) == df compares the next row to the current for consecutive duplicates.
df.mask(df.shift(1) == df)
# outputs
0 1 2
0 0.365329 0.153527 0.143244
1 0.688364 0.495755 1.065965
2 0.354180 -0.023518 3.338483
3 -0.106851 0.296802 -0.594785
4 40.000000 0.149378 1.507316
5 NaN -1.312952 0.225137
6 NaN -0.242527 -1.731890
7 NaN 0.798908 0.654434
8 2.226980 -1.117809 -1.172430
9 -1.228234 -3.129854 -1.101965
10 0.393293 1.682098 0.230000
11 -0.029907 -0.502333 NaN
12 0.107994 22.000000 0.354902
13 -0.478481 NaN 0.531017
14 -1.517769 NaN 1.552974
if you want to remove all the consecutive duplicates, test that the previous row is also the same as the current row
df.mask((df.shift(1) == df) | (df.shift(-1) == df))
Option 1
Specialized solution using diff. Get's at the final desired output.
df.mask(df.diff().eq(0))
0 1 2
0 1.239916 1.109434 0.305490
1 0.248682 1.472628 0.630074
2 -0.028584 -1.116208 0.074299
3 -0.784692 -0.774261 -1.117499
4 40.000000 0.283084 -1.495734
5 NaN -0.074763 -0.840403
6 NaN 0.709794 -1.000048
7 NaN 0.920943 0.681230
8 -0.701831 0.547689 -0.128996
9 -0.455691 0.610016 0.420240
10 -0.856768 -1.039719 0.230000
11 1.187208 0.964340 NaN
12 0.116258 22.000000 1.119744
13 -0.501180 NaN 0.558941
14 0.551586 NaN -0.993749

fillna in clustered data in large pandas dataframes

Considering the following dataframe:
index group signal
1 1 1
2 1 NAN
3 1 NAN
4 1 -1
5 1 NAN
6 2 NAN
7 2 -1
8 2 NAN
9 3 NAN
10 3 NAN
11 3 NAN
12 4 1
13 4 NAN
14 4 NAN
I want to modify the signals by ffill NANs in each group so that I can have the following dataframe:
index group signal
1 1 1
2 1 1
3 1 1
4 1 -1
5 1 -1
6 2 NAN
7 2 -1
8 2 -1
9 3 NAN
10 3 NAN
11 3 NAN
12 4 1
13 4 1
14 4 1
The dataframe is big (around 800,000 rows with about 16,000 different groups) and currently I put it into a groupby object and try to modify each group there, which is very slow. Then I tried to convert it into a pivot_table and ffill() there, but the dataframe is simple too large and the program gives errors. Any suggestions? Thank you!
Can you try out this
data_group = data.groupby('group').apply(lambda v: v.fillna(method='ffill'))
I think in your data NAN is a string. Its not a empty element. Empty data will appear as NaN. If it is a string, do a replacement of NAN. Like
data_group = data.groupby('group').apply(lambda v: v.replace('NAN', float('nan')).fillna(method='ffill'))
Or a better version as Jeff suggested
data['signal'] = data['signal'].replace('NAN', float('nan'))
data = data.groupby('group').ffill()

Categories