I have an example(left one in image description).
There are several indexes in the first column. However, the third (not just for the third one because I have data from more than a thousand repeating intervals) of the repeating characters is missing-data which is 'GG'.
Question: I want to add particular rows (like 'GG') with the value 'NaN'
I want to display its values in different columns based on the characters of the repeated section(from 'II' to '//\n').
Is there any way I can do in this situation?.
Assuming your data is a dataframe with all data as columns (if not, you first need to reset_index):
(df
#.reset_index() # uncomment if first column is index
.assign(cols=df.groupby('col1').cumcount())
.pivot(index='col1', columns='cols', values='col2')
)
output:
cols 0 1 2
col1
A 0.0 4.0 7.0
B 1.0 5.0 8.0
C 2.0 9.0 NaN
D 3.0 6.0 10.0
input:
col1 col2
0 A 0
1 B 1
2 C 2
3 D 3
4 A 4
5 B 5
6 D 6
7 A 7
8 B 8
9 C 9
10 D 10
Related
I have the following Dataset:
col value
0 A 1
1 A NaN
2 B NaN
3 B NaN
4 B NaN
5 B 1
6 C 3
7 C NaN
8 C NaN
9 D 5
10 E 6
There is only one value set per group, the rest in Nan. What I want to do know, is fill the NaN with he value of the group. If a group has no NaNs, I just want to ignore it.
Outcome should look like this:
col value
0 A 1
1 A 1
2 B 1
3 B 1
4 B 1
5 B 1
6 C 3
7 C 3
8 C 3
9 D 5
10 E 6
What I've tried so far is the following:
df["value"] = df.groupby(col).transform(lambda x: x.fillna(x.mean()))
However, this method is not only super slow, but doesn't give me the wished result.
Anybody an idea?
It depends of data - if there is always one non missing value you can sorting and then replace by GroupBy.ffill, it working well if some groups has NANs only:
df = df.sort_values(['col','value'])
df["value"] = df.groupby('col')["value"].ffill()
#if always only one non missing value per group, fail if all NaNs of some group
#df["value"] = df["value"].ffill()
print (df)
col value
0 A 1.0
1 A 1.0
5 B 1.0
2 B 1.0
3 B 1.0
4 B 1.0
6 C 3.0
7 C 3.0
8 C 3.0
9 D 5.0
10 E 6.0
Or if there is multiple values and need replace by mean, for improve performace change your solution with GroupBy.transform only mean passed to Series.fillna:
df["value"] = df["value"].fillna(df.groupby('col')["value"].transform('mean'))
print (df)
col value
0 A 1.0
1 A 1.0
5 B 1.0
2 B 1.0
3 B 1.0
4 B 1.0
6 C 3.0
7 C 3.0
8 C 3.0
9 D 5.0
10 E 6.0
You can use ffill which is the same as fillna() with method=ffill (see docs)
df["value"] = df["value"].ffill()
Trying to groupby in pandas, then sort values and have a result column show what you need to add to get to the next row in the group, and if your are the end of the group. To replace the value with the number 3. Anyone have an idea how to do it?
import pandas as pd
df = pd.DataFrame({'label': 'a a b c b c'.split(), 'Val': [2,6,6, 4,16, 8]})
df
label Val
0 a 2
1 a 6
2 b 6
3 c 4
4 b 16
5 c 8
Id like the results as shown below, that you have to add 4 to 2 to get 6. So the groups are sorted. But if there is no next value in the group and NaN is added. To replace it with the value 3. I have shown below what the results should look like:
label Val Results
0 a 2 4.0
1 a 6 3.0
2 b 6 10.0
3 c 4 4.0
4 b 16 3.0
5 c 8 3.0
I tried this, and was thinking of shifting values up but the problem is that the labels aren't sorted.
df['Results'] = df.groupby('label').apply(lambda x: x - x.shift())`
df
label Val Results
0 a 2 NaN
1 a 6 4.0
2 b 6 NaN
3 c 4 NaN
4 b 16 10.0
5 c 8 4.0
Hope someone can help:D!
Use groupby, diff and abs:
df['Results'] = abs(df.groupby('label')['Val'].diff(-1)).fillna(3)
label Val Results
0 a 2 4.0
1 a 6 3.0
2 b 6 10.0
3 c 4 4.0
4 b 16 3.0
5 c 8 3.0
Consider a dataframe which contains several groups of integers:
d = pd.DataFrame({'label': ['a','a','a','a','b','b','b','b'], 'value': [1,2,3,2,7,1,8,9]})
d
label value
0 a 1
1 a 2
2 a 3
3 a 2
4 b 7
5 b 1
6 b 8
7 b 9
For each of these groups of integers, each integer has to be bigger or equal to the previous one. If not the case, it takes on the value of the previous integer. I replace using
s.where(~(s < s.shift()), s.shift())
which works fine for a single series. I can even group the dataframe, and loop through each extracted series:
grouped = s.groupby('label')['value']
for _, s in grouped:
print(s.where(~(s < s.shift()), s.shift()))
0 1.0
1 2.0
2 3.0
3 3.0
Name: value, dtype: float64
4 7.0
5 7.0
6 8.0
7 9.0
Name: value, dtype: float64
However, how do I now get these values back into my original dataframe?
Or, is there a better way to do this? I don't care for using .groupby and don't consider the for loop a pretty solution either...
IIUC, you can use cummax in the groupby like:
d['val_max'] = d.groupby('label')['value'].cummax()
print (d)
label value val_max
0 a 1 1
1 a 2 2
2 a 3 3
3 a 2 3
4 b 7 7
5 b 1 7
6 b 8 8
7 b 9 9
I'm new in python and I need your assistance with getting the result when you add your columns as value and your values in rows.
Here's an example:
columns
A B C
1 2 3
4 5 6
7 8 9
Expected result:
avg
A 4
B 5
C 6
I can do it easily in excel by placing the columns in "Values" the move the values in rows to get the average but I can't seem to do it in python.
df=pd.DataFrame({'A':[1,4,7],'B':[2,5,8],'C':[3,6,9]})
df
A B C
0 1 2 3
1 4 5 6
2 7 8 9
ser=df.mean() #Result is a Series
df=pd.DataFrame({'avg':ser}) #Convert this Series into DataFrame
df
avg
A 4.0
B 5.0
C 6.0
Using to_frame
df.mean().to_frame('ave')
Out[186]:
ave
A 4.0
B 5.0
C 6.0
I am trying to turn multiple dataframes into a single one based on the values in the first column, but not every dataframe has the same values in the first column. Take this example:
df1:
A 4
B 6
C 8
df2:
A 7
B 4
F 3
full_df:
A 4 7
B 6 4
C 8
F 3
How do I do this using python and pandas?
You can use pandas merge with outer join
df1.merge(df2,on =['first_column'],how='outer')
You can use pd.concat, remembering to align indices:
res = pd.concat([df1.set_index(0), df2.set_index(0)], axis=1)
print(res)
1 1
A 4.0 7.0
B 6.0 4.0
C 8.0 NaN
F NaN 3.0