Split panda frames when a specific column reaches a given value - python

I want to df.cut() with two different bin sizes for two specific parts of a dataframe. I believe the easiest way to do that is to read my dataframe and split it in two so I can use df.cut() in the two independent dataframe with two independent bins.
I understand I can use df.head(), but I had to keep changing the dataframe and they don't have always the same size. For example, with the following dataframe
A B
1 0.1 0.423655
2 0.2 0.645894
3 0.3 0.437587
4 0.31 0.891773
5 0.4 0.1773
6 0.43 0.91773
7 0.5 0.891773
I want to have two dataframes for value of A higher or equal than 0.4 and lower than 0.4.
So I would have df2:
A B
1 0.1 0.423655
2 0.2 0.645894
3 0.3 0.437587
4 0.31 0.891773
and df3:
A B
1 0.4 0.1773
2 0.43 0.91773
3 0.5 0.891773
Again, df.head(4) or df.tail(3) won't work.

df2 = df[df["A"] < 0.4]
df3 = df[df["A"] >= 0.4]

This should work:
import pandas as pd
data = {'A': [0.1,0.2,0.1,0.2,5,6,7,8], 'B': [5,0.2,4,8,11,9,10,14]}
df = pd.DataFrame(data)
df2 = df[df.A >= 0.4]
print(df2)
# A B
#4 5.0 11.0
#5 6.0 9.0
#6 7.0 10.0
#7 8.0 14.0
df3 = df[df.A < 0.4]
print(df3)
# A B
#0 0.1 5.0
#1 0.2 0.2
#2 0.1 4.0
#3 0.2 8.0

I added in some ficticious data as an example:
data = {'A': [1,2,3,4,5,6,7,8], 'B': [5,8,9,10,11,12,13,14]}
df = pd.DataFrame(data)
df1 = df[df.A > 4]
df2 = df[df.A <13]
print(df1)
print(df2)
Output
>>> print(df1)
A B
4 5 11
5 6 12
6 7 13
7 8 14
>>> print(df2)
A B
0 1 5
1 2 8
2 3 9
3 4 10
4 5 11
5 6 12
6 7 13
7 8 14

Related

assign values of one dataframe column to another dataframe column based on condition

am trying to compare two dataframes based on different columns and assign a value to a dataframe based on it.
df1 :
date value1 value2
4/1/2021 A 1
4/2/2021 B 2
4/6/2021 C 3
4/4/2021 D 4
4/5/2021 E 5
4/6/2021 F 6
4/2/2021 G 7
df2:
Date percent
4/1/2021 0.1
4/2/2021 0.2
4/6/2021 0.6
output:
date value1 value2 per
4/1/2021 A 1 0.1
4/2/2021 B 2 0.2
4/6/2021 C 3 0.6
4/4/2021 D 4 0
4/5/2021 E 5 0
4/6/2021 F 6 0
4/2/2021 G 7 0.2
Code1:
df1['per'] = np.where(df1['date']==df2['Date'], df2['per'], 0)
error:
ValueError: Can only compare identically-labeled Series objects
Note: changed the column value of df2['Date] to df2['date] and then tried merging
code2:
new = pd.merge(df1, df2, on=['date'], how='inner')
error:
ValueError: You are trying to merge on object and datetime64[ns] columns. If you wish to proceed you should use pd.concat
df1['per']=df1['date'].map(dict(zip(df2['Date'], df2['percent']))).fillna(0)
date value1 value2 per
0 4/1/2021 A 1 0.1
1 4/2/2021 B 2 0.2
2 4/6/2021 C 3 0.6
3 4/4/2021 D 4 0.0
4 4/5/2021 E 5 0.0
5 4/6/2021 F 6 0.6
6 4/2/2021 G 7 0.2
You could use pd.merge and perform a left join to keep all the rows from df1 and bring over all the date matching rows from df2:
pd.merge(df1,df2,left_on='date',right_on='Date', how='left').fillna(0).drop('Date',axis=1)
Prints:
date value1 value2 percent
0 04/01/2021 A 1 0.1
1 04/02/2021 B 2 0.2
2 04/06/2021 C 3 0.6
3 04/04/2021 D 4 0.0
4 04/05/2021 E 5 0.0
5 04/06/2021 F 6 0.6
6 04/02/2021 G 7 0.2
*I think there's a typo on your penultimate row. percent should be 0.6 IIUC.

Unexplained behavior with Pandas Split (group) + Apply + Rejoin (concat), but only when sorting

I'm trying to calculate distances between a column and its lag (shift) for groups in a Pandas dataframe. The groups need to be sorted so that the shift is one timestep before. The standard way to do this is by .groupby() (aka Split), then .apply() with the distance function over each group, then rejoin with .concat(). This works fine, but only when I don't explicitly sort the grouped dataframe. when I sort the grouped dataframe, I get an error in the rejoining step.
Here's my example code, for which I was able to reproduce the unexpected behavior:
import pandas as pd
def dist_apply(group):
# when commented out, this code will run to completion (!)
group.sort_values(by='T',inplace=True)
group['shift'] = group['Y'].shift()
group['dist'] = group['Y'] - group['shift']
return group
df = pd.DataFrame(pd.DataFrame({'X': ['A', 'B', 'A', 'B', 'A', 'B'], 'T': [0.9, 0.8, 0.7, 0.9, 0.8, 0.7], 'Y': [7, 1, 8, 3, 9, 5]}))
print(df)
# split
df_g = df.groupby(['X'])
# apply
df_g = df_g.apply(dist_apply)
print(df_g)
# rejoin
df = pd.concat([df,df_g],axis=1)
print(df)
When the code that sorts the grouped dataframe is commented out, then the code prints this, which is expected:
X T Y
0 A 0.9 7
1 B 0.8 1
2 A 0.7 8
3 B 0.9 3
4 A 0.8 9
5 B 0.7 5
X T Y shift dist
0 A 0.9 7 NaN NaN
1 B 0.8 1 NaN NaN
2 A 0.7 8 7.0 1.0
3 B 0.9 3 1.0 2.0
4 A 0.8 9 8.0 1.0
5 B 0.7 5 3.0 2.0
X T Y X T Y shift dist
0 A 0.9 7 A 0.9 7 NaN NaN
1 B 0.8 1 B 0.8 1 NaN NaN
2 A 0.7 8 A 0.7 8 7.0 1.0
3 B 0.9 3 B 0.9 3 1.0 2.0
4 A 0.8 9 A 0.8 9 8.0 1.0
5 B 0.7 5 B 0.7 5 3.0 2.0
With the sorting line, the Traceback looks like this:
Traceback (most recent call last):
File "test.py", line 19, in <module>
df = pd.concat([df,df_g],axis=1)
File "/Users/me/miniconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 229, in concat
return op.get_result()
File "/Users/me/miniconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 420, in get_result
indexers[ax] = obj_labels.reindex(new_labels)[1]
File "/Users/me/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py", line 2236, in reindex
target = MultiIndex.from_tuples(target)
File "/Users/me/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py", line 396, in from_tuples
arrays = list(lib.tuples_to_object_array(tuples).T)
File "pandas/_libs/lib.pyx", line 2287, in pandas._libs.lib.tuples_to_object_array
TypeError: object of type 'int' has no len()
Sorting but not running the concat prints me this for df_g:
X T Y shift dist
X
A 2 A 0.7 8 NaN NaN
4 A 0.8 9 8.0 1.0
0 A 0.9 7 9.0 -2.0
B 5 B 0.7 5 NaN NaN
1 B 0.8 1 5.0 -4.0
3 B 0.9 3 1.0 2.0
which shows that it's grouped differently than the printing of df_g without the sorting (above), but it's not clear how the concat is breaking in this case.
update: I thought I had solved it by renaming the offending column ('X' in this case) and also using .reset_index() on the grouped dataframe before the merge.
df_g.columns = ['X_g','T','Y','shift','dist']
df = pd.concat([df,df_g.reset_index()],axis=1)
runs as expected, and prints this:
X T Y X level_1 X_g T Y shift dist
0 A 0.9 7 A 2 A 0.7 8 NaN NaN
1 B 0.8 1 A 4 A 0.8 9 8.0 1.0
2 A 0.7 8 A 0 A 0.9 7 9.0 -2.0
3 B 0.9 3 B 5 B 0.7 5 NaN NaN
4 A 0.8 9 B 1 B 0.8 1 5.0 -4.0
5 B 0.7 5 B 3 B 0.9 3 1.0 2.0
But looking closely, this column shows that the merge is done incorrectly:
1 B 0.8 1 A 4 A 0.8 9 8.0 1.0
I'm using Mac OSX with Python 3.7.6 | packaged by conda-forge | (default, Jan 7 2020, 22:05:27)
Pandas 0.24.2 + Numpy 1.17.3
and also tried upgrading to Pandas 0.25.3 and Numpy 1.17.5 with the same result.
This is tentatively working.
Rename columns to avoid duplicate:
df_g.columns = ['X_g','T','Y','shift','dist']
Reset index to single from multiindex:
df_g = df_g.reset_index(level=[0,1])
Simple merge, put df_g first if you want to keep the sorted-group order:
df = pd.merge(df_g,df)
gives me
X level_1 X_g T Y shift dist
0 A 2 A 0.7 8 NaN NaN
1 A 4 A 0.8 9 8.0 1.0
2 A 0 A 0.9 7 9.0 -2.0
3 B 5 B 0.7 5 NaN NaN
4 B 1 B 0.8 1 5.0 -4.0
5 B 3 B 0.9 3 1.0 2.0
Full code:
import pandas as pd
def dist_apply(group):
group.sort_values(by='T',inplace=True)
group['shift'] = group['Y'].shift()
group['dist'] = group['Y'] - group['shift']
return group
df = pd.DataFrame(pd.DataFrame({'X': ['A', 'B', 'A', 'B', 'A', 'B'], 'T': [0.9, 0.8, 0.7, 0.9, 0.8, 0.7], 'Y': [7, 1, 8, 3, 9, 5]}))
print(df)
df_g = df.groupby(['X'])
df_g = df_g.apply(dist_apply)
#print(df_g)
df_g.columns = ['X_g','T','Y','shift','dist']
df_g = df_g.reset_index(level=[0,1])
#print(df_g)
df = pd.merge(df_g,df)
print(df)

Python Pandas - Returning Data based on specified conditions and prevent true false values

I have a simple dataframe that's somewhat providing the output i want. Below is the code and output
import pandas as pd
import numpy as np
data = {'A': [1,2,3,4,5,6,7,8], 'B': [5,8,9,10,11,12,13,14]}
df = pd.DataFrame(data)
df1 = df['A'] > 4, df['A']
df2 = df['B'] <13, df['B']
df3 = df1 + df2
print(df3)
Output
>>> print(df3)
(0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 True
Name: A, dtype: bool, 0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
Name: A, dtype: int64, 0 True
1 True
2 True
3 True
4 True
5 True
6 False
7 False
Name: B, dtype: bool, 0 5
1 8
2 9
3 10
4 11
5 12
6 13
7 14
Name: B, dtype: int64)
My question is, how do I prevent the output from printing the true false values as i'm just interested in the dataframes with the values
Desired output
1 2
2 3
3 4
4 5
5 6
6 7
7 8
1 8
2 9
3 10
4 11
5 12
6 13
7 14
The desired output is two separate dataframes of just the values without the true/false outputs
This should work:
import pandas as pd
data = {'A': [0.1,0.2,0.1,0.2,5,6,7,8], 'B': [5,0.2,4,8,11,9,10,14]}
df = pd.DataFrame(data)
df1 = df[df.A >= 0.4]
print(df1)
# A B
#4 5.0 11.0
#5 6.0 9.0
#6 7.0 10.0
#7 8.0 14.0
df2 = df[df.A < 0.4]
print(df2)
# A B
#0 0.1 5.0
#1 0.2 0.2
#2 0.1 4.0
#3 0.2 8.0
df3 = pd.concat([df1, df2])
print(df3)
# A B
#4 5.0 11.0
#5 6.0 9.0
#6 7.0 10.0
#7 8.0 14.0
#0 0.1 5.0
#1 0.2 0.2
#2 0.1 4.0
#3 0.2 8.0
I figured this out. the below provides me with the expected output
data = {'A': [1,2,3,4,5,6,7,8], 'B': [5,8,9,10,11,12,13,14]}
df = pd.DataFrame(data)
df1 = df[df.A > 4]
df2 = df[df.A <13]

Assign values from pandas.quantile

I just try to get the quantiles of a dataframe asigned on to an other dataframe like:
dataframe['pc'] = dataframe['row'].quantile([.1,.5,.7])
the result is
0 NaN
...
5758 NaN
Name: pc, Length: 5759, dtype: float64
any idea why the dataframe['row'] got plenty of values
It is expected, because different indices, so no align Series created by quantile with original DataFrame and get NaNs:
#indices 0,1,2...6
dataframe = pd.DataFrame({'row':[2,0,8,1,7,4,5]})
print (dataframe)
row
0 2
1 0
2 8
3 1
4 7
5 4
6 5
#indices 0.1, 0.5, 0.7
print (dataframe['row'].quantile([.1,.5,.7]))
0.1 0.6
0.5 4.0
0.7 5.4
Name: row, dtype: float64
#not align
dataframe['pc'] = dataframe['row'].quantile([.1,.5,.7])
print (dataframe)
row pc
0 2 NaN
1 0 NaN
2 8 NaN
3 1 NaN
4 7 NaN
5 4 NaN
6 5 NaN
If want create DataFrame of quantile add rename_axis + reset_index:
df = dataframe['row'].quantile([.1,.5,.7]).rename_axis('a').reset_index(name='b')
print (df)
a b
0 0.1 0.6
1 0.5 4.0
2 0.7 5.4
But if some indices are same (I think it is not what you want, only for better explanation):
Add reset_index for default indices 0,1,2:
print (dataframe['row'].quantile([.1,.5,.7]).reset_index(drop=True))
0 0.6
1 4.0
2 5.4
Name: row, dtype: float64
First 3 rows are aligned, because same indices 0,1,2 in Series and DataFrame:
dataframe['pc'] = dataframe['row'].quantile([.1,.5,.7]).reset_index(drop=True)
print (dataframe)
row pc
0 2 0.6
1 0 4.0
2 8 5.4
3 1 NaN
4 7 NaN
5 4 NaN
6 5 NaN
EDIT:
For multiple columns need DataFrame.quantile, it also exclude non numeric columns:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
df1 = df.quantile([.1,.2,.3,.4])
print (df1)
B C D E
0.1 4.0 2.5 0.5 2.5
0.2 4.0 3.0 1.0 3.0
0.3 4.0 3.5 1.0 3.5
0.4 4.0 4.0 1.0 4.0

two different csv file data manipulation using pandas

I have two data frame df1 and df2
df1 has following data (N Rows)
Time(s) sv-01 sv-02 sv-03 Val1 val2 val3
1339.4 1 4 12 1.6 0.6 1.3
1340.4 1 12 4 -0.5 0.5 1.4
1341.4 1 6 8 0.4 5 1.6
1342.4 2 5 14 1.2 3.9 11
...... ..... .... ... ..
df2 has following data which has more rows than df1
Time(msec) channel svid value-1 value-2 valu-03
1000 1 2 0 5 1
1000 2 5 1 4 2
1000 3 2 3 4 7
..... .....................................
1339400 1 1 1.6 0.4 5.3
1339400 2 12 0.5 1.8 -4.4
1339400 3 4 -0.20 1.6 -7.9
1340400 1 1 0.3 0.3 1.5
1340400 2 6 2.3 -4.3 1.0
1340400 3 4 2.0 1.1 -0.45
1341400 1 1 2 2.1 0
1341400 2 8 3.4 -0.3 1
1341400 3 6 0 4.1 2.3
.... .... .. ... ... ...
What I am trying to achieve is
1.first multiplying Time(s) column by 1000 so that it matches with df2
millisecond column.
2.In df1 sv 01,02 and 03 are in independent column but those sv are
present in same column under svid.
So goal is when time of df1(after changing) is matching with time
of df2 copy next three consecutive lines i.e copy all matched
lines of that time instant.
Basically I want to iterate the time of df1 in df2 time column
and if there is a match copy three next rows and copy to a new df.
I have seen examples using pandas merge function but in my case both have
different header.
Thanks.
I think you need double boolean indexing - first df2 with isin, for multiple is used mul:
And then count values per groups by cumcount and filter first 3:
df = df2[df2['Time(msec)'].isin(df1['Time(s)'].mul(1000))]
df = df[df.groupby('Time(msec)').cumcount() < 3]
print (df)
Time(msec) channel svid value-1 value-2 valu-03
3 1339400 1 1 1.6 0.4 5.30
4 1339400 2 12 0.5 1.8 -4.40
5 1339400 3 4 -0.2 1.6 -7.90
6 1340400 1 1 0.3 0.3 1.50
7 1340400 2 6 2.3 -4.3 1.00
8 1340400 3 4 2.0 1.1 -0.45
9 1341400 1 1 2.0 2.1 0.00
10 1341400 2 8 3.4 -0.3 1.00
11 1341400 3 6 0.0 4.1 2.30
Detail:
print (df.groupby('Time(msec)').cumcount())
3 0
4 1
5 2
6 0
7 1
8 2
9 0
10 1
11 2
dtype: int64

Categories