Find number of datapoints in each range - python

I have a data frame that looks like this
data = [['A', 0.20], ['B',0.25], ['C',0.11], ['D',0.30], ['E',0.29]]
df = pd.DataFrame(data, columns=['col1', 'col2'])
Col1 is a primary key (each row has a unique value)
The max of col2 is 1 and the min is 0. I want to find the number of datapoint in ranges 0-.30 (both 0 and 0.30 are included), 0-.29, 0-.28, and so on till 0-.01. I can use pd.cut, but the lower limit is not fixed. My lower limit is always 0.
Can someone help?

One option using numpy broadcasting:
step = 0.01
up = np.arange(0, 0.3+step, step)
out = pd.Series((df['col2'].to_numpy()[:,None] <= up).sum(axis=0), index=up)
Output:
0.00 0
0.01 0
0.02 0
0.03 0
0.04 0
0.05 0
0.06 0
0.07 0
0.08 0
0.09 0
0.10 0
0.11 1
0.12 1
0.13 1
0.14 1
0.15 1
0.16 1
0.17 1
0.18 1
0.19 1
0.20 2
0.21 2
0.22 2
0.23 2
0.24 2
0.25 3
0.26 3
0.27 3
0.28 3
0.29 4
0.30 5
dtype: int64
With pandas.cut and cumsum:
step = 0.01
up = np.arange(0, 0.3+step, step)
(pd.cut(df['col2'], up, labels=up[1:].round(2))
.value_counts(sort=False).cumsum()
)
Output:
0.01 0
0.02 0
0.03 0
0.04 0
0.05 0
0.06 0
0.07 0
0.08 0
0.09 0
0.1 0
0.11 1
0.12 1
0.13 1
0.14 1
0.15 1
0.16 1
0.17 1
0.18 1
0.19 1
0.2 2
0.21 2
0.22 2
0.23 2
0.24 2
0.25 3
0.26 3
0.27 3
0.28 3
0.29 4
0.3 5
Name: col2, dtype: int64

Related

Adding values of a second dataFrame depending on the index values of a first dataFrame

I have a DataFrame that looks like this:
base_rate weighting_factor
0 NaN
1 1.792750
2 1.792944
I have a second DataFrame that looks like this:
min_index max_index weighting_factor
0 0 8 0.15
1 9 17 0.20
2 18 26 0.60
3 27 35 0.80
as you can see, the column
weighting_factor
in the first column is empty. How can I add the weighting_factor from the second dataFrame depending on the index?
For example, I want the weighting factor with the value 0.15 beeing added in the index range 0 - 8 and the weighting factor 0.20 to the index range 9 -17.
Thanks!
EDIT 1:
Instead of
>>> df1
base_rate weighting_factor
0 0.035007 0.15
1 0.427381 0.15
2 0.791881 0.15
3 0.282179 0.15
4 0.810117 0.15
5 0.871500 0.15
6 0.813326 0.15
7 0.054184 0.15
8 0.795688 0.15
9 0.560442 0.20
10 0.192447 0.20
11 0.712720 0.20
0 0.623351 0.20
1 0.805375 0.20
2 0.484269 0.20
I want
A possible solution is to expand your second dataframe:
idx = df2.index.repeat(df2['max_index'] - df2['min_index'] + 1)
df1['weighting_factor'] = df2.reindex(idx)['weighting_factor'] .values[:len(df1)]
>>> df1
base_rate weighting_factor
0 0.035007 0.15
1 0.427381 0.15
2 0.791881 0.15
3 0.282179 0.15
4 0.810117 0.20
5 0.871500 0.20
6 0.813326 0.20
7 0.054184 0.20
8 0.795688 0.60
9 0.560442 0.60
10 0.192447 0.60
11 0.712720 0.60
12 0.623351 0.80
13 0.805375 0.80
14 0.484269 0.80
15 0.360207 0.80
16 0.889750 1
17 0.503820 1
18 0.779739 1
19 0.116079 1
20 0.417814 1
21 0.423896 1
22 0.801999 1
23 0.034853 1
Since the length of df1 increases, also the range of min_index and max_index increases
A possible solution is to expand your second dataframe:
idx = df2.index.repeat(df2['max_index'] - df2['min_index'] + 1)
df1['weighting_factor'] = df2.reindex(idx)['weighting_factor'] .values[:len(df1)]
>>> df1
base_rate weighting_factor
0 0.035007 0.15
1 0.427381 0.15
2 0.791881 0.15
3 0.282179 0.15
4 0.810117 0.15
5 0.871500 0.15
6 0.813326 0.15
7 0.054184 0.15
8 0.795688 0.15
9 0.560442 0.20
10 0.192447 0.20
11 0.712720 0.20
0 0.623351 0.20
1 0.805375 0.20
2 0.484269 0.20
3 0.360207 0.20
4 0.889750 0.20
5 0.503820 0.20
6 0.779739 0.60
7 0.116079 0.60
8 0.417814 0.60
9 0.423896 0.60
10 0.801999 0.60
11 0.034853 0.60

Delete rows from a pandas DataFrame based on a conditional expression, iteration (n) and (n+1)

Given a dataframe as follows:
col1
1 0.6
2 0.88
3 1.2
4 1.2
5 1.2
6 0.55
7 0.55
8 0.65
I want to delete rows from it where the value in row (n+1) is the same value in (n), such that this would yield:
col1
1 0.6
2 0.88
3 1.2
4 row deleted
5 row deleted
6 0.55
7 row deleted
8 0.65
In [191]: df[df["col1"] != df["col1"].shift()]
Out[191]:
col1
1 0.60
2 0.88
3 1.20
6 0.55
8 0.65
Ok let do
df=df[df.diff().ne(0)]
Try this:
df = df[~df['col1'].eq(df['col1'].shift(1))]
print(df)
col1
0 0.60
1 0.88
2 1.20
5 0.55
7 0.65
Or:
df = df[df['col1'].ne(df['col1'].shift(1))]
print(df)
col1
0 0.60
1 0.88
2 1.20
5 0.55
7 0.65

How to combine Date object with float Vector

I am reading two data frames from two separate csvs and trying to combine them in a single data frame.Both df1 & df2 should be combined row by row.df1 contains floating numbers and
df2 is a date.
df1=pd.read_csv("Weights.csv")
print(df1.head(5))
df2=pd.read_csv("Date.csv")
print(df2.head(5))
0 1 2 3 4 5 6 7 8 9 10 11 12
0 0.06 0.06 -0.0 -0.0 0.11 0.06 0.37 0.01 0.05 0.10 -0.00 0.01 0.0
1 0.09 0.05 -0.0 -0.0 0.12 0.05 0.36 0.00 0.05 0.08 0.00 0.00 -0.0
2 0.14 0.07 -0.0 0.0 0.13 0.04 0.33 0.01 0.04 0.05 0.00 0.00 0.0
3 0.13 0.07 0.0 -0.0 0.12 0.06 0.34 0.01 0.05 0.04 0.01 0.00 -0.0
4 0.11 0.08 0.0 0.0 0.08 0.10 0.35 0.05 0.05 0.06 0.02 0.00 0.0
0
0 2010-12-29
1 2011-01-05
2 2011-01-12
3 2011-01-19
4 2011-01-26
I am facing problem using pd.concat in pandas.

Pandas change value of column if other column values don't meet criteria

I have the following data frame. I want to check the values of each row for the columns of "mental_illness", "feeling", and "flavor". If all the values for those three columns per row are less than 0.5, I want to change the corresponding value of the "unclassified" column to 1.0.
sent_no pos unclassified mental_illness feeling flavor
0 0 word_1 0.0 0.75 0.30 0.28
1 1 word_2 0.0 0.17 0.72 0.16
2 2 word_3 0.0 0.19 0.38 0.16
3 3 word_4 0.0 0.39 0.20 0.14
4 4 word_5 0.0 0.72 0.30 0.14
Expected result:
sent_no pos unclassified mental_illness feeling flavor
0 0 word_1 0.0 0.75 0.30 0.28
1 1 word_2 0.0 0.17 0.72 0.16
2 2 word_3 1.0 0.19 0.38 0.16
3 3 word_4 1.0 0.39 0.20 0.14
4 4 word_5 0.0 0.72 0.30 0.14
How do I go about doing so?
Use .le and .all over axis=1:
m = df[['mental_illness', 'feeling', 'flavor']].le(0.5).all(axis=1)
df['unclassified'] = m.astype(int)
sent_no pos unclassified mental_illness feeling flavor
0 0 word_1 0 0.75 0.30 0.28
1 1 word_2 0 0.17 0.72 0.16
2 2 word_3 1 0.19 0.38 0.16
3 3 word_4 1 0.39 0.20 0.14
4 4 word_5 0 0.72 0.30 0.14
Would this work?
mask1 = df["mental_illness"] < 0.5
mask2 = df["feeling"] < 0.5
mask3 = df["flavor"] < 0.5
df.loc[mask1 & mask2 & mask3, 'unclassified'] = 1
Here is my solution:
data.unclassified = data[['mental_illness', 'feeling', 'flavor']].apply(lambda x: x.le(0.5)).apply(lambda x: 1 if sum(x) == 3 else 0, axis = 1)
output
sent_no pos unclassified mental_illness feeling flavor
0 0 Word_1 0 0.75 0.30 0.28
1 1 Word_2 0 0.17 0.72 0.16
2 2 Word_3 1 0.19 0.38 0.16
3 3 Word_4 1 0.39 0.20 0.14
4 4 Word_5 0 0.72 0.30 0.14

Merging two pandas data frame with common columns

I have a lower triangular matrix and then I transpose it and I have the transpose of it.
I am trying to merge them together
lower triangular:
Data :
0 1 2 3
0 1 0 0 0
1 0.21 0 0 0
2 0.31 0.32 0 0
3 0.41 0.42 0.43 0
4 0.51 0.52 0.53 0.54
transpose triangular:
Data :
0 1 2 3
0 1 0.21 0.31 0.41
1 0 0 0.32 0.52
2 0 0 0 0.53
3 0 0 0 0.54
4 0 0 0 0
Merged matrix:
Data :
0 1 2 3 4
0 1 0.21 0.31 0.41 0.51
1 0.21 0 0.32 0.42 0.52
2 0.31 0.32 0 0.43 0.53
3 0.41 0.42 0.43 0 0.54
4 0.51 0.52 0.53 0.54 0
I tried using pd.merge but I couldn't get it to work
Let us using combine_first after mask
df.mask(df==0).T.combine_first(df).fillna(0)
Out[1202]:
0 1 2 3 4
0 1.00 0.21 0.31 0.41 0.51
1 0.21 0.00 0.32 0.42 0.52
2 0.31 0.32 0.00 0.43 0.53
3 0.41 0.42 0.43 0.00 0.54
4 0.51 0.52 0.53 0.54 0.00
How about just adding the two dataframes?
df3 = df1.add(df2, fill_value=0)
BR

Categories