Python Dataframe categorize values - python

I have a data coming from the field and I want to categorize it with a gap of specific range.
I want to categorize in 100 range. That is, 0-100, 100-200, 200-300
My code:
df=pd.DataFrame([112,341,234,78,154],columns=['value'])
value
0 112
1 341
2 234
3 78
4 154
Expected answer:
value value_range
0 112 100-200
1 341 200-400
2 234 200-300
3 78 0-100
4 154 100-200
My code:
df['value_range'] = df['value'].apply(lambda x:[a,b] if x>a and x<b for a,b in zip([0,100,200,300,400],[100,200,300,400,500]))
Present solution:
SyntaxError: invalid syntax

You can use pd.cut:
df["value_range"] = pd.cut(df["value"], [0, 100, 200, 300, 400], labels=['0-100', '100-200', '200-300', '300-400'])
print(df)
Prints:
value value_range
0 112 100-200
1 341 300-400
2 234 200-300
3 78 0-100
4 154 100-200

you can use the odd IntervalIndex.from_tuples. Just set the tuple values to the values that are in your data and you should be good to go! -Listen to Lil Wayne
df = pd.DataFrame([112,341,234,78,154],columns=['value'])
bins = pd.IntervalIndex.from_tuples([(0, 100), (100, 200), (200, 300), (300, 400)])
df['value_range'] = pd.cut(df['value'], bins)

Related

Filtering dataframes based on one column with a different type of other column

I have the following problem
import pandas as pd
data = {
"ID": [420, 380, 390, 540, 520, 50, 22],
"duration": [50, 40, 45,33,19,1,3],
"next":["390;50","880;222" ,"520;50" ,"380;111" ,"810;111" ,"22;888" ,"11" ]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
As you can see I have
ID duration next
0 420 50 390;50
1 380 40 880;222
2 390 45 520;50
3 540 33 380;111
4 520 19 810;111
5 50 1 22;888
6 22 3 11
Things to notice:
ID type is int
next type is a string with numbers separated by ; if more than two numbers
I would like to filter the rows with no next in the ID
For example in this case
420 has a follow up in both 390 and 50
380 has as next 880 and 222 both of which are not in ID so this one
540 has as next 380 and 111 and while 111 is not in ID, 380 is so not this one
same with 50
In the end I want to get
1 380 40 880;222
4 520 19 810;111
6 22 3 11
With only one value I used print(df[~df.next.astype(int).isin(df.ID)]) but in this case isin can not be simply applied.
How can I do this?
Let us try with split then explode with isin check
s = df.next.str.split(';').explode().astype(int)
out = df[~s.isin(df['ID']).groupby(level=0).any()]
Out[420]:
ID duration next
1 380 40 880;222
4 520 19 810;111
6 22 3 11
Use a regex with word boundaries for efficiency:
pattern = '|'.join(df['ID'].astype(str))
out = df[~df['next'].str.contains(fr'\b(?:{pattern})\b')]
Output:
ID duration next
1 380 40 880;222
4 520 19 810;111
6 22 3 11

Pandas: Select row pairs based on specific combination of strings in one column

I'm fairly new to python/pandas and have struggled to find an example specific enough for me to work with.
Say I have the following pandas dataframe, consisting of a column of event markers and a column displaying the time each marker was presented:
df = pd.DataFrame({'Marker': ['S200', 'S4', 'S44', 'Tone', 'S200', 'S1', 'S44', 'Tone'],
'Time': [0, 100, 150, 230, 300, 340, 380, 400]})
Marker Time
0 S200 0
1 S4 100
2 S44 150
3 Tone 230
4 S200 300
5 S1 340
6 S44 380
7 Tone 400
I would like to extract pairs of rows where S44 is followed by a Tone. The resulting output should be:
newdf = pd.DataFrame({'Marker': ['S44', 'Tone', 'S44', 'Tone'],
'Time': [150, 230, 380, 400]})
Marker Time
0 S44 150
1 Tone 230
2 S44 380
3 Tone 400
Any ideas would be appreciated!
One way about it is to use shift to get the indexes, add 1 and pull with loc - note that this assumes that the index is numeric and monotonic increasing:
index = df.loc[df.Marker.shift(-1).eq('Tone') & (df.Marker.eq('S44'))].index
df.loc[index.union(index +1)]
Marker Time
2 S44 150
3 Tone 230
6 S44 380
7 Tone 400
Another way:
s = ((df.Marker.eq('S44')) & (df.Marker.shift(-1).eq('Tone')))
df = df[s | s.shift()]
OUTPUT:
Marker Time
2 S44 150
3 Tone 230
6 S44 380
7 Tone 400

How can I group multiple columns and sum the last one?

I have this problem which I've been trying to solve:
I want the code to take this DataFrame and group multiple columns based on the most frequent number and sum the values on the last column. For example:
df = pd.DataFrame({'A':[1000, 1000, 1000, 1000, 1000, 200, 200, 500, 500],
'B':[380, 380, 270, 270, 270, 45, 45, 45, 55],
'C':[380, 380, 270, 270, 270, 88, 88, 88, 88],
'D':[45, 32, 67, 89, 51, 90, 90, 90, 90]})
df
A B C D
0 1000 380 380 45
1 1000 380 380 32
2 1000 270 270 67
3 1000 270 270 89
4 1000 270 270 51
5 200 45 88 90
6 200 45 88 90
7 500 45 88 90
8 500 55 88 90
I would like the code to show the result below:
A B C D
0 1000 380 380 284
1 1000 380 380 284
2 1000 270 270 284
3 1000 270 270 284
4 1000 270 270 284
5 200 45 88 360
6 200 45 88 360
7 500 45 88 360
8 500 55 88 360
Notice that the most frequent value on the first rows is 1000, and this way I group the column 'A' so I get the sum 284 on the column 'D'. However, on the last rows, the most frequent number, which is 88, is not on column 'A', but in column 'C'. I am trying to sum the values on column 'D' by grouping column 'C' and get 360. I am not sure if I made myself clear.
I tried to use the function df['D'] = df.groupby(['A', 'B', 'C'])['D'].transform('sum'), but it does not show the desired result aforementioned.
Is there any pandas-style way of resolving this? Thanks in advance!
Code
def get_count_sum(col, func):
return df.groupby(col).D.transform(func)
ga = get_count_sum('A', 'count')
gb = get_count_sum('B', 'count')
gc = get_count_sum('C', 'count')
conditions = [
((ga > gb) & (ga > gc)),
((gb > ga) & (gb > gc)),
((gc > ga) & (gc > gb)),
]
choices = [get_count_sum('A', 'sum'),
get_count_sum('B', 'sum'),
get_count_sum('C', 'sum')]
df['D'] = np.select(conditions, choices)
df
Output
A B C D
0 1000 380 380 284
1 1000 380 380 284
2 1000 270 270 284
3 1000 270 270 284
4 1000 270 270 284
5 200 45 88 360
6 200 45 88 360
7 500 45 88 360
8 500 55 88 360
Explanation
Since we need to group by each column 'A','B' or 'C' considering which one has max repeated number, so first we are checking the max repeated number and storing the groupby output in ga, gb, gc for A,B,C col respectively.
We are checking which col has max frequent number in conditions.
According to the conditions we are applying choices for if else conditions.
np.select is like if-elif-else where we placed the conditions and required output in choices.

How to calculate value of column in dataframe based on value/count of other columns in the dataframe in python?

I have a pandas dataframe which has data of 24 hours of the day for a whole month with the following fields:
(df1):- date,hour,mid,rid,percentage,total
I need to create 2nd dataframe using this dataframe with the following fields:
(df2) :- date, hour,mid,rid,hour_total
Here hour_total is to be calculated as below:
If for a combination of (date,mid,rid) from dataframe 1, count of records where df1.percentage is 0 is 24, then hour_total = df1.total/24 else hour_total = (df1.percentage /100) * total
For example if dataframe 1 is as below:- (count of records for group of date mid,rid where perc is 0 is 24)
date,hour,mid,rid,perc,total
2019-10-31,0,2, 0,0,3170.87
2019-10-31,1,2,0,0,3170.87
2019-10-31,2,2,0,0,3170.87
2019-10-31,3,2,0,0,3170.87
2019-10-31,4,2,0,0,3170.87
.
.
2019-10-31,23,2,0,0,3170.87
Then dataframe 2 should be: (hour_total = df1.total/24)
date,hour,mid,rid,hour_total
2019-10-31,0,2,0,132.12
2019-10-31,1,4,0,132.12
2019-10-31,2,13,0,132.12
2019-10-31,3,17,0,132.12
2019-10-31,4,7,0,132.12
.
.
2019-10-31,23,27,0,132.12
How can I accomplish this?
You can try the apply function
For example
a = np.random.randint(100,200, size=5)
b = np.random.randint(100,200, size=5)
c = [datetime.now() for x in range(100) if x%20 == 0]
df1 = pd.DataFrame({'Time' : c, "A" : a, "B" : b})
Above data frame looks like this
Time A B
0 2019-10-24 20:37:38.907058 158 190
1 2019-10-24 20:37:38.907058 161 127
2 2019-10-24 20:37:38.908056 100 100
3 2019-10-24 20:37:38.908056 163 164
4 2019-10-24 20:37:38.908056 121 159
Now if we want to compute a new column whose value depends on the other values of column.
You can define a function which does this computation.
def func(x):
t = x[0] # time
a = x[1] # A
b = x[2] # B
return a+b
And apply this function to the data frame
df1["new_col"] = df1.apply(func, axis=1)
Which would yield the following result.
Time A B new_col
0 2019-10-24 20:37:38.907058 158 190 348
1 2019-10-24 20:37:38.907058 161 127 288
2 2019-10-24 20:37:38.908056 100 100 200
3 2019-10-24 20:37:38.908056 163 164 327
4 2019-10-24 20:37:38.908056 121 159 280

filter pandas dataframe based in another column

this might be a basic question, but I have not being able to find a solution. I have two dataframes, with identical rows and columns, called Volumes and Prices, which are like this
Volumes
Index ProductA ProductB ProductC ProductD Limit
0 100 300 400 78 100
1 110 370 20 30 100
2 90 320 200 121 100
3 150 320 410 99 100
....
Prices
Index ProductA ProductB ProductC ProductD Limit
0 50 110 30 90 0
1 51 110 29 99 0
2 49 120 25 88 0
3 51 110 22 96 0
....
I want to assign 0 to the "cell" of the Prices dataframe which correspond to Volumes less than what it is on the Limit column
so, the ideal output would be
Prices
Index ProductA ProductB ProductC ProductD Limit
0 50 110 30 0 0
1 51 110 0 0 0
2 0 120 25 88 0
3 51 110 22 0 0
....
I tried
import pandas as pd
import numpy as np
d_price = {'ProductA' : [50, 51, 49, 51], 'ProductB' : [110,110,120,110],
'ProductC' : [30,29,25,22],'ProductD' : [90,99,88,96], 'Limit': [0]*4}
d_volume = {'ProductA' : [100,110,90,150], 'ProductB' : [300,370,320,320],
'ProductC' : [400,20,200,410],'ProductD' : [78,30,121,99], 'Limit': [100]*4}
Prices = pd.DataFrame(d_price)
Volumes = pd.DataFrame(d_volume)
Prices[Volumes > Volumes.Limit]=0
but I do not obtain any changes to the Prices dataframe... obviously I'm having a hard time understanding boolean slicing, any help would be great
The problem is in
Prices[Volumes > Volumes.Limit]=0
Since Limit varies on each row, you should use, for example, apply like following:
Prices[Volumes.apply(lambda x : x>x.Limit, axis=1)]=0
you can use mask to solve this problem, I am not an expert either but this solutions does what you want to do.
test=(Volumes.ix[:,'ProductA':'ProductD'] >= Volumes.Limit.values)
final = Prices[test].fillna(0)

Categories