Having this list:
list_price = [['1800','5060','6300','6800','10800','3000','7100',]
how do I categorize the list to be (1000, 2000, 3000, 4000, 5000, 6000, 7000, 000)?
example:
2000: 1800
7000:6800, 6300
And count them 2000(1),7000(2), if possible using pandas as an example.
Using rounding to the upper thousand:
list_price = ['1800','5060','6300','6800','10800','3000','7100']
out = (pd.Series(list_price).astype(int)
.sub(1).floordiv(1000)
.add(1).mul(100)
.value_counts()
)
output:
700 2
200 1
600 1
1100 1
300 1
800 1
0 1
dtype: int64
Intermediate without value_counts:
0 200
1 600
2 700
3 700
4 1100
5 300
6 800
dtype: int64
I assumed 000 at the end of categories is 10000. Try:
cut = pd.cut(list_price, bins=(1000, 2000, 3000, 4000, 5000, 6000, 7000, 10000))
pd.Series(list_price).groupby(cut).count()
Related
I have a piece of R code that i am trying to figure out how to do in Python pandas.
It takes a column called INDUST_CODE and check its value to assign a category according to range of the value as a new column. May i ask how i can do something like that in python please?
industry_index <- full_table_update %>%
mutate(industry = case_when(
INDUST_CODE < 1000 ~ 'Military_service',
INDUST_CODE < 1500 & INDUST_CODE >= 1000 ~ 'Public_service',
INDUST_CODE < 2000 & INDUST_CODE >= 1500 ~ 'Private_sector',
INDUST_CODE >= 2000 ~ 'Others'
)) %>%
select(industry)
You can use pandas.cut to organise this into bins in line with your example.
df = pd.DataFrame([500, 1000, 1001, 1560, 1500, 2000, 2300, 7, 1499], columns=['INDUST_CODE'])
INDUST_CODE
0 500
1 1000
2 1001
3 1560
4 1500
5 2000
6 2300
7 7
8 1499
df['Categories'] = pd.cut(df['INDUST_CODE'], [0, 999, 1499, 1999, 100000], labels=['Military_service', 'Public_service', 'Private_sector', 'Others'])
INDUST_CODE Categories
0 500 Military_service
1 1000 Public_service
2 1001 Public_service
3 1560 Private_sector
4 1500 Private_sector
5 2000 Others
6 2300 Others
7 7 Military_service
8 1499 Public_service
Categories (4, object): [Military_service < Public_service < Private_sector < Others]
I have a dataframe with area and price columns and have created a new column of empty lists called compList.
I am using a for loop to populate the compList for each row with the prices of any other houses with the same area value.
The result I am looking for is for data['compList'] to be [] for all area values apart from the first and last which both have an area of 1500, where the compList values should each have one value, 31000 and 30000 respectively. Instead I am getting [30000, 31000] for every compList value.
What is wrong with my code? Been racking my head for 2 hours trying to figure this out. Your help would be greatly appreciated.
import pandas as pd
import numpy as np
import collections
reqArea = 1200
area = [1500, 500, 1000, 2000, 2500, 1500]
price = [30000, 10000, 20000, 40000, 50000, 31000]
data = pd.DataFrame(list(zip(area,price)), columns = ['area','price'])
data['compList'] = [[]]*len(data['area'])
At this stage my dataframe looks like this:
area price compList
0 1500 30000 []
1 500 10000 []
2 1000 20000 []
3 2000 40000 []
4 2500 50000 []
5 1500 31000 []
Then I process it.
for i in range(len(data['area'])):
sameArea = []
sameArea = np.where(data['area'] == data['area'][i])[0]
if len(sameArea)>1:
for j in range(len(sameArea)):
if sameArea[j] != i:
data['compList'][i].append(data['price'][sameArea[j]])
else:
pass
At the end my dataframe looks like this:
area price compList
0 1500 30000 [31000, 30000]
1 500 10000 [31000, 30000]
2 1000 20000 [31000, 30000]
3 2000 40000 [31000, 30000]
4 2500 50000 [31000, 30000]
5 1500 31000 [31000, 30000]
[[]]*n is n references to the same object. Once you append data['compList'][i].append(data['price'][sameArea[j]]) you are basically appending to the elements of your compList column (which are essentially one object). Try this:
reqArea = 1200
area = [1500, 500, 1000, 2000, 2500, 1500]
price = [30000, 10000, 20000, 40000, 50000, 31000]
data = pd.DataFrame(list(zip(area,price)), columns = ['area','price'])
data['compList'] = np.empty((len(data), 0)).tolist()
Output using the rest of your code is:
area price compList
0 1500 30000 [31000]
1 500 10000 []
2 1000 20000 []
3 2000 40000 []
4 2500 50000 []
5 1500 31000 [30000]
I have pandas dataframe as below:
import pandas as pd
import numpy as np
df = pd.DataFrame({'CATEGORY': [1, 1, 2, 2],
'GROUP': ['A', 'A', 'B', 'B'],
'XYZ': [3000, 2500, 3000, 3000],
'VAL': [3000, 2500, 3000, 3000],
'A_CLASS': [3000, 2500, 3000, 3000],
'B_CAL': [3000, 4500, 3000, 1000],
'C_CLASS': [3000, 2500, 3000, 3000],
'A_CAL': [3000, 2500, 3000, 3000],
'B_CLASS': [3000, 4500, 3000, 500],
'C_CAL': [3000, 2500, 3000, 3000],
'ABC': [3000, 2500, 3000, 3000]})
df
CATEGORY GROUP XYZ VAL A_CLASS B_CAL C_CLASS A_CAL B_CLASS C_CAL ABC
1 A 3000 1 3000 3000 3000 3000 3000 3000 3000
1 A 2500 2 2500 4500 2500 2500 4500 2500 2500
2 B 3000 4 3000 3000 3000 3000 3000 3000 3000
2 B 3000 1 3000 1000 3000 3000 500 3000 3000
I want columns in below order in my final dataframe
GROUP, CATEGORY, all columns with suffix "_CAL", all columns with suffix "_CLASS", all other fields
My expected output:
GROUP CATEGORY B_CAL A_CAL C_CAL A_CLASS C_CLASS B_CLASS XYZ VAL ABC
A 1 3000 3000 3000 3000 3000 3000 3000 1 3000
A 1 4500 2500 2500 2500 2500 4500 2500 2 2500
A 1 8000 7000 8000 8000 8000 8000 8000 5 8000
B 2 3000 3000 3000 3000 3000 3000 3000 4 3000
B 2 1000 3000 3000 3000 3000 500 3000 1 3000
Fun with sorted:
first = ['GROUP','CATEGORY']
cols = sorted(df.columns.difference(first),
key=lambda x: (not x.endswith('_CAL'), not x.endswith('_CLASS')))
df[first+cols]
GROUP CATEGORY A_CAL B_CAL C_CAL A_CLASS B_CLASS C_CLASS ABC VAL \
0 A 1 3000 3000 3000 3000 3000 3000 3000 3000
1 A 1 2500 4500 2500 2500 4500 2500 2500 2500
2 B 2 3000 3000 3000 3000 3000 3000 3000 3000
3 B 2 3000 1000 3000 3000 500 3000 3000 3000
XYZ
0 3000
1 2500
2 3000
3 3000
For more details here's a similar one with a detailed explanation
You just need to play with strings
cols = df.columns
cols_sorted = ["GROUP", "CATEGORY"] +\
[col for col in cols if col.endswith('_CAL')] +\
[col for col in cols if col.endswith('_CLASS')]
cols_sorted += sorted([col for col in cols if col not in cols_sorted])
df = df[cols_sorted]
I have a column in my dataframe comprised of numbers. Id like to have another column in the dataframe that takes a running average of the values greater than 0 that i can ideally do in numpy without iteration. (data is huge)
Vals Output
-350
1000 1000
1300 1150
1600 1300
1100 1250
1000 1200
450 1075
1900 1192.857143
-2000 1192.857143
-3150 1192.857143
1000 1168.75
-900 1168.75
800 1127.777778
8550 1870
Code:
list =[-350, 1000, 1300, 1600, 1100, 1000, 450,
1900, -2000, -3150, 1000, -900, 800, 8550]
df = pd.DataFrame(data = list)
Option 1
expanding and mean
df.assign(out=df.loc[df.Vals.gt(0)].Vals.expanding().mean()).ffill()
If you have other columns in your DataFrame that have NaN values, this method will ffill those too, so if that is a concern, you may want to consider using something like this:
df['Out'] = df.loc[df.Vals.gt(0)].Vals.expanding().mean()
df['Out'] = df.Out.ffill()
Which will only fill in the Out column.
Option 2
mask:
df.assign(Out=df.mask(df.Vals.lt(0)).Vals.expanding().mean())
Both of these result in:
Vals Out
0 -350 NaN
1 1000 1000.000000
2 1300 1150.000000
3 1600 1300.000000
4 1100 1250.000000
5 1000 1200.000000
6 450 1075.000000
7 1900 1192.857143
8 -2000 1192.857143
9 -3150 1192.857143
10 1000 1168.750000
11 -900 1168.750000
12 800 1127.777778
13 8550 1870.000000
I want to subset the following data frame df into bins of a size 50:
ID FREQ
0 358081 6151
1 431511 952
2 410632 350
3 398149 220
4 177791 158
5 509179 151
6 485346 99
7 536655 50
8 389180 51
9 406622 45
10 410191 112
The result should be this one:
FREQ_BIN QTY_IDs
>200 3
150-200 2
100-150 1
50-100 3
<50 1
How can I do it? Should I use groupBy or any other approach?
You could use pd.cut.
df.groupby(pd.cut(df.FREQ,
bins=[-np.inf, 50, 100, 150, 200, np.inf],
right=False)
).size()
right=False ensures that we take half-open intervals as your output suggests, and unlike np.digitize we need to include np.inf in the bins for "infinite endpoints".
Demo
>>> df.groupby(pd.cut(df.FREQ,
bins=[-np.inf, 50, 100, 150, 200, np.inf],
right=False)
).size()
FREQ
[-inf, 50) 1
[50, 100) 3
[100, 150) 1
[150, 200) 2
[200, inf) 4
dtype: int64