Python - Running Average If number is great than 0 - python

I have a column in my dataframe comprised of numbers. Id like to have another column in the dataframe that takes a running average of the values greater than 0 that i can ideally do in numpy without iteration. (data is huge)
Vals Output
-350
1000 1000
1300 1150
1600 1300
1100 1250
1000 1200
450 1075
1900 1192.857143
-2000 1192.857143
-3150 1192.857143
1000 1168.75
-900 1168.75
800 1127.777778
8550 1870
Code:
list =[-350, 1000, 1300, 1600, 1100, 1000, 450,
1900, -2000, -3150, 1000, -900, 800, 8550]
df = pd.DataFrame(data = list)

Option 1
expanding and mean
df.assign(out=df.loc[df.Vals.gt(0)].Vals.expanding().mean()).ffill()
If you have other columns in your DataFrame that have NaN values, this method will ffill those too, so if that is a concern, you may want to consider using something like this:
df['Out'] = df.loc[df.Vals.gt(0)].Vals.expanding().mean()
df['Out'] = df.Out.ffill()
Which will only fill in the Out column.
Option 2
mask:
df.assign(Out=df.mask(df.Vals.lt(0)).Vals.expanding().mean())
Both of these result in:
Vals Out
0 -350 NaN
1 1000 1000.000000
2 1300 1150.000000
3 1600 1300.000000
4 1100 1250.000000
5 1000 1200.000000
6 450 1075.000000
7 1900 1192.857143
8 -2000 1192.857143
9 -3150 1192.857143
10 1000 1168.750000
11 -900 1168.750000
12 800 1127.777778
13 8550 1870.000000

Related

How To Categorize a List

Having this list:
list_price = [['1800','5060','6300','6800','10800','3000','7100',]
how do I categorize the list to be (1000, 2000, 3000, 4000, 5000, 6000, 7000, 000)?
example:
2000: 1800
7000:6800, 6300
And count them 2000(1),7000(2), if possible using pandas as an example.
Using rounding to the upper thousand:
list_price = ['1800','5060','6300','6800','10800','3000','7100']
out = (pd.Series(list_price).astype(int)
.sub(1).floordiv(1000)
.add(1).mul(100)
.value_counts()
)
output:
700 2
200 1
600 1
1100 1
300 1
800 1
0 1
dtype: int64
Intermediate without value_counts:
0 200
1 600
2 700
3 700
4 1100
5 300
6 800
dtype: int64
I assumed 000 at the end of categories is 10000. Try:
cut = pd.cut(list_price, bins=(1000, 2000, 3000, 4000, 5000, 6000, 7000, 10000))
pd.Series(list_price).groupby(cut).count()

Pandas: Select row pairs based on specific combination of strings in one column

I'm fairly new to python/pandas and have struggled to find an example specific enough for me to work with.
Say I have the following pandas dataframe, consisting of a column of event markers and a column displaying the time each marker was presented:
df = pd.DataFrame({'Marker': ['S200', 'S4', 'S44', 'Tone', 'S200', 'S1', 'S44', 'Tone'],
'Time': [0, 100, 150, 230, 300, 340, 380, 400]})
Marker Time
0 S200 0
1 S4 100
2 S44 150
3 Tone 230
4 S200 300
5 S1 340
6 S44 380
7 Tone 400
I would like to extract pairs of rows where S44 is followed by a Tone. The resulting output should be:
newdf = pd.DataFrame({'Marker': ['S44', 'Tone', 'S44', 'Tone'],
'Time': [150, 230, 380, 400]})
Marker Time
0 S44 150
1 Tone 230
2 S44 380
3 Tone 400
Any ideas would be appreciated!
One way about it is to use shift to get the indexes, add 1 and pull with loc - note that this assumes that the index is numeric and monotonic increasing:
index = df.loc[df.Marker.shift(-1).eq('Tone') & (df.Marker.eq('S44'))].index
df.loc[index.union(index +1)]
Marker Time
2 S44 150
3 Tone 230
6 S44 380
7 Tone 400
Another way:
s = ((df.Marker.eq('S44')) & (df.Marker.shift(-1).eq('Tone')))
df = df[s | s.shift()]
OUTPUT:
Marker Time
2 S44 150
3 Tone 230
6 S44 380
7 Tone 400

Assign a category, according to range of the value as a new column, python

I have a piece of R code that i am trying to figure out how to do in Python pandas.
It takes a column called INDUST_CODE and check its value to assign a category according to range of the value as a new column. May i ask how i can do something like that in python please?
industry_index <- full_table_update %>%
mutate(industry = case_when(
INDUST_CODE < 1000 ~ 'Military_service',
INDUST_CODE < 1500 & INDUST_CODE >= 1000 ~ 'Public_service',
INDUST_CODE < 2000 & INDUST_CODE >= 1500 ~ 'Private_sector',
INDUST_CODE >= 2000 ~ 'Others'
)) %>%
select(industry)
You can use pandas.cut to organise this into bins in line with your example.
df = pd.DataFrame([500, 1000, 1001, 1560, 1500, 2000, 2300, 7, 1499], columns=['INDUST_CODE'])
INDUST_CODE
0 500
1 1000
2 1001
3 1560
4 1500
5 2000
6 2300
7 7
8 1499
df['Categories'] = pd.cut(df['INDUST_CODE'], [0, 999, 1499, 1999, 100000], labels=['Military_service', 'Public_service', 'Private_sector', 'Others'])
INDUST_CODE Categories
0 500 Military_service
1 1000 Public_service
2 1001 Public_service
3 1560 Private_sector
4 1500 Private_sector
5 2000 Others
6 2300 Others
7 7 Military_service
8 1499 Public_service
Categories (4, object): [Military_service < Public_service < Private_sector < Others]

Pandas: Using Append Adds New Column and Makes Another All NaN

I just started learning pandas a week ago or so and I've been struggling with a pandas dataframe for a bit now. My data looks like this:
State NY CA Other Total
Year
2003 450 50 25 525
2004 300 75 5 380
2005 500 100 100 700
2006 250 50 100 400
I made this table from a dataset that included 30 or so values for the variable I'm representing as State here. If they weren't NY or CA, in the example, I summed them and put them in an 'Other' category. The years here were made from a normalized list of dates (originally mm/dd/yyyy and yyyy-mm-dd) as such, if this is contributing to my issue:
dict = {'Date': pd.to_datetime(my_df.Date).dt.year}
and later:
my_df = my_df.rename_axis('Year')
I'm trying now to append a row at the bottom that shows the totals in each category:
final_df = my_df.append({'Year' : 'Total',
'NY': my_df.NY.sum(),
'CA': my_df.CA.sum(),
'Other': my_df.Other.sum(),
'Total': my_df.Total.sum()},
ignore_index=True)
This does technically work, but it makes my table look like this:
NY CA Other Total State
0 450 50 25 525 NaN
1 300 75 5 380 NaN
2 500 100 100 700 NaN
3 250 50 100 400 NaN
4 a b c d Total
('a' and so forth are the actual totals of the columns.) It adds a column at the beginning and puts my 'Year' column at the end. In fact, it removes the 'Date' label as well, and turns all the years in the last column into NaNs.
Is there any way I can get this formatted properly? Thank you for your time.
I believe you need create Series by sum and rename it:
final_df = my_df.append(my_df.sum().rename('Total'))
print (final_df)
NY CA Other Total
State
2003 450 50 25 525
2004 300 75 5 380
2005 500 100 100 700
2006 250 50 100 400
Total 1500 275 230 2005
Another solution is use loc for setting with enlargement:
my_df.loc['Total'] = my_df.sum()
print (my_df)
NY CA Other Total
State
2003 450 50 25 525
2004 300 75 5 380
2005 500 100 100 700
2006 250 50 100 400
Total 1500 275 230 2005
Another idea from previous answer - add parameters margins=True and margins_name='Total' to crosstab:
df1 = df.assign(**dct)
out = (pd.crosstab(df1['Firing'], df1['State'], margins=True, margins_name='Total'))

store dictionary in pandas dataframe

I want to store a a dictionary to an data frame
dictionary_example={1234:{'choice':0,'choice_set':{0:{'A':100,'B':200,'C':300},1:{'A':200,'B':300,'C':300},2:{'A':500,'B':300,'C':300}}},
234:{'choice':1,'choice_set':0:{'A':100,'B':400},1:{'A':100,'B':300,'C':1000}},
1876:{'choice':2,'choice_set':0:{'A': 100,'B':400,'C':300},1:{'A':100,'B':300,'C':1000},2:{'A':600,'B':200,'C':100}}
}
That put them into
id choice 0_A 0_B 0_C 1_A 1_B 1_C 2_A 2_B 2_C
1234 0 100 200 300 200 300 300 500 300 300
234 1 100 400 - 100 300 1000 - - -
1876 2 100 400 300 100 300 1000 600 200 100
I think the following is pretty close, the core idea is simply to convert those dictionaries into json and relying on pandas.read_json to parse them.
dictionary_example={
"1234":{'choice':0,'choice_set':{0:{'A':100,'B':200,'C':300},1:{'A':200,'B':300,'C':300},2:{'A':500,'B':300,'C':300}}},
"234":{'choice':1,'choice_set':{0:{'A':100,'B':400},1:{'A':100,'B':300,'C':1000}}},
"1876":{'choice':2,'choice_set':{0:{'A': 100,'B':400,'C':300},1:{'A':100,'B':300,'C':1000},2:{'A':600,'B':200,'C':100}}}
}
df = pd.read_json(json.dumps(dictionary_example)).T
def to_s(r):
return pd.read_json(json.dumps(r)).unstack()
flattened_choice_set = df["choice_set"].apply(to_s)
flattened_choice_set.columns = ['_'.join((str(col[0]), col[1])) for col in flattened_choice_set.columns]
result = pd.merge(df, flattened_choice_set,
left_index=True, right_index=True).drop("choice_set", axis=1)
result

Categories