Pandas: Using Append Adds New Column and Makes Another All NaN - python

I just started learning pandas a week ago or so and I've been struggling with a pandas dataframe for a bit now. My data looks like this:
State NY CA Other Total
Year
2003 450 50 25 525
2004 300 75 5 380
2005 500 100 100 700
2006 250 50 100 400
I made this table from a dataset that included 30 or so values for the variable I'm representing as State here. If they weren't NY or CA, in the example, I summed them and put them in an 'Other' category. The years here were made from a normalized list of dates (originally mm/dd/yyyy and yyyy-mm-dd) as such, if this is contributing to my issue:
dict = {'Date': pd.to_datetime(my_df.Date).dt.year}
and later:
my_df = my_df.rename_axis('Year')
I'm trying now to append a row at the bottom that shows the totals in each category:
final_df = my_df.append({'Year' : 'Total',
'NY': my_df.NY.sum(),
'CA': my_df.CA.sum(),
'Other': my_df.Other.sum(),
'Total': my_df.Total.sum()},
ignore_index=True)
This does technically work, but it makes my table look like this:
NY CA Other Total State
0 450 50 25 525 NaN
1 300 75 5 380 NaN
2 500 100 100 700 NaN
3 250 50 100 400 NaN
4 a b c d Total
('a' and so forth are the actual totals of the columns.) It adds a column at the beginning and puts my 'Year' column at the end. In fact, it removes the 'Date' label as well, and turns all the years in the last column into NaNs.
Is there any way I can get this formatted properly? Thank you for your time.

I believe you need create Series by sum and rename it:
final_df = my_df.append(my_df.sum().rename('Total'))
print (final_df)
NY CA Other Total
State
2003 450 50 25 525
2004 300 75 5 380
2005 500 100 100 700
2006 250 50 100 400
Total 1500 275 230 2005
Another solution is use loc for setting with enlargement:
my_df.loc['Total'] = my_df.sum()
print (my_df)
NY CA Other Total
State
2003 450 50 25 525
2004 300 75 5 380
2005 500 100 100 700
2006 250 50 100 400
Total 1500 275 230 2005
Another idea from previous answer - add parameters margins=True and margins_name='Total' to crosstab:
df1 = df.assign(**dct)
out = (pd.crosstab(df1['Firing'], df1['State'], margins=True, margins_name='Total'))

Related

Filtering dataframes based on one column with a different type of other column

I have the following problem
import pandas as pd
data = {
"ID": [420, 380, 390, 540, 520, 50, 22],
"duration": [50, 40, 45,33,19,1,3],
"next":["390;50","880;222" ,"520;50" ,"380;111" ,"810;111" ,"22;888" ,"11" ]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
As you can see I have
ID duration next
0 420 50 390;50
1 380 40 880;222
2 390 45 520;50
3 540 33 380;111
4 520 19 810;111
5 50 1 22;888
6 22 3 11
Things to notice:
ID type is int
next type is a string with numbers separated by ; if more than two numbers
I would like to filter the rows with no next in the ID
For example in this case
420 has a follow up in both 390 and 50
380 has as next 880 and 222 both of which are not in ID so this one
540 has as next 380 and 111 and while 111 is not in ID, 380 is so not this one
same with 50
In the end I want to get
1 380 40 880;222
4 520 19 810;111
6 22 3 11
With only one value I used print(df[~df.next.astype(int).isin(df.ID)]) but in this case isin can not be simply applied.
How can I do this?
Let us try with split then explode with isin check
s = df.next.str.split(';').explode().astype(int)
out = df[~s.isin(df['ID']).groupby(level=0).any()]
Out[420]:
ID duration next
1 380 40 880;222
4 520 19 810;111
6 22 3 11
Use a regex with word boundaries for efficiency:
pattern = '|'.join(df['ID'].astype(str))
out = df[~df['next'].str.contains(fr'\b(?:{pattern})\b')]
Output:
ID duration next
1 380 40 880;222
4 520 19 810;111
6 22 3 11

How to group specific items in a column and calculate the mean

I am trying to figure out how to group specific tag names in a column and then to calculate the mean of the raw data that have the same time. My dataframe looks something like this except with 10000+ rows:
tag_name time raw_data
happy 5 300
8 340
angry 5 315
8 349
sad 5 400
8 480
etc.
.
.
I wish to keep the mean in the dataframe table, but I can't figure out how. I found that I can create a pivot table to calculate the mean, but I can't figure out how to single out the specific tag names I want. Below is what I have so far:
output = pd.pivot_table(data=dataset,index=['Timestamp'],columns=['Tag_Name'],values='Raw_Data',aggfunc='mean')
I am trying to get one of these outputs when I calculate the average of sad and happy:
1. optimal output:
tag_name time raw_data sad_happy_avg
happy 5 300 350
8 340 410
sad 5 400
8 480
angry 5 315
8 349
2. alright output:
tag_name happy sad avg
time
5 300 400 350
8 340 480 410
Try as follows:
Use Series.isin to keep only "happy" and "sad", and apply
df.pivot
to get the data in the correct shape.
Next, add a column for the mean on axis=1:
res = df[df.tag_name.isin(['happy','sad'])].pivot(index='time', columns='tag_name',
values='raw_data')
res['avg'] = res.mean(axis=1)
print(res)
tag_name happy sad avg
time
5 300 400 350.0
8 340 480 410.0
Your "optimal" output doesn't seem a very logical way to present/store this data, but you can achieve it as follows:
# assuming you have a "standard" index starting `0,1,2` etc.
df['sad_happy_avg'] = df[df.tag_name.isin(['happy','sad'])]\
.groupby('time')['raw_data'].mean().reset_index(drop=True)
print(df)
tag_name time raw_data sad_happy_avg
0 happy 5 300 350.0
1 happy 8 340 410.0
2 angry 5 315 NaN
3 angry 8 349 NaN
4 sad 5 400 NaN
5 sad 8 480 NaN

Pandas - Count the number of rows that would be true for a function - for each input row

I have a dataframe that needs a column added to it. That column needs to be a count of all the other rows in the table that meet a certain condition, that condition needs to take in input both from the "input" row and the "output" row.
For example, if it was a dataframe describing people, and I wanted to make a column that counted how many people were taller than the current row and lighter.
I'd want the height and weight of the row, as well as the height and weight of the other rows in a function, so I can do something like:
def example_function(height1, weight1, height2, weight2):
if height1 > height2 and weight1 < weight2:
return True
else:
return False
And it would just sum up all the True's and give that sum in the column.
Is something like this possible?
Thanks in advance for any ideas!
Edit: Sample input:
id name height weight country
0 Adam 70 180 USA
1 Bill 65 190 CANADA
2 Chris 71 150 GERMANY
3 Eric 72 210 USA
4 Fred 74 160 FRANCE
5 Gary 75 220 MEXICO
6 Henry 61 230 SPAIN
The result would need to be:
id name height weight country new_column
0 Adam 70 180 USA 1
1 Bill 65 190 CANADA 1
2 Chris 71 150 GERMANY 3
3 Eric 72 210 USA 1
4 Fred 74 160 FRANCE 4
5 Gary 75 220 MEXICO 1
6 Henry 61 230 SPAIN 0
I believe it will need to be some sort of function, as the actual logic I need to use is more complicated.
edit 2:fixed typo
You can add booleans, like this:
count = ((df.height1 > df.height2) & (df.weight1 < df.weight2)).sum()
EDIT:
I test it a bit and then change conditions with custom function:
def f(x):
#check boolean mask
#print ((df.height > x.height) & (df.weight < x.weight))
return ((df.height < x.height) & (df.weight > x.weight)).sum()
df['new_column'] = df.apply(f, axis=1)
print (df)
id name height weight country new_column
0 0 Adam 70 180 USA 2
1 1 Bill 65 190 CANADA 1
2 2 Chris 71 150 GERMANY 3
3 3 Eric 72 210 USA 1
4 4 Fred 74 160 FRANCE 4
5 5 Gary 75 220 MEXICO 1
6 6 Henry 61 230 SPAIN 0
Explanation:
For each row compare values and for count simply sum values True.
For example, if it was a dataframe describing people, and I wanted to make a column that counted how many people were taller than the current row and lighter.
As far as I understand, you want to assign to a new column something like
df['num_heigher_and_leighter'] = df.apply(lambda r: ((df.height > r.height) & (df.weight < r.weight)).sum(), axis=1)
However, your text description doesn't seem to match the outcome, which is:
0 2
1 3
2 0
3 1
4 0
5 0
6 6
dtype: int64
Edit
As in any other case, you can use a named function instead of a lambda:
df = ...
def foo(r):
return ((df.height > r.height) & (df.weight < r.weight)).sum()
df['num_heigher_and_leighter'] = df.apply(foo, axis=1)
I'm assuming you had a typo and want to compare heights with heights and weights with weights. If so, you could count the number of persons taller OR heavier like so:
>>> for i,height,weight in zip(df.index,df.height, df.weight):
... cnt = df.loc[((df.height>height) & (df.weight>weight)), 'height'].count()
... df.loc[i,'thing'] = cnt
...
>>> df
name height weight country thing
0 Adam 70 180 USA 2.0
1 Bill 65 190 CANADA 2.0
2 Chris 71 150 GERMANY 3.0
3 Eric 72 210 USA 1.0
4 Fred 74 160 FRANCE 1.0
5 Gary 75 220 MEXICO 0.0
6 Henry 61 230 SPAIN 0.0
Here for instance, no person is Heavier than Henry, and no person is taller than Gary. If that's not what you intended, it should be easy to modify the & above to a | instead or switching out the > to a <.
When you're more accustomed to Pandas, I suggest you use Ami Tavory excellent answer instead.
PS. For the love of god, use the Metric system for representing weight and height, and convert to whatever for presentation. These numbers are totally nonsensical for the world population at large. :)

store dictionary in pandas dataframe

I want to store a a dictionary to an data frame
dictionary_example={1234:{'choice':0,'choice_set':{0:{'A':100,'B':200,'C':300},1:{'A':200,'B':300,'C':300},2:{'A':500,'B':300,'C':300}}},
234:{'choice':1,'choice_set':0:{'A':100,'B':400},1:{'A':100,'B':300,'C':1000}},
1876:{'choice':2,'choice_set':0:{'A': 100,'B':400,'C':300},1:{'A':100,'B':300,'C':1000},2:{'A':600,'B':200,'C':100}}
}
That put them into
id choice 0_A 0_B 0_C 1_A 1_B 1_C 2_A 2_B 2_C
1234 0 100 200 300 200 300 300 500 300 300
234 1 100 400 - 100 300 1000 - - -
1876 2 100 400 300 100 300 1000 600 200 100
I think the following is pretty close, the core idea is simply to convert those dictionaries into json and relying on pandas.read_json to parse them.
dictionary_example={
"1234":{'choice':0,'choice_set':{0:{'A':100,'B':200,'C':300},1:{'A':200,'B':300,'C':300},2:{'A':500,'B':300,'C':300}}},
"234":{'choice':1,'choice_set':{0:{'A':100,'B':400},1:{'A':100,'B':300,'C':1000}}},
"1876":{'choice':2,'choice_set':{0:{'A': 100,'B':400,'C':300},1:{'A':100,'B':300,'C':1000},2:{'A':600,'B':200,'C':100}}}
}
df = pd.read_json(json.dumps(dictionary_example)).T
def to_s(r):
return pd.read_json(json.dumps(r)).unstack()
flattened_choice_set = df["choice_set"].apply(to_s)
flattened_choice_set.columns = ['_'.join((str(col[0]), col[1])) for col in flattened_choice_set.columns]
result = pd.merge(df, flattened_choice_set,
left_index=True, right_index=True).drop("choice_set", axis=1)
result

conditional cumulative sum based on column comparison in pandas dataframe

I'm relatively new to pandas, and I'm sure there is an easy solution, but I could not figure it out on my own. I have a dataframe of transactions that looks like this:
OrderId Size Price Side TimeSecO TimeUSecO TimeSecOT TimeUSecOT AmountBuy AmountSell
10 100 41.44000000 BUY 1403200077 47720 1403200100 640070
11 100 41.43000000 BUY 1403200077 47979 1403200112 43383
12 100 41.45000000 SELL 1403200077 48311 1403200090 61100
14 100 41.45000000 BUY 1403200092 253793 1403200092 374767
17 100 41.44000000 SELL 1403200103 24382 1403200125 929563
20 100 41.43000000 SELL 1403200116 208057 1403200116 226762
31 100 41.46000000 SELL 1403200214 874124 1403200259 751002
37 100 41.46000000 BUY 1403200278 494827 1403200300 729545
42 100 41.45000000 BUY 1403200335 601039 1403200361 925384
42 100 41.45000000 BUY 1403200335 601039 1403200361 925415
45 500 15.54000000 SELL 1403200365 997248 1403200741 26216
49 100 41.45000000 SELL 1403200375 419253 1403200402 959968
53 100 42.61000000 SELL 1403200377 403525 1403200377 403680
54 100 42.61000000 BUY 1403200377 501636 1403200377 501770
I want to calculate rolling cumulative sums for each OrderId and put them into 2 new columns with respect to Side column, CumAmountBuy and CumAmountSell, where TimeSecO > TimeSecOT.
For example, for the above dataframe the correct cumulative sums for OrderId 10, OrderId 11 and OrderId 12 would be CumAmountBuy = 0 and CumAmountSell = 0, because there are no records in the dataframe where 1403200077 > TimeUSecOT.
For OrderId 14, CumAmountBuy = 0, and CumAmountSell = 100, as OrderId 12 has already happened at this point, and it was a Side=SELL, and it fulfilled the requirement of TimeSecO > TimeSecOT (1403200092 > 1403200090).
I can think of a dirty trick but when the dataframe gets huge, I don't think it's efficient.
In [42]: df['flag'] = df.TimeSecO.map(lambda sec: (sec > df.TimeSecOT).values)
In [43]: df['CumAmountBuy'] = df.flag.map(lambda f: np.dot(f,df['Size']*(df['Side']=='BUY')))
In [44]: df['CumAmountSell'] = df.flag.map(lambda f: np.dot(f,df['Size']*(df['Side']=='SELL')))
In [45]: df[['CumAmountBuy','CumAmountSell']]
Out[45]:
CumAmountBuy CumAmountSell
OrderId
10 0 0
11 0 0
12 0 0
14 0 100
17 200 100
20 300 100
31 300 300
37 300 400
42 400 400
42 400 400
45 600 400
49 600 400
53 600 400
54 600 400

Categories