I have the following dataframe from a database download that I cleaned up a bit. Unfortunately some of the single numbers split into a second column (row 9) from a single one. I'm trying to merge the two columns but exclude the zero values.
city crashes crashes_1 total_crashes
1 ABERDEEN 710 0 710
2 ACHERS LODGE 1 0 1
3 ACME 1 0 1
4 ADVANCE 55 0 55
5 AFTON 2 0 2
6 AHOSKIE 393 0 393
7 AKERS CENTER 1 0 1
8 ALAMANCE 50 0 50
9 ALBEMARLE 1 58 59
So for row 9 I want to end with:
9 ALBEMARLE 1 58 158
I tried a few snippets but nothing seems to work:
df['total_crashes'] = df['crashes'].astype(str).str.zfill(0) + df['crashes_1'].astype(str).str.zfill(0)
df['total_crashes'] = df['total_crashes'].astype(str).replace('\0', '', regex=True)
df['total_crashes'] = df['total_crashes'].apply(lambda x: ''.join(x[x!=0]))
df['total_crashes'] = df['total_crashes'].str.cat(df['total_crashes'], x[x!=0])
df['total_crashes'] = df.drop[0].sum(axis=1)
Thanks for any help.
You can use where condition:
df['total_crashes'] = df['crashes'].astype(str) + df['crashes_1'].astype(str).where(df['crashes_1'] != 0, "")
Related
I'm writing a code to merge several dataframe together using pandas .
Here is my first table :
Index Values Intensity
1 11 98
2 12 855
3 13 500
4 24 140
and here is the second one:
Index Values Intensity
1 21 1000
2 11 2000
3 24 0.55
4 25 500
With these two df, I concanate and drop_duplicates the Values columns which give me the following df :
Index Values Intensity_df1 Intensity_df2
1 11 0 0
2 12 0 0
3 13 0 0
4 24 0 0
5 21 0 0
6 25 0 0
I would like to recover the intensity of each values in each Dataframes, for this purpose, I'm iterating through each line of each df which is very inefficient. Here is the following code I use:
m = 0
while m < len(num_df):
n = 0
while n < len(df3):
temp_intens_abs = df[m]['Intensity'][df3['Values'][n] == df[m]['Values']]
if temp_intens_abs.empty:
merged.at[n,"Intensity_df%s" %df[m]] = 0
else:
merged.at[n,"Intensity_df%s" %df[m]] = pandas.to_numeric(temp_intens_abs, errors='coerce')
n = n + 1
m = m + 1
The resulting df3 looks like this at the end:
Index Values Intensity_df1 Intensity_df2
1 11 98 2000
2 12 855 0
3 13 500 0
4 24 140 0.55
5 21 0 1000
6 25 0 500
My question is : Is there a way to directly recover "present" values in a df by comparing directly two columns using pandas? I've tried several solutions using numpy but without success.. Thanks in advance for your help.
You can try joining these dataframes: df3 = df1.merge(df2, on="Values")
I am struggling with pandas where by condition especially in a group by
I am having following dataframe
data = {'script':['a','a','a','b','b'],
'call_put':['C', 'P', 'P','C', 'P'],
'strike':[280,260,275,280,285],
'premium':[10,20,35,38,50]}
df=pd.DataFrame(data)
df['t']=df['premium'].cumsum()
df
script call_put strike premium t
0 a C 280 10 10
1 a P 260 20 30
2 a P 275 35 65
3 b C 280 38 103
4 b P 285 50 153
I want two additional columns having running count based on script and call_put and premium > 0 expected output
k1 k2
a c 10 1 1 call_put is "C" so first value should be 1, k2 column should be also one as call_put "P" is 0
a p 30 1 1 for call_put value is P so second column count 1
a P 65 1 2 as value is "P", so increase cumm count by 1
b C 103 1 1 script value changed, "C" is 1 and "P" = 0 so 1
b P 153 1 1 "C" = 1 and "P" = 1
can you please tell me how to do this?
Based on your explanation, this what you need.
df['k1'] = df.loc[df["premium"]>0].groupby(["script"])['call_put'].apply(lambda x: np.cumsum(x=='C'))
df['k2'] = df.loc[df["premium"]>0].groupby(["script"])['call_put'].apply(lambda x: np.cumsum(x=='P'))
Output
script call_put strike premium t k1 k2
a C 280 10 10 1 0
a P 260 20 30 1 1
a P 275 35 65 1 2
b C 280 38 103 1 0
b P 285 50 153 1 1
may be you need four columns to represent the cumsum as there will be 4 different combinations of script and call_put. Following code do as per you told. The count start from zero here
data = {'script':['a','a','a','b','b'],
'call_put':['C', 'P', 'P','C', 'P'],
'strike':[280,260,275,280,285],
'premium':[10,20,35,38,50]}
df=pd.DataFrame(data)
df['t']=df['premium'].cumsum()
## column cond_col will have unique combination of script, call_put and premium >0
df["cond_col"] = df["script"] + "-" + df["call_put"] + "-" + (df["premium"]>0).astype(np.str)
## and new columns for each unique combination
for col in np.unique(df["cond_col"]):
df[col] = df["cond_col"]==col
## do cumsum in each unique combination column
for col in np.unique(df["cond_col"]):
df[col] = df[col].cumsum()
## may be the solution you want is upto here
## if you want to combine the columns then you can do following
df["k1"] = df["a-C-True"].where(df["cond_col"]=="a-C-True", df["b-C-True"])
df["k2"] = df["a-P-True"].where(df["cond_col"]=="a-P-True", df["b-P-True"])
df
Output
script call_put strike premium t cond_col a-C-True a-P-True b-C-True b-P-True k1 k2
0 a C 280 10 10 a-C-True 1 0 0 0 1 0
1 a P 260 20 30 a-P-True 1 1 0 0 0 1
2 a P 275 35 65 a-P-True 1 2 0 0 0 2
3 b C 280 38 103 b-C-True 1 2 1 0 1 0
4 b P 285 50 153 b-P-True 1 2 1 1 1 1
I've Pandas Dataframe as shown below. What I'm trying to do is, partition (or groupby) by BlockID, LineID, WordID, and then within each group use current WordStartX - previous (WordStartX + WordWidth) to derive another column, e.g., WordDistance to indicate the distance between this word and previous word.
This post Row operations within a group of a pandas dataframe is very helpful but in my case multiple columns involved (WordStartX and WordWidth).
*BlockID LineID WordID WordStartX WordWidth WordDistance
0 0 0 0 275 150 0
1 0 0 1 431 96 431-(275+150)=6
2 0 0 2 642 90 642-(431+96)=115
3 0 0 3 746 104 746-(642+90)=14
4 1 0 0 273 69 ...
5 1 0 1 352 151 ...
6 1 0 2 510 92
7 1 0 3 647 90
8 1 0 4 752 105**
The diff() and shift() functions are usually helpful for calculation referring to previous or next rows:
df['WordDistance'] = (df.groupby(['BlockID', 'LineID'])
.apply(lambda g: g['WordStartX'].diff() - g['WordWidth'].shift()).fillna(0).values)
I have a data frame of roughly 6 million rows, which I need to repeatedly analyse for simulations. The following is a very simple representation of the data.
For rows where action=1,
I am tying to devise an efficient way to do this
For index,row in df.iterrows():
`Result = the first next row where (price2 is >= row.price1 +4) and index > row.index`
or if that doesn't exist
return index+100 (i.e the activity times out).
import pandas as pd
df = pd.DataFrame({'Action(y/n)' : [0,1,0,0,1,0,1,0,0,0], 'Price1' : [1,8,3,1,7,3,8,2,3,1], 'Price2' : [2,1,1,5,3,1,2,11,12,1]})
print(df)
Action(y/n) Price1 Price2
0 0 1 2
1 1 8 1
2 0 3 1
3 0 1 5
4 1 7 3
5 0 3 1
6 1 8 2
7 0 2 11
8 0 3 12
9 0 1 1
Resulting in something like this:
Action(y/n) Price1 Price2 ExitRow(IndexOfRowWhereCriteriaMet)
0 0 14 2 9
1 1 8 1 8
2 0 3 1 102
3 0 1 5 103
4 1 7 3 7
5 0 3 1 105
6 1 8 2 8
7 0 2 11 107
8 0 3 12 108
9 0 1 1 109
I have tried a few methods,which are all really slow.
This best one maps it, but really not fast enough.
df['ExitRow'] = list(map(ATestFcn, df.index,df.price1))
def ATestFcn(dfIx, dfPrice1):
ExitRow = df[((df.price2>(price1+4))&(df.index >dfIx)& (df.index<=dfIx+TimeOut))].index.min()
if pd.isnull(ExitRow):
return dfIx+ TimeOut
else:
return ExitRow
I also tested this with a loop, it was about 25% slower - but it was ideas-wise
essentially the same.
I'm thinking there must be a smarter or faster way to do this, a mask could have been useful except you can't fill down with this data as price2 for one row might be thousands of rows after the price2 for another row, and I can't find a way to turn a merge into a cross apply like one might in TSQL.
To find the index of the first row which meet your criterion, you could use
cur_row_idx = 100 # we want the row after 100
next_row_idx = (df[cur_row_idx:].Price2 >= df[cur_row_idx:].Price1 + 4).argmax()
Then, you want to set a cutoff, say, the max value you getting is the TimeOut - so it could be
next_row_idx = np.min(((df[cur_row_idx:].Price2 >= df[cur_row_idx:].Price1 + 4).argmax() , dfIx + TimeOut))
I did not check the performance on the large datasets, but hope it helps.
If you wish, you also can wrap it into a function:
def ATestFcn(dfIx, df, TimeOut):
return np.min(((df[dfIx:].Price2 >= df[dfIx:].Price1 + 4).argmax() , dfIx + TimeOut))
Edit: Just tested it, it is quite fast, see the results below:
df = pd.DataFrame()
Price1 = np.random.randint(100, size=10e6)
Price2 = np.random.randint(100, size=10e6)
df["Price1"] = Price1
df["Price2"] = Price2
timeit ATestFcn(np.random.randint(1e6), df, 100)
Out[62]: 1 loops, best of 3: 289 ms per loop
I have some data imported from a csv, to create something similar I used this:
data = pd.DataFrame([[1,0,2,3,4,5],[0,1,2,3,4,5],[1,1,2,3,4,5],[0,0,2,3,4,5]], columns=['split','sex', 'group0Low', 'group0High', 'group1Low', 'group1High'])
means = data.groupby(['split','sex']).mean()
so the dataframe looks something like this:
group0Low group0High group1Low group1High
split sex
0 0 2 3 4 5
1 2 3 4 5
1 0 2 3 4 5
1 2 3 4 5
You'll notice that each column actually contains 2 variables (group# and height). (It was set up this way for running repeated measures anova in SPSS.)
I want to split the columns up, so I can also groupby "group", like this (I actually screwed up the order of the numbers, but hopefully the idea is clear):
low high
split sex group
0 0 95 265
0 0 1 123 54
1 0 120 220
1 1 98 111
1 0 0 150 190
0 1 211 300
1 0 139 86
1 1 132 250
How do I achieve this?
The first trick is to gather the columns into a single column using stack:
In [6]: means
Out[6]:
group0Low group0High group1Low group1High
split sex
0 0 2 3 4 5
1 2 3 4 5
1 0 2 3 4 5
1 2 3 4 5
In [13]: stacked = means.stack().reset_index(level=2)
In [14]: stacked.columns = ['group_level', 'mean']
In [15]: stacked.head(2)
Out[15]:
group_level mean
split sex
0 0 group0Low 2
0 group0High 3
Now we can do whatever string operations we want on group_level using pd.Series.str as follows:
In [18]: stacked['group'] = stacked.group_level.str[:6]
In [21]: stacked['level'] = stacked.group_level.str[6:]
In [22]: stacked.head(2)
Out[22]:
group_level mean group level
split sex
0 0 group0Low 2 group0 Low
0 group0High 3 group0 High
Now you're in business and you can do whatever you want. For example, sum each group/level:
In [31]: stacked.groupby(['group', 'level']).sum()
Out[31]:
mean
group level
group0 High 12
Low 8
group1 High 20
Low 16
How do I group by everything?
If you want to group by split, sex, group and level you can do:
In [113]: stacked.reset_index().groupby(['split', 'sex', 'group', 'level']).sum().head(4)
Out[113]:
mean
split sex group level
0 0 group0 High 3
Low 2
group1 0High 5
0Low 4
What if the split is not always at location 6?
This SO answer will show you how to do the splitting more intelligently.
This can be done by first construct multi-level index on column names and then reshape the dataframe by stack.
import pandas as pd
import numpy as np
# some artificial data
# ==================================
multi_index = pd.MultiIndex.from_arrays([[0,0,1,1], [0,1,0,1]], names=['split', 'sex'])
np.random.seed(0)
df = pd.DataFrame(np.random.randint(50,300, (4,4)), columns='group0Low group0High group1Low group1High'.split(), index=multi_index)
df
group0Low group0High group1Low group1High
split sex
0 0 222 97 167 242
1 117 245 153 59
1 0 261 71 292 86
1 137 120 266 138
# processing
# ==============================
level_group = np.where(df.columns.str.contains('0'), 0, 1)
# output: array([0, 0, 1, 1])
level_low_high = np.where(df.columns.str.contains('Low'), 'low', 'high')
# output: array(['low', 'high', 'low', 'high'], dtype='<U4')
multi_level_columns = pd.MultiIndex.from_arrays([level_group, level_low_high], names=['group', 'val'])
df.columns = multi_level_columns
df.stack(level='group')
val high low
split sex group
0 0 0 97 222
1 242 167
1 0 245 117
1 59 153
1 0 0 71 261
1 86 292
1 0 120 137
1 138 266