I want to find the starting index and ending index of every piece of data chunk in the dataset.
The data is like:
index A wanted_column1 wanted_column2
2000/1/1 0 0
2000/1/2 1 2000/1/2 1
2000/1/3 1 1
2000/1/4 1 1
2000/1/5 0 0
2000/1/6 1 2000/1/6 2
2000/1/7 1 2
2000/1/8 1 2
2000/1/9 0 0
As shown in the data, index and A are the given columns and wanted_column1 and wanted_column2 are what I want to get.
The idea is that there are different pieces of continuous chunks of data. I want to retrieve starting indices of every chunk of data and I want to increment a count of how many chunks are in the data.
I tried to use shift(-1), but it is not possible to differentiate the difference between starting index and the ending index.
Is that what you need ?
df['change'] = df['A'].diff().eq(1)
df['wanted_column1'] = df[['index','change']].apply(lambda x: x[0] if x[1] else None, axis=1)
df['wanted_column2'] = df['change'].cumsum()
df['wanted_column2'] = df[['wanted_column2','A']].apply(lambda x: 0 if x[1]==0 else x[0], axis=1)
df.drop('change', axis=1, inplace=True)
That yields :
index A wanted_column1 wanted_column2
0 2000/1/1 0 None 0
1 2000/1/2 1 2000/1/2 1
2 2000/1/3 1 None 1
3 2000/1/4 1 None 1
4 2000/1/5 0 None 0
5 2000/1/6 1 2000/1/6 2
6 2000/1/7 1 None 2
7 2000/1/8 1 None 2
8 2000/1/9 0 None 2
Edit : performance comparison
gehbiszumeis's solution : 19.9 ms
my solution : 4.07 ms
Assuming your dataframe to be df, you can find the indices where df['A'] != 0. The indices before are the last indices of a chunck, the ones after the first ones of a chunk. Later you count the number of found indices to calculate the number of data chunks
import pandas as pd
# Read your data
df = pd.read_csv('my_txt.txt', sep=',')
df['wanted_column1'] = None # creating already dummy columns
df['wanted_column2'] = None
# Find indices after each index, where 'A' is not 1, except of it is the last value
# of the dataframe
first = [x + 1 for x in df[df['A'] != 1].index.values if x != len(df)-1]
# Find indices before each index, where 'A' is not 1, except of it is the first value
# of the dataframe
last = [x - 1 for x in df[df['A'] != 1].index.values if x != 0]
# Set the first indices of each chunk at its corresponding position in your dataframe
df.loc[first, 'wanted_column1'] = df.loc[first, 'index']
# You can set also the last indices of each chunk (you only mentioned this in the text,
# not in your expected-result-listed). Uncomment for last indices.
# df.loc[last, 'wanted_column1'] = df.loc[last, 'index']
# Count the number of chunks and fill it to wanted_column2
for i in df.index: df.loc[i, 'wanted_column2'] = sum(df.loc[:i, 'wanted_column1'].notna())
# Some polishing of the df after to match your expected result
df.loc[df['A'] != 1, 'wanted_column2'] = 0
This gives
index A wanted_column1 wanted_column2
0 2000/1/1 0 None 0
1 2000/1/2 1 2000/1/2 1
2 2000/1/3 1 None 1
3 2000/1/4 1 None 1
4 2000/1/5 0 None 0
5 2000/1/6 1 2000/1/6 2
6 2000/1/7 1 None 2
7 2000/1/8 1 None 2
8 2000/1/9 0 None 0
and works for all lengths of df and number of chunks in your data
Related
I have a pandas data frame that looks like:
Index Activity
0 0
1 0
2 1
3 1
4 1
5 0
...
1167 1
1168 0
1169 0
I want to count how many times it changes from 0 to 1 and when it changes from 1 to 0, but I do not want to count how many 1's or 0's there are.
For example, if I only wanted to count index 0 to 5, the count for 0 to 1 would be one.
How would I go about this? I have tried using some_value
This is a simple approach that can also tell you the index value when the change happens. Just add the index to a list.
c_1to0 = 0
c_0to1 = 0
for i in range(0, df.shape[0]-1):
if df.iloc[i]['Activity'] == 0 and df.iloc[i+1]['Activity'] == 1:
c_0to1 +=1
elif df.iloc[i]['Activity'] == 1 and df.iloc[i+1]['Activity'] == 0:
c_1to0 +=1
Assuming a dataframe like this
In [5]: data = pd.DataFrame([[9,4],[5,4],[1,3],[26,7]])
In [6]: data
Out[6]:
0 1
0 9 4
1 5 4
2 1 3
3 26 7
I want to count how many times the values in a rolling window/slice of 2 on column 0 are greater or equal to the value in col 1 (4).
On the first number 4 at col 1, a slice of 2 on column 0 yields 5 and 1, so the output would be 2 since both numbers are greater than 4, then on the second 4 the next slice values on col 0 would be 1 and 26, so the output would be 1 because only 26 is greater than 4 but not 1. I can't use rolling window since iterating through rolling window values is not implemented.
I need something like a slice of the previous n rows and then I can iterate, compare and count how many times any of the values in that slice are above the current row.
I have done this using list instead of doing it in data frame. Check the code below:
list1, list2 = df['0'].values.tolist(), df['1'].values.tolist()
outList = []
for ix in range(len(list1)):
if ix < len(list1) - 2:
if list2[ix] < list1[ix + 1] and list2[ix] < list1[ix + 2]:
outList.append(2)
elif list2[ix] < list1[ix + 1] or list2[ix] < list1[ix + 2]:
outList.append(1)
else:
outList.append(0)
else:
outList.append(0)
df['2_rows_forward_moving_tag'] = pd.Series(outList)
Output:
0 1 2_rows_forward_moving_tag
0 9 4 1
1 5 4 1
2 1 3 0
3 26 7 0
My data frame contains 10,000,000 rows! After group by, ~ 9,000,000 sub-frames remain to loop through.
The code is:
data = read.csv('big.csv')
for id, new_df in data.groupby(level=0): # look at mini df and do some analysis
# some code for each of the small data frames
This is super inefficient, and the code has been running for 10+ hours now.
Is there a way to speed it up?
Full Code:
d = pd.DataFrame() # new df to populate
print 'Start of the loop'
for id, new_df in data.groupby(level=0):
c = [new_df.iloc[i:] for i in range(len(new_df.index))]
x = pd.concat(c, keys=new_df.index).reset_index(level=(2,3), drop=True).reset_index()
x = x.set_index(['level_0','level_1', x.groupby(['level_0','level_1']).cumcount()])
d = pd.concat([d, x])
To get the data:
data = pd.read_csv('https://raw.githubusercontent.com/skiler07/data/master/so_data.csv', index_col=0).set_index(['id','date'])
Note:
Most of id's will only have 1 date. This indicates only 1 visit. For id's with more visits, I would like to structure them in a 3d format e.g. store all of their visits in the 2nd dimension out of 3. The output is (id, visits, features)
Here is one way to speed that up. This adds the desired new rows in some code which processes the rows directly. This saves the overhead of constantly constructing small dataframes. Your sample of 100,000 rows runs in a couple of seconds on my machine. While your code with only 10,000 rows of your sample data takes > 100 seconds. This seems to represent a couple of orders of magnitude improvement.
Code:
def make_3d(csv_filename):
def make_3d_lines(a_df):
a_df['depth'] = 0
depth = 0
prev = None
accum = []
for row in a_df.values.tolist():
row[0] = 0
key = row[1]
if key == prev:
depth += 1
accum.append(row)
else:
if depth == 0:
yield row
else:
depth = 0
to_emit = []
for i in range(len(accum)):
date = accum[i][2]
for j, r in enumerate(accum[i:]):
to_emit.append(list(r))
to_emit[-1][0] = j
to_emit[-1][2] = date
for r in to_emit[1:]:
yield r
accum = [row]
prev = key
df_data = pd.read_csv('big-data.csv')
df_data.columns = ['depth'] + list(df_data.columns)[1:]
new_df = pd.DataFrame(
make_3d_lines(df_data.sort_values('id date'.split())),
columns=df_data.columns
).astype(dtype=df_data.dtypes.to_dict())
return new_df.set_index('id date'.split())
Test Code:
start_time = time.time()
df = make_3d('big-data.csv')
print(time.time() - start_time)
df = df.drop(columns=['feature%d' % i for i in range(3, 25)])
print(df[df['depth'] != 0].head(10))
Results:
1.7390995025634766
depth feature0 feature1 feature2
id date
207555809644681 20180104 1 0.03125 0.038623 0.008130
247833985674646 20180106 1 0.03125 0.004378 0.004065
252945024181083 20180107 1 0.03125 0.062836 0.065041
20180107 2 0.00000 0.001870 0.008130
20180109 1 0.00000 0.001870 0.008130
329567241731951 20180117 1 0.00000 0.041952 0.004065
20180117 2 0.03125 0.003101 0.004065
20180117 3 0.00000 0.030780 0.004065
20180118 1 0.03125 0.003101 0.004065
20180118 2 0.00000 0.030780 0.004065
I believe your approach for feature engineering could be done better, but I will stick to answering your question.
In Python, iterating over a Dictionary is way faster than iterating over a DataFrame
Here how I managed to process a huge pandas DataFrame (~100,000,000 rows):
# reset the Dataframe index to get level 0 back as a column in your dataset
df = data.reset_index() # the index will be (id, date)
# split the DataFrame based on id
# and store the splits as Dataframes in a dictionary using id as key
d = dict(tuple(df.groupby('id')))
# iterate over the Dictionary and process the values
for key, value in d.items():
pass # each value is a Dataframe
# concat the values and get the original (processed) Dataframe back
df2 = pd.concat(d.values(), ignore_index=True)
Modified #Stephen's code
def make_3d(dataset):
def make_3d_lines(a_df):
a_df['depth'] = 0 # sets all depth from (1 to n) to 0
depth = 1 # initiate from 1, so that the first loop is correct
prev = None
accum = [] # accumulates blocks of data belonging to given user
for row in a_df.values.tolist(): # for each row in our dataset
row[0] = 0 # NOT SURE
key = row[1] # this is the id of the row
if key == prev: # if this rows id matches previous row's id, append together
depth += 1
accum.append(row)
else: # else if this id is new, previous block is completed -> process it
if depth == 0: # previous id appeared only once -> get that row from accum
yield accum[0] # also remember that depth = 0
else: # process the block and emit each row
depth = 0
to_emit = [] # prepare to emit the list
for i in range(len(accum)): # for each unique day in the accumulated list
date = accum[i][2] # define date to be the first date it sees
for j, r in enumerate(accum[i:]):
to_emit.append(list(r))
to_emit[-1][0] = j # define the depth
to_emit[-1][2] = date # define the
for r in to_emit[0:]:
yield r
accum = [row]
prev = key
df_data = dataset.reset_index()
df_data.columns = ['depth'] + list(df_data.columns)[1:]
new_df = pd.DataFrame(
make_3d_lines(df_data.sort_values('id date'.split(), ascending=[True,False])),
columns=df_data.columns
).astype(dtype=df_data.dtypes.to_dict())
return new_df.set_index('id date'.split())
Testing:
t = pd.DataFrame(data={'id':[1,1,1,1,2,2,3,3,4,5], 'date':[20180311,20180310,20180210,20170505,20180312,20180311,20180312,20180311,20170501,20180304], 'feature':[10,20,45,1,14,15,20,20,13,11],'result':[1,1,0,0,0,0,1,0,1,1]})
t = t.reindex(columns=['id','date','feature','result'])
print t
id date feature result
0 1 20180311 10 1
1 1 20180310 20 1
2 1 20180210 45 0
3 1 20170505 1 0
4 2 20180312 14 0
5 2 20180311 15 0
6 3 20180312 20 1
7 3 20180311 20 0
8 4 20170501 13 1
9 5 20180304 11 1
Output
depth feature result
id date
1 20180311 0 10 1
20180311 1 20 1
20180311 2 45 0
20180311 3 1 0
20180310 0 20 1
20180310 1 45 0
20180310 2 1 0
20180210 0 45 0
20180210 1 1 0
20170505 0 1 0
2 20180312 0 14 0
20180312 1 15 0
20180311 0 15 0
3 20180312 0 20 1
20180312 1 20 0
20180311 0 20 0
4 20170501 0 13 1
I am iterating over a Python dataframe and finding it to be extremely slow. I understand that in Pandas you try to vectorize everything, but in this case I specifically need to iterate (or if it is possible to vectorize, I'm unclear how to do it).
The logic is simple: you have two columns "A" and "B" and a result column "signal." If A equals 1, then you set signal to 1. If B equals 1, then you set signal to 0. Otherwise, signals is whatever it was previously. In other words, column A is an "on" signal, column B is an "off" signal, and "signal" represents the state.
Here is my code:
def signals(indata):
numrows = len(indata)
data = pd.DataFrame(index= range(0,numrows))
data['A'] = indata['A']
data['B'] = indata['B']
data['signal'] = 0
for i in range(1,numrows):
if data['A'].iloc[i] == 1:
data['signal'].iloc[i] = 1
elif data['B'].iloc[i] == 1:
data['signal'].iloc[i] = 0
else:
data['signal'].iloc[i] = data['signal'].iloc[i-1]
return data
Example input/output:
indata = pd.DataFrame(index = range(0,10))
indata['A'] = [0, 1, 0, 0, 0, 0, 1, 0, 0, 0]
indata['B'] = [1, 0, 0, 0, 1, 0, 0, 0, 1, 1]
signals(indata)
Output:
A B signal
0 0 1 0
1 1 0 1
2 0 0 1
3 0 0 1
4 0 1 0
5 0 0 0
6 1 0 1
7 0 0 1
8 0 1 0
9 0 1 0
This simple logic takes my computer 46 seconds to run on a dataframe of 2000 rows with randomly generated data.
df['signal'] = df.A.groupby((df.A != df.B).cumsum()).transform('head', 1)
df
A B signal
0 0 1 0
1 1 0 1
2 0 0 1
3 0 0 1
4 0 1 0
5 0 0 0
6 1 0 1
7 0 0 1
8 0 1 0
9 0 1 0
The logic here involves dividing your series into groups based on the inequality between A and B, and every group's value is determined by A.
You dont need to iterate at all you can do some Boolean indexing
#set condition for A
indata.loc[indata.A == 1,'signal'] = 1
#set condition for B
indata.loc[indata.B == 1,'signal'] = 0
#forward fill NaN values
indata.signal.fillna(method='ffill',inplace=True)
The simplest answer to my problem was to not write to the dataframe while iterating through it. I created an array of zeros in numpy, then did my iterative logic in the array. Then I wrote the array to the column in my dataframe.
def signals3(indata):
numrows = len(indata)
data = pd.DataFrame(index= range(0,numrows))
data['A'] = indata['A']
data['B'] = indata['B']
out_signal = np.zeros(numrows)
for i in range(1,numrows):
if data['A'].iloc[i] == 1:
out_signal[i] = 1
elif data['B'].iloc[i] == 1:
out_signal[i] = 0
else:
out_signal[i] = out_signal[i-1]
data['signal'] = out_signal
return data
On a dataframe of 2000 rows of random data, this takes only 43 milliseconds as opposed to 46 seconds (~1,000x faster).
I also tried a variant where I assigned the dataframe columns A and B to series, and then iterated through the series. This was a bit faster (27 milliseconds). But it appears most of the slowness is in writing to a dataframe.
Both coldspeed and djk's answers were faster than my solution (about 4.5ms) but in practice I'll probably just iterate through series even though that is not optimal.
I have a dataframe that have over 200 columns of dummy variable:
Row1 Feature1 Feature2 Feature3 Feature4 Feature5
A 0 1 1 1 0
B 0 0 1 1 1
C 1 0 1 0 1
D 0 1 0 1 0
I want to do iteration to separate each feature to create extra 3 dataframes with df1 only contains keep the first feature that=1 as 1 and change all the later columns to 0 and df2 only contains keep the second feature that=1 as 1 and change all the previous and later columns to 0.
I have create codes to do it, but I figured there gotta be better ways to do it. Please help me with a more efficient way to approach this. Thank you!
Below is my code:
for index, row in hcit1.iterrows():
for i in range(1,261):
title="feature"+str(i)
if int(row[title])==1:
for j in range(i+1,261):
title2="feature"+str(j)
hcit1.loc[index,title2]=0
else:
pass
for index, row in hcit2.iterrows():
for i in range(1,261):
title="feature"+str(i)
if int(row[title])==1:
for j in range(i+1,261):
title2="feature"+str(j)
if row[title2]==1:
for k in range(j+1,261):
title3="feature"+str(k)
hcit1.loc[index,title3]=0
hcit1.loc[index,title]=0
else:
pass
for index, row in hcit3.iterrows():
for i in range(1,261):
title="feature"+str(i)
if int(row[title])==1:
for j in range(i+1,261):
title2="feature"+str(j)
if row[title2]==1:
for k in range(j+1,261):
title3="feature"+str(k)
if row[title3]==1:
for l in range(k+1,261):
title4="feature"+str(l)
hcit1.loc[index,title4]=0
hcit1.loc[index,title2]=0
hcit1.loc[index,title]=0
else:
pass
for index, row in hcit4.iterrows():
for i in range(1,261):
title="feature"+str(i)
if int(row[title])==1:
for j in range(i+1,261):
title2="feature"+str(j)
if row[title2]==1:
for k in range(j+1,261):
title3="feature"+str(k)
if row[title3]==1:
for l in range(k+1,261):
title4="feature"+str(l)
if row[title4]==1:
for m in range(l+1,261):
title5="feature"+str(m)
hcit1.loc[index,title5]=0
hcit1.loc[index,title3]=0
hcit1.loc[index,title2]=0
hcit1.loc[index,title]=0
else:
pass
Here:
df1 = df[df['Feature1'] == 1]
df1.iloc[:, :] = 0
df1.loc[:, 'Feature1'] = 1
df2 = df[df['Feature2'] == 1]
df2.iloc[:, :] = 0
df2.loc[:, 'Feature2'] = 1
df3 = df[df['Feature2'] == 1]
df3.iloc[:, :] = 0
df3.loc[:, 'Feature3'] = 1
That should be what you are looking for.