I have a DataFrame that looks like this:
class passed failed extra_teaching
A11 1 2 0.5
A12 2 1 0.7
I want to 'unravel' the DataFrame, and lose the information about the class but keep the information on extra_teaching, so I end up with a row for each individual pupil who passed.
So the DataFrame should end up looking like this:
pass extra_teaching
1 0.5
0 0.5
0 0.5
1 0.7
1 0.7
0 0.7
I have no idea how to do this in pandas, except perhaps by using iterrows() and manually appending rows to a new DataFrame - has anyone got a neater way?
UPDATE:
I tried this, seems to work though not very elegant:
temp = []
df = df.set_index('class')
for idx in df.index:
row = df.loc[idx]
t = {'class': idx, 'extra_teaching': row['extra_teaching']}
for i in range(0, int(row['passed'])):
t['pass'] = 1
temp.append(t)
for i in range(0, int(row['failed'])):
t['pass'] = 0
temp.append(t)
df_exploded = pd.DataFrame(temp)
Try:
def teaching_results(x):
num_rows = x.passed.iloc[0] + x.failed.iloc[0]
passed = x.passed.iloc[0] * [1] + x.failed.iloc[0] * [0]
extra_teaching = num_rows * [x.extra_teaching.iloc[0]]
class_code = x['class'].iloc[0]
return pd.DataFrame({'pass': passed, 'extra_teaching': extra_teaching, 'class': class_code})
df.groupby('class', as_index=False).apply(lambda x: teaching_results(x))
to get:
class extra_teaching pass
0 0 A11 0.5 1
1 A11 0.5 0
2 A11 0.5 0
1 0 A12 0.7 1
1 A12 0.7 1
2 A12 0.7 0
Related
now i have two file, file1 is the sequence file,and file2 is the binding region with prediction score, file1 look like:
seq A
seq B
Q
Q
V
V
Q
Q
A
A
B
C
C
S
A
A
B
C
C
S
another file2 look like :
id
region
score
seq A
QABCA
0.6
seq B
CSACS
0.4
now i want to match the prediction score tow the sequence(more than 100 sequence),the result i want is :
seq A
seq A score
seq B
seq B score
Q
0
Q
0
V
0
V
0
Q
0.6
Q
0
A
0.6
A
0
B
0.6
C
0.4
C
0.6
S
0.4
A
0.6
A
0.4
B
0
C
0.4
C
0
S
0.4
How could i get the result? Thanks!
I tried use pd.str.match(), but it it can't match in multi row same time.
You can try something like that
file1 = file1.reset_index()
file2 = file2.set_index("id")
for seq_id in file1.columns:
region = file2.loc[seq_id, "region"]
score = file2.loc[seq_id, "score"]
seq = "".join(file1[seq_id].tolist())
seq_start_index = seq.find(region)
seq_stop_index = start_index + len(region) - 1
file1[f"{seq_id} score"] = 0
file1.loc[seq_start_index:seq_stop_index, f"{seq_id} score"] = score
I am new to python and am trying to write a code to create a new dataframe based on conditions from an old dataframe along with the results in the cell above on the new dataframe.
Here is an example of what I am trying to do:
is the raw data
I need to create a new dataframe where if the corresponding position in the raw data is 0 the result is 0, if it is greater than 0 then 1 plus the above row
I need to remove any instances where the consecutive number of intervals doesn't reach at least 3
The way I think about the code is as such, but being new to python I am struggling.
From Raw data to Dataframe 2:
if (1,1)=0 then (1a, 1a)= 0: # line 1
else (1a,1a)=1;
if (2,1)=0 then (2a,1a)=0; # line 2
else (2a,1a)= (1a,1a)+1 = 2;
if (3,1)=0 then (3a,1a)=0; # line 3
From Dataframe 2 to 3:
If any of the last 3 rows is greater than 3 then return that cells value else return 0
I am not sure how to make any of these work, if there is an easier way to do/think about this then what I am doing please let me know. Any help is appreciated!
Based on your question, the output I was able to generate was:
Earlier, the DataFrame looked like so:
A B C
0.05 5 0 0
0.10 7 0 1
0.15 0 0 12
0.20 0 4 3
0.25 1 0 5
0.30 21 5 0
0.35 6 0 9
0.40 15 0 0
Now, the DataFrame looks like so:
A B C
0.05 0 0 0
0.10 0 0 1
0.15 0 0 2
0.20 0 0 3
0.25 1 0 4
0.30 2 0 0
0.35 3 0 0
0.40 4 0 0
The code I used for this is given below, just copy the following code in a new file, say code.py and run it
import re
import pandas as pd
def get_continous_runs(ext_list, threshold):
mylist = list(ext_list)
for i in range(len(mylist)):
if mylist[i] != 0:
mylist[i] = 1
samp = "".join(map(str, mylist))
finder = re.finditer(r"1{%s,}" % threshold, samp)
ranges = [x.span() for x in finder]
return ranges
def build_column(ranges, max_len):
answer = [0]*max_len
for r in ranges:
start = r[0]
run_len = r[1] - start
for i in range(run_len):
answer[start+i] = i + 1
return answer
def main(df):
print("Earlier, the DataFrame looked like so:")
print(df)
ndf = df.copy()
for col_name, col_data in df.iteritems():
ranges = get_continous_runs(col_data.values, 4)
column_len = len(col_data.values)
new_column = build_column(ranges, column_len)
ndf[col_name] = new_column
print("\nNow, the DataFrame looks like so:")
print(ndf)
return
if __name__ == '__main__':
raw_data = [
(5,0,0), (7,0,1), (0,0,12), (0,4,3),
(1,0,5), (21,5,0), (6,0,9), (15,0,0),
]
df = pd.DataFrame(
raw_data,
columns=list("ABC"),
index=[0.05,0.10,0.15,0.20,0.25,0.30,0.35,0.40]
)
main(df)
You can adjust threshold in line #28 to get consecutive number of intervals other than 4 (i.e. more than 3).
As always, start by reading main() function to understand how everything works. I have tried to use good variable names to aid understanding. My method might seem a little contrived because I am using regex, but I didn't want to overwhelm a very beginner with a custom run-length counter, so...
starting by another my question I've done yesterday Pandas set value if all columns are equal in a dataframe
Starting by #anky_91 solution I'm working on something similar.
Instead of put 1 or -1 if all columns are equals I want something more flexible.
In fact I want 1 if (for example) the 70% percentage of the columns are 1, -1 for the same but inverse condition and 0 else.
So this is what I've wrote:
# Instead of using .all I use .sum to count the occurence of 1 and 0 for each row
m1 = local_df.eq(1).sum(axis=1)
m2 = local_df.eq(0).sum(axis=1)
# Debug print, it work
print(m1)
print(m2)
But I don't know how to change this part:
local_df['enseamble'] = np.select([m1, m2], [1, -1], 0)
m = local_df.drop(local_df.columns.difference(['enseamble']), axis=1)
I write in pseudo code what I want:
tot = m1 + m2
if m1 > m2
if(m1 * 100) / tot > 0.7 # simple percentage calculus
df['enseamble'] = 1
else if m2 > m1
if(m2 * 100) / tot > 0.7 # simple percentage calculus
df['enseamble'] = -1
else:
df['enseamble'] = 0
Thanks
Edit 1
This is an example of expected output:
NET_0 NET_1 NET_2 NET_3 NET_4 NET_5 NET_6
date
2009-08-02 0 1 1 1 0 1
2009-08-03 1 0 0 0 1 0
2009-08-04 1 1 1 0 0 0
date enseamble
2009-08-02 1 # because 1 is more than 70%
2009-08-03 -1 # because 0 is more than 70%
2009-08-04 0 # because 0 and 1 are 50-50
You could obtain the specified output from the following conditions:
thr = 0.7
c1 = (df.eq(1).sum(1)/df.shape[1]).gt(thr)
c2 = (df.eq(0).sum(1)/df.shape[1]).gt(thr)
c2.astype(int).mul(-1).add(c1)
Output
2009-08-02 0
2009-08-03 0
2009-08-04 0
2009-08-05 0
2009-08-06 -1
2009-08-07 1
dtype: int64
Or using np.select:
pd.DataFrame(np.select([c1,c2], [1,-1], 0), index=df.index, columns=['result'])
result
2009-08-02 0
2009-08-03 0
2009-08-04 0
2009-08-05 0
2009-08-06 -1
2009-08-07 1
Try with (m1 , m2 and tot are same as what you have):
cond1=(m1>m2)&((m1 * 100/tot).gt(0.7))
cond2=(m2>m1)&((m2 * 100/tot).gt(0.7))
df['enseamble'] =np.select([cond1,cond2],[1,-1],0)
m =df.drop(df.columns.difference(['enseamble']), axis=1)
print(m)
enseamble
date
2009-08-02 1
2009-08-03 -1
2009-08-04 0
For each "acat" unique value, I want to count how many occurrences there are of each "data" category (call this "bins"), and then calc the mean and skew of "bins"
possible values of data = 1,2,3,4,5
df = pd.DataFrame({'acat':[1,1,2,3,1,3],
'data':[1,1,2,1,3,1]})
df
Out[45]:
acat data
0 1 1
1 1 1
2 2 2
3 3 1
4 1 3
5 3 1
for acat = 1:
bins = (2 + 0 + 1 + 0 + 0)
average = bins / 5 = 0.6
for acat = 2:
bins = (0 + 1 + 0 + 0 + 0)
average = bins / 5 = 0.2
for acat = 3:
bins = (2 + 0 + 0 + 0 + 0)
average = bins / 5 = 0.4
bin_average_col
0.6
0.6
0.2
0.4
0.6
0.4
Also I would like a bin_skew_col.
I have a solution that uses crosstab, but this blows my PC memory when the number of acat is large.
I have tried extensively with groupby and transform but this is beyond me!
Many thanks in advance.
My data frame contains 10,000,000 rows! After group by, ~ 9,000,000 sub-frames remain to loop through.
The code is:
data = read.csv('big.csv')
for id, new_df in data.groupby(level=0): # look at mini df and do some analysis
# some code for each of the small data frames
This is super inefficient, and the code has been running for 10+ hours now.
Is there a way to speed it up?
Full Code:
d = pd.DataFrame() # new df to populate
print 'Start of the loop'
for id, new_df in data.groupby(level=0):
c = [new_df.iloc[i:] for i in range(len(new_df.index))]
x = pd.concat(c, keys=new_df.index).reset_index(level=(2,3), drop=True).reset_index()
x = x.set_index(['level_0','level_1', x.groupby(['level_0','level_1']).cumcount()])
d = pd.concat([d, x])
To get the data:
data = pd.read_csv('https://raw.githubusercontent.com/skiler07/data/master/so_data.csv', index_col=0).set_index(['id','date'])
Note:
Most of id's will only have 1 date. This indicates only 1 visit. For id's with more visits, I would like to structure them in a 3d format e.g. store all of their visits in the 2nd dimension out of 3. The output is (id, visits, features)
Here is one way to speed that up. This adds the desired new rows in some code which processes the rows directly. This saves the overhead of constantly constructing small dataframes. Your sample of 100,000 rows runs in a couple of seconds on my machine. While your code with only 10,000 rows of your sample data takes > 100 seconds. This seems to represent a couple of orders of magnitude improvement.
Code:
def make_3d(csv_filename):
def make_3d_lines(a_df):
a_df['depth'] = 0
depth = 0
prev = None
accum = []
for row in a_df.values.tolist():
row[0] = 0
key = row[1]
if key == prev:
depth += 1
accum.append(row)
else:
if depth == 0:
yield row
else:
depth = 0
to_emit = []
for i in range(len(accum)):
date = accum[i][2]
for j, r in enumerate(accum[i:]):
to_emit.append(list(r))
to_emit[-1][0] = j
to_emit[-1][2] = date
for r in to_emit[1:]:
yield r
accum = [row]
prev = key
df_data = pd.read_csv('big-data.csv')
df_data.columns = ['depth'] + list(df_data.columns)[1:]
new_df = pd.DataFrame(
make_3d_lines(df_data.sort_values('id date'.split())),
columns=df_data.columns
).astype(dtype=df_data.dtypes.to_dict())
return new_df.set_index('id date'.split())
Test Code:
start_time = time.time()
df = make_3d('big-data.csv')
print(time.time() - start_time)
df = df.drop(columns=['feature%d' % i for i in range(3, 25)])
print(df[df['depth'] != 0].head(10))
Results:
1.7390995025634766
depth feature0 feature1 feature2
id date
207555809644681 20180104 1 0.03125 0.038623 0.008130
247833985674646 20180106 1 0.03125 0.004378 0.004065
252945024181083 20180107 1 0.03125 0.062836 0.065041
20180107 2 0.00000 0.001870 0.008130
20180109 1 0.00000 0.001870 0.008130
329567241731951 20180117 1 0.00000 0.041952 0.004065
20180117 2 0.03125 0.003101 0.004065
20180117 3 0.00000 0.030780 0.004065
20180118 1 0.03125 0.003101 0.004065
20180118 2 0.00000 0.030780 0.004065
I believe your approach for feature engineering could be done better, but I will stick to answering your question.
In Python, iterating over a Dictionary is way faster than iterating over a DataFrame
Here how I managed to process a huge pandas DataFrame (~100,000,000 rows):
# reset the Dataframe index to get level 0 back as a column in your dataset
df = data.reset_index() # the index will be (id, date)
# split the DataFrame based on id
# and store the splits as Dataframes in a dictionary using id as key
d = dict(tuple(df.groupby('id')))
# iterate over the Dictionary and process the values
for key, value in d.items():
pass # each value is a Dataframe
# concat the values and get the original (processed) Dataframe back
df2 = pd.concat(d.values(), ignore_index=True)
Modified #Stephen's code
def make_3d(dataset):
def make_3d_lines(a_df):
a_df['depth'] = 0 # sets all depth from (1 to n) to 0
depth = 1 # initiate from 1, so that the first loop is correct
prev = None
accum = [] # accumulates blocks of data belonging to given user
for row in a_df.values.tolist(): # for each row in our dataset
row[0] = 0 # NOT SURE
key = row[1] # this is the id of the row
if key == prev: # if this rows id matches previous row's id, append together
depth += 1
accum.append(row)
else: # else if this id is new, previous block is completed -> process it
if depth == 0: # previous id appeared only once -> get that row from accum
yield accum[0] # also remember that depth = 0
else: # process the block and emit each row
depth = 0
to_emit = [] # prepare to emit the list
for i in range(len(accum)): # for each unique day in the accumulated list
date = accum[i][2] # define date to be the first date it sees
for j, r in enumerate(accum[i:]):
to_emit.append(list(r))
to_emit[-1][0] = j # define the depth
to_emit[-1][2] = date # define the
for r in to_emit[0:]:
yield r
accum = [row]
prev = key
df_data = dataset.reset_index()
df_data.columns = ['depth'] + list(df_data.columns)[1:]
new_df = pd.DataFrame(
make_3d_lines(df_data.sort_values('id date'.split(), ascending=[True,False])),
columns=df_data.columns
).astype(dtype=df_data.dtypes.to_dict())
return new_df.set_index('id date'.split())
Testing:
t = pd.DataFrame(data={'id':[1,1,1,1,2,2,3,3,4,5], 'date':[20180311,20180310,20180210,20170505,20180312,20180311,20180312,20180311,20170501,20180304], 'feature':[10,20,45,1,14,15,20,20,13,11],'result':[1,1,0,0,0,0,1,0,1,1]})
t = t.reindex(columns=['id','date','feature','result'])
print t
id date feature result
0 1 20180311 10 1
1 1 20180310 20 1
2 1 20180210 45 0
3 1 20170505 1 0
4 2 20180312 14 0
5 2 20180311 15 0
6 3 20180312 20 1
7 3 20180311 20 0
8 4 20170501 13 1
9 5 20180304 11 1
Output
depth feature result
id date
1 20180311 0 10 1
20180311 1 20 1
20180311 2 45 0
20180311 3 1 0
20180310 0 20 1
20180310 1 45 0
20180310 2 1 0
20180210 0 45 0
20180210 1 1 0
20170505 0 1 0
2 20180312 0 14 0
20180312 1 15 0
20180311 0 15 0
3 20180312 0 20 1
20180312 1 20 0
20180311 0 20 0
4 20170501 0 13 1