Creating new pandas df from old one

Creating new pandas df from old one - python

I have a dataframe data, and want to append another one at the end. The new dataframe is similar to the previous one, only the entries are swapped. I have the following code that works and illustrates what I am doing:
listL = data.shape[0]
length = data.shape[1]
mid = (length-1) / 2.0
for j in range(0, 5) :
data.loc[listL+j] = data.iloc[j]
for j in range(0, 5) :
for i in range(start, end) :
left = int(ceil(mid+i)) + 1
right = int(ceil(mid-i))
data.iloc[listL+j][left] = data.iloc[j][right]
data.iloc[listL+j][0] = data.iloc[j][0] + 10
In this example I am adding only the first 5 rows at the end, and swap the columns. This does not scale well at all, and it is very inefficient.
Can you help make this more efficient, eliminate the loops, and make it scale well (I would like to work with dataframes that have 10000's of entries).
In particular, how can I make the swapping more efficient?
Update:
Using one of the answers, I can now do:
tmpdf = data
data = pandas.concat([data, tmpdf])
for j in range(0, listL-1) :
for i in range(start, end) :
left = int(ceil(mid+i)) + 1
right = int(ceil(mid-i))
data.iloc[listL+j][left] = data.iloc[listL+j][right]
data.iloc[listL+j][0] = data.iloc[listL+j][0] + 10
where listL is the number of rows in the original df data. I need to optimise the second part:
listL = data.shape[0]
length = data.shape[1]
mid = (length-1) / 2.0
for j in range(0, listL-1) :
for i in range(start, end) :
left = int(ceil(mid+i)) + 1
right = int(ceil(mid-i))
data.iloc[listL+j][left] = data.iloc[listL+j][right]
data.iloc[listL+j][0] = data.iloc[listL+j][0] + 10

If you have df1 and df2, you can simply use pd.concat to add df2 first five rows, independantly of how columns are ordered:
pd.concat([df1, df2.ix[:4,]])

This is what I ended up doing, thanks to the answers and comments received:
length = data.shape[1]
mid = (length-1) / 2.0
start = -int(floor(mid))
end = int(floor(mid))
#for j in range(0, 5) :
# data.loc[listL+j] = data.iloc[j]
tmpdf = data.copy(deep=True)
for i in range(start, end) :
left = int(ceil(mid+i)) + 1
right = int(ceil(mid-i))
tmpdf[data.columns[left]] = data[data.columns[right]]
data = pandas.concat([data, tmpdf])

Related

How can I make my code simpler and still get the same output?

So I had to write a code to find all the diagonal strings in a grid mystery
mystery = [["r","a","w","b","i","t"],
["x","a","y","z","c","h"],
["p","q","b","e","i","e"],
["t","r","s","b","o","g"],
["u","w","x","v","i","t"],
["n","m","r","w","o","t"]]
And here is what I have so far, with the help of a few experts because I'm new to this. The expert who helped me is https://stackoverflow.com/users/5237560/alain-t
def diagsDownRight(M):
diags,pad = [],[]
while any(M):
edge = [*next(zip(*reversed(M))),*M[0][1:]]
M = [r[1:] for r in M[1:]]
diags.append(pad+edge+pad)
pad.append("")
return [*map("".join,zip(*diags))]
While this does work, I myself find it hard to grasp and I do not want to just write down a code that I do not understand. So, can anyone please help make the code as basic as possible?
When I mean basic as possible, I mean like picture yourself as a person who has just learnt coding for a couple of months, and please try to simplify my code as much as possible.

The easiest I could think of: pad rows so that diagonals become columns. The code:
def diagsDownRight(M):
n = len(M)
m = [[''] * (n-i-1) + row + [''] * i for i, row in enumerate(M)] # pad rows
return [''.join(col) for col in zip(*m)]
The result is the same, and IMO the approach is more intuitive

consider a square matrix
[
[ 1, 2, 3 ],
[ 4, 5, 6 ],
[ 7, 8, 9 ]
]
the indexes into the diagonals are as follows
d1 = [[0,0],[1,1],[2,2]]
d2 = [[0,1],[1,2]]
d3 = [[1,0],[2,1]]
d4 = [[2,0]]
d5 = [[0,2]]
to get the middle diagonal you can simply start with the indexes
for i in range(3):
index = [i,i]
for the next diagonal we simply do the same... but offset x by 1, until we go out of bounds
for i in range(3):
if i + 1 > 2:
break
index = [i, i+1]
for the next diagonal its the same ... except we do it on the other axis
for i in range(3):
if i + 1 > 2:
break
index = [i + 1, i]
for the toprightmost (in this case ...) its the same but we add 2
for i in range(3):
if i + 2 > 2:
break
index = [i, i+2]
same for the bottom most but using the other index
for i in range(3):
if i + 2 > 2:
break
index = [i + 2, i]
I will leave it to you to extrapolate this into a working solution

here's a simpler version:
def diagsDownRight(M):
rows = len(M) # number of rows
cols = len(M[0]) # number of columns
result = [] # result will be a list of strings
leftSide = [(r,0) for r in range(rows)] # first column
topSide = [(0,c) for c in range(1,cols)] # first row
for r,c in leftSide[::-1] + topSide: # all start positions
s = "" # string on the diagonal
while r < rows and c < cols:
s += M[r][c] # accumulate characters
r += 1 # move down
c += 1 # and to the right
result.append(s) # add diagonal string to result
return result
print(diagsDownRight(mystery))
['n', 'um', 'twr', 'prxw', 'xqsvo',
'rabbit', 'ayeot', 'wzig', 'bce', 'ih', 't']
The way it works is by starting at the coordinate of the left and top positions, accumulate characters going one place to the right and down until going out of the matrix.
I would suggest you go with Marat's solution though. It is simple and elegant. If you print the m matrix I'm sure, you'll understand what's going on

Selecting columns using [[]] is very inefficient especially as the size of the dataset increases in python using pandas

Created sample data using below function:
def create_sample(num_of_rows=1000):
num_of_rows = num_of_rows # number of records to generate.
data = {
'var1' : [random.uniform(0.0, 1.0) for x in range(num_of_rows)],
'other' : [random.uniform(0.0, 1.0) for x in range(num_of_rows)]
}
df = pd.DataFrame(data)
print("Shape : {}".format(df.shape))
print("Type : \n{}".format(df.dtypes))
return df
df = create_sample()
times = []
for i in range(1, 300):
start = time.time()
# Make the dataframe 1 column bigger
df['var' + str(i + 1)] = df['var' + str(i)]
# Select two columns from the dataframe using double square brackets
####################################################
temp = df[['var' + str(i + 1), 'var' + str(i)]]
####################################################
end = time.time()
times.append(end - start)
start = end
plt.plot(times)
print(sum(times))
The graph is linear
enter image description here
used pd.concat to select columns, the graph shows peaks at every 100.. why is this so
df = create_sample()
times = []
for i in range(1, 300):
start = time.time()
# Make the dataframe 1 column bigger
df['var' + str(i + 1)] = df['var' + str(i)]
# Select two columns from the dataframe using double square brackets
####################################################
temp = pd.concat([df['var' + str(i + 1)],df['var' + str(i)]], axis=1)
####################################################
end = time.time()
times.append(end - start)
start = end
plt.plot(times)
print(sum(times))
please ignore indentation.
**From the above we can see that the time taken to select columns using [[]] increases linerly with the size of the dataset.
However, using pd.concat the time does not increase materially. Why increases in every 100 records only. The above is not obvious
**

How to make this for loop faster?

I know that python loops themselves are relatively slow when compared to other languages but when the correct functions are used they become much faster.
I have a pandas dataframe called "acoustics" which contains over 10 million rows:
print(acoustics)
timestamp c0 rowIndex
0 2016-01-01T00:00:12.000Z 13931.500000 8158791
1 2016-01-01T00:00:30.000Z 14084.099609 8158792
2 2016-01-01T00:00:48.000Z 13603.400391 8158793
3 2016-01-01T00:01:06.000Z 13977.299805 8158794
4 2016-01-01T00:01:24.000Z 13611.000000 8158795
5 2016-01-01T00:02:18.000Z 13695.000000 8158796
6 2016-01-01T00:02:36.000Z 13809.400391 8158797
7 2016-01-01T00:02:54.000Z 13756.000000 8158798
and there is the code I wrote:
acoustics = pd.read_csv("AccousticSandDetector.csv", skiprows=[1])
weights = [1/9, 1/18, 1/27, 1/36, 1/54]
sumWeights = np.sum(weights)
deltaAc = []
for i in range(5, len(acoustics)):
time = acoustics.iloc[i]['timestamp']
sum = 0
for c in range(5):
sum += (weights[c]/sumWeights)*(acoustics.iloc[i]['c0']-acoustics.iloc[i-c]['c0'])
print("Row " + str(i) + " of " + str(len(acoustics)) + " is iterated")
deltaAc.append([time, sum])
deltaAc = pd.DataFrame(deltaAc)
It takes a huge amount of time, how can I make it faster?

You can use diff from pandas and create all the differences for each row in an array, then multiply with your weigths and finally sum over the axis 1, such as:
deltaAc = pd.DataFrame({'timestamp': acoustics.loc[5:, 'timestamp'],
'summation': (np.array([acoustics.c0.diff(i) for i in range(5) ]).T[5:]
*np.array(weights)).sum(1)/sumWeights})
and you get the same values than what I get with your code:
print (deltaAc)
timestamp summation
5 2016-01-01T00:02:18.000Z -41.799986
6 2016-01-01T00:02:36.000Z 51.418728
7 2016-01-01T00:02:54.000Z -3.111184

First optimization, weights[c]/sumWeights could be done outside the loop.
weights_array = np.array([1/9, 1/18, 1/27, 1/36, 1/54])
sumWeights = np.sum(weights_array)
tmp = weights_array / sumWeights
...
sum += tmp[c]*...
I'm not familiar with pandas, but if you could extract your columns as 1D numpy array, it would be great for you. It might look something like:
# next lines to be tested, or find the correct way of extracting the column
c0_column = acoustics[['c0']].values
time_column = acoustics[['times']].values
...
sum = numpy.zeros(shape=(len(acoustics)-5,))
delta_ac = []
for c in range(5):
sum += tmp[c]*(c0_column[5:]-c0_column[5-c:len(acoustics)-c])
for i in range(len(acoustics)-5):
deltaAc.append([time[5+i], sum[i])

Dataframes have a great method rolling for constructing and applying windowing transformations; So, you don't need loops at all:
# df is your data frame
window_size = 5
weights = pd.np.array([1/9, 1/18, 1/27, 1/36, 1/54])
weights /= weights.sum()
df.loc[:,'deltaAc'] = df.loc[:, 'c0'].rolling(window_size).apply(lambda x: ((x[-1] - x)*weights).sum())

Creating number masks from number range

I have a range like this:
1323000-1555999
It's is necessary to create "masks" covering entire range. So the "masks" list for range above should look like this:
1323***
1324***
1325***
...
14*****
...
153****
154****
1550***
1551***
And so on.
Anyone have ideas about how to solve this problem using Python?
The idea is to cover all range using minimum amount of masks. So in case of 1000-1999, algo should output 1*** but not 101*,102*... or 10**,11**....

With a bit of looping:
Code:
def wild_card_range(start, end):
while start <= end:
shift = 0
multiple = 1
done = False
over = False
still_fits = True
while not (done or over) and still_fits:
multiple *= 10
shift += 1
next_value = int(start / multiple) * multiple + multiple
done = next_value == end + 1
over = next_value > end + 1
still_fits = int(start / multiple) == \
int((start + multiple - 1) / multiple)
if over or not still_fits:
multiple = int(multiple / 10)
shift -= 1
yield str(int(start / multiple)) + '*' * shift
start += multiple
for mask in wild_card_range(1323000, 1555999):
print(mask)
Results:
1323***
1324***
1325***
1326***
1327***
1328***
1329***
133****
134****
135****
136****
137****
138****
139****
14*****
150****
151****
152****
153****
154****
1550***
1551***
1552***
1553***
1554***
1555***

Condensing repeat code with a "for" statement using strings - Python

I am very new with "for" statements in Python, and I can't get something that I think should be simple to work. My code that I have is:
import pandas as pd
df1 = pd.DataFrame({'Column1' : pd.Series([1,2,3,4,5,6])})
df2 = pd.DataFrame({'Column1' : pd.Series([1,2,3,4,5,6])})
df3 = pd.DataFrame({'Column1' : pd.Series([1,2,3,4,5,6])})
DF1 = pd.DataFrame({'Column1' : pd.Series([1,2,3,4,5,6])})
DF2 = pd.DataFrame({'Column1' : pd.Series([1,2,3,4,5,6])})
DF3 = pd.DataFrame({'Column1' : pd.Series([1,2,3,4,5,6])})
Then:
A1 = len(df1.loc[df1['Column1'] <= DF1['Column1'].iloc[2]])
Z1 = len(df1.loc[df1['Column1'] >= DF1['Column1'].iloc[3]])
A2 = len(df2.loc[df2['Column1'] <= DF2['Column1'].iloc[2]])
Z2 = len(df2.loc[df2['Column1'] >= DF2['Column1'].iloc[3]])
A3 = len(df3.loc[df3['Column1'] <= DF3['Column1'].iloc[2]])
Z3 = len(df3.loc[df3['Column1'] >= DF3['Column1'].iloc[3]])
As you can see, it is a lot of repeat code with just the identifying numbers being different. So my first attempt at a "for" statement was:
Numbers = [1,2,3]
for i in Numbers:
"A" + str(i) = len("df" + str(i).loc["df" + str(i)['Column1'] <= "DF" + str(i)['Column1'].iloc[2]])
"Z" + str(i) = len("df" + str(i).loc["df" + str(i)['Column1'] >= "DF" + str(i)['Column1'].iloc[3]])
This yielded the SyntaxError: "can't assign to operator". So I tried:
Numbers = [1,2,3]
for i in Numbers:
A = "A" + str(i)
Z = "Z" + str(i)
A = len("df" + str(i).loc["df" + str(i)['Column1'] <= "DF" + str(i)['Column1'].iloc[2]])
Z = len("df" + str(i).loc["df" + str(i)['Column1'] >= "DF" + str(i)['Column1'].iloc[3]])
This yielded the AttributeError: 'str' object has no attribute 'loc'. I tried a few other things like:
Numbers = [1,2,3]
for i in Numbers:
A = "A" + str(i)
Z = "Z" + str(i)
df = "df" + str(i)
DF = "DF" + str(i)
A = len(df.loc[df['Column1'] <= DF['Column1'].iloc[2]])
Z = len(df.loc[df['Column1'] <= DF['Column1'].iloc[3]])
But that just gives me the same errors. Ultimately what I would want is something like:
Numbers = [1,2,3]
for i in Numbers:
Ai = len(dfi.loc[dfi['Column1'] <= DFi['Column1'].iloc[2]])
Zi = len(dfi.loc[dfi['Column1'] <= DFi['Column1'].iloc[3]])
Where the output would be equivalent if I typed:
A1 = len(df1.loc[df1['Column1'] <= DF1['Column1'].iloc[2]])
Z1 = len(df1.loc[df1['Column1'] >= DF1['Column1'].iloc[3]])
A2 = len(df2.loc[df1['Column1'] <= DF2['Column1'].iloc[2]])
Z2 = len(df2.loc[df1['Column1'] >= DF2['Column1'].iloc[3]])
A3 = len(df3.loc[df3['Column1'] <= DF3['Column1'].iloc[2]])
Z3 = len(df3.loc[df3['Column1'] >= DF3['Column1'].iloc[3]])

It is "restricted" to generate variables in for loop (you can do that, but it's better to avoid. See other posts: post_1, post_2).
Instead use this code to achieve your goal without generating as many variables as your needs (actually generate only the values in the for loop):
# Lists of your dataframes
Hanimals = [H26, H45, H46, H47, H51, H58, H64, H65]
Ianimals = [I26, I45, I46, I47, I51, I58, I64, I65]
# Generate your series using for loops iterating through your lists above
BPM = pd.DataFrame({'BPM_Base':pd.Series([i_a for i_a in [len(i_h.loc[i_h['EKG-evt'] <=\
i_i[0].iloc[0]]) / 10 for i_h, i_i in zip(Hanimals, Ianimals)]]),
'BPM_Test':pd.Series([i_z for i_z in [len(i_h.loc[i_h['EKG-evt'] >=\
i_i[0].iloc[-1]]) / 30 for i_h, i_i in zip(Hanimals, Ianimals)]])})
UPDATE
A more efficient way (iterate over "animals" lists only once):
# Lists of your dataframes
Hanimals = [H26, H45, H46, H47, H51, H58, H64, H65]
Ianimals = [I26, I45, I46, I47, I51, I58, I64, I65]
# You don't need using pd.Series(),
# just create a list of tuples: [(A26, Z26), (A45, Z45)...] and iterate over it
BPM = pd.DataFrame({'BPM_Base':i[0], 'BPM_Test':i[1]} for i in \
[(len(i_h.loc[i_h['EKG-evt'] <= i_i[0].iloc[0]]) / 10,
len(i_h.loc[i_h['EKG-evt'] >= i_i[0].iloc[-1]]) / 30) \
for i_h, i_i in zip(Hanimals, Ianimals)])

Figured out a better way to do this that fits my needs. This is mainly so that I will be able to find my method.
# Change/Add animals and conditions here, make sure they match up directly
Animal = ['26','45','46','47','51','58','64','65', '69','72','84']
Cond = ['Stomach','Intestine','Stomach','Stomach','Intestine','Intestine','Intestine','Stomach','Cut','Cut','Cut']
d = []
def CuSO4():
for i in Animal:
# load in Spike data
A = pd.read_csv('TXT/INJ/' + i + '.txt',delimiter=r"\s+", skiprows = 15, header = None, usecols = range(1))
B = pd.read_csv('TXT/EKG/' + i + '.txt', skiprows = 3)
C = pd.read_csv('TXT/ESO/' + i + '.txt', skiprows = 3)
D = pd.read_csv('TXT/TRACH/' + i + '.txt', skiprows = 3)
E = pd.read_csv('TXT/BP/' + i + '.txt', delimiter=r"\s+").rename(columns={"4 BP": "BP"})
# Count number of beats before/after injection, divide by 10/30 minutes for average BPM.
F = len(B.loc[B['EKG-evt'] <= A[0].iloc[0]])/10
G = len(B.loc[B['EKG-evt'] >= A[0].iloc[-1]])/30
# Count number of esophogeal events before/after injection
H = len(C.loc[C['Eso-evt'] <= A[0].iloc[0]])
I = len(C.loc[C['Eso-evt'] >= A[0].iloc[-1]])
# Find Trach events after injection
J = D.loc[D['Trach-evt'] >= A[0].iloc[-1]]
# Count number of breaths before/after injection, divide by 10/30 min for average breaths/min
K = len(D.loc[D['Trach-evt'] <= A[0].iloc[0]])/10
L = len(J)/30
# Use Trach events from J to find the number of EE
M = pd.DataFrame(pybursts.kleinberg(J['Trach-evt'], s=4, gamma=0.1))
N = M.last_valid_index()
# Use N and M to determine the latency, set value to MaxTime (1800s)if EE = 0
O = 1800 if N == 0 else M.iloc[1][1] - A[0].iloc[-1]
# Find BP value before/after injection, then determine the mean value
P = E.loc[E['Time'] <= A[0].iloc[0]]
Q = E.loc[E['Time'] >= A[0].iloc[-1]]
R = P["BP"].mean()
S = Q["BP"].mean()
# Combine all factors into one DF
d.append({'EE' : N, 'EE-lat' : O,
'BPM_Base' : F, 'BPM_Test' : G,
'Eso_Base' : H, 'Eso_Test' : I,
'Trach_Base' : K, 'Trach_Test' : L,
'BP_Base' : R, 'BP_Test' : S})
CuSO4()
# Create shell DF with animal numbers and their conditions.
DF = pd.DataFrame({'Animal' : pd.Series(Animal), 'Cond' : pd.Series(Cond)})
# Pull appended DF from CuSO4 and make it a pd.DF
Df = pd.DataFrame(d)
# Combine the two DF's
df = pd.concat([DF, Df], axis=1)
df

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating new pandas df from old one - python

If you have df1 and df2, you can simply use pd.concat to add df2 first five rows, independantly of how columns are ordered: pd.concat([df1, df2.ix[:4,]])

Related

How can I make my code simpler and still get the same output?

Selecting columns using [[]] is very inefficient especially as the size of the dataset increases in python using pandas

How to make this for loop faster?

Creating number masks from number range

Condensing repeat code with a "for" statement using strings - Python

Categories

Resources