Pandas : While adding new rows, its replacing my existing dataframe values? [duplicate] - python

This question already has answers here:
Is it possible to insert a row at an arbitrary position in a dataframe using pandas?
(4 answers)
Closed 2 years ago.
import pandas as pd
data = {'term':[2, 7,10,11,13],'pay':[22,30,50,60,70]}
df = pd.DataFrame(data)
pay term
0 22 2
1 30 7
2 50 10
3 60 11
4 70 13
df.loc[2] = [49,9]
print(df)
pay term
0 22 2
1 30 7
2 49 9
3 60 11
4 70 13
Expected output :
pay term
0 22 2
1 30 7
2 49 9
3 50 10
4 60 11
5 70 13
If we run above code, it is replacing the values at 2 index. I want to add new row with desired value as above to my existing dataframe without replacing the existing values. Please suggest.

You could not be able to insert a new row directly by assigning values to df.loc[2] as it will overwrite the existing values. But you can slice the dataframe in two parts and then concat the two parts along with third row to insert.
Try this:
new_df = pd.DataFrame({"pay": 49, "term": 9}, index=[2])
df = pd.concat([df.loc[:1], new_df, df.loc[2:]]).reset_index(drop=True)
print(df)
Output:
term pay
0 2 22
1 7 30
2 9 49
3 10 50
4 11 60
5 13 70

A possible way is to prepare an empty slot in the index, add the row and sort according to the index:
df.index = list(range(2)) + list(range(3, len(df) +1))
df.loc[2] = [49,9]
It gives:
term pay
0 2 22
1 7 30
3 10 50
4 11 60
5 13 70
2 49 9
Time to sort it:
df = df.sort_index()
term pay
0 2 22
1 7 30
2 49 9
3 10 50
4 11 60
5 13 70

That is because loc and iloc methods bring the already existing row from the dataframe, what you would normally do is to insert by appending a value in the last row.
To address this situation first you need to split the dataframe, append the value you want, concatenate with the second split and finally reset the index (in case you want to keep using integers)
#location you want to update
i = 2
#data to insert
data_to_insert = pd.DataFrame({'term':49, 'pay':9}, index = [i])
#split, append data to insert, append the rest of the original
df = df.loc[:i].append(data_to_insert).append(df.loc[i:]).reset_index(drop=True)
Keep in mind that the slice operator will work because the index is integers.

Related

How to stack two columns of a pandas dataframe in python

I want to stack two columns on top of each other
So I have Left and Right values in one column each, and want to combine them into a single one. How do I do this in Python?
I'm working with Pandas Dataframes.
Basically from this
Left Right
0 20 25
1 15 18
2 10 35
3 0 5
To this:
New Name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
It doesn't matter how they are combined as I will plot it anyway, and the new column name also doesn't matter because I can rename it.
You can create a list of the cols, and call squeeze to anonymise the data so it doesn't try to align on columns, and then call concat on this list, passing ignore_index=True creates a new index, otherwise you'll get the names as index values repeated:
cols = [df[col].squeeze() for col in df]
pd.concat(cols, ignore_index=True)
Many options, stack, melt, concat, ...
Here's one:
>>> df.melt(value_name='New Name').drop('variable', 1)
New Name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
You can also use np.ravel:
import numpy as np
out = pd.DataFrame(np.ravel(df.values.T), columns=['New name'])
print(out)
# Output
New name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
Update
If you have only 2 cols:
out = pd.concat([df['Left'], df['Right']], ignore_index=True).to_frame('New name')
print(out)
# Output
New name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
Solution with unstack
df2 = df.unstack()
# recreate index
df2.index = np.arange(len(df2))
A solution with masking.
# Your data
import numpy as np
import pandas as pd
df = pd.DataFrame({"Left":[20,15,10,0], "Right":[25,18,35,5]})
# Masking columns to ravel
df2 = pd.DataFrame({"New Name":np.ravel(df[["Left","Right"]])})
df2
New Name
0 20
1 25
2 15
3 18
4 10
5 35
6 0
7 5
I ended up using this solution, seems to work fine
df1 = dfTest[['Left']].copy()
df2 = dfTest[['Right']].copy()
df2.columns=['Left']
df3 = pd.concat([df1, df2],ignore_index=True)

Column-wise subtraction calculations in python 3

It seems simple but I can’t seem to find an efficient way to solve this in Python 3: Is there is a loop I can use in my dataframe that takes every column after the current column (starting with the 1st column), and subtracts it from the current column, so that I can add that resulting column to a new dataframe?
This is what my data looks like:
This is what I have so far, but when running run_analysis my "result" equation is bringing up an error, and I do not know how to store the results in a new dataframe. I'm a beginner at all of this so any help would be much appreciated.
storage = [] #container that will store the results of the subtracted columns
def subtract (a,b): #function to call to do the column-wise subtractions
return a-b
def run_analysis (frame, store):
for first_col_index in range(len(frame)): #finding the first column to use
temp=[] #temporary place to store the column-wise values from the analysis
for sec_col_index in range(len(frame)): #finding the second column to subtract from the first column
if (sec_col_index <= first_col_index): #if the column is below the current column or is equal to
#the current column, then skip to next column
continue
else:
result = [r for r in map(subtract, frame[sec_col_index], frame[first_col_index])]
#if column above our current column, the subtract values in the column and keep the result in temp
temp.append(result)
store.append(temp) #save the complete analysis in the store
Something like this?
#dummy ddataframe
df = pd.DataFrame({'a':list(range(10)), 'b':list(range(10,20)), 'c':list(range(10))})
print(df)
output:
a b c
0 0 10 0
1 1 11 1
2 2 12 2
3 3 13 3
4 4 14 4
5 5 15 5
6 6 16 6
7 7 17 7
8 8 18 8
9 9 19 9
Now iterate over pairs of columns and subtract them while assigning another column to the dataframe
for c1, c2 in zip(df.columns[:-1], df.columns[1:]):
df[f'{c2}-{c1}'] = df[c2]-df[c1]
print(df)
output:
a b c b-a c-b
0 0 10 0 10 -10
1 1 11 1 10 -10
2 2 12 2 10 -10
3 3 13 3 10 -10
4 4 14 4 10 -10
5 5 15 5 10 -10
6 6 16 6 10 -10
7 7 17 7 10 -10
8 8 18 8 10 -10
9 9 19 9 10 -10

Is there a way to reference a previous value in Pandas column without a loop?

For example if I wanted
df = pd.DataFrame(index=range(5000)
df[‘A’]= 0
df[‘A’][0]= 1
for i in range(len(df):
if i != 0: df['A'][i] = df['A'][i-1] * 3
Is there a way to do this without a loop?
your code sample has missing close brackets and quotes are not valid. Fixed these
if I understand what you are trying to achieve, multiply previous value by 3 where zeroth number is 1
initialize the series to 3, then set zeroth item to 1
then simple use of cumprod
shortened your series as this calculation will rapidly result in number overflow
df = pd.DataFrame(index=range(20))
df["A"]= 3
df.loc[0,"A"] = 1
df["A"] = df["A"].cumprod()
A
0
1
1
3
2
9
3
27
4
81
5
243
6
729
7
2187
8
6561
9
19683
10
59049
11
177147
12
531441
13
1.59432e+06
14
4.78297e+06
15
1.43489e+07
16
4.30467e+07
17
1.2914e+08
18
3.8742e+08
19
1.16226e+09

Retrieving Unknown Column Names from DataFrame.apply

How I can retrieve column names from a call to DataFrame apply without knowing them in advance?
What I'm trying to do is apply a mapping from column names to functions to arbitrary DataFrames. Those functions might return multiple columns. I would like to end up with a DataFrame that contains the original columns as well as the new ones, the amount and names of which I don't know at build-time.
Other solutions here are Series-based. I'd like to do the whole frame at once, if possible.
What am I missing here? Are the columns coming back from apply lost in destructuring unless I know their names? It looks like assign might be useful, but will likely require a lot of boilerplate.
import pandas as pd
def fxn(col):
return pd.Series(col * 2, name=col.name+'2')
df = pd.DataFrame({'A': range(0, 10), 'B': range(10, 0, -1)})
print(df)
# [Edit:]
# A B
# 0 0 10
# 1 1 9
# 2 2 8
# 3 3 7
# 4 4 6
# 5 5 5
# 6 6 4
# 7 7 3
# 8 8 2
# 9 9 1
df = df.apply(fxn)
print(df)
# [Edit:]
# Observed: columns changed in-place.
# A B
# 0 0 20
# 1 2 18
# 2 4 16
# 3 6 14
# 4 8 12
# 5 10 10
# 6 12 8
# 7 14 6
# 8 16 4
# 9 18 2
df[['A2', 'B2']] = df.apply(fxn)
print(df)
# [Edit: I am doubling column values, so missing something, but the question about the column counts stands.]
# Expected: new columns added. How can I do this at runtime without knowing column names?
# A B A2 B2
# 0 0 40 0 80
# 1 4 36 8 72
# 2 8 32 16 64
# 3 12 28 24 56
# 4 16 24 32 48
# 5 20 20 40 40
# 6 24 16 48 32
# 7 28 12 56 24
# 8 32 8 64 16
# 9 36 4 72 8
You need to concat the result of your function with the original df.
Use pd.concat:
In [8]: x = df.apply(fxn) # Apply function on df and store result separately
In [10]: df = pd.concat([df, x], axis=1) # Concat with original df to get all columns
Rename duplicate column names by adding suffixes:
In [82]: from collections import Counter
In [38]: mylist = df.columns.tolist()
In [41]: d = {a:list(range(1, b+1)) if b>1 else '' for a,b in Counter(mylist).items()}
In [62]: df.columns = [i+str(d[i].pop(0)) if len(d[i]) else i for i in mylist]
In [63]: df
Out[63]:
A1 B1 A2 B2
0 0 10 0 20
1 1 9 2 18
2 2 8 4 16
3 3 7 6 14
4 4 6 8 12
5 5 5 10 10
6 6 4 12 8
7 7 3 14 6
8 8 2 16 4
9 9 1 18 2
You can assign directly with:
df[df.columns + '2'] = df.apply(fxn)
Outut:
A B A2 B2
0 0 10 0 20
1 1 9 2 18
2 2 8 4 16
3 3 7 6 14
4 4 6 8 12
5 5 5 10 10
6 6 4 12 8
7 7 3 14 6
8 8 2 16 4
9 9 1 18 2
Alternatively, you can leverage the #MayankPorwal answer by using .add_suffix('2') to the output from your apply function:
pd.concat([df, df.apply(fxn).add_suffix('2')], axis=1)
which will return the same output.
In your function, name=col.name+'2' is doing nothing (it's basically returning just col * 2). That's because apply returns the values back to the original column.
Anyways, it's possible to take the MayankPorwal approach: pd.concat + managing duplicated columns (make them unique). Another possible way to do that:
# Use pd.concat as mentioned in the first answer from Mayank Porwal
df = pd.concat([df, df.apply(fxn)], axis=1)
# Rename duplicated columns
suffix = (pd.Series(df.columns).groupby(df.columns).cumcount()+1).astype(str)
df.columns = df.columns + suffix.rename('1', '')
which returns the same output, and additionally manage further duplicated columns.
Answer on the behalf of OP:
This code does what I wanted:
import pandas as pd
# Simulated business logic: for an input row, return a number of columns
# related to the input, and generate names for them, such that we don't
# know the shape of the output or the names of its columns before the call.
def fxn(row):
length = row[0]
indicies = [row.index[0] + str(i) for i in range(0, length)]
series = pd.Series([i for i in range(0, length)], index=indicies)
return series
# Sample data: 0 to 18, inclusive, counting by 2.
df1 = pd.DataFrame(list(range(0, 20, 2)), columns=['A'])
# Randomize the rows to simulate different input shapes.
df1 = df1.sample(frac=1)
# Apply fxn to rows to get new columns (with expand). Concat to keep inputs.
df1 = pd.concat([df1, df1.apply(fxn, axis=1, result_type='expand')], axis=1)
print(df1)

Fill pandas dataframe with values in between

I am new to using pandas but want to learn it better. I am currently facing a problem. I have a DataFrame looking like this:
0 1 2
0 chr2L 1 4
1 chr2L 9 12
2 chr2L 17 20
3 chr2L 23 23
4 chr2L 26 27
5 chr2L 30 40
6 chr2L 45 47
7 chr2L 52 53
8 chr2L 56 56
9 chr2L 61 62
10 chr2L 66 80
I want to get something like this:
0 1 2 3
0 chr2L 0 1 0
1 chr2L 1 2 1
2 chr2L 2 3 1
3 chr2L 3 4 1
4 chr2L 4 5 0
5 chr2L 5 6 0
6 chr2L 6 7 0
7 chr2L 7 8 0
8 chr2L 8 9 0
9 chr2L 9 10 1
10 chr2L 10 11 1
11 chr2L 11 12 1
12 chr2L 12 13 0
And so on...
So, fill in the missing intervals with zeros, and save the present intervals as ones (if there is an easy way to save "boundary" positions (the borders of the intervals in the initial data) as 0.5 at the same time it might also be helpful) while splitting all data into 1-length intervals.
In the data there are multiple string values in the column 0, and this should be done for each of them separately. They require different length of the final data (the last value that should get a 0 or a 1 is different). Would appreciate your help with dealing with this in pandas.
This works for most of your first paragraph and some of the second. Left as an exercise: finish inserting insideness=0 rows (see end):
import pandas as pd
# dummied-up version of your data, but with column headers for readability:
df = pd.DataFrame({'n':['a']*4 + ['b']*2, 'a':[1,6,8,5,1,5],'b':[4,7,10,5,3,7]})
# splitting up a range, translated into df row terms:
def onebyone(dfrow):
a = dfrow[1].a; b = dfrow[1].b; n = dfrow[1].n
count = b - a
if count >= 2:
interior = [0.5]+[1]*(count-2)+[0.5]
elif count == 1:
interior = [0.5]
elif count == 0:
interior = []
return {'n':[n]*count, 'a':range(a, a + count),
'b':range(a + 1, a + count + 1),
'insideness':interior}
Edited to use pd.concat(), new in pandas 0.15, to combine the intermediate results:
# Into a new dataframe:
intermediate = []
for label in set(df.n):
for row in df[df.n == label].iterrows():
intermediate.append(pd.DataFrame(onebyone(row)))
df_onebyone = pd.concat(intermediate)
df_onebyone.index = range(len(df_onebyone))
And finally a sketch of identifying the missing rows, which you can edit to match the above for-loop in adding rows to a final dataframe:
# for times in the overall range describing 'a'
for i in range(int(newd[newd.n=='a'].a.min()),int(newd[newd.n=='a'].a.max())):
# if a time isn't in an existing 0.5-1-0.5 range:
if i not in newd[newd.n=='a'].a.values:
# these are the values to fill in a 0-row
print '%d, %d, 0'%(i, i+1)
Or, if you know the a column will be sorted for each n, you could keep track of the last end-value handled by onebyone() and insert some extra rows to catch up to the next start value you're going to pass to onebyone().

Categories