I am new to using pandas but want to learn it better. I am currently facing a problem. I have a DataFrame looking like this:
0 1 2
0 chr2L 1 4
1 chr2L 9 12
2 chr2L 17 20
3 chr2L 23 23
4 chr2L 26 27
5 chr2L 30 40
6 chr2L 45 47
7 chr2L 52 53
8 chr2L 56 56
9 chr2L 61 62
10 chr2L 66 80
I want to get something like this:
0 1 2 3
0 chr2L 0 1 0
1 chr2L 1 2 1
2 chr2L 2 3 1
3 chr2L 3 4 1
4 chr2L 4 5 0
5 chr2L 5 6 0
6 chr2L 6 7 0
7 chr2L 7 8 0
8 chr2L 8 9 0
9 chr2L 9 10 1
10 chr2L 10 11 1
11 chr2L 11 12 1
12 chr2L 12 13 0
And so on...
So, fill in the missing intervals with zeros, and save the present intervals as ones (if there is an easy way to save "boundary" positions (the borders of the intervals in the initial data) as 0.5 at the same time it might also be helpful) while splitting all data into 1-length intervals.
In the data there are multiple string values in the column 0, and this should be done for each of them separately. They require different length of the final data (the last value that should get a 0 or a 1 is different). Would appreciate your help with dealing with this in pandas.
This works for most of your first paragraph and some of the second. Left as an exercise: finish inserting insideness=0 rows (see end):
import pandas as pd
# dummied-up version of your data, but with column headers for readability:
df = pd.DataFrame({'n':['a']*4 + ['b']*2, 'a':[1,6,8,5,1,5],'b':[4,7,10,5,3,7]})
# splitting up a range, translated into df row terms:
def onebyone(dfrow):
a = dfrow[1].a; b = dfrow[1].b; n = dfrow[1].n
count = b - a
if count >= 2:
interior = [0.5]+[1]*(count-2)+[0.5]
elif count == 1:
interior = [0.5]
elif count == 0:
interior = []
return {'n':[n]*count, 'a':range(a, a + count),
'b':range(a + 1, a + count + 1),
'insideness':interior}
Edited to use pd.concat(), new in pandas 0.15, to combine the intermediate results:
# Into a new dataframe:
intermediate = []
for label in set(df.n):
for row in df[df.n == label].iterrows():
intermediate.append(pd.DataFrame(onebyone(row)))
df_onebyone = pd.concat(intermediate)
df_onebyone.index = range(len(df_onebyone))
And finally a sketch of identifying the missing rows, which you can edit to match the above for-loop in adding rows to a final dataframe:
# for times in the overall range describing 'a'
for i in range(int(newd[newd.n=='a'].a.min()),int(newd[newd.n=='a'].a.max())):
# if a time isn't in an existing 0.5-1-0.5 range:
if i not in newd[newd.n=='a'].a.values:
# these are the values to fill in a 0-row
print '%d, %d, 0'%(i, i+1)
Or, if you know the a column will be sorted for each n, you could keep track of the last end-value handled by onebyone() and insert some extra rows to catch up to the next start value you're going to pass to onebyone().
Related
It seems simple but I can’t seem to find an efficient way to solve this in Python 3: Is there is a loop I can use in my dataframe that takes every column after the current column (starting with the 1st column), and subtracts it from the current column, so that I can add that resulting column to a new dataframe?
This is what my data looks like:
This is what I have so far, but when running run_analysis my "result" equation is bringing up an error, and I do not know how to store the results in a new dataframe. I'm a beginner at all of this so any help would be much appreciated.
storage = [] #container that will store the results of the subtracted columns
def subtract (a,b): #function to call to do the column-wise subtractions
return a-b
def run_analysis (frame, store):
for first_col_index in range(len(frame)): #finding the first column to use
temp=[] #temporary place to store the column-wise values from the analysis
for sec_col_index in range(len(frame)): #finding the second column to subtract from the first column
if (sec_col_index <= first_col_index): #if the column is below the current column or is equal to
#the current column, then skip to next column
continue
else:
result = [r for r in map(subtract, frame[sec_col_index], frame[first_col_index])]
#if column above our current column, the subtract values in the column and keep the result in temp
temp.append(result)
store.append(temp) #save the complete analysis in the store
Something like this?
#dummy ddataframe
df = pd.DataFrame({'a':list(range(10)), 'b':list(range(10,20)), 'c':list(range(10))})
print(df)
output:
a b c
0 0 10 0
1 1 11 1
2 2 12 2
3 3 13 3
4 4 14 4
5 5 15 5
6 6 16 6
7 7 17 7
8 8 18 8
9 9 19 9
Now iterate over pairs of columns and subtract them while assigning another column to the dataframe
for c1, c2 in zip(df.columns[:-1], df.columns[1:]):
df[f'{c2}-{c1}'] = df[c2]-df[c1]
print(df)
output:
a b c b-a c-b
0 0 10 0 10 -10
1 1 11 1 10 -10
2 2 12 2 10 -10
3 3 13 3 10 -10
4 4 14 4 10 -10
5 5 15 5 10 -10
6 6 16 6 10 -10
7 7 17 7 10 -10
8 8 18 8 10 -10
9 9 19 9 10 -10
For example if I wanted
df = pd.DataFrame(index=range(5000)
df[‘A’]= 0
df[‘A’][0]= 1
for i in range(len(df):
if i != 0: df['A'][i] = df['A'][i-1] * 3
Is there a way to do this without a loop?
your code sample has missing close brackets and quotes are not valid. Fixed these
if I understand what you are trying to achieve, multiply previous value by 3 where zeroth number is 1
initialize the series to 3, then set zeroth item to 1
then simple use of cumprod
shortened your series as this calculation will rapidly result in number overflow
df = pd.DataFrame(index=range(20))
df["A"]= 3
df.loc[0,"A"] = 1
df["A"] = df["A"].cumprod()
A
0
1
1
3
2
9
3
27
4
81
5
243
6
729
7
2187
8
6561
9
19683
10
59049
11
177147
12
531441
13
1.59432e+06
14
4.78297e+06
15
1.43489e+07
16
4.30467e+07
17
1.2914e+08
18
3.8742e+08
19
1.16226e+09
This question already has answers here:
Is it possible to insert a row at an arbitrary position in a dataframe using pandas?
(4 answers)
Closed 2 years ago.
import pandas as pd
data = {'term':[2, 7,10,11,13],'pay':[22,30,50,60,70]}
df = pd.DataFrame(data)
pay term
0 22 2
1 30 7
2 50 10
3 60 11
4 70 13
df.loc[2] = [49,9]
print(df)
pay term
0 22 2
1 30 7
2 49 9
3 60 11
4 70 13
Expected output :
pay term
0 22 2
1 30 7
2 49 9
3 50 10
4 60 11
5 70 13
If we run above code, it is replacing the values at 2 index. I want to add new row with desired value as above to my existing dataframe without replacing the existing values. Please suggest.
You could not be able to insert a new row directly by assigning values to df.loc[2] as it will overwrite the existing values. But you can slice the dataframe in two parts and then concat the two parts along with third row to insert.
Try this:
new_df = pd.DataFrame({"pay": 49, "term": 9}, index=[2])
df = pd.concat([df.loc[:1], new_df, df.loc[2:]]).reset_index(drop=True)
print(df)
Output:
term pay
0 2 22
1 7 30
2 9 49
3 10 50
4 11 60
5 13 70
A possible way is to prepare an empty slot in the index, add the row and sort according to the index:
df.index = list(range(2)) + list(range(3, len(df) +1))
df.loc[2] = [49,9]
It gives:
term pay
0 2 22
1 7 30
3 10 50
4 11 60
5 13 70
2 49 9
Time to sort it:
df = df.sort_index()
term pay
0 2 22
1 7 30
2 49 9
3 10 50
4 11 60
5 13 70
That is because loc and iloc methods bring the already existing row from the dataframe, what you would normally do is to insert by appending a value in the last row.
To address this situation first you need to split the dataframe, append the value you want, concatenate with the second split and finally reset the index (in case you want to keep using integers)
#location you want to update
i = 2
#data to insert
data_to_insert = pd.DataFrame({'term':49, 'pay':9}, index = [i])
#split, append data to insert, append the rest of the original
df = df.loc[:i].append(data_to_insert).append(df.loc[i:]).reset_index(drop=True)
Keep in mind that the slice operator will work because the index is integers.
I have a large time series df (2.5mil rows) that contain 0 values in a given row, some of which are legitimate. However if there are repeated continuous occurrences of zero values I would like to remove them from my df.
Example:
Col. A contains [1,2,3,0,4,5,0,0,0,1,2,3,0,8,8,0,0,0,0,9] I would like to remove the [0,0,0] and [0,0,0,0] from the middle and leave the remaining 0 to make a new df [1,2,3,0,4,5,1,2,3,0,8,8,9].
The length of zero values before deletion being a parameter that has to be set - in this case > 2.
Is there a clever way to do this in pandas?
It looks like you want to remove the row if it is 0 and either previous or next row in same column is 0. You can use shift to look for previous and next value and compare with current value as below:
result_df = df[~(((df.ColA.shift(-1) == 0) & (df.ColA == 0)) | ((df.ColA.shift(1) == 0) & (df.ColA == 0)))]
print(result_df)
Result:
ColA
0 1
1 2
2 3
3 0
4 4
5 5
9 1
10 2
11 3
12 0
13 8
14 8
19 9
Update for more than 2 consecutive
Following example in link, adding new column to track consecutive occurrence and later checking it to filter:
# https://stackoverflow.com/a/37934721/5916727
df['consecutive'] = df.ColA.groupby((df.ColA != df.ColA.shift()).cumsum()).transform('size')
df[~((df.consecutive>10) & (df.ColA==0))]
We need build a new para meter here, then using drop_duplicates
df['New']=df.A.eq(0).astype(int).diff().ne(0).cumsum()
s=pd.concat([df.loc[df.A.ne(0),:],df.loc[df.A.eq(0),:].drop_duplicates(keep=False)]).sort_index()
s
Out[190]:
A New
0 1 1
1 2 1
2 3 1
3 0 2
4 4 3
5 5 3
9 1 5
10 2 5
11 3 5
12 0 6
13 8 7
14 8 7
19 9 9
Explanation :
#df.A.eq(0) to find the value equal to 0
#diff().ne(0).cumsum() if they are not equal to 0 then we would count them in same group .
Given the following DataFrame:
import pandas as pd
import numpy as np
d=pd.DataFrame({0:[10,20,30,40],1:[20,45,10,35],2:[34,24,54,22],
'0 to 1':[1,1,1,0],'0 to 2':[1,0,1,1],'1 to 2':[1,1,1,1],
})
d=d[[0,1,2,'0 to 1','0 to 2','1 to 2']]
d
0 1 2 0 to 1 0 to 2 1 to 2
0 10 20 34 1 1 1
1 20 45 24 1 0 1
2 30 10 54 1 1 1
3 40 35 22 0 1 1
I'd like to produce 3 new columns; one for each of the 3 columns on the left with the following criteria:
Include the original value.
If the original value is greater than the value being compared, and if there is a 1 in the comparison column (the columns with 'to'), list the other column(s) separated by commas.
For example:
Column 1, row 0 has a value of 20, which is greater than its corresponding value in column 0 (10). The comparison column between columns 0 and 1 is '0 to 1'. In this column, there is a value of 1 for row 0. There is another column comparing column 1 to column 2, but the value for column 2, row 0 is 34, so since it is greater than 20, ignore the 1 in '1 to 2'.
So the final value would be '20 (0)'.
Here is the desired resulting data frame:
0 1 2 0 to 1 0 to 2 1 to 2 0 Final 1 Final 2 Final
0 10 20 34 1 1 1 10 20 (0) 34 (0,1)
1 20 45 24 1 0 1 20 45 (0,2) 24
2 30 10 54 1 1 1 30 (1) 10 54 (0,1)
3 40 35 22 0 1 1 40 (2) 35 (2) 22
Thanks in advance!
Note: Because my real data will have varying numbers of columns on the left (i.e. 0,1,2,3,4) and resulting comparisons, I need an approach that finds all conditions that apply. So, for a particular value, find all cases where the comparison column value is 1 and the value is higher than that being compared.
Update
To clarify:
'0 to 1' compares column 0 to column 1. If there is a significant difference between them, the value is 1, else 0. So for '0 Final', if 0 is larger than 1 and '0 to 1' is 1, there would be a (1) after the value to signify that 0 is significantly larger than 1 for that row.
Here's what I have so far:
d['0 Final']=d[0].astype(str)
d['1 Final']=d[1].astype(str)
d['2 Final']=d[2].astype(str)
d.loc[((d[0]>d[1])&(d['0 to 1']==1))|((d['0 to 2']==1)&(d[0]>d[2])),'0 Final']=d['0 Final']+' '
d.loc[((d[1]>d[0])&(d['0 to 1']==1))|((d['1 to 2']==1)&(d[1]>d[2])),'1 Final']=d['1 Final']+' '
d.loc[((d[2]>d[0])&(d['0 to 2']==1))|((d['1 to 2']==1)&(d[2]>d[1])),'2 Final']=d['2 Final']+' '
d.loc[(d['0 to 1']==1)&(d[0]>d[1]),'0 Final']=d['0 Final']+'1'
d.loc[(d['0 to 2']==1)&(d[0]>d[2]),'0 Final']=d['0 Final']+'2'
d.loc[(d['0 to 1']==1)&(d[1]>d[0]),'1 Final']=d['1 Final']+'0'
d.loc[(d['1 to 2']==1)&(d[1]>d[2]),'1 Final']=d['1 Final']+'2'
d.loc[(d['0 to 2']==1)&(d[2]>d[0]),'2 Final']=d['2 Final']+'0'
d.loc[(d['1 to 2']==1)&(d[2]>d[1]),'2 Final']=d['2 Final']+'1'
d.loc[d['0 Final'].str.contains(' '),'0 Final']=d[0].astype(str)+' ('+d['0 Final'].str.split(' ').str[1]+')'
d.loc[d['1 Final'].str.contains(' '),'1 Final']=d[1].astype(str)+' ('+d['1 Final'].str.split(' ').str[1]+')'
d.loc[d['2 Final'].str.contains(' '),'2 Final']=d[2].astype(str)+' ('+d['2 Final'].str.split(' ').str[1]+')'
0 1 2 0 to 1 0 to 2 1 to 2 0 Final 1 Final 2 Final
0 10 20 34 1 1 1 10 20 (0) 34 (01)
1 20 45 24 1 0 1 20 45 (02) 24
2 30 10 54 1 1 1 30 (1) 10 54 (01)
3 40 35 22 0 1 1 40 (2) 35 (2) 22
It has 2 shortcomings:
I cannot predict how many columns I will have to compare, so the first 3 .loc lines will need to somehow account for this, assuming it can and it's the best approach.
I still need to figure out how to get a comma and space between each number in parentheses if there is more than 1.