Is this possible to do in DataFrame Pandas?
I want to keep only first row value on the same column, replace second row, and on with 0
Input
Name--------Date-------Amount-----Labor
A--------------1/1/1972-------5-------- 0.3
A--------------1/1/1972-------5-------- 0.1
A--------------1/1/1972-------5-------- 0.7
A--------------1/1/1972-------1-------- 0.3
B--------------7/2/1980-------1-------- 0.6
B--------------7/2/1980-------1-------- 0.3
B--------------7/2/1980-------1-------- 0.7
C--------------6/9/1965-------4-------- 0.2
C--------------6/9/1965-------4-------- 0.3
C--------------6/9/1965-------4-------- 0.4
Output
Name--------Date-------Amount-----Labor
A--------------1/1/1972-------5-------- 0.3
A--------------1/1/1972-------0-------- 0.1
A--------------1/1/1972-------0-------- 0.7
A--------------1/1/1972-------0-------- 0.3
B--------------7/2/1980-------1-------- 0.6
B--------------7/2/1980-------0-------- 0.3
B--------------7/2/1980-------0-------- 0.7
C--------------6/9/1965-------4-------- 0.2
C--------------6/9/1965-------0-------- 0.3
C--------------6/9/1965-------0-------- 0.4
As simple as multiplying by a boolean mask.
df['Amount'] *= df['Amount'].ne(df['Amount'].shift())
Yes, that is possible. You can use .duplicated(..) to construct a series that marks all duplicates with True. Then you thus can assign values with that mask:
df.loc[df['Amount'].duplicated(), 'Amount'] = 0
Or if you only want to set values that are duplicates in a "sequence", we can work with .diff().eq(0):
df.loc[df['Amount'].diff().eq(0), 'Amount'] = 0
Related
I am trying to plot 2 columns from a csv file. Each column contains list of values.
How can I plot those columns. The columns are read as strings and unable to identify the "[" character.
Here's a snippet of my csv. I want to plot column1 versus column2 for all Names(A,B,C)
Name Column1 column2 Column3
A [0.1,0.2,0.3] [0.1,0.2,0.3] 0.2
B [0.9,0.7,0.3] [0.1,0.8,0.3] 0.2
C [0.1,0.2,0.6] [0.1,0.2,0.3] 0.2
I tried to use the following code
r = pd.read_csv('L.csv')
plt.plot(r.loc[i]['Column1'].astype(float),
r.loc[i]['Column2'].astype(float),
linestyle="--",
label="{}, Led={:.3f}".format(i, r.loc[i]['Column3']))
You can use ast.literal_eval() function to convert string types to types present inside the string:
See more here
import ast
plt.plot(ast.literal_eval(r.loc[i]['Column1']),
ast.literal_eval(r.loc[i]['Column2']),
linestyle="--",
label="{}, Led={:.3f}".format(i, r.loc[i]['Column3']))
I used your csv data to read the data in:
Name;Column1;Column2;Column3
A;[0.1,0.2,0.3];[0.1,0.2,0.3];0.2
B;[0.9,0.7,0.3];[0.1,0.8,0.3];0.2
C;[0.1,0.2,0.6];[0.1,0.2,0.3];0.2
df = pd.read_csv('input_csv.csv', sep=';')
#change the strings to actual lists
df[['Column1', 'Column2']] = df[['Column1', 'Column2']].applymap(ast.literal_eval)
#explode the list to seperate rows
out = df.explode(column=['Column1', 'Column2'])
print(out)
for grp, val in out.groupby('Name'):
plt.plot(val['Column1'],
val['Column2'],
linestyle='--',
label="{}, Led={:.3f}".format(grp, val['Column3'].unique()[0]))
plt.legend()
Output of out:
Name Column1 Column2 Column3
0 A 0.1 0.1 0.2
0 A 0.2 0.2 0.2
0 A 0.3 0.3 0.2
1 B 0.9 0.1 0.2
1 B 0.7 0.8 0.2
1 B 0.3 0.3 0.2
2 C 0.1 0.1 0.2
2 C 0.2 0.2 0.2
2 C 0.6 0.3 0.2
Plot:
I have the following reference
corr = pd.DataFrame({'i':['a','b'],'a':[.1,.2],'b':[.2,.1]}).set_index('i')
I also have some vector values. The length will always change so I'm only using 6 to show what I am trying to achieve.
vectors = pd.DataFrame({'val':['a','b','a','a','b']})
I would like to use these values to generate a 5x5 matrix 'X' such that:
5x5 because vector len(vector) = 5
The closest I can think of is a map function, that generates only one column.
DataFrame.reindex twice, once for columns and once for rows:
out = corr.reindex(vectors['val']).reindex(vectors['val'], axis=1)
Note that duplicate column names, while supported, is not recommended. For example out['a'] would return a dataframe, while most of the case it returns a series.
We can use DataFrame.reindex and use both index and columns arguments with vectors["val"]
idx = vectors["val"] # vectors["val"].tolist() to avoid naming axes `val`.
corr.reindex(index=idx, columns=idx)
val a b a a a b
val
a 0.1 0.2 0.1 0.1 0.1 0.2
b 0.2 0.1 0.2 0.2 0.2 0.1
a 0.1 0.2 0.1 0.1 0.1 0.2
a 0.1 0.2 0.1 0.1 0.1 0.2
a 0.1 0.2 0.1 0.1 0.1 0.2
b 0.2 0.1 0.2 0.2 0.2 0.1
DataFrame.reindex twice, once for columns and once for rows:
out = corr.reindex(vectors['val']).reindex(vectors['val'], axis=1)
Although duplicate column names are supported, they are not recommended
I have a matrix as follows
0 1 2 3 ...
A 0.1 0.2 0.3 0.1
C 0.5 0.4 0.2 0.1
G 0.6 0.4 0.8 0.3
T 0.1 0.1 0.4 0.2
The data is in a dataframe as shown
Genes string
Gene1 ATGC
Gene2 GCTA
Gene3 ATCG
I need to write a code to find the score of each sequence. The score for seq ATGC is 0.1+0.1+0.8+0.1 = 1.1 (A is 0.1 because A is in first position and the value for A at that position is 0.1, similar this is calculated along the length of the sequence (450 letters))
The output should be as follows:
Genes Score
Gene1 1.1
Gene2 1.5
Gene3 0.7
I tried using biopython but could not get it right. Can anyone please help!
Let df and genes be your DataFrames. First, let's convert df into a "tall" form:
tall = df.stack().reset_index()
tall.columns = 'letter', 'pos', 'score'
tall.pos = tall.pos.astype(int) # Need a number here, not a string!
Create a new tuple-based index for the trall DF:
tall.set_index(tall[['pos', 'letter']].apply(tuple, axis=1), inplace=True)
This function will extract the scores indexed by the tuples in the form (position,"letter") from the tall DF and sum them up:
def gene2score(gene):
return tall.loc[list(enumerate(gene))]['score'].sum()
genes['string'].apply(gene2score)
#Genes
#Gene1 1.1
#Gene2 1.5
#Gene3 0.7
I have a pandas dataframe with position,k, y. For example
pos k y
123 0.7 0.5
124 0.4 0.1
125 0.3 0.2
126 0.4 0.1
128 0.3 0.6
130 0.4 0.9
131 0.3 0.2
i would like to sum the information at k and y like
123 1.1 0.6
125 0.7 0.3
128 0.3 0.6
130 0.7 1.1
so the output has only the first positions and the sum of value the first and its immediate consecutive number which follows it.
I tried grouping by pandas
for k,g in df.groupby(df['pos'] - np.arange(df.shape[0])):
u=g.ix[0:,2:].sum()
but its groups all the consecutive numbers which I dont want
ALSO I NEED SOMETHING FAST AS I HAVE 2611774 ROW IN MY DATAFILE
Hope this will solve your problem
import pandas as pd
df = pd.DataFrame( columns=['pos','k','y'])
cf = pd.DataFrame( columns=['pos','k','y'])
df['pos']=123, 124,125,126,128,130,131
df['k']=.7,.4,.3,.4,.3,.4,.3
df['y']=.5,.1,.2,.1,.6,.9,.2
row=0
while 1:
if row+1<len(df):
if(df.loc[row]['pos']+1==df.loc[row+1]['pos']):
cf.loc[row]= df.loc[row]+df.loc[row+1]
cf.loc[row]['pos']=df.loc[row]['pos']
row=row+2
else:
cf.loc[row]= df.loc[row]
row=row+1
else:
break
print cf
Maybe this is faster than a loop, but it won't sum positions 123 and 124 and then 130 and 131 as I think you expect, because it sums odd positions with its consecutive like 129 and 130, 131 and 132...
df = df.set_index('pos')
df_odd = df.loc[df.index.values % 2 == 1]
df_even = df.loc[df.index.values % 2 == 0]
df_even = df_even.set_index(df_even.index.values - 1)
df_odd.add(df_even, fill_value = 0)
Result:
pos k y
123 1.1 0.6
125 0.7 0.3
127 0.3 0.6
129 0.4 0.9
131 0.3 0.2
I have not used pandas before, but if you get the chance to use the data as a list then this should work.
def SumNext(L):
N = xrange(len(L)-1)
Output = [L[i]+L[i+1] for i in N]
return Output
This function will give you a summation of consecutive elements if you input a list.
A=[1,1,2,3,5,8,13]
SumNext(A) => [2,3,5,8,13]
then you just have to read out the values to wherever you like, it is much faster to do things in lists (as opposed to while loops) when you get lots of elements.
then you will just need to figure out the implementation of passing the output back to your data frame.
I have the following dataframe:
actual_credit min_required_credit
0 0.3 0.4
1 0.5 0.2
2 0.4 0.4
3 0.2 0.3
I need to add a column indicating where actual_credit >= min_required_credit. The result would be:
actual_credit min_required_credit result
0 0.3 0.4 False
1 0.5 0.2 True
2 0.4 0.4 True
3 0.1 0.3 False
I am doing the following:
df['result'] = abs(df['actual_credit']) >= abs(df['min_required_credit'])
However the 3rd row (0.4 and 0.4) is constantly resulting in False. After researching this issue at various places including: What is the best way to compare floats for almost-equality in Python? I still can't get this to work. Whenever the two columns have an identical value, the result is False which is not correct.
I am using python 3.3
Due to imprecise float comparison you can or your comparison with np.isclose, isclose takes a relative and absolute tolerance param so the following should work:
df['result'] = df['actual_credit'].ge(df['min_required_credit']) | np.isclose(df['actual_credit'], df['min_required_credit'])
#EdChum's answer works great, but using the pandas.DataFrame.round function is another clean option that works well without the use of numpy.
df = pd.DataFrame( # adding a small difference at the thousandths place to reproduce the issue
data=[[0.3, 0.4], [0.5, 0.2], [0.400, 0.401], [0.2, 0.3]],
columns=['actual_credit', 'min_required_credit'])
df['result'] = df['actual_credit'].round(1) >= df['min_required_credit'].round(1)
print(df)
actual_credit min_required_credit result
0 0.3 0.400 False
1 0.5 0.200 True
2 0.4 0.401 True
3 0.2 0.300 False
You might consider using round() to more permanently edit your dataframe, depending if you desire that precision or not. In this example, it seems like the OP suggests this is probably just noise and is just causing confusion.
df = pd.DataFrame( # adding a small difference at the thousandths place to reproduce the issue
data=[[0.3, 0.4], [0.5, 0.2], [0.400, 0.401], [0.2, 0.3]],
columns=['actual_credit', 'min_required_credit'])
df = df.round(1)
df['result'] = df['actual_credit'] >= df['min_required_credit']
print(df)
actual_credit min_required_credit result
0 0.3 0.4 False
1 0.5 0.2 True
2 0.4 0.4 True
3 0.2 0.3 False
In general numpy Comparison functions work well with pd.Series and allow for element-wise comparisons:
isclose, allclose, greater, greater_equal, less, less_equal etc.
In your case greater_equal would do:
df['result'] = np.greater_equal(df['actual_credit'], df['min_required_credit'])
or alternatively, as proposed, using pandas.ge(alternatively le, gt etc.):
df['result'] = df['actual_credit'].ge(df['min_required_credit'])
The risk with oring with ge (as mentioned above) is that e.g. comparing 3.999999999999 and 4.0 might return True which might not necessarily be what you want.
Use pandas.DataFrame.abs() instead of the built-in abs():
df['result'] = df['actual_credit'].abs() >= df['min_required_credit'].abs()