I am trying to plot 2 columns from a csv file. Each column contains list of values.
How can I plot those columns. The columns are read as strings and unable to identify the "[" character.
Here's a snippet of my csv. I want to plot column1 versus column2 for all Names(A,B,C)
Name Column1 column2 Column3
A [0.1,0.2,0.3] [0.1,0.2,0.3] 0.2
B [0.9,0.7,0.3] [0.1,0.8,0.3] 0.2
C [0.1,0.2,0.6] [0.1,0.2,0.3] 0.2
I tried to use the following code
r = pd.read_csv('L.csv')
plt.plot(r.loc[i]['Column1'].astype(float),
r.loc[i]['Column2'].astype(float),
linestyle="--",
label="{}, Led={:.3f}".format(i, r.loc[i]['Column3']))
You can use ast.literal_eval() function to convert string types to types present inside the string:
See more here
import ast
plt.plot(ast.literal_eval(r.loc[i]['Column1']),
ast.literal_eval(r.loc[i]['Column2']),
linestyle="--",
label="{}, Led={:.3f}".format(i, r.loc[i]['Column3']))
I used your csv data to read the data in:
Name;Column1;Column2;Column3
A;[0.1,0.2,0.3];[0.1,0.2,0.3];0.2
B;[0.9,0.7,0.3];[0.1,0.8,0.3];0.2
C;[0.1,0.2,0.6];[0.1,0.2,0.3];0.2
df = pd.read_csv('input_csv.csv', sep=';')
#change the strings to actual lists
df[['Column1', 'Column2']] = df[['Column1', 'Column2']].applymap(ast.literal_eval)
#explode the list to seperate rows
out = df.explode(column=['Column1', 'Column2'])
print(out)
for grp, val in out.groupby('Name'):
plt.plot(val['Column1'],
val['Column2'],
linestyle='--',
label="{}, Led={:.3f}".format(grp, val['Column3'].unique()[0]))
plt.legend()
Output of out:
Name Column1 Column2 Column3
0 A 0.1 0.1 0.2
0 A 0.2 0.2 0.2
0 A 0.3 0.3 0.2
1 B 0.9 0.1 0.2
1 B 0.7 0.8 0.2
1 B 0.3 0.3 0.2
2 C 0.1 0.1 0.2
2 C 0.2 0.2 0.2
2 C 0.6 0.3 0.2
Plot:
Related
With a dataframe like this:
index col_1 col_2 ... col_n
0 0.2 0.1 0.3
1 0.2 0.1 0.3
2 0.2 0.1 0.3
...
n 0.4 0.7 0.1
How can one get the norm for each column ?
Where the norm is the sqrt of the sum of the squares.
I am able to do this for each column sequentially, but am unsure how to vectorize (avoiding a for loop) the same to an answer:
import pandas as pd
import numpy as np
norm_col_1 = np.linalg.norm(df[col_1])
norm_col_2 = np.linalg.norm(df[col_2])
norm_col_n = np.linalg.norm(df[col_n])
the answer would be a new dataframe series like this:
norms
col_1 0.111
col_2 0.202
col_3 0.55
...
con_n 0.100
You can pass the entire DataFrame to np.linalg.norm, along with an axis argument of 0 to tell it to apply it column-wise:
np.linalg.norm(df, axis=0)
To create a series with appropriate column names, try:
results = pd.Series(data=np.linalg.norm(df, axis=0), index=df.columns)
I have the following reference
corr = pd.DataFrame({'i':['a','b'],'a':[.1,.2],'b':[.2,.1]}).set_index('i')
I also have some vector values. The length will always change so I'm only using 6 to show what I am trying to achieve.
vectors = pd.DataFrame({'val':['a','b','a','a','b']})
I would like to use these values to generate a 5x5 matrix 'X' such that:
5x5 because vector len(vector) = 5
The closest I can think of is a map function, that generates only one column.
DataFrame.reindex twice, once for columns and once for rows:
out = corr.reindex(vectors['val']).reindex(vectors['val'], axis=1)
Note that duplicate column names, while supported, is not recommended. For example out['a'] would return a dataframe, while most of the case it returns a series.
We can use DataFrame.reindex and use both index and columns arguments with vectors["val"]
idx = vectors["val"] # vectors["val"].tolist() to avoid naming axes `val`.
corr.reindex(index=idx, columns=idx)
val a b a a a b
val
a 0.1 0.2 0.1 0.1 0.1 0.2
b 0.2 0.1 0.2 0.2 0.2 0.1
a 0.1 0.2 0.1 0.1 0.1 0.2
a 0.1 0.2 0.1 0.1 0.1 0.2
a 0.1 0.2 0.1 0.1 0.1 0.2
b 0.2 0.1 0.2 0.2 0.2 0.1
DataFrame.reindex twice, once for columns and once for rows:
out = corr.reindex(vectors['val']).reindex(vectors['val'], axis=1)
Although duplicate column names are supported, they are not recommended
Is this possible to do in DataFrame Pandas?
I want to keep only first row value on the same column, replace second row, and on with 0
Input
Name--------Date-------Amount-----Labor
A--------------1/1/1972-------5-------- 0.3
A--------------1/1/1972-------5-------- 0.1
A--------------1/1/1972-------5-------- 0.7
A--------------1/1/1972-------1-------- 0.3
B--------------7/2/1980-------1-------- 0.6
B--------------7/2/1980-------1-------- 0.3
B--------------7/2/1980-------1-------- 0.7
C--------------6/9/1965-------4-------- 0.2
C--------------6/9/1965-------4-------- 0.3
C--------------6/9/1965-------4-------- 0.4
Output
Name--------Date-------Amount-----Labor
A--------------1/1/1972-------5-------- 0.3
A--------------1/1/1972-------0-------- 0.1
A--------------1/1/1972-------0-------- 0.7
A--------------1/1/1972-------0-------- 0.3
B--------------7/2/1980-------1-------- 0.6
B--------------7/2/1980-------0-------- 0.3
B--------------7/2/1980-------0-------- 0.7
C--------------6/9/1965-------4-------- 0.2
C--------------6/9/1965-------0-------- 0.3
C--------------6/9/1965-------0-------- 0.4
As simple as multiplying by a boolean mask.
df['Amount'] *= df['Amount'].ne(df['Amount'].shift())
Yes, that is possible. You can use .duplicated(..) to construct a series that marks all duplicates with True. Then you thus can assign values with that mask:
df.loc[df['Amount'].duplicated(), 'Amount'] = 0
Or if you only want to set values that are duplicates in a "sequence", we can work with .diff().eq(0):
df.loc[df['Amount'].diff().eq(0), 'Amount'] = 0
I have a matrix as follows
0 1 2 3 ...
A 0.1 0.2 0.3 0.1
C 0.5 0.4 0.2 0.1
G 0.6 0.4 0.8 0.3
T 0.1 0.1 0.4 0.2
The data is in a dataframe as shown
Genes string
Gene1 ATGC
Gene2 GCTA
Gene3 ATCG
I need to write a code to find the score of each sequence. The score for seq ATGC is 0.1+0.1+0.8+0.1 = 1.1 (A is 0.1 because A is in first position and the value for A at that position is 0.1, similar this is calculated along the length of the sequence (450 letters))
The output should be as follows:
Genes Score
Gene1 1.1
Gene2 1.5
Gene3 0.7
I tried using biopython but could not get it right. Can anyone please help!
Let df and genes be your DataFrames. First, let's convert df into a "tall" form:
tall = df.stack().reset_index()
tall.columns = 'letter', 'pos', 'score'
tall.pos = tall.pos.astype(int) # Need a number here, not a string!
Create a new tuple-based index for the trall DF:
tall.set_index(tall[['pos', 'letter']].apply(tuple, axis=1), inplace=True)
This function will extract the scores indexed by the tuples in the form (position,"letter") from the tall DF and sum them up:
def gene2score(gene):
return tall.loc[list(enumerate(gene))]['score'].sum()
genes['string'].apply(gene2score)
#Genes
#Gene1 1.1
#Gene2 1.5
#Gene3 0.7
I have a pandas dataframe with position,k, y. For example
pos k y
123 0.7 0.5
124 0.4 0.1
125 0.3 0.2
126 0.4 0.1
128 0.3 0.6
130 0.4 0.9
131 0.3 0.2
i would like to sum the information at k and y like
123 1.1 0.6
125 0.7 0.3
128 0.3 0.6
130 0.7 1.1
so the output has only the first positions and the sum of value the first and its immediate consecutive number which follows it.
I tried grouping by pandas
for k,g in df.groupby(df['pos'] - np.arange(df.shape[0])):
u=g.ix[0:,2:].sum()
but its groups all the consecutive numbers which I dont want
ALSO I NEED SOMETHING FAST AS I HAVE 2611774 ROW IN MY DATAFILE
Hope this will solve your problem
import pandas as pd
df = pd.DataFrame( columns=['pos','k','y'])
cf = pd.DataFrame( columns=['pos','k','y'])
df['pos']=123, 124,125,126,128,130,131
df['k']=.7,.4,.3,.4,.3,.4,.3
df['y']=.5,.1,.2,.1,.6,.9,.2
row=0
while 1:
if row+1<len(df):
if(df.loc[row]['pos']+1==df.loc[row+1]['pos']):
cf.loc[row]= df.loc[row]+df.loc[row+1]
cf.loc[row]['pos']=df.loc[row]['pos']
row=row+2
else:
cf.loc[row]= df.loc[row]
row=row+1
else:
break
print cf
Maybe this is faster than a loop, but it won't sum positions 123 and 124 and then 130 and 131 as I think you expect, because it sums odd positions with its consecutive like 129 and 130, 131 and 132...
df = df.set_index('pos')
df_odd = df.loc[df.index.values % 2 == 1]
df_even = df.loc[df.index.values % 2 == 0]
df_even = df_even.set_index(df_even.index.values - 1)
df_odd.add(df_even, fill_value = 0)
Result:
pos k y
123 1.1 0.6
125 0.7 0.3
127 0.3 0.6
129 0.4 0.9
131 0.3 0.2
I have not used pandas before, but if you get the chance to use the data as a list then this should work.
def SumNext(L):
N = xrange(len(L)-1)
Output = [L[i]+L[i+1] for i in N]
return Output
This function will give you a summation of consecutive elements if you input a list.
A=[1,1,2,3,5,8,13]
SumNext(A) => [2,3,5,8,13]
then you just have to read out the values to wherever you like, it is much faster to do things in lists (as opposed to while loops) when you get lots of elements.
then you will just need to figure out the implementation of passing the output back to your data frame.