I am now getting used to python lists. But I encountered a complicated list and I have troubles parsing it.
prediction=[('__label__inflation_today', 0.8),('__label__economic_outlook', 0.2)]
I am trying to present this prediction in a better way, something like excel.
predicted label probability
Inflation_today 0.8
Economic_outlook 0.2
You can try
for x in prediction:
string=x[0].replace('__label__','')
print(string,":",x[1])
inflation_today : 0.8
economic_outlook : 0.2
If you want to access it using those names, you can also create a dictionary
d={}
for x in prediction:
string=x[0].replace('__label__','')
d[string]=x[1]
d
{'economic_outlook': 0.2, 'inflation_today': 0.8}
d['economic_outlook']
0.2
One possible solution is pandas DataFrame,
then use Series.str.replace:
import pandas as pd
prediction=[('__label__inflation_today', 0.8), ('__label__economic_outlook', 0.2)]
df = pd.DataFrame(prediction, columns=['predicted label',' probability'])
df['predicted label'] = df['predicted label'].str.replace('__label__', '')
print (df)
predicted label probability
0 inflation_today 0.8
1 economic_outlook 0.2
If need only data use DataFrame.to_string:
print (df.to_string(index=False, header=None))
inflation_today 0.8
economic_outlook 0.2
Related
Hi I'm trying to encode a Genome, stored as a string inside a dataframe read from a CSV.
Right now I'm looking to split each string in the dataframe under the column 'Genome' into a list of it's base pairs i.e. from ('acgt...') to ('a','c','g','t'...) then convert each base pair into a float (0.25,0.50,0.75,1.00) respectively.
I thought I was looking for a split function to split each string into characters but none seem to work on the data in the dataframe even when transformed to string using .tostring
Here's my most recent code:
import re
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
def string_to_array(my_string):
my_string = my_string.lower()
my_string = re.sub('[^acgt]', 'z', my_string)
my_array = np.array(list(my_string))
return my_array
label_encoder = LabelEncoder()
label_encoder.fit(np.array(['a','g','c','t','z']))
def ordinal_encoder(my_array):
integer_encoded = label_encoder.transform(my_array)
float_encoded = integer_encoded.astype(float)
float_encoded[float_encoded == 0] = 0.25 # A
float_encoded[float_encoded == 1] = 0.50 # C
float_encoded[float_encoded == 2] = 0.75 # G
float_encoded[float_encoded == 3] = 1.00 # T
float_encoded[float_encoded == 4] = 0.00 # anything else, z
return float_encoded
dfpath = 'C:\\Users\\CAAVR\\Desktop\\Ison.csv'
dataframe = pd.read_csv(dfpath)
df = ordinal_encoder(string_to_array(dataframe[['Genome']].values.tostring()))
print(df)
I've tried making my own function but I don't have any clue how they work. Everything I try points to not being able to process data when it's in a numpy array and nothing is working to transform the data to another type.
Thanks for the tips!
Edit: Here is the print of the dataframe-
Antibiotic ... Genome
0 isoniazid ... ccctgacacatcacggcgcctgaccgacgagcagaagatccagctc...
1 isoniazid ... gggggtgctggcggggccggcgccgataaccccaccggcatcggcg...
2 isoniazid ... aatcacaccccgcgcgattgctagcatcctcggacacactgcacgc...
3 isoniazid ... gttgttgttgccgagattcgcaatgcccaggttgttgttgccgaga...
4 isoniazid ... ttgaccgatgaccccggttcaggcttcaccacagtgtggaacgcgg...
There are 5 columns 'Genome' being the 5th in the list I don't know why 1. .head() will not work and 2. why print() doesn't give me all columns...
I don't think LabelEncoder is what you want. This is a simple transformation, I recommend doing it directly. Start with a lookup your base pair mapping:
lookup = {
'a': 0.25,
'g': 0.50,
'c': 0.75,
't': 1.00
# z: 0.00
}
Then apply the lookup to value of the "Genome" column. The values attribute will return the resulting dataframe as an ndarray.
dataframe['Genome'].apply(lambda bps: pd.Series([lookup[bp] if bp in lookup else 0.0 for bp in bps.lower()])).values
I'm calculating the frequency of words into many text files (140 docs), the end of my work is to create a csv file where I can order the frequence of every word by single doc and by all docs.
Let say I have:
absolut_freq= {u'hello':0.001, u'world':0.002, u'baby':0.005}
doc_1= {u'hello':0.8, u'world':0.9, u'baby':0.7}
doc_2= {u'hello':0.2, u'world':0.3, u'baby':0.6}
...
doc_140={u'hello':0.1, u'world':0.5, u'baby':0.9}
So, what I need is a cvs file to export in excel that looks like this:
WORD, ABS_FREQ, DOC_1_FREQ, DOC_2_FREQ, ..., DOC_140_FREQ
hello, 0.001 0.8 0.2 0.1
world, 0.002 0.9 0.03 0.5
baby, 0.005 0.7 0.6 0.9
How can I do It with python?
You could also convert it to a Pandas Dataframe and save it as a csv file or continue analysis in a clean format.
absolut_freq= {u'hello':0.001, u'world':0.002, u'baby':0.005}
doc_1= {u'hello':0.8, u'world':0.9, u'baby':0.7}
doc_2= {u'hello':0.2, u'world':0.3, u'baby':0.6}
doc_140={u'hello':0.1, u'world':0.5, u'baby':0.9}
all = [absolut_freq, doc_1, doc_2, doc_140]
# if you have a bunch of docs, you could use enumerate and then format the colname as you iterate over and create the dataframe
colnames = ['AbsoluteFreq', 'Doc1', 'Doc2', 'Doc140']
import pandas as pd
masterdf = pd.DataFrame()
for i in all:
df = pd.DataFrame([i]).T
masterdf = pd.concat([masterdf, df], axis=1)
# assign the column names
masterdf.columns = colnames
# get a glimpse of what the data frame looks like
masterdf.head()
# save to csv
masterdf.to_csv('docmatrix.csv', index=True)
# and to sort the dataframe by frequency
masterdf.sort(['AbsoluteFreq'])
You can make it a mostly a data-driven process—given only the variable names of all the dictionary variables—by first creating a table with all the data listed in it, and then using the csv module to write a transposed (columns for rows swapped) version it to the output file.
import csv
absolut_freq = {u'hello': 0.001, u'world': 0.002, u'baby': 0.005}
doc_1 = {u'hello': 0.8, u'world': 0.9, u'baby': 0.7}
doc_2 = {u'hello': 0.2, u'world': 0.3, u'baby': 0.6}
doc_140 ={u'hello': 0.1, u'world': 0.5, u'baby': 0.9}
dic_names = ('absolut_freq', 'doc_1', 'doc_2', 'doc_140') # dict variable names
namespace = globals()
words = namespace[dic_names[0]].keys() # assume dicts all contain the same words
table = [['WORD'] + list(words)] # header row (becomes first column of output)
for dic_name in dic_names: # add values from each dictionary given its name
table.append([dic_name.upper()+'_FREQ'] + list(namespace[dic_name].values()))
# Use open('merged_dicts.csv', 'wb') for Python 2.
with open('merged_dicts.csv', 'w', newline='') as csvfile:
csv.writer(csvfile).writerows(zip(*table))
print('done')
CSV file produced:
WORD,ABSOLUT_FREQ_FREQ,DOC_1_FREQ,DOC_2_FREQ,DOC_140_FREQ
world,0.002,0.9,0.3,0.5
baby,0.005,0.7,0.6,0.9
hello,0.001,0.8,0.2,0.1
No matter how you want to write this data, first you need an ordered data structure, for example a 2D list:
docs = []
docs.append( {u'hello':0.001, u'world':0.002, u'baby':0.005} )
docs.append( {u'hello':0.8, u'world':0.9, u'baby':0.7} )
docs.append( {u'hello':0.2, u'world':0.3, u'baby':0.6} )
docs.append( {u'hello':0.1, u'world':0.5, u'baby':0.9} )
words = docs[0].keys()
result = [ [word] + [ doc[word] for doc in docs ] for word in words ]
then you can use the built-in csv module: https://docs.python.org/2/library/csv.html
My output for the following code is only save the last row of result instead of putting every single row of values into the csv file. I have limited knowledge in python. I think my looping part is incorrect. Can anyone help me?
Code
import numpy as np
from numpy import genfromtxt
with open('binary.csv') as actg:
actg=actg.readlines()
with open('single.csv') as single:
single=single.readlines()
with open('division.csv') as division:
division=division.readlines()
for line in actg:
for line2 in single:
for line1 in division:
myarray = np.fromstring(line, dtype=float, sep=',')
myarray = myarray.reshape((-1, 3, 4))
a=np.asmatrix(myarray)
a=np.array(a)
single1 = np.fromstring(line2, dtype=float, sep=',')
single1 = single1.reshape((-1, 4))
s=np.asmatrix(single1)
s=np.array(s)
division1 = np.fromstring(line1, dtype=float, sep=',')
m=np.asmatrix(division1)
m=np.array(m)
res2 = (s[np.newaxis,:,:] / m[:,np.newaxis,:] * a).sum(axis=-1)
np.savetxt("output.csv", res2, delimiter=",")
binary.csv
0,1,0,0,1,0,0,0,0,0,0,1
0,0,1,0,1,0,0,0,1,0,0,0
single.csv:
0.28,0.22,0.23,0.27,0.12,0.29,0.34,0.21,0.44,0.56,0.51,0.65
division.csv
0.4,0.5,0.7,0.1
0.2,0.8,0.9,0.3
Expected output
0.44,0.3,6.5
0.26,0.6,2.2
Actual output
0.26,0.6,2.2
It is not you python understanding. It is your loop logic understanding.
If you look closely at your loops, you (and not Python) always keep the last elements.
In the first and second loops you overwrite each time the variable that you need and you do not keep the overall information for each iteration.
In the last loop, you declare the row variable as new list in each iteration.
Also, the i variable is not set up anywhere except as the index variable in the for loop!
Because of the fact that you only write out the results after the loops have finished, you are only seeing the last result.
Also, I would expect 4 results from that data, not 2.
You need to write the output on each iteration or store each iteration and then write out the stored values.
Try printing res2 each time you calculate it and you will see something like this:
('Final Results', array([[ 0.44, 0.3 , 6.5 ]]))
('Final Results', array([[ 0.275 , 0.6 , 2.16666667]]))
('Final Results', array([[ 0.32857143, 0.3 , 1.1 ]]))
('Final Results', array([[ 0.25555556, 0.6 , 2.2 ]]))
My function outputs a list, for instance when I type:
My_function('TV', 'TV_Screen')
it outputs the following:
['TV', 1, 'TV_Screen', 0.04, 'True']
Now, my TV is made of several parts, such as speaker, transformer, etc., I can keep running my function for each part, and for instance change 'TV_Screen' for 'TV_Speaker', or 'TV_transformer', etc.
The alternative is to create a list with all the part, such as:
TV_parts = ['TV_Screen', 'TV_Speaker', 'TV_transformer']
What I am trying to get is a pandas data frame with 5 columns (because my function outputs 5 variables, see above the section "it outputs the following:") and in this case 3 rows (one of each for 'TV_Screen', 'TV_Speaker', and 'TV_transformer'). Basically, I want the following to be in a data frame:
['TV', 1, 'TV_Screen', 0.04, 'True']
['TV', 9, 'TV_Speaker', 0.56, 'True']
['TV', 3, 'TV_transformer', 0.80, 'False']
I know I need a for loop somewhere, but I am not sure how to create this data frame. Could you please help? (I can change the output of my function to be a pd.Series or something else that would work better).
Thanks!
If you have many arrays, it may be worth converting them into a numpy matrix first and then converting them into a dataframe.
import pandas as pd
import numpy as np
a = ['TV', 1, 'TV_Screen', 0.04, 'True']
b = ['TV', 9, 'TV_Speaker', 0.56, 'True']
c = ['TV', 3, 'TV_transformer', 0.80, 'False']
matrix = np.matrix([a,b,c])
df = pd.DataFrame(data=matrix)
You can do it like this:
def My_function(part):
# prepare result
result = ['TV', 1, part, 0.04, 'True'] # for testing
return result
TV_parts = ['TV_Screen', 'TV_Speaker', 'TV_transformer']
df = pd.DataFrame([My_function(part) for part in TV_parts])
>>> df
0 1 2 3 4
0 TV 1 TV_Screen 0.04 True
1 TV 1 TV_Speaker 0.04 True
2 TV 1 TV_transformer 0.04 True
I made N experiments and for each experiment I have a list of results with dates, i.e. I have N lists of the type [[float1, date1], [float2, date2], ...]
I want to make a matrix(NxM) of the results of all the experiments for the common dates.
What is the most efficient way to do it?
For example,
Given three experiments (N = 3) with values:
[[float1a, date1],
[float2a, date2],
[float3a, date3]]
[[float1b, date1],
[float2b, date2],
[float3b, date3]]
[[float1c, date1],
[float2c, date2],
[float3c, date3],
[float3, date4]]
I would like to produce something like:
date1 - float1a float1b float1c
date2 - float2a float2b float2b
date3 - float3a float3b float3c
I'd look at using pandas for something like this:
import pandas as pd
from datetime import date
expr1 = [[1.2,date(2012,1,1)], [1.3,date(2012,1,2)], [1.4,date(2012,1,3)]]
expr2 = [[1.2,date(2012,1,1)], [1.3,date(2012,1,2)], [1.4,date(2012,1,3)], [1.5,date(2012,1,4)]]
expr3 = [[1.2,date(2012,1,1)], [1.3,date(2012,1,2)], [1.4,date(2012,1,3)]]
exper_df1 = pd.DataFrame(expr1).set_index(1).rename(columns={0: "Result_1"})
exper_df2 = pd.DataFrame(expr2).set_index(1).rename(columns={0: "Result_2"})
exper_df3 = pd.DataFrame(expr3).set_index(1).rename(columns={0: "Result_3"})
experiments = [exper_df2, exper_df3]
exper_df = exper_df1.join(experiments, how='inner')
This produces a single DataFrame labelled by the dates you seek:
Result_1 Result_2 Result_3
1
2012-01-01 1.2 1.2 1.2
2012-01-02 1.3 1.3 1.3
2012-01-03 1.4 1.4 1.4
I'm not sure I understood you correctly, but by common dates you mean similar dates, you can create a dictionary where each key is a date, and the value is a list of experiments from that date.
{'date1': ['float1', 'float11', etc..], 'date2': [...], ... }
This will also allow easy access to results from a specific date.
it can be done the following way:
my_results_list = [[float1, date1], [float2, date2], ...]
results_by_date = {}
for res_couple in results:
date, result = res_couple
if date not in results_by_date:
results_by_date[date] = []
results_by_date.append(result)
I'm certain there are better ways to do this performance wise if that is an issue, but you get the idea.
Hope this helps.
Use numpy.asmatrix(data, dtype=None) function ! this is an efficient way for create MATRIX
import numpy as np
x = np.array([[float1, date1], [float2, date2], ...])
matrix = np.asmatrix(x)