Flatten object within an object in python - python

I'm trying to tranform a class I created:
class LabeledSourceFeatures:
label = ''
features = FeatureSet()
def __init__(self, label, features):
self.label = label
self.features = features
# added this as part the of the workaround
def flat_features(self):
return self.features.__dict__
It is easier to create the FeatureSet then add it to this class (just saying that to mention that I don't want to just take the members of FeatureSet and put them in LabeledSourceFeatures), but the end result, I want to put into a pandas dataframe. The problem was, I was getting a data frame with 2 columns, one for the label string and one for the FeatureSet object. What I really want is take all the keys and values of my FeatureSet and make them their own columns.
This is what I've tried so far:
intermediate_data = [(t.__dict__, x.__dict__ for x in t.features) for t in labeledFeatures]
# This fails for syntax error.
A workaround is this:
intermediate_data = [(t.label, t.flat_features()) for t in labeledFeatures]
final_data = []
for row in intermediate_data:
new_row = row[1]
new_row['label'] = row[0]
final_data.append(new_row)
but this looks very inefficient.
Edit:
a FeatureSet looks like this:
class FeatureSet:
"""
Adapted from the CSFS presented in De-anonymizing Programmers via Code Stylometry
by:
Aylin Caliskan-Islam, Drexel University; Richard Harang, U.S. Army Research Laboratory;
Andrew Liu, University of Maryland; Arvind Narayanan, Princeton University;
Clare Voss, U.S. Army Research Laboratory; Fabian Yamaguchi, University of Goettingen;
Rachel Greenstadt, Drexel University
"""
# LEXICAL FEATURES
ln_keyword_length = 0
ln_unique_keyword_length = 0
ln_comments_length = 0
ln_token_length = 0
avg_line_length = 0
# LAYOUT FEATURES
ln_tabs_length = 0
ln_space_length = 0
ln_empty_length = 0
white_space_ratio = 0
is_brace_on_new_line = False
do_tabs_lead_lines = False
comment_text = ''

Related

implement do_sampling for custom GPT-NEO model

import numpy as np
from transformers import GPTNeoForCausalLM, GPT2Tokenizer
import coremltools as ct
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
sentence_fragment = "The Oceans are"
class NEO(torch.nn.Module):
def __init__(self, model):
super(NEO, self).__init__()
self.next_token_predictor = model
def forward(self, x):
sentence = x
predictions, _ = self.next_token_predictor(sentence)
token = torch.argmax(predictions[-1, :], dim=0, keepdim=True)
sentence = torch.cat((sentence, token), 0)
return sentence
token_predictor = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-125M", torchscript=True).eval()
context = torch.tensor(tokenizer.encode(sentence_fragment))
random_tokens = torch.randint(10000, (5,))
traced_token_predictor = torch.jit.trace(token_predictor, random_tokens)
model = NEO(model=traced_token_predictor)
scripted_model = torch.jit.script(model)
# Custom model
sentence_fragment = "The Oceans are"
for i in range(10):
context = torch.tensor(tokenizer.encode(sentence_fragment))
torch_out = scripted_model(context)
sentence_fragment = tokenizer.decode(torch_out)
print("Custom model: {}".format(sentence_fragment))
# Stock model
model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-125M", torchscript=True).eval()
sentence_fragment = "The Oceans are"
input_ids = tokenizer(sentence_fragment, return_tensors="pt").input_ids
gen_tokens = model.generate(input_ids, do_sample=True, max_length=20)
gen_text = tokenizer.batch_decode(gen_tokens)[0]
print("Stock model: "+gen_text)
RUN 1
Output:
Custom model: The Oceans are the most important source of water for the entire world
Stock model: The Oceans are on the rise. The American Southwest is thriving, but the southern United States still
RUN 2
Output:
Custom model: The Oceans are the most important source of water for the entire world.
Stock model: The Oceans are the land of man
This is a short video of the Australian government
The custom model always returns the same output. However with the do_sampling = True stock model.generate return different results on each call. I spent a lot of time figuring out how do_sampling works for transformers so I require help from you guys, appreciate it.
How to code a custom model to have different results on each call?
Thanks!
So, the answer would be to implement sampling :D
class NEO(torch.nn.Module):
def __init__(self, model):
super(NEO, self).__init__()
self.next_token_predictor = model
def forward(self, x):
sentence = x
predictions, _ = self.next_token_predictor(sentence)
# get top K (k=2) indicies of highest probs of tokens
# 2 indicies would be enough, anyway you will got 2 in a power of N variations
_, topK = torch.topk(predictions[-1, :], 2, dim=0)
# get one of two of those indicies randomly, and concat sentence
perm = torch.randperm(topK.size(0))
idx = perm[:1]
token = topK[idx.long()]
sentence = torch.cat((sentence, token), 0)
return sentence

How can I get the information from a dataframe containing dictionaries or lists in every column?

I have this information and I can't get the values of the columns serviceTypes and crowding:
id name modeName disruptions lineStatuses serviceTypes crowding
0 piccadilly Piccadilly tube [] [] [{'$type': 'Tfl.Api.Presentation.Entities.Line... {'$type': 'Tfl.Api.Presentation.Entities.Crowd...
1 victoria Victoria tube [] [] [{'$type': 'Tfl.Api.Presentation.Entities.Line... {'$type': 'Tfl.Api.Presentation.Entities.Crowd...
2 bakerloo Bakerloo tube [] [] [{'$type': 'Tfl.Api.Presentation.Entities.Line... {'$type': 'Tfl.Api.Presentation.Entities.Crowd...
3 central Central tube [] [] [{'$type': 'Tfl.Api.Presentation.Entities.Line... {'$type': 'Tfl.Api.Presentation.Entities.Crowd.
I tried this code:
def split(x, index):
try:
return x[index]
except:
return None
dflines['serviceTypes'] = dflines.serviceTypes.apply(lambda x:split(x,0))
dflines['crowding'] = dflines.crowding.apply(lambda x:split(x,1))
def values(x):
try:
return ';'.join('{}'.format(val) for val in x.values())
except:
return None
m = dflines['serviceTypes'].apply(lambda x:values(x))
dflines1 = m.str.split(';', expand=True)
dflines1.columns = dflines['serviceTypes'][0].keys()
dflines2 = dflines1[['name']]
dflines2
But I got this error:
AttributeError Traceback (most recent call last)
<ipython-input-108-8f4bb6ac731a> in <module>
14 m = dflines['serviceTypes'].apply(lambda x:values(x))
15 dflines1 = m.str.split(';', expand=True)
---> 16 dflines1.columns = dflines['serviceTypes'][0].keys()
17 dflines2 = dflines1[['name']]
18 dflines2
AttributeError: 'str' object has no attribute 'keys'
Can anyone help me?
You can pull a pandas column into a list like so:
service_types = dflines['serviceTypes']
The first value is now the the first value in the list service_types.
first_value = service_types[0]
Pandas works differently than a dictionary. I think you might be trying to treat the data frame as a dictionary. Apologies if I misunderstood or oversimplified.
Edit:
Ok, it looks like service_types ( above) is a list of dictionaries. To write the column so that it only contains the type you need to index into the list and then index into the dictionary.
service_types = dflines['serviceTypes']
types_alone = []
for i in service_types:
types_alone.append(i['$type'][0])
dflines['new_column'] = types_alone

How to make it like make it's process of adding automatically based on keyword count

I am trying to make a form where if I input a medicine's name, it will show the solution of the medicine serially. But it is kind of limited bythe way I'm making it like more lines I'll code more spaces they will get to have the number of feedback. It would be great if you could help me to make something short but have the infinity process like loop.
df = pd.DataFrame({'FEVER':['NAPA_PLUS','JERIN','PARASITAMOL'],
'GASTRIC':['SECLO40','SECLO20','ANTACID'],
'WATERINESS':['ORSALINE','TESTY_SALINE','HOME_MADE_SALINE']})
def word_list(text):
return list(filter(None, re.split('\W+', text)))
session = raw_input("INPUT THE NAME OF THE MEDICINES ONE BY ONE BY KEEPING SPACE:")
feedback = session
print(word_list(feedback))
dff = pd.DataFrame({'itemlist':[feedback]})
dff['1'] = dff['itemlist'].astype(str).str.split().str[0]
dff['2'] = dff['itemlist'].astype(str).str.split().str[1]
dff['3'] = dff['itemlist'].astype(str).str.split().str[2]
dff['4'] = dff['itemlist'].astype(str).str.split().str[3]
dff['5'] = dff['itemlist'].astype(str).str.split().str[4]
for pts1 in dff['1']:
pts1 = df.columns[df.isin([pts1]).any()]
for pts2 in dff['2']:
pts2 = df.columns[df.isin([pts2]).any()]
for pts3 in dff['3']:
pts3 = df.columns[df.isin([pts3]).any()]
for pts4 in dff['4']:
pts4 = df.columns[df.isin([pts4]).any()]
for pts5 in dff['5']:
pts5 = df.columns[df.isin([pts5]).any()]
This wraps your repeated code into two loops:
...
dff = pd.DataFrame({'itemlist':[feedback]})
limit = 5
for i in xrange(limit):
name = str(i+1)
dff[name] = dff['itemlist'].astype(str).str.split().str[i]
for pts in dff[name]:
pts = df.columns[df.isin([pts]).any()]

Independent arrays interfered with each other?

There's a given data set of two columns: EmployeeCode and Surname.
The format is like:
EmployeeCode[1] = "L001"
Surname[1] = "Pollard"
EmployeeCode[2] = "L002"
Surname[2] = "Wills"
...
What I was trying to do is to sort according to lexicographic order for each column so as to facilitate implementation of binary search later on.
This is my code:
#data set
EmployeeCode, Surname = [0]*33, [0]*33
EmployeeCode[1] = "L001"
Surname[1] = "Pollard"
EmployeeCode[2] = "L002"
Surname[2] = "Wills"
EmployeeCode[3] = "L007"
Surname[3] = "Singh"
EmployeeCode[4] = "L008"
Surname[4] = "Yallop"
EmployeeCode[5] = "L009"
Surname[5] = "Adams"
EmployeeCode[6] = "L013"
Surname[6] = "Davies"
EmployeeCode[7] = "L014"
Surname[7] = "Patel"
EmployeeCode[8] = "L021"
Surname[8] = "Kelly"
EmployeeCode[9] = "S001"
Surname[9] = "Ong"
EmployeeCode[10] = "S002"
Surname[10] = "Goh"
EmployeeCode[11] = "S003"
Surname[11] = "Ong"
EmployeeCode[12] = "S004"
Surname[12] = "Ang"
EmployeeCode[13] = "S005"
Surname[13] = "Wong"
EmployeeCode[14] = "S006"
Surname[14] = "Teo"
EmployeeCode[15] = "S007"
Surname[15] = "Ho"
EmployeeCode[16] = "S008"
Surname[16] = "Chong"
EmployeeCode[17] = "S009"
Surname[17] = "Low"
EmployeeCode[18] = "S010"
Surname[18] = "Sim"
EmployeeCode[19] = "S011"
Surname[19] = "Tay"
EmployeeCode[20] = "S012"
Surname[20] = "Tay"
EmployeeCode[21] = "S013"
Surname[21] = "Chia"
EmployeeCode[22] = "S014"
Surname[22] = "Tan"
EmployeeCode[23] = "S015"
Surname[23] = "Yeo"
EmployeeCode[24] = "S016"
Surname[24] = "Lim"
EmployeeCode[25] = "S017"
Surname[25] = "Tan"
EmployeeCode[26] = "S018"
Surname[26] = "Ng"
EmployeeCode[27] = "S018"
Surname[27] = "Lim"
EmployeeCode[28] = "S019"
Surname[28] = "Toh"
EmployeeCode[29] = "N011"
Surname[29] = "Morris"
EmployeeCode[30] = "N013"
Surname[30] = "Williams"
EmployeeCode[31] = "N016"
Surname[31] = "Chua"
EmployeeCode[32] = "N023"
Surname[32] = "Wong"
#sort based on value of main array
def bubble_sort(main, second):
sort = True
passed = len(main)-1
while sort:
sort = False
i = 2
while i<= passed:
#print(main[i],main[i-1],i)
if main[i] < main[i-1]:
main[i], main[i-1] = main[i-1], main[i]
second[i], second[i-1] = second[i-1], second[i]
sort = True
i += 1
passed -= 1
return main,second
#main
#prepare sorted array for binary search
#for search by surname, sort according to surname
sName,sCode = bubble_sort(Surname,EmployeeCode)
print("**BEFORE******")
for k in range(0,33):
print(sName[k],sCode[k])
print("*BEFORE*******")
#for search by ECode, sort according to ECode
cCode,cName = bubble_sort(EmployeeCode, Surname)
print("**AFTER******")
for k in range(0,33):
print(sName[k],sCode[k])
print("**AFTER******")
However, after the 2nd time sorting, the 1st time sorting result in sName and sCode just changed by themselves. I've never manually changed it.
BEFORE(1st sorting)
**BEFORE******
0 0
Adams L009
Ang S004
Chia S013
Chong S008
Chua N016
Davies L013
Goh S002
Ho S007
Kelly L021
Lim S016
Lim S018
Low S009
Morris N011
Ng S018
Ong S001
Ong S003
Patel L014
Pollard L001
Sim S010
Singh L007
Tan S014
Tan S017
Tay S011
Tay S012
Teo S006
Toh S019
Williams N013
Wills L002
Wong S005
Wong N023
Yallop L008
Yeo S015
*BEFORE*******
AFTER(2nd sorting, see last 4 items)
**AFTER******
0 0
Pollard L001
Wills L002
Singh L007
Yallop L008
Adams L009
Davies L013
Patel L014
Kelly L021
Morris N011
Williams N013
Chua N016
Wong N023
Ong S001
Goh S002
Ong S003
Ang S004
Wong S005
Teo S006
Ho S007
Chong S008
Low S009
Sim S010
Tay S011
Tay S012
Chia S013
Tan S014
Yeo S015
Lim S016
Tan S017
Lim S018
Ng S018
Toh S019
Can anyone tell me how could this happened?
Assignments and argument passing in Python won't ever create copies of objects. When you bubblesort your lists, the lists you pass into the bubble sort are the same exact list objects in memory as the lists the bubble sort passes back out. Surname, sName, and cName are the exact same objects, and when you do the second bubble sort, you modify sName and sCode instead of creating independent, sorted lists.
If you want to copy a list, you have to do so explicitly. This is a shallow copy:
new_list = original[:]
new_list will be a new list containing the same objects original contained.
This is a deep copy:
import copy
new_list = copy.deepcopy(original)
new_list will be a new list containing deep copies of the objects original contained. (This is sometimes too deep; for example, if you have a list of lists, you sometimes don't want to copy the objects inside the inner lists.)
Finally, I'd like to note that your initialization code is painfully verbose. You can use a list literal instead of creating a list full of zeros and assigning each element separately:
EmployeeCode = [
'Pollard',
'Wills',
...
]

python csv into dictionary

I am pretty new to python. I need to create a class that loads csv data into a dictionary.
I want to be able to control the keys and value
So let say the following code, I can pull out worker1.name or worker1.age anytime i want.
class ageName(object):
'''class to represent a person'''
def __init__(self, name, age):
self.name = name
self.age = age
worker1 = ageName('jon', 40)
worker2 = ageName('lise', 22)
#Now if we print this you see that it`s stored in a dictionary
print worker1.__dict__
print worker2.__dict__
#
'''
{'age': 40, 'name': 'jon'}
#
{'age': 22, 'name': 'lise'}
#
'''
#
#when we call (key)worker1.name we are getting the (value)
print worker1.name
#
'''
#
jon
#
'''
But I am stuck at loading my csv data into keys and value.
[1] I want to create my own keys
worker1 = ageName([name],[age],[id],[gender])
[2] each [name],[age],[id] and [gender] comes from specific a column in a csv data file
I really do not know how to work on this. I tried many methods but I failed. I need some helps to get started on this.
---- Edit
This is my original code
import csv
# let us first make student an object
class Student():
def __init__(self):
self.fname = []
self.lname = []
self.ID = []
self.sport = []
# let us read this file
for row in list(csv.reader(open("copy-john.csv", "rb")))[1:]:
self.fname.append(row[0])
self.lname.append(row[1])
self.ID.append(row[2])
self.sport.append(row[3])
def Tableformat(self):
print "%-14s|%-10s|%-5s|%-11s" %('First Name','Last Name','ID','Favorite Sport')
print "-" * 45
for (i, fname) in enumerate(self.fname):
print "%-14s|%-10s|%-5s|%3s" %(fname,self.lname[i],self.ID[i],self.sport[i])
def Table(self):
print self.lname
class Database(Student):
def __init__(self):
g = 0
choice = ['Basketball','Football','Other','Baseball','Handball','Soccer','Volleyball','I do not like sport']
data = student.sport
k = len(student.fname)
print k
freq = {}
for i in data:
freq[i] = freq.get(i, 0) + 1
for i in choice:
if i not in freq:
freq[i] = 0
print i, freq[i]
student = Student()
database = Database()
This is my current code (incomplete)
import csv
class Student(object):
'''class to represent a person'''
def __init__(self, lname, fname, ID, sport):
self.lname = lname
self.fname = fname
self.ID = ID
self.sport = sport
reader = csv.reader(open('copy-john.csv'), delimiter=',', quotechar='"')
student = [Student(row[0], row[1], row[2], row[3]) for row in reader][1::]
print "%-14s|%-10s|%-5s|%-11s" %('First Name','Last Name','ID','Favorite Sport')
print "-" * 45
for i in range(len(student)):
print "%-14s|%-10s|%-5s|%3s" %(student[i].lname,student[i].fname,student[i].ID,student[i].sport)
choice = ['Basketball','Football','Other','Baseball','Handball','Soccer','Volleyball','I do not like sport']
lst = []
h = 0
k = len(student)
# 23
for i in range(len(student)):
lst.append(student[i].sport) # merge together
for a in set(lst):
print a, lst.count(a)
for i in set(choice):
if i not in set(lst):
lst.append(i)
lst.count(i) = 0
print lst.count(i)
import csv
reader = csv.reader(open('workers.csv', newline=''), delimiter=',', quotechar='"')
workers = [ageName(row[0], row[1]) for row in reader]
workers now has a list of all the workers
>>> workers[0].name
'jon'
added edit after question was altered
Is there any reason you're using old style classes? I'm using new style here.
class Student:
sports = []
def __init__(self, row):
self.lname, self.fname, self.ID, self.sport = row
self.sports.append(self.sport)
def get(self):
return (self.lname, self.fname, self.ID, self.sport)
reader = csv.reader(open('copy-john.csv'), delimiter=',', quotechar='"')
print "%-14s|%-10s|%-5s|%-11s" % tuple(reader.next()) # read header line from csv
print "-" * 45
students = list(map(Student, reader)) # read all remaining lines
for student in students:
print "%-14s|%-10s|%-5s|%3s" % student.get()
# Printing all sports that are specified by students
for s in set(Student.sports): # class attribute
print s, Student.sports.count(s)
# Printing sports that are not picked
allsports = ['Basketball','Football','Other','Baseball','Handball','Soccer','Volleyball','I do not like sport']
for s in set(allsports) - set(Student.sports):
print s, 0
Hope this gives you some ideas of the power of python sequences. ;)
edit 2, shortened as much as possible... just to show off :P
Ladies and gentlemen, 7(.5) lines.
allsports = ['Basketball','Football','Other','Baseball','Handball',
'Soccer','Volleyball','I do not like sport']
sports = []
reader = csv.reader(open('copy-john.csv'))
for row in reader:
if reader.line_num: sports.append(s[3])
print "%-14s|%-10s|%-5s|%-11s" % tuple(s)
for s in allsports: print s, sports.count(s)
I know this is a pretty old question, but it's impossible to read this, and not think of the amazing new(ish) Python library, pandas. Its main unit of analysis is a think called a DataFrame which is modelled after the way R handles data.
Let's say you have a (very silly) csv file called example.csv which looks like this:
day,fruit,sales
Monday,Banana,10
Monday,Orange,20
Tuesday,Banana,12
Tuesday,Orange,22
If you want to read in a csv in double-quick time, and do 'stuff' with it, you'd be hard pressed to beat the following code for either brevity or ease of use:
>>> import pandas as pd
>>> csv = pd.read_csv('example.csv')
>>> csv
day fruit sales
0 Monday Banana 10
1 Monday Orange 20
2 Tuesday Banana 12
3 Tuesday Orange 22
>>> csv[csv.fruit=='Banana']
day fruit sales
0 Monday Banana 10
2 Tuesday Banana 12
>>> csv[(csv.fruit=='Banana') & (csv.day=='Monday')]
day fruit sales
0 Monday Banana 10
In my opinion, this is really fantastic stuff. Never iterate over a csv.reader object again!
I second Mark's suggestion. In particular, look at DictReader from csv module that allows reading a comma separated (or delimited in general) file as a dictionary.
Look at PyMotW's coverage of csv module for a quick reference and examples of usage of DictReader, DictWriter
Have you looked at the csv module?
import csv

Categories