I intend to write in a single file (for each function), but inside the "loop in the loop" I got trapped.
It's working except the storage/ save part,
now writes a file for each inner loop: ## def t2(): ##
But I wish to improve and also work with the current 'dic' or 'list' in the next pool/ funtion t'x'(): and so on, to avoid have to open the csv in the jorney.
what's the lesson over here? :p
It's my 1st data scrape, I'm new to python!
import
def t0(url): # url
soup ('http://www.foo.net')
return soup
def t1(): # 1st_pool
soup = t0()
dic = {}
with open('dic.csv', 'w') as f:
for x in range(15):
try:
collect
dic[name] = link
f.write('{0};{1}\n'.format(name, link))
except:
pass
return dic
def t2(): # 2nd_pool
dic = t1()
dic2 = {}
for k,v in dic.items():
time.sleep(3)
with open(k+'_dic.csv', 'w') as f:
for x in range(13):
try:
collect
dic2[name] = link
f.write('{0};{1}\n'.format(name, link))
except:
pass
return ###############
def t3(): ... # 3rd_pool
def t4(): ... # 4th_pool
def t5(): ... # 5th_pool
def t6(): ... # full_path /to /details
As I mention early, the "problem" resides only in the fact that was creating a individual *.csv (to not overwrite the previous loop) for each loop, so now I figured out how to create a single file.csv for each function:
def t2(): # 2nd_pool
dic = t1()
dic2 = {}
for k,v in dic.items():
time.sleep(3)
##
for x in range(13):
try:
collect
dic2[name] = link
##
except:
pass
##
with open('dic2.csv', 'w') as f:
for n,j in dic2.items():
f.write('{0};{1}\n'.format(n, j))
##
return dic2
I simply moved the "*.csv operation" ( ## represent the chages) to the end of the function, outside of the "double loop", and the dictionary it's also available in the next function # t3():# and so on.
I was trying to achieve that without write the extra loop, so if someone can provide a better alternative, I would like to learn!
Related
I have a dataframe df_full that I am trying to rewrite as a dict() while also doing some stuff over it.
agent locations modal_choices
0 agent_1 'loc1', 'loc2', 'loc3', 'loc2' 'mode_1', 'mode_1', 'mode_2', 'mode_3'
1 agent_2 'loc1', 'loc4', 'loc2', 'loc6' 'mode_2', 'mode_3', 'mode_2', 'mode_3'
I am currently facing a problem while trying to multiprocess the following function format_dict() knowing that I only want to iterate over the agent argument, the three others are supposed to be the same for each iterations. So I added the partial() parameter to "freeze" df, dict_ and list_ but the code returns me an empty dict and an empty list by the end and I don't understand why.
I suppose I haven't written the executor.map() properly. I tried following the methods shown here but it still doesn't return anything.
What could be wrong with my code?
I also printed the time taken by the following script to run with time.perf_counter() and compared it with what is given with tqdm() but the two values don't match. The iteration part is done in 7 seconds (tqdm) while the print of time.perf_counter() shows up after 2.3 minutes.
What would explain the delay for the ending of the with concurrent.futures.ProcessPoolExecutor() as executor:?
I am, unfortunately, still not an expert in python and this is the first time I'm trying to multiprocess something (as the agent list I am working with is massive and would take days to process...). Any help would be greatly appreciated! And please do tell me if informations are missing or if something is not explained properly, I'll edit the post right away.
def format_dict(agent, df, dict_, list_):
try:
dict_[agent] = dict()
toto_ = df.loc[df.agent_ID == agent]
toto_mod = toto_['modal_choices'].apply(lambda x: pd.Series(x.split(',')))
toto_loc = toto_['locations'].apply(lambda x: pd.Series(x.split(',')))
for i in toto_mod:
dict_[agent]['step_{}'.format(i)] = dict()
dict_[agent]['step_{}'.format(i)]['mode'] = toto_mod[i].iloc[0]
dict_[agent]['step_{}'.format(i)]['start'] = toto_loc[counter + 1].iloc[0]
dict_[agent]['step_{}'.format(i)]['name'] = dict_agent_edt[agent]['step_0']['name']
except ValueError:
list_.append(agent)
return dict_, list_
dict_name = dict()
list_name = list()
start = time.perf_counter()
agent = df_full['agent'][:1000]
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.map(partial(format_dict, df=df_full, dict_=dict_name, list_=list_name),
tqdm(agent), chunksize=50)
end = time.perf_counter()
print(f'It took {(end-start)/60} minutes.')
Following #Louis Lac's answer, I modified my script to avoid any concurrence but it still returns an empty dict.
def format_dict(agent, df, dict_):
try:
dict_[agent] = dict()
toto_ = df.loc[df.agent_ID == agent]
(same stuff here)
except ValueError:
pass
return dict_
start = time.perf_counter()
agents = df_full['agent'][:1000]
dict_name = {}
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.map(partial(format_dict, df=df_full, dict_=dict_name),
tqdm(agents), chunksize=50)
end = time.perf_counter()
print(f'It took {(end-start)/60} minutes.')
When using concurrency such as multithreading and multiprocessing, functions that are executed concurrently such as format_dict should not mutate shared state to avoid data races or the mutations should be synchronized.
You could for instance compute all your stuff concurrently first, then sequentially reduce the result into outputs (dict_ and list_):
def format_dict(agent, df):
list_ = None
try:
dict_ = dict()
toto_ = df.loc[df.agent_ID == agent]
toto_mod = toto_['modal_choices'].apply(lambda x: pd.Series(x.split(',')))
toto_loc = toto_['locations'].apply(lambda x: pd.Series(x.split(',')))
for i in toto_mod:
dict_['step_{}'.format(i)] = dict()
dict_['step_{}'.format(i)]['mode'] = toto_mod[i].iloc[0]
dict_['step_{}'.format(i)]['start'] = toto_loc[counter + 1].iloc[0]
dict_['step_{}'.format(i)]['name'] = dict_agent_edt[agent]['step_0']['name']
except ValueError:
list_ = agent
return dict_, list_
start = time.perf_counter()
agents = df_full['agent'][:1000]
with concurrent.futures.ProcessPoolExecutor() as executor:
elements = executor.map(partial(format_dict, df=df_full),
tqdm(agents), chunksize=50)
dict_ = {}
list_ = []
for agent, (d, l) in zip(agents, elements):
if l is not None:
list_.append(l)
dict_[agent] = d
end = time.perf_counter()
print(f'It took {(end-start)/60} minutes.')
I have a dictionary that is in the main function and I want to use it in another function to present the data in a tabular format.
I had created the dictionary in a function as follows:
def file_reader():
config_dict = {}
newDict = {}
configParser = configparser.ConfigParser()
configParser.read('config.ini')
for section in configParser.sections():
for k,v in configParser.items(section):
config_dict[k] = v
config_dict = dict(configParser.items('SectionTwo'))
rev_dict = dict(map(reversed, configParser.items('SectionOne')))
for v in rev_dict:
newDict[k] = rev_dict[v]
list_vals = list(config_dict.values())
list_keys = list(config_dict.keys())
return rev_dict, newDict
I then used the dictionary created in the above function in main function as follows:
def main():
rev_dict, newDict = file_reader()
parser = ap.ArgumentParser()
parser.add_argument('-s', '--start', help='start script', action='store_true')
args = parser.parse_args()
elif args.start:
for k,v in rev_dict.items():
print("\nTestcase:" + v + "\n");print(v, "=", k);print("\n");time.sleep(5);
proc = sp.call([k], shell=True);time.sleep(5);
print('Process ID is:', os.getpid())
if proc != 0:
if proc < 0:
print("\nKilled by signal!\n", -proc)
else:
print("\nFailed with return code: ", proc)
newDict[v] = 'Fail'
print(json.dumps(dic, indent=4, sort_keys=True))
else:
print("\nOK\n")
newDict[v] = 'Pass';
print(json.dumps(dic, indent=4, sort_keys=True))
sipResponse(args.ip)
I had then created a function called read_file() where I want to generate a report and use the updated dictionary named newDict from main function.
def read_file():
rev_dict = file_reader()
shutil.copy("logfile.log", "file.txt")
f = open("file.txt", "r+")
headers = ['Testcase', 'Path']
data = sorted([(k,v) for k,v in rev_dict.items()])
f.write(tabulate(data, headers=headers, tablefmt="grid"))
f.close()
sys.exit(0)
Can someone please guide?
You need to introduce an argument to the read_file function:
def read_file(new_dict):
Then in main when you want to call the new function, call read_file(newDict). Notice how the argument name in the function is new_dict, thus showing that the two names are actually different.
Thus main might look something like this, with other code removed for simplicity: just add a call to the other method.
def main():
rev_dict, newDict = file_reader()
read_file(newDict)
Below is my code. I want to use the finalList in some other function, so I am trying to return finalList, but I am getting an error 'return' outside function.
If I use print finalList, it is printing the result fine.
Any idea what to do?
import csv
from featureVector import getStopWordList
from preprocess import processTweet
from featureVector import getFeatureVector
inpTweets = csv.reader(open('sampleTweets.csv', 'rb'), delimiter=',', quotechar='|')
stopWords = getStopWordList('stopwords.txt')
featureList = []
tweets = []
for row in inpTweets:
sentiment = row[0]
tweet = row[1]
processedTweet = processTweet(tweet)
featureVector = getFeatureVector(processedTweet)
featureList.extend(featureVector)
tweets.append((featureVector, sentiment));
finalList = list(set(featureList))
I want to use the finalList in some other function
You can already do this.
# other code...
finalList = list(set(featureList)) # this is global
def foo():
print(finalList)
foo()
'return' outside function
If you want to return finalList, then make a function for it.
def getFinalList():
# other code...
return list(set(featureList))
Option 1
def foo():
final_list = getFinalList()
print(final_list)
foo()
Option 2
def foo(final_list):
print(final_list)
foo(getFinalList())
So I wanted to make an arff reader (similar to csv file format).
And I wanted to use yield to make an iterator but also to add attributes to this iterator.
eg:
data = arff.reader(my_fname)
print data.relation
for row in data:
print row
but in the reader definition:
def reader(fname):
reader.relation = fname # this is assigned to the function, not the generator
yield 1
yield 2
Is there a way to do this using yield or am I stuck with the iterator api?
You can make it a class.
class Reader(object): # Assuming Python <= 2.7
def __init__(self, fname):
self.fname = fname
def __iter__(self):
yield 1
yield 2
r = Reader("some file")
print r.fname ## 'some file'
for line in r:
print line ## 1 then 2
Appologies for the really long drawn out question.
I am trying to read in a config file and get a list of rules out.
I have tried to use ConfigParser to do this but it is not a standard config file.
The file contains no section header and no token.
i.e.
config section a
set something to something else
config subsection a
set this to that
next
end
config firewall policy
edit 76
set srcintf "There"
set dstintf "Here"
set srcaddr "all"
set dstaddr "all"
set action accept
set schedule "always"
set service "TCP_5600"
next
edit 77
set srcintf "here"
set dstintf "there"
set srcaddr "all"
set dstaddr "all"
set action accept
set schedule "always"
set service "PING"
next
end
As I couldn't work out how to get ConfigParser to work I thought I would try to iterate through the file, unfortunately I don't have much programming skill so I have got stuck.
I really think I am making this more complicated than it should be.
Here's the code I have written;
class Parser(object):
def __init__(self):
self.config_section = ""
self.config_header = ""
self.section_list = []
self.header_list = []
def parse_config(self, fields): # Create a new section
new_list = []
self.config_section = " ".join(fields)
new_list.append(self.config_section)
if self.section_list: # Create a sub section
self.section_list[-1].append(new_list)
else: self.section_list.append(new_list)
def parse_edit(self, line): # Create a new header
self.config_header = line[0]
self.header_list.append(self.config_header)
self.section_list[-1].append(self.header_list)
def parse_set(self, line): # Key and values
key_value = {}
key = line[0]
values = line[1:]
key_value[key] = values
if self.header_list:
self.header_list.append(key_value)
else: self.section_list[-1].append(key_value)
def parse_next(self, line): # Close the header
self.config_header = []
def parse_end(self, line): # Close the section
self.config_section = []
def parse_file(self, path):
with open(path) as f:
for line in f:
# Clean up the fields and remove unused lines.
fields = line.replace('"', '').strip().split(" ")
if fields[0] == "set":
pass
elif fields[0] == "end":
pass
elif fields[0] == "edit":
pass
elif fields[0] == "config":
pass
elif fields[0] == "next":
pass
else: continue
# fetch and call method.
method = fields[0]
parse_method = "parse_" + method
getattr(Parser, parse_method)(self, fields[1:])
return self.section_list
config = Parser().parse_file('test_config.txt')
print config
The output I am looking for is something like the following;
[['section a', {'something': 'to something else'}, ['subsection a', {'this': 'to that'}]],['firewall policy',['76',{'srcintf':'There'}, {'dstintf':'Here'}{etc.}{etc.}]]]
and this is what I get
[['section a']]
EDIT
I have changed the above to reflect where I am currently at.
I am still having issues getting the output I expect. I just can't seem to get the list right.
class Parser(object):
def __init__(self):
self.my_section = 0
self.flag_section = False
# ...
def parse_config(self, fields):
self.my_section += 1
# go on with fields
# ...
self.flag_section = True
def parse_edit(self, line):
...
def parse_set(self, line):
...
def parse_end(self, line):
...
def parse_file(self, path):
with open(path) as f:
for line in f:
fields = f.strip().split(" ")
method = fields[0]
# fetch and call method
getattr(Parser, "parse_" + method)(self, fields[1:])
I post my answer for people who first come here from Google when trying to parse Fortigate configuration file !
I rewrote what I found here based on my own needs and it works great.
from collections import defaultdict
from pprint import pprint
import sys
f = lambda: defaultdict(f)
def getFromDict(dataDict, mapList):
return reduce(lambda d, k: d[k], mapList, dataDict)
def setInDict(dataDict, mapList, value):
getFromDict(dataDict, mapList[:-1])[mapList[-1]] = value
class Parser(object):
def __init__(self):
self.config_header = []
self.section_dict = defaultdict(f)
def parse_config(self, fields): # Create a new section
self.config_header.append(" ".join(fields))
def parse_edit(self, line): # Create a new header
self.config_header.append(line[0])
def parse_set(self, line): # Key and values
key = line[0]
values = " ".join(line[1:])
headers= self.config_header+[key]
setInDict(self.section_dict,headers,values)
def parse_next(self, line): # Close the header
self.config_header.pop()
def parse_end(self, line): # Close the section
self.config_header.pop()
def parse_file(self, path):
with open(path) as f:
gen_lines = (line.rstrip() for line in f if line.strip())
for line in gen_lines:
# pprint(dict(self.section_dict))
# Clean up the fields and remove unused lines.
fields = line.replace('"', '').strip().split(" ")
valid_fields= ["set","end","edit","config","next"]
if fields[0] in valid_fields:
method = fields[0]
# fetch and call method
getattr(Parser, "parse_" + method)(self, fields[1:])
return self.section_dict
config = Parser().parse_file('FGT02_20130308.conf')
print config["system admin"]["admin"]["dashboard-tabs"]["1"]["name"]
print config["firewall address"]["ftp.fr.debian.org"]["type"]
I do not know if this can help you too, but it did for me : http://wiki.python.org/moin/ConfigParserExamples
Have fun !
I would do it in a simpler way:
flagSection = False
flagSub = False
mySection = 0
mySubsection = 0
myItem = 0
with open('d:/config.txt', 'r') as f:
gen_lines = (line.rstrip() for line in f if line.strip())
for line in gen_lines:
if line[0:7]=='config ':
mySection = mySection + 1
newLine = line[7:]
# Create a new section
# Mark section as open
flagSection == True
elif line[0:5]=='edit '):
mySubsection = mySubsection + 1
newLine = line[5:]
# Create a new sub-section
# Mark subsection as open
flagSub == true
elif line[0:4]=='set '):
myItem = myItem + 1
name, value = x.split(' ',2)[1:]
# Add to whatever is open
elif line=='end':
# If subsection = open then close and goto end
if flagSub:
# Or if section = open then close and goto end
elif flagSection:
# :End
continue
The instruction gen_lines = (line.rstrip() for line in f if line.strip())
creates a generator of not empty lines (thanks to the test if line.strip()) without newline and without blanks at the right (thanks to line.rstrip())
.
If I would know more about the operations you want to perform with name,value and in the section opened with if line=='end' , I could propose a code using regexes.
Edit
from time import clock
n = 1000000
print 'Measuring times with clock()'
te = clock()
for i in xrange(n):
x = ('abcdfafdf'[:3] == 'end')
print clock()-te,
print "\tx = ('abcdfafdf'[:3] == 'end')"
te = clock()
for i in xrange(n):
x = 'abcdfafdf'.startswith('end')
print clock()-te,
print "\tx = 'abcdfafdf'.startswith('end')"
print '\nMeasuring times with timeit module'
import timeit
ti = timeit.repeat("x = ('abcdfafdf'[:3] == 'end')",repeat=10,number = n)
print min(ti),
print "\tx = ('abcdfafdf'[:3] == 'end')"
to = timeit.repeat("x = 'abcdfafdf'.startswith('end')",repeat=10,number = n)
print min(to),
print "\tx = 'abcdfafdf'.startswith('end')"
result:
Measuring times with clock()
0.543445605517 x = ('abcdfafdf'[:3] == 'end')
1.08590449345 x = 'abcdfafdf'.startswith('end')
Measuring times with timeit module
0.294152748464 x = ('abcdfafdf'[:3] == 'end')
0.901923289133 x = 'abcdfafdf'.startswith('end')
Is the fact the times are smaller with timieit than with clock() due to the fact that the GC is unplugged when the program is run ? Anyway, with either clock() or timeit module , executing startswith() takes more time than slicing.