manipulating two values of a key in dictionary at the same time - python

i am reading a file which is in the format below:
0.012281001 00:1c:c4:c2:1f:fe 1 30
0.012285001 00:1c:c4:c2:1f:fe 3 40
0.012288001 00:1c:c4:c2:1f:fe 2 50
0.012292001 00:1c:c4:c2:1f:fe 4 60
0.012295001 24:1c:c4:c2:2f:ce 5 70
I intend to make column 2 entities as keys and columns 3 and 4 as separate values. For each line I encounter, for that particular key, their respective values must add up (value 1 and value 2 should aggregate separately for that key). In the above example mentioned, I need to get the output like this:
'00:1c:c4:c2:1f:fe': 10 : 180, '24:1c:c4:c2:2f:ce': 5 : 70
The program i have written for simple 1 key 1 value is as below:
#!/usr/bin/python
import collections
result = collections.defaultdict(int)
clienthash = dict()
with open("luawrite", "r") as f:
for line in f:
hashes = line.split()
ckey = hashes[1]
val1 = float(hashes[2])
result[ckey] += val1
print result
How can I extend this for 2 values and how can I print them as the output mentioned above. I am not getting any ideas. Please help! BTW i am using python2.6

You can store all of the values in a single dictionary, using a tuple as the stored value:
with open("luawrite", "r") as f:
for line in f:
hashes = line.split()
ckey = hashes[1]
val1 = int(hashes[2])
val2 = int(hashes[3])
a,b = result[ckey]
result[ckey] = (a+val1, b+val2)
print result

Related

After a desire ouput Not able to print date from both date/time

Hello Experts i have a program that read the csv file which contain several columns main motive of this program is to convert the string into seuence of number and duplicated string will be the same number which have taken this all operation i can able to perform but I want my date/time column to print only date for that i applied a slicing method that's work in console but I'm not able to to print it on my other csv file. Please tell me what to do.
This is the program I have written:
import pandas as pd
import csv
import os
# from io import StringIO
# tempFile="input1.csv"
with open("input1.csv", 'r',encoding="utf8") as csvfile:
# creating a csv reader object
reader = csv.DictReader(csvfile, delimiter=',')
# next(reader, None)
'''We then restructure the data to be a set of keys with list of values {key_1: [], key_2: []}:'''
data = {}
for row in reader:
# print(row)
for header, value in row.items():
try:
data[header].append(value)
except KeyError:
data[header] = [value]
'''Next we want to give each value in each list a unique identifier.'''
# Loop through all keys
for key in data.keys():
values = data[key]
things = list(sorted(set(values), key=values.index))
for i, x in enumerate(data[key]):
if key=="Date/Time":
var = data[key]
iter_obj1 = iter(var)
while True:
try:
element1 = next(iter_obj1)
date =element1[0:10]
print("date-",date)
except StopIteration:
break
break
else:
# if key == "Date/Time" :
# print(x[0:10])
# continue
data[key][i] = things.index(x) + 1
print('data.[keys]()-',data[key])
print('data.keys()-',data.keys())
print('values-',values)
print('data.keys()-',key)
print('x-',x)
print('i-',i)
# print("FullName-",FullName)
"""Since csv.writerows() takes a list but treats it as a row, we need to restructure our
data so that each row is one value from each list. This can be accomplished using zip():"""
with open("ram3.csv", "w") as outfile:
writer = csv.writer(outfile)
# Write headers
writer.writerow(data.keys())
# Make one row equal to one value from each list
rows = zip(*data.values())
# Write rows
writer.writerows(rows)
Note: I can't use pandas DataFrame. That's why I have written code like this please tell me how to print my date/time column only date where i need to change in code to get that...thanks
Input:
job_Id Name Address Email Date/Time
1 snehil singh marathalli ss#gmail.com 12/10/2011:02:03:20
2 salman marathalli ss#gmail.com 12/11/2011:03:10:20
3 Amir HSR ar#gmail.com 11/02/2009:09:03:20
4 Rakhesh HSR rakesh#gmail.com 09/12/2010:02:03:55
5 Ram marathalli r#gmail.com 01/10/2014:12:03:20
6 Shyam BTM ss#gmail.com 12/11/2012:01:03:20
7 salman HSR ss#gmail.com 11/08/2016:15:03:20
8 Amir BTM ar#gmail.com 07/10/2013:04:02:30
9 snehil singh Majestic sne#gmail.com 03/03/2018:02:03:20
Csv file:
job_Id Name Address Email Date/Time
1 1 1 1 12/10/2011:02:03:20
2 2 1 1 12/11/2011:03:10:20
3 3 2 2 11/02/2009:09:03:20
4 4 2 3 09/12/2010:02:03:55
5 5 1 4 01/10/2014:12:03:20
6 6 3 1 12/11/2012:01:03:20
7 2 2 1 11/08/2016:15:03:20
8 3 3 2 07/10/2013:04:02:30
9 1 4 5 03/03/2018:02:03:20
In this output, everything is correct but only the date/time column. I want to print date only, and not time.
if key=="Date/Time":
var="12/10/2011"
print(var)
var = data[key]
iter_obj1 = iter(var)
while True:
try:
element1 = next(iter_obj1)
date =element1[0:10]
print("date-",date)
except StopIteration:
break
i got it i should not use all these...things just added one line will print the desired output in the for loop..
if key=="Date/Time":
data[key][i] = data[key][i][0:10]
that's it.. its done all will be same..

enumeration of elements for lists within lists

I have a collection of files (kind of like CSV, but no commas) with data arranged like the following:
RMS ResNum Scores Rank
30 1 44 5
12 1 99 2
2 1 60 1
1.5 1 63 3
12 2 91 4
2 2 77 3
I'm trying to write a script that enumerates for me and gives an integer as the output. I want it to count how many times we get a value of RMS below 3 AND a score above 51. Only if both these criteria are met should it add 1 to our count.
HOWEVER, the tricky part is that for any given "ResNum" it cannot add 1 multiple times. In other words, I want to sub-group the data by ResNum then decide 1 or 0 on whether or not those two criteria are met within that group.
So right now it would give as an output as 3, whereas I want it to display 2 instead. Since ResNum 1 is being counted twice here (two rows meet the criteria).
import glob
file_list = glob.glob("*")
file_list = sorted(file_list)
for input_file in file_list:
masterlist = []
opened_file = open(input_file,'r')
count = 0
for line in opened_file:
data = line.split()
templist = []
templist.append(float(data[0])) #RMS
templist.append(int(data[1])) #ResNum
templist.append(float(data[2])) #Scores
templist.append(float(data[3])) #Rank
masterlist.append(templist)
then here comes the part that needs modification (I think)
for placement in masterlist:
if placement[0] <3 and placement[2] >51.0:
count += 1
print input_file
print count
count = 0
Choose you data structures carefully to make your life easier.
import glob
file_list = glob.glob("*")
file_list = sorted(file_list)
grouper = {}
for input_file in file_list:
with open(input_file) as f:
grouper[input_file] = set()
for line in f:
rms, resnum, scores, rank = line.split()
if int(rms) < 3 and float(scores) > 53:
grouper[input_file].add(float(resnum))
for input_file, group in grouper.iteritems():
print input_file
print len(group)
This creates a dictionary of sets. The key of this dictionary is the file-name. The values are sets of the ResNums, added only when your condition holds. Since sets don't have repeated elements, the size of your set (len) will give you the right count of the number of times your condition was met, per ResNum, per file.

How to skip more then one lines of header in RDD in Spark

Data in my first RDD is like
1253
545553
12344896
1 2 1
1 43 2
1 46 1
1 53 2
Now the first 3 integers are some counters that I need to broadcast.
After that all the lines have the same format like
1 2 1
1 43 2
I will map all those values after 3 counters to a new RDD after doing some computation with them in function.
But I'm not able to understand how to separate those first 3 values and map the rest normally.
My Python code is like this
documents = sc.textFile("file.txt").map(lambda line: line.split(" "))
final_doc = documents.map(lambda x: (int(x[0]), function1(int(x[1]), int(x[2])))).reduceByKey(lambda x, y: x + " " + y)
It works only when first 3 values are not in the text file but with them it gives error.
I don't want to skip those first 3 values, but store them in 3 broadcast variables and then pass the remaining dataset in map function.
And yes the text file has to be in that format only. I cannot remove those 3 values/counters
Function1 is just doing some computation and returning the values.
Imports for Python 2
from __future__ import print_function
Prepare dummy data:
s = "1253\n545553\n12344896\n1 2 1\n1 43 2\n1 46 1\n1 53 2"
with open("file.txt", "w") as fw: fw.write(s)
Read raw input:
raw = sc.textFile("file.txt")
Extract header:
header = raw.take(3)
print(header)
### [u'1253', u'545553', u'12344896']
Filter lines:
using zipWithIndex
content = raw.zipWithIndex().filter(lambda kv: kv[1] > 2).keys()
print(content.first())
## 1 2 1
using mapPartitionsWithIndex
from itertools import islice
content = raw.mapPartitionsWithIndex(
lambda i, iter: islice(iter, 3, None) if i == 0 else iter)
print(content.first())
## 1 2 1
NOTE: All credit goes to pzecevic and Sean Owen (see linked sources).
In my case I have a csv file like below
----- HEADER START -----
We love to generate headers
#who needs comment char?
----- HEADER END -----
colName1,colName2,...,colNameN
val__1.1,val__1.2,...,val__1.N
Took me a day to figure out
val rdd = spark.read.textFile(pathToFile) .rdd
.zipWithIndex() // get tuples (line, Index)
.filter({case (line, index) => index > numberOfLinesToSkip})
.map({case (line, index) => l}) //get rid of index
val ds = spark.createDataset(rdd) //convert rdd to dataset
val df=spark.read.option("inferSchema", "true").option("header", "true").csv(ds) //parse csv
Sorry code in scala, however can be easily converted to python
First take the values using take() method as zero323 suggested
raw = sc.textfile("file.txt")
headers = raw.take(3)
Then
final_raw = raw.filter(lambda x: x != headers)
and done.

Need to create a median function that draws from a dictionary

I need to find the median of all the integers associated with each key (AA, BB). The basic format my code leads to:
AA - 21
AA - 52
BB - 3
BB - 2
My code:
def scoreData(filename):
d = dict()
fin = open(filename)
contents = fin.readlines()
for line in contents:
parts = linesplit()
part[i] = int(part[1])
if parts[0] not in d:
d[parts[0]] = list(parts[1])
else:
d[parts[0]].append(parts[1])
names = list(d.keys())
names.sort() #alphabeticez the names
print("Name\+Max\+Min\+Median")
for name in names: #makes the table
print (name"\+", max(d[name]),\+min(d[name]),"\+"median(d[name]))
I'm afraid following the same format as the "names" and "names.sort" will completely restructure the data. I've thought about "from statistics import median," but once again I do not know how to only select the values associated with each of the same keys.
Thanks in advance
You can do it easily with pandas and numpy:
import pandas
import numpy as np
and aggregating by first row:
score = pandas.read_csv(filename, delimiter=' - ', header=None)
print score.groupby(0).agg([np.median, np.min, np.max])
which returns:
1
median amin amax
0
AA 36.5 21 52
BB 2.5 2 3
There are many, many ways you can go about this. But here's a 'naive' implementation that will get the job done.
Assuming your data looks like:
AA 1
BB 5
AA 2
CC 7
BB 1
You can do the following:
import numpy as np
from collections import defaultdict
def find_averages(input_file)
result_dict = defaultdict(list)
for line in input_file.readlines()
key, value = line.split()
result_dict[key].append[int(value)]
return [(key, np.mean(value)) for key,value in result_dict.iteritems()]

Formatting list data by table

I am trying to analyse some data, but my data contains letters which require standardising. What I would like to be able to do is, for every datatable in the data (this csv data contains 3 datatables) replace the letter T or any other letter for that matter with the next highest integer for that table. The first table contains no errors, the second table contains 1 T and the third contains 2 x t's.
DatatableA,1
DatatableA,2
DatatableA,3
DatatableA,4
DatatableA,5
DatatableB,1
DatatableB,6
DatatableB,T
DatatableB,3
DatatableB,4
DatatableB,5
DatatableB,2
DatatableC,3
DatatableC,4
DatatableC,2
DatatableC,1
DatatableC,Q
DatatableC,5
DatatableC,T
I am expecting this to be a relatively easy thing to code, however whilst I know how to replace all T's with a number, within a particular column or a particular row, I do not know how to replace each T with a different number depending on the Datatable it is in. Essentially I am looking to produce the following from the above:
DatatableA,1
DatatableA,2
DatatableA,3
DatatableA,4
DatatableA,5
DatatableB,1
DatatableB,6
DatatableB,7
DatatableB,3
DatatableB,4
DatatableB,5
DatatableB,2
DatatableC,3
DatatableC,4
DatatableC,2
DatatableC,1
DatatableC,6
DatatableC,5
DatatableC,6
Here nothing happened in DatatableA, DatatableB the only T was replaced with the next highest integer in this case it was replaced with a 7, in DatatableC there was two anomalous data points which were both replaced with the next highest integer, which was a 6.
If anyone can point me in the right direction or provide a snippet of something, It would be greatly appreciated. As always constructive comments are also appreciated.
Edit in reply to elyase
I attempted to run the code:
import pandas as pd
df = pd.read_csv('test.csv', sep=',', header=None, names=['datatable', 'col'])
def replace_letter(group):
letters = group.isin(['T', 'Q']) # select letters
group[letters] = int(group[~letters].max()) + 1 # replace by next max
return group
df['col'] = df.groupby('datatable').transform(replace_letter)
print df
and i received the traceback:
Traceback (most recent call last):
File "C:/test.py", line 11, in <module>
df['col'] = df.groupby('datatable').transform(replace_letter)
File "C:\Python27\lib\site-packages\pandas\core\groupby.py", line 1981, in transform
res = path(group)
File "C:\Python27\lib\site-packages\pandas\core\groupby.py", line 2006, in <lambda>
slow_path = lambda group: group.apply(lambda x: func(x, *args, **kwargs), axis=self.axis)
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 4416, in apply
return self._apply_standard(f, axis)
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 4491, in _apply_standard
raise e
ValueError: ("invalid literal for int() with base 10: 'col'", u'occurred at index col')
Is there something I have used in correctly, I could use AEAs answer, but I have been meaning to use pandas more, as the library seems so useful for data manipulations.
Pandas is ideal for this kind of tasks:
Read your csv:
>>> import pandas as pd
>>> df = pd.read_csv('data.csv', sep=',', header=None, names=['datatable', 'col'])
>>> df.head()
datatable col
0 DatatableA 1
1 DatatableA 2
2 DatatableA 3
3 DatatableA 4
4 DatatableA 5
Group, select and replace max:
def replace_letter(group):
letters = group.isin(['T', 'Q']) # select letters
group[letters] = int(group[~letters].max()) + 1 # replace by next max
return group
>>> df['col'] = df.groupby('datatable').transform(replace_letter)
>>> df
datatable col
0 DatatableA 1
1 DatatableA 2
2 DatatableA 3
3 DatatableA 4
4 DatatableA 5
5 DatatableB 1
6 DatatableB 6
7 DatatableB 7
8 DatatableB 3
9 DatatableB 4
10 DatatableB 5
11 DatatableB 2
12 DatatableC 3
13 DatatableC 4
14 DatatableC 2
15 DatatableC 1
16 DatatableC 6
17 DatatableC 5
18 DatatableC 6
Write to csv:
df.to_csv('result.csv', index=None, header=None)
I suppose I have to answer the question asked my by own alter-ego. Seriously, does StackExchange not sanitize usernames?
Here's a solution, not guaranteeing that it's efficient or simple, but the logic is pretty simple. First you iterate your dataset and check for anything that's not an integer string and record the largest value. Then you iterate again and replace non-integer strings.
I am using StringIO as a replacement for a file just for convenience sake.
import csv
import string
from StringIO import StringIO
raw = """DatatableA,1
DatatableA,2
DatatableA,3
DatatableA,4
DatatableA,5
DatatableB,1
DatatableB,6
DatatableB,T
DatatableB,3
DatatableB,4
DatatableB,5
DatatableB,2
DatatableC,3
DatatableC,4
DatatableC,2
DatatableC,1
DatatableC,Q
DatatableC,5
DatatableC,T"""
fp = StringIO()
fp.write(raw)
fp.seek(0)
reader = csv.reader(fp)
data = []
mapping = {}
for row in reader:
if row[0] not in mapping:
mapping[row[0]] = float("-inf")
if row[1] in string.digits:
x = int(row[1])
if x > mapping[row[0]]:
mapping[row[0]] = x
data.append(row)
for i, row in enumerate(data):
if row[1] not in string.digits:
mapping[row[0]] += 1
row[1] = str(mapping[row[0]])
fp.close()
fp = StringIO()
writer = csv.writer(fp)
writer.writerows(data)
print fp.getvalue()

Categories