Shuffling text from file by group of data - python

I was looking for some approach in Python / Unix Command to shuffle large data set of text by grouping based on first words value like below-
Input Text:
"ABC", 21, 15, 45
"DEF", 35, 3, 35
"DEF", 124, 33, 5
"QQQ" , 43, 54, 35
"XZZ", 43, 35 , 32
"XZZ", 45 , 35, 32
So it would be randomly shuffled but keep the group together like below
Output Sample-
"QQQ" , 43, 54, 35
"XZZ", 43, 35 , 32
"XZZ", 45 , 35, 32
"ABC", 21, 15, 45
"DEF", 35, 3, 35
"DEF", 124, 33, 5
I found solution by normal shuffling, but I am not getting the idea to keep the group while shuffling.

It is possible to do it using collections.defaultdict. By identifying each line by its first sequence you can sort through them easily and then only sample over the dictionary's keys, like so:
import random
from collections import defaultdict
# Read all the lines from the file
lines = defaultdict(list)
with open("/path/to/file", "r") as in_file:
for line in in_file:
s_line = line.split(",")
lines[s_line[0]].append(line)
# Randomize the order
rnd_keys = random.sample(lines.keys(), len(lines))
# Write back to the file?
with open("/path/to/file", "w") as out_file:
for k in rnd_keys:
for line in lines[k]:
out_file.write(line)
Hope this helps in your endeavor.

You could also store each line from the file into a nested list:
lines = []
with open('input_text.txt') as in_file:
for line in in_file.readlines():
line = [x.strip() for x in line.strip().split(',')]
lines.append(line)
Which gives:
[['"ABC"', '21', '15', '45'], ['"DEF"', '35', '3', '35'], ['"DEF"', '124', '33', '5'], ['"QQQ"', '43', '54', '35'], ['"XZZ"', '43', '35', '32'], ['"XZZ"', '45', '35', '32']]
Then you could group these lists by the first item with itertools.groupby():
import itertools
from operator import itemgetter
grouped = [list(g) for _, g in itertools.groupby(lines, key = itemgetter(0))]
Which gives a list of your grouped items:
[[['"ABC"', '21', '15', '45']], [['"DEF"', '35', '3', '35'], ['"DEF"', '124', '33', '5']], [['"QQQ"', '43', '54', '35']], [['"XZZ"', '43', '35', '32'], ['"XZZ"', '45', '35', '32']]]
Then you could shuffle this with random.shuffle():
import random
random.shuffle(grouped)
Which gives a randomized list of your grouped items intact:
[[['"QQQ"', '43', '54', '35']], [['"ABC"', '21', '15', '45']], [['"XZZ"', '43', '35', '32'], ['"XZZ"', '45', '35', '32']], [['"DEF"', '35', '3', '35'], ['"DEF"', '124', '33', '5']]]
And now all you have to do is flatten the final list and write it to a new file, which you can do with itertools.chain.from_iterable():
with open('output_text.txt', 'w') as out_file:
for line in itertools.chain.from_iterable(grouped):
out_file.write(', '.join(line) + '\n')
print(open('output_text.txt').read())
Which a gives new shuffled version of your file:
"QQQ", 43, 54, 35
"ABC", 21, 15, 45
"XZZ", 43, 35, 32
"XZZ", 45, 35, 32
"DEF", 35, 3, 35
"DEF", 124, 33, 5

Related

set add() method sorts the set?

Using the python set add method i have noticed that the method sorts the content based on value and the content of the set.
Based on the docstring the following method description is found:
Why is this happining ? And is there a method for this not to occur ?
I am using Python 3.6.
Please don't count on this behavior:
>>> x = set()
>>> for i in range(10):
... x.add(i)
...
>>> x
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
>>> for i in range(1000, 1020):
... x.add(i)
...
>>> x
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, 1011, 1012, 1013, 1014, 1015, 1016, 1017, 1018, 1019}
>>> x.remove(2)
>>> x
{0, 1, 3, 4, 5, 6, 7, 8, 9, 1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, 1011, 1012, 1013, 1014, 1015, 1016, 1017, 1018, 1019}
>>> x.add(2)
>>> x
{0, 1, 3, 4, 5, 6, 7, 8, 9, 2, 1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, 1011, 1012, 1013, 1014, 1015, 1016, 1017, 1018, 1019}
Even if you see this "ordered" behavior once, does not mean it is always so.
Trivial example:
w = set()
for i in range(100):
w.add(i)
w.add(str(i))
print(w)
Output:
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, '20', 21, 22, 23, 24, 25, 26, 27, 28, 29,
30, 32, 33, 34, 35, 36, 37, '9', 38, 31, 39, 40, 41, 42,
43, 44, 45, 46, 47, 48, 49, 50, 51, 52, '52', 53, 54, 55,
56, 57, 58, 59, 60, 61, '61', 62, 63, 64, '26', 65, 66,
67, '58', '36', 68, '6', '68', 69, '18', 71, 72, '4', 74,
75, 76, 77, '77', 79, 80, 81, 82, '12', '46', 85, 86, 87,
'33', 89, 90, 91, 92, 93, 94, 95, '23', '24', 98, 99, '49',
'92', '30', '44', '7', '21', '93', '86', '2', '67', '57',
'13', '79', '80', '96', '38', '32', '15', '45', '64', '83',
'65', '54', '88', '48', '75', '99', '71', '5', '0', '28',
'87', '43', '94', '90', '72', '42', '37', '59', '35', '8',
'17', '10', 70, 73, '98', '22', '19', '11', '27', '34', '14',
'56', '55', '69', '66', 78, '3', '1', '53', '84', '16', '25',
'76', 83, '82', '29', 84, '95', '31', '70', 88, '97', '40',
'47', '51', '85', '91', '60', '81', '89', 96, '78', '62',
'73', '74', 97, '41', '39', '50', '63'}
If it really sorted anything it should either
alternate the int or the string value (insert order)
show all ints sorted first, then all strings sorted
or some other kind of "detectable" pattern.
Using a very small samle set (range(10)) or very restricted values (all ints) can/might depending on the sets internal bucketing strategy lead to "ordered" outputs.

Values of dictionary = sum of 2dlist at i

I have the following 2d list and dictionary:
List2d = [['1', '55', '32', '667' ],
['43', '76', '55', '100'],
['23', '70', '15', '300']]
dictionary = {'New York':0, "London": 0, "Tokyo": 0, "Toronto": 0 }
How do I replace all the values of the dictionary with sums of the columns in List2d? So dictionary will look like this:
dictionary= {'New York' : 67, 'London': 201, 'Tokyo': 102, 'Toronto': 1067}
#67 comes from adding up first column (1+43+23) in 'List2d'
#201 comes from adding up second column (55+76+70) in 'List2d'
#102 comes from adding up third column (32+55+15) in 'List2d'
#1067 comes from adding up fourth column (667+100+300) in 'List2d'
Since Python 3.7, keys in dict are ordered.
You can use enumerate in order to keep track of the position of the element in the dict while iterating over it. Then, you use the i as an index on each row of the 2d list, convert each value to int and do a sum of the result.
List2d = [['1', '55', '32', '667' ],
['43', '76', '55', '100'],
['23', '70', '15', '300']]
dictionary = {'New York':0, "London": 0, "Tokyo": 0, "Toronto": 0 }
for i, city in enumerate(dictionary.keys()):
dictionary[city] = sum(int(row[i]) for row in List2d)
print(dictionary)
# {'New York': 67, 'London': 201, 'Tokyo': 102, 'Toronto': 1067}
Use pandas
#!pip install pandas
import pandas as pd
pd.DataFrame(List2d, columns=dictionary.keys()).astype(int).sum(axis=0).to_dict()
output:
{'New York': 67, 'London': 201, 'Tokyo': 102, 'Toronto': 1067}

python - converting a list of 2 digit string numbers to a list of 2 digit integers

I have a list of 2 character strings of numbers,
I'm trying to write a function to convert this to a list of 2 digit integers without using int() or knowing the length of the list, this is my code so far:
intslist = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52,
53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69,
70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86,
87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
numslist = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12',
'13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23',
'24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34',
'35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45',
'46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56',
'57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67',
'68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78',
'79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89',
'90', '91', '92', '93', '94', '95', '96', '97', '98', '99']
def convert_num(numlist,list1,list2):
returnlist = []
templist = []
convertdict = {k:v for k,v in zip(list1,list2)}
p = 0
num = ''.join(numlist)
for c in num:
templist.append(convertdict[num[p]])
p += 2
for i in templist:
if templist[i] % 2 == 0:
returnlist.append()
return returnlist
this works but only returns a list of the individual digits, not the 2 digits i want.
I'm only a beginner and don't really know how to proceed.
Any help appreciated!!
An integer is an integer. "Two digit integers" don't exist as a concept.
Without using int or len, to return an integer from a string, you can reverse a string, use ord instead of int, multiply by 10k and sum:
x = '84'
res = sum((ord(val)-48)*10**idx for idx, val in enumerate(reversed(x))) # 84
You can use map to apply the logic to every string in a list:
def str_to_int(x):
return sum((ord(val)-48)*10**idx for idx, val in enumerate(reversed(x)))
res = list(map(str_to_int, numslist))
print(res)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
...
81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
The core of your solution will be taking the string and converting it to an integer:
def str_to_int(number):
return sum((ord(c) - 48) * (10 ** i) for i, c in enumerate(number[::-1]))
This method takes your number in, enumerates over it from the end and then converts the ASCII value of each character to its numeric representation and then makes sure it will occupy the proper digit in the overall number.
From there, you can use map to convert the entire list:
intsList = list(map(str_to_int, numsList))
The very simple solution:
dd={ str(i):i for i in range(10) } # {"0":0,"1":1,..."9":9}
rslt=[]
for ns in numslist:
n=0
for i in range(len(ns)):
n=10*n+dd[ns[i]]
rslt.append(n)

How to put a text file into a 2d list in tuples?

I am having a problem with passing a text file into a 2 dimensional array of tuples. Here is what my input file looks like, it's really big so this is just part of it.
26 54 94 25 53 93 24 52 92 25 53 93 25 53 93 25 53 93 25 53 93 27 55 95 28 55 98 26 53 96 25 52 95 26 53 96 27 54 97 28 55 98 27 54 97 26 53 96 26 55 97 26 55 97 26 55 97 26 55 97 25 54 96 25 54 96 25 54 96 26 55 97 26 55 99 27 56 100 28 57 101 26 55 99 25 54 98 26 55 99 26 55 99 26 55 99 25 54 98 26 55 99 27 56 100 27 56 100 26 55 99 26 55 99 26 55 99 27 56 100 28 57 101 29 58 102 29 58 102
Here is the function that is reading in the file and putting it in the 2d array
def load_image_data(infile):
'''
Accepts an input file object
Reads the data in from the input PPM image into a 2-dimensional list of RGB tuples
Returns the 2-dimensional list
'''
print("Loading image data...\n")
for line in infile.readlines():
line = line.strip()
values = line.split(" ")
new_line = []
for j in range(int(len(new_line) / 3) + 1):
for i in range(len(new_line) // 3):
r = new_line[0]
g = new_line[1]
b = new_line[2]
t = (r, g, b)
t_list.append(t)
del new_line[0]
del new_line[0]
del new_line[0]
new_line.append(t)
print(new_line)
print("done")
return new_line`
And here is main:
def main():
'''
Runs Program
'''
mods = ["vertical_flip", "horizontal_flip", "horizontal_blur", "negative",
"high_contrast", "random_noise", "gray_scale", "remove_color"]
# ** finish adding string modifications to this list
for mod in mods:
# get infile name
#file = (input("Please enter the input file name: "))
# get outfile name
#out = (input("Please enter the output file name: "))
infile = open("ny.ppm", "r") # ** get the filename from the user
outfile = open("ny_negative.ppm", "w") # ** change to use mod and user-spec filename
process_header(infile, outfile)
load_image_data(infile)
process_body(infile, outfile, mod)
outfile.close()
infile.close()
read the data from your file as inf and the split it to get the list of data. With that, iterate through the items for range(number_of_items//3) and then get the desired length appended to your list and return the same!
print("Loading image data...\n")
inf=infile.readlines()
inf = inf[0].split()
new_line=[]
for i in range(len(inf)//3):
r,b,g=inf[i*3:i*3+3]
print r,g,b
t = (r, g, b)
t_list.append(t)
new_line.append(inf[i*3:i*3+3])
return new_line
And in your main()
infile = open("ny.ppm", "r") # ** get the filename from the user
print load_image_data(infile)
infile.close()
Sample Output:
[['26', '54', '94'], ['25', '53', '93'], ['24', '52', '92'], ['25', '53', '93'], ['25', '53', '93'], ['25', '53', '93'], ['25', '53', '93'], ['27', '55', '95'], ['28', '55', '98'], ['26', '53', '96'], ['25', '52', '95'], ['26', '53', '96'], ['27', '54', '97'], ['28', '55', '98'], ['27', '54', '97'], ['26', '53', '96'], ['26', '55', '97'], ['26', '55', '97'], ['26', '55', '97'], ['26', '55', '97'], ['25', '54', '96'], ['25', '54', '96'], ['25', '54', '96'], ['26', '55', '97'], ['26', '55', '99'], ['27', '56', '100'], ['28', '57', '101'], ['26', '55', '99'], ['25', '54', '98'], ['26', '55', '99'], ['26', '55', '99'], ['26', '55', '99'], ['25', '54', '98'], ['26', '55', '99'], ['27', '56', '100'], ['27', '56', '100'], ['26', '55', '99'], ['26', '55', '99'], ['26', '55', '99'], ['27', '56', '100'], ['28', '57', '101'], ['29', '58', '102'], ['29', '58', '102']]
Hope it helps!
Here is how you could chunk into tuples:
In [8]: from itertools import islice
In [9]: with open("yourfile.DATA") as f:
...: data = f.read().split()
...: size = len(data)
...: it = map(int, data)
...: data = [tuple(islice(it,0,3)) for _ in range(0, size, 3)]
...:
The output:
In [10]: data
Out[10]:
[(26, 54, 94),
(25, 53, 93),
(24, 52, 92),
(25, 53, 93),
(25, 53, 93),
(25, 53, 93),
(25, 53, 93),
(27, 55, 95),
(28, 55, 98),
(26, 53, 96),
(25, 52, 95),
(26, 53, 96),
(27, 54, 97),
(28, 55, 98),
(27, 54, 97),
(26, 53, 96),
(26, 55, 97),
(26, 55, 97),
(26, 55, 97),
(26, 55, 97),
(25, 54, 96),
(25, 54, 96),
(25, 54, 96),
(26, 55, 97),
(26, 55, 99),
(27, 56, 100),
(28, 57, 101),
(26, 55, 99),
(25, 54, 98),
(26, 55, 99),
(26, 55, 99),
(26, 55, 99),
(25, 54, 98),
(26, 55, 99),
(27, 56, 100),
(27, 56, 100),
(26, 55, 99),
(26, 55, 99),
(26, 55, 99),
(27, 56, 100),
(28, 57, 101),
(29, 58, 102),
(29, 58, 102)]
That list comprehension could be written a little more verbosely as:
In [11]: with open('yourfile.DATA') as f:
...: data = f.read().split()
...: size = len(data)
...: it = map(int, data)
...: data = []
...: for _ in range(0, size, 3):
...: data.append(tuple(islice(it, 0, 3)))
...:
Note that I used a with block, which is advisable when dealing with files, not only do they close the file for you, but they make sure the file is closed (in case of an exception handling even, for example).
One piece of advice, be careful passing file-handlers around. When you do stuff like this:
infile = open("ny.ppm", "r") # ** get the filename from the user
outfile = open("ny_negative.ppm", "w") # ** change to use mod and user-spec filename
process_header(infile, outfile)
load_image_data(infile)
process_body(infile, outfile, mod)
outfile.close()
infile.close()
Be aware that file handlers like infile act sort of like one-pass iterators, and can only do stuff like .readlines() once. So if you use infile.readlines() in process_header, when you pass that same infile to process_body, subsequent calls to infile.readlines() will raise an error unless unless you reset the file cursor explicitly using infile.seek(0) -- which is why I say they are "sort of" like one-pas iterators. But I suggest not dealing with that and instead passing around a string of the path to the file, and using a with-block to open your files.
Something like this would read (and return) the image data as a list-of-lists-of-tuples:
try:
from itertools import izip
except ImportError: # Python 3
izip = zip
def load_image_data(infile):
rows = []
for line in infile:
values = [int(v) for v in line.split()]
tuples = [t for t in izip(*[iter(values)]*3)]
rows.append(tuples)
return rows
def main():
with open("ny.ppm", "r") as infile, open("ny_negative.ppm", "w") as outfile:
process_header(infile, outfile)
image_data = load_image_data(infile)
print(image_data)
# etc ...
main()
Sample of output format:
[[(255, 0, 0), (0, 255, 0), (0, 0, 255), ...],
[(255, 255, 0), (255, 255, 255), (0, 0, 0), ...],
...
]

parsing this csv file in python(pylab) and converting it into a dictionary

I have this code:
data = np.genfromtxt('csv_data.csv', dtype=None, names=True)
print data
It results in the following output
[('westin,390,291,70,43,19,215,27,813',)
('ramada,136,67,53,30,24,149,49,310',)
('sutton,489,293,106,39,20,299,24,947',)
('loden,681,134,17,5,0,199,4,837',) ('hampton,241,166,26,5,1,159,21,439',)
('shangrila,332,45,20,8,2,325,8,407',) ('mariott,22,15,5,0,0,179,35,42',)
('pan_pacific,475,262,86,29,16,249,15,868',)
('sheraton,277,346,150,80,26,249,45,879',)
('westin_bayshore,390,291,70,43,19,199,27,813',)]
It didn't copy the column headers:
Hotel,excellent,verygood,average,poor,terrible,cheapest,rank,reviews
from the file. What Im trying to do is save the output to a dicationary data structure in python. Is there a way to convert this output inot a dictionary ?
I can write a function to parse this but I was wondering if there is a built in function in Python.
Thanks
You didn't give a value to the delimiter parameter. therefore, np.genfromtxt uses the default None and try to separate the fields using spaces.
You need to use
np.genfromtxt(your_file, dtype=None, delimiter=',', names=True)
Process the file yourself using the csv module.
The following takes the file, and creates a dictionary called by_hotel whose key is the hotel name, and whose values is a dictionary of fieldname->value of the original row (note it also includes the hotel name, but anyway...)
import csv
with open('csv_data.csv') as fin:
csvin = csv.DictReader(fin)
headers = csvin.fieldnames
by_hotel = {row['Hotel']: row for row in csvin}
print by_hotel['sutton']['excellent']
# 489
If you wanted a list back in the original order, then you could do:
print [hotel['sutton'][fname] for fname in headers]
NB: You may want to convert your values to integers for computation purposes though.
Simple version :
d = { item[0].split(',')[0] : item[0].split(',')[1:] for item in data }
return :
{'sutton': ['489', '293', '106', '39', '20', '299', '24', '947'], 'hampton': ['241', '166', '26', '5', '1', '159', '21', '439'], 'westin_bayshore': ['390', '291', '70', '43', '19', '199', '27', '813'], 'sheraton': ['277', '346', '150', '80', '26', '249', '45', '879'], 'ramada': ['136', '67', '53', '30', '24', '149', '49', '310'], 'mariott': ['22', '15', '5', '0', '0', '179', '35', '42'], 'loden': ['681', '134', '17', '5', '0', '199', '4', "837'"], 'shangrila': ['332', '45', '20', '8', '2', '325', '8', '407'], 'pan_pacific': ['475', '262', '86', '29', '16', '249', '15', '868']}
and more complicated (dict of dict) :
d = { item[0].split(',')[0] : { headers[i] : int( item[0].split(',')[i+1].strip("'") ) for i in range(len( item[0].split(',')[1:] ) ) } for item in data }
return :
{'sutton': {'poor': 39, 'cheapest': 299, 'average': 106, 'terrible': 20, 'rank': 24, 'reviews': 947, 'excellent': 489, 'verygood': 293}, 'hampton': {'poor': 5, 'cheapest': 159, 'average': 26, 'terrible': 1, 'rank': 21, 'reviews': 439, 'excellent': 241, 'verygood': 166}, 'westin_bayshore': {'poor': 43, 'cheapest': 199, 'average': 70, 'terrible': 19, 'rank': 27, 'reviews': 813, 'excellent': 390, 'verygood': 291}, 'sheraton': {'poor': 80, 'cheapest': 249, 'average': 150, 'terrible': 26, 'rank': 45, 'reviews': 879, 'excellent': 277, 'verygood': 346}, 'ramada': {'poor': 30, 'cheapest': 149, 'average': 53, 'terrible': 24, 'rank': 49, 'reviews': 310, 'excellent': 136, 'verygood': 67}, 'mariott': {'poor': 0, 'cheapest': 179, 'average': 5, 'terrible': 0, 'rank': 35, 'reviews': 42, 'excellent': 22, 'verygood': 15}, 'loden': {'poor': 5, 'cheapest': 199, 'average': 17, 'terrible': 0, 'rank': 4, 'reviews': 837, 'excellent': 681, 'verygood': 134}, 'shangrila': {'poor': 8, 'cheapest': 325, 'average': 20, 'terrible': 2, 'rank': 8, 'reviews': 407, 'excellent': 332, 'verygood': 45}, 'pan_pacific': {'poor': 29, 'cheapest': 249, 'average': 86, 'terrible': 16, 'rank': 15, 'reviews': 868, 'excellent': 475, 'verygood': 262}}
import csv
f = open("csv_data",'r')
holder = csv.reader(f,delimiter = ',')
data_dict = {}
headers = []
first_row = True
for row in holder:
if first_row:
first_row = False
for header in row:
colname = str(header)
headers.append(colname)
data_dict[colname] = []
else:
colnum = 0
for datapoint in row:
data_dict[headers[colnum]].append(int(datapoint))
colnum += 1
Thus you can have a dictionary variable having keys which are column headers(which are first row of csv file) and values associated with those keys as list(remaining data in csv file).
Moreover, header is a list of all the column headers.

Categories