Grouping, sorting lines and removing redundancies

Grouping, sorting lines and removing redundancies - python

I have an input file having 15 columns,
con13 tr|M0VCZ1| 91.39 267 23 0 131 211 1 267 1 480 239 267 33.4 99.6
con13 tr|M8B287| 97.12 590 17 0 344 211 1 267 0 104 239 590 74.0 99.8
con15 tr|M0WV77| 92.57 148 11 0 73 516 1 148 2 248 256 148 17.3 99.3
con15 tr|C5WNQ0| 85.14 148 22 0 73 516 1 178 4 233 256 148 17.3 99.3
con15 tr|B8AQC2| 83.78 148 24 0 73 516 1 148 6 233 256 148 17.3 99.3
con18 tr|G9HXG9| 99.66 293 1 0 144 102 1 293 7 527 139 301 63.1 97.0
con18 tr|M0XCZ0| 98.29 293 5 0 144 102 1 293 2 519 139 301 63.1 97.0
I need to 1) group and iterate inside each con (using groupby), 2) sort line[2] from lowest to highest value, 3) see inside each group if line[0], line[8] and line[9] are similar, 4) if they are similar, remove repetitive elements and print the results in a new .txt file choosing the one that has highest value in line[2], so that my output file looks like this,
con13 tr|M8B287| 97.12 590 17 0 344 211 1 267 0 104 239 590 74.0 99.8
con15 tr|M0WV77| 92.57 148 11 0 73 516 1 148 2 248 256 148 17.3 99.3
con15 tr|C5WNQ0| 85.14 148 22 0 73 516 1 178 4 233 256 148 17.3 99.3
con18 tr|G9HXG9| 99.66 293 1 0 144 102 1 293 7 527 139 301 63.1 97.0
My attempted script, prints only one single con and does not sort,
from itertools import groupby
f1 = open('example.txt','r')
f2 = open('result1', 'w')
f3 = open('result2.txt','w')
for k, g in groupby(f1, key=lambda x:x.split()[0]):
seen = set()
for line in g:
hsp = tuple(line.rsplit())
if hsp[8] and hsp[9] not in seen:
seen.add(hsp)
f2.write(line.rstrip() + '\n')
else:
f3.write(line.rstrip() + '\n')

Use the csv module to pre-split your lines for you and write out formatted data again, and use a tuple in seen (of just the 9th and 10th columns) to track similar rows:
import csv
from itertools import groupby
from operator import itemgetter
with open('example.txt','rb') as f1
with open('result1', 'wb') as f2, open('result2.txt','wb') as f3):
reader = csv.reader(f1, delimiter='\t')
writer1 = csv.writer(f2, delimiter='\t')
writer2 = csv.writer(f3, delimiter='\t')
for group, rows in groupby(reader, itemgetter(0)):
rows = sorted(rows, key=itemgetter(8, 9, 2))
for k, rows in groupby(rows, itemgetter(8, 9)):
# now we are grouping on columns 8 and 9,
# *and* these are sorted on column 2
# everything but the *last* row is written to writer2
rows = list(rows)
writer1.writerow(rows[-1])
writer2.writerows(rows[:-1])
The sorted(rows, key=itemgetter(2)) call sorts the grouped rows (so all rows with the same row[0] value) on the 3rd column.
Because you then want to write only the row with the highest value in column 2 *per group of rows with column 8 and 9 equal) to the first result file, we group again, but sorted on columns 8, 9 and 2 (in that order), then group on just columns 8 and 9 giving us sorted groups in ascending order for column 2. The last row is then written to result1, the rest to result2.txt.

Related

How could i count the rating for each item_id?

From the u.item file, which is divided into [100000 rows x 4columns],
I have to find out which are the best movies.
I try, for each unique item_id (which is 1682) to find the overall rating for each one separately
import pandas as pd
import csv
ratings = pd.read_csv("erg3/files/u.data", encoding="utf-8", delim_whitespace=True,
names = ["user_id", "item_id", "rating", "timestamp"]
)
The data has this form:
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
....
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
My expected output :
item_id
1 1753
2 420
3 273
4 742
...
1570 1
1486 1
1626 1
1580 1
i used this best_m = ratings.groupby("item_id")["rating"].sum()
followed by best_m = best_m.sort_values(ascending=False)
And the output looks like :
50 2541
100 2111
181 2032
258 1936
174 1786
...
1581 1
1570 1
1486 1
1626 1
1580 1

Is there a way to get [0:30], [1:31], [2:32], etc of a very long dataframe and turn those into rows of a different one?

I have an extremely long dataframe (1 column, 1,000,000 rows), and I want to collate rows 1-30, 2-31, 3-32, all the way to the end, into rows of a new dataframe. I already attempted to do so with a for loop:
hlist = longdataframe.tolist()
i = 0
df = pd.DataFrame()
while i < len(hlist)-31:
x = hlist[i:i+30]
df.append([x])
i+=1
However, it is very clunky and takes hours to complete. Is there a quicker way to achieve my desired outcome? Many thanks.

You can accomplish this with a rolling window with pandas.DataFrame.rolling.
My guess is whatever you are trying to accomplish with individual data frames is best accomplished using a rolling window on one big dataframe.

Sequential appends are very slow, it's highly encouraged to use a single concat on a list of frames.
Since Series.rolling is iterable
(I use Series.rolling here because DataFrame has no tolist function which implies longdataframe is actually a Series) we can try a list comprehension to build out a new frame:
# Sample Series
input_series = pd.Series(np.arange(1, 301))
# Comprehension to build DataFrame
output_df = pd.DataFrame([subset_df.values
for subset_df in input_series.rolling(window=30)
if len(subset_df) == 30])
*Note:
min_periods will not work when iterating over the rolling window.
DataFrame.rolling is also iterable and could work similarly.
output_df:
0 1 2 3 4 5 6 ... 23 24 25 26 27 28 29
0 1 2 3 4 5 6 7 ... 24 25 26 27 28 29 30
1 2 3 4 5 6 7 8 ... 25 26 27 28 29 30 31
2 3 4 5 6 7 8 9 ... 26 27 28 29 30 31 32
3 4 5 6 7 8 9 10 ... 27 28 29 30 31 32 33
4 5 6 7 8 9 10 11 ... 28 29 30 31 32 33 34
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
266 267 268 269 270 271 272 273 ... 290 291 292 293 294 295 296
267 268 269 270 271 272 273 274 ... 291 292 293 294 295 296 297
268 269 270 271 272 273 274 275 ... 292 293 294 295 296 297 298
269 270 271 272 273 274 275 276 ... 293 294 295 296 297 298 299
270 271 272 273 274 275 276 277 ... 294 295 296 297 298 299 300
Let's look at some timing information via %timeit:
import numpy as np
import pandas as pd
def append_approach(s):
s = s.tolist()
df = pd.DataFrame()
i = 0
while i < len(s) - 31:
x = s[i:i + 30]
df = df.append([x])
i += 1
return df
def concat_approach(s):
return pd.DataFrame([subset_df.values
for subset_df in s.rolling(window=30)
if len(subset_df) == 30])
# Sample Series
input_series = pd.Series(np.arange(1, 10_001))
%timeit append_approach(input_series)
13.5 s ± 556 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit concat_approach(input_series)
304 ms ± 33.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Notice the concat approach takes 2% of the time that the append approach takes on a Series of 10,001 values.

pad rows on a pandas dataframe with zeros till N count

Iam loading data via pandas read_csv like so:
data = pd.read_csv(file_name_item, sep=" ", header=None, usecols=[0,1,2])
which looks like so:
0 1 2
0 257 503 48
1 167 258 39
2 172 242 39
3 172 403 81
4 180 228 39
5 183 394 255
6 192 179 15
7 192 347 234
8 192 380 243
9 192 437 135
10 211 358 234
I would like to pad this data with zeros till a row count of 256, meaning:
0 1 2
0 157 303 48
1 167 258 39
2 172 242 39
3 172 403 81
4 180 228 39
5 183 394 255
6 192 179 15
7 192 347 234
8 192 380 243
9 192 437 135
10 211 358 234
11 0 0 0
.. .. .. ..
256 0 0 0
How do I go about doing this? The file could have anything from 1 row to 200 odd rows and I am looking for something generic which pads this dataframe with 0's till 256 rows.
I am quite new to pandas and could not find any function to do this.

reindex with fill_value
df_final = data.reindex(range(257), fill_value=0)
Out[1845]:
0 1 2
0 257 503 48
1 167 258 39
2 172 242 39
3 172 403 81
4 180 228 39
.. ... ... ..
252 0 0 0
253 0 0 0
254 0 0 0
255 0 0 0
256 0 0 0
[257 rows x 3 columns]

We can do
new_df = df.reindex(range(257)).fillna(0, downcast='infer')

Comparing CSVs with rows as multiples of 6

I have two CSV files with rows as n-multiples of 6 and I want to compare them. If a row in CSV1 has the same values of column 1 to 3 as any in CSV2, but their column 4 values are different replace the column 4 in CSV1 with column 4 in CSV2. So far I have written the code below which reads both CSVs and groups them by 6(s) but I don't know what next to do as it results in a list of list of list which I cannot handle. N.B One CSV has more rows than the other.
My Code:
import csv
def datareader(datafile):
with open(datafile, 'r') as f:
reader = csv.reader(f)
next(reader, None)
List1 = [lines for lines in reader]
return [List1[pos:pos + 6] for pos in xrange(0, len(List1), 6)]
list1 = datareader('CSV1.csv')
def datareader1(datafile):
# Read the csv
with open(datafile, 'r') as f:
reader = csv.reader(f)
next(reader, None)
List2 = [lines for lines in reader]
return [List2[pos:pos + 6] for pos in xrange(0, len(List2), 6)]
list2 = datareader1('CSV2.csv')
CSV1
frm to grp dur
192 177 1 999999
192 177 2 749
192 177 3 895
192 177 4 749
192 177 5 749
192 177 6 222222
192 178 1 222222
192 178 2 222222
192 178 3 222222
192 178 4 222222
192 178 5 1511
192 178 6 999999
192 179 1 999999
192 179 2 387
192 179 3 969
192 179 4 387
192 179 5 387
192 179 6 999999
CSV2
from_BAKCode to_BAKCode interval duration
192 177 1 999999
192 177 2 749
192 177 3 749
192 177 4 749
192 177 5 749
192 177 6 999999
192 178 1 999999
192 178 2 999999
192 178 3 999999
192 178 4 999999
192 178 5 1511
192 178 6 999999

You can use pandas module to deal with data manipulation. It will be much easier.
import pandas as pd
def add_new_dur(row):
if row['dur'] == row['duration']: return row['dur']
else: return row['duration']
fileNameCSV1 = 'csv1.csv'
fileNameCSV2 = 'csv2.csv'
df = dict()
for f in [fileNameCSV1, fileNameCSV2]:
df[f.split('.')[0]] = pd.read_csv(f)
result = df['csv1'].merge(df['csv2'],
left_on = ['frm', 'to', 'grp'],
right_on = ['from_BAKCode', 'to_BAKCode', 'interval'])
result['new_dur'] = result.apply(add_new_dur, axis=1)
result = result[['frm', 'to', 'grp', 'new_dur']]
result = result.rename(columns={'new_dur':'dur'})
The result will look like this.
frm to grp dur
0 192 177 1 999999
1 192 177 2 749
2 192 177 3 749
3 192 177 4 749
4 192 177 5 749
5 192 177 6 999999
6 192 178 1 999999
7 192 178 2 999999
8 192 178 3 999999
9 192 178 4 999999
10 192 178 5 1511
11 192 178 6 999999
Incase you have one csv file has greater number of rows than the other csv file, then the extra rows will be omitted.
Hope it helps.

2d list in python - accessing through column names

I'm parsing two files which has data as shown below
File1:
UID A B C D
------ ---------- ---------- ---------- ----------
456 536 1 148 304
1071 908 1 128 243
1118 4 8 52 162
249 4 8 68 154
1072 296 416 68 114
118 180 528 68 67
file2:
UID X Y A Z B
------ ---------- ---------- ---------- ---------- ---------
456 536 1 148 304 234
1071 908 1 128 243 12
1118 4 8 52 162 123
249 4 8 68 154 987
1072 296 416 68 114 45
118 180 528 68 67 6
I will be comparing two such files, however the number of columns might vary and the columns names. For every unique UID, I need to match the column names, compare and find the difference.
Questions
1. Is there a way to access columns by column names instead of index?
2. Dynamically give column names based on the file data?
I'm able to load the file into list, and compare using indexes, but thats not a proper solutions.
Thanks in advance.

You might consider using csv.DictReader. It allows you both to address columns by names, and a variable list of columns for each file opened. Consider removing the ------ separating header from actual data as it might be read wrong.
Example:
import csv
with open('File1', 'r', newline='') as f:
# If you don't pass field names
# they are taken from the first row.
reader = csv.DictReader(f)
for line in reader:
# `line` is a dict {'UID': val, 'A': val, ... }
print line
If your input format has no clear delimiter (multiple whitespaces), you can wrap the file with a generator that will compress continous whitespaces into e.g. a comma:
import csv
import re
r = re.compile(r'[ ]+')
def trim_whitespaces(f):
for line in f:
yield r.sub(',', line)
with open('test.txt', 'r', newline='') as f:
reader = csv.DictReader(trim_whitespaces(f))
for line in reader:
print line

This is an good use case for pandas, loading data is as simple as:
import pandas as pd
from StringIO import StringIO
data = """ UID A B C D
------ ---------- ---------- ---------- ----------
456 536 1 148 304
1071 908 1 128 243
1118 4 8 52 162
249 4 8 68 154
1072 296 416 68 114
118 180 528 68 67 """
df = pd.read_csv(StringIO(data),skiprows=[1],delimiter=r'\s+')
Let's inspect results:
>>> df
UID A B C D
0 456 536 1 148 304
1 1071 908 1 128 243
2 1118 4 8 52 162
3 249 4 8 68 154
4 1072 296 416 68 114
5 118 180 528 68 67
After obtaining df2 with similar means we can merge results:
>>> df.merge(df2, on=['UID'])
UID A_x B_x C D X Y A_y Z B_y
0 456 536 1 148 304 536 1 148 304 234
1 1071 908 1 128 243 908 1 128 243 12
2 1118 4 8 52 162 4 8 52 162 123
3 249 4 8 68 154 4 8 68 154 987
4 1072 296 416 68 114 296 416 68 114 45
5 118 180 528 68 67 180 528 68 67 6
Resulting pandas.DataFrame has a very profound API, and all SQL-like analisis operations such as joining, filtering, grouping, aggregating etc are easy to perform. Look for examples on this site or in the documentation.

my_text = """UID A B C D
------ ---------- ---------- ---------- ----------
456 536 1 148 304
1071 908 1 128 243
1118 4 8 52 162
249 4 8 68 154
1072 296 416 68 114
118 180 528 68 67 """
lines = my_text.splitlines() #split your text into lines
keys= lines[0].split() #headers is your first line
table = [line.split() for line in lines[1:]] #the data is the rest
columns = zip(*table) #transpose the rows array to a columns array
my_dict = dict(zip(keys,columns)) #create a dict using your keys from earlier and matching them with columns
print my_dict['A'] #access
obviously you would need to change it if you had to read from a file say
alternatively this is what packages like pandas were made for
import pandas
table = pandas.read_csv('foo.csv', index_col=0)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Grouping, sorting lines and removing redundancies - python

Related

How could i count the rating for each item_id?

Is there a way to get [0:30], [1:31], [2:32], etc of a very long dataframe and turn those into rows of a different one?

pad rows on a pandas dataframe with zeros till N count

Comparing CSVs with rows as multiples of 6

2d list in python - accessing through column names

Categories

Resources