Extract data from table column and make variables in Python - python

I have a dataset where I want to make a new variable everytime 'Recording' number changes. I want the new variable to include the 'Duration' data for the specific 'Recording' and the previous data. So for the below table it would be:
Var1 = (3, 3, 3)
Var2 = (3, 3, 3, 4, 6)
Var2 = (3, 3, 3, 4, 6, 4, 3, 1, 4)
And so on. I have several dataset that can have different number of recordings (but always starting from 1) and different number of durations for each recording. Any help is greatly appreciated.
Recording
Duration
1
3
1
3
1
3
2
4
2
6
3
4
3
3
3
1
3
4

You can aggregate list with cumualative sum for lists, then convert to tuples and dictionary:
d = df.groupby('Recording')['Duration'].agg(list).cumsum().apply(tuple).to_dict()
print (d)
{1: (3, 3, 3), 2: (3, 3, 3, 4, 6), 3: (3, 3, 3, 4, 6, 4, 3, 1, 4)}
print (d[1])
print (d[2])
print (d[3])
Your ouput is possible, but not recommended:
s = df.groupby('Recording')['Duration'].agg(list).cumsum().apply(tuple)
for k, v in s.items():
globals()[f'Var{k}'] = v

#jezrael's answer is beautiful and definately better :). But if you really wanted to do this as a loop, (perhaps in future you might want to modify the logic further), then you might:
import pandas as pd
df = pd.DataFrame({
"Recording": [1,1,1,2,2,3,3,3,3],
"Duration": [3,3,3,4,6,4,3,1,4]
}) # your example data
records = {}
record = []
last_recording = None # flag to track change in recording
for r, d in zip(df.Recording, df.Duration):
if record and not r == last_recording:
records[last_recording] = (tuple(record))
record.append(d)
last_recording = r
records[last_recording] = (tuple(record)) # capture final group
print(records)
modified to provide a dict (which seems sensible). This will be slow for large datasets.

Related

Two way removal of duplicate start and end nodes

I am just a beginner in python and this may seem like an easy fix but I have been stuck at it given my limited knowledge of python.
I have two lists that are paired together:
s = [0,1,2,3,4,5,6,7,3,5,7]
t = [2,4,6,2,1,6,3,1,7,4,1]
This can be interpreted as start nodes and end nodes of lines, so 0 is connected to 2 and 1 is connected to 4 and so on.
I would like to remove all duplicate "lines" or pairs of nodes, in this example 7 -> 1 is repeated twice and 1 -> 4 is duplicated in the other direction 4 -> 1. I want to remove both types of duplicates and get the results:
S = [0,1,2,3,5,6,7,3,5]
T = [2,4,6,2,6,3,1,7,4]
Preserving the order and pairs of start and end is required.
I hope this makes sense, any help is greatly appreciated!
You can use a paired set and deduplicate the lists in order such as:
s = [0,1,2,3,4,5,6,7,3,5,7]
t = [2,4,6,2,1,6,3,1,7,4,1]
seen=set()
li=[]
for t in zip(s,t):
if frozenset(t) not in seen:
li.append(t)
seen.add(frosenset(t))
S,T=map(list,(zip(*li)))
Result:
>>> S
[0, 1, 2, 3, 5, 6, 7, 3, 5]
>>> T
[2, 4, 6, 2, 6, 3, 1, 7, 4]
Note: This can be reduced to:
seen=set()
S,T=zip(*[t for t in zip(s,t) if frozenset(t) not in seen and not seen.add(frozenset(t))])
But some will object to the use of a side effect in a list comprehension. I personally think it is OK in this use, but the loop form is considered by many to be better because it is far easier to read.
You can zip these lists together and use set comprehension
u = {tuple({a,b}) for (a,b) in (zip(s,t))}
# u: {(0, 2), (1, 4), (1, 7), (2, 3), (2, 6), (3, 6), (3, 7), (4, 5), (5, 6)}
first, sec = zip(*u)
# first: (6, 6, 5, 4, 3, 6, 7, 7, 2)
# sec : (2, 5, 4, 1, 2, 3, 1, 3, 0)
We use tuple to make objs hashable.
Just notice that sets are unorded, so if order is important please highlight
that in your question.
To preserve orders, check #Dawg's answes. My solution for this case was very similar to his after he undeleted ;)

Convert Pandas Dataframe to Dictionary with Tuple Keys for Ternary plot

I am plotting ternary diagrams with python-ternary
My data is in a pandas dataframe. I need to convert it to a dictionary mapping (i, j) to a float as input for the heatmap function in ternary.
My dataframe (df) looks like this:
i j value
0 1 2 7
1 3 4 8
2 5 6 9
I need to make a dictionary like this:
{(1, 2): 7, (5, 6): 9, (3, 4): 8}
My current workaround is a brute force loop that is very slow:
import pandas as pd
df = pd.DataFrame({'i': [1, 3, 5], 'j': [2, 4, 6], 'value': [7, 8, 9]})
data = dict()
for k in range(0, len(df)):
data[(df.iloc[k]['i'],df.iloc[k]['j'])] = \
df.iloc[k]['value']
Please, could someone help me with a faster or more pythonic way of doing this?
Use set_index with to_dict:
d = df.set_index(['i','j'])['value'].to_dict()
Alternative with zip and dict comprehension:
d = {(a,b):c for a,b,c in zip(df['i'], df['j'], df['value'])}
print (d)
{(1, 2): 7, (3, 4): 8, (5, 6): 9}

How to generate a result list from pandas DataFrame

I need to generate a list from pandas DataFrame . I am able to print the result,but i don't know how to convert it to the list format.The code i used to print the result(without converting to list) is
df=pandas.DataFrame(processed_data_format, columns=["file_name", "innings", "over","ball", "individual ball", "runs","batsman", "wicket_status","bowler_name","fielder_name"])**
#processed_data_format is the list passing to the DataFrame
t = df.groupby(['batsman','over'])['runs','ball'].sum()
print t
i am getting the result like
Sangakara 1 10 5
2 0 2
3 3 1
sewag 1 2 1
2 1 1
I would like to convert this data into list format like
[ [sangakara,1,10,5],[sangakara,2,0,2],[sangakara,3,3,1],[sewag,1,2,1][sewag,2,1,1] ]
You can use to_records:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [2, 3, 4, 5]})
>>> list(df.to_records())
[(0, 1, 2), (1, 2, 3), (2, 3, 4), (3, 4, 5)]
This will make a list of tuples. Converting this to a list of lists (if you really need this at all), is easy.

Compare consecutive columns of a file and return the number of non-matching elements

I have a text file which looks like this:
# sampleID HGDP00511 HGDP00511 HGDP00512 HGDP00512 HGDP00513 HGDP00513
M rs4124251 0 0 A G 0 A
M rs6650104 0 A C T 0 0
M rs12184279 0 0 G A T 0
I want to compare the consecutive columns and return the number of matching elements. I want to do this in Python. Earlier, I did it using Bash and AWK (shell scripting), but its very slow, as I have huge data to process. I believe Python would be a faster solution to this. But, I am very new to Python and I already have something like this:
for line in open("phased.txt"):
columns = line.split("\t")
for i in range(len(columns)-1):
a = columns[i+3]
b = columns[i+4]
for j in range(len(a)):
if a[j] != b[j]:
print j
which is obviously not working. As I am very new to Python, I don't really know what changes to make to get this to work. (This is code is completely wrong and I guess I could use difflib, etc. But, I have never proficiently coded in Python before, so, skeptical to proceed)
I want to compare and return the number of non matching elements in each column(starting from the third) to every other column in the file. I have 828 columns in totality. Hence I would need 828*828 number of outputs. (You can think of a n*n matrix where the (i,j)th element would be the number of non matching elements between them. My desired output in case of the above snippet would be:
3 4: 1
3 5: 3
3 6: 3
......
4 6: 3
..etc
Any help on this would be appreciated. Thanks.
A Pure native python library way of solving this - let us know how it compares with bash 828 x 828 should be a walk in the park.
Element Column counts:
I purposely wrote this with a step in flipping of the sequences, for simplicity and illustrative purposes - you can improve it with changed logic or usages of class objects, function decorators maybe etc...
Code Python 2.7:
shiftcol = 2 # shift columns as first two are to be ignored
with open('phased.txt') as f:
data = [x.strip().split('\t')[shiftcol:] for x in f.readlines()][1:]
# Step 1: Flipping the data first
flip = []
for idx, rows in enumerate(data):
for i in range(len(rows)):
if len(flip) <= i:
flip.append([])
flip[i].append(rows[i])
# Step 2: counts store in temp dictionary
for idx, v in enumerate(flip):
for e in v:
tmp = {}
for i, z in enumerate(flip):
if i != idx and e != '0':
# Dictionary to store results
if i+1 not in tmp: # note has_key will be deprecated
tmp[i+1] = {'match': 0, 'notma': 0}
tmp[i+1]['match'] += z.count(e)
tmp[i+1]['notma'] += len([x for x in z if x != e])
# results compensate for column shift..
for key, count in tmp.iteritems():
print idx+shiftcol+1, key+shiftcol, ': ', count
sample output
>>> 3 4 : {'match': 0, 'notma': 3}
>>> 3 5 : {'match': 0, 'notma': 3}
>>> 3 6 : {'match': 2, 'notma': 1}
>>> 3 7 : {'match': 2, 'notma': 1}
>>> 3 3 : {'match': 1, 'notma': 2}
>>> 3 4 : {'match': 1, 'notma': 2}
>>> 3 5 : {'match': 1, 'notma': 2}
I highly recommend you use pandas for this rather than writing your own code:
import numpy as np
import pandas as pd
df = pd.read_csv("phased.txt")
match_counts = {(i,j):
np.sum(df[df.columns[i]] != df[df.columns[j]])
for i in range(3,len(df.columns))
for j in range(3,len(df.columns))}
match_counts
{(6, 4): 3,
(4, 7): 2,
(4, 4): 0,
(4, 3): 3,
(6, 6): 0,
(4, 5): 3,
(5, 4): 3,
(3, 5): 3,
(7, 7): 0,
(7, 5): 3,
(3, 7): 2,
(6, 5): 3,
(5, 5): 0,
(7, 4): 2,
(5, 3): 3,
(6, 7): 2,
(4, 6): 3,
(7, 6): 2,
(5, 7): 3,
(6, 3): 2,
(5, 6): 3,
(3, 6): 2,
(3, 3): 0,
(7, 3): 2,
(3, 4): 3}

Sort by column 1 and tuple(column2,column3) then expand again

I am new to Python and this seems to be a bit tricky for me:
I have 3 columns:
column1 : id
column2 : size
column3 : rank
I now want to re-align the id column and keeping size,rank in order together (size,rank)
so it would look like:
id:1 (size,rank):4,5
id:2 (size,rank):5,8
So the id column has to be reorded from 1 to 1000 and not messing up the (size,rank) tupel
I tried to do:
combined = zip(size,rank)
id, combined = zip(*sorted(zip(id, combined)))
Is this correct? And if yes, how can I seperate the tupel to 2 arrays size and rank again.
I heard about zip(*combined)?
unziped = zip(*combined)
then size equals unziped[0] and rank equals unziped[1] ?
Thank you for help!
ADDED from Numpy genfromtxt function
size= [x[2] for x in mydata]
rank= [x[1] for x in mydata]
Your main problem is you are using the id column as both an identifier and as the value by which the data is ordered. This is wrong. Leave the id column alone; and then sort the data by only size; once you have it sorted use enumerate to list the "order".
Here is an example, which sorts the data by the second column (size), then prints the data along with their "rank" or "order":
>>> data = ((1, 4, 5), (3, 6, 7), (4, 3, 3), (19, 32, 0))
>>> data_sorted = sorted(data, key=lambda x: x[1])
>>> for k,v in enumerate(data_sorted):
... print('{}: {}'.format(k+1, v))
...
1: (4, 3, 3)
2: (1, 4, 5)
3: (3, 6, 7)
4: (19, 32, 0)

Categories