Convert Pandas Dataframe to Dictionary with Tuple Keys for Ternary plot - python

I am plotting ternary diagrams with python-ternary
My data is in a pandas dataframe. I need to convert it to a dictionary mapping (i, j) to a float as input for the heatmap function in ternary.
My dataframe (df) looks like this:
i j value
0 1 2 7
1 3 4 8
2 5 6 9
I need to make a dictionary like this:
{(1, 2): 7, (5, 6): 9, (3, 4): 8}
My current workaround is a brute force loop that is very slow:
import pandas as pd
df = pd.DataFrame({'i': [1, 3, 5], 'j': [2, 4, 6], 'value': [7, 8, 9]})
data = dict()
for k in range(0, len(df)):
data[(df.iloc[k]['i'],df.iloc[k]['j'])] = \
df.iloc[k]['value']
Please, could someone help me with a faster or more pythonic way of doing this?

Use set_index with to_dict:
d = df.set_index(['i','j'])['value'].to_dict()
Alternative with zip and dict comprehension:
d = {(a,b):c for a,b,c in zip(df['i'], df['j'], df['value'])}
print (d)
{(1, 2): 7, (3, 4): 8, (5, 6): 9}

Related

Print every two pairs in a new line from array of elements in Python

How can I print from an array of elements in Python every second pair of elements one below another, without commas and brackets?
My array looks like this:
m=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
And, I want to print in one of the cases:
1 2
5 6
9 10
or in another case:
3 4
7 8
11 12
I didn't know how to do that, so i created two separate arrays, but when i try to print elements in separate rows, each pair has brackets and coma. Is there any way to solve this easier and to make it look as i wrote?
What I've tried:
a=[m[j:j+2] for j in range(0,len(m),2)]
a1=m[::2]
a2=m[1::2]
if s1>s2:
print("\n".join(map(str,a1)))
elif s1<s2:
print("\n".join(map(str,a2)))
My current output:
[3, 4]
[7, 8]
[11, 12]
You could use a while loop
m = [1,2,3,4,5,6,7,8,9]
idx = 0
try:
while idx < len:
print(m[idx], m[idx+1])
idx += 3
except IndexError:
print("Index out of bounds")
Just change the start Index (idx) for the other print
Another way to do it like this-
m=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
pairs = [[i, i+1] for i in m[::2]]
results = [], []
for i, e in enumerate(pairs):
results[i%2].append(e)
for i in results:
for p in i:
print(*p)
print("-----")
Output:
1 2
5 6
9 10
-----
3 4
7 8
11 12
-----
what you are trying to achieve is making a pair of 2 in array and save alternative pair in a different arrays/list. one way of achiving this is below code, by going step by step.
m=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
make_pair = [(m[i], m[i+1]) for i in range(0, len(m), 2)]
res1 = []
res2 = []
print(make_pair)
# [(1, 2), (3, 4), (5, 6), (7, 8), (9, 10), (11, 12)]
for i in range(len(make_pair)):
if i%2:
res2.append(make_pair[i])
else:
res1.append(make_pair[i])
print(res1)
# [(1, 2), (5, 6), (9, 10)]
print(res2)
# [(3, 4), (7, 8), (11, 12)]
if you want to go in one go, ie without creating pair array, then using temporary stack you can achieve same result
I don't understand why all the solutions are so complex, this is as simple as follows (case 1):
len_m = len(m)
for i in range(0, len_m - 1 if len_m & 1 else len_m, 4):
print(f"{m[i]} {m[i + 1]}")
For case 2, just start the range with 2.

Changing list to dataframe in dictionary

I am writing a dictionary that has to seperate a dataframe into multiple small dataframes based on a certain item that is repeated in the list calvo_massflows. If the items isn't repeated, it'll make a list in the dictionary. In the second for loop, the dictionary will add the index item from the df dataframe to one of the dictionary lists, if the key (l) and e are the same.
This is what I currently got:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from scipy.stats import linregress
from scipy.optimize import curve_fit
calvo_massflow = [1, 2, 1, 2, 2, 1, 1]
df = pd.DataFrame({"a":[1, 2, 3, 4, 11, 2, 4, 6, 7, 3],
"b":[5, 6, 7, 8, 10, 44, 23, 267, 4, 66]})
dic = {}
massflows = []
for i, e in enumerate(calvo_massflow):
if e not in massflows:
massflows.append(e)
dic[e] = []
for l in dic:
if e == l:
dic[e].append(pd.DataFrame([df.iloc[i]]))
The problem with the output is the fact each index is a seperate dataframe in thte dictionary. I would like to have all the dataframes combined. I tried doing something with pd.concat. But I didn't figure it out. Moreover, the chapters in the dictionary (if that's how you call them), are lists and I prefer them being dataframes. However, if I change my list to a dataframe like I done here:
dic3 = {}
massflows = []
for i, e in enumerate(calvo_massflow):
if e not in massflows:
massflows.append(e)
dic3[e] = pd.DataFrame([])
for l in dic3:
if e == l:
dic3[e].append(df.iloc[i])
I can't seem to add dataframes to the dataframes made by the dictionary.
My ideal scenario would be a dictionary with two dataframes. One having the key '1' and one being '2'. Both those dataframes, include all the information from the data frame df. And not how it is right now with separate dataframes for each index. Preferably the dataframes aren't in lists like they are now but it won't be a disaster.
Let me know if you guys can help me out or need more context!
IIUC you want to select the rows of df up to the length of calvo_massflow, group by calvo_massflow and convert to dict. This might look like this:
calvo_massflow = [1, 2, 1, 2, 2, 1, 1]
df = pd.DataFrame({"a":[1, 2, 3, 4, 11, 2, 4, 6, 7, 3],
"b":[5, 6, 7, 8, 10, 44, 23, 267, 4, 66]})
dic = dict(iter(df.iloc[:len(calvo_massflow)]
.groupby(calvo_massflow)))
print(dic)
resulting in a dictionary with keys 1 and 2 containing two filtered DataFrames:
{1: a b
0 1 5
2 3 7
5 2 44
6 4 23,
2: a b
1 2 6
3 4 8
4 11 10}

Extract data from table column and make variables in Python

I have a dataset where I want to make a new variable everytime 'Recording' number changes. I want the new variable to include the 'Duration' data for the specific 'Recording' and the previous data. So for the below table it would be:
Var1 = (3, 3, 3)
Var2 = (3, 3, 3, 4, 6)
Var2 = (3, 3, 3, 4, 6, 4, 3, 1, 4)
And so on. I have several dataset that can have different number of recordings (but always starting from 1) and different number of durations for each recording. Any help is greatly appreciated.
Recording
Duration
1
3
1
3
1
3
2
4
2
6
3
4
3
3
3
1
3
4
You can aggregate list with cumualative sum for lists, then convert to tuples and dictionary:
d = df.groupby('Recording')['Duration'].agg(list).cumsum().apply(tuple).to_dict()
print (d)
{1: (3, 3, 3), 2: (3, 3, 3, 4, 6), 3: (3, 3, 3, 4, 6, 4, 3, 1, 4)}
print (d[1])
print (d[2])
print (d[3])
Your ouput is possible, but not recommended:
s = df.groupby('Recording')['Duration'].agg(list).cumsum().apply(tuple)
for k, v in s.items():
globals()[f'Var{k}'] = v
#jezrael's answer is beautiful and definately better :). But if you really wanted to do this as a loop, (perhaps in future you might want to modify the logic further), then you might:
import pandas as pd
df = pd.DataFrame({
"Recording": [1,1,1,2,2,3,3,3,3],
"Duration": [3,3,3,4,6,4,3,1,4]
}) # your example data
records = {}
record = []
last_recording = None # flag to track change in recording
for r, d in zip(df.Recording, df.Duration):
if record and not r == last_recording:
records[last_recording] = (tuple(record))
record.append(d)
last_recording = r
records[last_recording] = (tuple(record)) # capture final group
print(records)
modified to provide a dict (which seems sensible). This will be slow for large datasets.

How to generate a result list from pandas DataFrame

I need to generate a list from pandas DataFrame . I am able to print the result,but i don't know how to convert it to the list format.The code i used to print the result(without converting to list) is
df=pandas.DataFrame(processed_data_format, columns=["file_name", "innings", "over","ball", "individual ball", "runs","batsman", "wicket_status","bowler_name","fielder_name"])**
#processed_data_format is the list passing to the DataFrame
t = df.groupby(['batsman','over'])['runs','ball'].sum()
print t
i am getting the result like
Sangakara 1 10 5
2 0 2
3 3 1
sewag 1 2 1
2 1 1
I would like to convert this data into list format like
[ [sangakara,1,10,5],[sangakara,2,0,2],[sangakara,3,3,1],[sewag,1,2,1][sewag,2,1,1] ]
You can use to_records:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [2, 3, 4, 5]})
>>> list(df.to_records())
[(0, 1, 2), (1, 2, 3), (2, 3, 4), (3, 4, 5)]
This will make a list of tuples. Converting this to a list of lists (if you really need this at all), is easy.

Compare consecutive columns of a file and return the number of non-matching elements

I have a text file which looks like this:
# sampleID HGDP00511 HGDP00511 HGDP00512 HGDP00512 HGDP00513 HGDP00513
M rs4124251 0 0 A G 0 A
M rs6650104 0 A C T 0 0
M rs12184279 0 0 G A T 0
I want to compare the consecutive columns and return the number of matching elements. I want to do this in Python. Earlier, I did it using Bash and AWK (shell scripting), but its very slow, as I have huge data to process. I believe Python would be a faster solution to this. But, I am very new to Python and I already have something like this:
for line in open("phased.txt"):
columns = line.split("\t")
for i in range(len(columns)-1):
a = columns[i+3]
b = columns[i+4]
for j in range(len(a)):
if a[j] != b[j]:
print j
which is obviously not working. As I am very new to Python, I don't really know what changes to make to get this to work. (This is code is completely wrong and I guess I could use difflib, etc. But, I have never proficiently coded in Python before, so, skeptical to proceed)
I want to compare and return the number of non matching elements in each column(starting from the third) to every other column in the file. I have 828 columns in totality. Hence I would need 828*828 number of outputs. (You can think of a n*n matrix where the (i,j)th element would be the number of non matching elements between them. My desired output in case of the above snippet would be:
3 4: 1
3 5: 3
3 6: 3
......
4 6: 3
..etc
Any help on this would be appreciated. Thanks.
A Pure native python library way of solving this - let us know how it compares with bash 828 x 828 should be a walk in the park.
Element Column counts:
I purposely wrote this with a step in flipping of the sequences, for simplicity and illustrative purposes - you can improve it with changed logic or usages of class objects, function decorators maybe etc...
Code Python 2.7:
shiftcol = 2 # shift columns as first two are to be ignored
with open('phased.txt') as f:
data = [x.strip().split('\t')[shiftcol:] for x in f.readlines()][1:]
# Step 1: Flipping the data first
flip = []
for idx, rows in enumerate(data):
for i in range(len(rows)):
if len(flip) <= i:
flip.append([])
flip[i].append(rows[i])
# Step 2: counts store in temp dictionary
for idx, v in enumerate(flip):
for e in v:
tmp = {}
for i, z in enumerate(flip):
if i != idx and e != '0':
# Dictionary to store results
if i+1 not in tmp: # note has_key will be deprecated
tmp[i+1] = {'match': 0, 'notma': 0}
tmp[i+1]['match'] += z.count(e)
tmp[i+1]['notma'] += len([x for x in z if x != e])
# results compensate for column shift..
for key, count in tmp.iteritems():
print idx+shiftcol+1, key+shiftcol, ': ', count
sample output
>>> 3 4 : {'match': 0, 'notma': 3}
>>> 3 5 : {'match': 0, 'notma': 3}
>>> 3 6 : {'match': 2, 'notma': 1}
>>> 3 7 : {'match': 2, 'notma': 1}
>>> 3 3 : {'match': 1, 'notma': 2}
>>> 3 4 : {'match': 1, 'notma': 2}
>>> 3 5 : {'match': 1, 'notma': 2}
I highly recommend you use pandas for this rather than writing your own code:
import numpy as np
import pandas as pd
df = pd.read_csv("phased.txt")
match_counts = {(i,j):
np.sum(df[df.columns[i]] != df[df.columns[j]])
for i in range(3,len(df.columns))
for j in range(3,len(df.columns))}
match_counts
{(6, 4): 3,
(4, 7): 2,
(4, 4): 0,
(4, 3): 3,
(6, 6): 0,
(4, 5): 3,
(5, 4): 3,
(3, 5): 3,
(7, 7): 0,
(7, 5): 3,
(3, 7): 2,
(6, 5): 3,
(5, 5): 0,
(7, 4): 2,
(5, 3): 3,
(6, 7): 2,
(4, 6): 3,
(7, 6): 2,
(5, 7): 3,
(6, 3): 2,
(5, 6): 3,
(3, 6): 2,
(3, 3): 0,
(7, 3): 2,
(3, 4): 3}

Categories