pandas create new columns from tuple values in one column - python

I have a dataframe that looks like
RMSE SELECTED DATA information
0 100 [12, 15, 19, 13] (arr1, str1, fl1)
1 200 [7, 12, 3] (arr2, str2, fl2)
2 300 [5, 9, 3, 3, 3, 3] (arr3, str3, fl3)
Here, I want to break up the information column into three distinct columns: the first column containing the arrays , the second column containing the string and the last column containing the float Thus the new dataframe would look like
RMSE SELECTED DATA ARRAYS STRING FLOAT
0 100 [12, 15, 19, 13] arr1 str1 fl1
1 200 [7, 12, 3] arr2 str2 fl2
2 300 [5, 9, 3, 3, 3, 3] arr3 str3 fl3
I thought one way would be to isolate the information column and then slice it using .apply like so:
df['arrays'] = df['information'].apply(lambda row : row[0])
and do this for each entry. But I was curious if there is a better way to do this as if there are many more entries it may become tedious or slow with a for loop

Let us recreate the dataframe
tojoin = pd.DataFrame(df.pop('information').to_numpy().tolist(),
index = df.index,
columns = ['ARRAYS', 'STRING', 'FLOAT'])
df = df.join(tojoin)

Related

Merge lists in a dataframe column if they share a common value

What I need:
I have a dataframe where the elements of a column are lists. There are no duplications of elements in a list. For example, a dataframe like the following:
import pandas as pd
>>d = {'col1': [[1, 2, 4, 8], [15, 16, 17], [18, 3], [2, 19], [10, 4]]}
>>df = pd.DataFrame(data=d)
col1
0 [1, 2, 4, 8]
1 [15, 16, 17]
2 [18, 3]
3 [2, 19]
4 [10, 4]
I would like to obtain a dataframe where, if at least a number contained in a list at row i is also contained in a list at row j, then the two list are merged (without duplication). But the values could also be shared by more than two lists, in that case I want all lists that share at least a value to be merged.
col1
0 [1, 2, 4, 8, 19, 10]
1 [15, 16, 17]
2 [18, 3]
The order of the rows of the output dataframe, nor the values inside a list is important.
What I tried:
I have found this answer, that shows how to tell if at least one item in list is contained in another list, e.g.
>>not set([1, 2, 4, 8]).isdisjoint([2, 19])
True
Returns True, since 2 is contained in both lists.
I have also found this useful answer that shows how to compare each row of a dataframe with each other. The answer applies a custom function to each row of the dataframe using a lambda.
df.apply(lambda row: func(row['col1']), axis=1)
However I'm not sure how to put this two things together, how to create the func method. Also I don't know if this approach is even feasible since the resulting rows will probably be less than the ones of the original dataframe.
Thanks!
You can use networkx and graphs for that:
import networkx as nx
G = nx.Graph([edge for nodes in df['col1'] for edge in zip(nodes, nodes[1:])])
result = pd.Series(nx.connected_components(G))
This is basically treating every number as a node, and whenever two number are in the same list then you connect them. Finally you find the connected components.
Output:
0 {1, 2, 4, 8, 10, 19}
1 {16, 17, 15}
2 {18, 3}
This is not straightforward. Merging lists has many pitfalls.
One solid approach is to use a specialized library, for example networkx to use a graph approach. You can generate successive edges and find the connected components.
Here is your graph:
You can thus:
generate successive edges with add_edges_from
find the connected_components
craft a dictionary and map the first item of each list
groupby and merge the lists (you could use the connected components directly but I'm giving a pandas solution in case you have more columns to handle)
import networkx as nx
G = nx.Graph()
for l in df['col1']:
G.add_edges_from(zip(l, l[1:]))
groups = {k:v for v,l in enumerate(nx.connected_components(G)) for k in l}
# {1: 0, 2: 0, 4: 0, 8: 0, 10: 0, 19: 0, 16: 1, 17: 1, 15: 1, 18: 2, 3: 2}
out = (df.groupby(df['col1'].str[0].map(groups), as_index=False)
.agg(lambda x: sorted(set().union(*x)))
)
output:
col1
0 [1, 2, 4, 8, 10, 19]
1 [15, 16, 17]
2 [3, 18]
Seems more like a Python problem than pandas one, so here's one attempt that checks every after list, merges (and removes) if intersecting:
vals = d["col1"]
# while there are at least 1 more list after to process...
i = 0
while i < len(vals) - 1:
current = set(vals[i])
# for the next lists...
j = i + 1
while j < len(vals):
# any intersection?
# then update the current and delete the other
other = vals[j]
if current.intersection(other):
current.update(other)
del vals[j]
else:
# no intersection, so keep going for next lists
j += 1
# put back the updated current back, and move on
vals[i] = current
i += 1
at the end, vals is
In [108]: vals
Out[108]: [{1, 2, 4, 8, 10, 19}, {15, 16, 17}, {3, 18}]
In [109]: pd.Series(map(list, vals))
Out[109]:
0 [1, 2, 19, 4, 8, 10]
1 [16, 17, 15]
2 [18, 3]
dtype: object
if you don't want vals modified, can chain .copy() for it.
To add on mozway's answer. It wasn't clear from the question, but I also had rows with single-valued lists. This values aren't clearly added to the graph when calling add_edges_from(zip(l, l[1:]), since l[1:] is empty. I solved it adding a singular node to the graph when encountering emtpy l[1:] lists. I leave the solution in case anyone needs it.
import networkx as nx
import pandas as pd
d = {'col1': [[1, 2, 4, 8], [15, 16, 17], [18, 3], [2, 19], [10, 4], [9]]}
df= pd.DataFrame(data=d)
G = nx.Graph()
for l in df['col1']:
if len(l[1:]) == 0:
G.add_node(l[0])
else:
G.add_edges_from(zip(l, l[1:]))
groups = {k: v for v, l in enumerate(nx.connected_components(G)) for k in l}
out= (df.groupby(df['col1'].str[0].map(groups), as_index=False)
.agg(lambda x: sorted(set().union(*x))))
Result:
col1
0 [1, 2, 4, 8, 10, 19]
1 [15, 16, 17]
2 [3, 18]
3 [9]

Changing list to dataframe in dictionary

I am writing a dictionary that has to seperate a dataframe into multiple small dataframes based on a certain item that is repeated in the list calvo_massflows. If the items isn't repeated, it'll make a list in the dictionary. In the second for loop, the dictionary will add the index item from the df dataframe to one of the dictionary lists, if the key (l) and e are the same.
This is what I currently got:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from scipy.stats import linregress
from scipy.optimize import curve_fit
calvo_massflow = [1, 2, 1, 2, 2, 1, 1]
df = pd.DataFrame({"a":[1, 2, 3, 4, 11, 2, 4, 6, 7, 3],
"b":[5, 6, 7, 8, 10, 44, 23, 267, 4, 66]})
dic = {}
massflows = []
for i, e in enumerate(calvo_massflow):
if e not in massflows:
massflows.append(e)
dic[e] = []
for l in dic:
if e == l:
dic[e].append(pd.DataFrame([df.iloc[i]]))
The problem with the output is the fact each index is a seperate dataframe in thte dictionary. I would like to have all the dataframes combined. I tried doing something with pd.concat. But I didn't figure it out. Moreover, the chapters in the dictionary (if that's how you call them), are lists and I prefer them being dataframes. However, if I change my list to a dataframe like I done here:
dic3 = {}
massflows = []
for i, e in enumerate(calvo_massflow):
if e not in massflows:
massflows.append(e)
dic3[e] = pd.DataFrame([])
for l in dic3:
if e == l:
dic3[e].append(df.iloc[i])
I can't seem to add dataframes to the dataframes made by the dictionary.
My ideal scenario would be a dictionary with two dataframes. One having the key '1' and one being '2'. Both those dataframes, include all the information from the data frame df. And not how it is right now with separate dataframes for each index. Preferably the dataframes aren't in lists like they are now but it won't be a disaster.
Let me know if you guys can help me out or need more context!
IIUC you want to select the rows of df up to the length of calvo_massflow, group by calvo_massflow and convert to dict. This might look like this:
calvo_massflow = [1, 2, 1, 2, 2, 1, 1]
df = pd.DataFrame({"a":[1, 2, 3, 4, 11, 2, 4, 6, 7, 3],
"b":[5, 6, 7, 8, 10, 44, 23, 267, 4, 66]})
dic = dict(iter(df.iloc[:len(calvo_massflow)]
.groupby(calvo_massflow)))
print(dic)
resulting in a dictionary with keys 1 and 2 containing two filtered DataFrames:
{1: a b
0 1 5
2 3 7
5 2 44
6 4 23,
2: a b
1 2 6
3 4 8
4 11 10}

Group rows based on +- threshold on high dimensional object

I have a large df with coordinates in multiple dimensions. I am trying to create classes (Objects) based on threshold difference between the coordinates. An example df is as below:
df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6], 'y': [10, 14, 5, 14, 3, 12],'z': [7, 6, 2, 43, 1, 40]})
So based on this df I want to group each row to a class based on -+ 2 across all coordinates. So the df will have a unique group name added to each row. So the output for this threshold function is:
'x' 'y' 'z' 'group'
1 10 7 -
2 14 6 -
3 5 2 G1
4 14 43 -
5 3 1 G1
6 12 40 -
It is similar to clustering but I want to work on my own threshold functions. How can this done in python.
EDIT
To clarify the threshold is based on the similar coordinates. All rows with -+ threshold across all coordinates will be grouped as a single object. It can also be taken as grouping rows based on a threshold across all columns and assigning unique labels to each group.
As far as I understood, what you need is a function apply. It was not very clear from your statement, whether you need all the differences between the coordinates, or just the neighbouring differences (x-y and y-z). The row 5 has the difference between x and z coordinate 4, but is still assigned to the class G1.
That's why I wrote it for the two possibilities and you can just choose which one you need more:
import pandas as pd
import numpy as np
def your_specific_function(row):
'''
For all differences use this:
diffs = np.array([abs(row.x-row.y), abs(row.y-row.z), abs(row.x-row.z)])
'''
# for only x - y, y - z use this:
diffs = np.diff(row)
statement = all(diffs <= 2)
if statement:
return 'G1'
else:
return '-'
df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6], 'y': [10, 14, 5, 14, 3, 12],'z': [7, 6, 2, 43, 1, 40]})
df['group'] = df.apply(your_specific_function, axis = 1)
print(df.head())

Performing calculations on a numpy array and adding them to a pandas dataframe

Let's say I have an array such as this:
a = np.array([[1, 2, 3, 4, 5, 6, 7], [20, 25, 30, 35, 40, 45, 50], [2, 4, 6, 8, 10, 12, 14]])
and a dataframe such as this:
num letter
0 1 a
1 2 b
2 3 c
What I would then like to do is to calculate the difference between the first and last number in each sequence in the array and ultimately add this difference to a new column in the df.
Currently I am able to calculate the desired difference in each sequence in this manner:
for i in a:
print(i[-1] - i[0])
Giving me the following results:
6
30
12
I would expect to be able to do is replace the print with df['new_col'] like so:
df['new_col'] = (i[-1] - i[0])
And for my df to then look like this:
num letter new_col
0 1 a 6
1 2 b 30
2 3 c 12
However, I end up getting this:
num letter new_col
0 1 a 12
1 2 b 12
2 3 c 12
I would also really appreciate if anyone could tell me what the equivalent of .diff() and .shift() are in numpy as I tried that in the same way you would with a pandas dataframe as well but just got error messages. This would be useful for me if I want to calculate the difference not just between the first and last numbers but somewhere in between.
Any help would be really appreciated, cheers.
currently you are only performing the difference calculation in the very last one
use a list comprehension:
a = np.array([[1, 2, 3, 4, 5, 6, 7], [20, 25, 30, 35, 40, 45, 50], [2, 4, 6, 8, 10, 12, 14]])
b = [i[-1] - i[0] for i in a]
if the lengths mismatch, then you need to extend the list with NaNs:
b = b + [np.NaN]*(len(df) - len(b))
df['new_col'] = b
Might be better off doing this in a DataFrame if your array grows in size.
df1 = pd.DataFrame(a.T)
df['new_col'] = df1.iloc[-1] - df1.iloc[0]
print(df)
num letter new_col
0 1 a 6
1 2 b 30
2 3 c 12

Chained Lookups in Pandas Dataframe

I have a very large pandas dataframe with two columns that I'd like to recursively lookup.
Given input of the following dataframe:
NewID, OldID
1, 0
2, 1
3, 2
5, 4
7, 6
8, 7
9, 5
I'd like to generate the series OriginalId:
NewID, OldId, OriginalId
1, 0, 0
2, 1, 0
3, 2, 0
5, 4, 4
7, 6, 6
8, 7, 6
9, 5, 4
This can be trivially solved by iterating over the sorted data and for each row, checking if OldId points to an existing NewId and if so, setting OriginalId to OriginalId for that row.
This can be solved by iteratively merging and updating columns, by the following algorithm:
Merge OldId to NewId.
For any one that did not match, set OriginalId to OldId.
If they did match, set OldId to OldId for the matched column.
Repeat until OriginalIds are all filled in.
Feels like there should be a pandas friendly way to do this via cumulative sums or similar.
Easy:
df.set_index('NewID', inplace=True)
df.loc[:, 'OriginalId'] = df.loc[df['OldId'], 'OldID'].fillna(df['OldId'])

Categories