I am writing a dictionary that has to seperate a dataframe into multiple small dataframes based on a certain item that is repeated in the list calvo_massflows. If the items isn't repeated, it'll make a list in the dictionary. In the second for loop, the dictionary will add the index item from the df dataframe to one of the dictionary lists, if the key (l) and e are the same.
This is what I currently got:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from scipy.stats import linregress
from scipy.optimize import curve_fit
calvo_massflow = [1, 2, 1, 2, 2, 1, 1]
df = pd.DataFrame({"a":[1, 2, 3, 4, 11, 2, 4, 6, 7, 3],
"b":[5, 6, 7, 8, 10, 44, 23, 267, 4, 66]})
dic = {}
massflows = []
for i, e in enumerate(calvo_massflow):
if e not in massflows:
massflows.append(e)
dic[e] = []
for l in dic:
if e == l:
dic[e].append(pd.DataFrame([df.iloc[i]]))
The problem with the output is the fact each index is a seperate dataframe in thte dictionary. I would like to have all the dataframes combined. I tried doing something with pd.concat. But I didn't figure it out. Moreover, the chapters in the dictionary (if that's how you call them), are lists and I prefer them being dataframes. However, if I change my list to a dataframe like I done here:
dic3 = {}
massflows = []
for i, e in enumerate(calvo_massflow):
if e not in massflows:
massflows.append(e)
dic3[e] = pd.DataFrame([])
for l in dic3:
if e == l:
dic3[e].append(df.iloc[i])
I can't seem to add dataframes to the dataframes made by the dictionary.
My ideal scenario would be a dictionary with two dataframes. One having the key '1' and one being '2'. Both those dataframes, include all the information from the data frame df. And not how it is right now with separate dataframes for each index. Preferably the dataframes aren't in lists like they are now but it won't be a disaster.
Let me know if you guys can help me out or need more context!
IIUC you want to select the rows of df up to the length of calvo_massflow, group by calvo_massflow and convert to dict. This might look like this:
calvo_massflow = [1, 2, 1, 2, 2, 1, 1]
df = pd.DataFrame({"a":[1, 2, 3, 4, 11, 2, 4, 6, 7, 3],
"b":[5, 6, 7, 8, 10, 44, 23, 267, 4, 66]})
dic = dict(iter(df.iloc[:len(calvo_massflow)]
.groupby(calvo_massflow)))
print(dic)
resulting in a dictionary with keys 1 and 2 containing two filtered DataFrames:
{1: a b
0 1 5
2 3 7
5 2 44
6 4 23,
2: a b
1 2 6
3 4 8
4 11 10}
Related
What I need:
I have a dataframe where the elements of a column are lists. There are no duplications of elements in a list. For example, a dataframe like the following:
import pandas as pd
>>d = {'col1': [[1, 2, 4, 8], [15, 16, 17], [18, 3], [2, 19], [10, 4]]}
>>df = pd.DataFrame(data=d)
col1
0 [1, 2, 4, 8]
1 [15, 16, 17]
2 [18, 3]
3 [2, 19]
4 [10, 4]
I would like to obtain a dataframe where, if at least a number contained in a list at row i is also contained in a list at row j, then the two list are merged (without duplication). But the values could also be shared by more than two lists, in that case I want all lists that share at least a value to be merged.
col1
0 [1, 2, 4, 8, 19, 10]
1 [15, 16, 17]
2 [18, 3]
The order of the rows of the output dataframe, nor the values inside a list is important.
What I tried:
I have found this answer, that shows how to tell if at least one item in list is contained in another list, e.g.
>>not set([1, 2, 4, 8]).isdisjoint([2, 19])
True
Returns True, since 2 is contained in both lists.
I have also found this useful answer that shows how to compare each row of a dataframe with each other. The answer applies a custom function to each row of the dataframe using a lambda.
df.apply(lambda row: func(row['col1']), axis=1)
However I'm not sure how to put this two things together, how to create the func method. Also I don't know if this approach is even feasible since the resulting rows will probably be less than the ones of the original dataframe.
Thanks!
You can use networkx and graphs for that:
import networkx as nx
G = nx.Graph([edge for nodes in df['col1'] for edge in zip(nodes, nodes[1:])])
result = pd.Series(nx.connected_components(G))
This is basically treating every number as a node, and whenever two number are in the same list then you connect them. Finally you find the connected components.
Output:
0 {1, 2, 4, 8, 10, 19}
1 {16, 17, 15}
2 {18, 3}
This is not straightforward. Merging lists has many pitfalls.
One solid approach is to use a specialized library, for example networkx to use a graph approach. You can generate successive edges and find the connected components.
Here is your graph:
You can thus:
generate successive edges with add_edges_from
find the connected_components
craft a dictionary and map the first item of each list
groupby and merge the lists (you could use the connected components directly but I'm giving a pandas solution in case you have more columns to handle)
import networkx as nx
G = nx.Graph()
for l in df['col1']:
G.add_edges_from(zip(l, l[1:]))
groups = {k:v for v,l in enumerate(nx.connected_components(G)) for k in l}
# {1: 0, 2: 0, 4: 0, 8: 0, 10: 0, 19: 0, 16: 1, 17: 1, 15: 1, 18: 2, 3: 2}
out = (df.groupby(df['col1'].str[0].map(groups), as_index=False)
.agg(lambda x: sorted(set().union(*x)))
)
output:
col1
0 [1, 2, 4, 8, 10, 19]
1 [15, 16, 17]
2 [3, 18]
Seems more like a Python problem than pandas one, so here's one attempt that checks every after list, merges (and removes) if intersecting:
vals = d["col1"]
# while there are at least 1 more list after to process...
i = 0
while i < len(vals) - 1:
current = set(vals[i])
# for the next lists...
j = i + 1
while j < len(vals):
# any intersection?
# then update the current and delete the other
other = vals[j]
if current.intersection(other):
current.update(other)
del vals[j]
else:
# no intersection, so keep going for next lists
j += 1
# put back the updated current back, and move on
vals[i] = current
i += 1
at the end, vals is
In [108]: vals
Out[108]: [{1, 2, 4, 8, 10, 19}, {15, 16, 17}, {3, 18}]
In [109]: pd.Series(map(list, vals))
Out[109]:
0 [1, 2, 19, 4, 8, 10]
1 [16, 17, 15]
2 [18, 3]
dtype: object
if you don't want vals modified, can chain .copy() for it.
To add on mozway's answer. It wasn't clear from the question, but I also had rows with single-valued lists. This values aren't clearly added to the graph when calling add_edges_from(zip(l, l[1:]), since l[1:] is empty. I solved it adding a singular node to the graph when encountering emtpy l[1:] lists. I leave the solution in case anyone needs it.
import networkx as nx
import pandas as pd
d = {'col1': [[1, 2, 4, 8], [15, 16, 17], [18, 3], [2, 19], [10, 4], [9]]}
df= pd.DataFrame(data=d)
G = nx.Graph()
for l in df['col1']:
if len(l[1:]) == 0:
G.add_node(l[0])
else:
G.add_edges_from(zip(l, l[1:]))
groups = {k: v for v, l in enumerate(nx.connected_components(G)) for k in l}
out= (df.groupby(df['col1'].str[0].map(groups), as_index=False)
.agg(lambda x: sorted(set().union(*x))))
Result:
col1
0 [1, 2, 4, 8, 10, 19]
1 [15, 16, 17]
2 [3, 18]
3 [9]
I have a dataframe that has duplicated time indices and I would like to get the mean across all for the previous 2 days (I do not want to drop any observations; they are all information that I need). I've checked pandas documentation and read previous posts on Stackoverflow (such as Apply rolling mean function on data frames with duplicated indices in pandas), but could not find a solution. Here's an example of how my data frame look like and the output I'm looking for. Thank you in advance.
data:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,3,3,4,4,4],'t': [1, 2, 3, 2, 1, 2, 2, 3, 4],'v1':[1, 2, 3, 4, 5, 6, 7, 8, 9]})
output:
t
v2
1
-
2
-
3
4.167
4
5
5
6.667
A rough proposal to concatenate 2 copies of the input frame in which values in 't' are replaced respectively by values of 't+1' and 't+2'. This way, the meaning of the column 't' becomes "the target day".
Setup:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,3,3,4,4,4],
't': [1, 2, 3, 2, 1, 2, 2, 3, 4],
'v1':[1, 2, 3, 4, 5, 6, 7, 8, 9]})
Implementation:
len = df.shape[0]
incr = pd.DataFrame({'id': [0]*len, 't': [1]*len, 'v1':[0]*len}) # +1 in 't'
df2 = pd.concat([df + incr, df + incr + incr]).groupby('t').mean()
df2 = df2[1:-1] # Drop the days that have no full values for the 2 previous days
df2 = df2.rename(columns={'v1': 'v2'}).drop('id', axis=1)
Output:
v2
t
3 4.166667
4 5.000000
5 6.666667
Thank you for all the help. I ended up using groupby + rolling (2 Day), and then drop duplicates (keep the last observation).
I have a dataframe below:
import pandas as pd
d = {'id': [1, 2, 3, 4, 4, 6, 1, 8, 9], 'cluster': [7, 2, 3, 3, 3, 6, 7, 8, 8]}
df = pd.DataFrame(data=d)
df = df.sort_values('cluster')
I want to keep ALL the rows
if there is the same cluster but different id AND keep every row from that cluster
even if it is the same id since there was a different id AT LEAST once within that cluster.
The code I have been using to achieve this is the following below, BUT, the only problem
with this is it drops too many rows for what I am looking for.
df = (df.assign(counts=df.count(axis=1))
.sort_values(['id', 'counts'])
.drop_duplicates(['id','cluster'], keep='last')
.drop('counts', axis=1))
The output dataframe I am expecting that the code above does not do
would drop rows at
dataframe index 1, 5, 0, and 6 but leave dataframe indexes 2, 3, 4, 7, and 8. Essentially
resulting in what the code below produces:
df = df.loc[[2, 3, 4, 7, 8]]
I have looked at many deduplication pandas posts on stack overflow but have yet to find this
scenario. Any help would be greatly appreciated.
I think we can do this with a single boolean. using .groupby().nunique()
con1 = df.groupby('cluster')['id'].nunique() > 1
#of these we only want the True indexes.
cluster
2 False
3 True
6 False
7 False
8 True
df.loc[(df['cluster'].isin(con1[con1].index))]
id cluster
2 3 3
3 4 3
4 4 3
7 8 8
8 9 8
name_region
bahia [10, 11, 12, 1, 2, 3, 4]
distrito_federal [9, 10, 11, 12, 1, 2, 3, 4]
goias [9, 10, 11, 12, 1, 2, 3, 4]
maranhao [10, 11, 12, 1, 2, 3, 4]
mato_grosso [9, 10, 11, 12, 1, 2, 3, 4]
mato_grosso_do_sul [8, 9, 10, 11, 12, 1, 2, 3]
I have a pandas series above, obtained from a groupby operation. The 2nd column represents months of the year. How do I construct a superset of months i.e. [8, 9, 10, 11, 12, 1, 2, 3, 4] since that represents all possible months present in
the dataset
--NOTE:
I do want to preserve order
You can use the itertools recipe unique_everseen (which preserves order) like so:
>>> [i for i in unique_everseen([z for z in y['months'] for x,y in df.iterrows()])]
[9, 10, 11, 12, 1, 2, 3, 4]
Definition of unique_everseen:
import itertools as it
def unique_everseen(iterable, key=None):
"List unique elements, preserving order. Remember all elements ever seen."
# unique_everseen('AAAABBBCCDAABBB') --> A B C D
# unique_everseen('ABBCcAD', str.lower) --> A B C D
seen = set()
seen_add = seen.add
if key is None:
for element in it.ifilterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
else:
for element in iterable:
k = key(element)
if k not in seen:
seen_add(k)
yield element
I seem to have misinterpreted the data structure in the question, but as it might be useful for similar cases, I will keep this answer here for future reference.
You can use numpy's unique function.
import pandas as pd
import numpy as np
df = pd.DataFrame({"x": [1,3,5], "y": [3,4,5]})
print np.unique(df) # prints [1 3 4 5]
I don't know if there is a way to do this more cleanly in Pandas so if anyone else knows please answer... Looking at the types this seems like a time for folding over that column.
I didn't see a fold operation in pandas, so maybe just a for loop that accumulates.. i.e.
all_months = []
for row in df.iterrows():
months = row['months']
all_months += [e for e in months if not e in all_months]
on second thought.. would use set instead of complicated for comprehension
all_months = set()
for row in df.iterrows():
months = set(row['months'])
all_months = all_months.union(months)
hmm just saw the other guys answer, haven't tested it.. but it looks better! choose that one :). Posting this just in case it helps someone...
I have a very large pandas dataframe with two columns that I'd like to recursively lookup.
Given input of the following dataframe:
NewID, OldID
1, 0
2, 1
3, 2
5, 4
7, 6
8, 7
9, 5
I'd like to generate the series OriginalId:
NewID, OldId, OriginalId
1, 0, 0
2, 1, 0
3, 2, 0
5, 4, 4
7, 6, 6
8, 7, 6
9, 5, 4
This can be trivially solved by iterating over the sorted data and for each row, checking if OldId points to an existing NewId and if so, setting OriginalId to OriginalId for that row.
This can be solved by iteratively merging and updating columns, by the following algorithm:
Merge OldId to NewId.
For any one that did not match, set OriginalId to OldId.
If they did match, set OldId to OldId for the matched column.
Repeat until OriginalIds are all filled in.
Feels like there should be a pandas friendly way to do this via cumulative sums or similar.
Easy:
df.set_index('NewID', inplace=True)
df.loc[:, 'OriginalId'] = df.loc[df['OldId'], 'OldID'].fillna(df['OldId'])