Write to file from dictionary instead of pandas - python

I would like to print dictionaries to file in a different way.
Right now, I am using Pandas to convert dictionaries to Dataframes, combine several Dataframes and then print them to file (see below code).
However, the Pandas operations seem to take a very long time and I would like to do this more efficiently.
Is it possible to do the below approach more efficiently while retaining the structure of the output files? (e.g. by printing from dictionary directly?)
import pandas as pd
labels = ["A", "B", "C"]
periods = [0, 1, 2]
header = ['key', 'scenario', 'metric', 'labels']
metrics_names = ["metric_balances", "metric_record"]
key = "key_x"
scenario = "base"
# The metrics are structured as dicts where the keys are `periods` and the values
# are arrays (where each array entry correspond to one of the `labels`)
metric_balances = {0: [1000, 100, 50], 1: [900, 150, 100], 2: [800, 350, 100]}
metric_record = {0: [20, 10, 5], 1: [90, 15, 10], 2: [80, 35, 10]}
# Combine all metrics into one output structure for key "x"
output_x = pd.concat([pd.DataFrame(metric_balances, columns=periods, index=labels),
pd.DataFrame(metric_record, columns=periods, index=labels)],
keys=pd.MultiIndex.from_product([[key], [scenario], metrics_names]),
names=header)
key = "key_y"
scenario = "base_2"
metric_balances = {0: [2000, 200, 50], 1: [1900, 350, 100], 2: [1200, 750, 100]}
metric_record = {0: [40, 5, 3], 1: [130, 45, 10], 2: [82, 25, 18]}
# Combine all metrics into one output structure for key "y"
output_y = pd.concat([pd.DataFrame(metric_balances, columns=periods, index=labels),
pd.DataFrame(metric_record, columns=periods, index=labels)],
keys=pd.MultiIndex.from_product([[key], [scenario], metrics_names]),
names=header)
# Concatenate all output dataframes
output = pd.concat([output_x, output_y], names=header)
# Print results to a csv file
output.to_csv("test.csv", index=False)
Below are the respective outputs:
OUTPUT X
0 1 2
key scenario metric labels
key_x base metric_balances A 1000 900 800
B 100 150 350
C 50 100 100
metric_record A 20 90 80
B 10 15 35
C 5 10 10
-----------------------------------
OUTPUT Y
0 1 2
key scenario metric labels
key_y base_2 metric_balances A 2000 1900 1200
B 200 350 750
C 50 100 100
metric_record A 40 130 82
B 5 45 25
C 3 10 18
------------------------------
OUTPUT COMBINED
0 1 2
key scenario metric labels
key_x base metric_balances A 1000 900 800
B 100 150 350
C 50 100 100
metric_record A 20 90 80
B 10 15 35
C 5 10 10
key_y base_2 metric_balances A 2000 1900 1200
B 200 350 750
C 50 100 100
metric_record A 40 130 82
B 5 45 25
C 3 10 18
I was looking into row wise printing of the dictionaries - but I had difficulties in merging the labels with the relevant arrays.

Related

Compare two dataframe and conditionally capture random data in Python

The main logic of my question is on comparing the two dataframes a little, but it will be different from the existing questions here. Q1, Q2, Q3
Let's create dummy two dataframes.
data1 = {'user': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4,4],
'checkinid': [10, 20, 30, 40, 50, 35, 45, 55, 20, 120, 100, 35, 55, 180, 200,400],
'count': [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]}
data2 = {'checkinid': [10, 20, 30, 35, 40, 45, 50,55, 60, 70,100,120,180,200,300,400]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
data2 consists of whole checkinid values. I am trying to create a training file.
For example, user 1 visited 5 places where ids are (10,20,30,40,50)
I want to add randomly the places that user 1 does not visit and set the 'count column' as 0.
My expectation dataframe like this
user checkinid count
1 10 1
1 20 1
1 30 1
1 40 1
1 50 1
1 300 0 (add randomly)
1 180 0 (add randomly)
1 55 0 (add randomly)
2 35 1
2 45 1
2 55 1
2 20 1
2 120 1
2 10 0 (add randomly)
2 400 0 (add randomly)
2 180 0 (add randomly)
... ...
Now those who read the question can ask how many random data they will add.
For each user, just add 3 non-visited places is enough for this example.
This might not be the best solution but it works
you have to get each users and then pick the checkinids which are not assigned to them
#get all users
users = df1.user.unique();
for user in users:
checkins = df1.loc[df1['user'] == user]
df = checkins.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='right_only'].sample(n=3)
df['user']=[user,user,user]
df['count']=[0,0,0]
df.pop("_merge")
df1 = df1.append(df, ignore_index=True)
#sort data frome based on user
df1 = df1.sort_values(by=['user']);
#re-arrange cols
df1 = df1[['user', 'checkinid', 'count']]
#print df
print df1

Is there a way to dinamically perform this loop?

do you know how if there is a better way to perform this task without using for loop?
Starting with the following dataset:
import pandas as pd
df = pd.DataFrame({'A': [90, 85, 85, 85, 100, 170, 150, 130, 125, 125],
'B':[100, 100, 100, 100, 100, 100, 100, 100, 100, 100]})
df['C'] = 0
df.loc[0, 'C'] = df.loc[0, 'B']
df['D'] = 0
df.loc[0, 'D'] = df.loc[0, 'C'] * 0.95
df['E'] = 0
df.loc[0, 'E'] = df.loc[0, 'C'] * 0.80
Now,
if the value in row 1 column A is greater than the value in row 0 column D:
the value in row 1 column C will be equal to the value in row 1 column A * 2
the value in row 1 column D will be equal to the value in row 1 column C * 0.95
the value in row 1 column E will be equal to the value in row 1 column D * 0.8
elif the value in row 1 column A is less than the value in row 0 column E:
the value in row 1 column C will be equal to the value in row 1 column A
the value in row 1 column D will be equal to the value in row 1 column C * 0.95
the value in row 1 column E will be equal to the value in row 1 column D * 0.8
else:
the value in row 1 column C will be equal to the value in row 0 column C
the value in row 1 column D will be equal to the value in row 1 column C * 0.95
the value in row 1 column E will be equal to the value in row 1 column D * 0.8
As output, I would like to create a df like this:
df_out = pd.DataFrame({'A': [90, 85, 85, 85, 100, 170, 150, 130, 125, 125],
'B':[100, 100, 100, 100, 100, 100, 100, 100, 100, 100],
'C':[100, 100, 100, 100, 200, 200, 150, 150, 150, 150],
'D':[95, 95, 95, 95, 190, 190, 190, 143, 143, 143],
'E':[80, 80, 80, 80, 160, 160, 160, 120, 120, 120]})
Considering that I have to iterate for more than 5000 rows and for around 3000 possible scenarios I'm looking for the fastest way to perform this task and I've noted that the for loop is extremely slow.
Thank you guys in advance and apologize for the trivial question!! I'm new to python and I'm trying to learn as much as possible!!
Best
Per our discussion in the comments, if you do the loop this way it's reasonably quick:
alist = [90, 85, 85, 85, 100, 170, 150, 130, 125, 125] * 500
a = alist[0]
c = 100
d = 95
e = 80
clist = [c]
dlist = [d]
elist = [e]
for a in alist[1:]:
if a > d:
c_new = round(a*1.5)
elif a < e:
c_new = a
else:
c_new = c
c = c_new
d = round(c_new * 0.95)
e = round(d * 0.8)
clist.append(c_new)
dlist.append(d)
elist.append(e)
df_out = pd.DataFrame({ 'A' : alist, 'C' : clist, 'D' : dlist, 'E' : elist })
print(df_out.head(10))
A C D E
0 90 100 95 80
1 85 100 95 76
2 85 100 95 76
3 85 100 95 76
4 100 150 142 114
5 170 255 242 194
6 150 150 142 114
7 130 150 142 114
8 125 150 142 114
9 125 150 142 114

Finding minimum variance based on combinations of binning in python

I am looking to use a loop to iterate through all combinations of binning a variable before doing a group by. Example data:
import pandas as pd
df = pd.DataFrame({'id': [1,2,3,4,5,6,7,8,9,10],
'age': [23,54,47,38,37,21,27,72,25,36],
'score':[28,38,47,27,37,26,28,48,27,47]})
df.head()
id age score
0 1 23 28
1 2 54 38
2 3 47 47
3 4 38 27
4 5 37 37
And then manually creating bins like so:
bins = [20,50,70,80]
labels = ['-'.join(map(str,(x,y))) for x, y in zip(bins[:-1], bins[1:])]
df["age_bin"] = pd.cut(df["age"], bins = bins,labels = labels)
Finally calculating the average variance for that bin combination:
df.groupby("age_bin").agg({'score':'var'}).mean()
How can I loop through all combinations of bins, with a minimum bin size of 10 and but with no restrictions on the number of bins and assuming they do not have to be the same size?
e.g.
bins mean
0 [20, 50, 70, 80] 82.553571
1 [20, 70, 80] 74.611111
2 [20, 30, 60, 80] 35.058333

Split on train and test separating by group

I have a sample data as follows:
import pandas as pd
df = pd.DataFrame({"x": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120],
"id": [1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5],
"label": ["a", "a", "a", "b", "a", "b", "b", "b", "a", "b", "a", "b"]})
So my data look like this
x id label
10 1 a
20 1 a
30 1 a
40 1 b
50 2 a
60 2 b
70 3 a
80 3 a
90 4 b
100 4 a
110 5 b
120 5 a
I would like to split this data into two groups (train, test) based on label distribution given the number of test samples (e.g. 6 samples). My settings prefers to define size of test set as integer representing the number of test samples rather than percentage. However, with my specific domain, any id MUST be allocated in ONLY one group. For example, if id 1 was assigned to the training set, other samples with id 1 cannot be assigned to the test set. So the expected output are 2 dataframes as follows:
Training set
x id label
10 1 a
20 1 a
30 1 a
40 1 b
50 2 a
60 2 b
Test set
x id label
70 3 a
80 3 a
90 4 b
100 4 a
110 5 b
120 5 a
Both training set and test set have the same class distribution (a:b is 4:2) and id 1, 2 were assigned to only the training set while id 3, 4, 5 were assigned to only the test set. I used to do with sklearn train_test_split but I could not figure out how to apply it with such a condition. May I have your suggestions how to handle such conditions?
sklearn.model_selection has several other options other than train_test_split. One of them, aims at solving what you're after. In this case you could use GroupShuffleSplit, which as mentioned inthe docs it provides randomized train/test indices to split data according to a third-party provided group. You also have GroupKFold for these cases which is very useful.
from sklearn.model_selection import GroupShuffleSplit
X = df.drop('label',1)
y=df.label
You can now instantiate GroupShuffleSplit, and do as you would with train_test_split, with the only difference of specifying a group column, which will be used to split X and y so the groups are split according the the groups values:
gs = GroupShuffleSplit(n_splits=2, test_size=.6, random_state=0)
train_ix, test_ix = next(gs.split(X, y, groups=X.id))
Now you can index the dataframe to create the train and test sets:
X_train = X.loc[train_ix]
y_train = y.loc[train_ix]
X_test = X.loc[test_ix]
y_test = y.loc[test_ix]
Giving:
print(X_train)
x id
4 50 2
5 60 2
8 90 4
9 100 4
10 110 5
11 120 5
And for the test set:
print(X_test)
x id
0 10 1
1 20 1
2 30 1
3 40 1
6 70 3
7 80 3
Adding to Yatu's brilliant answer, you can split your data only using pandas if you liked, although its better to use what was proposed in his answer.
import pandas as pd
df = pd.DataFrame(
{
"x": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120],
"id": [1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5],
"label": ["a", "a", "a", "b", "a", "b", "b", "b", "a", "b", "a", "b"],
}
)
TRAIN_TEST_SPLIT_PERC = 0.75
uniques = df["id"].unique()
sep = int(len(uniques) * TRAIN_TEST_SPLIT_PERC)
df = df.sample(frac=1).reset_index(drop=True) #For shuffling your data
train_ids, test_ids = uniques[:sep], uniques[sep:]
train_df, test_df = df[df.id.isin(train_ids)], df[df.id.isin(test_ids)]
print("\nTRAIN DATAFRAME\n", train_df)
print("\nTEST DATAFRAME\n", test_df)

Merge two dataframes based on interval overlap

I have two dataframes A and B:
For example:
import pandas as pd
import numpy as np
In [37]:
A = pd.DataFrame({'Start': [10, 11, 20, 62, 198], 'End': [11, 11, 35, 70, 200]})
A[["Start","End"]]
Out[37]:
Start End
0 10 11
1 11 11
2 20 35
3 62 70
4 198 200
In [38]:
B = pd.DataFrame({'Start': [8, 5, 8, 60], 'End': [10, 90, 13, 75], 'Info': ['some_info0','some_info1','some_info2','some_info3']})
B[["Start","End","Info"]]
Out[38]:
Start End Info
0 8 10 some_info0
1 5 90 some_info1
2 8 13 some_info2
3 60 75 some_info3
I would like to add column info to dataframe A based on if the interval (Start-End) of A overlaps with the interval of B. In case, the A interval overlaps with more than one B interval, the info corresponding to the shorter interval should be added.
I have been looking arround how to manage this issue and I have found kind of similar questions but most of their answers are using iterrows() which in my case, as I am dealing with huge dataframes is not viable.
I would like something like:
A.merge(B,on="overlapping_interval", how="left")
And then drop duplicates keeping the info coming from the shorter interval.
The output should look like this:
In [39]:
C = pd.DataFrame({'Start': [10, 11, 20, 62, 198], 'End': [11, 11, 35, 70, 200], 'Info': ['some_info0','some_info2','some_info1','some_info3',np.nan]})
C[["Start","End","Info"]]
Out[39]:
Start End Info
0 10 11 some_info0
1 11 11 some_info2
2 20 35 some_info1
3 62 70 some_info3
4 198 200 NaN
I have found this question really interesting as it suggests the posibility of solving this issue using pandas Interval object. But after lots attempts I have not managed to solve it.
Any ideas?
I would suggest to do a function then apply on the rows:
First I compute the delta (End - Start) in B for sorting purpose
B['delta'] = B.End - B.Start
Then a function to get information:
def get_info(x):
#Fully included
c0 = (x.Start >= B.Start) & (x.End <= B.End)
#start lower, end include
c1 = (x.Start <= B.Start) & (x.End >= B.Start)
#start include, end higher
c2 = (x.Start <= B.End) & (x.End >= B.End)
#filter with conditions and sort by delta
_B = B[c0|c1|c2].sort_values('delta',ascending=True)
return None if len(_B) == 0 else _B.iloc[0].Info #None if no info corresponding
Then you can apply this function to A:
A['info'] = A.apply(lambda x : get_info(x), axis='columns')
print(A)
Start End info
0 10 11 some_info0
1 11 11 some_info2
2 20 35 some_info1
3 62 70 some_info3
4 198 200 None
Note:
Instead of using pd.Interval, make your own conditions. cx are your intervals definitions, change them to get the exact expected behaviour

Categories