I need to convert some code from dictionaries to dataframes. How do I duplicate the functionality of a dictionary's .get()? What I have mostly works, but I can't figure out how to get the default value working. For example, in the below code, the output of the dataframe should include an index1 value for 234. (For various reasons, I cannot change the format of the incoming data.)
import pandas
def build_dataframe(data2):
tuple_list = []
data_dict = {}
for source in sorted(data2.keys()):
tuple_list.extend([(source, target) for target in sorted(data2[source])])
data_dict.update({(source, target): data2[source][target] for target in sorted(data2[source])})
multi_index = pandas.MultiIndex.from_tuples(tuple_list, names=["index1", "index2"])
df = pandas.DataFrame(index=multi_index, columns=[0], data={0: data_dict})
return df
def dataframe_get(df, index2, default_value=0):
return df.loc(axis=0)[:, index2]
def dict_get(input_dict, key, default_value=0):
return {index1: dictionary.get(key, default_value) for index1, dictionary in input_dict.iteritems()}
data = {123: {6544: 44, 23423: 66, 12: 65}, 234: {725: 42, 7245: 62}}
df_data = build_dataframe(data)
print df_data
print dict_get(data, 12, 999)
print dataframe_get(df_data, 12, 999)
Result:
0
index1 index2
123 12 65
6544 44
23423 66
234 725 42
7245 62
{234: 999, 123: 65}
0
index1 index2
123 12 65
EDIT: I got something:
def dataframe_get(df, index2, default_value=0):
levels = df.index.levels[:-1] + [[index2]]
new_index = pandas.MultiIndex.from_product(levels, names=["index1", "index2"])
data = df.reindex(new_index)
return data.loc(axis=0)[:, index2].fillna(default_value)
This is almost correct, but I need to drop the index2; I'm still working on this.
index1 index2
123 12 65.0
234 12 999.0
OK, got this working:
def dataframe_get(df, index2, default_value=0):
index_length = len(df.index.levels)
levels = df.index.levels[:-1] + [[index2]]
new_index = pandas.MultiIndex.from_product(levels, names=["index1", "index2"])
data = df.reindex(new_index)
data.index = data.index.droplevel(index_length - 1)
return data.fillna(default_value)
Related
I have a dataframe with sorted values:
import numpy as np
import pandas as pd
sub_run = pd.DataFrame({'Runoff':[45,10,5,26,30,23,35], 'ind':[3, 10, 25,43,53,60,93]})
I would like to start from the highest value in Runoff (45), drop all values with which the difference in "ind" is less than 30 (10, 5), reupdate the DataFrame , then go to the second highest value (35): drop the indices with which the difference in "ind" is < 30 , then the the third highest value (30) and drop 26 and 23...
I wrote the following code :
pre_ind = []
for (idx1, row1) in sub_run.iterrows():
var = row1.ind
pre_ind.append(np.array(var))
for (idx2,row2) in sub_run.iterrows():
if (row2.ind != var) and (row2.ind not in pre_ind):
test = abs(row2.ind - var)
print("test" , test)
if test <= 30:
sub_run = sub_run.drop(sub_run[sub_run.ind == row2.ind].index)
I expect to find as an output the values [45,35,30]. However I only find the first one.
Many thanks
Try this:
list_pre_max = []
while True:
try:
max_val = sub_run.Runoff.sort_values(ascending=False).iloc[len(list_pre_max)]
except:
break
max_ind = sub_run.loc[sub_run['Runoff'] == max_val, 'ind'].item()
list_pre_max.append(max_val)
dropped_indices = sub_run.loc[(abs(sub_run['ind']-max_ind) <= 30) & (sub_run['ind'] != max_ind) & (~sub_run.Runoff.isin(list_pre_max))].index
sub_run.drop(index=dropped_indices, inplace=True)
Output:
>>>sub_run
Runoff ind
0 45 3
4 30 53
6 35 93
You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html
In your case, the modification of sub_run has no effect immediately on the iteration.
Therefore, in the outer loop, after iteration on 45, 3,
the next row iterated is 35, 93, followed by 30, 53, 26, 43, 23, 60, 10, 10, 5, 25. For the inner loop, your modification works since you re-enter a new loop through iteration on the outer loop.
Here is my advice code, inspired by bubble sort.
import pandas as pd
sub_run = pd.DataFrame({'Runoff': [45,10,5,26,30,23,35],
'ind': [3,10,25,43,53,60,93]})
sub_run = sub_run.sort_values(by=['Runoff'], ascending=False)
highestRow = 0
while highestRow < len(sub_run) - 1:
cur_run = sub_run
highestRunoffInd = cur_run.iloc[highestRow].ind
for i in range(highestRow + 1, len(cur_run)):
ind = cur_run.iloc[i].ind
if abs(ind - highestRunoffInd) <= 30:
sub_run = sub_run.drop(sub_run[sub_run.ind == ind].index)
highestRow += 1
print(sub_run)
Output:
Runoff ind
0 45 3
6 35 93
4 30 53
I have multiple dfs with same columns. Here is the list of all dfs
dfs = [df_14, df_15, df_16, df_17]
Every dataframe looks like this for example,df_14:
id
Days
001
0
004
56
013
95
015
33
Next, df_15:
Id
Days
001
0
023
18
459
19
811
35
df_16:
Id
Days
111
93
114
56
232
0
df_17:
Id
Days
532
120
113
31
065
58
015
2
My code:
rows = [['532', 120],['113', 31], ['065', 58],['025', 2]]
for row in rows:
df_14.loc[len(df_14)] = row
# and so on
The task is to append to lists of each month - the is which has 30-60 days and another separate list with id of clients which has 60-100 days.
#The result should be like this:
14_1: ['004', '015']
14_2: ['013']
15_1: ['811']
I try to use f'strings on it. Something like:
abrreviations = ['14', '15','16', '17']
c = ['_1', '_2']
#Have wrote initializing loops like
m_list=[]
for a in abrreviations:
for cp in c:
m_list.append(a+cp)
And the idea is using abbreviations in the loops with f'string or format. But don't know how to do it? Or can you offer another ideas?
This can help you
import pandas as pd
data = {'df_jan' : [['001', 0],['004', 56], ['013', 95],['015', 33]],
'df_feb' : [['001', 0],['023', 18], ['459', 19],['811', 35]],
'df_mar' : [['111', 93],['114', 56], ['232', 0]],
'df_apr' : [['532', 120],['113', 31], ['065', 58],['025', 2]]}
dfs = {}
for df in data:
dfs[df] = pd.DataFrame(data[df], columns=['id', 'days'])
months = {}
for df in dfs:
months[df.replace('df_', '') + '_30'] = dfs[df][(dfs[df].days >= 30) & (dfs[df].days <= 60)].id.to_list()
months[df.replace('df_', '') + '_90'] = dfs[df][(dfs[df].days >= 90) & (dfs[df].days <= 120)].id.to_list()
months
{'jan_30': ['004', '015'],
'jan_90': ['013'],
'feb_30': ['811'],
'feb_90': [],
'mar_30': ['114'],
'mar_90': ['111'],
'apr_30': ['113', '065'],
'apr_90': ['532']}
In response to your comment:
I created the df inside the dictionary to simplify the creation of test data.
Your code can create the df in its own way ...
df_jan = ...
df_feb = ...
df_mar = ...
df_apr = ...
and to process them you create the dictionary ...
dfs = {
'df_jan' : df_jan,
'df_feb' : df_feb,
'df_mar' : df_mar,
'df_apr' : df_apr
}
run the loop
and you can assign results to your variables
and delete dictionaries
jan_30 = months['jan_30']
jan_90 = months['jan_90']
feb_30 = months['feb_30']
feb_90 = months['feb_90']
mar_30 = months['mar_30']
mar_90 = months['mar_90']
apr_30 = months['apr_30']
apr_90 = months['apr_90']
del dfs, months
#let first create a list containing all the dataframe's
all_df=[df_jan, df_feb, df_mar, df_apr, df_may, df_jun, df_jul, df_aug, df_sep, df_oct, df_nov, df_dec]
#create 2 lists for storing the id values of 30-60 range and 90-120 range
list_30,list_90=[],[]
#1 nested for loop for handling all data frames
for cur_df in all_df:
for id,days in zip(cur_df['Id'],cur_df['Days']):
if(30<=days<=60):
list_30.append(id)
elif(90<=days<=120):
list_90.append(id)
#Now list_30 and list_90 contains the corresponding id values in that range
Hope the answer helps :)
Since you didn't provide data I made a basic example and it worked for me so here is a single for-loop as you described:
import numpy as np
import pandas as pd
dfs = [df_jan, df_feb, df_mar, df_apr, df_may, df_jun, df_jul, df_aug, df_sep, df_oct, df_nov, df_dec]
df30 = []
df90 = []
dfsChained30 = []
dfsChained90 = []
for rowsForMonths, xForMonths in enumerate(dfs):
# If January [don't consider chain];
if rowsForMonths == 0:
for dayN in range(dfs[rowsForMonths]):
if dfs[rowsForMonths][dayN] in range(30, 61):
df30.append(dfs[rowsForMonths][dayN])
elif dfs[rowsForMonths][dayN] in range(90, 121):
df90.append(dfs[rowsForMonths][dayN])
else:
pass
dfsChained30.append(df30)
dfsChained90.append(df90)
# If not January [consider chain];
else:
for dayN in range(dfs[rowsForMonths]):
if dfs[rowsForMonths][dayN] in range(30, 61) and dfs[rowsForMonths][dayN] not in set(dfsChained30):
df30.append(dfs[rowsForMonths][dayN])
elif dfs[rowsForMonths][dayN] in range(90, 121) and dfs[rowsForMonths][dayN] not in set(dfsChained90):
df90.append(dfs[rowsForMonths][dayN])
else:
pass
dfsChained30.append(df30)
dfsChained90.append(df90)
I am trying to assign a value to a dataframe column based on a value that falls IN BETWEEN two values of an other dataframe:
intervals = pd.DataFrame(columns = ['From','To','Value'], data = [[0,100,'A'],[100,200,'B'],[200,500,'C']])
print('intervals\n',intervals,'\n')
points = pd.DataFrame(columns = ['Point', 'Value'], data = [[45,'X'],[125,'X'],[145,'X'],[345,'X']])
print('points\n',points,'\n')
DesiredResult = pd.DataFrame(columns = ['Point', 'Value'], data = [[45,'A'],[125,'B'],[145,'B'],[345,'C']])
print('DesiredResult\n',DesiredResult,'\n')
Many thanks
Let's use map, first create a series using pd.IntervalIndex with from_arrays method:
intervals = intervals.set_index(pd.IntervalIndex.from_arrays(intervals['From'],
intervals['To']))['Value']
points['Value'] = points['Point'].map(intervals)
Output:
Point Value
0 45 A
1 125 B
2 145 B
3 345 C
Another approach:
def calculate_value(x):
return intervals.loc[(x >= intervals['From']) & (x < intervals['To']), 'Value'].squeeze()
desired_result = points.copy()
desired_result['Value'] = desired_result['Point'].apply(calculate_value)
I have a multiple dataframes which are similar to below:
df:
Name Value1 Value2
A 98 57
B 267 962
C 43 423
D 612 34
I need to use a function on the above datframe which will perform some calculations and output some variables.
def my_func()
c001=[]
for _, value in df.iterrows():
var1 = value['Value1']
var2 = value['Value1%']
seg1 = value['Name']
flag1 = 'over' if var1>0 else 'under'
kpi = 'YYT'
c001.append(f"{seg1} {kpi} {flag1} Plan by {human(var1)}({abs(var2)}%) ")
c001[1]
How do I use this function on the input dataframe to print the value in variable c001[1]?
I hope I understood you correctly:
def my_func()
c001=[]
for _, value in df.iterrows():
var1 = value['Value1']
var2 = value['Value1%']
seg1 = value['Name']
flag1 = 'over' if var1>0 else 'under'
kpi = 'YYT'
c001.append(f"{seg1} {kpi} {flag1} Plan by {human(var1)}({abs(var2)}%) ")
return c001[1]
print (my_func())
You can try to create "c001" as a column and then print it.
def my_func(value):
var1 = value['Value1']
var2 = value['Value1%']
seg1 = value['Name']
flag1 = 'over' if var1 > 0 else 'under'
kpi = 'YYT'
return f"{seg1} {kpi} {flag1} Plan by {human(var1)}({abs(var2)}%) "
df["c001"] = df.apply(my_func, axis=1)
print(df["c001"])
The result will look like:
0 A YYT over Plan by 98(57%)
1 B YYT over Plan by 267(962%)
2 C YYT over Plan by 43(423%)
3 D YYT over Plan by 612(34%)
Name: c001, dtype: object
I have a training data like below which have all the information under a single column. The data set has above 300000 data.
id features label
1 name=John Matthew;age=25;1.=Post Graduate;2.=Football Player; 1
2 name=Mark clark;age=21;1.=Under Graduate;Interest=Video Games; 1
3 name=David;age=12;1:=High School;2:=Cricketer;native=america; 2
4 name=George;age=11;1:=High School;2:=Carpenter;married=yes 2
.
.
300000 name=Kevin;age=16;1:=High School;2:=Driver;Smoker=No 3
Now i need to convert this training data like below
id name age 1 2 Interest married Smoker
1 John Matthew 25 Post Graduate Football Player Nan Nan Nan
2 Mark clark 21 Under Graduate Nan Video Games Nan Nan
.
.
Is there any efficient way to do this. I tried the below code but it took 3 hours to complete
#Getting the proper features from the features column
cols = {}
for choices in set_label:
collection_list = []
array = train["features"][train["label"] == choices].values
for i in range(1,len(array)):
var_split = array[i].split(";")
try :
d = (dict(s.split('=') for s in var_split))
for x in d.keys():
collection_list.append(x)
except ValueError:
Error = ValueError
count = Counter(collection_list)
for k , v in count.most_common(5):
key = k.replace(":","").replace(" ","_").lower()
cols[key] = v
columns_add = list(cols.keys())
train = train.reindex(columns = np.append( train.columns.values, columns_add))
print (train.columns)
print (train.shape)
#Adding the values for the newly created problem
for row in train.itertuples():
dummy_dic = {}
new_dict={}
value = train.loc[row.Index, 'features']
v_split = value.split(";")
try :
dummy_dict = (dict(s.split('=') for s in v_split))
for k, v in dummy_dict.items():
new_key = k.replace(":","").replace(" ","_").lower()
new_dict[new_key] = v
except ValueError:
Error = ValueError
for k,v in new_dict.items():
if k in train.columns:
train.loc[row.Index, k] = v
Is there any useful function that i can apply here for efficient way of feature extraction ?
Create two DataFrames (in the first one all the features are the same for every data point and the second one is a modification of the first one introducing different features for some data points) meeting your criteria:
import pandas as pd
import numpy as np
import random
import time
import itertools
# Create a DataFrame where all the keys for each datapoint in the "features" column are the same.
num = 300000
NAMES = ['John', 'Mark', 'David', 'George', 'Kevin']
AGES = [25, 21, 12, 11, 16]
FEATURES1 = ['Post Graduate', 'Under Graduate', 'High School']
FEATURES2 = ['Football Player', 'Cricketer', 'Carpenter', 'Driver']
LABELS = [1, 2, 3]
df = pd.DataFrame()
df.loc[:num, 0]= ["name={0};age={1};feature1={2};feature2={3}"\
.format(NAMES[np.random.randint(0, len(NAMES))],\
AGES[np.random.randint(0, len(AGES))],\
FEATURES1[np.random.randint(0, len(FEATURES1))],\
FEATURES2[np.random.randint(0, len(FEATURES2))]) for i in xrange(num)]
df['label'] = [LABELS[np.random.randint(0, len(LABELS))] for i in range(num)]
df.rename(columns={0:"features"}, inplace=True)
print df.head(20)
# Create a modified sample DataFrame from the previous one, where not all the keys are the same for each data point.
mod_df = df
random_positions1 = random.sample(xrange(10), 5)
random_positions2 = random.sample(xrange(11, 20), 5)
INTERESTS = ['Basketball', 'Golf', 'Rugby']
SMOKING = ['Yes', 'No']
mod_df.loc[random_positions1, 'features'] = ["name={0};age={1};interest={2}"\
.format(NAMES[np.random.randint(0, len(NAMES))],\
AGES[np.random.randint(0, len(AGES))],\
INTERESTS[np.random.randint(0, len(INTERESTS))]) for i in xrange(len(random_positions1))]
mod_df.loc[random_positions2, 'features'] = ["name={0};age={1};smoking={2}"\
.format(NAMES[np.random.randint(0, len(NAMES))],\
AGES[np.random.randint(0, len(AGES))],\
SMOKING[np.random.randint(0, len(SMOKING))]) for i in xrange(len(random_positions2))]
print mod_df.head(20)
Assume that your original data is stored in a DataFrame called df.
Solution 1 (all the features are the same for every data point).
def func2(y):
lista = y.split('=')
value = lista[1]
return value
def function(x):
lista = x.split(';')
array = [func2(i) for i in lista]
return array
# Calculate the execution time
start = time.time()
array = pd.Series(df.features.apply(function)).tolist()
new_df = df.from_records(array, columns=['name', 'age', '1', '2'])
end = time.time()
new_df
print 'Total time:', end - start
Total time: 1.80923295021
Edit: The one thing you need to do is to edit accordingly the columns list.
Solution 2 (The features might be the same or different for every data point).
import pandas as pd
import numpy as np
import time
import itertools
# The following functions are meant to extract the keys from each row, which are going to be used as columns.
def extract_key(x):
return x.split('=')[0]
def def_columns(x):
lista = x.split(';')
keys = [extract_key(i) for i in lista]
return keys
df = mod_df
columns = pd.Series(df.features.apply(def_columns)).tolist()
flattened_columns = list(itertools.chain(*columns))
flattened_columns = np.unique(np.array(flattened_columns)).tolist()
flattened_columns
# This function turns each row from the original dataframe into a dictionary.
def function(x):
lista = x.split(';')
dict_ = {}
for i in lista:
key, val = i.split('=')
dict_[key ] = val
return dict_
df.features.apply(function)
arr = pd.Series(df.features.apply(function)).tolist()
pd.DataFrame.from_dict(arr)
Suppose your data is like this :
features= ["name=John Matthew;age=25;1:=Post Graduate;2:=Football Player;",
'name=Mark clark;age=21;1:=Under Graduate;2:=Football Player;',
"name=David;age=12;1:=High School;2:=Cricketer;",
"name=George;age=11;1:=High School;2:=Carpenter;",
'name=Kevin;age=16;1:=High School;2:=Driver; ']
df = pd.DataFrame({'features': features})
I will start by this answer and try to replace all separator (name, age , 1:= , 2:= ) by ;
with this function
def replace_feature(x):
for r in (("name=", ";"), (";age=", ";"), (';1:=', ';'), (';2:=', ";")):
x = x.replace(*r)
x = x.split(';')
return x
df = df.assign(features= df.features.apply(replace_feature))
After applying that function to your df all the values will a list of features. where you can get each one by index
then I use 4 customs function to get each attribute name, age, grade; job,
Note: There can be a better way to do this by using only one function
def get_name(df):
return df['features'][1]
def get_age(df):
return df['features'][2]
def get_grade(df):
return df['features'][3]
def get_job(df):
return df['features'][4]
And finaly applying that function to your dataframe :
df = df.assign(name = df.apply(get_name, axis=1),
age = df.apply(get_age, axis=1),
grade = df.apply(get_grade, axis=1),
job = df.apply(get_job, axis=1))
Hope this will be quick and fast
As far as I understand your code, the poor performances comes from the fact that you create the dataframe element by element. It's better to create the whole dataframe at once whith a list of dictionnaries.
Let's recreate your input dataframe :
from StringIO import StringIO
data=StringIO("""id features label
1 name=John Matthew;age=25;1.=Post Graduate;2.=Football Player; 1
2 name=Mark clark;age=21;1.=Under Graduate;2.=Football Player; 1
3 name=David;age=12;1:=High School;2:=Cricketer; 2
4 name=George;age=11;1:=High School;2:=Carpenter; 2""")
df=pd.read_table(data,sep=r'\s{3,}',engine='python')
we can check :
print df
id features label
0 1 name=John Matthew;age=25;1.=Post Graduate;2.=F... 1
1 2 name=Mark clark;age=21;1.=Under Graduate;2.=Fo... 1
2 3 name=David;age=12;1:=High School;2:=Cricketer; 2
3 4 name=George;age=11;1:=High School;2:=Carpenter; 2
Now we can create the needed list of dictionnaries with the following code :
feat=[]
for line in df['features']:
line=line.replace(':','.')
lsp=line.split(';')[:-1]
feat.append(dict([elt.split('=') for elt in lsp]))
And the resulting dataframe :
print pd.DataFrame(feat)
1. 2. age name
0 Post Graduate Football Player 25 John Matthew
1 Under Graduate Football Player 21 Mark clark
2 High School Cricketer 12 David
3 High School Carpenter 11 George