Transforming pandas Dataframe into dictionary via function taking column inputs - python

I have the following pandas Dataframe:
dict1 = {'file': ['filename2', 'filename2', 'filename3', 'filename4', 'filename4', 'filename3'], 'amount': [3, 4, 5, 1, 2, 1], 'front':[21889611, 36357723, 196312, 11, 42, 1992], 'back':[21973805, 36403870, 277500, 19, 120, 3210]}
df1 = pd.DataFrame(dict1)
print(df1)
file amount front back
0 filename2 3 21889611 21973805
1 filename2 4 36357723 36403870
2 filename3 5 196312 277500
3 filename4 1 11 19
4 filename4 2 42 120
5 filename3 1 1992 3210
My task is to take N random draws between front and back, whereby N is equal to the value in amount. Parse this into a dictionary.
To do this on an row-by-row basis is easy for me to understand:
e.g. row 1
import numpy as np
random_draws = np.random.choice(np.arange(21889611, 21973805+1), 3)
e.g. row 2
random_draws = np.random.choice(np.arange(36357723, 36403870+1), 4)
Normally with pandas, users could define this as a function and use something like
def func(front, back, amount):
return np.random.choice(np.arange(front, back+1), amount)
df["new_column"].apply(func)
but the result of my function is an array of varying size.
My second problem is that I would like the output to be a dictionary, of the format
{file: [random_draw_results], file: [random_draw_results], file: [random_draw_results], ...}
For the above example df1, the function should output this dictionary (given the draws):
final_dict = {"filename2": [21927457, 21966814, 21898538, 36392840, 36375560, 36384078, 36366833],
"filename3": 212143, 239725, 240959, 197359, 276948, 3199],
"filename4": [100, 83, 15]}

We can pass axis=1 to operate over rows when using apply.
We then need to tell what columns to use and we return a list.
We then either perform some form of groupby or we could use defaultdict as shown below:
dict1 = {'file': ['filename2', 'filename2', 'filename3', 'filename4', 'filename4', 'filename3'], 'amount': [3, 4, 5, 1, 2, 1], 'front':[21889611, 36357723, 196312, 11, 42, 1992], 'back':[21973805, 36403870, 277500, 19, 120, 3210]}
import numpy as np
import pandas as pd
def func(x):
return np.random.choice(np.arange(x.front, x.back+1), x.amount).tolist()
df1 = pd.DataFrame(dict1)
df1["new_column"] = df1.apply(func, axis=1)
df1.groupby('file')['new_column'].apply(sum).to_dict()
Returns:
{'filename2': [21891765,
21904680,
21914414,
36398355,
36358161,
36387670,
36369443],
'filename3': [240766, 217580, 217581, 274396, 241413, 2488],
'filename4': [18, 96, 107]}
Alt2 would be to use (and by some small timings I ran it looks like it runs as fast):
from collections import defaultdict
d = defaultdict(list)
for k,v in df1.set_index('file')['new_column'].items():
d[k].extend(v)

Related

count how many times a record appears in a pandas dataframe and create a new feature with this counter

I have this two dataframes dt_t and dt_u. I want to be able to count how many times a record in the text feature appears and I want to create a new feature in df_u where I associate to each id the counter. So id_u = 1 and id_u = 2 both will have counter = 3 since hello appears 3 times in df_t and both published a post with "hello" in the text.
import pandas as pd
import numpy as np
df_t = pd.DataFrame({'id_t': [0, 1, 2, 3, 4], 'id_u': [1, 1, 3, 2, 2], 'text': ["hello", "hello", "friend", "hello", "my"]})
print(df_t)
df_u = pd.DataFrame({'id_u': [1, 2, 3]})
print()
print(df_u)
df_u_new = pd.DataFrame({'id_u': [1, 2, 3], 'counter': [3, 3, 1]})
print()
print(df_u_new)
The code I wrote for the moment is this, but this is very slow and also I have a very huge dataset so it is impossible to do.
user_counter_dict = {}
tmp = dict(df_t["text"].value_counts())
# to speedup the process we set as index the text column
df_t.set_index(["text"], inplace=True)
for i, (k, v) in enumerate(tmp.items()):
row = (k, v)
text = row[0]
counter = row[1]
#this is slow and take much of the time
uniques_id = df_.loc[tweet]["id_u"].unique()
for elem in uniques_id:
value = user_counter_dict.setdefault(str(elem), counter)
if value < counter:
user_counter_dict[str(elem)] = counter
# and now I will put the date on the dict on a new column in df_u
Is there a very fast way to compute this?
You can do:
df_u_new = df_t.assign(counter=df_t["text"].map(df_t["text"].value_counts()))[
["id_u", "counter"]
].groupby("id_u", as_index=False).max()
Get the value_counts of text and groupby id_u and get the maximum value which is what you were trying to get IIUC.
print(df_u_new)
id_u counter
0 1 3
1 2 3
2 3 1

How to speed up nested loop and add condition?

I am trying to speed up my nested loop it currently takes 15 mins for 100k customers.
I am also having trouble adding an additional condition that only multiplies states (A,B,C) by lookup2 val, else multiplies by 1.
customer_data = pd.DataFrame({"cust_id": [1, 2, 3, 4, 5, 6, 7, 8],
"state": ['B', 'E', 'D', 'A', 'B', 'E', 'C', 'A'],
"cust_amt": [1000,300, 500, 200, 400, 600, 200, 300],
"year":[3, 3, 4, 3, 4, 2, 2, 4],
"group":[10, 25, 30, 40, 55, 60, 70, 85]})
state_list = ['A','B','C','D','E']
# All lookups should be dataframes with the year and/or group and the value like these.
lookup1 = pd.DataFrame({'year': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'lim %': 0.1})
lookup2 = pd.concat([pd.DataFrame({'group':g, 'lookup_val': 0.1, 'year':range(1, 11)}
for g in customer_data['group'].unique())]).explode('year')
multi_data = np.arange(250).reshape(10,5,5)
lookups = [lookup1, lookup2]
# Preprocessing.
# Transform the state to categorical code to use it as array index.
customer_data['state'] = pd.Categorical(customer_data['state'],
categories=state_list,
ordered=True).codes
# Set index on lookups.
for i in range(len(lookups)):
if 'group' in lookups[i].columns:
lookups[i] = lookups[i].set_index(['year', 'group'])
else:
lookups[i] = lookups[i].set_index(['year'])
calculation:
results = {}
for customer, state, amount, start, group in customer_data.itertuples(name=None, index=False):
for year in range(start, len(multi_data)+1):
if year == start:
results[customer] = [[amount * multi_data[year-1, state, :]]]
else:
results[customer].append([results[customer][-1][-1] # multi_data[year-1]])
for lookup in lookups:
if isinstance(lookup.index, pd.MultiIndex):
value = lookup.loc[(year, group)].iat[0]
else:
value = lookup.loc[year].iat[0]
results[customer][-1].append(value * results[customer][-1][-1])
example of expected output:
{1: [[array([55000, 56000, 57000, 58000, 59000]),
array([5500., 5600., 5700., 5800., 5900.]),
array([550., 560., 570., 5800., 5900.])],...
You could use multiprocessing if you have more than one CPU.
from multiprocessing import Pool
def get_customer_data(data_tuple) -> dict:
results = {}
customer, state, amount, start, group = data_tuple
for year in range(start, len(multi_data)+1):
if year == start:
results[customer] = [[amount * multi_data[year-1, state, :]]]
else:
results[customer].append([results[customer][-1][-1] # multi_data[year-1]])
for lookup in lookups:
if isinstance(lookup.index, pd.MultiIndex):
value = lookup.loc[(year, group)].iat[0]
else:
value = lookup.loc[year].iat[0]
results[customer][-1].append(value * results[customer][-1][-1])
return results
p = Pool(mp.cpu_count())
# Pool.map() takes a function and an iterable like a list or generator
results_list = p.map(get_customer_data, [data_tuple for data_tuple in customer_data.itertuples(name=None, index=False)] )
# results is a list of dict()
results_dict = {k:v for x in results_list for k,v in x.items()}
p.close()
Glad to see you posting this! As promised, my thoughts:
With Pandas works with columns very well. What you need to look to do is remove the need for loops as much as possible (In your case I would say get rid of the main loop you have then keep the year and lookups loop).
To do this, forget about the results{} variable for now. You want to do the calculations directly on the DataFrame. For example your first calculation would become something like:
customer_data['meaningful_column_name'] = [[amount * multi_data[customer_data['year']-1, customer_data['state'], :]]]
For your lookups loop you just have to be aware that the if statement will be looking at entire columns.
Finally, as it seems you want to have your data in a list of arrays you will need to do some formatting to extract the data from a DataFrame structure.
I hope that makes some sense

convert a dataframe column from string to List of numbers

I have created the following dataframe from a csv file:
id marks
5155 1,2,3,,,,,,,,
2156 8,12,34,10,4,3,2,5,0,9
3557 9,,,,,,,,,,
7886 0,7,56,4,34,3,22,4,,,
3689 2,8,,,,,,,,
It is indexed on id. The values for the marks column are string. I need to convert them to a list of numbers so that I can iterate over them and use them as index number for another dataframe. How can I convert them from string to a list? I tried to add a new column and convert them based on "Add a columns in DataFrame based on other column" but it failed:
df = df.assign(new_col_arr=lambda x: np.fromstring(x['marks'].values[0], sep=',').astype(int))
Here's a way to do:
df = df.assign(new_col_arr=df['marks'].str.split(','))
# convert to int
df['new_col'] = df['new_col_arr'].apply(lambda x: list(map(int, [i for i in x if i != ''])))
I presume that you want to create NEW dataframe, since the number of items is differnet from number of rows. I suggest the following:
#source data
df = pd.DataFrame({'id':[5155, 2156, 7886],
'marks':['1,2,3,,,,,,,,','8,12,34,10,4,3,2,5,0,9', '0,7,56,4,34,3,22,4,,,']
# create dictionary from df:
dd = {row[0]:np.fromstring(row[1], dtype=int, sep=',') for _, row in df.iterrows()}
{5155: array([1, 2, 3]),
2156: array([ 8, 12, 34, 10, 4, 3, 2, 5, 0, 9]),
7886: array([ 0, 7, 56, 4, 34, 3, 22, 4])}
# here you pad the lists inside dictionary so that they have equal length
...
# convert dd to DataFrame:
df2 = pd.DataFrame(dd)
I found two similar alternatives:
1.
df['marks'] = df['marks'].str.split(',').map(lambda num_str_list: [int(num_str) for num_str in num_str_list if num_str])
2.
df['marks'] = df['marks'].map(lambda arr_str: [int(num_str) for num_str in arr_str.split(',') if num_str])

How to merge Dynamic Dataframes

I'm looking for help to add two dynamically generated dataframes.
Both DataFrames have a column computed on input from an intslider ipywidget.
the third Dataframe should update dynamically on changes of any of above Dataframes
import pandas as pd
from ipywidgets import interact
#interact(x=(0,1000,10))
def df_draw_one(x):
data = {"A":[1,2,3,4,5]}
df_one = pd.DataFrame(data)
df_one['B'] = df_one['A']*x
print(df_one)
#interact(x=(0,1000,10))
def df_draw_two(x):
data = {"A":[6,7,8,9,10]}
df_two = pd.DataFrame(data)
df_two['B'] = df_two['A']*x
print(df_two)
df_res = df_one+df_two
I understand with the current code, df_one and two are local and hence result in:
NameError: name 'df_one' is not defined
but I'm at loss on how to make them accessible.
Any pointer would be appreciated
You can have your functions return the two dataframe adding a return statement.
import pandas as pd
from ipywidgets import interact
#interact(x=(0, 1000, 10))
def df_draw_one(x):
data = {"A": [1, 2, 3, 4, 5]}
df_one = pd.DataFrame(data)
df_one['B'] = df_one['A'] * x
print(df_one)
return df_one
#interact(x=(0, 1000, 10))
def df_draw_two(x):
data = {"A": [6, 7, 8, 9, 10]}
df_two = pd.DataFrame(data)
df_two['B'] = df_two['A'] * x
print(df_two)
return df_two
df_one = df_draw_one(1)
df_two = df_draw_two(1)
df_res = df_one + df_two
print(df_res)
Another way is to have df_one and df_two as global variables, but it's dirty and not really necessary.
Update
One idea could be to have both widget generated in the same function, then everything becomes more easy to handle.
import pandas as pd
from ipywidgets import interact
#interact()
def df_draw_one(x=(0, 1000, 10), y=(0, 1000, 10)):
data = {"A": [1, 2, 3, 4, 5]}
df_one = pd.DataFrame(data)
df_one['B'] = df_one['A'] * x
data2 = {"A": [6, 7, 8, 9, 10]}
df_two = pd.DataFrame(data2)
df_two['B'] = df_two['A'] * y
display(df_one)
display(df_two)
df_res = df_one + df_two
display(df_res)
Here my result:

How to apply a function using two Series as input with the output being a DataFrame of the function result of each combination of arguments?

I have two Series, each containing variables that I want to use in a function. I want to apply the function for each combination of variables with the resulting output being a DataFrame of the calculated values, the index will be the index from one Series and the columns will be the index of the other Series.
I have tried searching for an answer to a similar problem - I'm sure there's one out there but I'm not sure how to describe it for search engines.
I've solved the problem by creating a function using for loops, so you can understand the logic. I want to know if there is a more efficient operation to do this without using for loops.
From what I've read, I'm imagining some combination of a list comprehension with zipped columns to calculate the values, which is then reshaped into the DataFrame but I can't solve it this way.
Here's the code to reproduce the problem and current solution.
import pandas as pd
bands = pd.Series({'A': 5, 'B': 17, 'C': 9, 'D': 34}, name='band')
values = pd.Series({'Jan': 1, 'Feb': 1.02, 'Mar': 1.05, 'Apr': 1.12}, name='values')
# Here is an unused function as an example
myfunc = lambda x, y: x * (1 + 1/y)
def func1(values, bands):
# Initialise empty DataFrame
df = pd.DataFrame(index=bands.index,
columns=values.index)
for month, month_val in values.iteritems():
for band, band_val in bands.iteritems():
df.at[band, month] = band_val * (1/month_val - 1)
return df
outcome = func1(values, bands)
You could use numpy.outer for this:
import numpy as np
import pandas as pd
bands = pd.Series({'A': 5, 'B': 17, 'C': 9, 'D': 34}, name='band')
values = pd.Series({'Jan': 1, 'Feb': 1.02, 'Mar': 1.05, 'Apr': 1.12}, name='values')
outcome = pd.DataFrame(np.outer(bands, ((1 / values) - 1)),
index=bands.index,
columns=values.index)
[out]
Jan Feb Mar Apr
A 0.0 -0.098039 -0.238095 -0.535714
B 0.0 -0.333333 -0.809524 -1.821429
C 0.0 -0.176471 -0.428571 -0.964286
D 0.0 -0.666667 -1.619048 -3.642857
As a function:
def myFunc(ser1, ser2):
result = pd.DataFrame(np.outer(ser1, ((1 / ser2) - 1)),
index=ser1.index,
columns=ser2.index)
return result
myFunc(bands, values)

Categories