ngroups in groupby object not matching nunique() in same column - python

I have a DataFrame consisting of Ids and Serial Numbers. I want to create a new DataFrame with the Ids as index and the serial numbers as column values and zero padding where the length are not equal.
My problem is that when I try to group by id the number of groups in my groupby("id")-object does not match the number of nunique("id") values which is counter intuitive. For every example I tried using smaller DateFrames the numbers match. Any suggestions why?
import pandas as pd
import numpy as np
# data example (real df is shape(188225, 2)
hu = pd.DataFrame({'Id': ['1','12','123','1234','12345'],
'Serial':['A','AB','ABC','ABC','ABC']},
dtype = 'category')
max_len = df.groupby('Id')['Serial'].size().max() # Find the max length
grouped = df.groupby('Id')
from io import StringIO
from csv import writer
output = StringIO()
csv_writer = writer(output)
for key, vals in grouped.groups.items():
# Vector of serials with 0 padding matching so max_len = | [a, b, c, 0, 0, 0...]|
csv_writer.writerow(np.append(np.append(key, vals.values), np.array([0] * (max_len - len(vals)))))
output.seek(0) #goes to the start of the IO file
dfdiscrete = pd.read_csv(output,
header=None,
index_col=0,
dtype=str)
print("\Discrete Serials:", len(grouped.groups), "nunique ids", hu['Id'].nunique())
I expect the these two to be:
Shape discrete devices: (29840, 50) nunique citizen ids 29840,
but the actual output is
Shape discrete devices: (56674, 50) nunique citizen ids 29840

Related

How to reduce the time complexity of KS test python code?

I am currently working on a project where i need to compare whether two distributions are same or not. For that i have two data frame both contains numeric values only
db_df - which is from the db
2)data - which is user uploaded dataframe
I have to compare each and every columns from db_df with the data and find the similar columns from data and suggest it to user as suggestions for the db column
Dimensions of both the data frame is 100 rows,239 columns
`
from scipy.stats import kstest
row_list = []
suggestions = dict()
s = time.time()
db_data_columns = db_df.columns
data_columns = data.columns
for i in db_data_columns:
col_list = list()
for j in data_columns:
# perform Kolmogorov-Smirnov test
col_list.append(kstest(
df_db[i], data[j]
)[1])
row_list.append(col_list)
print(f"=== AFTER FOR TIME {time.time()-s}")
df = pd.DataFrame(row_list).T
df.columns = db_df.columns
df.index = data.columns
for i in df.columns:
sorted_df = df.sort_values(by=[i], ascending=False)
sorted_df = sorted_df[sorted_df > 0.05]
sorted_df = sorted_df[:3].loc[:, i:i]
sorted_df = sorted_df.dropna()
suggestions[sorted_df.columns[0]] = list(sorted_df.to_dict().values())[0]
`
After getting all the p-values for all the columns in db_df with the data i need select the top 3 columns from data for each column in db_df
**Overall time taken for this is 14 seconds which is very long. is there any chances to reduce the time less than 5 sec **

Extracting only the percent value in a column in pandas

I have a column that includes strings including a percent at the end e.g XX: (+2, 30%); (-5, 20%); (+17, 50%) .
I need to extract the highest % value for each such string and perform this on the whole column.
Any advice will be highly appreciated!
Thanks
In my understanding, each cell in column XX is a cells which contains some percentages. I have included a small test DataFrame I have used:
import pandas as pd
import re
df = pd.DataFrame({"XX":["(+2, 30%), (-5, 20%), (+17, 50%)","(+2, 70%), (-5, 20%), (+17, 50%)", ""]})
pattern = re.compile("([0-9\.]+)%")
df["XX"].apply(lambda x: max(pattern.findall(x), default=-1))
OUTPUT
0 50
1 70
this code returns the most value in a column having percents
import pandas as pd
import numpy as np
data = [['2.3%', 1],['5.3%', 3]]
data = pd.DataFrame(data)
first_column = data.iloc[:, 0]
percent_list = []
for val in first_column:
percent_list.append(float(val[:-1]))
print(percent_list[np.argmax(percent_list)])

Python most efficient way to dictionary mapping in pandas dataframe

I have a dictionary of dictionaries and each contains a mapping for each column of my dataframe.
My goal is to find the most efficient way to perform mapping for my dataframe with 1 row and 300 columns.
My dataframe is randomly sampled from range(mapping_size); and my dictionaries map values from range(mapping_size) to random.randint(mapping_size+1,mapping_size*2).
I can see from the answer provided by jpp that map is possibly the most efficient way to go but I am looking for something which is even faster than map. Can you think of any? I am happy if the data structure of the input is something else instead of pandas dataframe.
Here is the code for setting up the question and results using map and replace:
# import packages
import random
import pandas as pd
import numpy as np
import timeit
# specify paramters
ncol = 300 # number of columns
nrow = 1 #number of rows
mapping_size = 10 # length of each dictionary
# create a dictionary of dictionaries for mapping
mapping_dict = {}
random.seed(123)
for idx1 in range(ncol):
# create empty dictionary
mapping_dict['col_' + str(idx1)] = {}
for inx2 in range(mapping_size):
# create dictionary of length mapping_size and maps value from range(mapping_size) to random.randint(mapping_size +1 ,mapping_size*2)
mapping_dict['col_' + str(idx1)][inx2+1] = random.randint(mapping_size+1,mapping_size*2)
# Create a dataframe with values sampled from range(mapping_size)
d={}
random.seed(123)
for idx1 in range(ncol):
d['col_' + str(idx1)] = np.random.choice(range(mapping_size),nrow)
df = pd.DataFrame(data=d)
Results using map and replace:
%%timeit -n 20
df.replace(mapping_dict) #296 ms
%%timeit -n 20
for key in mapping_dict.keys():
df[key] = df[key].map(mapping_dict[key]).fillna(df[key]) #221ms
%%timeit -n 20
for key in mapping_dict.keys():
df[key] = df[key].map(mapping_dict[key]) #181ms
Just use pandas without python for iteration.
# runtime ~ 1s (1000rows)
# creat a map_serials with multi_index
df_dict = pd.DataFrame(mapping_dict)
obj_dict = df_dict.T.stack()
# obj_dict
# col_0 1 10
# 2 14
# 3 11
# Length: 3000, dtype: int64
# convert df to map_serials's index, df can have more then 1 row
obj_idx = pd.Series(df.values.flatten())
obj_idx.index = pd.Index(df.columns.to_list() * df.shape[0])
idx = obj_idx.to_frame().reset_index().set_index(['index', 0]).index
result = obj_dict[idx]
# handle null values
cond = result.isnull()
result[cond] = pd.Series(result[cond].index.values).str[1].values
# transform to reslut DataFrame
df_result = pd.DataFrame(result.values.reshape(df.shape))
df_result.columns = df.columns
df_result

Why does performance decrease with the size of the data frame?

I'm working with a relatively large dataset (approx 5m observations, made up of about 5.5k firms).
I needed to run OLS regressions with a 60 month rolling window for each firm. I noticed that the performance was insanely slow when I ran the following code:
for idx, sub_df in master_df.groupby("firm_id"):
# OLS code
However, when I first split my dataframe into about 5.5k dfs and then iterated over each of the dfs, the performance improved dramatically.
grouped_df = master_df.groupby("firm_id")
df_list = [group for group in grouped_df]
for df in df_list:
my_df = df[1]
# OLS code
I'm talking 1-2 weeks of time (24/7) to complete in the first version compared to 8-9 hours tops.
Can anyone please explain why splitting the master df into N smaller dfs and then iterating over each smaller df performs better than iterating over the same number of groups within the master df?
Thanks ever so much!
I'm unable to reproduce your observation. Here's some code that generates data and then times the direct and indirect methods separately. The time taken is very similar in either case.
Is it possible that you accidentally sorted the dataframe by the group key between the runs? Sorting by group key results in a noticeable difference in run time.
Otherwise, I'm beginning to think that there might be some other differences in your code. It would be great if you could post the full code.
import numpy as np
import pandas as pd
from datetime import datetime
def generate_data():
''' returns a Pandas DF with columns 'firm_id' and 'score' '''
# configuration
np.random.seed(22)
num_groups = 50000 # number of distinct groups in the DF
mean_group_length = 200 # how many records per group?
cov_group_length = 0.10 # throw in some variability in the num records per group
# simulate group lengths
stdv_group_length = mean_group_length * cov_group_length
group_lengths = np.random.normal(
loc=mean_group_length,
scale=stdv_group_length,
size=(num_groups,)).astype(int)
group_lengths[group_lengths <= 0] = mean_group_length
# final length of DF
total_length = sum(group_lengths)
# compute entries for group key column
firm_id_list = []
for i, l in enumerate(group_lengths):
firm_id_list.extend([(i + 1)] * l)
# construct the DF; data column is 'score' populated with Numpy's U[0, 1)
result_df = pd.DataFrame(data={
'firm_id': firm_id_list,
'score': np.random.rand(total_length)
})
# Optionally, shuffle or sort the DF by group keys
# ALTERNATIVE 1: (badly) unsorted df
result_df = result_df.sample(frac=1, random_state=13).reset_index(drop=True)
# ALTERNATIVE 2: sort by group key
# result_df.sort_values(by='firm_id', inplace=True)
return result_df
def time_method(df, method):
''' time 'method' with 'df' as its argument '''
t_start = datetime.now()
method(df)
t_final = datetime.now()
delta_t = t_final - t_start
print(f"Method '{method.__name__}' took {delta_t}.")
return
def process_direct(df):
''' direct for-loop over groupby object '''
for group, df in df.groupby('firm_id'):
m = df.score.mean()
s = df.score.std()
return
def process_indirect(df):
''' indirect method: generate groups first as list and then loop over list '''
grouped_df = df.groupby('firm_id')
group_list = [pair for pair in grouped_df]
for pair in group_list:
m = pair[1].score.mean()
s = pair[1].score.std()
df = generate_data()
time_method(df, process_direct)
time_method(df, process_indirect)

Associating units with Pandas DataFrame

I'm using a web service that returns a CSV response in which the 1st row contains the column names, and the 2nd row contains the column units, for example:
longitude,latitude
degrees_east,degrees_north
-142.842,-1.82
-25.389,39.87
-37.704,27.114
I can read this into a Pandas DataFrame:
import pandas as pd
from StringIO import StringIO
x = '''
longitude,latitude
degrees_east,degrees_north
-142.842,-1.82
-25.389,39.87
-37.704,27.114
'''
# Create a Pandas DataFrame
obs=pd.read_csv(StringIO(x.strip()), sep=",\s*")
print(obs)
which produces
longitude latitude
0 degrees_east degrees_north
1 -142.842 -1.82
2 -25.389 39.87
3 -37.704 27.114
But what would be the best approach to associate the units with the DataFrame columns for later use, for example labeling plots?
Allowing pandas to read the second line as data is screwing up the dtype for the columns. Instead of a float dtype, the presence of strings make the dtype of the columns object, and the underlying objects, even the numbers, are strings. This screws up all numerical operations:
In [8]: obs['latitude']+obs['longitude']
Out[8]:
0 degrees_northdegrees_east
1 -1.82-142.842
2 39.87-25.389
3 27.114-37.704
In [9]: obs['latitude'][1]
Out[9]: '-1.82'
So it is imperative that pd.read_csv skip the second line.
The following is pretty ugly, but given the format of the input, I don't see a better way.
import pandas as pd
from StringIO import StringIO
x = '''
longitude,latitude
degrees_east,degrees_north
-142.842,-1.82
-25.389,39.87
-37.704,27.114
'''
content = StringIO(x.strip())
def read_csv(content):
columns = next(content).strip().split(',')
units = next(content).strip().split(',')
obs = pd.read_table(content, sep=",\s*", header=None)
obs.columns = ['{c} ({u})'.format(c=col, u=unit)
for col, unit in zip(columns, units)]
return obs
obs = read_csv(content)
print(obs)
# longitude (degrees_east) latitude (degrees_north)
# 0 -142.842 -1.820
# 1 -25.389 39.870
# 2 -37.704 27.114
print(obs.dtypes)
# longitude (degrees_east) float64
# latitude (degrees_north) float64

Categories