I have a dataset on Google Playstore data. It has twelve features (one float, the rest objects) and I would like to manipulate one of them a bit so that I can convert it to numeric form. The feature column I'm talking about is the Size column, and here's a snapshot of what it looks like:
As you can see, it's in a string form consisting of the number with the scale appended to it. Checking through the rest of the feature, I discovered that asides from megabytes (M), there are also some entries in kilobytes (K) and also some entries where the size is the string "Varies according to device".
So my ultimate plan to deal with this is to :
Strip the last character from all the entries under size.
Convert the convertible entries to floats
Rescale the k entries by dividing them by 1000 so as to represent them properly
Replace the "Varies according to device" entries with the mean of the feature.
I know how to do 1,2 and 4, but 3 is giving me trouble because I'm not sure how to go about differentiating the k entries from the M ones and dividing those specific entries by 1000. If all of them were M or K, there'd be no issue as I've dealt with that before, but having to discriminate makes it trickier and I'm not sure what form the syntax should take (my attempts continuously throw errors).
By the way if anyone has a smarter way of going about this, I'd love to hear it. This is a learning exercise if anything!
Any help would be greatly appreciated. Thank you!!
------------------------Edit------------------------
A minimum reproducible example of an attempt would be
import pandas as pd
data = pd.read_csv("playstore-edited.csv",
index_col=("App"),
parse_dates=True,
infer_datetime_format=True)
x = data
var = [i[-1] for i in x.Size]
sar = dict(list(enumerate(var)))
ls = []
for i in sar:
if sar[i]=="k":
ls.append(i)
x.Size.loc[ls,"Size"]=x.Size.loc[ls,"Size"]/1000
This throws the following error:
IndexingError: Too many indexers
I know the last part of the code is off, but I'm not sure how to express what I want.
As written in the comment: If you strip the final letter to a new column you can then condition on that column for the division.
df = pd.DataFrame({'APP': ['A', 'B'], 'Size': ['5M','6K']})
df['Scale'] = df['Size'].str[-1]
df['Size'] = df['Size'].str[:-1].astype(int)
df.loc[df['Scale'] == 'K', 'Size'] = df.loc[df['Scale'] == 'K', 'Size'] / 1000
df = df.drop('Scale', axis=1)
df
Process size column using regex and then do your conversions:
df = (
df
#extract numeric part
.assign(New_Size = lambda x: x['Size'].str.replace('([A-Za-z]+)', ''))
#extract Scale part
.assign(Scale = lambda x: x['Size'].str.extract('([A-Za-z]+)'))
#convert KB to MB
.assign(Size = lambda x: np.where(x['Scale'] =='K', x['New_Size']/1000,x['New_Size']))
#update converted rows to MB
.assign(Scale = lambda x: np.where(x['Scale'] =='K', 'M',x['Scale']))
#replace those do not have value with mean of the size column
.assign(Size= lambda x: np.where(x['Scale']!='M',mean(x['New_Size']), x['New_Size']))
)
Related
I have created a function that parses through each column of a dataframe, shifts up the data in that respective column to the first observation (shifting past '-'), and stores that column in a dictionary. I then convert the dictionary back to a dataframe to have the appropriately shifted columns. The function is operational and takes about 10 seconds on a 12x3000 dataframe. However, when applying it to 12x25000 it is extremely extremely slow. I feel like there is a much better way to approach this to increase the speed - perhaps even an argument of the shift function that I am missing. Appreciate any help.
def create_seasoned_df(df_orig):
"""
Creates a seasoned dataframe with only the first 12 periods of a loan
"""
df_seasoned = df_orig.reset_index().copy()
temp_dic = {}
for col in cols:
to_shift = -len(df_seasoned[df_seasoned[col] == '-'])
temp_dic[col] = df_seasoned[col].shift(periods=to_shift)
df_seasoned = pd.DataFrame.from_dict(temp_dic, orient='index').T[:12]
return df_seasoned
Try using this code with apply instead:
def create_seasoned_df(df_orig):
df_seasoned = df_orig.reset_index().apply(lambda x: x.shift(df_seasoned[col].eq('-').sum()), axis=0)
return df_seasoned
I have a 10 GB csv file with 170,000,000 rows and 23 columns that I read in to a dataframe as follows:
import pandas as pd
d = pd.read_csv(f, dtype = {'tax_id': str})
I also have a list of strings with nearly 20,000 unique elements:
h = ['1123787', '3345634442', '2342345234', .... ]
I want to create a new column called class in the dataframe d. I want to assign d['class'] = 'A' whenever d['tax_id'] has a value that is found in the list of strings h. Otherwise, I want d['class'] = 'B'.
The following code works very quickly on a 1% sample of my dataframe d:
d['class'] = 'B'
d.loc[d['tax_num'].isin(h), 'class'] = 'A'
However, on the complete dataframe d, this code takes over 48 hours (and counting) to run on a 32 core server in batch mode. I suspect that indexing with loc is slowing down the code, but I'm not sure what it could really be.
In sum: Is there a more efficient way of creating the class column?
If your tax numbers are unique, I would recommend setting tax_num to the index and then indexing on that. As it stands, you call isin which is a linear operation. However fast your machine is, it can't do a linear search on 170 million records in a reasonable amount of time.
df.set_index('tax_num', inplace=True) # df = df.set_index('tax_num')
df['class'] = 'B'
df.loc[h, 'class'] = 'A'
If you're still suffering from performance issues, I'd recommend switching to distributed processing with dask.
"I also have a list of strings with nearly 20,000 unique elements"
Well, for starters, you should make that list a set if you are going to be using it for membership testing. list objects have linear time membership testing, set objects have very optimized constant-time performance for membership testing. That is the lowest hanging fruit here. So use
h = set(h) # convert list to set
d['class'] = 'B'
d.loc[d['tax_num'].isin(h), 'class'] = 'A'
I have the below data which I store in a csv (df_sample.csv). I have the column names in a list called cols_list.
df_data_sample:
df_data_sample = pd.DataFrame({
'new_video':['BASE','SHIVER','PREFER','BASE+','BASE+','EVAL','EVAL','PREFER','ECON','EVAL'],
'ord_m1':[0,1,1,0,0,0,1,0,1,0],
'rev_m1':[0,0,25.26,0,0,9.91,'NA',0,0,0],
'equip_m1':[0,0,0,'NA',24.9,20,76.71,57.21,0,12.86],
'oev_m1':[3.75,8.81,9.95,9.8,0,0,'NA',10,56.79,30],
'irev_m1':['NA',19.95,0,0,4.95,0,0,29.95,'NA',13.95]
})
attribute_dict = {
'new_video': 'CAT',
'ord_m1':'NUM',
'rev_m1':'NUM',
'equip_m1':'NUM',
'oev_m1':'NUM',
'irev_m1':'NUM'
}
Then I read each column and do some data processing as below:
cols_list = df_data_sample.columns
# Write to csv.
df_data_sample.to_csv("df_seg_sample.csv",index = False)
#df_data_sample = pd.read_csv("df_seg_sample.csv")
#Create empty dataframe to hold final processed data for each income level.
df_final = pd.DataFrame()
# Read in each column, process, and write to a csv - using csv module
for column in cols_list:
df_column = pd.read_csv('df_seg_sample.csv', usecols = [column],delimiter = ',')
if (((attribute_dict[column] == 'CAT') & (df_column[column].unique().size <= 100))==True):
df_target_attribute = pd.get_dummies(df_column[column], dummy_na=True,prefix=column)
# Check and remove duplicate columns if any:
df_target_attribute = df_target_attribute.loc[:,~df_target_attribute.columns.duplicated()]
for target_column in list(df_target_attribute.columns):
# If variance of the dummy created is zero : append it to a list and print to log file.
if ((np.var(df_target_attribute[[target_column]])[0] != 0)==True):
df_final[target_column] = df_target_attribute[[target_column]]
elif (attribute_dict[column] == 'NUM'):
#Let's impute with 0 for numeric variables:
df_target_attribute = df_column
df_target_attribute.fillna(value=0,inplace=True)
df_final[column] = df_target_attribute
attribute_dict is a dictionary containing the mapping of variable name : variable type as :
{
'new_video': 'CAT'
'ord_m1':'NUM'
'rev_m1':'NUM'
'equip_m1':'NUM'
'oev_m1':'NUM'
'irev_m1':'NUM'
}
However, this column by column operation takes long time to run on a dataset of size**(5million rows * 3400 columns)**. Currently the run time is approximately
12+ hours.
I want to reduce this as much as possible and one of the ways I can think of is to do processing for all NUM columns at once and then go column by column
for the CAT variables.
However I am neither sure of the code in Python to achieve this nor if this will really fasten up the process.
Can someone kindly help me out!
For numeric columns it is simple:
num_cols = [k for k, v in attribute_dict.items() if v == 'NUM']
print (num_cols)
['ord_m1', 'rev_m1', 'equip_m1', 'oev_m1', 'irev_m1']
df1 = pd.read_csv('df_seg_sample.csv', usecols = [num_cols]).fillna(0)
But first part code is performance problem, especially in get_dummies called for 5 million rows:
df_target_attribute = pd.get_dummies(df_column[column], dummy_na=True, prefix=column)
Unfortunately there is problem processes get_dummies in chunks.
There are three things i would advice you to speed up your computaions:
Take a look at pandas HDF5 capabilites. HDF is a binary file
format for fast reading and writing data to disk.
I would read in bigger chunks (several columns) of your csv file at once (depending on
how big your memory is).
There are many pandas operations you can apply to every column at once. For example nunique() (giving you the number of unique values, so you don't need unique().size). With these column-wise operations you can easily filter columns by selecting with a binary vector. E.g.
df = df.loc[:, df.nunique() > 100]
#filter out every column where less then 100 unique values are present
Also this answer from the author of pandas on large data workflow might be interesting for you.
EDIT : here are the first lines :
df = pd.read_csv(os.path.join(path, file), dtype = str,delimiter = ';',error_bad_lines=False, nrows=50)
df["CALDAY"] = df["CALDAY"].apply(lambda x:dt.datetime.strptime(x,'%d/%m/%Y'))
df = df.fillna(0)
I have a csv file that has 1500 columns and 35000 rows. It contains values, but under the form 1.700,35 for example, whereas in python I need 1700.35. When I read the csv, all values are under a str type.
To solve this I wrote this function :
def format_nombre(df):
for i in range(length):
for j in range(width):
element = df.iloc[i,j]
if (type(element) != type(df.iloc[1,0])):
a = df.iloc[i,j].replace(".","")
b = float(a.replace(",","."))
df.iloc[i,j] = b
Basically, I select each intersection of all rows and columns, I replace the problematic characters, I turn the element into a float and I replace it in the dataframe. The if ensures that the function doesn't consider dates, which are in the first column of my dataframe.
The problem is that although the function does exactly what I want, it takes approximately 1 minute to cover 10 rows, so transforming my csv would take a little less than 60h.
I realize this is far from being optimized, but I struggled and failed to find a way that suited my needs and (scarce) skills.
How about:
def to_numeric(column):
if np.issubdtype(column.dtype, np.datetime64):
return column
else:
return column.str.replace('.', '').str.replace(',', '.').astype(float)
df = df.apply(to_numeric)
That's assuming all strings are valid. Otherwise use pd.to_numeric instead of astype(float).
I have a dataframe of values:
df = pd.DataFrame(np.random.uniform(0,1,(500,2)), columns = ['a', 'b'])
>>> print df
a b
1 0.277438 0.042671
.. ... ...
499 0.570952 0.865869
[500 rows x 2 columns]
I want to transform this by replacing the values with their percentile, where the percentile is taken over the distribution of all values in prior rows. i.e., if you do df.T.unstack(), it would be a pure expanding sample. This might be more intuitive if you think of the index as a DatetimeIndex, and I'm asking to take the expanding percentile over the entire cross-sectional history.
So the goal is this guy:
a b
0 99 99
.. .. ..
499 58 84
(Ideally I'd like to take the distribution of a value over the set of all values in all rows before and including that row, so not exactly an expanding percentile; but if we can't get that, that's fine.)
I have one really ugly way of doing this, where I transpose and unstack the dataframe, generate a percentile mask, and overlay that mask on the dataframe using a for loop to get the percentiles:
percentile_boundaries_over_time = pd.DataFrame({integer:
pd.expanding_quantile(df.T.unstack(), integer/100.0)
for integer in range(0,101,1)})
percentile_mask = pd.Series(index = df.unstack().unstack().unstack().index)
for integer in range(0,100,1):
percentile_mask[(df.unstack().unstack().unstack() >= percentile_boundaries_over_time[integer]) &
(df.unstack().unstack().unstack() <= percentile_boundaries_over_time[integer+1])] = integer
I've been trying to get something faster to work, using scipy.stats.percentileofscore() and pd.expanding_apply(), but it's not giving the correct output and I'm driving myself insane trying to figure out why. This is what I've been playing with:
perc = pd.expanding_apply(df, lambda x: stats.percentileofscore(x, x[-1], kind='weak'))
Does anyone have any thoughts on why this gives incorrect output? Or a faster way to do this whole exercise? Any and all help much appreciated!
As several other commenters have pointed out, computing percentiles for each row likely involves sorting the data each time. This will probably be the case for any current pre-packaged solution, including pd.DataFrame.rank or scipy.stats.percentileofscore. Repeatedly sorting is wasteful and computationally intensive, so we want a solution that minimizes that.
Taking a step back, finding the inverse-quantile of a value relative to an existing data set is analagous to finding the position we would insert that value into the data set if it were sorted. The issue is that we also have an expanding set of data. Thankfully, some sorting algorithms are extremely fast with dealing with mostly sorted data (and inserting a small number of unsorted elements). Hence our strategy is to maintain our own array of sorted data, and with each row iteration, add it to our existing list and query their positions in the newly expanded sorted set. The latter operation is also fast given that the data is sorted.
I think insertion sort would be the fastest sort for this, but its performance will probably be slower in Python than any native NumPy sort. Merge sort seems to be the best of the available options in NumPy. An ideal solution would involve writing some Cython, but using our above strategy with NumPy gets us most of the way.
This is a hand-rolled solution:
def quantiles_by_row(df):
""" Reconstruct a DataFrame of expanding quantiles by row """
# Construct skeleton of DataFrame what we'll fill with quantile values
quantile_df = pd.DataFrame(np.NaN, index=df.index, columns=df.columns)
# Pre-allocate numpy array. We only want to keep the non-NaN values from our DataFrame
num_valid = np.sum(~np.isnan(df.values))
sorted_array = np.empty(num_valid)
# We want to maintain that sorted_array[:length] has data and is sorted
length = 0
# Iterates over ndarray rows
for i, row_array in enumerate(df.values):
# Extract non-NaN numpy array from row
row_is_nan = np.isnan(row_array)
add_array = row_array[~row_is_nan]
# Add new data to our sorted_array and sort.
new_length = length + len(add_array)
sorted_array[length:new_length] = add_array
length = new_length
sorted_array[:length].sort(kind="mergesort")
# Query the relative positions, divide by length to get quantiles
quantile_row = np.searchsorted(sorted_array[:length], add_array, side="left").astype(np.float) / length
# Insert values into quantile_df
quantile_df.iloc[i][~row_is_nan] = quantile_row
return quantile_df
Based on the data that bhalperin provided (offline), this solution is up to 10x faster.
One final comment: np.searchsorted has options for 'left' and 'right' which determines whether you want your prospective inserted position to be the first or last suitable position possible. This matters if you have a lot of duplicates in your data. A more accurate version of the above solution will take the average of 'left' and 'right':
# Query the relative positions, divide to get quantiles
left_rank_row = np.searchsorted(sorted_array[:length], add_array, side="left")
right_rank_row = np.searchsorted(sorted_array[:length], add_array, side="right")
quantile_row = (left_rank_row + right_rank_row).astype(np.float) / (length * 2)
Yet not quite clear, but do you want a cumulative sum divided by total?
norm = 100.0/df.a.sum()
df['cum_a'] = df.a.cumsum()
df['cum_a'] = df.cum_a * norm
ditto for b
Here's an attempt to implement your 'percentile over the set of all values in all rows before and including that row' requirement. stats.percentileofscore seems to act up when given 2D data, so squeezeing seems to help in getting correct results:
a_percentile = pd.Series(np.nan, index=df.index)
b_percentile = pd.Series(np.nan, index=df.index)
for current_index in df.index:
preceding_rows = df.loc[:current_index, :]
# Combine values from all columns into a single 1D array
# * 2 should be * N if you have N columns
combined = preceding_rows.values.reshape((1, len(preceding_rows) *2)).squeeze()
a_percentile[current_index] = stats.percentileofscore(
combined,
df.loc[current_index, 'a'],
kind='weak'
)
b_percentile[current_index] = stats.percentileofscore(
combined,
df.loc[current_index, 'b'],
kind='weak'
)