Select Random Data From Python Dataframe based on Columns data distribution & Conditions - python

I have 12 rows in my input data and need to select 4 random rows which keeps the columns distribution in focus at the time of random selection.
This is a sample data, original data contains million rows.
Input data Sample -
input_data = pd.DataFrame({'Id': ['A','A','A','A','A','A','B','B','B','C','C','C'],
'Fruit': ['Apple','Mango','Orange','Apple','Apple','Mango','Apple','Mango','Apple','Apple','Apple','Orange'],
'City':['California','California','Chicago','Michigan','New York','Ohio','Michigan',
'Michigan','Ohio','Florida','New York','Washington']})
Output Data Expectation -
output_data = pd.DataFrame({'Id': ['A','A','B','C'],
'Fruit': ['Apple','Mango','Apple','Orange'],
'City':['California','Ohio','Michigan','New York']})
My random selection should consider the below three parameters -
The Id distribution, in below image, out of 4, 2 rows should be selected from A, 1 row from B and one from C
The Fruit distribution, 2 rows for Apple, 1 for Mango and 1 for Orange
The data should prioritize the higher frequency Cities
I am aware of sampling the data using pandas sample function and tried that which gives me unbalanced selection -
input_data.sample(n = 4)
Any leads on how to attend the problem is really appreciated!

You are prescribing probabilities on single random variables 3 times, once on the ID, once on fruit and once on the city, whereas you need to select an ordered tuple of 3: (ID, fruit, city), and you have restriction on the possible combinations too. In general, it is not possible. I'll explain why not so that you can modify your question to match your needs.
Forget about how pandas help you to make random choices and let's understand the problem mathematically first. Let's simplify the problem into 2D. Keep the fruits (apple, mango, orange) and cities (Ohio, Florida). First, let's suppose you have all the possible combination:
unique ID
Fruit
City
0
Apple
Ohio
1
Apple
Florida
2
Mango
Ohio
3
Mango
Florida
4
Orange
Ohio
5
Orange
Florida
Then you define the probability for the different categories independently via their frequency:
Fruit
frequency
probability
0
Apple
5
0.5
1
Mango
2
0.2
2
Orange
3
0.3
City
frequency
probability
0
Ohio
2
0.2
1
Florida
8
0.8
Now you can represent your possible choices:
Each line in your list of possible choices are represented in the figure (their ID is written into the center of the cells together with their possibilities). By selecting a line from the table, it means you generate a point on this 2D space. If you use the area of the cells to determine the probability you choose that pair, you'll get the desired 1D probability distributions, hence this representation. Of course, it is a good, intuitive choice to generate a random number on a 2D (discrete) space by generating 2 random numbers in each dimension, but this is not a must. In this example, the individual properties are independent, meaning that if your line's fruit property is apple, then it has a 20% or 80% probability, that it is from Ohio or Florida, respectively, which is equal to the original 1D distribution you prescribed for cities.
Now consider if you have an extra entry with unique id 6 for (Orange, Florida). When you generate a point on the 2D space, and it falls onto cells 5 and 6, you have the freedom to choose from the 5th or the 6th line. This case occurs if you have repeated set of tuples. (If your full table of all the 3 properties is considered, then you don't have repeated tuples).
Now consider what happens if you keep the prescribed 1D probabilities but don't represent all the possibilities, e.g. by removing the entry (Apple, Florida) with ID 1. You cannot generate points on the cell with number 1 anymore, but this affects the 1D probabilities you prescribed. If you can resolve this issue by redistributing the removed 40% so that individual category probabilities will be the one you desire, then you can select lines with the probability of the properties you want. This case occurs in your table, because not every possibility is listed.
If you can redistribute the probabilities, e.g. according to the following table (by scaleing up everything by 100%/(100%-40%), then not all the variables will be independent anymore. E.g. if it is apple, then it must have city Ohio (instead of 20% - 80% probability share with Florida).
You mentioned that you have millions of rows. Maybe in your complete table all possible combinations can be found, and you don't need to deal with this problem. You can also extend your table so that it contains all the possible combinations, and you can later decide how to interpret the results when you selected a row not contained in your full table initially.

this doesn't include the 'city' column, but it's a start:
# the usual:
import pandas as pd
import numpy as np
from random import sample
# our df:
df = pd.DataFrame({'Id': ['A','A','A','A','A','A','B','B','B','C','C','C'],
'Fruit': ['Apple','Mango','Orange','Apple','Apple','Mango','Apple','Mango','Apple','Apple','Apple','Orange'],
'City':['California','California','Chicago','Michigan','New York','Ohio','Michigan',
'Michigan','Ohio','Florida','New York','Washington']})
# the fun part:
def lets_check_it(df):
# take 2 samples with 'A' Id, one sample with 'B', one sample with 'C' put them all in a df:
result = pd.concat([df[df['Id']=='A'].sample(1),df[df['Id']=='A'].sample(1),df[df['Id']=='B'].sample(1),df[df['Id']=='C'].sample(1)])
# if Apple or Orange are not in results, keep on sampling:
while ('Apple' not in result['Fruit'].value_counts().index.tolist()) | ('Orange' not in result['Fruit'].value_counts().index.tolist()):
result = pd.concat([df[df['Id']=='A'].sample(1),df[df['Id']=='A'].sample(1),df[df['Id']=='B'].sample(1),df[df['Id']=='C'].sample(1)])
else:
# if Apple and Orange are in results, we have to check if it's 2 of 'Apple' and 1 of 'Orange that's the result we want
while (result['Fruit'].value_counts()['Apple'] != 2) | (result['Fruit'].value_counts()['Orange'] != 1):
# if it's not the desired result, run the whole function again:
return lets_check_it(df)
# if it's the desired result, return the result:
else:
return result
Not sure how this is going to play out time-wise with millions of rows.

Related

how to automatically classify a list of numbers

Well, the context is: I have a list of wind speeds, let's imagine, 100 wind measurements from 0 to 50 km/h, so I want to automate the creation of a list by uploading the csv, let's imagine, every 5 km/h, that is, the ones that they go from 0 to 5, what go from 5 to 10... etc.
Let's go to the code:
wind = pd.read_csv("wind.csv")
df = pd.DataFrame(wind)
x = df["Value"]
d = sorted(pd.Series(x))
lst = [[] for i in range(0,(int(x.max())+1),5)]
this gives me a list of empty lists, i.e. if the winds go from 0 to 54 km/h will create 11 empty lists.
Now, to classify I did this:
for i in range(0,len(lst),1):
for e in range(0,55,5):
for n in d:
if n>e and n< (e+5):
lst[i].append(n)
else:
continue
My objective would be that when it reaches a number greater than 5, it jumps to the next level, that is, it adds 5 to the limits of the interval (e) and jumps to the next i to fill the second empty list in lst. I tried it in several ways because I imagine that the loops must go in a specific order to give a good result. This code is just an example of several that I tried, but they all gave me similar results, either all the lists were filled with all the numbers, or only the first list was filled with all the numbers
Your title mentions classifying the numbers -- are you looking for a categorical output like calm | gentle breeze | strong breeze | moderate gale | etc.? If so, take a look at the second example on the pd.qcut docs.
Since you're already using pandas, use pd.cut with an IntervalIndex (constructed with the pd.interval_range function) to get a Series of bins, and then groupby on that.
import pandas as pd
from math import ceil
BIN_WIDTH = 5
wind_velocity = (pd.read_csv("wind.csv")["Value"]).sort_values()
upper_bin_lim = BIN_WIDTH * ceil(wind_velocity.max() / BIN_WIDTH)
bins = pd.interval_range(
start=0,
end=upper_bin_lim,
periods=upper_bin_lim//BIN_WIDTH,
closed='left')
velocity_bins = pd.cut(wind_velocity, bins)
groups = wind_velocity.groupby(velocity_bins)
for name, group in groups:
#TODO: use `groups` to do stuff

How to visualize frequency of category values along time per IDs in Pandas, Python?

I have a Pandas DataFrame with IDs and categorical values (A, B, C) like this:
ID CAT
1 A
2 C
2 B
3 A
2 A
1 B
1 A
3 B
3 B
Actually, the rows represent a time sequence with records of categorical events by IDs, so there is a temporal dimension, but the actual datetimes don't matter, only the relative sequence of events. Each IDs have identical number of sequential events in the whole DF.
I'd like to visualize the category value (event) sequences per users in a 2D matrix (like a heatmap) where rows represent IDs, columns represent time steps, and colored cells as the category values like this:
ABA
CBA
ABB
This is supposed to be a 3*3 matrix with colored tiles instead of letters. First row is ID 1 with it's three consecutive events, and so on. How is it feasible in Python?
I guess what you want to do is a simple group by agg list. Which is basically, gonna display the unique id of an user with a list, that follows the given order of the elements.
So just
df.groupby('ID')['CAT'].agg(list)

How to check dataset for typos and replace them?

I have a question.
Is there a way on how to check wheteher there are typos in a specific column?
I have an Excel sheet which is read by use of pandas.
First I need to make a unique list in Python, based on the name of the column;
Second I need to replace the wrong values with the new values.
Working in a Jupyter notebook and doing this semi-manually might be the best way. One option could be to start by creating a list of correct spelling:
correct= ['terms','that','are','spelt','correctly']
and create a subset from your data frame which does not contain the values in that list.
df[~df['columnname'].str.startswith(tuple(correct))]
You will then know how many rows are affected. You can then count the number of different variations:
df['columnname'].value_counts()
and if reasonable, you could look at the unique values, and make them into a list:
listoftypos = list(df['columnname'].unique())
print(listoftypos)
and then create a dictionary again in a semi-manual way as:
typodict= {'terma':'term','thaaat':'that','arree':'are','speelt':'spelt','cooorrectly':'correct'}
then iterate over your original data frame, and if a row in the column contains the keyword which is in your list of typos, then replace it with the correct key from the dictionary, something like this:
for index,row in df.itterows():
if any(row['columnname'] in s for s in listoftypos):
correctspelling = list(typodict.keys())[list(typodict.values()).index(row['columnname'])])
df.at[index,'columnname'] = correctspelling
A strong caveat here though - of course, this would be something that would have to be done iteratively if the dataframe was extremely large.
Keep in mind that a generic spell check is a fairly tall order, but I believe this solution will fit your need with the lowest chance of false matches:
Setup:
import difflib
import re
from itertools import permutations
cardinal_directions=['north', 'south', 'east', 'west']
regions=['coast', 'central', 'international', 'mid']
p_lst=list(permutations(cardinal_directions+regions,2))
area=[''.join(i) for i in p_lst]+cardinal_directions+regions
df=pd.DataFrame({"ID":list(range(0,9)), "region":['Midwest', 'Northwest', 'West', 'Northeast', 'East coast', 'Central', 'South', 'International', 'Centrall']})
Initial DF:
ID
region
0
Midwest
1
Northwest
2
West
3
Northeast
4
East coast
5
Central
6
South
7
International
8
Centrall
Function:
def spell_check(my_str, name_bank):
prcnt=[]
for y in name_bank:
prcnt.append(difflib.SequenceMatcher(None, y, my_str.lower().strip()).ratio())
return name_bank[prcnt.index(max(prcnt))]
Apply Function to DF:
df.region=df.region.apply(lambda x: spell_check(x, area))
Resultant DF:
ID
region
0
midwest
1
northwest
2
west
3
northeast
4
eastcoast
5
central
6
south
7
international
8
central
I hope this answers your question and good luck.

What's the best way to use fuzzywuzzy to compare each value of a column with all the values of a separate dataframe's column?

Having a really tough time with this one. Say I have two dataframes, one that has fruits and another one that has types of fruit candy. There's lots of other data in each of the dataframes. So it looks something like this:
fruit:
fruitId fruitName
0 1 banana
1 2 orange
2 3 apple
3 4 pear
4 5 lemon
candy:
candyId candyName fruitId
0 1 Orange Julius null
1 2 Bananarama null
2 3 Sour Lemon Drops null
3 4 Chocolate Bar null
4 5 Applicious null
I need to match the candyName with the proper fruit, and then put the corresponding fruitId into the fruitId column in the candy dataframe. Let's assume for my purposes that .contains doesn't work at all; there too many creative spellings and outright misspellings in the candyName column.
I have tried to define a function that uses fuzzywuzzy, and then use that in .map, but I can't get the function to work. It needs to check each value of the first df to see if it's in the second, and then move onto the next value, etc. The functions I end up building keep wanting to do comparisons where they're either (a) in the same dataframe, or (b) in the same row.
I did find a solution to this, but it's ugly because it uses iterrows() which you're not supposed to use. Here it is:
import pandas as pd
from fuzzywuzzy import fuzz
candy_file = 'candy.csv'
fruit_file = 'fruits.csv'
candy = pd.read_csv(candy_file)
fruit = pd.read_csv(fruit_file)
dict = {}
for i, row1 in candy.iterrows():
for j, row2 in fruit.iterrows():
if fuzz.partial_ratio(row1['candyName'], row2['fruitName']) >= 80:
dict[row1['candyName']] = row2['fruitId']
candy['fruitId'] = candy['candyName'].map(dict)
This takes forever. Like, 10 minutes to get through 500 rows. Is there a better way to do this? I've written like a hundred different code snippets out for faster functions without getting anywhere.
Thanks!
It's slow because you're currently working in O(N^2).
Rather than using iterrows, use dictionaries to iterate instead. This can be done with the following:
candydict = candy.to_dict{}
fruitdict = fruit.to_dict{}
for k,v in candydict.items():
for k2,v2 in fruitdict.items():
#do the rest of your comparisons here
This should speed it up significantly.

Satisfying Cross tab constraints in Python by filling in Random Numbers

I have a problem to modify a dataframe (actual data) to satisfy cross tab constraints and generating a new dataframe as described below:
In cross-tab 1 (attached pic n code), we have 2 tasks for John in Area A, 1 task for John in Area B and so on. However, my desired distribution is as shown in cross-tab 2 i.e., John has 1 task in Area A, 4 tasks in Area B etc. Thus I need to modify original data as depicted by crosstab 1, to satisfy the row and column totals constraints as required in crosstab2, while grand total should remain 18 as in both cross tabs. Number filling may be random
Another constraint is average time which should be for example 11 minutes for John (average of 03 tasks), 7 minutes for William and 5 minutes for Richard(03 tasks).
Thus, the task is to modify original dataframe which satisfies row, column total as in crosstab2 and average time requirement. The final dataframe will have three columns Person, Area of Work and Time and will generate a crosstab similar to crosstab2, while randomly filling in numbers
Cross tab2- Required
Cross tab1- Actual Data
Actual Data:
df = pd.DataFrame([['John','A',2,8],['John','B',1,9],['John','C',0,12],['William','A',1,14],['William','B',2,10],['William','C',2,9],['Richard','A',3,8],['Richard','B',4,7],['Richard','C',3,5]],columns=['Person', 'AreaOfWork', 'Task','Time'])
1.1 Actual Cross-Tab:
pd.crosstab(df.AreaOfWork, df.Person, values=df.Task, aggfunc=np.sum, margins=True)
Required-Dataframe
df1 = pd.DataFrame([['John','A',1,10],['John','B',4,11],['John','C',3,12],['William','A',0,9],['William','B',1,7],['William','C',3,5],['Richard','A',2,5],['Richard','B',1,3],['Richard','C',3,8]],columns=['Person', 'AreaOfWork', 'Task','Time'])
2.1 Required crosstab
pd.crosstab(df.AreaOfWork, df.Person, values=df1.Task, aggfunc=np.sum, margins=True)

Categories