Create random.randint with condition in a group by? - python

I have a column called: cars and want to create another called persons using random.randint() which i have:
dat['persons']=np.random.randint(1,5,len(dat))
This is so I can put the number of persons who use these but I'd
like to know how to put a condition so in the suv category will be generated only numbers from 4 to 9 for example.
cars | persons
suv 4
sedan 2
truck 2
suv 1
suv 5

You can create an index for your series, where matching rows have True, and everything else has False. You can then assign to the rows matching that index using loc[] to select the rows; you then generate just the number of values for those selected rows:
m = dat['cars'] == 'suv'
dat.loc[m, 'persons'] = np.random.randint(4, 9, m.sum())
You could also use apply on the cars series to create the new column, creating a new random value in each call:
dat['persons'] = dat.cars.apply(
lambda c: random.randint(4, 9) if c == 'suv' else random.randint(1, 5))
But this has to make a separate function call for each row. Using a mask will be more efficient.

Option 1
So, you're generating random numbers between 1 and 5, whereas numbers in the SUV category should be between 4 and 9. That just means you can generate a random number, and then add 4 to all random numbers belonging to the SUV category?
df = df.assign(persons=np.random.randint(1,5, len(df)))
df.loc[df.cars == 'suv', 'persons'] += 4
df
cars persons
0 suv 7
1 sedan 3
2 truck 1
3 suv 8
4 suv 8
Option 2
Another alternative would be using np.where -
df.persons = np.where(df.cars == 'suv',
np.random.randint(5, 9, len(df)),
np.random.randint(1, 5, len(df)))
df
cars persons
0 suv 8
1 sedan 1
2 truck 2
3 suv 5
4 suv 6

There may be a way to do this with something like a groupby that's more clever than I am, but my approach would be to build a function and apply it to your cars column. This is pretty flexible - it will be easy to build in more complicated logic if you want something different for each car:
def get_persons(car):
if car == 'suv':
return np.random.randint(4, 9)
else:
return np.random.randint(1, 5)
dat['persons'] = dat['cars'].apply(get_persons)
or in a more slick, but less flexible way:
dat['persons'] = dat['cars'].apply(lambda car: np.random.randint(4, 9) if car == 'suv' else np.random.randint(1, 5))

I had a similar problem. I'll describe what I did generally because application may vary. For smaller frames it won't matter so the above methods might work but for larger frames like mine (i.e.; hundreds of thousands to millions of rows) I would do this:
Sort dat by 'cars'
Get a unique list of cars
Create a temporary list for the random numbers
Loop over that list of cars and populate the temporary list of random
numbers and extending a new list with the temp list
Add the new list to the 'persons' column
If order matters maintain and re-sort by the index

Related

How to create a Dataframe from one Series, when it is not as simple as to transpose the object?

I've seen many similar questions here, but none of them applies to the case I need to solve. It happens that I have a products Series in which the names of the "future" columns end with the string [edit] and are mixed with the values that are going to be join in them. Something like this:
Index Values
0 Soda [edit]
1 Coke
2 Sprite
3 Ice Cream [edit]
4 Nestle
5 Snacks [edit]
6 Lays
7 Act II
8 Nachos
I need to turn this into a df, to get sth like:
Soda Ice Cream Snacks
0 Coke Nestle Lays
1 Sprite NaN Act II
2 NaN NaN Nachos
I made a Series called cols_index, which saves the index of the columns as in the first series:
Index Values
0 Soda [edit]
3 Ice Cream [edit]
5 Snacks [edit]
However, from here I don't know how to get to pass the values to the columns. As I'm new to pandas I thought in iterating using a for loop generating ranges which would refer to the elements' indexes ([1,2], [4], [6:8]), but that wouldn't be a pandorable way to do things.
How can I do this? Thanks in advance.
=========================================================
EDIT: I solved it, here's how I did it.
After reviewing the problem with a colleague, we concluded that there's no pandorable way to do it and therefore, I had to use the data as a list and apply for and if loops:
products = pd.read_csv("products_file.txt", delimiter='\n', header = None, squeeze = True)
product_list = products.values.tolist()
cols = products[products.str.contains('\[edit\]', case = False)].values.tolist() # List of elements to be columns
df = []
category = product_list[0]
for item in product_list:
if item in cols:
category = item[:-6] # Removes '[edit]'
else:
df.append((category, item))
df = pd.DataFrame(df, columns = ['Category', 'Product'])
We do isin find the column name , then with cumsum and cumcount create the pivot key then do crosstab
s=df1.Values.isin(df2.Values)
df=pd.crosstab(index=s.cumsum(),
columns=s.groupby(s.cumsum()).cumcount(),
values=df1.Values,
aggfunc='first').set_index(0).T
0 Soda IceCream Snacks
col_0
1 Coke Nestle Lays
2 Sprite NaN ActII
3 NaN NaN Nachos

Remake dataframe based of fuzzywuzzy matches

i have a dataframes now it have 5 rows(in future will have more). In column names there 5 values, if those 5 names the same(their fuzz.ratio close to each other) then ok no changes needed.
But there is cases where:
4 values good(their fuzz.ratio close) and 1 value different, bad.
3 values good, 2 bad,
3 values good, 1 bad and 1 bad.
2 values the same, other 2 the same, and 1 different, bad.
2 values the same, other 1 and 1 and 1 values bad.
So I need dataframes where at least 2 rows the same, 3 better, 4 good, 5 the best.
Here is some simple example, of course series will have row index based on that it will be easier to select needed rows.
fruits_4_1 = ['banana', 'bananas', 'bananos', 'banandos','cherry']
fruits_3_2 = ['tomato','tamato','tomatos','apple','apples']
fruits_3_1_1 = ['orange','orangad','orandges','ham', 'beef']
fruits_2_2_1 = ['kiwi', 'kiwiss', 'mango','mangas', 'grapes']
fruits_2_1_1_1 = ['kiwi', 'kiwiss', 'mango','apples', 'beefs']
for f in fruits_4_1:
score_1 = process.extract(f, fruits_2_1_1_1, limit=10, scorer=fuzz.ratio)
print(score_1)
I need implement logic, that will check dataframe`s series and determine what type it is 4+1\3+2 etc. And based on that will create new dataframes, with only similar rows. How do i do that?

panda aggregate by functions

I have data like below:
id movie details value
5 cane1 good 6
5 wind2 ok 30.3
5 wind1 ok 18
5 cane1 good 2
5 cane22 ok 4
5 cane34 good 7
5 wind2 ok 2
I want the output with below criteria:
If movie name starts with 'cane' - sum the value
If movie name starts with 'wind' - count the occurrence.
So - the final output will be:
id movie value
5 cane1 8
5 cane22 4
5 cane34 7
5 wind1 1
5 wind2 2
I tried to use:
movie_df.groupby(['id']).apply(aggr)
def aggr(x):
if x['movie'].str.startswith('cane'):
y = x.groupby(['value']).sum()
else:
y = x.groupby(['movie']).count()
return y
But It's not working. Can anyone please help?
You should aim for vectorised operations where possible.
You can calculate 2 results and then concatenate them.
mask = df['movie'].str.startswith('cane')
df1 = df[mask].groupby('movie')['value'].sum()
df2 = df[~mask].groupby('movie').size()
res = pd.concat([df1, df2], ignore_index=0)\
.rename('value').reset_index()
print(res)
movie value
0 cane1 8.0
1 cane22 4.0
2 cane34 7.0
3 wind1 1.0
4 wind2 2.0
There might be multiple ways of doing this. One way would to filter by the start of movie name first and then aggregate and merge afterwards.
cane = movie_df[movie_df['movie'].str.startswith('cane1')]
wind = movie_df[movie_df['movie'].str.startswith('wind')]
cane_sum = cane.groupby(['id']).agg({'movie':'first', 'value':'sum'}).reset_index()
wind_count = wind.groupby(['id']).agg({'movie':'first', 'value':'count'}).reset_index()
pd.concat([cane_sum, wind_count])
First of all, you need to perform string operation. I guess in your case you don't want digits in a movie name. Use solution discussed at pandas applying regex to replace values.
Then you call groupby() on new series.
FYI: Some movie names have digits only; in that case, you need to use update function. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.update.html
I would start by creating a column which defines the required groups. For the example at hand this can be done with
df['group'] = df.movie.transform(lambda x : x[:4])
The next step would be to group by this column
df.groupby('group').apply(agg_fun)
using the following aggregation function
def agg_fun(grp):
if grp.name == "cane":
value=grp.value.sum()
else:
value=grp.value.count()
return value
The output of this code is
group
cane 19.0
wind 3.0

Iteration order with pandas groupby on a pre-sorted DataFrame

The Situation
I'm classifying the rows in a DataFrame using a certain classifier based on the values in a particular column. My goal is to append the results to one new column or another depending on certain conditions. The code, as it stands looks something like this:
df = pd.DataFrame({'A': [list with classifier ids], # Only 3 ids, One word strings
'B': [List of text to be classified], # Millions of unique rows, lines of text around 5-25 words long
'C': [List of the old classes]} # Hundreds of possible classes, four digit integers stored as strings
df.sort_values('A', inplace=True)
new_col1, new_col2 = [], []
for name, group in df.groupby('A', sort=False):
classifier = classy_dict[name]
vectors = vectorize(group.B.values)
preds = classifier.predict(vectors)
scores = classifier.decision_function(vectors)
for tup in zip(preds, scores, group.C.values):
if tup[2] == tup[0]:
new_col1.append(np.nan)
new_col2.append(tup[2])
else:
new_col1.append(str(classifier.classes_[tup[1].argsort()[-5:]]))
new_col2.append(np.nan)
df['D'] = new_col1
df['E'] = new_col2
The Issue
I am concerned that groupby will not iterate in a top-down, order-of-appearance manner as I expect. Iteration order when sort=False is not covered in the docs
My Expectations
All I'm looking for here is some affirmation that groupby('col', sort=False) does iterate in the top-down order-of-appearance way that I expect. If there is a better way to make all of this work, suggestions are appreciated.
Here is the code I used to test my theory on sort=False iteration order:
from numpy.random import randint
import pandas as pd
from string import ascii_lowercase as lowers
df = pd.DataFrame({'A': [lowers[randint(3)] for _ in range(100)],
'B': randint(10, size=100)})
print(df.A.unique()) # unique values in order of appearance per the docs
for name, group in df.groupby('A', sort=False):
print(name)
Edit: The above code makes it appear as though it acts in the manner that I expect, but I would like some more undeniable proof, if it is available.
Yes, when you pass sort=False the order of first appearance is preserved. The groupby source code is a little opaque, but there is one function groupby.ngroup which fully answers this question, as it directly tells you the order in which iteration occurs.
def ngroup(self, ascending=True):
"""
Number each group from 0 to the number of groups - 1.
This is the enumerative complement of cumcount. Note that the
numbers given to the groups match the order in which the groups
would be seen when iterating over the groupby object, not the
order they are first observed.
""
Data from #coldspeed
df['sort=False'] = df.groupby('col', sort=False).ngroup()
df['sort=True'] = df.groupby('col', sort=True).ngroup()
Output:
col sort=False sort=True
0 16 0 7
1 1 1 0
2 10 2 5
3 20 3 8
4 3 4 2
5 13 5 6
6 2 6 1
7 5 7 3
8 7 8 4
When sort=False you iterate based on the first appearance, when sort=True it sorts the groups, and then iterates.
Let's do a little empirical test. You can iterate over groupby and see the order in which groups are iterated over.
df
col
0 16
1 1
2 10
3 20
4 3
5 13
6 2
7 5
8 7
for c, g in df.groupby('col', sort=False):
print(c)
16
1
10
20
3
13
2
5
7
It appears that the order is preserved.

Fast python algorithm (in numpy or pandas?) to find indices of array elements that match elements in another array

I am looking for a fast method to determine the cross-matching indices of two arrays, defined as follows.
I have two very large (>1e7 elements) structured arrays, one called members, and another called groups. Both arrays have a groupID column. The groupID entries of the groups array are unique, the groupID entries of the members array are not.
The groups array has a column called mass. The members array has a (currently empty) column called groupmass. I want to assign the correct groupmass to those elements of members with a groupID that matches one of the groups. This would be accomplished via:
members['groupmass'][idx_matched_members] = groups['mass'][idx_matched_groups]
So what I need is a fast routine to compute the two index arrays idx_matched_members and idx_matched_groups. This sort of task seems so common that it seems very likely that a package like numpy or pandas would have an optimized solution. Does anyone know of a solution, professionally developed, homebrewed, or otherwise?
This can be done with pandas using map to map the data from one column using the data of another. Here's an example with sample data:
members = pandas.DataFrame({
'id': np.arange(10),
'groupID': np.arange(10) % 3,
'groupmass': np.zeros(10)
})
groups = pandas.DataFrame({
'groupID': np.arange(3),
'mass': np.random.randint(1, 10, 3)
})
This gives you this data:
>>> members
groupID groupmass id
0 0 0 0
1 1 0 1
2 2 0 2
3 0 0 3
4 1 0 4
5 2 0 5
6 0 0 6
7 1 0 7
8 2 0 8
9 0 0 9
>>> groups
groupID mass
0 0 3
1 1 7
2 2 4
Then:
>>> members['groupmass'] = members.groupID.map(groups.set_index('groupID').mass)
>>> members
groupID groupmass id
0 0 3 0
1 1 7 1
2 2 4 2
3 0 3 3
4 1 7 4
5 2 4 5
6 0 3 6
7 1 7 7
8 2 4 8
9 0 3 9
If you will often want to use the groupID as the index into groups, you can set it that way permanently so you won't have to use set_index every time you do this.
Here's an example of setting the mass with just numpy. It does use iteration, so for large arrays it won't be fast.
For just 10 rows, this is much faster than the pandas equivalent. But as the data set becomes larger (eg. M=10000), pandas is much better. The setup time for pandas is larger, but the per row iteration time much lower.
Generate test arrays:
dt_members = np.dtype({'names':['groupID','groupmass'], 'formats': [int, float]})
dt_groups = np.dtype({'names':['groupID', 'mass'], 'formats': [int, float]})
N, M = 5, 10
members = np.zeros((M,), dtype=dt_members)
groups = np.zeros((N,), dtype=dt_groups)
members['groupID'] = np.random.randint(101, 101+N, M)
groups['groupID'] = np.arange(101, 101+N)
groups['mass'] = np.arange(1,N+1)
def getgroup(id):
idx = id==groups['groupID']
return groups[idx]
members['groupmass'][:] = [getgroup(id)['mass'] for id in members['groupID']]
In python2 the iteration could use map:
members['groupmass'] = map(lambda x: getgroup(x)['mass'], members['groupID'])
I can improve the speed by about 2x by minimizing the repeated subscripting, eg.
def setmass(members, groups):
gmass = groups['mass']
gid = groups['groupID']
mass = [gmass[id==gid] for id in members['groupID']]
members['groupmass'][:] = mass
But if groups['groupID'] can be mapped onto arange(N), then we can get a big jump in speed. By applying the same mapping to members['groupID'], it becomes a simple array indexing problem.
In my sample arrays, groups['groupID'] is just arange(N)+101. So the mapping just subtracts that minimum.
def setmass1(members, groups):
members['groupmass'][:] = groups['mass'][members['groupID']-groups['groupID'].min()]
This is 300x faster than my earlier code, and 8x better than the pandas solution (for 10000,500 arrays).
I suspect pandas does something like this. pgroups.set_index('groupID').mass is the mass Series, with an added .index attribute. (I could test this with a more general array)
In a more general case, it might help to sort groups, and if necessary, fill in some indexing gaps.
Here's a 'vectorized' solution - no iteration. But it has to calculate a very large matrix (length of groups by length of members), so does not gain much speed (np.where is the slowest step).
def setmass2(members, groups):
idx = np.where(members['groupID'] == groups['groupID'][:,None])
members['groupmass'][idx[1]] = groups['mass'][idx[0]]

Categories