Create a column and assign values randomly - python

I have a dataframe containing customers ID.
I want to create a new column named group_user which would take only 3 values : 0,1,2
I want these values to be assigned randomly to customers in balanced proportions.
The output would be :
ID group_user
341 1
127 0
389 2
Thanks !

You could try this:
>>> lst = [0, 1, 2]
>>> df['group_user'] = pd.Series(np.tile(lst, len(df) // len(lst) + 1)[:len(df)]).sample(frac=1)
>>> df
This would work for all length columns and list.

I think this may work for you:
import pandas as pd
import numpy as np
randints = [0, 1, 2]
N = 100
# Generate a dataframe with N entries, where the ID is a three digit integer and group_usr is selected in random from the variable randints.
df = pd.DataFrame({'ID': np.random.randint(low=100,high=999,size = N),
'group_usr': np.random.choice(randints, size = N, replace=True)})
if the dataframe is large (long) enough you should get more or less equal proportions. So, for example, when you have a 100 entries in you dataframe this is the distribution of the group_usr column:

You can try this:
import random
df= pd.DataFrame({'ID':random.sample(range(100,1000),25), 'col2':np.nan*25})
groups=random.choices(([0]*3)+([1]*5)+([2]*5), k=len(df.ID))
df['groups']=groups
proportions are 3, 5, 5.

Related

Get index of elements in first Series within the second series

I want to get the index of all values in the smaller series for the larger series. The answer is in the code snippet below stored in the ans variable.
import pandas as pd
smaller = pd.Series(["a","g","b","k"])
larger = pd.Series(["a","b","c","d","e","f","g","h","i","j","k","l","m"])
# ans to be generated by some unknown combination of functions
ans = [0,6,1,10]
print(larger.iloc[ans,])
print(smaller)
assert(smaller.tolist() == larger.iloc[ans,].tolist())
Context: Series larger serves as an index for the columns in a numpy matrix, and series smaller serves as an index for the columns in a numpy vector. I need indexes for the matrix and vector to match.
You can reverse your larger series, then index this with smaller:
larger_rev = pd.Series(larger.index, larger.values)
res = larger_rev[smaller].values
print(res)
array([ 0, 6, 1, 10], dtype=int64)
for i in list(smaller):
if i in list(larger):
print((list(larger).index(i)))
This will get you the desired output
Using Series get
pd.Series(larger.index, larger.values).get(smaller)
Out[8]:
a 0
g 6
b 1
k 10
dtype: int64
try this :)
import pandas as pd
larger = pd.Series(["a","b","c","d","e","f","g","h","i","j","k","l","m"])
smaller = pd.Series(["a","g","b","k"])
res = pd.Series(larger.index, larger.values).reindex(smaller.values, copy=True)
print(res)

Cumulative custom function

I am trying to add a column to a pandas dataframe
import pandas as pd
df = pd.DataFrame([['a',1],['b',0],['c',1],['d',1],['e',0],['f',1]])
such that it contains the result of a cumulative custom function
a --> (total + a) * a
that is, it takes the value a, sums it up with the total and multiplies the result. In my example I would like to have as output:
pd.DataFrame([['a',1,1],['b',0,0],['c',1,1],['d',1,2],['e',0,0],['f',1,1]])
I understand that this could be done using
df.expanding.apply(some_lambda_function)
but I have some difficult in understanding how to code it.
Do you have any idea?
many thanks.
I will recommend for loop ..
start=0
total=[]
for x ,y in df.iterrows():
start=(y[1]+start)*y[1]
total.append(start)
total
Out[201]: [1, 0, 1, 2, 0, 1]

How to group records with Pandas cut()?

My goal is to group n records by 4, say for example:
0-3
4-7
8-11
etc.
find max() value of each group of 4 based on one column among other columns, and create a new dataset or csv file. The max() operation would be performed on one column, while the other columns remain as they are.
Based on the research I have made here (Stackoverflow), I have tried to customize and apply the following solution on this site on my dataset, but it wasn't giving me my expectations:
# Group by every 4 row until the len(dataset)
groups = dataset.groupby(pd.cut(dataset.index, range(0,len(dataset),3))
needataset = groups.max()
I'm getting results similar to the following:
Column 1 Column 2 ... Column n
0. (0,3]
1. (3,6]
The targeted column for the max() operation did not also produce the expected result.
I will appreciate any guide to tackling the problem.
This example should help you. Here I create a DataFrame of some random values between 0 and 100 with step 5, and group those values in groups of 4 (sort_values is really important, it will make your life easier)
df = pd.DataFrame({'value': np.random.randint(0, 100, 5)})
df = df.sort_values(by='value')
labels = ["{0} - {1}".format(i, i + 4) for i in range(0, 100, 5)]
df['group'] = pd.cut(df.value, range(0, 105, 5), right=False, labels=labels)
groups = df["group"].unique()
Then I create an array for max values
max_vals = np.zeros((len(groups)))
for i, group in enumerate(groups):
max_vals[i] = max(df[df["group"] == group]["value"])
And then a DataFrame out of those max values
pd.DataFrame({"group": groups, "max value": max_vals})

Python - Subtract a number of samples from a given in a dictionary structure

I have a dict structure with length 5. The dict structure is called "mat_contents". The information is located in "traindata" and their respective labels in "trainlabels". I want to extract a given number of samples from a given label value. For instance, 60 samples (out of 80) from "traindata" with label "trainlabels" equal 1. I have seen some examples in here but they are different from my request.
Assuming this as an example of Input
traindata trainlabels
a 1
b 2
c 2
d 1
e 1
f 2
The result if I want to extract two random samples of traindata with trainlabels value of 2 could be:
b
f
labels = [k for k, v in mat_contents.items() if v == 1]
result = np.random.choice(labels, 2, replace=False)
The first line extracts the relevant labels from your dictionary, and the second line chooses a random subset of 2 elements from these labels (without replacement), if numpy is imported as np.
Can you not use a pandas data frame to do this? Link:Pandas Dataframe Sampling. This is an example that i have used in the past:
import pandas as pd
keeping = 0.8
source = "/path/to/some/file"
df = pd.DataFrame(source)
ones = df[df.trainlabels == 1].sample(frac=keeping)
twos = df[df.trainlabels == 2].sample(frac=keeping)

Python Pandas: How to make a column row dependent on it's previous rows, possibly with a function?

I am trying to calculate column B in dependence of previous data of Column A and B. A simple function example would be
e.g. B(n) = A(n-1) + B(n-1),
where n is the index of the Pandas dataframe. I do not need necessarily to use the dataframe index.
In this example, I start with B(1) = 0 and add the A rows in consecutive fashion.
n A(n) B(n)
----------------
1 1 0
2 0 1
3 2 1
4 9 3
An example of this data structure would be defined in Pandas as
d = {'A' : pd.Series([1, 0, 2, 9],),
'B' : pd.Series([0, float("nan"), float("nan"), float("nan")])}
df = pd.DataFrame(d)
Update
Both Henry Cutchers' and Jakob's answer work well.
As your example problem can be reduced to be dependent on B[0] and A[n] only:
a possible simple solution could look like
import pandas as pd
import numpy as np
d = {'A' : pd.Series([1, 0, 2, 9],),
'B' : pd.Series([0, float("nan"), float("nan"), float("nan")])}
df = pd.DataFrame(d)
for i in range(1,len(df.A)):
df.B[i] = df.B[0] + np.sum(df.A[:i])
df
which results in the data frame
If you face a similar iterative dependency you should be able to construct a similar approach suiting your needs.
Have you thought about using Cython http://www.cython.org ? It will interoperate with pandas -- same data structures, etc (as pandas is written in cython). It looks to me like you'll need the ability to iterate across your dataframe in arbitrary ways (not knowing more about your problem, that's all I can say), and yet need speed. Cython compiles to C.
I could forsee a loop of the form:
import numpy
import pandas
import datetime
dates = pandas.date_range('20130101',periods=6)
myDataFrame = pandas.DataFrame(numpy.arange(12).reshape((6,2)),index=dates,columns=list('ab'))
a=myDataFrame["a"]
b=myDataFrame["b"]
print a
print b
out=numpy.empty_like(a.values)
out[0] = 0
#this loop will work but be slow...
for i in range(1, a.shape[0]):
out[i] = a[i-1] + b[i-1]
myDataFrame['c'] = pandas.Series(out, index=myDataFrame.index)
print myDataFrame
But that's going to be slow.

Categories