Get index of elements in first Series within the second series - python

I want to get the index of all values in the smaller series for the larger series. The answer is in the code snippet below stored in the ans variable.
import pandas as pd
smaller = pd.Series(["a","g","b","k"])
larger = pd.Series(["a","b","c","d","e","f","g","h","i","j","k","l","m"])
# ans to be generated by some unknown combination of functions
ans = [0,6,1,10]
print(larger.iloc[ans,])
print(smaller)
assert(smaller.tolist() == larger.iloc[ans,].tolist())
Context: Series larger serves as an index for the columns in a numpy matrix, and series smaller serves as an index for the columns in a numpy vector. I need indexes for the matrix and vector to match.

You can reverse your larger series, then index this with smaller:
larger_rev = pd.Series(larger.index, larger.values)
res = larger_rev[smaller].values
print(res)
array([ 0, 6, 1, 10], dtype=int64)

for i in list(smaller):
if i in list(larger):
print((list(larger).index(i)))
This will get you the desired output

Using Series get
pd.Series(larger.index, larger.values).get(smaller)
Out[8]:
a 0
g 6
b 1
k 10
dtype: int64

try this :)
import pandas as pd
larger = pd.Series(["a","b","c","d","e","f","g","h","i","j","k","l","m"])
smaller = pd.Series(["a","g","b","k"])
res = pd.Series(larger.index, larger.values).reindex(smaller.values, copy=True)
print(res)

Related

Create a column and assign values randomly

I have a dataframe containing customers ID.
I want to create a new column named group_user which would take only 3 values : 0,1,2
I want these values to be assigned randomly to customers in balanced proportions.
The output would be :
ID group_user
341 1
127 0
389 2
Thanks !
You could try this:
>>> lst = [0, 1, 2]
>>> df['group_user'] = pd.Series(np.tile(lst, len(df) // len(lst) + 1)[:len(df)]).sample(frac=1)
>>> df
This would work for all length columns and list.
I think this may work for you:
import pandas as pd
import numpy as np
randints = [0, 1, 2]
N = 100
# Generate a dataframe with N entries, where the ID is a three digit integer and group_usr is selected in random from the variable randints.
df = pd.DataFrame({'ID': np.random.randint(low=100,high=999,size = N),
'group_usr': np.random.choice(randints, size = N, replace=True)})
if the dataframe is large (long) enough you should get more or less equal proportions. So, for example, when you have a 100 entries in you dataframe this is the distribution of the group_usr column:
You can try this:
import random
df= pd.DataFrame({'ID':random.sample(range(100,1000),25), 'col2':np.nan*25})
groups=random.choices(([0]*3)+([1]*5)+([2]*5), k=len(df.ID))
df['groups']=groups
proportions are 3, 5, 5.

Pandas: extract number from calculation within loop

I'm trying to do calculations within a loop from multiple columns within a pandas dataframe. I want the output to be just a number, but it is in the form [index number dtype: int64]. It seems like it should be easy to get just that number, but I can't figure it out. Here is a simple example of some data and a basic calculation
import pandas as pd
# create a little dataframe
df = pd.DataFrame({
'A': [1,2],
'B': [3,4]
})
# create a list to hold results
l1 = []
# run a loop to do a simple example calculation
for i,_ in enumerate(df.A):
val = df.A[[i]] + df.B[[i]]
l1.append(val)
This is what I get for l1:
[0 4
dtype: int64,
1 6
dtype: int64]
My desired output is
[4, 6]
I can take the second value from each element in the list, but I want to do something faster, because my dataset is large, and it seems like I should be able to return a calculation without the index and dtype. Thank you in advance.
Change you last line within for loop, the original one return Series which will cause the 'issue' you mentioned
l1 = []
# run a loop to do a simple example calculation
for i,_ in enumerate(df.A):
val = df.A[[i]] + df.B[[i]]
l1.append(val.iloc[0])
l1
Out[154]: [4, 6]

Numpy: Finding correspondencies in one array by uniques of other array, arbitrary length

I have a problem where I have two arrays, one with identifiers which can occur multiple time, lets just say
import numpy as np
ind = np.random.randint(0,10,(100,))
and another one which is the same length and contains some info, in this case boolean, for each of the elementes identified by ind. They are sorted correspondingly.
dat = np.random.randint(0,2,(100,)).astype(np.bool8)
I'm looking for a (faster?) way to do the following: Do a np.any() for each element (defined by ind) for all elements. The number of occurences per element is, as in the example, random. What I'm doing now is
result = np.empty(np.unique(ind))
for i,uni in enumerate(np.unique(ind)):
result[i] = np.any(dat[ind==uni])
Which is sort of slow. Any ideas?
Approach #1
Index ind with dat to select the ones required to be checked, get the binned counts with np.bincount and see which bins have more one than occurrence -
result = np.bincount(ind[dat])>0
If ind has negative numbers, offset it with the min value -
ar = ind[dat]
result = np.bincount(ar-ar.min())>0
Approach #2
One more with np.unique -
unq = np.unique(ind[dat])
n = len(np.unique(ind))
result = np.zeros(n,dtype=bool)
result[unq] = 1
We can use pandas to get n :
import pandas as pd
n = pd.Series(ind).nunique()
Approach #3
One more with indexing -
ar = ind[dat]
result = np.zeros(ar.max()+1,dtype=bool)
result[ar] = 1

Efficiently taking time slices of variable length in a dataframe

I would like to efficiently slice a DataFrame with a DatetimeIndex (similar to a resample or groupby operation), but the desired time slices are different lengths.
This is relatively easy to do by looping (see code below), but with large timeseries the multiple slices quickly becomes slow. Any suggestions on vectorising this/improving speed?
import pandas as pd, datetime as dt, numpy as np
#Example DataFrame with a DatetimeIndex
idx = pd.DatetimeIndex(start=dt.datetime(2017,1,1), end=dt.datetime(2017,1,31), freq='h')
df = pd.Series(index = idx, data = np.random.rand(len(idx)))
#The slicer dataframe contains a series of start and end windows
slicer_df = pd.DataFrame(index = [1,2])
slicer_df['start_window'] = [dt.datetime(2017,1,2,2), dt.datetime(2017,1,6,12)]
slicer_df['end_window'] = [dt.datetime(2017,1,6,12), dt.datetime(2017,1,15,2)]
#The results should be stored to a dataframe, indexed by the index of the slicer dataframe
#This is the loop that I would like to vectorise
slice_results = pd.DataFrame()
slice_results['total'] = None
for index, row in slicer_df.iterrows():
slice_results.loc[index,'total'] = df[(df.index >= row.start_window) &
(df.index <= row.end_window)].sum()
NB. I've just realised that my particular data set has adjacent windows (ie. the start of one window corresponds to the end of the one before it), but the windows are of different lengths. It feels like there should be a way to perform a groupby or similar with only one pass over df...
You can do this as an apply, which will concat the results rather than iteratively update the DataFrame:
In [11]: slicer_df.apply((lambda row: \
df[(df.index >= row.start_window)
& (df.index <= row.end_window)].sum()), axis=1)
Out[11]:
1 36.381155
2 111.521803
dtype: float64
You can vectorize this with searchsorted (assuming the datetime index is sorted, otherwise first sort):
In [11]: inds = np.searchsorted(df.index.values, slicer_df.values)
In [12]: s = df.cumsum() # only sum once!
In [13]: pd.Series([s[end] - s[start-1] if start else s[end] for start, end in inds], slicer_df.index)
Out[13]:
1 36.381155
2 111.521803
dtype: float64
There's still a loop in there, but it's now a lot cheaper!
That leads us to a completely vectorized solution (it's a little more cryptic):
In [21]: inds2 = np.maximum(1, inds) # see note
In [22]: inds2[:, 0] -= 1
In [23]: inds2
Out[23]:
array([[ 23, 96],
[119, 336]])
In [24]: x = s[inds2]
In [25]: x
Out[25]:
array([[ 11.4596498 , 47.84080472],
[ 55.94941276, 167.47121538]])
In [26]: x[:, 1] - x[:, 0]
Out[26]: array([ 36.38115493, 111.52180263])
Note: the when the start date is before the first date we want to avoid the start index rolling back from 0 to -1 (which would mean the end of the array i.e. underflow).
I have come up with a vectorised method which relies on the varying length "windows" being always adjacent to one another, ie. that the start of a window is the same as the end of the window before it.
# Ensure that the join will be successful by rounding to a specific frequency
round_freq = '1h'
df.index = df.index.round(round_freq)
slicer_df.start_window= slicer_df.start_window.dt.round(round_freq)
# Give the index of the slicer a useful name
slicer_df.index.name = 'event_number'
#Perform a join to the start of the window, forward fill to the next window, then groupby to get the totals for each time window
df = df.to_frame('orig_data').join(slicer_df.reset_index().set_index('start_window')[['event_number']])
df.event_number = df.event_number.ffill()
df.groupby('event_number').sum()
Of course this only works when the windows are adjacent, ie. they can't overlap or have any gaps. If anyone has a more general method that works for the above, I'd love to see it!

Matrix is printing wrong dimensions

I'm reading in a column from a dataframe named 'OneHot'. Each row of this column has a value of either [1,0] or [0,1]. I am trying to store these values into a variable so I can use it in a neural network.
Problem:
When I read in the values into a variable it stores as (792824, 1) instead of (792824, 2). 792824 is the amount of rows in the dataframe. I have tried reshape and that did not work.
Here is the code I have:
input_matrix = np.matrix(df['VectorTweet'].values.tolist())
​
In [157]:
input_matrix = np.transpose(input_matrix)
x_inputs = input_matrix.shape
print x_inputs
(792824, 1)
In [160]:
output_matrix = np.matrix(df['OneHot'].values.tolist())
y_inputs = np.transpose(output_matrix)
print y_outputs.shape
​
(792824, 1)
print y_outputs[1]
[['[1, 0]']]
attached is a snippet of my dataframe Example of my dataframe.
Looks like each entry in OneHot is a string representation of a list. That's why you're only getting one column in your transpose - you've made a single-element list of a string of a list of integers. You can convert strings of lists to actual lists with ast.literal_eval():
# OneHot as string of list of ints
strOneHot = pd.Series(['[0,1]','[1,0]'])
print(strOneHot.values)
# ['[0,1]' '[1,0]']
import ast
print(strOneHot.apply(ast.literal_eval).values)
# [[0, 1] [1, 0]]
FWIW, you can take the transpose of a Pandas series with .T, if that's useful here:
strOneHot.apply(ast.literal_eval).T
Output:
0 [0, 1]
1 [1, 0]
dtype: object

Categories