I have two dataframes. In the first one I have the customers and a column with a list of every restaurant he/she visited.
In [1]: df_customers
Out[1]:
Document Restaurants
0 '000000984 [20504916171, 20504916171, 20499859164]
1 '000010076 [20505918674, 20505918674, 20505918674]
2 '000010319 [20253346711, 20524403863, 20508246677]
3 '000018468 [20253346711, 20538456226, 20505918674]
4 '000024409 [20553255881, 20553596441, 20553255881]
5 '000025944 [20492255719, 20600654226]
6 '000031162 [20600351398, 20408462399, 20499859164]
7 '000055177 [20524403863, 20524403863]
8 '000058303 [20600997239, 20524403863, 20600997239]
9 '000074791 [20517920178, 20517920178, 20517920178]
In my other dataframe I have a column with the restaurants and another with a given value for each
In [2]: df_rest
Out [2]:
Restaurant Points
0 10026575473 1
1 10037003331 1
2 10072208299 1
3 10179698400 2
4 10214262750 1
I need to create a column in my customers dataframe with the sum of the points given to each restaurant he/she visited.
I tried something like this:
df_customers["Sum"]=df_rest.loc[df_rest["Restaurant"].isin(df_customers["Restaurants"]),"Points"].sum()
But I'm getting this error:
TypeError: unhashable type: 'list'
I'm trying not to iterate on my customers dataframe, it takes too long. Any help?
Aim not to use lists within Pandas series. Using list removes the possibility of vectorised operations. More efficient is to expand your jagged array of restaurant lists into a single dataframe, then map to points via a dictionary and sum.
Here's a minimal example:
df1 = pd.DataFrame({'Document': [1, 2],
'Restaurants': [[20504916171, 20504916171, 20499859164],
[20505918674, 20505918674]]})
df2 = pd.DataFrame({'Restaurant': [20504916171, 20504916171, 20499859164,
20505918674, 20505918674],
'Points': [1, 2, 1, 3, 2]})
ratmap = df2.set_index('Restaurant')['Points'].to_dict()
df1['score'] = pd.DataFrame(df1['Restaurants'].values.tolist())\
.applymap(ratmap.get).fillna(0).sum(1).astype(int)
print(df1)
Document Restaurants score
0 1 [20504916171, 20504916171, 20499859164] 5
1 2 [20505918674, 20505918674] 4
I would first expand the df into:
d = {c: df_customers[c].values.repeat(df_customers.Restaurants.str.len(), axis=0) for c in df_customers.columns}
d['Restaurants'] = [i for sub in df_customers.Restaurants for i in sub]
df3 = pd.DataFrame(d)
Document Restaurants
0 000000984 20504916171
1 000000984 20504916171
2 000000984 20499859164
3 000010076 20505918674
4 000010076 20505918674
5 000010076 20505918674
6 000010319 20253346711
7 000010319 20524403863
Then map
df3['Point'] = df3.Restaurants.map(df_rest.set_index('Restaurant').Points).fillna(0)
Document Restaurants Point
0 000000984a 20504916171 1
1 000000984a 20504916171 1
2 000000984a 20499859164 0
3 000010076a 20505918674 0
4 000010076a 20505918674 0
5 000010076a 20505918674 0
Then groupby document and sum
df3.groupby('Document').sum()
Restaurants Point
Document
000000984 61509691506 2.0
000010076 61517756022 0.0
000010319 61285997251 0.0
000018468 61297721611 0.0
Values are mocked, because no restaurant id from your df_customers is present in your df_rest in the example you provided.
Related
I got a pandas dataframe that looks like this:
I want to count how many rows are for each id and print the result. The problem is I want to count that ONLY for consecutive numbers in "frame num".
For example: if frame num is: [1,2,3,45,47,122,123,124,125] and id is [1,1,1,1,1,1,1,1,1] it should print: 3 1 1 4 (and do that for EACH id).
Is there any way to do that? I got crazy trying to figure it out! To count rows for each id should be enought to use a GROUP BY. But with this new condition its difficult.
You can use pandas.DataFrame.shift() for finding consecutive numbers then use itertools.groupby for creating a list of counting consecutive.
import pandas as pd
from itertools import chain
from itertools import groupby
# Example input dataframe
df = pd.DataFrame({
'num' : [1,2,3,45,47,122,123,124,125,1,2,3,45,47,122,123,124,125],
'id' : [1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2]
})
df['s'] = (df['num']-1 == df['num'].shift()) | (df['num']+1 == df['num'].shift(-1))
res = df.groupby('id')['s'].apply(lambda g: list(chain.from_iterable([[len(list(group))] if key else [1]*len(list(group))
for key, group in groupby( g )])))
print(res)
Output:
id
1 [3, 1, 1, 4]
2 [3, 1, 1, 4]
Name: s, dtype: object
Update: Get the output as a dataframe:
>>> res.to_frame().explode('s').reset_index()
id s
0 1 3
1 1 1
2 1 1
3 1 4
4 2 3
5 2 1
6 2 1
7 2 4
Say I have the following DataFrame df:
time person attributes
----------------------------
1 1 a
2 2 b
3 1 c
4 3 d
5 2 e
6 1 f
7 3 g
... ... ...
I want to write a function get_latest() that, when given a request_time and a list of persons ids, it will return a DataFrame containing the latest entry (row) for each person, up to the request_time.
So for instance, if I called get_latest(request_time = 4.5, ids = [1, 2]), then I want it to return
time person attributes
----------------------------
2 2 b
3 1 c
since those are the latest entries for persons 1 and 2 up to the time 4.5.
I've thought about doing a truncation of the DataFrame and then doing search from there by going up the DataFrame, but that's an okay efficiency of O(n), and I was wondering if there are functions or logic that make this a faster computation.
EDIT: I made this example DataFrame on the fly but it is perhaps important that I point out that the times are Python datetimes.
How about pd.DataFrame.query
def latest_entries(request_time: int or float, ids: list) -> pd.DataFrame:
return (
df
.query("time <= #request_time & person in #ids")
.sort_values(["time"], ascending=False)
.drop_duplicates(subset=["person"], keep="first")
.reset_index(drop=True)
)
print(latest_entries(4.5, [1, 2]))
time person attributes
0 3 1 c
1 2 2 b
def get_latest(tme, ids):
df2= ( df[(df['time']<=tme) &
(df['person'].isin(ids))])
return df2[~df2.duplicated(subset=['person'], keep='last')]
get_latest(4.5, [1,2])
time person attributes
1 2 2 b
2 3 1 c
I would like to make a new column with the order of the numbers in a list. I get 3,1,0,4,2,5 ( index of the lowest numbers ) but I would like to have a new column with 2,1,4,0,3,5 ( so if I look at a row i get the list and I get in what order this number comes in the total list. what am I doing wrong?
df = pd.DataFrame({'list': [4,3,6,1,5,9]})
df['order'] = df.sort_values(by='list').index
print(df)
What you're looking for is the rank:
import pandas as pd
df = pd.DataFrame({'list': [4,3,6,1,5,9]})
df['order'] = df['list'].rank().sub(1).astype(int)
Result:
list order
0 4 2
1 3 1
2 6 4
3 1 0
4 5 3
5 9 5
You can use the method parameter to control how to resolve ties.
I'm trying to get a random sample from two pandas frames. If rows (random) 2,5,8 are selected in frame A, then the same 2,5,8 rows must be selected from frame B. I did it by first generating a random sample and now want to use this sample as indices for rows for frame. How can I do it? The code should look like
idx = list(random.sample(range(X_train.shape[0]),5000))
lgstc_reg[i].fit(X_train[idx,:], y_train[idx,:])
However, running the code gives an error.
Use iloc:
indexes = [2,5,8] # in your case this is the randomly generated list
A.iloc[indexes]
B.iloc[indexes]
An alternative consistent sampling methodology would be to set a random seed, and then sample:
random_seed = 42
A.sample(3, random_state=random_seed)
B.sample(3, random_state=random_seed)
The sampled DataFrames will have the same index.
Hope this helps!
>>> df1
value ID
0 3 2
1 4 2
2 7 8
3 8 8
4 11 8
>>> df2
value distance
0 3 0
1 4 0
2 7 1
3 8 0
4 11 0
I have two data frames. I want to select randoms of df1 along with corresponding rows of df2.
First I create a sample_index which a list of random rows of df using Pandas inbuilt function sample. Now use this index to location these rows in df1 and df2 with the help of another inbuilt funciton loc.
>>> selection_index = df1.sample(2).index
>>> selection_index
Int64Index([3, 1], dtype='int64')
>>> df1.loc[selection_index]
value ID
3 8 8
1 4 2
>>> df2.loc[selection_index]
value distance
3 8 0
1 4 0
>>>
In your case, this would become somewhat like
idx = X_train.sample(5000).index
lgstc_reg[i].fit(X_train.loc[idx], y_train.loc[idx])
I have two csv files.Depending upon the value of a cell in csv file 1 I should be able to search that value in a column of csv file 2 and get he corresponding value from other column in csv file 2.
I am sorry if this very confusing.It will probably get clear by illustration
CSV file 1
Car Mileage
A 8
B 6
C 10
CSV file 2
Score Mileage(Min) Mileage(Max)
1 1 3
2 4 6
3 7 9
4 10 12
5 13 15
And my desired output CSV file is something like this
Car Mileage Score
A 8 3
B 6 2
C 10 4
Car A is given a score of 3 depending upon its mileage 8 and then looking that mileage in csv file 2 in what range it falls and then getting corresponding score value for that range.
Any help will be appreciated
Thanks in advance
As of writing this, the current stable release is v0.21.
To read your files, use pd.read_csv -
df0 = pd.read_csv('file1.csv')
df1 = pd.read_csv('file2.csv')
df0
Car Mileage
0 A 8
1 B 6
2 C 10
df1
Score Mileage(Min) Mileage(Max)
0 1 1 3
1 2 4 6
2 3 7 9
3 4 10 12
4 5 13 15
To find the Score, use pd.IntervalIndex by calling IntervalIndex.from_tuples. This should be really fast -
v = df1.loc[:, 'Mileage(Min)':'Mileage(Max)'].apply(tuple, 1).tolist()
idx = pd.IntervalIndex.from_tuples(v, closed='both') # you can also use `from_arrays`
df0['Score'] = df1.iloc[idx.get_indexer(df0.Mileage.values), 'Score'].values
df0
Car Mileage Score
0 A 8 3
1 B 6 2
2 C 10 4
Other methods of creating an IntervalIndex are outlined here.
To write your result, use pd.DataFrame.to_csv -
df0.to_csv('file3.csv')
Here's a high level outline of what I've done here.
First, read in your CSV files
Use pd.IntervalIndex to build an interval index tree. So, searching is now logarithmic in complexity.
Use idx.get_indexer to find the index of each value in the tree
Use the index to locate the Score value in df1, and assign this back to df0. Note that I call .values, otherwise, the values will be misaligned when assigning back.
Write your result back to CSV
For more information on Intervalindex, take a look at this SO Q/A - Finding matching interval(s) in pandas Intervalindex
Note that IntervalIndex is new in v0.20, so if you have an older version, make sure you update your version with
pip install --upgrade pandas
You can use IntervalIndex, new in version 0.20.0+:
First create DataFrames by read_csv:
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
Create IntervalIndex by from_arrays:
s = pd.IntervalIndex.from_arrays(df2['Mileage(Min)'], df2['Mileage(Max)'], 'both')
print (s)
IntervalIndex([[1, 3], [4, 6], [7, 9], [10, 12], [13, 15]]
closed='both',
dtype='interval[int64]')
Select Mileage values by intervalindex and set to new column by array created by values, because else indices are not aligned and get:
TypeError: incompatible index of inserted column with frame index
df1['Score'] = df2.set_index(s).loc[df1['Mileage'], 'Score'].values
print (df1)
Car Mileage Score
0 A 8 3
1 B 6 2
2 C 10 4
And last write to file by to_csv:
df1.to_csv('file3.csv', index=False)
Setup
data = [(1,1,3), (2,4,6), (3,7,9), (4,10,12), (5,13,15)]
df = pd.DataFrame(data, columns=['Score','MMin','MMax'])
car_data = [('A', 8), ('B', 6), ('C', 10)]
car = pd.DataFrame(car_data, columns=['Car','Mileage'])
def find_score(x, df):
result = -99
for idx, row in df.iterrows():
if x >= row.MMin and x <= row.MMax:
result = row.Score
return result
car['Score'] = car.Mileage.apply(lambda x: find_score(x, df))
Which yields
In [58]: car
Out[58]:
Car Mileage Score
0 A 8 3
1 B 6 2
2 C 10 4