I have two csv files.Depending upon the value of a cell in csv file 1 I should be able to search that value in a column of csv file 2 and get he corresponding value from other column in csv file 2.
I am sorry if this very confusing.It will probably get clear by illustration
CSV file 1
Car Mileage
A 8
B 6
C 10
CSV file 2
Score Mileage(Min) Mileage(Max)
1 1 3
2 4 6
3 7 9
4 10 12
5 13 15
And my desired output CSV file is something like this
Car Mileage Score
A 8 3
B 6 2
C 10 4
Car A is given a score of 3 depending upon its mileage 8 and then looking that mileage in csv file 2 in what range it falls and then getting corresponding score value for that range.
Any help will be appreciated
Thanks in advance
As of writing this, the current stable release is v0.21.
To read your files, use pd.read_csv -
df0 = pd.read_csv('file1.csv')
df1 = pd.read_csv('file2.csv')
df0
Car Mileage
0 A 8
1 B 6
2 C 10
df1
Score Mileage(Min) Mileage(Max)
0 1 1 3
1 2 4 6
2 3 7 9
3 4 10 12
4 5 13 15
To find the Score, use pd.IntervalIndex by calling IntervalIndex.from_tuples. This should be really fast -
v = df1.loc[:, 'Mileage(Min)':'Mileage(Max)'].apply(tuple, 1).tolist()
idx = pd.IntervalIndex.from_tuples(v, closed='both') # you can also use `from_arrays`
df0['Score'] = df1.iloc[idx.get_indexer(df0.Mileage.values), 'Score'].values
df0
Car Mileage Score
0 A 8 3
1 B 6 2
2 C 10 4
Other methods of creating an IntervalIndex are outlined here.
To write your result, use pd.DataFrame.to_csv -
df0.to_csv('file3.csv')
Here's a high level outline of what I've done here.
First, read in your CSV files
Use pd.IntervalIndex to build an interval index tree. So, searching is now logarithmic in complexity.
Use idx.get_indexer to find the index of each value in the tree
Use the index to locate the Score value in df1, and assign this back to df0. Note that I call .values, otherwise, the values will be misaligned when assigning back.
Write your result back to CSV
For more information on Intervalindex, take a look at this SO Q/A - Finding matching interval(s) in pandas Intervalindex
Note that IntervalIndex is new in v0.20, so if you have an older version, make sure you update your version with
pip install --upgrade pandas
You can use IntervalIndex, new in version 0.20.0+:
First create DataFrames by read_csv:
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
Create IntervalIndex by from_arrays:
s = pd.IntervalIndex.from_arrays(df2['Mileage(Min)'], df2['Mileage(Max)'], 'both')
print (s)
IntervalIndex([[1, 3], [4, 6], [7, 9], [10, 12], [13, 15]]
closed='both',
dtype='interval[int64]')
Select Mileage values by intervalindex and set to new column by array created by values, because else indices are not aligned and get:
TypeError: incompatible index of inserted column with frame index
df1['Score'] = df2.set_index(s).loc[df1['Mileage'], 'Score'].values
print (df1)
Car Mileage Score
0 A 8 3
1 B 6 2
2 C 10 4
And last write to file by to_csv:
df1.to_csv('file3.csv', index=False)
Setup
data = [(1,1,3), (2,4,6), (3,7,9), (4,10,12), (5,13,15)]
df = pd.DataFrame(data, columns=['Score','MMin','MMax'])
car_data = [('A', 8), ('B', 6), ('C', 10)]
car = pd.DataFrame(car_data, columns=['Car','Mileage'])
def find_score(x, df):
result = -99
for idx, row in df.iterrows():
if x >= row.MMin and x <= row.MMax:
result = row.Score
return result
car['Score'] = car.Mileage.apply(lambda x: find_score(x, df))
Which yields
In [58]: car
Out[58]:
Car Mileage Score
0 A 8 3
1 B 6 2
2 C 10 4
Related
I have the the below df build from a pivot of a larger df. In this table 'week' is the the index (dtype = object) and I need to show week 53 as the first row instead of the last
Can someone advice please? I tried reindex and custom sorting but can't find the way
Thanks!
here is the table
Since you can't insert the row and push others back directly, a clever trick you can use is create a new order:
# adds a new column, "new" with the original order
df['new'] = range(1, len(df) + 1)
# sets value that has index 53 with 0 on the new column
# note that this comparison requires you to match index type
# so if weeks are object, you should compare df.index == '53'
df.loc[df.index == 53, 'new'] = 0
# sorts values by the new column and drops it
df = df.sort_values("new").drop('new', axis=1)
Before:
numbers
weeks
1 181519.23
2 18507.58
3 11342.63
4 6064.06
53 4597.90
After:
numbers
weeks
53 4597.90
1 181519.23
2 18507.58
3 11342.63
4 6064.06
One way of doing this would be:
import pandas as pd
df = pd.DataFrame(range(10))
new_df = df.loc[[df.index[-1]]+list(df.index[:-1])].reset_index(drop=True)
output:
0
9 9
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
Alternate method:
new_df = pd.concat([df[df["Year week"]==52], df[~(df["Year week"]==52)]])
I'm trying to get a random sample from two pandas frames. If rows (random) 2,5,8 are selected in frame A, then the same 2,5,8 rows must be selected from frame B. I did it by first generating a random sample and now want to use this sample as indices for rows for frame. How can I do it? The code should look like
idx = list(random.sample(range(X_train.shape[0]),5000))
lgstc_reg[i].fit(X_train[idx,:], y_train[idx,:])
However, running the code gives an error.
Use iloc:
indexes = [2,5,8] # in your case this is the randomly generated list
A.iloc[indexes]
B.iloc[indexes]
An alternative consistent sampling methodology would be to set a random seed, and then sample:
random_seed = 42
A.sample(3, random_state=random_seed)
B.sample(3, random_state=random_seed)
The sampled DataFrames will have the same index.
Hope this helps!
>>> df1
value ID
0 3 2
1 4 2
2 7 8
3 8 8
4 11 8
>>> df2
value distance
0 3 0
1 4 0
2 7 1
3 8 0
4 11 0
I have two data frames. I want to select randoms of df1 along with corresponding rows of df2.
First I create a sample_index which a list of random rows of df using Pandas inbuilt function sample. Now use this index to location these rows in df1 and df2 with the help of another inbuilt funciton loc.
>>> selection_index = df1.sample(2).index
>>> selection_index
Int64Index([3, 1], dtype='int64')
>>> df1.loc[selection_index]
value ID
3 8 8
1 4 2
>>> df2.loc[selection_index]
value distance
3 8 0
1 4 0
>>>
In your case, this would become somewhat like
idx = X_train.sample(5000).index
lgstc_reg[i].fit(X_train.loc[idx], y_train.loc[idx])
I have two dataframes. In the first one I have the customers and a column with a list of every restaurant he/she visited.
In [1]: df_customers
Out[1]:
Document Restaurants
0 '000000984 [20504916171, 20504916171, 20499859164]
1 '000010076 [20505918674, 20505918674, 20505918674]
2 '000010319 [20253346711, 20524403863, 20508246677]
3 '000018468 [20253346711, 20538456226, 20505918674]
4 '000024409 [20553255881, 20553596441, 20553255881]
5 '000025944 [20492255719, 20600654226]
6 '000031162 [20600351398, 20408462399, 20499859164]
7 '000055177 [20524403863, 20524403863]
8 '000058303 [20600997239, 20524403863, 20600997239]
9 '000074791 [20517920178, 20517920178, 20517920178]
In my other dataframe I have a column with the restaurants and another with a given value for each
In [2]: df_rest
Out [2]:
Restaurant Points
0 10026575473 1
1 10037003331 1
2 10072208299 1
3 10179698400 2
4 10214262750 1
I need to create a column in my customers dataframe with the sum of the points given to each restaurant he/she visited.
I tried something like this:
df_customers["Sum"]=df_rest.loc[df_rest["Restaurant"].isin(df_customers["Restaurants"]),"Points"].sum()
But I'm getting this error:
TypeError: unhashable type: 'list'
I'm trying not to iterate on my customers dataframe, it takes too long. Any help?
Aim not to use lists within Pandas series. Using list removes the possibility of vectorised operations. More efficient is to expand your jagged array of restaurant lists into a single dataframe, then map to points via a dictionary and sum.
Here's a minimal example:
df1 = pd.DataFrame({'Document': [1, 2],
'Restaurants': [[20504916171, 20504916171, 20499859164],
[20505918674, 20505918674]]})
df2 = pd.DataFrame({'Restaurant': [20504916171, 20504916171, 20499859164,
20505918674, 20505918674],
'Points': [1, 2, 1, 3, 2]})
ratmap = df2.set_index('Restaurant')['Points'].to_dict()
df1['score'] = pd.DataFrame(df1['Restaurants'].values.tolist())\
.applymap(ratmap.get).fillna(0).sum(1).astype(int)
print(df1)
Document Restaurants score
0 1 [20504916171, 20504916171, 20499859164] 5
1 2 [20505918674, 20505918674] 4
I would first expand the df into:
d = {c: df_customers[c].values.repeat(df_customers.Restaurants.str.len(), axis=0) for c in df_customers.columns}
d['Restaurants'] = [i for sub in df_customers.Restaurants for i in sub]
df3 = pd.DataFrame(d)
Document Restaurants
0 000000984 20504916171
1 000000984 20504916171
2 000000984 20499859164
3 000010076 20505918674
4 000010076 20505918674
5 000010076 20505918674
6 000010319 20253346711
7 000010319 20524403863
Then map
df3['Point'] = df3.Restaurants.map(df_rest.set_index('Restaurant').Points).fillna(0)
Document Restaurants Point
0 000000984a 20504916171 1
1 000000984a 20504916171 1
2 000000984a 20499859164 0
3 000010076a 20505918674 0
4 000010076a 20505918674 0
5 000010076a 20505918674 0
Then groupby document and sum
df3.groupby('Document').sum()
Restaurants Point
Document
000000984 61509691506 2.0
000010076 61517756022 0.0
000010319 61285997251 0.0
000018468 61297721611 0.0
Values are mocked, because no restaurant id from your df_customers is present in your df_rest in the example you provided.
I have 2 data frames sample output is here
My code for getting those and formatting the date column is here
First df:
csv_data_df = pd.read_csv(os.path.join(path_to_data+'\\Data\\',appendedfile))
csv_data_df['Date_Formatted'] = pd.to_datetime(csv_data_df['DATE']).dt.strftime('%Y-%m-%d')
csv_data_df.head(3)
second df :
new_Data_df = pd.read_csv(io.StringIO(response.decode('utf-8')))
new_Data_df['Date_Formatted'] =
pd.to_datetime(new_Data_df['DATE']).dt.strftime('%Y-%m-%d')
new_Data_df.head(3)`
I want to construct third dataframe where only the rows with un-matching dates from second dataframe needs to go in third one.
Is there any method to do that. The date formatted column you can see in the screenshot.
You could set the index of both dataframes to your desired join column, then
use df1.combine_first(df2). For your specific example here, that could look like the below line.
csv_data_df.set_index('Date_Formatted').combine_first(new_Data_df.set_index('Date_Formatted')).reset_index()
Ex:
df = pd.DataFrame(np.random.randn(5, 3), columns=list('abc'), index=list(range(1, 6)))
df2 = pd.DataFrame(np.random.randn(8, 3), columns=list('abc'))
df
Out[10]:
a b c
1 -1.357517 -0.925239 0.974483
2 0.362472 -1.881582 1.263237
3 0.785508 0.227835 -0.604377
4 -0.386585 -0.511583 3.080297
5 0.660516 -1.393421 1.363900
df2
Out[11]:
a b c
0 1.732251 -1.977803 0.720292
1 0.048229 1.125277 1.016083
2 -1.684013 2.136061 0.553824
3 -0.022957 1.237249 0.236923
4 -0.998079 1.714126 1.291391
5 0.955464 -0.049673 1.629146
6 0.865864 1.137120 1.117207
7 -0.126944 1.003784 -0.180811
df.combine_first(df2)
Out[13]:
a b c
0 1.732251 -1.977803 0.720292
1 -1.357517 -0.925239 0.974483
2 0.362472 -1.881582 1.263237
3 0.785508 0.227835 -0.604377
4 -0.386585 -0.511583 3.080297
5 0.660516 -1.393421 1.363900
6 0.865864 1.137120 1.117207
7 -0.126944 1.003784 -0.180811
I have written code to iterate through a dataset that has a demarcation column. This column consist of a value shared by all equally demarked rows. The code iterate through each demarcated section with a nested loop to iterate through each line, finding the nearest neighbor for each row in its respective demarcated block
import pandas as pd
import numpy as np
Create a df with XYZ and Section demark
p=5
df = pd.DataFrame(np.random.randn(100, 3), columns=list('XYZ'))
df2 = df.sort('Z')
df2 = df2.reset_index(drop=True)
df2['Section_demark'] = (df2.index/p).astype('int')
df2.head(15)
X Y Z Section_demark
0 -1.125526 -0.249091 -2.505444 0
1 0.710114 1.357477 -2.195904 0
2 -0.580319 -0.997311 -2.031280 0
3 1.311526 -0.268590 -1.741079 0
4 0.481450 0.448904 -1.546278 0
5 -1.820224 -0.846628 -1.392700 1
6 0.528618 0.418862 -1.388170 1
7 0.360560 -0.309429 -1.319548 1
8 -0.369107 -1.290528 -1.233815 1
9 0.139063 0.045076 -1.209820 1
10 0.049387 1.087300 -1.188375 2
11 0.678247 -1.191882 -1.172214 2
12 -0.976294 -0.752081 -1.092286 2
13 0.875952 0.319304 -1.079185 2
14 0.469730 -0.329548 -1.044178 2
Function for euclidean distance
def eucl_d(item_id):
a = df3.sub(df3.iloc[item_id], axis=1)
b = np.sum( np.square(a), axis=1 )
return b
Iterate through the section demarks, iterate through the lines in each Section_demark and find nearest neighbor,
Isolate the row nearest to top row and create a series, take the ix for that series and compile a list from it.
read the list back to df2, creating a new column with the Nearest neighbor index number as value
s=0
elements = []
while s<(len(df2)/p):
df3 = df2[df2['Section_demark']==(s)]
r=0
while r<(p):
df4=df3.copy()
df4['dist'] = eucl_d(r)
df4 = df4.sort('dist')
ser = df4.iloc[1]
elements.append(ser.name)
r=r+1
s=s+1
df2["NNIX"] = elements
df2.head(10)
X1 Y1 Z1 NNIX
0 0.002299 1.284195 -1.604009 1
1 -0.444305 0.346856 -2.396538 0
2 -0.490741 -1.416682 -1.423573 3
3 0.203635 -0.676841 -1.596332 2
4 0.002299 1.284195 -1.604009 1
5 -0.314330 0.036554 -1.153127 6
6 -0.387839 0.129000 -1.235331 5
7 -0.314330 0.036554 -1.153127 6
8 -0.059477 -0.205260 -1.136376 7
9 0.717980 0.130665 -1.040372 8
I would like to exchange the last section of iteration with a groupby command and use aggregate or apply to run the eucl_d function, but it eludes me
I can get df2 grouped by running this:
grouped = df3.groupby('Section_demark')
Its the second step that is giving me trouble
I was thinking:
grouped.agg(eucl_d(item_id))
But I dont know how to specify the item_id for eucl_d(item_id)