Unable to combine/merge/crossjoin Dataframe and Nested List with a where condition(If the nearest zip from the nested list is equal to the actual zip do not show it in the nearest zip field) to get to the desired output.
The code i have so far
x=0
print(test_df)
print(type(test_df))
for x in range(5):
nearest_result=search.by_coordinates(test_df.iloc[x,1],test_df.iloc[x,2], radius=30,returns=3)
n_zip=[res.zipcode for res in nearest_result]
print(n_zip)
print(type(n_zip))
The dataframe and nested list:
Desired Output:
Maybe a simplier approach can be proposed, but as a first shot, initially dropping 'NEAREST_ZIP':
>>> print(test_df) # /!\ dropped 'NEAREST_ZIP
ID BEGIN_LAT BEGIN_LON ZIP_CODE
0 0 30.9958 -87.2388 36441
1 1 42.5589 -92.5000 50613
2 2 42.6800 -91.9000 50662
3 3 37.0800 -97.8800 67018
4 4 37.8200 -96.8200 67042
>>> # used nzip:
>>> nzip = [[36441, 32535, 36426],
[50613, 50624, 50613], # i guess there was a typo in your code here
[50662, 50641, 50671],
[67018, 67003, 67049],
[67042, 67144, 67074]]
>>> # build a `closest` dataframe:
>>> closest = pd.DataFrame(data={k: (v1, v2) for k, v1, v2 in nzip}).T.stack().reset_index().drop(columns=['level_1'])
>>> closest.columns = ['ZIP_CODE', 'NEAREST_ZIP']
>>> # merging
>>> test_df.merge(closest)
ID BEGIN_LAT BEGIN_LON ZIP_CODE NEAREST_ZIP
0 0 30.9958 -87.2388 36441 32535
1 0 30.9958 -87.2388 36441 36426
2 1 42.5589 -92.5000 50613 50624
3 1 42.5589 -92.5000 50613 50613
4 2 42.6800 -91.9000 50662 50641
5 2 42.6800 -91.9000 50662 50671
6 3 37.0800 -97.8800 67018 67003
7 3 37.0800 -97.8800 67018 67049
8 4 37.8200 -96.8200 67042 67144
9 4 37.8200 -96.8200 67042 67074
Related
I am new to programming and Python. Lately, I am learning to use pandas.
What I would like to know
I am wondering what would be the best approach to work only on numbers related to Group II (in the attached DataFrame). I mean e.g. sum all grades for group II and column 'Project'. Sure it won't make sense to sum grades, but the data is just for illustration purposes.
I'd be grateful for any advices and suggestions.
My DataFrame
The code attached will generate random numbers (except for the 'Group' column) but the DataFrame will always be like that:
Name Album Group Colloquium_1 Colloquium_2 Project
# 0 B 61738 I 5 4 5
# 1 Z 44071 I 5 5 2
# 2 M 87060 I 5 5 5
# 3 L 67974 I 3 5 3
# 4 Z 15617 I 3 2 3
# 5 Z 91872 II 2 4 5
# 6 H 84685 II 4 2 5
# 7 T 17943 II 2 5 2
# 8 L 54302 II 2 5 3
# 9 O 53433 II 5 4 5
Code to generate my DataFrame:
import pandas as pd
import random as rd
def gen_num():
num = ""
for i in range(5):
num += str(rd.randint(0,9))
return num
names = ['A','B','C','D','E','F','G','H','I','J','K', 'L','M','N','O', \
'P','R','S','T','W','Z']
list_names = []
list_album = []
list_group = []
list_coll_1 = []
list_coll_2 = []
list_project = []
num_of_students = 10
for i in range(num_of_students):
list_names.append(rd.choice(names))
list_album.append(gen_num())
list_coll_1.append(rd.randint(2, 5))
list_coll_2.append(rd.randint(2, 5))
list_project.append(rd.randint(2, 5))
if i < (num_of_students / 2):
list_group.append('I')
else:
list_group.append('II')
group = pd.DataFrame(list_names)
group.set_axis(['Name'], axis=1, inplace=True)
group['Album'] = list_album
group['Group'] = list_group
group['Colloquium_1'] = list_coll_1
group['Colloquium_2'] = list_coll_2
group['Project'] = list_project
One solution to this is to filter the DataFrame first:
group[group["Group"] == "II"]["Project"].sum()
#Out: 18
Breaking this up into parts:
First, this part returns a series of bools (True/False) for each row as to whether the values in "Group" are equal to "II":
group["Group"] == "II"
#0 False
#1 False
#2 False
#3 False
#4 False
#5 True
#6 True
#7 True
#8 True
#9 True
#Name: Group, dtype: bool
Next, writing this into group[] returns a filtered dataframe for those rows that are True:
group[group["Group"] == "II"]
# Name Album Group Colloquium_1 Colloquium_2 Project
#5 E 77371 II 4 5 3
#6 N 90525 II 4 3 3
#7 H 89889 II 3 4 5
#8 T 88154 II 3 4 5
#9 E 56176 II 3 2 2
Using ["Project"] on the end returns a pandas Series of the values in the column:
group[group["Group"] == "II"]["Project"]
#5 3
#6 3
#7 5
#8 5
#9 2
#Name: Project, dtype: int64
And lastly .sum() returns the sum of the series (18).
You can use the DataFrame.groupby function to analyze data from one or many columns based on "groups" defined in other columns.
For example, something like
group.groupby('Group')['Project'].sum()
Or,you could use masking if you only want the result:
group[group['Group']=='II']['Project'].sum()
If your only working with 'Group II' data, it may be best to reassign your df to a new variable:
df_ii = df[df.Group=='II']
Next you could sum the 'Project' grades:
df_ii.Project.sum()
Or, if summing doesn't make sense, you could take the average:
df_ii.Project.mean()
I have a parking lot with cars of different models (nr) and the cars are so closely packed that in order for one to get out one might need to move some others. A little like a 15Puzzle, only I can take one or more cars out of the parking lot. Ordered_car_List includes the cars that will be picked up today, and they need to be taken out of the parking lot with as few non-ordered cars as possible moved. There are more columns to this panda, but this is what I can't figure out.
I have a Program that works good for small sets of data, but it seems that this is not the way of the PANDAS :-)
I have this:
cars = pd.DataFrame({'x': [1,1,1,1,1,2,2,2,2],
'y': [1,2,3,4,5,1,2,3,4],
'order_number':[6,6,7,6,7,9,9,10,12]})
cars['order_number_no_dublicates_down'] = None
Ordered_car_List = [6,9,9,10,28]
i=0
while i < len(cars):
temp_val = cars.at[i, 'order_number']
if temp_val in Ordered_car_List:
cars.at[i, 'order_number_no_dublicates_down'] = temp_val
Ordered_car_List.remove(temp_val)
i+=1
If I use cars.apply(lambda..., how can I change the Ordered_car_List in each iteration?
Is there another approach that I can take?
I found this page, and it made me want to be faster. The Lambda approach is in the middle when it comes to speed, but it still is so much faster than what I am doing now.
https://towardsdatascience.com/how-to-make-your-pandas-loop-71-803-times-faster-805030df4f06
Updating cars
We can vectorize this based on two counters:
cumcount() to cumulatively count each unique value in cars['order_number']
collections.Counter() to count each unique value in Ordered_car_List
cumcount = cars.groupby('order_number').cumcount().add(1)
maxcount = cars['order_number'].map(Counter(Ordered_car_List))
# order_number cumcount maxcount
# 0 6 1 1
# 1 6 2 1
# 2 7 1 0
# 3 6 3 1
# 4 7 2 0
# 5 9 1 2
# 6 9 2 2
# 7 10 1 1
# 8 12 1 0
So then we only want to keep cars['order_number'] where cumcount <= maxcount:
either use DataFrame.loc[]
cars.loc[cumcount <= maxcount, 'nodup'] = cars['order_number']
or Series.where()
cars['nodup'] = cars['order_number'].where(cumcount <= maxcount)
or Series.mask() with the condition inverted
cars['nodup'] = cars['order_number'].mask(cumcount > maxcount)
Updating Ordered_car_List
The final Ordered_car_List is a Counter() difference:
Used_car_List = cars.loc[cumcount <= maxcount, 'order_number']
# [6, 9, 9, 10]
Ordered_car_List = list(Counter(Ordered_car_List) - Counter(Used_car_List))
# [28]
Final output
cumcount = cars.groupby('order_number').cumcount().add(1)
maxcount = cars['order_number'].map(Counter(Ordered_car_List))
cars['nodup'] = cars['order_number'].where(cumcount <= maxcount)
# x y order_number nodup
# 0 1 1 6 6.0
# 1 1 2 6 NaN
# 2 1 3 7 NaN
# 3 1 4 6 NaN
# 4 1 5 7 NaN
# 5 2 1 9 9.0
# 6 2 2 9 9.0
# 7 2 3 10 10.0
# 8 2 4 12 NaN
Used_car_List = cars.loc[cumcount <= maxcount, 'order_number']
Ordered_car_List = list(Counter(Ordered_car_List) - Counter(Used_car_List))
# [28]
Timings
Note that your loop is still very fast with small data, but the vectorized counter approach just scales much better:
I have a triple for loop that creates a 1 row and 2 column collection of numbers starting at 0 0 and going up to 2 2. The third for loop counts from 0 to 8. The code looks as follows:
for N in range(0,3):
for K in range(0,3):
print(N,K)
for P in range(0,9):
print(P)
If you run this code you get the obvious output:
0 0
0
1
2
3
4
5
6
7
8
0 1
0
1
2
3
4
5
6
7
8
0 2
0
1
2
3
4
5
6
7
8
...
And so on. I want instead of the output of 0 to 8 after the N K printout, instead something that looks like:
0 0
0
0 1
1
0 2
2
1 0
3
1 1
4
1 2
5
2 0
6
2 1
7
2 2
8
My first guess was an if statement that said:
if P == Q:
break
where Q was several sets of sums and even the N,K array. However, I couldn't figure out the best way to get my
wanted output. I do think an if statement is the best way to achieve my wanted result, but I'm not quite sure of how to approach it. P is necessary for the rest of my code as it will be used in some subplots.
As this is just an increment by one at each print, you can just do compute the index with N * 3 + K
for N in range(0, 3):
for K in range(0, 3):
print(N, K)
print(N * 3 + K)
CODE DEMO
You can use zip to traverse two iterables in parallel. In this case, one of the iterables is the result of a nested list. That can be handled by using itertools.product, as follows:
import itertools
for (N, K), P in zip(itertools.product(range(3), range(3)), range(9)):
print(N, K)
print(P)
I have a dataframe with two columns. I want to know how many characters they have in common. The number of common elements should be a new column. Here's a minimally reproducible example.
What I have:
import pandas as pd
from string import ascii_lowercase
import numpy as np
df = pd.DataFrame([[''.join(np.random.choice(list(ascii_lowercase),
8)) for i in range(10)] for i in range(2)],
index=['col_1', 'col_2']).T
Out[17]:
col_1 col_2
0 ollcgfmy daeubsrx
1 jtvtqoux xbgtrzno
2 irwmoqqa mdblczfa
3 jyebzpyd xwlynkhw
4 ifuqojvs lxotbsju
5 fybsqbku xwbluaek
6 oylztnpf gelonsay
7 zdkibutk ujlcwhfu
8 uhrcjbsk nhxhpoii
9 eocxreqz muvfwusi
What I need (the numbers are random):
Out[19]:
col_1 col_2 common_letters
0 ollcgfmy daeubsrx 1
1 jtvtqoux xbgtrzno 1
2 irwmoqqa mdblczfa 0
3 jyebzpyd xwlynkhw 3
4 ifuqojvs lxotbsju 3
5 fybsqbku xwbluaek 3
6 oylztnpf gelonsay 3
7 zdkibutk ujlcwhfu 3
8 uhrcjbsk nhxhpoii 1
9 eocxreqz muvfwusi 3
EDIT: to anyone reading this trying to get similarity between two strings, don't use this approach. other similarity measures exist, such as levenshtein or jaccard.
Using df.apply and set operations can be one way to solve the problem:
df["common_letters"] = df.apply(
lambda x: len(set(x["col_1"]).intersection(set(x["col_2"]))),
axis=1)
output:
col_1 col_2 common_letters
0 cgeabfem amnwfsde 4
1 vozgpmgs slfwvjnv 2
2 xyvktrfr jtzijmud 1
3 piexmmgh ydaxbmyo 2
4 iydpnwcu hhdxyptd 3
If you like sets you can go for:
df['common_letters'] = (df.col_1.apply(set).apply(len)
+ df.col_2.apply(set).apply(len)
- (df.col_1+df.col_2).apply(set).apply(len))
You can use numpy:
df["noCommonChars"]=np.bitwise_and(df["col_1"].map(set), df["col_2"].map(set)).str.len()
Output:
col_1 col_2 noCommonChars
0 smuxucyw hywtedvz 2
1 bniuqhkh axcuukjg 2
2 ttzehrtl nbmsmwsc 0
3 ndwyjusu dssmdnvb 3
4 zqvsvych wguthcwu 2
5 jlnpjqgn xgedmodm 1
6 ocjbtnpy lywjqkjf 2
7 tolrpshi hslxxmgo 4
8 ehatmryw fhpvluvq 1
9 icciebte joyiwooi 1
Edit
In order to include repeating characters - you can do:
from collections import Counter
df["common_letters_full"]=np.bitwise_and(df["col_1"].map(Counter), df["col_2"].map(Counter))
df["common_letters"]=df["common_letters_full"].map(dict.values).apply(sum)
#alternatively:
df["common_letters"]=df["common_letters_full"].apply(pd.Series).sum(axis=1)
I already see better answers :D , but here goes a kind of good one. I might have been able to use more from pandas:
I took some code from here
import pandas as pd
from string import ascii_lowercase
import numpy as np
def countPairs(s1, s2) :
n1 = len(s1) ;
n2 = len(s2);
# To store the frequencies of characters
# of string s1 and s2
freq1 = [0] * 26;
freq2 = [0] * 26;
# To store the count of valid pairs
count = 0;
# Update the frequencies of
# the characters of string s1
for i in range(n1) :
freq1[ord(s1[i]) - ord('a')] += 1;
# Update the frequencies of
# the characters of string s2
for i in range(n2) :
freq2[ord(s2[i]) - ord('a')] += 1;
# Find the count of valid pairs
for i in range(26) :
count += min(freq1[i], freq2[i]);
return count;
# This code is contributed by Ryuga
df = pd.DataFrame([[''.join(np.random.choice(list(ascii_lowercase),
8)) for i in range(10)] for i in range(2)],
index=['col_1', 'col_2']).T
counts = []
for i in range(0,df.shape[0]):
counts.append(countPairs(df.iloc[i].col_1,df.iloc[i].col_2))
df["counts"] = counts
col_1 col_2 counts
0 ploatffk dwenjpmc 1
1 gjjupyqg smqtlmzc 1
2 cgtxexho hvwhpyfh 1
3 mifsbfhc ufalhlbi 4
4 qnjesfdn lyhrrnkf 2
5 omnumzmf dagttzqo 2
6 gsygkrrb aocfoqxk 1
7 wrgvruuw ydnlzvyf 1
8 ivkdxoft zmgcnrjr 0
9 vvthbzjj mmirlcvx 1
For the dataframe below, how to return all opposite pairs?
import pandas as pd
df1 = pd.DataFrame([1,2,-2,2,-1,-1,1,1], columns=['a'])
a
0 1
1 2
2 -2
3 2
4 -1
5 -1
6 1
7 1
The output should be as below:
(1) sum of all rows is 0
(2) as there are 3 "1" and 2 "-1" in
original data, output includes 2 "1" and 2"-1".
a
0 1
1 2
2 -2
4 -1
5 -1
6 1
Thank you very much.
Well, I thought this would take fewer lines (and probably can) but this does work. First just create a couple of new columns to simplify the later syntax:
>>> df1['abs_a'] = np.abs( df1['a'] )
>>> df1['ones'] = 1
Then the main thing you need is to do some counting. For example, are there fewer 1s or fewer -1s?
>>> df2 = df1.groupby(['abs_a','a']).count()
ones
abs_a a
1 -1 2
1 3
2 -2 1
2 2
>>> df3 = df2.groupby(level=0).min()
ones
abs_a
1 2
2 1
That's basically the answer right there, but I'll put it closer to the form you asked for:
>>> lst = [ [i]*j for i, j in zip( df3.index.tolist(), df3['ones'].tolist() ) ]
>>> arr = np.array( [item for sublist in lst for item in sublist] )
>>> np.hstack( [arr,-1*arr] )
array([ 1, 1, 2, -1, -1, -2], dtype=int64)
Or if you want to put it back into a dataframe:
>>> pd.DataFrame( np.hstack( [arr,-1*arr] ) )
0
0 1
1 1
2 2
3 -1
4 -1
5 -2