Iterating through multiple dataframes pandas - python

I have two dataframes:
1) Contains a list of suppliers and their Lat,Long coordinates
sup_essential = pd.DataFrame({'supplier': ['A','B','C'],
'coords': [(51.1235,-0.3453),(52.1245,-0.3423),(53.1235,-1.4553)]})
2) A list of stores and their lat, long coordinates
stores_essential = pd.DataFrame({'storekey': [1,2,3],
'coords': [(54.1235,-0.6553),(49.1245,-1.3423),(50.1235,-1.8553)]})
I want to create an output table that has: store, store_coordinates, supplier, supplier_coordinates, distance for every combination of store and supplier.
I currently have:
test=[]
for row in sup_essential.iterrows():
for row in stores_essential.iterrows():
r = sup_essential['supplier'],stores_essential['storeKey']
test.append(r)
But this just gives me repeats of all the values

Source DFs
In [105]: sup
Out[105]:
coords supplier
0 (51.1235, -0.3453) A
1 (52.1245, -0.3423) B
2 (53.1235, -1.4553) C
In [106]: stores
Out[106]:
coords storekey
0 (54.1235, -0.6553) 1
1 (49.1245, -1.3423) 2
2 (50.1235, -1.8553) 3
Solutions:
from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('haversine')
m = pd.merge(sup.assign(x=0), stores.assign(x=0), on='x', suffixes=['1','2']).drop('x',1)
d1 = sup[['coords']].assign(lat=sup.coords.str[0], lon=sup.coords.str[1]).drop('coords',1)
d2 = stores[['coords']].assign(lat=stores.coords.str[0], lon=stores.coords.str[1]).drop('coords',1)
m['dist_km'] = np.ravel(dist.pairwise(np.radians(d1), np.radians(d2)) * 6367)
## -- End pasted text --
Result:
In [135]: m
Out[135]:
coords1 supplier coords2 storekey dist_km
0 (51.1235, -0.3453) A (54.1235, -0.6553) 1 334.029670
1 (51.1235, -0.3453) A (49.1245, -1.3423) 2 233.213416
2 (51.1235, -0.3453) A (50.1235, -1.8553) 3 153.880680
3 (52.1245, -0.3423) B (54.1235, -0.6553) 1 223.116901
4 (52.1245, -0.3423) B (49.1245, -1.3423) 2 340.738587
5 (52.1245, -0.3423) B (50.1235, -1.8553) 3 246.116984
6 (53.1235, -1.4553) C (54.1235, -0.6553) 1 122.997130
7 (53.1235, -1.4553) C (49.1245, -1.3423) 2 444.459052
8 (53.1235, -1.4553) C (50.1235, -1.8553) 3 334.514028

Related

Creating matrix of 0 and 1 from a string vector in R or python

I want to create a matrix of 0 and 1 from a vector where each string contains the two names I want to map to the matrix. For example, if I have the following vector
vector_matrix <- c("A_B", "A_C", "B_C", "B_D", "C_D")
I would like to transform it into the following matrix
A B C D
A 0 1 1 0
B 0 0 1 1
C 0 0 0 1
D 0 0 0 0
I am open to any suggestion, but it is better if there is some built-in function that can deal with it. I am trying to do a very similar thing but in a magnitude that I will generate a matrix of 25 million cells.
I prefer if the code is R, but doesn't matter if there is some pythonic solution :)
Edit:
So when I say "A_B", I want a "1" in row A column B. It doesn't matter if it is the contrary (column A row B).
Edit:
I would like to have a matrix where its rownames and colnames are the letters.
Create a two column data frame d from the data, calculate the levels and then generate a list in which each colunn of d is a factor and finally run table. The second line sorts each row and that isn't actually needed for the input shown so it could be omitted but you might need it for other data if B_A is to be regarded as A_B.
d <- read.table(text = vector_matrix, sep = "_")
d[] <- t(apply(d, 1, sort))
tab <- table( lapply(d, factor, levels = levels(factor(unlist(d)))) )
tab
giving this table:
V2
V1 A B C D
A 0 1 1 0
B 0 0 1 1
C 0 0 0 1
D 0 0 0 0
heatmap(tab[nrow(tab):1, ], NA, NA, col = 2:3, symm = TRUE)
library(igraph)
g <- graph_from_adjacency_matrix(tab, mode = "undirected")
plot(g)
The following should work in Python. It splits the input data in two lists, converts the characters to indexes and sets the indexes of a matrix to 1.
import numpy as np
vector_matrix = ("A_B", "A_C", "B_C", "B_D", "C_D")
# Split data in two lists
rows, cols = zip(*(s.split("_") for s in vector_matrix))
print(rows, cols)
>>> ('A', 'A', 'B', 'B', 'C') ('B', 'C', 'C', 'D', 'D')
# With inspiration from: https://stackoverflow.com/a/5706787/10603874
row_idxs = np.array([ord(char) - 65 for char in rows])
col_idxs = np.array([ord(char) - 65 for char in cols])
print(row_idxs, col_idxs)
>>> [0 0 1 1 2] [1 2 2 3 3]
n_rows = row_idxs.max() + 1
n_cols = col_idxs.max() + 1
print(n_rows, n_cols)
>>> 3 4
mat = np.zeros((n_rows, n_cols), dtype=int)
mat[row_idxs, col_idxs] = 1
print(mat)
>>>
[[0 1 1 0]
[0 0 1 1]
[0 0 0 1]]

Merging python Data frame and Nested List

Unable to combine/merge/crossjoin Dataframe and Nested List with a where condition(If the nearest zip from the nested list is equal to the actual zip do not show it in the nearest zip field) to get to the desired output.
The code i have so far
x=0
print(test_df)
print(type(test_df))
for x in range(5):
nearest_result=search.by_coordinates(test_df.iloc[x,1],test_df.iloc[x,2], radius=30,returns=3)
n_zip=[res.zipcode for res in nearest_result]
print(n_zip)
print(type(n_zip))
The dataframe and nested list:
Desired Output:
Maybe a simplier approach can be proposed, but as a first shot, initially dropping 'NEAREST_ZIP':
>>> print(test_df) # /!\ dropped 'NEAREST_ZIP
ID BEGIN_LAT BEGIN_LON ZIP_CODE
0 0 30.9958 -87.2388 36441
1 1 42.5589 -92.5000 50613
2 2 42.6800 -91.9000 50662
3 3 37.0800 -97.8800 67018
4 4 37.8200 -96.8200 67042
>>> # used nzip:
>>> nzip = [[36441, 32535, 36426],
[50613, 50624, 50613], # i guess there was a typo in your code here
[50662, 50641, 50671],
[67018, 67003, 67049],
[67042, 67144, 67074]]
>>> # build a `closest` dataframe:
>>> closest = pd.DataFrame(data={k: (v1, v2) for k, v1, v2 in nzip}).T.stack().reset_index().drop(columns=['level_1'])
>>> closest.columns = ['ZIP_CODE', 'NEAREST_ZIP']
>>> # merging
>>> test_df.merge(closest)
ID BEGIN_LAT BEGIN_LON ZIP_CODE NEAREST_ZIP
0 0 30.9958 -87.2388 36441 32535
1 0 30.9958 -87.2388 36441 36426
2 1 42.5589 -92.5000 50613 50624
3 1 42.5589 -92.5000 50613 50613
4 2 42.6800 -91.9000 50662 50641
5 2 42.6800 -91.9000 50662 50671
6 3 37.0800 -97.8800 67018 67003
7 3 37.0800 -97.8800 67018 67049
8 4 37.8200 -96.8200 67042 67144
9 4 37.8200 -96.8200 67042 67074

Python: dynamic column sum for each row

I have a dataframe with 2 identifiers (ID1, ID2) and 3 numeric columns (X1,X2,X3) and a column titled 'input' (total 6 columns) and n rows. For each row, I want to get the index of the nth column such that n is the last time that (x1+x2+xn... >=0) is still true.
How can I do this in Python?
In R I did this by using:
tmp = data
for (i in 4:5)
{
data[,i]<- tmp$input - rowSums(tmp[,3:i])
}
output<- apply((data[,3:5]), 1, function(x) max(which(x>0)))
data$output <- output
I am trying to translate this into Python. What might be the best way to do this? There can be N such rows, and M such columns.
Sample Data:
ID1 ID2 X1 X2 X3 INPUT OUTPUT (explanation)
a b 1 2 3 3 2 (X1 = 1, x1+x2 = 3, x1+x3+x3 = 6 ... and after 2 sums, input< sums)
a1 a2 5 2 1 4 0 (X1 = 5, x1+x2 = 7, x1+x3+x3 = 8 ... and even for 1 sum, input< sums)
a2 b2 0 4 5 100 3 (X1=0, X1+X2=4, X1+X2+X3=9, ... even after 3 sums, input>sums)
You can use Pandas module which handles this very effectively in Python.
import pandas as pd
#Taking a sample data here
df = pd.DataFrame([
['A','B',1,3,4,0.1],
['K','L',10,3,14,0.5],
['P','H',1,73,40,0.6]],columns = ['ID1','ID2','X2','X3','X4','INPUT'])
#Below code does the functionality you would want.
df['new_column']=df[['X2','X3','X4']].max(axis=1)

Find longest run of consecutive zeros for each user in dataframe

I'm looking to find the max run of consecutive zeros in a DataFrame with the result grouped by user. I'm interested in running the RLE on usage.
sample input:
user--day--usage
A-----1------0
A-----2------0
A-----3------1
B-----1------0
B-----2------1
B-----3------0
Desired output
user---longest_run
a - - - - 2
b - - - - 1
mydata <- mydata[order(mydata$user, mydata$day),]
user <- unique(mydata$user)
d2 <- data.frame(matrix(NA, ncol = 2, nrow = length(user)))
names(d2) <- c("user", "longest_no_usage")
d2$user <- user
for (i in user) {
if (0 %in% mydata$usage[mydata$user == i]) {
run <- rle(mydata$usage[mydata$user == i]) #Run Length Encoding
d2$longest_no_usage[d2$user == i] <- max(run$length[run$values == 0])
} else {
d2$longest_no_usage[d2$user == i] <- 0 #some users did not have no-usage days
}
}
d2 <- d2[order(-d2$longest_no_usage),]
this works in R but I want to do the same thing in python, I'm totally stumped
Use groupby with size by columns user, usage and helper Series for consecutive values first:
print (df)
user day usage
0 A 1 0
1 A 2 0
2 A 3 1
3 B 1 0
4 B 2 1
5 B 3 0
6 C 1 1
df1 = (df.groupby([df['user'],
df['usage'].rename('val'),
df['usage'].ne(df['usage'].shift()).cumsum()])
.size()
.to_frame(name='longest_run'))
print (df1)
longest_run
user val usage
A 0 1 2
1 2 1
B 0 3 1
5 1
1 4 1
C 1 6 1
Then filter only zero rows, get max and add reindex for append non 0 groups:
df2 = (df1.query('val == 0')
.max(level=0)
.reindex(df['user'].unique(), fill_value=0)
.reset_index())
print (df2)
user longest_run
0 A 2
1 B 1
2 C 0
Detail:
print (df['usage'].ne(df['usage'].shift()).cumsum())
0 1
1 1
2 2
3 3
4 4
5 5
6 6
Name: usage, dtype: int32
get max number of consecutive zeros on series:
def max0(sr):
return (sr != 0).cumsum().value_counts().max() - (0 if (sr != 0).cumsum().value_counts().idxmax()==0 else 1)
max0(pd.Series([1,0,0,0,0,2,3]))
4
I think the following does what you are looking for, where the consecutive_zero function is an adaptation of the top answer here.
Hope this helps!
import pandas as pd
from itertools import groupby
df = pd.DataFrame([['A', 1], ['A', 0], ['A', 0], ['B', 0],['B',1],['C',2]],
columns=["user", "usage"])
def len_iter(items):
return sum(1 for _ in items)
def consecutive_zero(data):
x = list((len_iter(run) for val, run in groupby(data) if val==0))
if len(x)==0: return 0
else: return max(x)
df.groupby('user').apply(lambda x: consecutive_zero(x['usage']))
Output:
user
A 2
B 1
C 0
dtype: int64
If you have a large dataset and speed is essential, you might want to try the high-performance pyrle library.
Setup:
# pip install pyrle
# or
# conda install -c bioconda pyrle
import numpy as np
np.random.seed(0)
import pandas as pd
from pyrle import Rle
size = int(1e7)
number = np.random.randint(2, size=size)
user = np.random.randint(5, size=size)
df = pd.DataFrame({"User": np.sort(user), "Number": number})
df
# User Number
# 0 0 0
# 1 0 1
# 2 0 1
# 3 0 0
# 4 0 1
# ... ... ...
# 9999995 4 1
# 9999996 4 1
# 9999997 4 0
# 9999998 4 0
# 9999999 4 1
#
# [10000000 rows x 2 columns]
Execution:
for u, udf in df.groupby("User"):
r = Rle(udf.Number)
is_0 = r.values == 0
print("User", u, "Max", np.max(r.runs[is_0]))
# (Wall time: 1.41 s)
# User 0 Max 20
# User 1 Max 23
# User 2 Max 20
# User 3 Max 22
# User 4 Max 23

pandas convert text feature to numeric value

I can convert all text features in a pandas dataframe by casting to 'category' using the df.astype() method as below. However I find category hard to work with (eg for plotting data) and would prefer to create a new column of integers
#convert all objects to categories
object_types = dataset.select_dtypes(include=['O'])
for col in object_types:
dataset['{0}_category'.format(col)] = dataset[col].astype('category')
I can convert the text to integers using this hack:
#convert all objects to int values
object_types = dataset.select_dtypes(include=['O'])
new_cols = {}
for col in object_types:
data_set = set(dataset[col].tolist())
data_indexed = {}
for i, item in enumerate(data_set):
data_indexed[item] = i
new_list = []
for item in dataset[col].tolist():
new_list.append(data_indexed[item])
new_cols[col]=new_list
for key, val in new_cols.items():
dataset['{0}_int_value'.format(key)] = val
But is there a better (or existing) way to do the same?
I would use factorize method, which is designed for this particular task:
In [90]: x
Out[90]:
A B
9 c z
10 c z
4 b x
5 b y
1 a w
7 b z
In [91]: x.apply(lambda col: pd.factorize(col, sort=True)[0])
Out[91]:
A B
9 2 3
10 2 3
4 1 1
5 1 2
1 0 0
7 1 3
or:
In [92]: x.apply(lambda col: pd.factorize(col)[0])
Out[92]:
A B
9 0 0
10 0 0
4 1 1
5 1 2
1 2 3
7 1 0
consider df
df = pd.DataFrame(dict(A=list('aaaabbbbcccc'),
B=list('wwxxxyyzzzzz')))
df
you can convert to integers like this
def intify(s):
u = np.unique(s)
i = np.arange(len(u))
return s.map(dict(zip(u, i)))
or shorter version
def intify(s):
u = np.unique(s)
return s.map({k: i for i, k in enumerate(u)})
df.apply(intify)
Or in a single line
df.apply(lambda s: s.map({k:i for i,k in enumerate(s.unique())}))

Categories