ordene matrix with numpy.triu and non nan's values

ordene matrix with numpy.triu and non nan's values - python

I’ve a dataset where i need do a transformation to get a upper triangular matrix. So my matrix has this format:
| 1 | 2 | 3 |
01/01/1999 | nan | 582.96 | nan |
02/01/1999 | nan | 589.78 | 78.47 |
03/01/1999 | nan | 588.74 | 79.41 |
… | | |
01/01/2022 | 752.14 | 1005.78 | 193.47 |
02/01/2022 | 754.14 | 997.57 | 192.99 |
I use a dataframe.T, to get my date as columns, but I also need that my rows be ordened by non nan’s.
| 01/01/1999 | 02/01/1999 |03/01/1999 |… |01/01/2022 | 02/01/2022 |
2 | 582.96 | 589.78 | 588.74 |… | 1005.78 | 997.57 |
3 | nan | 78.47 | 79.41 | … | 193.47 | 192.99 |
1 | nan | nan | nan | … | 752.14 | 754.14 |
A tried use the different combinantions of numpy.triu, sort_by and dataframe.T but I haven’t success.
My main goal is get with this format, but if I get this with performance would be nice, cause my data is big.

Related

sumif and countif on Python for multiple columns , On row level and not column level

I'm trying to figure a way to do:
COUNTIF(Col2,Col4,Col6,Col8,Col10,Col12,Col14,Col16,Col18,">=0.05")
SUMIF(Col2,Col4,Col6,Col8,Col10,Col12,Col14,Col16,Col18,">=0.05")
My attempt:
import pandas as pd
df=pd.read_excel(r'C:\\Users\\Downloads\\Prepped.xls') #Please use: https://github.com/BeboGhattas/temp-repo/blob/main/Prepped.xls
df.iloc[:, [2,4,6,8,10,12,14,16,18]].astype(float) #changing dtype to float
#unconditional sum
df['sum']=df.iloc[:, [2,4,6,8,10,12,14,16,18]].astype(float).sum(axis=1)
whatever goes below won't work
#sum if
df['greater-than-0.05']=df.iloc[:, [2,4,6,8,10,12,14,16,18]].astype(float).sum([c for c in col if c >= 0.05])
| | # | word | B64684807 | B64684807Measure | B649845471 | B649845471Measure | B83344143 | B83344143Measure | B67400624 | B67400624Measure | B85229235 | B85229235Measure | B85630406 | B85630406Measure | B82615898 | B82615898Measure | B87558236 | B87558236Measure | B00000009 | B00000009Measure | 有效竞品数 | 关键词抓取时间 | 搜索量排名 | 月搜索量 | 在售商品数 | 竞争度 |
|---:|----:|:--------|------------:|:-------------------|-------------:|:-------------------------|------------:|:-------------------------|------------:|:-------------------|------------:|:-------------------|------------:|:-------------------|------------:|:-------------------|------------:|-------------------:|------------:|:-------------------|-------------:|:--------------------|-------------:|-----------:|-------------:|---------:|
| 0 | 1 | word 1 | 0.055639 | [主要流量词] | 0.049416 | nan | 0.072298 | [精准流量词, 主要流量词] | 0.00211 | nan | 0.004251 | nan | 0.007254 | nan | 0.074409 | [主要流量词] | 0.033597 | nan | 0.000892 | nan | 9 | 2022-10-06 00:53:56 | 5726 | 326188 | 3810 | 0.01 |
| 1 | 2 | word 2 | 0.045098 | nan | 0.005472 | nan | 0.010791 | nan | 0.072859 | [主要流量词] | 0.003423 | nan | 0.012464 | nan | 0.027396 | nan | 0.002825 | nan | 0.060989 | [主要流量词] | 9 | 2022-10-07 01:16:21 | 9280 | 213477 | 40187 | 0.19 |
| 2 | 3 | word 3 | 0.02186 | nan | 0.05039 | [主要流量词] | 0.007842 | nan | 0.028832 | nan | 0.044385 | [精准流量词] | 0.001135 | nan | 0.003866 | nan | 0.021035 | nan | 0.017202 | nan | 9 | 2022-10-07 00:28:31 | 24024 | 81991 | 2275 | 0.03 |
| 3 | 4 | word 4 | 0.000699 | nan | 0.01038 | nan | 0.001536 | nan | 0.021512 | nan | 0.007658 | nan | 5e-05 | nan | 0.048682 | nan | 0.001524 | nan | 0.000118 | nan | 9 | 2022-10-07 00:52:12 | 34975 | 53291 | 30970 | 0.58 |
| 4 | 5 | word 5 | 0.00984 | nan | 0.030248 | nan | 0.003006 | nan | 0.014027 | nan | 0.00904 | [精准流量词] | 0.000348 | nan | 0.000414 | nan | 0.006721 | nan | 0.00153 | nan | 9 | 2022-10-07 02:36:05 | 43075 | 41336 | 2230 | 0.05 |
| 5 | 6 | word 6 | 0.010029 | [精准流量词] | 0.120739 | [精准流量词, 主要流量词] | 0.014359 | nan | 0.002796 | nan | 0.002883 | nan | 0.028747 | [精准流量词] | 0.007022 | nan | 0.017803 | nan | 0.001998 | nan | 9 | 2022-10-07 00:44:51 | 49361 | 34791 | 517 | 0.01 |
| 6 | 7 | word 7 | 0.002735 | nan | 0.002005 | nan | 0.005355 | nan | 6.3e-05 | nan | 0.000772 | nan | 0.000237 | nan | 0.015149 | nan | 2.1e-05 | nan | 2.3e-05 | nan | 9 | 2022-10-07 09:48:20 | 53703 | 31188 | 511 | 0.02 |
| 7 | 8 | word 8 | 0.003286 | [精准流量词] | 0.058161 | [主要流量词] | 0.013681 | [精准流量词] | 0.000748 | [精准流量词] | 0.002684 | [精准流量词] | 0.013916 | [精准流量词] | 0.029376 | nan | 0.019792 | nan | 0.005602 | nan | 9 | 2022-10-06 01:51:53 | 58664 | 27751 | 625 | 0.02 |
| 8 | 9 | word 9 | 0.004273 | [精准流量词] | 0.025581 | [精准流量词] | 0.014784 | [精准流量词] | 0.00321 | [精准流量词] | 0.000892 | nan | 0.00223 | nan | 0.005315 | nan | 0.02211 | nan | 0.027008 | [精准流量词] | 9 | 2022-10-07 01:34:28 | 73640 | 20326 | 279 | 0.01 |
| 9 | 10 | word 10 | 0.002341 | [精准流量词] | 0.029604 | nan | 0.007817 | [精准流量词] | 0.000515 | [精准流量词] | 0.001865 | [精准流量词] | 0.010128 | [精准流量词] | 0.015378 | nan | 0.019677 | nan | 0.003673 | nan | 9 | 2022-10-07 01:17:44 | 80919 | 17779 | 207 | 0.01 |
So my question is,
How can i do the sumif and countif on the exact table (Should use col2,col4... etc, because every file will have the same format but different header, so using df['B64684807'] isn't helpful )
Sample file can be found at:
https://github.com/BeboGhattas/temp-repo/blob/main/Prepped.xls

IIUC, you can use a boolean mask:
df2 = df.iloc[:, [2,4,6,8,10,12,14,16,18]].astype(float)
m = df2.ge(0.05)
df['countif'] = m.sum(axis=1)
df['sumif'] = df2.where(m).sum(axis=1)
output (last 3 columns only):
sum countif sumif
0 0.299866 3 0.202346
1 0.241317 2 0.133848
2 0.196547 1 0.050390
3 0.092159 0 0.000000
4 0.075174 0 0.000000
5 0.206376 1 0.120739
6 0.026360 0 0.000000
7 0.147246 1 0.058161
8 0.105403 0 0.000000
9 0.090998 0 0.000000

combine data frames of different sizes and replacing values

I am having 2 dataframes of different size. I am looking to join the dataframes and want to replace the Nan values after combining both the dataframes and replacing the the Nan values with lower size dataframe.
dataframe1:-
| symbol| value1 | value2 | Occurance |
|=======|========|========|===========|
2020-07-31 | A | 193.5 | 186.05 | 3 |
2020-07-17 | A | 372.5 | 359.55 | 2 |
2020-07-21 | A | 387.8 | 382.00 | 1 |
dataframe2:-
| x | y | z | symbol|
|=====|=====|=====|=======|
2020-10-01 |448.5|453.0|443.8| A |
I tried concatenating and replacing the Nan values with values of dataframe2 value.
I tried df1 =pd.concat([dataframe2,dataframe1],axis=1). The result is given below but i am looking for result as in result desired. How can i achieve that desired result.
Result given:-
| X | Y | Z | symbol|symbol| value1| value2 | Occurance|
|====== | ====|=====|=======|======|=======| =======| =========|
2020-07-31|NaN |NaN | NaN | NaN | A |193.5 | 186.05 | 3 |
2021-05-17| NaN | NaN | NaN | NaN | A |372.5 | 359.55 | 2 |
2021-05-21| NaN | NaN | NaN | NaN | A |387.8 | 382.00 | 1 |
2020-10-01| 448.5 |453.0|443.8| A |NaN | NaN | NaN | NaN |
Result Desired:-
| X | Y | Z | symbol|symbol| value1| value2 | Occurance|
| ===== | ======| ====| ======| =====|=======|========|==========|
2020-10-01| 448.5 |453.0 |443.8| A | A |193.5 | 186.05 | 3 |
2020-10-01| 448.5 |453.0 |443.8| A | A |372.5 | 359.55 | 2 |
2020-10-01| 448.5 |453.0 |443.8| A | A |387.8 | 382.00 | 1 |
2020-10-01| 448.5 |453.0 |443.8| A |NaN | NaN | NaN | NaN |
Please note the datatime needs to be same in the Result Desired. In short replicating the single line of dataframe2 to NaN values of dataframe1. a solution avoiding For loop would be great.

Could you try to sort your dataframe by the index to check how the output would be ?
df1.sort_index()

Modify column in according another column dataframe python

I have two dataframes. One is the master dataframe and the other df is used to fil my master dataframe.
what I want is fil one column in according another column without alter the others columns.
This is example of master df
| id | Purch. order | cost | size | code |
| 1 | G918282 | 8283 | large| hchs |
| 2 | EE18282 | 1283 | small| ueus |
| 3 | DD08282 | 5583 | large| kdks |
| 4 | GU88912 | 8232 | large| jdhd |
| 5 | NaN | 1283 | large| jdjd |
| 6 | Nan | 5583 | large| qqas |
| 7 | Nan | 8232 | large| djjs |
This is example of the another df
| id | Purch. order | cost |
| 1 | G918282 | 7728 |
| 2 | EE18282 | 2211 |
| 3 | DD08282 | 5321 |
| 4 | GU88912 | 4778 |
| 5 | NaN | 4283 |
| 6 | Nan | 9993 |
| 7 | Nan | 3442 |
This is the result I'd like
| id | Purch. order | cost | size | code |
| 1 | G918282 | 7728 | large| hchs |
| 2 | EE18282 | 2211 | small| ueus |
| 3 | DD08282 | 5321 | large| kdks |
| 4 | GU88912 | 4778 | large| jdhd |
| 5 | NaN | 1283 | large| jdjd |
| 6 | Nan | 5583 | large| qqas |
| 7 | Nan | 8232 | large| djjs |
Where only the cost column is modified only if the secondary df coincides with the purch. order and if it's not NaN.
I hope you can help me... and I'm sorry if my english is so basic, not is my mother language. Thanks a lot.

lets try Update which works along indexes, by default overwrite is set to True which will overwrite overlapping values in your target dataframe. use overwrite=False if you only want to change NA values.
master_df = master_df.set_index(['id','Purch. order'])
another_df = another_df.dropna(subset=['Purch. order']).set_index(['id','Purch. order'])
master_df.update(another_df)
print(master_df)
cost size code
id Purch. order
1 G918282 7728.0 large hchs
2 EE18282 2211.0 small ueus
3 DD08282 5321.0 large kdks
4 GU88912 4778.0 large jdhd
5 NaN 1283.0 large jdjd
6 Nan 5583.0 large qqas
7 Nan 8232.0 large djjs

You can do it with merge followed by updating the cost column based on where the Nan are:
final_df = df1.merge(df2[~df2["Purch. order"].isna()], on = 'Purch. order', how="left")
final_df.loc[~final_df['Purch. order'].isnull(), "cost"] = final_df['cost_y'] # not nan
final_df.loc[final_df['Purch. order'].isnull(), "cost"] = final_df['cost_x'] # nan
final_df = final_df.drop(['id_y','cost_x','cost_y'],axis=1)
Output:
id _x Purch. order size code cost
0 1 G918282 large hchs 7728.0
1 2 EE18282 small ueus 2211.0
2 3 DD08282 large kdks 5321.0
3 4 GU88912 large jdhd 4778.0
4 5 NaN large jdjd 1283.0
5 6 NaN large qqas 5583.0
6 7 NaN large djjs 8232.0

How to fill missing GPS Data in pandas?

I have a data frame that looks something like this
+-----+------------+-------------+-------------------------+----+----------+----------+
| | Actual_Lat | Actual_Long | Time | ID | Cal_long | Cal_lat |
+-----+------------+-------------+-------------------------+----+----------+----------+
| 0 | 63.433376 | 10.397068 | 2019-09-30 04:48:13.540 | 11 | 10.39729 | 63.43338 |
| 1 | 63.433301 | 10.395846 | 2019-09-30 04:48:18.470 | 11 | 10.39731 | 63.43326 |
| 2 | 63.433259 | 10.394543 | 2019-09-30 04:48:23.450 | 11 | 10.39576 | 63.43323 |
| 3 | 63.433258 | 10.394244 | 2019-09-30 04:48:29.500 | 11 | 10.39555 | 63.43436 |
| 4 | 63.433258 | 10.394215 | 2019-09-30 04:48:35.683 | 11 | 10.39505 | 63.43427 |
| ... | ... | ... | ... | ...| ... | ... |
| 70 | NaN | NaN | NaT | NaN| 10.35826 | 63.43149 |
| 71 | NaN | NaN | NaT | NaN| 10.35809 | 63.43155 |
| 72 | NaN | NaN | NaT | NaN| 10.35772 | 63.43163 |
| 73 | NaN | NaN | NaT | NaN| 10.35646 | 63.43182 |
| 74 | NaN | NaN | NaT | NaN| 10.35536 | 63.43196 |
+-----+------------+-------------+-------------------------+----------+----------+----------+
Actual_lat and Actual_long contains GPS coordinates of data obtained from GPS device. Cal_lat and cal_lat are GPS coordinates obtained from OSRM's API. As you can see there is a lot of data missing in actual coordinates. I am looking to get a data set such that when I take difference of actual_lat vs cal_lat it should be zero or at least close to zero. I tried to fill these missing values with destination lat and long, but that would result in huge difference. My question is how can I fill these values using python/pandas so that when vehicle followed the OSRM estimated path the difference between actual lat/long and estimated lat/long should be zero or close to zero. I am new to GIS data Sets and have no idea about how to deal with them.
EDIT: I am looking for something like this.
+-----+------------+-------------+-------------------------+----------+----------+----------+----------------------+----------------------+
| | Actual_Lat | Actual_Long | Time | Tour ID | Cal_long | Cal_lat | coordinates_diff_Lat | coordinates_diff_Lon |
+-----+------------+-------------+-------------------------+----------+----------+----------+----------------------+----------------------+
| 0 | 63.433376 | 10.397068 | 2019-09-30 04:48:13.540 | 11 | 10.39729 | 63.43338 | -0.000 | -0.000 |
| 1 | 63.433301 | 10.395846 | 2019-09-30 04:48:18.470 | 11 | 10.39731 | 63.43326 | 0.000 | -0.001 |
| 2 | 63.433259 | 10.394543 | 2019-09-30 04:48:23.450 | 11 | 10.39576 | 63.43323 | 0.000 | -0.001 |
| 3 | 63.433258 | 10.394244 | 2019-09-30 04:48:29.500 | 11 | 10.39555 | 63.43436 | -0.001 | -0.001 |
| 4 | 63.433258 | 10.394215 | 2019-09-30 04:48:35.683 | 11 | 10.39505 | 63.43427 | -0.001 | -0.001 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 70 | 63.43000 | 10.35800 | NaT | 115268.0 | 10.35826 | 63.43149 | 0.000 | -0.003 |
| 71 | 63.43025 | 10.35888 | NaT | 115268.0 | 10.35809 | 63.43155 | 0.000 | -0.003 |
| 72 | 63.43052 | 10.35713 | NaT | 115268.0 | 10.35772 | 63.43163 | 0.000 | -0.002 |
| 73 | 63.43159 | 10.35633 | NaT | 115268.0 | 10.35646 | 63.43182 | 0.000 | -0.001 |
| 74 | 63.43197 | 10.35537 | NaT | 115268.0 | 10.35536 | 63.43196 | 0.000 | 0.000 |
+-----+------------+-------------+-------------------------+----------+----------+----------+----------------------+----------------------+
Note that 63.43197,10.35537 is destination and 63.433376,10.397068 is starting position. All these points represent road coordinates.

IIUC, you need something like this:
I am taking the columns out of df as list.
div = float(len(cal_lat)) / float(len(actual_lat))
new_l = []
for i in range(len(cal_lat)):
new_l.append(actual_lat[int(i/div)])
print(new_l)
len(new_l)
Do, the same with longitude columns.
Since these are GPS points you can tweak your model to have the accuracy of up to 3 digits, when taking the difference. So, keeping this in mind, starting from Actual_lat and lng , if your next value is same as the first, the difference won’t be much greater.
Hopefully, I made sense and you have your solution.

You need pandas.DataFrame.where.
Let's say your dataframe is df, then you can do:
df.Actual_Lat = df.Actual_Lat.where(~df.Actual_Lat.isna(), df.Cal_lat)

Manipulate pandas columns with datetime

Please see this SO post Manipulating pandas columns
I shared this dataframe:
+----------+------------+-------+-----+------+
| Location | Date | Event | Key | Time |
+----------+------------+-------+-----+------+
| i2 | 2019-03-02 | 1 | a | |
| i2 | 2019-03-02 | 1 | a | |
| i2 | 2019-03-02 | 1 | a | |
| i2 | 2019-03-04 | 1 | a | 2 |
| i2 | 2019-03-15 | 2 | b | 0 |
| i9 | 2019-02-22 | 2 | c | 0 |
| i9 | 2019-03-10 | 3 | d | |
| i9 | 2019-03-10 | 3 | d | 0 |
| s8 | 2019-04-22 | 1 | e | |
| s8 | 2019-04-25 | 1 | e | |
| s8 | 2019-04-28 | 1 | e | 6 |
| t14 | 2019-05-13 | 3 | f | |
+----------+------------+-------+-----+------+
This is a follow-up question. Consider two more columns after Date as shown below.
+-----------------------+----------------------+
| Start Time (hh:mm:ss) | Stop Time (hh:mm:ss) |
+-----------------------+----------------------+
| 13:24:38 | 14:17:39 |
| 03:48:36 | 04:17:20 |
| 04:55:05 | 05:23:48 |
| 08:44:34 | 09:13:15 |
| 19:21:05 | 20:18:57 |
| 21:05:06 | 22:01:50 |
| 14:24:43 | 14:59:37 |
| 07:57:32 | 09:46:21
| 19:21:05 | 20:18:57 |
| 21:05:06 | 22:01:50 |
| 14:24:43 | 14:59:37 |
| 07:57:32 | 09:46:21 |
+-----------------------+----------------------+
The task remains the same - to get the time difference but in hours, corresponding to the Stop Time of the first row and Start Time of the last row
for each Key.
Based on the answer, I was trying something like this:
df['Time']=df.groupby(['Location','Event']).Date.\
transform(lambda x : (x.iloc[-1]-x.iloc[0]))[~df.duplicated(['Location','Event'],keep='last')]
df['Time_h']=df.groupby(['Location','Event'])['Start Time (hh:mm:ss)','Stop Time (hh:mm:ss)'].\
transform(lambda x,y : (x.iloc[-1]-y.iloc[0]))[~df.duplicated(['Location','Event'],keep='last')] # This gives an error on transform
to get the difference in days and hours separately and then combine. Is there a better way?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

ordene matrix with numpy.triu and non nan's values - python

Related

sumif and countif on Python for multiple columns , On row level and not column level

combine data frames of different sizes and replacing values

Modify column in according another column dataframe python

How to fill missing GPS Data in pandas?

Manipulate pandas columns with datetime

Categories

Resources