Loosing negative sign when extracting data from a dataframe - python

I extract temperature from a website into a dataframe. It looks like this:
Temp Prec
0 3 / -4 °C -
1 1 / -17 °C -
2 -7 / -18 °C -
3 6 / -8 °C -
4 8 / 1 °C -
5 8 / 0 °C 1.3 mm
6 8 / 0 °C 7.0 mm
7 6 / -1 °C -
8 4 / 0 °C 4.0 mm
9 5 / 2 °C 23.8 mm
10 6 / 1 °C -
11 5 / -1 °C -
12 4 / -1 °C -
13 7 / 0 °C 10.6 mm
14 7 / 1 °C 29.7 mm
Then I use this code to extract the temperature in the format I want:
df2['Temp'] = df2['Temp'].str.extract('(\d+)') + 'C'
and I get this result:
Temp Prec
0 3C -
1 1C -
2 7C -
3 6C -
4 8C -
5 8C 1.3 mm
6 8C 7.0 mm
7 6C -
8 4C 4.0 mm
9 5C 23.8 mm
10 6C -
11 5C -
12 4C -
13 7C 10.6 mm
14 7C 29.7 mm
I have lost the negative sign (like on row 2) when it's a temperature below zero. How can I keep the negative sign?

Without regex, go for rsplit and slice with str :
df["Temp"] = df["Temp"].str.rsplit("/", n=1).str[-1]
And regarding the regex approach, you can include °C in the captured group :
df["Temp"] = df["Temp"].str.extract("(-?\d+\s*°C)", expand=False)
​
Output :
print(df)
Temp Prec
0 -4 °C -
1 -17 °C -
2 -18 °C -
3 -8 °C -
4 1 °C -
5 0 °C 1.3 mm
6 0 °C 7.0 mm
8 0 °C 4.0 mm
9 2 °C 23.8 mm
10 1 °C -
11 -1 °C -
12 -1 °C -
13 0 °C 10.6 mm
14 1 °C 29.7 mm

Related

Pandas: Find closest group from another dataframe

Below, I have two dataframe. I need to update df_mapped using df_original.
In df_mapped, For each x_time need to find 3 closest rows (closest defined from difference from x_price) and add those to df_mapped dataframe.
import io
import pandas as pd
d = """
x_time expiration x_price p_price
60 4 10 20
60 5 11 30
60 6 12 40
60 7 13 50
60 8 14 60
70 5 10 20
70 6 11 30
70 7 12 40
70 8 13 50
70 9 14 60
80 1 10 20
80 2 11 30
80 3 12 40
80 4 13 50
80 5 14 60
"""
df_original = pd.read_csv(io.StringIO(d), delim_whitespace=True)`
to_mapped = """
x_time expiration x_price
50 4 15
60 5 15
70 6 13
80 7 20
90 8 20
"""
df_mapped = pd.read_csv(io.StringIO(to_mapped), delim_whitespace=True)
df_mapped = df_mapped.merge(df_original, on='x_time', how='left')
df_mapped['x_price_delta'] = abs(df_mapped['x_price_x'] - df_mapped['x_price_y'])`
**Intermediate output: In this, need to select 3 min x_price_delta row for each x_time
**
int_out = """
x_time expiration_x x_price_x expiration_y x_price_y p_price x_price_delta
50 4 15
60 5 15 6 12 40 3
60 5 15 7 13 50 2
60 5 15 8 14 60 1
70 6 13 7 12 40 1
70 6 13 8 13 50 0
70 6 13 9 14 60 1
80 7 20 3 12 40 8
80 7 20 4 13 50 7
80 7 20 5 14 60 6
90 8 20
"""
df_int_out = pd.read_csv(io.StringIO(int_out), delim_whitespace=True)
**Final step: keeping x_time fixed need to flatten the dataframe so we get the 3 closest row in one row
**
final_out = """
x_time expiration_original x_price_original expiration_1 x_price_1 p_price_1 expiration_2 x_price_2 p_price_2 expiration_3 x_price_3 p_price_3
50 4 15
60 5 15 6 12 40 7 13 50 8 14 60
70 6 13 7 12 40 8 13 50 9 14 60
80 7 20 3 12 40 4 13 50 5 14 60
90 8 20
"""
df_out = pd.read_csv(io.StringIO(final_out), delim_whitespace=True)
I am stuck in between intermediate and last step. Can't think of way out, what could be done to massage the dataframe?
This is not complete solution but it might help you to get unstuck.
At the end we get the correct data.
In [1]: df = df_int_out.groupby("x_time").apply(lambda x: x.sort_values(ascen
...: ding=False, by="x_price_delta")).set_index(["x_time", "expiration_x"]
...: ).drop(["x_price_delta", "x_price_x"],axis=1)
In [2]: df1 = df.iloc[1:-1]
In [3]: df1.groupby(df1.index).apply(lambda x: pd.concat([pd.DataFrame(d) for
...: d in x.values],axis=1).unstack())
Out[3]:
0
0 1 2 0 1 2 0 1 2
(60, 5) 6.0 12.0 40.0 7.0 13.0 50.0 8.0 14.0 60.0
(70, 6) 7.0 12.0 40.0 9.0 14.0 60.0 8.0 13.0 50.0
(80, 7) 3.0 12.0 40.0 4.0 13.0 50.0 5.0 14.0 60.0
I am sure there are much better ways of handling this case.

Replace values more than 80 percentile with 80 percentile in pandas

I have a df as shown below
df:
Id gender age salary
1 m 27 100
2 m 26 100000
3 m 57 180
4 f 27 150
5 m 57 200
6 f 29 100
7 m 47 130
8 f 27 140
9 m 37 100
10 f 43 2000
From the above I would like to replace the value more than 80 percentile value with 80 percentile value.
Expected output:
Id gender age salary
1 m 27 100
2 m 26 560
3 m 57 180
4 f 27 150
5 m 57 200
6 f 29 100
7 m 47 130
8 f 27 140
9 m 37 100
10 f 43 560
Let's try:
quantiles = df.salary.quantile(0.8)
df.loc[df.salary > quantiles, 'salary'] = quantiles
Output (can't quite get 200 as .8 percentile though):
Id gender age salary
0 1 m 27 100.0
1 2 m 26 560.0
2 3 m 57 180.0
3 4 f 27 150.0
4 5 m 57 200.0
5 6 f 29 100.0
6 7 m 47 130.0
7 8 f 27 140.0
8 9 m 37 100.0
9 10 f 43 560.0
In case you want to fill within gender:
quantiles = df.groupby('gender')['salary'].transform('quantile', q=0.8)
Output:
Id gender age salary
0 1 m 27 100
1 2 m 26 200
2 3 m 57 180
3 4 f 27 150
4 5 m 57 200
5 6 f 29 100
6 7 m 47 130
7 8 f 27 140
8 9 m 37 100
9 10 f 43 890

Reading from a MD file to pandas dataframe

I'm doing a project with this data repository, and for each football season, on top of the CSV file with the game data, there is an extra README which has the final results. This is in some sort of table like structure, and I want to read it into a Pandas dataframe. I've tried both "read_csv" and "read_table", but I'm not sure what delimiter is being used and if it possibly uses a multi-index... The MD file is as follows:
- Home - - Away - - Total -
Pld W D L F:A W D L F:A F:A +/- Pts
1. Club Brugge 34 14 2 1 45:12 11 4 2 38:18 83:30 +53 81
2. RSC Anderlecht 34 14 1 2 44:14 8 4 5 39:23 83:37 +46 71
3. Germinal Beerschot 34 12 3 2 34:11 3 5 9 19:26 53:37 +16 53
4. RWD Molenbeek 34 9 6 2 24:9 4 8 5 15:20 39:29 +10 53
5. K Lierse SK 34 8 4 5 30:21 6 6 5 24:24 54:45 +9 52
6. Standard Liège 34 9 7 1 28:15 4 5 8 23:31 51:46 +5 51
7. Sporting Charleroi 34 8 7 2 37:21 5 4 8 22:32 59:53 +6 50
8. Cercle Brugge 34 7 5 5 27:23 6 5 6 24:24 51:47 +4 49
9. KFC Lommel SK 34 7 6 4 20:15 7 0 10 20:30 40:45 -5 48
10. SC Eendracht Aalst 34 8 6 3 37:21 4 4 9 18:29 55:50 +5 46
11. KV Mechelen 34 8 4 5 20:16 4 4 9 20:30 40:46 -6 44
12. KRC Harelbeke 34 8 1 8 26:26 5 3 9 14:22 40:48 -8 43
13. Royal Antwerp FC 34 7 4 6 26:23 4 5 8 12:23 38:46 -8 42
14. KAA Gent 34 8 2 7 21:22 2 9 6 18:27 39:49 -10 41
15. Sint-Truidense VV 34 7 4 6 29:28 4 3 10 13:32 42:60 -18 40
16. RFC Seraing 34 5 4 8 18:24 3 1 13 17:51 35:75 -40 29
17. KSK Beveren 34 4 7 6 24:25 2 2 13 14:32 38:57 -19 27
18. SV Zulte Waregem 34 3 4 10 18:36 1 5 11 12:34 30:70 -40 21
Pld = Matches; W = Matches won; D = Matches drawn; L = Matches lost; F = Goals for; A = Goals against; +/- = Goal differencence; Pts = Points
How can I best read this file? Cheers!
This file is not formatted as a CSV (which can use both ',' and ';' delimiters.
In your case you only have spaces to work with, so the approach would be for every line to split it by space char, get rid of empty entries and fetch them by index.
f = open("your.csv", 'r')
for l in f.readlines():
vals = [l for l in l.split(' ') if l]
index = vals[0]
name = vals[1]
goals_fa = tuple(vals[6].split(':'))
...
# fill dataframe
f.close()
This file isn't in a particular format - it's meant to be human-readable rather than machine readable. So you'll probably need to do some conversion on your own first.
One simple way:
lines = text.split('\n')
df = pd.DataFrame([re.split(r'\s+', line[34:]) for line in lines])
You can name the columns directly:
df.columns = [['home_pld', 'home_w', 'home_d', 'home_l', 'home_fa',
'home_pld', 'home_w', 'home_d', 'home_l', 'home_fa',
'total_fa', 'total_plusminus', 'total_points']]
And add the club name:
df['club'] = [line[4:34].strip() for line in lines]
A quick way could be:
df = pd.read_table('data.txt', sep='\s{2,}', header=1, engine='python')
Pld W D L F:A W.1 D.1 L.1 F:A.1 F:A.2 +/- Pts
1. Club Brugge 34 14 2 1 45:12 11 4 2 38:18 83:30 53 81
2. RSC Anderlecht 34 14 1 2 44:14 8 4 5 39:23 83:37 46 71
3. Germinal Beerschot 34 12 3 2 34:11 3 5 9 19:26 53:37 16 53
...
Explanation
sep='\s{2,}' will split the data by 2 or more spaces
engine='python' is required for sep='\s{2,}' to work
To clean up the column names you could set the prefix for each column manually:
# Manually set column prefix for each column
k = ['Pld'] + ['Home']*4 + ['Away']*4 + ['Total']*3
# Drop duplicate column name suffix (e.g. D.1 --> D)
cols = [re.sub('\.[0-9]*$', '', c) for c in df.columns]
# Update column names
df.columns = [f'{k}_{c}' for k, c in zip(k, cols)]

Sum values of a column for each value based on another column and divide it by total

Today I'm struggling once again with Python and data-analytics.
I got a dataframe which looks like this:
name totdmgdealt
0 Warwick 96980.0
1 Nami 25995.0
2 Draven 171568.0
3 Fiora 113721.0
4 Viktor 185302.0
5 Skarner 148791.0
6 Galio 130692.0
7 Ahri 145731.0
8 Jinx 182680.0
9 VelKoz 85785.0
10 Ziggs 46790.0
11 Cassiopeia 62444.0
12 Yasuo 117896.0
13 Warwick 129156.0
14 Evelynn 179252.0
15 Caitlyn 163342.0
16 Wukong 122919.0
17 Syndra 146754.0
18 Karma 35766.0
19 Warwick 117790.0
20 Draven 74879.0
21 Janna 11242.0
22 Lux 66424.0
23 Amumu 87826.0
24 Vayne 76085.0
25 Ahri 93334.0
..
..
..
this is a dataframe, which includes the total damage of a champion for one game.
Now I want to group these information, so I can see which champion overall has the most damage dealt.
I tried groupby('name') but it didn't work at all.
I already went through some threads about groupby and summing values, but I didn't solve my specific problem.
The dealt damage of each champion should also be shown as percentage of the total.
I'm looking for something like this as an output:
name totdmgdealt percentage
0 Warwick 2378798098 2.1 %
1 Nami 2837491074 2.3 %
2 Draven 1231451224 ..
3 Fiora 1287301724 ..
4 Viktor 1239808504 ..
5 Skarner 1487911234 ..
6 Galio 1306921234 ..
We can groupby on name and get the sum then we divide each value by the total with .div and multiply it by 100 with .mul and finally round it to one decimal with .round:
total = df['totdmgdealt'].sum()
summed = df.groupby('name', sort=False)['totdmgdealt'].sum().reset_index()
summed['percentage'] = summed.groupby('name', sort=False)['totdmgdealt']\
.sum()\
.div(total)\
.mul(100)\
.round(1).values
name totdmgdealt percentage
0 Warwick 343926.0 12.2
1 Nami 25995.0 0.9
2 Draven 246447.0 8.7
3 Fiora 113721.0 4.0
4 Viktor 185302.0 6.6
5 Skarner 148791.0 5.3
6 Galio 130692.0 4.6
7 Ahri 239065.0 8.5
8 Jinx 182680.0 6.5
9 VelKoz 85785.0 3.0
10 Ziggs 46790.0 1.7
11 Cassiopeia 62444.0 2.2
12 Yasuo 117896.0 4.2
13 Evelynn 179252.0 6.4
14 Caitlyn 163342.0 5.8
15 Wukong 122919.0 4.4
16 Syndra 146754.0 5.2
17 Karma 35766.0 1.3
18 Janna 11242.0 0.4
19 Lux 66424.0 2.4
20 Amumu 87826.0 3.1
21 Vayne 76085.0 2.7
you can use sum() to get the total dmg, and apply to calculate the precent relevant for each row, like this:
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO("""
name totdmgdealt
0 Warwick 96980.0
1 Nami 25995.0
2 Draven 171568.0
3 Fiora 113721.0
4 Viktor 185302.0
5 Skarner 148791.0
6 Galio 130692.0
7 Ahri 145731.0
8 Jinx 182680.0
9 VelKoz 85785.0
10 Ziggs 46790.0
11 Cassiopeia 62444.0
12 Yasuo 117896.0
13 Warwick 129156.0
14 Evelynn 179252.0
15 Caitlyn 163342.0
16 Wukong 122919.0
17 Syndra 146754.0
18 Karma 35766.0
19 Warwick 117790.0
20 Draven 74879.0
21 Janna 11242.0
22 Lux 66424.0
23 Amumu 87826.0
24 Vayne 76085.0
25 Ahri 93334.0"""), sep=r"\s+")
summed_df = df.groupby('name')['totdmgdealt'].agg(['sum']).rename(columns={"sum": "totdmgdealt"}).reset_index()
summed_df['percentage'] = summed_df.apply(
lambda x: "{:.2f}%".format(x['totdmgdealt'] / summed_df['totdmgdealt'].sum() * 100), axis=1)
print(summed_df)
Output:
name totdmgdealt percentage
0 Ahri 239065.0 8.48%
1 Amumu 87826.0 3.12%
2 Caitlyn 163342.0 5.79%
3 Cassiopeia 62444.0 2.21%
4 Draven 246447.0 8.74%
5 Evelynn 179252.0 6.36%
6 Fiora 113721.0 4.03%
7 Galio 130692.0 4.64%
8 Janna 11242.0 0.40%
9 Jinx 182680.0 6.48%
10 Karma 35766.0 1.27%
11 Lux 66424.0 2.36%
12 Nami 25995.0 0.92%
13 Skarner 148791.0 5.28%
14 Syndra 146754.0 5.21%
15 Vayne 76085.0 2.70%
16 VelKoz 85785.0 3.04%
17 Viktor 185302.0 6.57%
18 Warwick 343926.0 12.20%
19 Wukong 122919.0 4.36%
20 Yasuo 117896.0 4.18%
21 Ziggs 46790.0 1.66%
Maybe You can Try this:
I tried to achieve the same using my sample data and try to run the below code into your Jupyter Notebook:
import pandas as pd
name=['abhit','mawa','vaibhav','dharam','sid','abhit','vaibhav','sid','mawa','lakshya']
totdmgdealt=[24,45,80,22,89,55,89,51,93,85]
name=pd.Series(name,name='name') #converting into series
totdmgdealt=pd.Series(totdmgdealt,name='totdmgdealt') #converting into series
data=pd.concat([name,totdmgdealt],axis=1)
data=pd.DataFrame(data) #converting into Dataframe
final=data.pivot_table(values="totdmgdealt",columns="name",aggfunc="sum").transpose() #actual aggregating method
total=data['totdmgdealt'].sum() #calculating total for calculating percentage
def calPer(row,total): #actual Function for Percentage
return ((row/total)*100).round(2)
total=final['totdmgdealt'].sum()
final['Percentage']=calPer(final['totdmgdealt'],total) #assigning the function to the column
final
Sample Data :
name totdmgdealt
0 abhit 24
1 mawa 45
2 vaibhav 80
3 dharam 22
4 sid 89
5 abhit 55
6 vaibhav 89
7 sid 51
8 mawa 93
9 lakshya 85
Output:
totdmgdealt Percentage
name
abhit 79 12.48
dharam 22 3.48
lakshya 85 13.43
mawa 138 21.80
sid 140 22.12
vaibhav 169 26.70
Understand and run the code and just replace the dataset with Yours. Maybe This Helps.

Pandas: vectorize local range operations (max & sum for [i:i+2] rows)

I'm looking to do calculations in a local range for each row in a dataframe while avoiding a slow for loop. For example, for each row in the data below I want to find the maximum temperature within the next 3 days (including current day) and the total amount of rain within the next 3 days:
Day Temperature Rain
0 30 4
1 31 14
2 31 0
3 30 0
4 33 5
5 34 0
6 32 0
7 33 2
8 31 5
9 29 9
The ideal output would then be the new columns as in the table below. TempMax of Day 0 shows the highest temperature between Day 0 and Day 2, RainTotal shows the sum of rain between Day 0 and Day 2:
Day Temperature Rain TempMax RainTotal
0 30 4 31 18
1 31 14 31 14
2 31 0 33 5
3 30 0 34 5
4 33 5 34 5
5 34 0 34 2
6 32 0 33 7
7 33 2 33 16
8 31 5 31 14
9 29 9 29 9
Currently I'm using a for loop:
# Make empty arrays to store each row's max & sum values
temp_max = np.zeros(len(df))
rain_total = np.zeros(len(df))
# Loop through the df and do operations in the local range [i:i+2]
for i in range(len(df)):
temp_max[i] = df['Temperature'].iloc[i:i+2].max()
rain_total = df['Rain'].iloc[i:i+2].sum()
# Insert the arrays to df
df['TempMax'] = temp_max
df['RainTotal'] = rain_total
The for loop gets the job done but takes 50 minutes with my dataframe. Any chance this can be vecrotized or made faster some other way?
Thanks a bunch!
Use Series.rolling with change order by indexing and max with sum:
df['TempMax'] = df['Temperature'].iloc[::-1].rolling(3, min_periods=1).max()
df['RainTotal'] = df['Rain'].iloc[::-1].rolling(3, min_periods=1).sum()
print (df)
Day Temperature Rain TempMax RainTotal
0 0 30 4 31.0 18.0
1 1 31 14 31.0 14.0
2 2 31 0 33.0 5.0
3 3 30 0 34.0 5.0
4 4 33 5 34.0 5.0
5 5 34 0 34.0 2.0
6 6 32 0 33.0 7.0
7 7 33 2 33.0 16.0
8 8 31 5 31.0 14.0
9 9 29 9 29.0 9.0
Another faster solution with strides in numpy for 2d array and then use numpy.nanmax with numpy.nansum:
n = 2
t = np.concatenate([df['Temperature'].values, [np.nan] * (n)])
r = np.concatenate([df['Rain'].values, [np.nan] * (n)])
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
df['TempMax'] = np.nanmax(rolling_window(t, n + 1), axis=1)
df['RainTotal'] = np.nansum(rolling_window(r, n + 1), axis=1)
print (df)
Day Temperature Rain TempMax RainTotal
0 0 30 4 31.0 18.0
1 1 31 14 31.0 14.0
2 2 31 0 33.0 5.0
3 3 30 0 34.0 5.0
4 4 33 5 34.0 5.0
5 5 34 0 34.0 2.0
6 6 32 0 33.0 7.0
7 7 33 2 33.0 16.0
8 8 31 5 31.0 14.0
9 9 29 9 29.0 9.0
Performance:
#[100000 rows x 3 columns]
df = pd.concat([df] * 10000, ignore_index=True)
In [23]: %%timeit
...: df['TempMax'] = np.nanmax(rolling_window(t, n + 1), axis=1)
...: df['RainTotal'] = np.nansum(rolling_window(r, n + 1), axis=1)
...:
8.36 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [24]: %%timeit
...: df['TempMax'] = df['Temperature'].iloc[::-1].rolling(3, min_periods=1).max()
...: df['RainTotal'] = df['Rain'].iloc[::-1].rolling(3, min_periods=1).sum()
...:
20.4 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
For the case when Day has data for all successive days, we can employ fast NumPy and SciPy tools to our rescue -
from scipy.ndimage.filters import maximum_filter1d
N = 2 # window length
temp = df['Temperature'].to_numpy()
rain = df['Rain'].to_numpy()
df['TempMax'] = maximum_filter1d(temp,N+1,origin=-1,mode='nearest')
df['RainTotal'] = np.convolve(rain,np.ones(N+1,dtype=int))[N:]
Sample output -
In [27]: df
Out[27]:
Day Temperature Rain TempMax RainTotal
0 0 30 4 31 18
1 1 31 14 31 14
2 2 31 0 33 5
3 3 30 0 34 5
4 4 33 5 34 5
5 5 34 0 34 2
6 6 32 0 33 7
7 7 33 2 33 16
8 8 31 5 31 14
9 9 29 9 29 9

Categories