Add zero in between values in column (dataframe) - python

I have a dataframe:
ID Date Volume Sales
1 20191 3.33 1.33
1 20192 3.33 1.33
1 20193 3.33 1.33
1 20194 2.66 2
1 20195 2.66 2
1 20196 2.66 2
1 20197 2 2.66
1 20198 2 2.66
1 20199 2 2.66
1 201910 1.33 3.33
1 201911 1.33 3.33
1 201912 1.33 3.33
I would like to add a 0 right after the year 2019 in this case to that the date looks like: 201901 etc while 201910 and above remains the same.
My initial thought process is to use;
np.where(df['Date'].str.len() == 5,
where if string equals 5, we add the zero. Otherwise, data stays the same.
Expected output:
ID Date Volume Sales
1 201901 3.33 1.33
1 201902 3.33 1.33
1 201903 3.33 1.33
1 201904 2.66 2
1 201905 2.66 2
1 201906 2.66 2
1 201907 2 2.66
1 201908 2 2.66
1 201909 2 2.66
1 201910 1.33 3.33
1 201911 1.33 3.33
1 201912 1.33 3.33

Assuming these are strings:
df.Date = df.Date.apply(lambda x: x if len(x) == 6 else f"{x[:4]}0{x[-1]}")
But I concur that you should convert to proper dates, as suggested by #0 0 in the comment.

Related

How to read webpage dataset in pandas?

I am trying to read this table
on the webpage: https://datahub.io/sports-data/german-bundesliga
I am using this code:
import pandas as pd
url="https://datahub.io/sports-data/german-bundesliga"
pd.read_html(url)[2]
It reads another tables but not the tables of this type.
Also there is a link to this specific table:
https://datahub.io/sports-data/german-bundesliga/r/0.html
I also tried this:
import pandas as pd
url="https://datahub.io/sports-data/german-bundesliga/r/0.html"
pd.read_html(url)
But it says that there are no tables to read
There is no necessity to use the HTML form of the table cause the table is available in CSV format.
pd.read_csv('https://datahub.io/sports-data/german-bundesliga/r/season-1819.csv').head()
output:
Div Date HomeTeam AwayTeam FTHG FTAG FTR HTHG HTAG HTR ... BbAv<2.5 BbAH BbAHh BbMxAHH BbAvAHH BbMxAHA BbAvAHA PSCH PSCD PSCA
0 D1 24/08/2018 Bayern Munich Hoffenheim 3 1 H 1 0 H ... 3.55 22 -2.00 1.92 1.87 2.05 1.99 1.23 7.15 14.10
1 D1 25/08/2018 Fortuna Dusseldorf Augsburg 1 2 A 1 0 H ... 1.76 20 0.00 1.80 1.76 2.17 2.11 2.74 3.33 2.78
2 D1 25/08/2018 Freiburg Ein Frankfurt 0 2 A 0 1 A ... 1.69 20 -0.25 2.02 1.99 1.92 1.88 2.52 3.30 3.07
3 D1 25/08/2018 Hertha Nurnberg 1 0 H 1 0 H ... 1.76 20 -0.25 1.78 1.74 2.21 2.14 1.79 3.61 5.21
4 D1 25/08/2018 M'gladbach Leverkusen 2 0 H 0 0 D ... 2.32 20 0.00 2.13 2.07 1.84 1.78 2.63 3.70 2.69
5 rows × 61 columns

export dataframe to csv staking the columns with header and date index

I have a dataframe that I'd like to export to a csv file where each column is stacked on top of one another. I want to use each header as a label with the date in this format, Allu_1_2013.
date Allu_1 Allu_2 Alluv_3 year
2013-01-01 2.00 1.45 3.54 2013
2014-01-01 3.09 2.35 9.01 2014
2015-01-01 4.79 4.89 10.04 2015
The final csv text tile should look like
Allu_1_2013 2.00
Allu_1_2014 3.09
Allu_1_2015 4.79
Allu_2_2013 1.45
Allu_2_2014 2.35
Allu_2_2015 4.89
Allu_3_2013 3.54
Allu_3_2014 9.01
Allu_3_2015 10.04
You can use melt:
new_df = df.melt(id_vars=["date", "year"],
var_name="Date",
value_name="Value").drop(columns=['date'])
new_df['idx'] = new_df['Date'] + '_' + new_df['year'].astype(str)
new_df = new_df.drop(columns=['year', 'Date'])
Value
idx
0
2
Allu_1_2013
1
3.09
Allu_1_2014
2
4.79
Allu_1_2015
3
1.45
Allu_2_2013
4
2.35
Allu_2_2014
5
4.89
Allu_2_2015
6
3.54
Alluv_3_2013
7
9.01
Alluv_3_2014
8
10.04
Alluv_3_2015

Unable to print dataframe from the extracted Boolean values

import pandas as pd
diamonds = pd.read_csv('diam.csv')
print(diamonds.head())
Unnamed: 0 carat cut color clarity depth table price x y z quality?color
0 0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 Ideal,E
1 1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 Premium,E
2 2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 Good,E
3 3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 Premium,I
4 4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 Good,J
I want to print only the object data types
x=diamonds.dtypes=='object'
diamonds.where(diamonds[x]==True)
But I get this error:
unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
where uses the row axis. Use diamonds.loc[:, diamonds.dtypes == object], or the builtin select_dtypes
From your post (badly formatted) I recreated the diamonds DataFrame,
getting result like below:
Unnamed: 0 carat cut color clarity depth table price x y z quality?color
0 0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 Ideal,E 1
1 1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 Premium,E 2
2 2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 Good,E 3
3 3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 Premium,I 4
4 4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 Good,J
When you run x = diamonds.dtypes == 'object' and print x, the result is:
Unnamed: 0 False
carat False
cut True
color True
clarity True
depth False
table False
price False
x False
y False
z False
quality?color True
dtype: bool
It is a bool vector, containing answer to the question is this column of object type?
Note that diamonds.columns.where(x).dropna().tolist() prints:
['cut', 'color', 'clarity', 'quality?color']
i.e. names of columns with object type.
So to print all columns of object type you should run:
diamonds[diamonds.columns.where(x).dropna()]

Adding day of the week column according to multiple columns in python

How to add the day_of_week column(the day of the week, eg. 1 = Mon, 2 = Tue) to df1 according to the year,month,day values as shown below:
year month day A B C D day_of_week
0 2019 1 1 26.2 20.2 0.0 32.4 2
1 2019 1 2 22.9 20.3 0.0 10.0 3
2 2019 1 3 24.8 18.4 0.0 28.8 4
3 2019 1 4 26.6 18.3 0.0 33.5 5
4 2019 1 5 28.3 20.9 0.0 33.4 6
Use pd.to_datetime and .dt.dayofweek attribute and plus 1
df['day_of_week'] = pd.to_datetime(df[['year', 'month', 'day']],
errors='coerce').dt.dayofweek + 1
Out[410]:
year month day A B C D day_of_week
0 2019 1 1 26.2 20.2 0.0 32.4 2
1 2019 1 2 22.9 20.3 0.0 10.0 3
2 2019 1 3 24.8 18.4 0.0 28.8 4
3 2019 1 4 26.6 18.3 0.0 33.5 5
4 2019 1 5 28.3 20.9 0.0 33.4 6
If you also want the day name, you can make use of calendar.
import datetime
import calendar
def findDay(date):
day_number = datetime.datetime.strptime(date, '%d %m %Y').weekday() + 1
day_name = calendar.day_name[day_number-1]
return (day_number, day_name)
You can generate a date in str format using something like this date = f'{day} {month} {year}' from the year, month, day columns.
Then all you have to do is apply the function above on the new date column. The function will return a tuple with the day number of the week as well as the day name.

How to split string of text by conjunction in python?

I have a dataframe which is a transcript of a 2 person conversation. In the df are words, their timestamps, and the label of the speaker. It looks like this.
word start stop speaker
0 but 2.72 2.85 2
1 that's 2.85 3.09 2
2 alright 3.09 3.47 2
3 we'll 8.43 8.69 1
4 have 8.69 8.97 1
5 to 8.97 9.07 1
6 okay 9.19 10.01 2
7 sure 10.02 11.01 2
8 what? 11.02 12.00 1
9 i 12.01 13.00 2
10 agree 13.01 14.00 2
11 but 14.01 15.00 2
12 i 15.01 16.00 2
13 disagree 16.01 17.00 2
14 thats 17.01 18.00 1
15 fine 18.01 19.00 1
16 however 19.01 20.00 1
17 you 20.01 21.00 1
18 are 21.01 22.00 1
19 like 22.01 23.00 1
20 this 23.01 24.00 1
21 and 24.01 25.00 1
I have code to combine all words per speaker turn into one utterance which preserving the timestamp and speaker label. Using this code:
df.groupby([(df['speaker'] != df['speaker'].shift()).cumsum(), , df['speaker']], as_index=False).agg({
'word': ' '.join,
'start': 'min',
'stop': 'max'
})
I get this:
word start stop speaker
0 but that's alright 2.72 3.47 2
1 we'll have to 8.43 9.07 1
2 okay sure 9.19 11.01 2
3 what? 11.02 12.00 1
However, I want to split these combined utterances into sub-utterances based on the presence of a conjunctive adverb ('however', 'and', 'but', etc.). As a result, I want this:
word start stop speaker
0 but that's alright 2.72 3.47 2
1 we'll have to 8.43 9.07 1
2 okay sure 9.19 11.01 2
3 what? 11.02 12.00 1
4 I agree 12.01 14.00 2
5 but i disagree 14.01 17.00 2
6 thats fine 17.01 19.00 1
7 however you are 19.01 22.00 1
8 like this 22.01 24.00 1
9 and 24.01 25.00 1
Any recommendations on accomplishing this task would be appreciated.
you can add an OR (|) and check if the word is inside a specific list before grouping (e.g. with df['word'].isin(['however', 'and', 'but'])):
df.groupby([((df['speaker'] != df['speaker'].shift()) | (df['word'].isin(['however', 'and', 'but'])) ).cumsum(), df['speaker']], as_index=False).agg({
'word': ' '.join,
'start': 'min',
'stop': 'max'
})

Categories