Wide to long in python, with year repeating [duplicate] - python

This question already has answers here:
Reshape wide to long in pandas
(2 answers)
Closed 1 year ago.
I want to convert this wide format of tables in pandas to:
Jan Feb Mar Apr may jun jul aug sep oct nov dec
2019 0 0 0 0 0 0 0 0 0 0 0 0
2020 0 0 0 0 0 0 0 0 0 0 0 0
2021 0 0 0 0 0 0 0 0 0 0 0 0
in this format:
YEAR MON dd
2019 DEC 0
2019 NOV 0
2019 OCT 0
2019 SEP 0
2019 AUG 0
2019 JUL 0
2019 JUN 0
2019 MAY 0
2019 APR 0
2019 MAR 0
2019 FEB 0
2019 JAN 0
2018 DEC 0
How can this be done ?

df.transpose() can make your columns the rows and your rows the columns.

Related

Get specific data from txt file to pandas dataframe

I have such data in a txt file:
Wed Mar 23 16:59:25 GMT 2022
1 State
1 ESTAB
Wed Mar 23 16:59:26 GMT 2022
1 State
1 ESTAB
1 CLOSE-WAIT
Wed Mar 23 16:59:27 GMT 2022
1 State
1 ESTAB
10 FIN-WAIT
Wed Mar 23 16:59:28 GMT 2022
1 State
1 CLOSE-WAIT
102 ESTAB
I want to get a pandas dataframe looking like this:
timestamp | State | ESTAB | FIN-WAIT | CLOSE-WAIT
Wed Mar 23 16:59:25 GMT 2022 | 1 | 1 | 0 | 0
Wed Mar 23 16:59:26 GMT 2022 | 1 | 1 | 0 | 1
Wed Mar 23 16:59:27 GMT 2022 | 1 | 1 | 10 | 0
Wed Mar 23 16:59:28 GMT 2022 | 1 | 102 | 0 | 1
That means the string in the first line per paragraph should be used for the first column timestamp. The other columns should be filled withg the numbers according to the string following the number. The next column begins after a paragraph.
How can I do this with pandas?
First you can process the txt file to a list of list. Inner list means each hunk lines. Outer list means different hunks:
import pandas as pd
with open('data.txt', 'r') as f:
res = f.read()
records = [list(map(str.strip, line.strip().split('\n'))) for line in res.split('\n\n')]
print(records)
[['Wed Mar 23 16:59:25 GMT 2022', '1 State', '1 ESTAB'], ['Wed Mar 23 16:59:26 GMT 2022', '1 State', '1 ESTAB', '1 CLOSE-WAIT'], ['Wed Mar 23 16:59:27 GMT 2022', '1 State', '1 ESTAB', '10 FIN-WAIT'], ['Wed Mar 23 16:59:28 GMT 2022', '1 State', '1 CLOSE-WAIT', '102 ESTAB']]
Then you can turn the list of list to list of dictionary by manually define each key and value
l = []
for record in records:
d = {}
d['timestamp'] = record[0]
for r in record[1:]:
key = r.split(' ')[1]
value = r.split(' ')[0]
d[key] = value
l.append(d)
print(l)
[{'timestamp': 'Wed Mar 23 16:59:25 GMT 2022', 'State': '1', 'ESTAB': '1'}, {'timestamp': 'Wed Mar 23 16:59:26 GMT 2022', 'State': '1', 'ESTAB': '1', 'CLOSE-WAIT': '1'}, {'timestamp': 'Wed Mar 23 16:59:27 GMT 2022', 'State': '1', 'ESTAB': '1', 'FIN-WAIT': '10'}, {'timestamp': 'Wed Mar 23 16:59:28 GMT 2022', 'State': '1', 'CLOSE-WAIT': '1', 'ESTAB': '102'}]
At last you can feed this dictionary into dataframe and fill the nan cell
df = pd.DataFrame(l).fillna(0)
print(df)
timestamp State ESTAB CLOSE-WAIT FIN-WAIT
0 Wed Mar 23 16:59:25 GMT 2022 1 1 0 0
1 Wed Mar 23 16:59:26 GMT 2022 1 1 1 0
2 Wed Mar 23 16:59:27 GMT 2022 1 1 0 10
3 Wed Mar 23 16:59:28 GMT 2022 1 102 1 0
Try:
#read text file to a DataFrame
df = pd.read_csv("data.txt", header=None, skip_blank_lines=False)
#Extract possible column names
df["Column"] = df[0].str.extract("(State|ESTAB|FIN-WAIT|CLOSE-WAIT)")
#Remove the column names from the data
df[0] = df[0].str.replace("(State|ESTAB|FIN-WAIT|CLOSE-WAIT)","",regex=True)
df = df.dropna(how="all").fillna("timestamp")
df["Index"] = df["Column"].eq("timestamp").cumsum()
#Pivot the data to match expected output structure
output = df.pivot("Index","Column",0)
#Re-format columns as needed
output = output.set_index("timestamp").astype(float).fillna(0).astype(int).reset_index()
>>> output
Column timestamp CLOSE-WAIT ESTAB FIN-WAIT State
0 Wed Mar 23 16:59:25 GMT 2022 0 1 0 1
1 Wed Mar 23 16:59:26 GMT 2022 1 1 0 1
2 Wed Mar 23 16:59:27 GMT 2022 0 1 10 1
3 Wed Mar 23 16:59:28 GMT 2022 1 102 0 1

Pandas: How to draw bar graph on month over counts

I have a dataframe df as below:
Student_id Date_of_visit(d/m/y)
1 1/4/2020
1 30/12/2019
1 26/12/2019
2 3/1/2021
2 10/1/2021
3 4/5/2020
3 22/8/2020
How can I get the bar-graph with x-axis as month-year(eg: y-ticks: Dec 2019, Jan 2020, Feb 2020) and on y-axis - the total number of students (count) visited on a particular month.
Convert values to datetimes, then use DataFrame.resample with Resampler.size for counts, create new format of datetimes by DatetimeIndex.strftime:
df['Date_of_visit'] = pd.to_datetime(df['Date_of_visit'], dayfirst=True)
s = df.resample('M', on='Date_of_visit')['Student_id'].size()
s.index = s.index.strftime('%b %Y')
print (s)
Date_of_visit
Dec 2019 2
Jan 2020 0
Feb 2020 0
Mar 2020 0
Apr 2020 1
May 2020 1
Jun 2020 0
Jul 2020 0
Aug 2020 1
Sep 2020 0
Oct 2020 0
Nov 2020 0
Dec 2020 0
Jan 2021 2
Name: Student_id, dtype: int64
If need count only unique Student_id use Resampler.nunique:
s = df.resample('M', on='Date_of_visit')['Student_id'].nunique()
s.index = s.index.strftime('%b %Y')
print (s)
Date_of_visit
Dec 2019 1
Jan 2020 0
Feb 2020 0
Mar 2020 0
Apr 2020 1
May 2020 1
Jun 2020 0
Jul 2020 0
Aug 2020 1
Sep 2020 0
Oct 2020 0
Nov 2020 0
Dec 2020 0
Jan 2021 1
Name: Student_id, dtype: int64
Last plot by Series.plot.bar
s.plot.bar()

Pandas Multindex: iterate rows and add specific values to create a new variable

I have a pandas data frame with Multindex (id and datetime) and one column named X1.
X1
id datetime
a1ssjdldf 2019 Jul 10 2
2019 Jul 11 22
2019 Jul 12 21
r2dffs 2019 Jul 10 14
2019 Jul 11 13
2019 Jul 12 11
I want to create a new variable X2 where the corresponding value is the difference between the X1 value of the same row and the X1 value of the previous row. But every time it sees a new id the corresponding value has to be restarted from zero.
For example:
X1 X2
id datetime
a1ssjdldf 2019 Jul 10 2 0
2019 Jul 11 22 20
2019 Jul 12 21 -1
r2dffs 2019 Jul 10 14 0
2019 Jul 11 13 -1
2019 Jul 12 11 -2
Use DataFrameGroupBy.diff by first level and replace missing values by Series.fillna:
df['X2'] = df.groupby(level=0)['X1'].diff().fillna(0, downcast='int')
print (df)
X1 X2
id datetime
a1ssjdldf 2019 Jul 10 2 0
2019 Jul 11 22 20
2019 Jul 12 21 -1
r2dffs 2019 Jul 10 14 0
2019 Jul 11 13 -1
2019 Jul 12 11 -2

How to split one row into multiple and apply datetime on dataframe column?

I have one dataframe which looks like below:
Date_1 Date_2
0 5 Dec 2017 5 Dec 2017
1 14 Dec 2017 14 Dec 2017
2 15 Dec 2017 15 Dec 2017
3 18 Dec 2017 21 Dec 2017 18 Dec 2017 21 Dec 2017
4 22 Dec 2017 22 Dec 2017
Conditions to be checked:
Want to check if any row contains two dates or not like 3rd row. If present split them into two separate rows.
Apply the datetime on both columns.
I am trying to do the same operation like below:
df['Date_1'] = pd.to_datetime(df['Date_1'], format='%d %b %Y')
But getting below error:
ValueError: unconverted data remains:
Expected Output:
Date_1 Date_2
0 5 Dec 2017 5 Dec 2017
1 14 Dec 2017 14 Dec 2017
2 15 Dec 2017 15 Dec 2017
3 18 Dec 2017 18 Dec 2017
4 21 Dec 2017 21 Dec 2017
5 22 Dec 2017 22 Dec 2017
After using regex with findall get the you date , your problem become a unnesting problem
s=df.apply(lambda x : x.str.findall(r'((?:\d{,2}\s)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:-|\.|\s|,)\s?\d{,2}[a-z]*(?:-|,|\s)?\s?\d{,4})'))
unnesting(s,['Date_1','Date_2']).apply(pd.to_datetime)
Out[82]:
Date_1 Date_2
0 2017-12-05 2017-12-05
1 2017-12-14 2017-12-14
2 2017-12-15 2017-12-15
3 2017-12-18 2017-12-18
3 2017-12-21 2017-12-21
4 2017-12-22 2017-12-22

Python dataframe group by column and create new column with percentage [duplicate]

This question already has answers here:
Pandas percentage of total with groupby
(16 answers)
Closed 5 years ago.
I have a scenario simulating to a dataframe which looks something like below:
Month Amount
1 Jan 260
2 Feb 179
3 Mar 153
4 Apr 142
5 May 128
6 Jun 116
7 Jul 71
8 Aug 56
9 Sep 49
10 Oct 17
11 Nov 0
12 Dec 0
I'm trying to get new column by calculating percentage for each row using dataframe group by and use lambda function as below:
df = pd.DataFrame(mylistofdict)
df = df.groupby('Month')["Amount"].apply(lambda x: x / x.sum()*100)
But I'm not getting the expected result below only 2 columns:
Month Percentage
1 Jan 22%
2 Feb 15%
3 Mar 13%
4 Apr 12%
5 May 11%
6 Jun 10%
7 Jul 6%
8 Aug 5%
9 Sep 4%
10 Oct 1%
11 Nov 0
12 Dec 0
How do i modify my code or is there anything better than use dataframe.
If values of Month are unique use:
df['perc'] = df["Amount"] / df["Amount"].sum() * 100
print (df)
Month Amount perc
1 Jan 260 22.203245
2 Feb 179 15.286080
3 Mar 153 13.065756
4 Apr 142 12.126388
5 May 128 10.930828
6 Jun 116 9.906063
7 Jul 71 6.063194
8 Aug 56 4.782237
9 Sep 49 4.184458
10 Oct 17 1.451751
11 Nov 0 0.000000
12 Dec 0 0.000000
If values of Month are duplicated I believe is possible use:
print (df)
Month Amount
1 Jan 260
1 Jan 100
3 Mar 153
4 Apr 142
5 May 128
6 Jun 116
7 Jul 71
8 Aug 56
9 Sep 49
10 Oct 17
11 Nov 0
12 Dec 0
df = df.groupby('Month', as_index=False, sort=False)["Amount"].sum()
df['perc'] = df["Amount"] / df["Amount"].sum() * 100
print (df)
Month Amount perc
0 Jan 360 32.967033
1 Mar 153 14.010989
2 Apr 142 13.003663
3 May 128 11.721612
4 Jun 116 10.622711
5 Jul 71 6.501832
6 Aug 56 5.128205
7 Sep 49 4.487179
8 Oct 17 1.556777
9 Nov 0 0.000000
10 Dec 0 0.000000

Categories