Get specific data from txt file to pandas dataframe

Get specific data from txt file to pandas dataframe - python

I have such data in a txt file:
Wed Mar 23 16:59:25 GMT 2022
1 State
1 ESTAB
Wed Mar 23 16:59:26 GMT 2022
1 State
1 ESTAB
1 CLOSE-WAIT
Wed Mar 23 16:59:27 GMT 2022
1 State
1 ESTAB
10 FIN-WAIT
Wed Mar 23 16:59:28 GMT 2022
1 State
1 CLOSE-WAIT
102 ESTAB
I want to get a pandas dataframe looking like this:
timestamp | State | ESTAB | FIN-WAIT | CLOSE-WAIT
Wed Mar 23 16:59:25 GMT 2022 | 1 | 1 | 0 | 0
Wed Mar 23 16:59:26 GMT 2022 | 1 | 1 | 0 | 1
Wed Mar 23 16:59:27 GMT 2022 | 1 | 1 | 10 | 0
Wed Mar 23 16:59:28 GMT 2022 | 1 | 102 | 0 | 1
That means the string in the first line per paragraph should be used for the first column timestamp. The other columns should be filled withg the numbers according to the string following the number. The next column begins after a paragraph.
How can I do this with pandas?

First you can process the txt file to a list of list. Inner list means each hunk lines. Outer list means different hunks:
import pandas as pd
with open('data.txt', 'r') as f:
res = f.read()
records = [list(map(str.strip, line.strip().split('\n'))) for line in res.split('\n\n')]
print(records)
[['Wed Mar 23 16:59:25 GMT 2022', '1 State', '1 ESTAB'], ['Wed Mar 23 16:59:26 GMT 2022', '1 State', '1 ESTAB', '1 CLOSE-WAIT'], ['Wed Mar 23 16:59:27 GMT 2022', '1 State', '1 ESTAB', '10 FIN-WAIT'], ['Wed Mar 23 16:59:28 GMT 2022', '1 State', '1 CLOSE-WAIT', '102 ESTAB']]
Then you can turn the list of list to list of dictionary by manually define each key and value
l = []
for record in records:
d = {}
d['timestamp'] = record[0]
for r in record[1:]:
key = r.split(' ')[1]
value = r.split(' ')[0]
d[key] = value
l.append(d)
print(l)
[{'timestamp': 'Wed Mar 23 16:59:25 GMT 2022', 'State': '1', 'ESTAB': '1'}, {'timestamp': 'Wed Mar 23 16:59:26 GMT 2022', 'State': '1', 'ESTAB': '1', 'CLOSE-WAIT': '1'}, {'timestamp': 'Wed Mar 23 16:59:27 GMT 2022', 'State': '1', 'ESTAB': '1', 'FIN-WAIT': '10'}, {'timestamp': 'Wed Mar 23 16:59:28 GMT 2022', 'State': '1', 'CLOSE-WAIT': '1', 'ESTAB': '102'}]
At last you can feed this dictionary into dataframe and fill the nan cell
df = pd.DataFrame(l).fillna(0)
print(df)
timestamp State ESTAB CLOSE-WAIT FIN-WAIT
0 Wed Mar 23 16:59:25 GMT 2022 1 1 0 0
1 Wed Mar 23 16:59:26 GMT 2022 1 1 1 0
2 Wed Mar 23 16:59:27 GMT 2022 1 1 0 10
3 Wed Mar 23 16:59:28 GMT 2022 1 102 1 0

Try:
#read text file to a DataFrame
df = pd.read_csv("data.txt", header=None, skip_blank_lines=False)
#Extract possible column names
df["Column"] = df[0].str.extract("(State|ESTAB|FIN-WAIT|CLOSE-WAIT)")
#Remove the column names from the data
df[0] = df[0].str.replace("(State|ESTAB|FIN-WAIT|CLOSE-WAIT)","",regex=True)
df = df.dropna(how="all").fillna("timestamp")
df["Index"] = df["Column"].eq("timestamp").cumsum()
#Pivot the data to match expected output structure
output = df.pivot("Index","Column",0)
#Re-format columns as needed
output = output.set_index("timestamp").astype(float).fillna(0).astype(int).reset_index()
>>> output
Column timestamp CLOSE-WAIT ESTAB FIN-WAIT State
0 Wed Mar 23 16:59:25 GMT 2022 0 1 0 1
1 Wed Mar 23 16:59:26 GMT 2022 1 1 0 1
2 Wed Mar 23 16:59:27 GMT 2022 0 1 10 1
3 Wed Mar 23 16:59:28 GMT 2022 1 102 0 1

Related

Wide to long in python, with year repeating [duplicate]

This question already has answers here:
Reshape wide to long in pandas
(2 answers)
Closed 1 year ago.
I want to convert this wide format of tables in pandas to:
Jan Feb Mar Apr may jun jul aug sep oct nov dec
2019 0 0 0 0 0 0 0 0 0 0 0 0
2020 0 0 0 0 0 0 0 0 0 0 0 0
2021 0 0 0 0 0 0 0 0 0 0 0 0
in this format:
YEAR MON dd
2019 DEC 0
2019 NOV 0
2019 OCT 0
2019 SEP 0
2019 AUG 0
2019 JUL 0
2019 JUN 0
2019 MAY 0
2019 APR 0
2019 MAR 0
2019 FEB 0
2019 JAN 0
2018 DEC 0
How can this be done ?

df.transpose() can make your columns the rows and your rows the columns.

Pandas: How to draw bar graph on month over counts

I have a dataframe df as below:
Student_id Date_of_visit(d/m/y)
1 1/4/2020
1 30/12/2019
1 26/12/2019
2 3/1/2021
2 10/1/2021
3 4/5/2020
3 22/8/2020
How can I get the bar-graph with x-axis as month-year(eg: y-ticks: Dec 2019, Jan 2020, Feb 2020) and on y-axis - the total number of students (count) visited on a particular month.

Convert values to datetimes, then use DataFrame.resample with Resampler.size for counts, create new format of datetimes by DatetimeIndex.strftime:
df['Date_of_visit'] = pd.to_datetime(df['Date_of_visit'], dayfirst=True)
s = df.resample('M', on='Date_of_visit')['Student_id'].size()
s.index = s.index.strftime('%b %Y')
print (s)
Date_of_visit
Dec 2019 2
Jan 2020 0
Feb 2020 0
Mar 2020 0
Apr 2020 1
May 2020 1
Jun 2020 0
Jul 2020 0
Aug 2020 1
Sep 2020 0
Oct 2020 0
Nov 2020 0
Dec 2020 0
Jan 2021 2
Name: Student_id, dtype: int64
If need count only unique Student_id use Resampler.nunique:
s = df.resample('M', on='Date_of_visit')['Student_id'].nunique()
s.index = s.index.strftime('%b %Y')
print (s)
Date_of_visit
Dec 2019 1
Jan 2020 0
Feb 2020 0
Mar 2020 0
Apr 2020 1
May 2020 1
Jun 2020 0
Jul 2020 0
Aug 2020 1
Sep 2020 0
Oct 2020 0
Nov 2020 0
Dec 2020 0
Jan 2021 1
Name: Student_id, dtype: int64
Last plot by Series.plot.bar
s.plot.bar()

How to split one row into multiple and apply datetime on dataframe column?

I have one dataframe which looks like below:
Date_1 Date_2
0 5 Dec 2017 5 Dec 2017
1 14 Dec 2017 14 Dec 2017
2 15 Dec 2017 15 Dec 2017
3 18 Dec 2017 21 Dec 2017 18 Dec 2017 21 Dec 2017
4 22 Dec 2017 22 Dec 2017
Conditions to be checked:
Want to check if any row contains two dates or not like 3rd row. If present split them into two separate rows.
Apply the datetime on both columns.
I am trying to do the same operation like below:
df['Date_1'] = pd.to_datetime(df['Date_1'], format='%d %b %Y')
But getting below error:
ValueError: unconverted data remains:
Expected Output:
Date_1 Date_2
0 5 Dec 2017 5 Dec 2017
1 14 Dec 2017 14 Dec 2017
2 15 Dec 2017 15 Dec 2017
3 18 Dec 2017 18 Dec 2017
4 21 Dec 2017 21 Dec 2017
5 22 Dec 2017 22 Dec 2017

After using regex with findall get the you date , your problem become a unnesting problem
s=df.apply(lambda x : x.str.findall(r'((?:\d{,2}\s)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:-|\.|\s|,)\s?\d{,2}[a-z]*(?:-|,|\s)?\s?\d{,4})'))
unnesting(s,['Date_1','Date_2']).apply(pd.to_datetime)
Out[82]:
Date_1 Date_2
0 2017-12-05 2017-12-05
1 2017-12-14 2017-12-14
2 2017-12-15 2017-12-15
3 2017-12-18 2017-12-18
3 2017-12-21 2017-12-21
4 2017-12-22 2017-12-22

pandas DataFrame pivot with sort

I'm having a problem presenting data in the required way. My dataframe is formatted and then sorted by 'Site ID'. I need to present the data by Site ID with all date instances grouped alongside.
I'm 90% there in terms of how I want it to look using pivot_table
df_pivot = pd.pivot_table(df, index=['Site Ref','Site Name', 'Date'])
however the date column is not sorted.
(The tiny example output appears sorted however the ****Thu Jan 11 2018 10:43:20 entry**** illustrates my issue on large data sets)
I cannot figure out how to present it like below but also with the dates sorted per site ID
Any help is gratefully accepted
df = pd.DataFrame.from_dict([{'Site Ref': '1234567', 'Site Name': 'Building A', 'Date': 'Mon Jan 08 2018 10:43:20', 'Duration': 120}, {'Site Ref': '1245678', 'Site Name':'Building B', 'Date': 'Mon Jan 08 2018 10:43:20', 'Duration': 120}, {'Site Ref': '1245678', 'Site Name':'Building B', 'Date': 'Tue Jan 09 2018 10:43:20', 'Duration': 70}, {'Site Ref': '1245678', 'Site Name':'Building B', 'Date': 'Wed Jan 10 2018 10:43:20', 'Duration': 120}, {'Site Ref': '1212345', 'Site Name':'Building C', 'Date': 'Fri Jan 12 2018 10:43:20', 'Duration': 100}, {'Site Ref': '1123456', 'Site Name':'Building D', 'Date': 'Thu Jan 11 2018 10:43:20', 'Duration': 80}, {'Site Ref': '1123456', 'Site Name':'Building D', 'Date': 'Fri Jan 12 2018 12:22:20', 'Duration': 80}, {'Site Ref': '1123456', 'Site Name':'Building D', 'Date': 'Mon Jan 15 2018 11:43:20', 'Duration': 90}, {'Site Ref': '1123456', 'Site Name':'Building D', 'Date': 'Wed Jan 17 2018 10:43:20', 'Duration': 220}])
df = DataFrame(df, columns=['Site Ref', 'Site Name', 'Date', 'Duration'])
df = df.sort_values(by=['Site Ref'])
df
Site Ref Site Name Date Duration
5 1123456 Building D Thu Jan 11 2018 10:43:20 80
6 1123456 Building D Fri Jan 12 2018 12:22:20 80
7 1123456 Building D Mon Jan 15 2018 11:43:20 90
8 1123456 Building D Wed Jan 17 2018 10:43:20 220
4 1212345 Building C Fri Jan 12 2018 10:43:20 100
0 1234567 Building A Mon Jan 08 2018 10:43:20 120
1 1245678 Building B Mon Jan 08 2018 10:43:20 120
2 1245678 Building B Tue Jan 09 2018 10:43:20 70
3 1245678 Building B Wed Jan 10 2018 10:43:20 120
df_pivot = pd.pivot_table(df, index=['Site Ref','Site Name', 'Date'])
df_pivot
Site Ref Site Name Date
1123456 Building D Fri Jan 12 2018 12:22:20 80
Mon Jan 15 2018 11:43:20 90
****Thu Jan 11 2018 10:43:20 80****
Wed Jan 17 2018 10:43:20 220
1212345 Building C Fri Jan 12 2018 10:43:20 100
1234567 Building A Mon Jan 08 2018 10:43:20 120
1245678 Building B Mon Jan 08 2018 10:43:20 120
Tue Jan 09 2018 10:43:20 70
Wed Jan 10 2018 10:43:20 120

It's sorted lexicographically, because Date has object (string) dtype
Workaround - add a new column of datetime dtype, use it before Date in the pivot_table and drop it afterwards:
In [74]: (df.assign(x=pd.to_datetime(df['Date']))
.pivot_table(df, index=['Site Ref','Site Name', 'x', 'Date'])
.reset_index(level='x', drop=True))
Out[74]:
Duration
Site Ref Site Name Date
1123456 Building D Thu Jan 11 2018 10:43:20 80
Fri Jan 12 2018 12:22:20 80
Mon Jan 15 2018 11:43:20 90
Wed Jan 17 2018 10:43:20 220
1212345 Building C Fri Jan 12 2018 10:43:20 100
1234567 Building A Mon Jan 08 2018 10:43:20 120
1245678 Building B Mon Jan 08 2018 10:43:20 120
Tue Jan 09 2018 10:43:20 70
Wed Jan 10 2018 10:43:20 120

You need to convert your dates to datetime values rather than strings. Something like the following would work on your current pivot table:
df_pivot.reset_index(inplace=True)
df_pivot['Date'] = pd.to_datetime(df_pivot['Date'])
df_pivot.sort_values(by=['Site Ref', 'Date'], inplace=True)

Sort the values by Site Ref, groupby mean using sort = False i.e
df.sort_values('Site Ref').groupby(['Site Ref','Site Name','Date'],sort=False).mean()
Duration
Site Ref Site Name Date
1123456 Building D Thu Jan 11 2018 10:43:20 80
Fri Jan 12 2018 12:22:20 80
Mon Jan 15 2018 11:43:20 90
Wed Jan 17 2018 10:43:20 220
1212345 Building C Fri Jan 12 2018 10:43:20 100
1234567 Building A Mon Jan 08 2018 10:43:20 120
1245678 Building B Mon Jan 08 2018 10:43:20 120
Tue Jan 09 2018 10:43:20 70
Wed Jan 10 2018 10:43:20 120

Change date format in pandas dataframe

I have this dataframe:
date value
1 Thu 17th Nov 2016 385.943800
2 Fri 18th Nov 2016 1074.160340
3 Sat 19th Nov 2016 2980.857860
4 Sun 20th Nov 2016 1919.723960
5 Mon 21st Nov 2016 884.279340
6 Tue 22nd Nov 2016 869.071070
7 Wed 23rd Nov 2016 760.289260
8 Thu 24th Nov 2016 2481.689270
9 Fri 25th Nov 2016 2745.990070
10 Sat 26th Nov 2016 2273.413250
11 Sun 27th Nov 2016 2630.414900
12 Mon 28th Nov 2016 817.322310
13 Tue 29th Nov 2016 1766.876030
14 Wed 30th Nov 2016 469.388420
I would like to change the format of the date column to this format YYYY-MM-DD. The dataframe consists of more than 200 rows, and every day new rows will be added, so I need to find a way to do this automatically.
This link is not helping because it sets the dates like this dates = ['30th November 2009', '31st March 2010', '30th September 2010'] and I can't do it for every row. Anyone knows a way to solve this?

Dateutil will do this job.
from dateutil import parser
print df
df2 = df.copy()
df2.date = df2.date.apply(lambda x: parser.parse(x))
df2
Output:

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get specific data from txt file to pandas dataframe - python

Related

Wide to long in python, with year repeating [duplicate]

Pandas: How to draw bar graph on month over counts

How to split one row into multiple and apply datetime on dataframe column?

pandas DataFrame pivot with sort

Change date format in pandas dataframe

Categories

Resources