Problems extracting data from pandas data frame as a result of Grouping - python

In order to plot the frequency of tornados every 10 days I have grouped the data in groups of 10 days using
df_grouped = pd.DataFrame()
df_grouped['COUNT'] = df.groupby(pd.Grouper(key='DATE', freq='10D'))['DATE'].count().to_frame()
however the column DATE does not exist in the code as shown when I run:
>>> df_grouped.shape
(1041,1)
despite the fact that I am able to view and plot the dates in the Jupiter notebook GUI 1.
This is an issue as I wish to access this data later for other purposes and I am unable to using:
year = pd.to_datetime(df_grouped['DATE'], dayfirst = True, errors='coerce').dt.year.values
df_grouped['year'] = year
It states that there is an invalid indexing error since the column no longe exists. Does anyone know what I can do to access the data?
MINIMUM REPRODUCIBLE EXAMPLE
import pandas as pd
df = pd.DataFrame(pd.date_range(start='1994-01-01', end='1994-01-21'), columns=['DATE'])
df_grouped = pd.DataFrame()
df_grouped['COUNT'] = df.groupby(pd.Grouper(key='DATE', freq='10D'))['DATE'].count().to_frame()
expected output
|DATE |COUNT |
|1994-01-01|10 |
|1994-01-11|10 |
|1994-01-21|10 |
|1994-01-31|01 |
actual output
| |COUNT |
|DATE | |
|1994-01-01|10 |
|1994-01-11|10 |
|1994-01-21|10 |
|1994-01-31|01 |

import pandas as pd
df = pd.DataFrame(pd.date_range(start='1994-01-01', end='1994-01-21'), columns=['DATE'])
df = (df.assign(COUNT=lambda x: 1)
.groupby(pd.Grouper(key='DATE', freq='10D')).count()
.reset_index())
print(df)
# DATE COUNT
# 0 1994-01-01 10
# 1 1994-01-11 10
# 2 1994-01-21 1

Related

merging rows and replacing NaN values with pandas

I am trying to merge rows with each other to get one row containing all the values that are present. Currently the df look like this:
dataframe
What i want is something like:
| index | scan .. | snel. | kool .. | note .. |
| ----- | ------- | ----- | ------- | ------- |
| 0 | 7,8 | 4,0 | 20.0 | Fiasp, ..|
I can get that output in the code example below but it just seems really messy.
I tried to use groupby, agg, sum, max, and all those do is that it removes columns and looks like this:
df2.groupby('Tijdstempel apparaat').max().reset_index()
I tried filling the row with the values of the previous rows, and then drop the rows that dont contain every value. But this seems like a long work around and really messy.
df2 = df2.loc[df['Tijdstempel apparaat'] == '20-01-2023 13:24']
df2 = df2.reset_index()
del df2['index']
df2['Snelwerkende insuline (eenheden)'].fillna(method='pad', inplace=True)
df2['Koolhydraten (gram)'].fillna(method='pad', inplace=True)
df2['Notities'].fillna(method='pad', inplace=True)
df2['Scan Glucose mmol/l'].fillna(method='pad', inplace=True)
print(df2)
# df2.loc[df2[0,'Snelwerkende insuline (eenheden)']] = df2.loc[df2[1, 'Snelwerkende insuline (eenheden)']]
df2.drop([0, 1, 2])
Output:
When i have to do this for the entire data.csv (whenever a time stamp like "20-01-2023 13:24" is found multiple times) i am worried it wil be really slow and time consuming.
sample data as your data
df = pd.DataFrame(data={
"times":["date1","date1","date1","date1","date1"],
"type":[1,2,3,4,5],
"key1":[1,None,None,None,None],
"key2":[None,"2",None,None,None],
"key3":[None,None,3,None,None],
"key4":[None,None,None,"val",None],
"key5":[None,None,None,None,5],
})
solution
melt = df.melt(id_vars="times",
value_vars=df.columns[1:],)
melt = melt.dropna()
pivot = melt.pivot_table(values="value", index="times", columns="variable", aggfunc=lambda x: x)
change type column location
index = list(pivot.columns).index("type")
pivot = pd.concat([pivot.iloc[:,index:], pivot.iloc[:,:index]], axis=1)

How do round decimals when writing dataframe in streamlit

import streamlit as st
import pandas as pd
data = {'Days': ["Sunday", "Wednesday", "Friday"],
'Predictions': [433.11, 97.9, 153.65]}
df = pd.DataFrame(data)
st.write(df)
streamlit writes dataframe with four decimals by default, but i expected two decimals. With print() it produces the expected, and when st.write() is used it produces the below output:
Days | Predictions
-------------|-------------
0 Sunday | 433.1100 |
1 Wednesday| 97.9000 |
2 Friday | 153.6500 |
I tried:
df.round(2)
but it didn't help.
Desired output format:
Days | Predictions
-------------|-------------
0 Sunday | 433.11 |
1 Wednesday| 97.90 |
2 Friday | 153.65 |
With the df.style.format() method, you can specify the desired number of decimal places to override the default behavior of pandas in streamlit.
Note: This approach only provides the desired output format but when working farther you will have to still use the original dataframe which is df.
formatted_df = df.style.format({"Predictions": "{:.2f}".format})
st.write(formatted_df)
You could also choose to write the dataframe without having to initialize a variable, like so:
st.write(df.style.format({"Predictions": "{:.2f}"}))

multiple columns to single datetime dataframe column

I have a data frame that contains (among others) columns for the time of day (00:00-23:59:59) day (1-7), month (1-12), and year (2000-2019). How can I combine the values of each of these columns on a row by row basis into a new DateTime object and then store these new date-times in a new column? I've read other posts pertaining to such a task but they all seem to involve one date column to one DateTime column whereas I have 4 columns that need to be transformed into DateTime. Any help is appreciated!
e.g.
| 4:30:59 | 1 | 1 | 2000 | TO 200/1/1 4:30:59
this is the only code I have so far which probably doesn't do anything
#creating datetime object (MISC)
data = pd.read_csv('road_accidents_data_clean.csv',delimiter=',')
df = pd.DataFrame(data)
format = '%Y-%m-%d %H:%M:%S'
n = 0
df['datetime'] = data.loc[n,'Crash_Day'],data.loc[n,'Crash_Month'],data.loc[n,'Year']
My DataFrame is layed out as follows:
Index | Age | Year | Crash_Month | Crash_Day | Crash_Time | Road_User | Gender |
0 37 2000 1 1 4:30:59 DRIVER MALE
1 42 2000 1 1 7:45:10 DRIVER MALE
2 25 2000 1 1 10:15:30 PEDESTRIAN FEMALE
Crash_Type | Injury_Severity | Crash_LGA | Crash_Area_Type | Datetime |
UNKNOWN 1 YARRA MELBOURNE NaN
OVERTAKING 1 YARRA MELBOURNE NaN
ADJACENT DIR 0 MONASH MELBOURNE NaN
NOTE: the dataframe is 13 columns wide i just couldn't fit them all on one line so Crash_Type starts to the right of Gender.
below is the code i've been suggested to use/my adaptation of it
df = pd.DataFrame(dict(
Crash_Time=['4:30:59','4:20:00'],
Crash_Day=[1,20],
Crash_Month=[1,4],
Year=[2000,2020],
))
data['Datetime'] = df['Datetime']=pd.to_datetime(
np.sum([
df['Year'].astype(str),
'-',
df['Crash_Month'].astype(str),
'-',
df['Crash_Day'].astype(str),
' ',
df['Crash_Time'],
]),
format = '%Y-%m-%d %H:%M:%S',
)
I've adapted this code in order to combine the values for the datetime column with the my original dataframe.
Combine the columns into a single series of stings using + (converting to str where needed with pandas.Series.astype method) then pass that new series into pd.to_datetime before assigning it to a new column in your df:
import pandas as pd
df = pd.DataFrame(dict(time=['4:30:59'],date=[1],month=[1],year=[2000]))
df['datetime'] = pd.to_datetime(
df['year'].astype(str)+'-'+df['month'].astype(str)+'-'+df['date'].astype(str)+' '+df['time'],
format = '%Y-%m-%d %H:%M:%S',
)
print(df)
example in python tutor
edit: You can also use a numpy.sum to make that one long line adding columns together easier on the eyes:
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(
time=['4:30:59','4:20:00'],
date=[1,20],
month=[1,4],
year=[2000,2020],
))
df['datetime']=pd.to_datetime(
np.sum([
df['year'].astype(str),
'-',
df['month'].astype(str),
'-',
df['date'].astype(str),
' ',
df['time'],
]),
format = '%Y-%m-%d %H:%M:%S',
)
sum example in python tutor
edit 2: Using your actual column names, it should be something like this:
import pandas as pd
import numpy as np
'''
Index | Age | Year | Crash_Month | Crash_Day | Crash_Time | Road_User | Gender |
0 37 2000 1 1 4:30:59 DRIVER MALE
Crash_Type | Injury_Severity | Crash_LGA | Crash_Area_Type | Datetime |
UNKNOWN 1 YARRA MELBOURNE NaN
'''
df = pd.DataFrame(dict(
Crash_Time=['4:30:59','4:20:00'],
Crash_Day=[1,20],
Crash_Month=[1,4],
Year=[2000,2020],
))
df['Datetime']=pd.to_datetime(
np.sum([
df['Year'].astype(str),
'-',
df['Crash_Month'].astype(str),
'-',
df['Crash_Day'].astype(str),
' ',
df['Crash_Time'],
]),
format = '%Y-%m-%d %H:%M:%S',
)
print(df)
another python tutor link
One thing to note is that you might want to double check if your csv file is separated by just a comma or could it be a comma and a space? possible that you may need to load the data with df = pd.read_csv('road_accidents_data_clean.csv',sep=', ') if there is an extra space separating the data in addition to the comma. You don't want to have that extra space in your data.

pandas merge header rows if one is not NaN

I'm importing into a dataframe an excel sheet which has its headers split into two rows:
Colour | NaN | Shape | Mass | NaN
NaN | width | NaN | NaN | Torque
green | 33 | round | 2 | 6
etc
I want to collapse the first two rows into one header:
Colour | width | Shape | Mass | Torque
green | 33 | round | 2 | 6
...
I tried merged_header = df.loc[0].combine_first(df.loc[1])
but I'm not sure how to get that back into the original dataframe.
I've tried:
# drop top 2 rows
df = df.drop(df.index[[0,1]])
# then add the merged one in:
res = pd.concat([merged_header, df], axis=0)
But that just inserts merged_header as a column. I tried some other combinations of merge from this tutorial but without luck.
merged_header.append(df) gives a similar wrong result, and res = df.append(merged_header) is almost right, but the header is at the tail end:
green | 33 | round | 2 | 6
...
Colour | width | Shape | Mass | Torque
To provide more detail this is what I have so far:
df = pd.read_excel(ltro19, header=None, skiprows=9)
# delete all empty columns & rows
df = df.dropna(axis = 1, how = 'all')
df = df.dropna(axis = 0, how = 'all')
in case if affects the next step.
Let's use list comprehension to flatten multiindex column header:
df.columns = [f'{j}' if str(i)=='nan' else f'{i}' for i, j in df.columns]
Output:
['Colour', 'width', 'Shape', 'Mass', 'Torque']
This should work for you:
df.columns = list(df.columns.get_level_values(0))
Probably due to my ignorance of the terms, the suggestions above did not lead me directly to a working solution. It seemed I was working with a dataframe
>>> print(type(df))
>>> <class 'pandas.core.frame.DataFrame'>
but, I think, without headers.
This solution worked, although it involved jumping out of the dataframe and into a list to then put it back as the column headers. Inspired by Merging Two Rows (one with a value, the other NaN) in Pandas
df = pd.read_excel(name_of_file, header=None, skiprows=9)
# delete all empty columns & rows
df = df.dropna(axis = 1, how = 'all')
df = df.dropna(axis = 0, how = 'all')
# merge the two headers which are weirdly split over two rows
merged_header = df.loc[0].combine_first(df.loc[1])
# turn that into a list
header_list = merged_header.values.tolist()
# load that list as the new headers for the dataframe
df.columns = header_list
# drop top 2 rows (old split header)
df = df.drop(df.index[[0,1]])

How do i create a heatmap from two columns plus the value of those two in python

and thank you for helping!
I would like to generate a heatmap in python, from the data df.
(i am using pandas, seaborn, numpy, and matplotlib in my project)
The dataframe df looks like:
index | a | b | c | year | month
0 | | | | 2013 | 1
1 | | | | 2015 | 4
2 | | | | 2016 | 10
3 | | | | 2017 | 1
in the dataset the rows are each a ticket.
The dataset is big (51 colums and 100k+ rows), so a, b, c is just to show some random columns. (for month => 1 = jan, 2= feb...)
For the heatmap:
x-axis = year,
y-axis = month,
value: and in the heatmap, I wanted the value between the two axes to be a count of the number of rows, in which a ticket has been given in that year and month.
The result I imagine should look something like the from the seaborn documentation:
https://seaborn.pydata.org/_images/seaborn-heatmap-4.png
I am new to coding and tried a lot of random things I found on the internet and has not been able to make it work.
Thank you for helping!
This should do (with generated data):
import pandas as pd
import seaborn as sns
import random
y = [random.randint(2013,2017) for n in range(2000)]
m = [random.randint(1,12) for n in range(2000)]
df = pd.DataFrame([y,m]).T
df.columns=['y','m']
df['count'] = 1
df2 = df.groupby(['y','m'], as_index=False).count()
df_p = pd.pivot_table(df2,'count','m','y')
sns.heatmap(df_p)
You probably won't need the column count but I added it because I needed an extra column for the groupby to work.

Categories