how to take the average for certain rows (python)

how to take the average for certain rows (python) - python

As you can see in the outcome picture there is a column called 'dt', for every month in the year the temperature of lands/sea is measured. But i want to make the graph shorter so i want to take the average of the 12 months and make the measurements yearly and not monthly.
this is the outcome:
This is what i want:what i want

You can use:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('GlobalTemperatures.csv', parse_dates=['dt'])
out = df.groupby(df['dt'].dt.year).mean()
out.plot()
plt.show()
Output:
>>> out
LandAverageTemperature LandAverageTemperatureUncertainty LandMaxTemperature ... LandMinTemperatureUncertainty LandAndOceanAverageTemperature LandAndOceanAverageTemperatureUncertainty
dt ...
1750 8.719364 2.637818 NaN ... NaN NaN NaN
1751 7.976143 2.781143 NaN ... NaN NaN NaN
1752 5.779833 2.977000 NaN ... NaN NaN NaN
1753 8.388083 3.176000 NaN ... NaN NaN NaN
1754 8.469333 3.494250 NaN ... NaN NaN NaN
... ... ... ... ... ... ... ...
2011 9.516000 0.082000 15.284833 ... 0.136583 15.769500 0.059000
2012 9.507333 0.083417 15.332833 ... 0.145333 15.802333 0.061500
2013 9.606500 0.097667 15.373833 ... 0.149833 15.854417 0.064667
2014 9.570667 0.090167 15.313583 ... 0.139000 15.913000 0.063167
2015 9.831000 0.092167 15.572667 ... 0.141750 16.058583 0.060833
[266 rows x 8 columns]

Related

How can I correctly write a function? (Python)

Here is my definition:
def fill(df_name):
"""
Function to fill rows and dates.
"""
# Fill Down
for row in df_name[0]:
if 'Unnamed' in row:
df_name[0] = df_name[0].replace(row, np.nan)
df_name[0] = df_name[0].ffill(limit=2)
df_name[1] = df_name[1].ffill(limit=2)
# Fill in Dates
for col in df_name.columns:
if col >= 3:
old_dt = datetime(1998, 11, 15)
add_dt = old_dt + relativedelta(months=col - 3)
new_dt = add_dt.strftime('%#m/%d/%Y')
df_name = df_name.rename(columns={col: new_dt})
and then I call:
fill(df_cars)
The first half of the formula works (columns 0 and 1 have filled in correctly). However, as you can see, the columns are labeled 0-288. When I delete this function and simply run the code (changing df_name to df_cars) it runs correctly and the column names are the dates specified in the second half of the function.
What could be causing this to not execute the # Fill in Dates portion when defined in a function? Does it have to do with local variables?
0 1 2 3 4 5 ... 287 288 289 290 291 292
0 France NaN Market 3330 7478 2273 ... NaN NaN NaN NaN NaN NaT
1 France NaN World 362 798 306 ... NaN NaN NaN NaN NaN NaT
2 France NaN % 0.108709 0.106713 0.134624 ... NaN NaN NaN NaN NaN NaT
3 Germany NaN Market 1452 2025 1314 ... NaN NaN NaN NaN NaN NaT
4 Germany NaN World 209 246 182 ... NaN NaN NaN NaN NaN NaT
.. ... ... ... ... ... ... ... ... ... ... ... ... ..
349 Slovakia 0 World 1 1 0 ... NaN NaN NaN NaN NaN NaT
350 Slovakia 0 % 0.5 0.5 0 ... NaN NaN NaN NaN NaN NaT

Pandas trying to make values within a column into new columns after groupby on column

My original dataframe looked like:
timestamp variables value
1 2017-05-26 19:46:41.289 inf 0.000000
2 2017-05-26 20:40:41.243 tubavg 225.489639
... ... ... ...
899541 2017-05-02 20:54:41.574 caspre 684.486450
899542 2017-04-29 11:17:25.126 tvol 50.895000
Now I want to bucket this dataset by time, which can be done with the code:
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.groupby(pd.Grouper(key='timestamp', freq='5min'))
But I also want all the different metrics to become columns in the new dataframe. For example the first two rows from the original dataframe would look like:
timestamp inf tubavg caspre tvol ...
1 2017-05-26 19:46:41.289 0.000000 225.489639 xxxxxxx xxxxx
... ... ... ...
xxxxx 2017-05-02 20:54:41.574 xxxxxx xxxxxx 684.486450 50.895000
Now as it can be seen the time has been bucketed by 5 min intervals and will look at all the values of variables and try to create columns for those columns for all buckets. The bucket has assumed the very first value of the time it had bucketed with.
in order to solve this, I have tried a couple of different solutions, but can't seem to find anything without constant errors.

Try unstacking the variables column from rows to columns with .unstack(1). The parameter is 1, because we want the second index column (0 would be the first)
Then, drop the level of the multi-index you just created to make it a little bit cleaner with .droplevel().
Finally, use pd.Grouper. Since the date/time is on the index, you don't need to specify a key.
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.set_index(['timestamp','variables']).unstack(1)
df.columns = df.columns.droplevel()
df = df.groupby(pd.Grouper(freq='5min')).mean().reset_index()
df
Out[1]:
variables timestamp caspre inf tubavg tvol
0 2017-04-29 11:15:00 NaN NaN NaN 50.895
1 2017-04-29 11:20:00 NaN NaN NaN NaN
2 2017-04-29 11:25:00 NaN NaN NaN NaN
3 2017-04-29 11:30:00 NaN NaN NaN NaN
4 2017-04-29 11:35:00 NaN NaN NaN NaN
... ... ... ... ...
7885 2017-05-26 20:20:00 NaN NaN NaN NaN
7886 2017-05-26 20:25:00 NaN NaN NaN NaN
7887 2017-05-26 20:30:00 NaN NaN NaN NaN
7888 2017-05-26 20:35:00 NaN NaN NaN NaN
7889 2017-05-26 20:40:00 NaN NaN 225.489639 NaN
Another way would be to .groupby the variables as well and then .unstack(1) again:
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.groupby([pd.Grouper(freq='5min', key='timestamp'), 'variables']).mean().unstack(1)
df.columns = df.columns.droplevel()
df = df.reset_index()
df
Out[1]:
variables timestamp caspre inf tubavg tvol
0 2017-04-29 11:15:00 NaN NaN NaN 50.895
1 2017-05-02 20:50:00 684.48645 NaN NaN NaN
2 2017-05-26 19:45:00 NaN 0.0 NaN NaN
3 2017-05-26 20:40:00 NaN NaN 225.489639 NaN

How to drop subdataframe if it contains more than 40% NaN Pandas

Hello everyone i have such problem:
I have panel data for 400.000 objects and i want to drop objects if it contains more that 40% NaNs
For example:
inn time_reg revenue1 balans1 equity1 opprofit1 \
0 0101000021 2006 457000.0 115000.0 28000.0 29000.0
1 0101000021 2007 1943000.0 186000.0 104000.0 99000.0
2 0101000021 2008 2812000.0 318000.0 223000.0 127000.0
3 0101000021 2009 2673000.0 370000.0 242000.0 39000.0
4 0101000021 2010 3240000.0 435000.0 45000.0 NaN
... ... ... ... ... ... ...
4081810 9909403758 2003 6943000.0 2185000.0 2136000.0 -97000.0
4081811 9909403758 2004 6504000.0 2245000.0 2196000.0 -34000.0
4081812 9909403758 2005 NaN NaN NaN NaN
4081813 9909403758 2006 NaN NaN NaN NaN
4081814 9909403758 2007 NaN NaN NaN NaN
grossprofit1 netprofit1 currentassets1 stliabilities1
0 92000.0 18000.0 105000.0 87000.0
1 189000.0 76000.0 176000.0 82000.0
2 472000.0 119000.0 308000.0 95000.0
3 483000.0 29000.0 360000.0 128000.0
4 NaN 35000.0 NaN NaN
... ... ... ... ...
4081810 2365000.0 -59000.0 253000.0 49000.0
4081811 2278000.0 60000.0 425000.0 49000.0
4081812 NaN NaN NaN NaN
4081813 NaN NaN NaN NaN
4081814 NaN NaN NaN NaN
I have such dataframe, and for each subdataframe grouped by (inn,time_reg) i need to drop it if total nans in columns (revenue1 balans1 equity1 opprofit1 grossprofit1 netprofit1 currentassets1 stliabilities1) more than 40%.
I have an idea to do it in a loop but this it takes a lot of time
For example:
inn time_reg revenue1 balans1 equity1 opprofit1 \
4081809 9909403758 2002 6078000.0 2270000.0 2195000.0 -32000.0
4081810 9909403758 2003 6943000.0 2185000.0 2136000.0 -97000.0
4081811 9909403758 2004 6504000.0 2245000.0 2196000.0 -34000.0
4081812 9909403758 2005 NaN NaN NaN NaN
4081813 9909403758 2006 NaN NaN NaN NaN
4081814 9909403758 2007 NaN NaN NaN NaN
grossprofit1 netprofit1 currentassets1 stliabilities1
4081809 1324000.0 NaN 234000.0 75000.0
4081810 2365000.0 -59000.0 253000.0 49000.0
4081811 2278000.0 60000.0 425000.0 49000.0
4081812 NaN NaN NaN NaN
4081813 NaN NaN NaN NaN
4081814 NaN NaN NaN NaN
This subdataframe should be droped, coz it contains more than 40% nans
inn time_reg revenue1 balans1 equity1 opprofit1 \
0 0101000021 2006 457000.0 115000.0 28000.0 29000.0
1 0101000021 2007 1943000.0 186000.0 104000.0 99000.0
2 0101000021 2008 2812000.0 318000.0 223000.0 127000.0
3 0101000021 2009 2673000.0 370000.0 242000.0 39000.0
4 0101000021 2010 3240000.0 435000.0 45000.0 NaN
5 0101000021 2011 3480000.0 610000.0 71000.0 NaN
6 0101000021 2012 4820000.0 710000.0 139000.0 149000.0
7 0101000021 2013 5200000.0 790000.0 148000.0 170000.0
8 0101000021 2014 5450000.0 830000.0 155000.0 180000.0
9 0101000021 2015 5620000.0 860000.0 164000.0 189000.0
10 0101000021 2016 5860000.0 885000.0 175000.0 200000.0
11 0101000021 2017 15112000.0 1275000.0 298000.0 323000.0
grossprofit1 netprofit1 currentassets1 stliabilities1
0 92000.0 18000.0 105000.0 87000.0
1 189000.0 76000.0 176000.0 82000.0
2 472000.0 119000.0 308000.0 95000.0
3 483000.0 29000.0 360000.0 128000.0
4 NaN 35000.0 NaN NaN
5 NaN 61000.0 NaN NaN
6 869000.0 129000.0 700000.0 571000.0
7 1040000.0 138000.0 780000.0 642000.0
8 1090000.0 145000.0 820000.0 675000.0
9 1124000.0 154000.0 850000.0 696000.0
10 1172000.0 165000.0 875000.0 710000.0
11 3023000.0 288000.0 1265000.0 977000.0
This subdataframe contains less than 40% nans and must be in final dataframe

Would a loop be too slow too if you used a numpy/pandas function for the counting? You could use someDataFrame.isnull().sum().sum().
Probably a lot faster than writing your own loop to go over all the values in a dataframe, since those libraries tend to have very efficient implementations of those kinds of functions.

You can use the filter method of pd.DataFrame.groupby.
This allows you to pass a function that indicates whether a subframe should be filtered or not (in this case if it contains over 40% NaNs in the relevant columns). To get that information, you can use numpy to count the nans as in getNanFraction:
def getNanFraction(df):
nanCount = np.sum(np.isnan(df.drop("inn", axis=1).values))
return nanCount/len(df)
df.groupby("inn").filter(lambda x: getNanFraction(x) < 0.4 )

Select a (non-indexed) column based on text content of a cell in a python/pandas dataframe

TL:DR - how do I create a dataframe/series from one or more columns in an existing non-indexed dataframe based on the column(s) containing a specific piece of text?
Relatively new to Python and data analysis and (this is my first time posting a question on Stack Overflow but I've been hunting for an answer for a long time (and used to code regularly) and not having any success.
I have a dataframe import from an Excel file that doesn't have named/indexed columns. I am trying to successfully extract data from nearly 2000 of these files which all have slightly different columns of data (of course - why make it simple... or follow a template... or simply use something other than poorly formatted Excel spreadsheets...).
The original dataframe (from a poorly structured XLS file) looks a bit like this:
0 NaN RIGHT NaN
1 Date UCVA Sph
2 2007-01-13 00:00:00 6/38 [-2.00]
3 2009-11-05 00:00:00 6/9 NaN
4 2009-11-18 00:00:00 6/12 NaN
5 2009-12-14 00:00:00 6/9 [-1.25]
6 2018-04-24 00:00:00 worn CL [-5.50]
3 4 5 6 7 8 9 \
0 NaN NaN NaN NaN NaN NaN NaN
1 Cyl Axis BSCVA Pentacam remarks K1 K2 K2 back
2 [-2.75] 65 6/9 NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN 6/5 Pentacam 46 43.9 -6.6
5 [-5.75] 60 6/6-1 NaN NaN NaN NaN
6 [+7.00} 170 6/7.5 NaN NaN NaN NaN
... 17 18 19 20 21 22 \
0 ... NaN NaN NaN NaN NaN NaN
1 ... BSCVA Pentacam remarks K1 K2 K2 back K max
2 ... 6/5 NaN NaN NaN NaN NaN
3 ... NaN NaN NaN NaN NaN NaN
4 ... NaN Pentacam 44.3 43.7 -6.2 45.5
5 ... 6/4-4 NaN NaN NaN NaN NaN
6 ... 6/5 NaN NaN NaN NaN NaN
I want to extract a set of dataframes/series that I can then combine back together to get a 'tidy' dataframe e.g.:
1 Date R-UCVA R-Sph
2 2007-01-13 00:00:00 6/38 [-2.00]
3 2009-11-05 00:00:00 6/9 NaN
4 2009-11-18 00:00:00 6/12 NaN
5 2009-12-14 00:00:00 6/9 [-1.25]
6 2018-04-24 00:00:00 worn CL [-5.50]
1 R-Cyl R-Axis R-BSCVA R-Penta R-K1 R-K2 R-K2 back
2 [-2.75] 65 6/9 NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN 6/5 Pentacam 46 43.9 -6.6
5 [-5.75] 60 6/6-1 NaN NaN NaN NaN
6 [+7.00} 170 6/7.5 NaN NaN NaN NaN
etc. etc. so I'm trying to write some code that will pull a series of columns that I define by looking for the words "Date" or "UCVA" etc. etc. Then I plan to stitch them back together into a single dataframe with patient identifier as an extra column. And then cycle through all the XLS files, appending the whole lot to a single CSV file that I can then do useful stuff on (like put into an Access Database - yes, I know, but it has to be easy to use and already installed on an NHS computer - and statistical analysis).
Any suggestions? I hope that's enough information.
Thanks very much in advance.
Kind regards
Vicky

Here a something that will hopefully get you started.
I have prepared a text.xlsx file:
and I can read it as follows
path = 'text.xlsx'
df = pd.read_excel(path, header=[0,1])
# Deal with two levels of headers, here I just join them together crudely
df.columns = df.columns.map(lambda h: ' '.join(h))
# Slight hack because I messed with the column names
# I create two dataframes, one with the first column, one with the second column
df1 = df[[df.columns[0],df.columns[1]]]
df2 = df[[df.columns[0], df.columns[2]]]
# Stacking them on top of each other
result = pd.concat([df1, df2])
print(result)
#Merging them on the Date column
result = pd.merge(left=df1, right=df2, on=df1.columns[0])
print(result)
This gives the output
RIGHT Sph RIGHT UCVA Unnamed: 0_level_0 Date
0 NaN 6/38 2007-01-13 00:00:00
1 NaN 6/37 2009-11-05 00:00:00
2 NaN 9/56 2009-11-18 00:00:00
0 [-2.00] NaN 2007-01-13 00:00:00
1 NaN NaN 2009-11-05 00:00:00
2 NaN NaN 2009-11-18 00:00:00
and
Unnamed: 0_level_0 Date RIGHT UCVA RIGHT Sph
0 2007-01-13 00:00:00 6/38 [-2.00]
1 2009-11-05 00:00:00 6/37 NaN
2 2009-11-18 00:00:00 9/56 NaN
Some pointers:
How to merger two header rows? See this question and answer.
How to select pandas columns conditionally? See e.g. this or this
How to merge dataframes? There is a very good guide in the pandas doc

Pandas Dataframe multiindex

I'm new to Python, Pandas, Dash, etc. I'm trying to structure a dataframe so I can create some dash components for graphing that will allow the user to see and filter data.
At the top are aggregation characteristics, the first 3 are required and the remaining are sparse based on whether or not the data was aggregated for that characteristic. After the first ellipses, there are some summary characteristics for the day, and after the second ellipses is the time series data for the aggregation. There are about 3800 pre-calculated aggregate groupings in this example.
Should I try to make the aggregate characteristics into a MultiIndex?
The runid is the identifier of the analysis run that created the output(same number for all 3818 columns), while the UID field should be unique for each column of a single run, but multiple runs will have the same UID with different RUNIDs. The UID is the unique combination of CHAR1 thru CHAR20 for that RUNID and AGGLEVEL. The AGGLEVEL is the analysis grouping which may have one or more columns of output. CHAR3_CHAR6_UNADJ is the unique combinations of CHAR3 and CHAR6, so those two rows are populated while the remaining CHAR rows are null (well NaN) My current example is just one run, but there tens of thousands of runs, although I usually focus on one at a time and probably won't deal with more than 10-20 at a a time for a subset of the data of each. Char1 thru Char20 are only populated if that column has data aggregated by that characteristic.
My dataframe example:
print(dft)
0 ... 3818
UID 32 ... 19980
RUNID 1234 ... 1234
AGGLEVEL CHAR12_ADJ ... CHAR3_CHAR6_UNADJ
CHAR1 NaN ... NaN
CHAR2 NaN ... NaN
CHAR3 NaN ... 1234
CHAR4 NaN ... NaN
CHAR5 NaN ... NaN
CHAR6 NaN ... ABCD
CHAR7 NaN ... NaN
CHAR8 NaN ... NaN
CHAR9 NaN ... NaN
CHAR10 NaN ... NaN
CHAR11 NaN ... NaN
CHAR12 IJKL ... NaN
CHAR13 NaN ... NaN
CHAR14 NaN ... NaN
CHAR15 NaN ... NaN
CHAR16 NaN ... NaN
CHAR17 NaN ... NaN
CHAR18 NaN ... NaN
CHAR19 NaN ... NaN
CHAR20 NaN ... NaN
...
STARTTIME 2018-08-22 00:00:00 ... 2018-08-22 00:00:00
MAXIMUM 2.676 ... 0.654993
MINIMUM 0.8868 ... 0.258181
...
00:00 1.2288 ... 0.335217
01:00 1.2828 ... 0.337848
02:00 1.2876 ... 0.324639
03:00 1.194 ... 0.314569
04:00 1.2876 ... 0.258181
05:00 1.1256 ... 0.284699
06:00 1.4016 ... 0.364655
07:00 1.122 ... 0.388968
08:00 1.0188 ... 0.452711
09:00 1.008 ... 0.507032
10:00 1.0272 ... 0.546807
11:00 0.972 ... 0.605359
12:00 1.062 ... 0.641152
13:00 0.8868 ... 0.625082
14:00 1.1076 ... 0.623865
15:00 0.9528 ... 0.654993
16:00 1.014 ... 0.645511
17:00 2.676 ... 0.62638
18:00 0.9888 ... 0.551629
19:00 1.038 ... 0.518322
20:00 1.2528 ... 0.50793
21:00 1.08 ... 0.456993
22:00 1.1724 ... 0.387063
23:00 1.1736 ... 0.345045
[62 rows x 3819 columns]

You should try to transpose it with dft.T. You will have as index the number of your sample from 0 to 3818 and it'll be easier to select your columns then with dft['STARTTIME'] for instance.
For the NaN, you should do dft = dft.replace('NaN',np.nan) so that Pandas will understand that it's really a NaN and not a string (don't forget to write import numpy as np before). You'll be able then to use pd.isna(dft) to check if you have NaN in your Dataframe or dft.dropna() to keep full completed lines.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.