I have a Dataframe that I'd like to perform the same operation (ie, correlations, graphing) on grouped data. The grouping is based on location (refered to as STA in the Dataframe).
Sample of the Dataframe is below:
Index STA Date Var1 Var2 Var3
0 RE25 1973-04-09 1.0 10.5 6.3
1 RE30 1973-04-09 1.0 10.0 7.6
2 RE25 1973-04-09 5.0 10.6 NaN
3 RE30 1973-04-09 5.0 10.0 NaN
4 RE25 1973-04-09 10.0 10.6 NaN
5 RE30 1973-04-09 10.0 10.2 NaN
6 RE25 1973-04-09 15.0 10.7 NaN
7 RE30 1973-04-09 15.0 10.1 NaN
8 RE25 1973-04-09 20.0 10.7 NaN
9 RE30 1973-04-09 20.0 10.1 NaN
10 RE30 1973-04-09 23.0 10.0 7.6
To generate the list of unique sampling STA (which will be different for each DataFrame), I used
Stations = np.sort(Resdat.STA.unique()).tolist()
which works in creating the unique list of STA that I'm after. However, when I try to call this list I get the following error:
TypeError: 'list' object is not callable.
With my limited knowledge, I'm only making progress with the following code:
RE01 = Resdat.groupby('STA').get_group('RE01')
RE01 = RE01.dropna(axis = 1, how = 'all')
repeated over and over for each unique STA.
I'm sure there is a better way but I'm struggling to find other posted answers that I can use.
You can using for loop
names=[]
l=[]
for name, data in df.groupby('STA'):
names.append(name)
l.append(data.dropna(axis=1, how='all'))
Related
So I have a sqlite local database, I read it into my program as a pandas dataframe using
""" Seperating hitters and pitchers """
pitchers = pd.read_sql_query("SELECT * FROM ALL_NORTHWOODS_DATA WHERE BF_y >= 20 AND BF_x >= 20", northwoods_db)
hitters = pd.read_sql_query("SELECT * FROM ALL_NORTHWOODS_DATA WHERE PA_y >= 25 AND PA_x >= 25", northwoods_db)
But when I do this, some of the numbers are not numeric. Here is a head of one of the dataframes:
index Year Age_x AgeDif_x Tm_x Lg_x Lev_x Aff_x G_x PA_x ... ER_y BK_y WP_y BF_y WHIP_y H9_y HR9_y BB9_y SO9_y SO/W_y
0 84 2020 21 -0.3 Hillsdale GMAC NCAA None 5 None ... 4.0 None 3.0 71.0 1.132 5.6 0.0 4.6 8.7 1.89
1 264 2018 -- None Duke ACC NCAA None 15 None ... 13 0 1 88 2.111 10.0 0.5 9.0 8.0 0.89
2 298 2019 21 0.1 Wisconsin-Milwaukee Horz NCAA None 8 None ... 1.0 0.0 2.0 21.0 2.25 9.0 0.0 11.3 11.3 1.0
3 357 2017 22 1.0 Nova Southeastern SSC NCAA None 15.0 None ... 20.0 0.0 3.0 206.0 1.489 9.7 0.4 3.7 8.5 2.32
4 418 2021 21 -0.4 Creighton BigE NCAA None 4 None ... 26.0 1.0 6.0 226.0 1.625 8.6 0.9 6.0 7.5 1.25
When I try to make the dataframe numeric, I used this line of code:
hitters = hitters.apply(pd.to_numeric, errors='coerce')
pitchers = pitchers.apply(pd.to_numeric, errors='coerce')
But when I did that, the new head of the dataframes is full of NaN's, it seems like it got rid of all of the string values but I want to keep those.
index Year Age_x AgeDif_x Tm_x Lg_x Lev_x Aff_x G_x PA_x ... ER_y BK_y WP_y BF_y WHIP_y H9_y HR9_y BB9_y SO9_y SO/W_y
0 84 2020 21.0 -0.3 NaN NaN NaN NaN 5.0 NaN ... 4.0 NaN 3.0 71.0 1.132 5.6 0.0 4.6 8.7 1.89
1 264 2018 NaN NaN NaN NaN NaN NaN 15.0 NaN ... 13.0 0.0 1.0 88.0 2.111 10.0 0.5 9.0 8.0 0.89
2 298 2019 21.0 0.1 NaN NaN NaN NaN 8.0 NaN ... 1.0 0.0 2.0 21.0 2.250 9.0 0.0 11.3 11.3 1.00
3 357 2017 22.0 1.0 NaN NaN NaN NaN 15.0 NaN ... 20.0 0.0 3.0 206.0 1.489 9.7 0.4 3.7 8.5 2.32
4 418 2021 21.0 -0.4 NaN NaN NaN NaN 4.0 NaN ... 26.0 1.0 6.0 226.0 1.625 8.6 0.9 6.0 7.5 1.25
Is there a better way to makethe number values numeric and keep all my string columns? Maybe there is an sqlite function that can do it better? I am not sure, any help is appriciated.
Maybe you can use combine_first:
hitters_new = hitters.apply(pd.to_numeric, errors='coerce').combine_first(hitters)
pitchers_new = pitchers.apply(pd.to_numeric, errors='coerce').combine_first(pitchers)
You can try using astype or convert_dtypes. They both take an argument which is the columns you want to convert, if you already know which columns are numeric and which ones are strings that can work. Otherwise, take a look at this thread to do this automatically.
Usually, to avoid SettingWithCopyWarning, I replace values using .loc or .iloc, but this does not work when I want to forward fill my column (from the first to the last non-nan value).
Do you know why it does that and how to bypass it ?
My test dataframe :
df3 = pd.DataFrame({'Timestamp':[11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,12.0,12.10,12.2,12.3,12.4,12.5,12.6,12.7,12.8,12.9],
'test':[np.nan,np.nan,np.nan,2,22,8,np.nan,4,5,4,5,np.nan,-3,-54,-23,np.nan,89,np.nan,np.nan]})
and the code that raises me a warning :
df3['test'].iloc[df3['test'].first_valid_index():df3['test'].last_valid_index()+1] = df3['test'].iloc[df3['test'].first_valid_index():df3['test'].last_valid_index()+1].fillna(method="ffill")
I would like something like that in the end :
Use first_valid_index and last_valid_index to determine range that you want to ffill and then select range of your dataframe
df = pd.DataFrame({'Timestamp':[11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,12.0,12.10,12.2,12.3,12.4,12.5,12.6,12.7,12.8,12.9],
'test':[np.nan,np.nan,np.nan,2,22,8,np.nan,4,5,4,5,np.nan,-3,-54,-23,np.nan,89,np.nan,np.nan]})
first=df['test'].first_valid_index()
last=df['test'].last_valid_index()+1
df['test']=df['test'][first:last].ffill()
print(df)
Timestamp test
0 11.1 NaN
1 11.2 NaN
2 11.3 NaN
3 11.4 2.0
4 11.5 22.0
5 11.6 8.0
6 11.7 8.0
7 11.8 4.0
8 11.9 5.0
9 12.0 4.0
10 12.1 5.0
11 12.2 5.0
12 12.3 -3.0
13 12.4 -54.0
14 12.5 -23.0
15 12.6 -23.0
16 12.7 89.0
17 12.8 NaN
18 12.9 NaN
I have data collected from a lineage of instruments with some overlap. I want to merge them to a single pandas data structure in a way where the newest available data for each column take precedence if not NaN, otherwise the older data are retained.
The following code produces the intended output, but involves a lot of code for such a simple task. Additionally, the final step involves identifying duplicated index values, and I am nervous about whether I can rely on the "last" part because df.combine_first(other) reorders the data. Is there a more compact, efficient and/or predictable way to do this?
# set up the data
df0 = pd.DataFrame({"x": [0.,1.,2.,3.,4,],"y":[0.,1.,2.,3.,np.nan],"t" :[0,1,2,3,4]}) # oldest/lowest priority
df1 = pd.DataFrame({"x" : [np.nan,4.1,5.1,6.1],"y":[3.1,4.1,5.1,6.1],"t": [3,4,5,6]})
df2 = pd.DataFrame({"x" : [8.2,10.2],"t":[8,10]})
df0.set_index("t",inplace=True)
df1.set_index("t",inplace=True)
df2.set_index("t",inplace=True)
# this concatenates, leaving redundant indices in df0, df1, df2
dfmerge = pd.concat((df0,df1,df2),sort=True)
print("dfmerge, with duplicate rows and interlaced NaN data")
print(dfmerge)
# Now apply, in priority order, each of the original dataframes to fill the original
dfmerge2 = dfmerge.copy()
for ddf in (df2,df1,df0):
dfmerge2 = dfmerge2.combine_first(ddf)
print("\ndfmerge2, fillable NaNs filled but duplicate indices now reordered")
print(dfmerge2) # row order has changed unpredictably
# finally, drop duplicate indices
dfmerge3 = dfmerge2.copy()
dfmerge3 = dfmerge3.loc[~dfmerge3.index.duplicated(keep='last')]
print ("dfmerge3, final")
print (dfmerge3)
The output of which is this:
dfmerge, with duplicate rows and interlaced NaN data
x y
t
0 0.0 0.0
1 1.0 1.0
2 2.0 2.0
3 3.0 3.0
4 4.0 NaN
3 NaN 3.1
4 4.1 4.1
5 5.1 5.1
6 6.1 6.1
8 8.2 NaN
10 10.2 NaN
dfmerge2, fillable NaNs filled but duplicate indices now reordered
x y
t
0 0.0 0.0
1 1.0 1.0
2 2.0 2.0
3 3.0 3.0
3 3.0 3.1
4 4.0 4.1
4 4.1 4.1
5 5.1 5.1
6 6.1 6.1
8 8.2 NaN
10 10.2 NaN
dfmerge3, final
x y
t
0 0.0 0.0
1 1.0 1.0
2 2.0 2.0
3 3.0 3.1
4 4.1 4.1
5 5.1 5.1
6 6.1 6.1
8 8.2 NaN
10 10.2 NaN
In your case
s=pd.concat([df0,df1,df2],sort=False)
s[:]=np.sort(s,axis=0)
s=s.dropna(thresh=1)
s
x y
t
0 0.0 0.0
1 1.0 1.0
2 2.0 2.0
3 3.0 3.0
4 4.0 3.1
3 4.1 4.1
4 5.1 5.1
5 6.1 6.1
6 8.2 NaN
8 10.2 NaN
I'm trying to assign values of some columns based on another column mapping them by one single key. The problem is that I don't think the mapping is being used correctly, because it is assigning NaN to the columns.
I should be mapping them by 'SampleID'.
Here is the DF I want to assign values to
>>> df.ix[new_df['SampleID'].isin(pooled['SampleID']), cols]
Volume_Received Quantity massug
88280 2.0 15.0 1.0
88282 3.0 55.0 5.0
88284 2.5 46.2 3.0
88286 2.0 98.0 5.0
229365 2.0 8.4 3.0
229366 3.0 15.9 3.0
229367 1.5 7.7 2.0
233666 1.5 50.8 3.0
233667 4.0 60.2 5.0
This is the new value I have for them
>>> numerical
Volume_Received Quantity massug
SampleID
sample8 10.0 75.0 5.0
sample70 15.0 275.0 25.0
sample72 12.5 231.0 15.0
sample89 6.0 294.0 15.0
sample90 4.0 16.8 6.0
sample96 6.0 31.8 6.0
sample97 3.0 15.4 4.0
sample99 3.0 101.6 6.0
sample100 8.0 120.4 10.0
I'm using this command to assign the values:
df.ix[df['SampleID'].isin(pooled['SampleID']), cols] = numerical[cols]
Where pooled is basically pooled = df[df['type'] == 'Pooled'] and cols is a list with the three columns shown above. After I run the code above I receive NaN in all the values. I think I'm telling pandas to get values where it does not exist because of the mapping and it's returning something null which is being converted to NaN (assumption).
index does not match,
you can use
df.ix[df['SampleID'].isin(pooled['SampleID']), cols] = numerical[cols].values
only if the size are exactly the same!
I am sure there must be a very simple solution to this problem, but I am failing to find it (and browsing through previously asked questions, I didn't find the answer I wanted or didn't understand it).
I have a dataframe similar to this (just much bigger, with many more rows and columns):
x val1 val2 val3
0 0.0 10.0 NaN NaN
1 0.5 10.5 NaN NaN
2 1.0 11.0 NaN NaN
3 1.5 11.5 NaN 11.60
4 2.0 12.0 NaN 12.08
5 2.5 12.5 12.2 12.56
6 3.0 13.0 19.8 13.04
7 3.5 13.5 13.3 13.52
8 4.0 14.0 19.8 14.00
9 4.5 14.5 14.4 14.48
10 5.0 15.0 19.8 14.96
11 5.5 15.5 15.5 15.44
12 6.0 16.0 19.8 15.92
13 6.5 16.5 16.6 16.40
14 7.0 17.0 19.8 18.00
15 7.5 17.5 17.7 NaN
16 8.0 18.0 19.8 NaN
17 8.5 18.5 18.8 NaN
18 9.0 19.0 19.8 NaN
19 9.5 19.5 19.9 NaN
20 10.0 20.0 19.8 NaN
In the next step, I need to compute the derivative dVal/dx for each of the value columns (in reality I have more than 3 columns, so I need to have a robust solution in a loop, I can't select the rows manually each time). But because of the NaN values in some of the columns, I am facing the problem that x and val are not of the same dimension. I feel the way to overcome this would be to only select only those x intervals, for which the val is notnull. But I am not able to do that. I am probably making some very stupid mistakes (I am not a programmer and I am very untalented, so please be patient with me:) ).
Here is the code so far (now that I think of it, I may have introduced some mistakes just by leaving some old pieces of code because I've been messing with it for a while, trying different things):
import pandas as pd
import numpy as np
df = pd.read_csv('H:/DocumentsRedir/pokus/dataframe.csv', delimiter=',')
vals = list(df.columns.values)[1:]
for i in vals:
V = np.asarray(pd.notnull(df[i]))
mask = pd.notnull(df[i])
X = np.asarray(df.loc[mask]['x'])
derivative=np.diff(V)/np.diff(X)
But I am getting this error:
ValueError: operands could not be broadcast together with shapes (20,) (15,)
So, apparently, it did not select only the notnull values...
Is there an obvious mistake that I am making or a different approach that I should adopt? Thanks!
(And another less important question: is np.diff the right function to use here or had I better calculated it manually by finite differences? I'm not finding numpy documentation very helpful.)
To calculate dVal/dX:
dVal = df.iloc[:, 1:].diff() # `x` is in column 0.
dX = df['x'].diff()
>>> dVal.apply(lambda series: series / dX)
val1 val2 val3
0 NaN NaN NaN
1 1 NaN NaN
2 1 NaN NaN
3 1 NaN NaN
4 1 NaN 0.96
5 1 NaN 0.96
6 1 15.2 0.96
7 1 -13.0 0.96
8 1 13.0 0.96
9 1 -10.8 0.96
10 1 10.8 0.96
11 1 -8.6 0.96
12 1 8.6 0.96
13 1 -6.4 0.96
14 1 6.4 3.20
15 1 -4.2 NaN
16 1 4.2 NaN
17 1 -2.0 NaN
18 1 2.0 NaN
19 1 0.2 NaN
20 1 -0.2 NaN
We difference all columns (except the first one), and then apply a lambda function to each column which divides it by the difference in column X.