Pandas Dataframe '[nan nan nan ... nan nan nan] not found in axis' - python

Receiving this error
'[nan nan nan ... nan nan nan] not found in axis'
When trying to drop columns from a dataframe if the value is zero
train_df.head()
external_company_id company_name email_domain mx_record ...
NaN Expresstext expresstext.net unknown expresstext.net ... 0.0 0.0 0.0 0.0
NaN Jobox jobox.ai unknown www.jobox.ai ... 17.0 -31.0 9.0 30.0
NaN Relola relola.com unknown home.relola.com ... 5.0 -25.0 5.0
train_df.drop(train_df[train_df['total_funding'] == float(0)].index, inplace = True, axis=0)
'[nan nan nan ... nan nan nan] not found in axis'
What would be causing this error?

I learned that pandas automatically uses the first column as the index for read_csv.
Because my first column was empty every index somehow ended up being NaN as seen in the question above.
I ran these two lines of code to create a new index and fill it.
train_df.index.name = 'id'
train_df.index = [x for x in range(1, len(train_df.values)+1)]
Then the former error disapeared

Instead of :
train_df.drop(...)
try:
train_df = train_df[train_df[train_df['total_funding'] != float(0)]

Related

Assign new value to a cell in pd.DataFrame which is a pd.Series when series index isn't unique

Here is my data if anyone wants to try to reproduce the problem:
https://github.com/LunaPrau/personal/blob/main/O_paired.csv
I have a pd.DataFrame (called O) of 1402 rows × 1402 columns with columns and index both as ['XXX-icsd', 'YYY-icsd', ...] and cell values as some np.float64, some np.nan and problematically, some as pandas.core.series.Series.
202324-icsd
644068-icsd
27121-icsd
93847-icsd
154319-icsd
202324-icsd
0.000000
0.029729
NaN
0.098480
0.097867
644068-icsd
NaN
0.000000
NaN
0.091311
0.091049
27121-icsd
0.144897
0.137473
0.0
0.081610
0.080442
93847-icsd
NaN
NaN
NaN
0.000000
0.005083
154319-icsd
NaN
NaN
NaN
NaN
0.000000
The problem is that some cells (e.g. O.loc["192693-icsd", "192401-icsd"]) contain a pandas.core.series.Series of form:
192693-icsd 0.129562
192693-icsd 0.129562
Name: 192401-icsd, dtype: float64
I'm struggling to make this cell contain only a np.float64.
I tried:
O.loc["192693-icsd", "192401-icsd"] = O.loc["192693-icsd", "192401-icsd"][0]
and other various known forms of assignnign a new value to a cell in pd.DataFrame, but they only assign a new element to the same series in this cell, e.g. if I do
O.loc["192693-icsd", "192401-icsd"] = 5
then when calling O.loc["192693-icsd", "192401-icsd"] I get:
192693-icsd 5.0
192693-icsd 5.0
Name: 192401-icsd, dtype: float64
How to modify O.loc["192693-icsd", "192401-icsd"] so that it is of type np.float64?
It's not that df.loc["192693-icsd", "192401-icsd"] contain a Series, your index just isn't unique. This is especially obvious looking at these outputs:
>>> df.loc["192693-icsd"]
202324-icsd 644068-icsd 27121-icsd 93847-icsd 154319-icsd 28918-icsd 28917-icsd ... 108768-icsd 194195-icsd 174188-icsd 159632-icsd 89111-icsd 23308-icsd 253341-icsd
192693-icsd NaN NaN NaN NaN 0.146843 NaN NaN ... NaN 0.271191 NaN NaN NaN NaN 0.253996
192693-icsd NaN NaN NaN NaN 0.146843 NaN NaN ... NaN 0.271191 NaN NaN NaN NaN 0.253996
[2 rows x 1402 columns]
# And the fact that this returns the same:
>>> df.at["192693-icsd", "192401-icsd"]
192693-icsd 0.129562
192693-icsd 0.129562
Name: 192401-icsd, dtype: float64
You can fix this with a groupby, but you have to decide what to do with the non-unique groups. It looks like they're the same, so we'll combine them with max:
df = df.groupby(level=0).max()
Now it'll work as expected:
>>> df.loc["192693-icsd", "192401-icsd"])
0.129562120551387
Your non-unique values are:
>>> df.index[df.index.duplicated()]
Index(['193303-icsd', '192693-icsd', '416602-icsd'], dtype='object')
IIUC, you can try DataFrame.applymap to check each cell type and get the first row if it is Series
df = df.applymap(lambda x: x.iloc[0] if type(x) == pd.Series else x)
It works as expected for O.loc["192693-icsd", "192401-icsd"] = O.loc["192693-icsd", "192401-icsd"][0]
Check this colab link: https://colab.research.google.com/drive/1XFXuj4OBu8GXQx6DTqv04XellmFcFWbC?usp=sharing

Create dataframe pandas from dict where values are list of tuples and each column name is unique

I have two lists that I use to create a dictionary, where list1 has text data and list2 is a list of tuples (text, float). I use these 2 lists to create a dictionary and the goal is to create a dataframe where each row of the first column will contain the elements of list1, each column will have a column name based on each unique text term from the first tuple element and for each row there will be the float values that connect them.
For example here's the dictionary with keys : {be, associate, induce, represent} and values : {('prove', 0.583171546459198), ('serve', 0.4951282739639282)} etc.
{'be': [('prove', 0.583171546459198), ('serve', 0.4951282739639282), ('render', 0.4826732873916626), ('represent', 0.47748714685440063), ('lead', 0.47725602984428406), ('replace', 0.4695377051830292), ('contribute', 0.4529820680618286)],
'associate': [('interact', 0.8237789273262024), ('colocalize', 0.6831706762313843)],
'induce': [('suppress', 0.8159114718437195), ('provoke', 0.7866303324699402), ('elicit', 0.7509980201721191), ('inhibit', 0.7498961687088013), ('potentiate', 0.742023229598999), ('produce', 0.7384929656982422), ('attenuate', 0.7352016568183899), ('abrogate', 0.7260081768035889), ('trigger', 0.717864990234375), ('stimulate', 0.7136563658714294)],
'represent': [('prove', 0.6612186431884766), ('evoke', 0.6591314673423767), ('up-regulate', 0.6582908034324646), ('synergize', 0.6541063785552979), ('activate', 0.6512928009033203), ('mediate', 0.6494284272193909)]}
Desired Output
prove serve render represent
be 0.58 0.49 0.48 0.47
associate 0 0 0 0
induce 0.45 0.58 0.9 0.7
represent 0.66 0 0 1
So what tricks me is that the verb prove can be found in more than one keys (i.e. for the key be, the score is 0.58 and for the key represent the score is 0.66).
If I use df = pd.DataFrame.from_dict(d,orient='index'), then the verb prove will appear twice as a column name, whereas I want each term to appear once in each column.
Can someone help?
With the dictionary that you provided (as d), you can't use from_dict directly.
You either need to rework the dictionary to have elements as dictionaries:
pd.DataFrame.from_dict({k: dict(v) for k,v in d.items()}, orient='index')
Or you need to read it as a Series and to reshape:
(pd.Series(d).explode()
.apply(pd.Series)
.set_index(0, append=True)[1]
.unstack(fill_value=0)
)
output:
prove serve render represent lead replace \
be 0.583172 0.495128 0.482673 0.477487 0.477256 0.469538
represent 0.661219 NaN NaN NaN NaN NaN
associate NaN NaN NaN NaN NaN NaN
induce NaN NaN NaN NaN NaN NaN
contribute interact colocalize suppress ... produce \
be 0.452982 NaN NaN NaN ... NaN
represent NaN NaN NaN NaN ... NaN
associate NaN 0.823779 0.683171 NaN ... NaN
induce NaN NaN NaN 0.815911 ... 0.738493
attenuate abrogate trigger stimulate evoke up-regulate \
be NaN NaN NaN NaN NaN NaN
represent NaN NaN NaN NaN 0.659131 0.658291
associate NaN NaN NaN NaN NaN NaN
induce 0.735202 0.726008 0.717865 0.713656 NaN NaN
synergize activate mediate
be NaN NaN NaN
represent 0.654106 0.651293 0.649428
associate NaN NaN NaN
induce NaN NaN NaN
[4 rows x 24 columns]

Remove group of empty or nan in pandas groupby

In a dataframe, with some empty(NaN) values in some rows - Example below
s = pd.DataFrame([[39877380,158232151,20], [39877380,332086469,], [39877380,39877381,14], [39877380,39877383,8], [73516838,6439138,1], [73516838,6500551,], [735571896,203559638,], [735571896,282186552,], [736453090,6126187,], [673117474,12196071,], [673117474,12209800,], [673117474,618058747,6]], columns=['start','end','total'])
When I groupby start and end columns
s.groupby(['start', 'end']).total.sum()
the output I get is
start end
39877380 39877381 14.00
39877383 8.00
158232151 20.00
332086469 nan
73516838 6439138 1.00
6500551 nan
673117474 12196071 nan
12209800 nan
618058747 6.00
735571896 203559638 nan
282186552 nan
736453090 6126187 nan
I want to exclude all the groups of start where all values with end is 'nan' - Expected output -
start end
39877380 39877381 14.00
39877383 8.00
158232151 20.00
332086469 nan
73516838 6439138 1.00
6500551 nan
673117474 12196071 nan
12209800 nan
618058747 6.00
I tried with dropna(), but it is removing all the nan values and not nan groups.
I am newbie in python and pandas. Can someone help me in this? thank you
In newer pandas versions is necessary use min_count=1 for missing values if use sum:
s1 = s.groupby(['start', 'end']).total.sum(min_count=1)
#oldier pandas version solution
#s1 = s.groupby(['start', 'end']).total.sum()
Then is possible filter if at least one non missing value per first level by Series.notna with GroupBy.transform and GroupBy.any, filtering is by boolean indexing:
s2 = s1[s1.notna().groupby(level=0).transform('any')]
#oldier pandas version solution
#s2 = s1[s1.notnull().groupby(level=0).transform('any')]
print (s2)
start end
39877380 39877381 14.0
39877383 8.0
158232151 20.0
332086469 NaN
73516838 6439138 1.0
6500551 NaN
673117474 12196071 NaN
12209800 NaN
618058747 6.0
Name: total, dtype: float64
Or is possible get unique values of first level index values by MultiIndex.get_level_values and filtering by DataFrame.loc:
idx = s1.index.get_level_values(0)
s2 = s1.loc[idx[s1.notna()].unique()]
#oldier pandas version solution
#s2 = s1.loc[idx[s1.notnull()].unique()]
print (s2)
start end
39877380 39877381 14.0
39877383 8.0
158232151 20.0
332086469 NaN
73516838 6439138 1.0
6500551 NaN
673117474 12196071 NaN
12209800 NaN
618058747 6.0
Name: total, dtype: float64

Python Pandas - Rolling regressions for multiple columns in a dataframe

I have a large dataframe containing daily timeseries of prices for 10,000 columns (stocks) over a period of 20 years (5000 rows x 10000 columns). Missing observations are indicated by NaNs.
0 1 2 3 4 5 6 7 8 \
31.12.2009 30.75 66.99 NaN NaN NaN NaN 393.87 57.04 NaN
01.01.2010 30.75 66.99 NaN NaN NaN NaN 393.87 57.04 NaN
04.01.2010 31.85 66.99 NaN NaN NaN NaN 404.93 57.04 NaN
05.01.2010 33.26 66.99 NaN NaN NaN NaN 400.00 58.75 NaN
06.01.2010 33.26 66.99 NaN NaN NaN NaN 400.00 58.75 NaN
Now I want to run a rolling regression for a 250 day window for each column over the whole sample period and save the coefficient in another dataframe
Iterating over the colums and rows using two for-loops isn't very efficient, so I tried this but getting the following error message
def regress(start, end):
y = df_returns.iloc[start:end].values
if np.isnan(y).any() == False:
X = np.arange(len(y))
X = sm.add_constant(X, has_constant="add")
model = sm.OLS(y,X).fit()
return model.params[1]
else:
return np.nan
regression_window = 250
for t in (regression_window, len(df_returns.index)):
df_coef[t] = df_returns.apply(regress(t-regression_window, t), axis=1)
TypeError: ("'float' object is not callable", 'occurred at index 31.12.2009')
here is my version, using df.rolling() instead and iterating over the columns.
I am not completely sure it is what you were looking for don't hesitate to comment
import statsmodels.regression.linear_model as sm
import statsmodels.tools.tools as sm2
df_returns =pd.DataFrame({'0':[30,30,31,32,32],'1':[60,60,60,60,60],'2':[np.NaN,np.NaN,np.NaN,np.NaN,np.NaN]})
def regress(X,Z):
if np.isnan(X).any() == False:
model = sm.OLS(X,Z).fit()
return model.params[1]
else:
return np.NaN
regression_window = 3
Z = np.arange(regression_window)
Z= sm2.add_constant(Z, has_constant="add")
df_coef=pd.DataFrame()
for col in df_returns.columns:
df_coef[col]=df_returns[col].rolling(window=regression_window).apply(lambda col : regress(col, Z))
df_coef

I'm trying modify a pandas data frame so that I will have 2 columns. A frequency column and a date column.

Basically, what I'm working with is a dataframe with all of the parking tickets given out in one year. Every ticket takes up its own row in the unaltered dataframe. What I want to do is group all the tickets by date so that I have 2 columns (date, and the amount of tickets issued on that day). Right now I can achieve that, however, the date is not considered a column by pandas.
import numpy as np
import matplotlib as mp
import pandas as pd
import matplotlib.pyplot as plt
df1 = pd.read_csv('C:/Users/brett/OneDrive/Data Science
Fundamentals/Parking_Tags_Data_2012.csv')
unnecessary_cols = ['tag_number_masked', 'infraction_code',
'infraction_description', 'set_fine_amount', 'time_of_infraction',
'location1', 'location2', 'location3', 'location4',
'province']
df1 = df1.drop (unnecessary_cols, 1)
df1 =
(df1.groupby('date_of_infraction').agg({'date_of_infraction':'count'}))
df1['frequency'] =
(df1.groupby('date_of_infraction').agg({'date_of_infraction':'count'}))
print (df1)
df1 = (df1.iloc[121:274])
The output is:
date_of_infraction date_of_infraction frequency
20120101 1059 NaN
20120102 2711 NaN
20120103 6889 NaN
20120104 8030 NaN
20120105 7991 NaN
20120106 8693 NaN
20120107 7237 NaN
20120108 5061 NaN
20120109 7974 NaN
20120110 8872 NaN
20120111 9110 NaN
20120112 8667 NaN
20120113 7247 NaN
20120114 7211 NaN
20120115 6116 NaN
20120116 9168 NaN
20120117 8973 NaN
20120118 9016 NaN
20120119 7998 NaN
20120120 8214 NaN
20120121 6400 NaN
20120122 6355 NaN
20120123 7777 NaN
20120124 8628 NaN
20120125 8527 NaN
20120126 8239 NaN
20120127 8667 NaN
20120128 7174 NaN
20120129 5378 NaN
20120130 7901 NaN
... ... ...
20121202 5342 NaN
20121203 7336 NaN
20121204 7258 NaN
20121205 8629 NaN
20121206 8893 NaN
20121207 8479 NaN
20121208 7680 NaN
20121209 5357 NaN
20121210 7589 NaN
20121211 8918 NaN
20121212 9149 NaN
20121213 7583 NaN
20121214 8329 NaN
20121215 7072 NaN
20121216 5614 NaN
20121217 8038 NaN
20121218 8194 NaN
20121219 6799 NaN
20121220 7102 NaN
20121221 7616 NaN
20121222 5575 NaN
20121223 4403 NaN
20121224 5492 NaN
20121225 673 NaN
20121226 1488 NaN
20121227 4428 NaN
20121228 5882 NaN
20121229 3858 NaN
20121230 3817 NaN
20121231 4530 NaN
Essentially, I want to move all the columns over by one to the right. Right now pandas only considers the last two columns as actual columns. I hope this made sense.
The count of infractions per date should be achievable with just one call to groupby. Try this:
import numpy as np
import pandas as pd
df1 = pd.read_csv('C:/Users/brett/OneDrive/Data Science
Fundamentals/Parking_Tags_Data_2012.csv')
unnecessary_cols = ['tag_number_masked', 'infraction_code',
'infraction_description', 'set_fine_amount', 'time_of_infraction',
'location1', 'location2', 'location3', 'location4',
'province']
df1 = df1.drop(unnecessary_cols, 1)
# reset_index() to move the dates into their own column
counts = df1.groupby('date_of_infraction').count().reset_index()
print(counts)
Note that any dates with zero tickets will not show up as 0; instead, they will simply be absent from counts.
If this doesn't work, it would be helpful for us to see the first few rows of df1 after you drop the unnecessary rows.
Try using as_index=False.
For example:
import numpy as np
import pandas as pd
data = {"date_of_infraction":["20120101", "20120101", "20120202", "20120202"],
"foo":np.random.random(4)}
df = pd.DataFrame(data)
df
date_of_infraction foo
0 20120101 0.681286
1 20120101 0.826723
2 20120202 0.669367
3 20120202 0.766019
(df.groupby("date_of_infraction", as_index=False) # <-- acts like reset_index()
.foo.count()
.rename(columns={"foo":"frequency"})
)
date_of_infraction frequency
0 20120101 2
1 20120202 2

Categories