Based on the responses I received in Pandas SettingWithCopyWarning: I'm thoroughly confused and the clear explanation I found at Pandas - Get first row value of a given column, I thought I had all my SettingWithCopyWarning errors solved. Unfortunately, they are back (Python 3.8.5), and I'd appreciate your assistance. My dataframe df has a column 'SBPi_min_time' which I refer to as t_min
t_min
'SBPi_min_time'
df.head()
SBPi_max_time SBPi_max SBPi_min_time SBPi_min delta_p
0 52.257 119.626 55.903 111.256 8.370
1 59.513 118.580 60.562 114.395 4.185
2 62.632 119.626 63.650 112.999 6.627
3 65.721 121.021 67.279 114.395 6.626
4 69.344 120.672 72.414 113.348 7.324
If I now try to copy a value from one line of df to the previous line, I get the infamous SettingWithCopyWarning. I have tried 5 distinct approaches, and get the error in every single case. It's worth noting that the first approach is the one that is recommended in the posts that I have posted links to:
df.iloc[i, df.columns.get_loc('SBPi_min_time')] = df.iloc[i+1, df.columns.get_loc('SBPi_min_time')]
df.iloc[i, df.columns.get_loc(t_min)] = df.iloc[i+1, df.columns.get_loc(t_min)]
df.iloc[i, 3] = df.iloc[i+1, 3]
df[t_min].iloc[i] = df[t_min].iloc[i+1]
df[t_min][i] = df[t_min][i+1]
If there was a way to create a new object (in this case a float) from df[t_min][i+1], I could do so, and then set df[t_min][i] to it, but there doesn't seem to a way in which to do it:
df_copy = df.copy(deep = True)
df[t_min][i] = df_copy[t_min][i+1]
gives me the same error. What on earth am I doing wrong, and what's the fix?
Many thanks in advance
Thomas Philips
Related
Hey I am using Jupitor Notebook and doing machine learning.
I wrote this code but getting this error and I dont know what is the error.
This is my code for reference:
f`rom sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imp = IterativeImputer(random_state=42)
date = pd.Timestamp('2200-01-01')
for col in combi:
if combi[col].dtype=="object":
combi[col].fillna("not listed", inplace=True)
if combi[col].dtype=="int":
#X[col].fillna(X[col].mode()[0], inplace=True)
combi[col].fillna(combi[col].mean(), inplace=True)
#combi[col] = combi[col].astype.int()
if combi[col].dtype=='float':
#X[col].fillna(X[col].mean(), inplace=True)
combi[col] = imp.fit_transform(combi[col].values.reshape(-1,1))
if combi[col].dtype=="datetime64[ns]":
combi[col].fillna(date, inplace=True)
combi`
Solution of the problem
This is not how for loops work in Python.
for col in combi:
if combi[col].dtype=="object":
# ...
col isn't an index into the collection you're iterating over (combi), it is the dereferenced element itself. Change all instances of combi[col] inside your for loop to col to correct. Example:
for col in combi:
if col.dtype=="object":
You didn't post all of your code so it's unclear if this will resolved the problem you're seeing, but it is certainly a step in the right direction.
I was able to solve the problem described below, but as I am a newbie, I am not sure if my solution is good. I'd be grateful for any tips on how to do it in a more efficient and/or more elegant manner.
What I have:
...and so on (the table's quite big).
What I need:
How I solved it:
Load the file
df = pd.read_csv("survey_data_cleaned_ver2.csv")
Define a function
def transform_df(df, list_2, column_2, list_1, column_1='Respondent'):
for ind in df.index:
elements = df[column_2][ind].split(';')
num_of_elements = len(elements)
for num in range(num_of_elements):
list_1.append(df['Respondent'][ind])
for el in elements:
list_2.append(el)
Dropna because NaNs are floats and that was causing errors later on.
df_LanguageWorkedWith = df[['Respondent', 'LanguageWorkedWith']]
df_LanguageWorkedWith.dropna(subset='LanguageWorkedWith', inplace=True)
Create empty lists
Respondent_As_List = []
LanguageWorkedWith_As_List = []
Call the function
transform_df(df_LanguageWorkedWith, LanguageWorkedWith_As_List, 'LanguageWorkedWith', Respondent_As_List)
Tranform the lists into dataframes
df_Respondent = pd.DataFrame(Respondent_As_List, columns=["Respondent"])
df_LanguageWorked = pd.DataFrame(LanguageWorkedWith_As_List, columns=["LanguageWorkedWith"])
Concatenate those dataframes
df_LanguageWorkedWith_final = pd.concat([df_Respondent, df_LanguageWorked], axis=1)
And that's it.
The code and input file can be found on my GitHub: https://github.com/jarsonX/Temp_files
Thanks in advance!
You can try like this. I haven't tested but it should work
df['LanguageWorkedWith'] = df['LanguageWorkedWith'].str.replace(';',',')
df =df.assign(LanguageWorkedWith=df['LanguageWorkedWith'].str.split(',')).explode('LanguageWorkedWith')
#Tested
LanguageWorkedWith Respondent
0 C 4
0 C++ 4
0 C# 4
0 Python 4
0 SQL 4
... ... ...
10319 Go 25142
I,m trying to add empty column in my dataset on colab but it give me this error. and when I,m trying to run it on my local machine it works perfectly fine. does anybody know possible solution for this?
My code.
dataframe["Comp"] = ''
dataframe["Negative"] = ''
dataframe["Neutral"] = ''
dataframe["Positive"] = ''
dataframe
Error message
TypeError: Expected unicode, got pandas._libs.properties.CachedProperty
I run into similar issue today.
"Expected unicode, got pandas._libs.properties.CachedProperty"
my dataframe(called df) has timeindex. When add a new column to it, and fill with numpy.array data, it raise this error. I tried set it with df.index or df.index.value. It always raise this error.
Finally, I solved by 3 stesp:
df = df.reset_index()
df['new_column'] = new_column_data # it is np.array format
df = df.set_index('original_index_name')
WY
this Quetion is the same as https://stackoverflow.com/a/67997139/16240186, and there's a simple way to solve it: df = df.asfreq('H') # freq can be min\D\M\S\5min etc.
sencap.csv is a file that has a lot of columns that I don't need and I want to keep just some columns in order to start filtering it to analyze the information and do some graphs, which in this case it'll be a pie chart that aggregate energy quantities depending on its energy source. Everything works fine except the condition that asks to sum() only those rows which are less than 9.0 MW.
import pandas as pd
import matplotlib.pyplot as plt
aux = pd.read_csv('sencap.csv')
keep_col = ['subsistema','propietario','razon_social', 'estado',
'fecha_servicio_central', 'region_nombre', 'comuna_nombre',
'tipo_final', 'clasificacion', 'tipo_energia', 'potencia_neta_mw',
'ley_ernc', 'medio_generacion', 'distribuidora', 'punto_conexion',
]
c1 = aux['medio_generacion'] == 'PMGD'
c2 = aux['medio_generacion'] == 'PMG'
aux2 = aux[keep_col]
aux3 = aux2[c1 | c2]
for col in ['potencia_neta_mw']:
aux3[col] = pd.to_numeric(aux3[col].str.replace(',','.'))
c3 = aux3['potencia_neta_mw'] <= 9.0
aux4 = aux3[c3]
df = aux4.groupby(['tipo_final']).sum()
Warning:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
aux3[col] = pd.to_numeric(aux3[col].str.replace(',','.'))
aux3[col] = pd.to_numeric(aux3[col].str.replace(',','.'))
this line is the reason you are getting a warning.
Accessing the "col" using indexing may result in unpredictable behavior since it may return view or copy of original data.
it depends on the memory layout of the array, about which pandas makes no guarantees
pandas documentation advises users to use .loc instead.
Example:
In: df
Out:
one two
first second first second
0 a b c d
1 e f g h
2 i j k l
3 m n o p
dfmi.loc[:,('one','second')] = value
# becomes
dfmi.loc.__setitem__((slice(None), ('one', 'second')), value)
dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)
In the second case __getitem__ is unpredictable. It may return view or copy of the data. Modifying a view and copy works differently.
Making change on copy is not reflected on the original data where as a change on a view does.
Note: So the warning is present to warn users, even if you get the expected output there is a chance it might causes some unpredictable behavior.
I have a dataframe in Python using pandas. It has 2 columns called 'dropoff_latitude' and 'pickup_latitude'. I want to make a function that will create a 3rd column based on these 2 variables (runs them through an api).
So I wrote a function:
def dropoff_info(row):
dropoff_latitude = row['dropoff_latitude']
dropoff_longitude = row['dropoff_longitude']
dropoff_url2 = "http://data.fcc.gov/api/block/find?format=json&latitude=%s&longitude=%s&showall=true" %(dropoff_latitude,dropoff_longitude)
dropoff_resp2 = requests.get(dropoff_url2)
dropoff_results2 = json.loads(dropoff_resp2.text)
dropoffinfo = dropoff_results2["Block"]["FIPS"][2:11]
return dropoffinfo
then I would run it as
df['newcolumn'] = dropoffinfo(df)
However it doesn't work.
Upon troubleshooting I find that when I print dropoff_latitude it looks like this:
0 40.773345947265625
1 40.762149810791016
2 40.770393371582031
...
And so I think that the URL can't get generated. I want dropoff_latitude to look like this when printed:
40.773345947265625
40.762149810791016
40.770393371582031
...
And I don't know how to specify that I want just the actual content part.
When I tried
dropoff_latitude = row['dropoff_latitude'][1]
dropoff_longitude = row['dropoff_longitude'][1]
It just gave me the values from the 1st row so that obviously didn't work.
Ideas please? I am very new to working with dataframes... Thank you!
Alex - with pandas we typically like to avoid loops, but in your particular case, the need to ping a remote server for data pretty much requires it. So I'd do something like the following:
l = []
for i in df.index:
dropoff_latitude = df.loc[i, 'dropoff_latitude']
dropoff_longitude = df.loc[i, 'dropoff_longitude']
dropoff_url2 = "http://data.fcc.gov/api/block/find?format=json&latitude=%s&longitude=%s&showall=true" %(dropoff_latitude,dropoff_longitude)
dropoff_resp2 = requests.get(dropoff_url2)
dropoff_results2 = json.loads(dropoff_resp2.text)
l.append(dropoff_results2["Block"]["FIPS"][2:11])
df['new'] = l
The key here is the .loc[i, ...] bit that gives you the ability to go through each row one by one, and call out the associated column to create the variables to send to your API.
Regarding your question about a drain on your memory - that's a little above my pay-grade, but I really don't think you have any other options in this case (unless your API has some kind of batch request that allows you to pull a larger data set in one call).