i have a csv file with multiple columns containing empty strings. Upon reading the csv into pandas dataframe, the empty strings get converted to NaN.
Now i want to append a string tag- to the strings already present in the columns but to only those that have some values in it and not on those with NaN
this is what i was trying to do:
with open('file1.csv','r') as file:
for chunk in pd.read_csv(file,chunksize=1000, header=0, names=['A','B','C','D'])
if len(chunk) >=1:
if chunk['A'].notna:
chunk['A'] = "tag-"+chunk['A'].astype(str)
if chunk['B'].notna:
chunk['B'] = "tag-"+chunk['B'].astype(str)
if chunk['C'].notna:
chunk['C'] = "tag-"+chunk['C'].astype(str)
if chunk['D'].notna:
chunk['D'] = "tag-"+chunk['D'].astype(str)
and this is the error I'm getting:
AttributeError: 'Series' object has no attribute 'notna'
the final output that i want should be something like this:
A,B,C,D
tag-a,tab-b,tag-c,
tag-a,tag-b,,
tag-a,,,
,,tag-c,
,,,tag-d
,tag-b,,tag-d
I believe you need mask for add tag- to all columns together:
for chunk in pd.read_csv('file1.csv',chunksize=2, header=0, names=['A','B','C','D']):
if len(chunk) >=1:
m1 = chunk.notna()
chunk = chunk.mask(m1, "tag-" + chunk.astype(str))
You need upgrade to last version of pandas, 0.21.0.
You can check docs:
In order to promote more consistency among the pandas API, we have added additional top-level functions isna() and notna() that are aliases for isnull() and notnull(). The naming scheme is now more consistent with methods like .dropna() and .fillna(). Furthermore in all cases where .isnull() and .notnull() methods are defined, these have additional methods named .isna() and .notna(), these are included for classes Categorical, Index, Series, and DataFrame. (GH15001).
The configuration option pd.options.mode.use_inf_as_null is deprecated, and pd.options.mode.use_inf_as_na is added as a replacement.
Related
I keep getting AttributeError: 'DataFrame' object has no attribute 'column' when I run the function on a column in a dataframe
def reform (column, dataframe):
if dataframe.column.nunique() > 2 and dataframe.column.dtypes == object:
enc.fit(dataframe[['column']])
enc.categories_
onehot = enc.transform(dataframe[[column]]).toarray()
dataframe[enc.categories_] = onehot
elif dataframe.column.nunique() == 2 and dataframe.column.dtypes == object :
le.fit_transform(dataframe[['column']])
else:
print('Column cannot be reformed')
return dataframe
Try changing
dataframe.column to dataframe.loc[:,column].
dataframe[['column']] to dataframe.loc[:,[column]]
For more help, please provide more information. Such as: What is enc (show your imports)? What does dataframe look like (show a small example, perhaps with dataframe.head(5))?
Details:
Since column is an input (probably a string), you need to use it correctly when asking for that column from the dataframe object. If you just use dataframe.column it will try to find the column actually named 'column', but if you ask for it dataframe.loc[:,column], it will use the string that is represented by the input parameter named column.
With dataframe.loc[:,column], you get a Pandas Series, and with dataframe.loc[:,[column]] you get a Pandas DataFrame.
The pandas attribute 'columns', used as dataframe.columns (note the 's' at the end) just returns a list of the names of all columns in your dataframe, probably not what you want here.
TIPS:
Try to name input parameters so that you know what they are.
When developing a function, try setting the input to something static, and iterate the code until you get desired output. E.g.
input_df = my_df
column_name = 'some_test_column'
if input_df.loc[:,column_name].nunique() > 2 and input_df.loc[:,column_name].dtypes == object:
enc.fit(input_df.loc[:,[column_name]])
onehot = enc.transform(input_df.loc[:,[column_name]]).toarray()
input_df.loc[:, enc.categories_] = onehot
elif input_df.loc[:,column_name].nunique() == 2 and input_df.loc[:,column_name].dtypes == object :
le.fit_transform(input_df.loc[:,[column_name]])
else:
print('Column cannot be transformed')
Look up on how to use SciKit Learn Pipelines, with ColumnTransformer. It will help make the workflow easier (https://scikit-learn.org/stable/modules/compose.html).
I have this dataframe:
01100MS,02200MS,02500MS,03100MS,22
626323,616720,616288,611860,622375
5188431,5181393,5173583,5165895,5152605
1915,1499,1310,1235,1907
1,4.1,4.41,4.441,4.4441
2,4.2,4.42,4.442,4.4442
3,4.3,4.43,4.443,4.4443
4,4.4,4.44,4.444,4.4444
5,4.5,4.45,4.445,4.4445
6,4.6,4.46,4.446,4.4446
7,4.7,4.47,4.447,4.4447
8,4.8,4.48,4.448,4.4448
9,4.9,4.49,4.449,4.4449
10,5,4.5,4.45,4.445
11,5.1,4.51,4.451,4.4451
I would like to have multiple headers. According to this post, I have done this:
dfr = pd.read_csv(file_input,sep=',',header=None,skiprows=0)
cols = tuple(zip(dfr.iloc[0], (dfr.iloc[1]).apply(lambda x: x[1:-1])))
However, I get an error:
TypeError: 'float' object is not subscriptable
The problem, I suppose, is due to the fact that 22 in the header is an integer. Indeed if I substitute 22 with A22 it works.
Due the fact that I have to work with multiple and large dataframe, I can not do it by end. As a consequence, I have tried this solution:
dfr.iloc[0] = dfr.iloc[0].apply(str)
but it does not seem to work.
Do you have some suggestions?
apply(lambda x: x[1:-1]) removes the first and last character, this was needed in the other post you quote as the format was [col1] but in your case you want the same value as in the file.
The problem is that 22 has only 2 characters. So just remove the apply function and then you can build the multiIndex.
How can i filter a query and then do a group by
df.query("'result_margin' > 100").groupby(['city','season','toss_winner','toss_decision','winner'])['winner'].size()
I am getting this error
TypeError: '>' not supported between instances of 'str' and 'int'
I am trying to filter where result_margin is greater than 100 then groupby with the columns specified and print records
Using 'result_margin' would treat it as a string, and not refer to the columns.
You would need to remove the quotes:
df.query("result_margin > 100").groupby(['city','season','toss_winner','toss_decision','winner'])['winner'].size()
Or if you might have columns that contain spaces, than add backticks:
df.query("`result_margin` > 100").groupby(['city','season','toss_winner','toss_decision','winner'])['winner'].size()
You need to convert 'result_margin' to int. Try:
df['result_margin'] = df['result_margin'].astype(int)
For the filter, I always create a new dataframe.
df_new = df[df['result_margin']>100].groupby['city','season','toss_winner','toss_decision','winner']).agg(WinnerCount = pd.NamedAgg(column='winner',aggfunc='count'))
I don't use the size method but instead opt for using the agg method and create a new column. You can also try replacing the
agg(WinnerCount = pd.NamedAgg(column='winner',aggfunc='count'))
with
['winner'].size
I have a function that I'm using to log transform individual values of a dataframe based a source dataframe and a list of columns that are passed in.
def split(columns, start_df):
df = start_df[columns].copy()
numeric_features = df.select_dtypes(exclude = ["object"]).columns
for cols in numeric_features:
for rows in range(0,train_num.shape[0]):
# Offending row
train_num[cols][rows] = np.log(train_num[cols][rows])
Since the df and the list of columns will be unknown and the columns may come from another df as .columns.tolist(), is there a way to work around this warning without the column index (because it may not match)?
It's the only thing I can think of that's messing up the model I'm making.
I have tried the below, but still getting the warning as well but I'm out of ideas.
train_num.loc[cols][rows] = np.log(train_num.loc[cols][rows])
This give me an error: 'numpy.float64' object has no attribute 'where'
train_num[cols][rows].where(train_num[cols][rows] > 0,
np.log(train_num[cols][rows],
train_num[cols][rows]))
What's strange is this section in the same function is throwing the same warning as well, hopefully it's the same fix!
X_train.loc[:, numeric_features] = scaler.fit_transform(X_train.loc[:, numeric_features])
X_val.loc[:, numeric_features] = scaler.transform(X_val.loc[:, numeric_features])
Any help is much appreciated!
I'm importing data from a CSV file which has text, date and numeric columns. I'm using pandas.read_csv() to read it in, but I'm not specifying what each column's dtype should be. Here's a cut of that csv file (apologies for the shoddy formatting).
Now these two columns (total_imp_pma, char_value_aa503) are imported very differently. I import all the number fields and create a new dataframe called base_varlist4, which only contains the number columns.
When I run base_varlist4.dtypes, I get:
total_imp_pma object
char_value_aa503 float64
So as you can see, total_imp_pma was imported as an object. The problem then means that if I run this:
#calculate max, and group by obs_date
output_max_temp=base_varlist4.groupby('obs_date').max(skipna=True)
#reset obs_date to be treated as a column rather than an index
output_max_temp.reset_index()
#reshape temporary output to have 2 columns corresponding to variable and value
output_max=pd.melt(output_max_temp, id_vars='obs_date', value_vars=varlist4)
Where varlist4 is just my list of columns, I get the wrong max value for total_imp_pma but the correct max value for char_value_aa503.
Logically, this means I should change the object total_imp_pma to either a float or an integer. However, when I run:
base_varlist4[varlist4] = base_varlist4[varlist4].apply(pd.to_numeric, errors='coerce')
And then proceed to do the max value, I still get an incorrect result.
What's going on here? Why does pandas.read_csv() import some columns as an object dtype, and others as an int64 or float64 dtype? Why does conversion not work?
I have a theory but I'm not sure how to work around it. The only difference I see in the two columns in my source data are that total_imp_pma has mixed typed cells all the way down. For example, 66979 is a General cell, while there's a cell a little further down with a value of 1,760.60 as a number.
I think the mixed cell types in certain columns is causing pandas.read_csv() to be confused and just say "whelp, dunno what this is, import it as an object".
... how do I fix this?
EDIT: Here's an MCVE as per the request below.
Data in CSV is:
Char_Value_AA503 Total_IMP_PMA
1293 19.9
1831 0.9
1.2
243 2,666.50
Code is:
import pandas as pd
loc = r"xxxxxxxxxxxxxx"
source_data_name = 'import_problem_example.csv'
reporting_date = '01Feb2018'
source_data = pd.read_csv(loc + source_data_name)
source_data.columns = source_data.columns.str.lower()
varlist4 = ["char_value_aa503","total_imp_pma"]
base_varlist4 = source_data[varlist4]
base_varlist4['obs_date'] = reporting_date
base_varlist4[varlist4] = base_varlist4[varlist4].apply(pd.to_numeric, errors='coerce')
output_max_temp=base_varlist4.groupby('obs_date').max(skipna=True)
#reset obs_date to be treated as a column rather than an index
output_max_temp.reset_index()
#reshape temporary output to have 2 columns corresponding to variable and value
output_max=pd.melt(output_max_temp, id_vars='obs_date', value_vars=varlist4)
""" Test some stuff"""
source_data.dtypes
output_max
source_data.dtypes
As you can see, the max value of total_imp_pma comes out as 19.9, when it should be 2666.50.