pandas : Unable to update empty dataframe by single string - python

I created an empty dataframe and tried adding one string value to a column but it is showing empty
import pandas as pd
df = pd.DataFrame()
df['Man'] = "manish"
print(df)
when i am running above code i am getting output as:
Empty DataFrame
Columns: [Man]
Index: []
While when i am running below code
df['Man'] = ['manish']
print(df)
i am getting correct output which i expected
Man
0 manish
Can anyone explain me why this is happening ?

It seems, by looking at the code of __setitem__ that it expects a list like value, as written in the function _ensure_valid_index:
def _ensure_valid_index(self, value):
"""
Ensure that if we don't have an index, that we can create one from the
passed value.
"""
# GH5632, make sure that we are a Series convertible
if not len(self.index) and is_list_like(value):
try:
value = Series(value)
except (ValueError, NotImplementedError, TypeError):
raise ValueError(
"Cannot set a frame with no defined index "
"and a value that cannot be converted to a "
"Series"
)
self._data = self._data.reindex_axis(
value.index.copy(), axis=1, fill_value=np.nan
)
So if the len of the index is zero (as in your case) it expects a list like value, which a string is not, to convert to a Series an use the index from there. The function _ensure_valid_index is called inside set_item.

Related

A function for onehotencoding and labelencoding in a dataframe

I keep getting AttributeError: 'DataFrame' object has no attribute 'column' when I run the function on a column in a dataframe
def reform (column, dataframe):
if dataframe.column.nunique() > 2 and dataframe.column.dtypes == object:
enc.fit(dataframe[['column']])
enc.categories_
onehot = enc.transform(dataframe[[column]]).toarray()
dataframe[enc.categories_] = onehot
elif dataframe.column.nunique() == 2 and dataframe.column.dtypes == object :
le.fit_transform(dataframe[['column']])
else:
print('Column cannot be reformed')
return dataframe
Try changing
dataframe.column to dataframe.loc[:,column].
dataframe[['column']] to dataframe.loc[:,[column]]
For more help, please provide more information. Such as: What is enc (show your imports)? What does dataframe look like (show a small example, perhaps with dataframe.head(5))?
Details:
Since column is an input (probably a string), you need to use it correctly when asking for that column from the dataframe object. If you just use dataframe.column it will try to find the column actually named 'column', but if you ask for it dataframe.loc[:,column], it will use the string that is represented by the input parameter named column.
With dataframe.loc[:,column], you get a Pandas Series, and with dataframe.loc[:,[column]] you get a Pandas DataFrame.
The pandas attribute 'columns', used as dataframe.columns (note the 's' at the end) just returns a list of the names of all columns in your dataframe, probably not what you want here.
TIPS:
Try to name input parameters so that you know what they are.
When developing a function, try setting the input to something static, and iterate the code until you get desired output. E.g.
input_df = my_df
column_name = 'some_test_column'
if input_df.loc[:,column_name].nunique() > 2 and input_df.loc[:,column_name].dtypes == object:
enc.fit(input_df.loc[:,[column_name]])
onehot = enc.transform(input_df.loc[:,[column_name]]).toarray()
input_df.loc[:, enc.categories_] = onehot
elif input_df.loc[:,column_name].nunique() == 2 and input_df.loc[:,column_name].dtypes == object :
le.fit_transform(input_df.loc[:,[column_name]])
else:
print('Column cannot be transformed')
Look up on how to use SciKit Learn Pipelines, with ColumnTransformer. It will help make the workflow easier (https://scikit-learn.org/stable/modules/compose.html).

Finding an integer in a list of integers if condition fulfilled

So I have a df with three columns: The first contains a name, the second an ID, and the third a list of IDs (delimited by commas). For guys with an identical name in the first column, I'd like to check if the ID in the second column of the one guy appears in the list of IDs in the third column of the other guy.
name id id2
Gabor 665 123
Hoak 667 100,111,112
Sherr 668 1,2,3
Hoak 669 667,500,600
Rine 670 73331,999
Rine 671 670,15
So basically I'd like python to note that there's two guys called "Hoak" and check if the id 667 of Hoak No.1 appears in the other Hoak's id2-list (which it does). I've tried to start with a cheap approach that does it manually for whatever name I specify, let's say for "Hoak" (i=1):
import pandas as pd
df = pd.read_excel (...)
for i in range(0,len(df)):
if df['name'][i] == df['name'][1]:
if df['id'][1] in df['id2'][i]:
print(i)
However, I'm getting
TypeError: argument of type 'float' is not iterable
I've tried to add all sorts of variations, like .string or str(), or things like if (df['id2'][i]).str.contains("667"), but I can't work it out, getting erros like
AttributeError: 'float' object has no attribute 'string'
Thanks for your help
You need to set dtype in read_excel to avoid float problems.
Data type to force. Only a single dtype is allowed. If None, infer
import pandas as pd
import numpy as np
df = pd.read_excel(io="test.xls", header=0, dtype={'name': np.str, 'id': np.str, 'id2': np.str})
for i in range(0,len(df)):
if df['name'][i] == df['name'][1]:
if df['id'][1] in df['id2'][i]:
print(i)
Next you need correct the search algorithm.
A more pandas-style approach is to group the rows by name and see if the set of all IDs in each group intersects with the set of all ID2s in the same group:
df['id2'] = df['id2'].astype(str).str.split(',').apply(set)
df['id'] = df['id'].astype(str) # if needed
df.groupby('name')\
.apply(lambda x: set(x['id']) & set.union(*x['id2']))
#name
#Gabor {}
#Hoak {667}
#Rine {670}
#Sherr {}
try chaging this condition
if df['id'][1] in df['id2'][i]:
with this
if isinstance(df['id2'][i], list) and df['id'][1] in df['id2'][i]:
...
elif df['id'][1] == df['id2'][i] :
...
the problem is maybe that when you go through rows with one value only it wont take it as a list but as a float value, so you can't iterate through it
df = pd.read_excel is showing up as a float, per your error messages. Have you tried to just print out the i at your first loop? Work your way down through your nested for-loops once that bug is gone.
To solve the first bug, you need to set dtype in read_excel to avoid float problems.

Find if a column in dataframe has neither nan nor none

I have gone through all posts on the website and am not able to find solution to my problem.
I have a dataframe with 15 columns. Some of them come with None or NaN values. I need help in writing the if-else condition.
If the column in the dataframe is not null and nan, I need to format the datetime column. Current Code is as below
for index, row in df_with_job_name.iterrows():
start_time=df_with_job_name.loc[index,'startTime']
if not df_with_job_name.isna(df_with_job_name.loc[index,'startTime']) :
start_time_formatted =
datetime(*map(int, re.split('[^\d]', start_time)[:-1]))
The error that I am getting is
if not df_with_job_name.isna(df_with_job_name.loc[index,'startTime']) :
TypeError: isna() takes exactly 1 argument (2 given)
A direct way to take care of missing/invalid values is probably:
def is_valid(val):
if val is None:
return False
try:
return not math.isnan(val)
except TypeError:
return True
and of course you'll have to import math.
Also it seems isna is not invoked with any argument and returns a dataframe of boolean values (see link). You can iterate thru both dataframes to determine if the value is valid.
isna takes your entire data frame as the instance argument (that's self, if you're already familiar with classes) and returns a data frame of Boolean values, True where that value is invalid. You tried to specify the individual value you're checking as a second input argument. isna doesn't work that way; it takes empty parentheses in the call.
You have a couple of options. One is to follow the individual checking tactics here. The other is to make the map of the entire data frame and use that:
null_map_df = df_with_job_name.isna()
for index, row in df_with_job_name.iterrows() :
if not null_map_df.loc[index,row]) :
start_time=df_with_job_name.loc[index,'startTime']
start_time_formatted =
datetime(*map(int, re.split('[^\d]', start_time)[:-1]))
Please check my use of row & column indices; the index, row handling doesn't look right. Also, you should be able to apply an any operation to the entire row at once.

Lookup value in the same dataframe based on label and add to a new column (Vlookup)

I have a table which contains laboratory results, including 'blind duplicate samples'. These are basically a sample taken twice, where the second sample was given a non-descript label. The corresponding origina; sample is indicated in a separate column
Labels = ['A1-1', 'A1-2', 'A1-3', 'A1-4','B1-2', 'B1-3', 'B1-4', 'B1-5', 'Blank1', 'Blank2', 'Blank3']
Values = [8356532 ,7616084,5272477, 5076012, 411851, 415258, 8285777, 9700884, 9192185, 4466890,830516]
Duplicate_of = ['','','','','','','','','A1-1', 'A1-4', 'B1-3']
d = {'Labels': Labels, 'Values': Values, 'Duplicate_of' : Duplicate_of}
df = pd.DataFrame(data=d)
df = df[['Labels','Values','Duplicate_of']]
I would like to add a column to the dataframe which holds the 'value' from the original sample for the duplicates. So a new column ('Original_value'), where for 'Blank1' the value of 'A1-1' is entered, for 'Blank2' the value of 'A1-4' is entered, etc. For rows where the 'Duplicate_of' field is empty, this new column is also empty.
In excel, this is very easy with Vlookup, but I haven't seen an easy way in Pandas (maybe other than joining the entire table with itself?)
Here is the easiest way to do this, in one line:
df["Original_value"] = df["Duplicate_of"].apply(lambda x: "" if x == "" else df.loc[df["Labels"] == x, "Values"].values[0])
Explanation:
This simply applies a lambda function to each element of the column "Duplicate_of"
First we check if the item is an empty string and we return an empty string if so:
"" if x == ""
is equivalent to:
if x == "" return ""
If it is not an empty string the following command is executed:
df.loc[df["Labels"] == x, "Values"].values[0]
This simple return the value in the column "Values" when the condition df["Labels"] == x is true. If you are wondering about the .values[0] part, it is there because .loc returns a series; our series in this case is just a single value so we simply get it with .values[0].
Not a memory efficient answer but this works
import numpy as np
dictionary = dict(zip(Labels, Values))
df["Original_value"] = df["Duplicate_of"].map(lambda x: np.nan if x not in dictionary else dictionary[x])
For rest of the values in Original_Value it gives NaN. You can decide what you want in place of that.
The type of the new column will not be integer that can also be changed if needed.
with #jezrael comment the same thing can be done as
import numpy as np
dictionary = dict(zip(Labels, Values))
df["Original_value"] = df["Duplicate_of"].map(dictionary)

spark cannot create LabeledPoint

I always get this error:
AnalysisException: u"cannot resolve 'substring(l,1,-1)' due to data type mismatch: argument 1 requires (string or binary) type, however, 'l' is of array type.;"
Quite confused because l[0] is a string, and matches arg 1.
dataframe has only one column named 'value', which is a comma separated string.
And I want to convert this original dataframe to another dataframe of object LabeledPoint, with the first element to be 'label' and the others to be 'features'.
from pyspark.mllib.regression import LabeledPoint
def parse_points(dataframe):
df1=df.select(split(dataframe.value,',').alias('l'))
u_label_point=udf(LabeledPoint)
df2=df1.select(u_label_point(col('l')[0],col('l')[1:-1]))
return df2
parsed_points_df = parse_points(raw_data_df)
I think you what to create LabeledPoint in dataframe. So you can:
def parse_points(df):
df1=df.select(split(df.value,',').alias('l'))
df2=df1.map(lambda seq: LabeledPoint(float(seq[0][0]),seq[0][1:])) # since map applies lambda in each tuple
return df2.toDF() #this will convert pipelinedRDD to dataframe
parsed_points_df = parse_points(raw_data_df)

Categories