Extract strings values from DataFrame column - python

I have the following DataFrame:
Student
food
1
R0100000
2
R0200000
3
R0300000
4
R0400000
I need to extract as a string the values of the "food" column of the df DataFrame when I filter the data.
For example, when I filter by the Student=1, I need the return value of "R0100000" as a string value, without any other characters or spaces.
This is the code to create the same DataFrame as mine:
data={'Student':[1,2,3,4],'food':['R0100000', 'R0200000', 'R0300000', 'R0400000']}
df=pd.DataFrame(data)
I tried to select the Dataframe Column and apply str(), but it does not return me the desired results:
df_new=df.loc[df['Student'] == 1]
df_new=df_new.food
df_str=str(df_new)
del df_new

This works for me:
s = df[df.Student==1]['food'][0]
s.strip()

It's pretty simple, first get the column.
like, col =data["food"] and then use col[index] to get respective value
So, you answer would be data["food"][0]
Also, you can use iloc and loc search for these.
(df.iloc[rows,columns], so we can use this property to get answer as, df.iloc[0,1])
df.loc[rows, column_names] example: df.loc[0,"food"]

Related

Selecting specific columns in where condition using Pandas

I have a below Dataframe with 3 columns:
df = DataFrame(query, columns=["Processid", "Processdate", "ISofficial"])
In Below code, I get Processdate based on Processid==204 (without Column Names):
result = df[df.Processid == 204].Processdate.to_string(index=False)
But I wan the same result for Two columns at once without column names, Something like below code:
result = df[df.Processid == 204].df["Processdate","ISofficial"].to_string(index=False)
I know how to get above result but I dont want Column names, Index and data type.
Can someone help?
I think you are looking for header argument in to_string parameters. Set it to False.
df[df.Processid==204][['Processdate', 'ISofficial']].to_string(index=False, header=False)

Indexing column in Pandas Dataframe returns NaN

I am running into a problem with trying to index my dataframe. As shown in the attached picture, I have a column in the dataframe called 'Identifiers' that contains a lot of redundant information ({'print_isbn_canonical': '). I only want the ISBN that comes after.
#Option 1 I tried
testdf2 = testdf2[testdf2['identifiers'].str[26:39]]
#Option 2 I tried
testdf2['identifiers_test'] = testdf2['identifiers'].str.replace("{'print_isbn_canonical': '","")
Unfortunately both of these options turn the dataframe column into a colum only containing NaN values
Please help out! I cannot seem to find the solution and have tried several things. Thank you all in advance!
Example image of the dataframe
If the contents of your column identifiers is a real dict / json type, you can use the string accessor str[] to access the dict value by key, as follows:
testdf2['identifiers_test'] = testdf2['identifiers'].str['print_isbn_canonical']
Demo
data = {'identifiers': [{'print_isbn_canonical': '9780721682167', 'eis': '1234'}]}
df = pd.DataFrame(data)
df['isbn'] = df['identifiers'].str['print_isbn_canonical']
print(df)
identifiers isbn
0 {'print_isbn_canonical': '9780721682167', 'eis': '1234'} 9780721682167
Try this out :
testdf2['new_column'] = testdf2.apply(lambda r : r.identifiers[26:39],axis=1)
Here I assume that the identifiers column is string type

Drop/edit rows in dataframe where entry doesn't meet condition

I know this has been asked before but I cannot find an answer that is working for me. I have a dataframe df that contains a column age, but the values are not all integers, some are strings like 35-59. I want to drop those entries. I have tried these two solutions as suggested by kite but they both give me AttributeError: 'Series' object has no attribute 'isnumeric'
df.drop(df[df.age.isnumeric()].index, inplace=True)
df = df.query("age.isnumeric()")
df = df.reset_index(drop=True)
Additionally is there a simple way to edit the value of an entry if it matches a certain condition? For example instead of deleting rows that have age as a range of values, I could replace it with a random value within that range.
Try with:
df.drop(df[df.age.str.isnumeric() == False].index, inplace=True)
If you check documentation isnumeric is a method of Series.str and not of Series. That's why you get that error.
Also you will need the ==False because you have mixed types and get a series with only booleans.
I'm posting it in case this also helps you with your last question. You can use pandas.DataFrame.at with pandas.DataFrame.Itertuples for iteration over rows of the dataframe and replace values:
for row in df.itertuples():
# iterate every row and change the value of that column
if row.age == 'non_desirable_value:
df.at[row.Index, "age"] = 'desirable_value'
Hence, it could be:
for row in df.itertuples():
if row.age.str.isnumeric() == False or row.age == 'non_desirable_value':
df.at[row.Index, "age"] = 'desirable_value'

Using pd.Dataframe.replace with an apply function as the replace value

I have several dataframes that have mixed in some columns with dates in this ASP.NET format "/Date(1239018869048)/". I've figured out how to parse this into python's datetime format for a given column. However I would like to put this logic into a function so that I can pass it any dataframe and have it replace all the dates that it finds that match a regex using pd.Dataframe.replace.
something like:
def pretty_dates():
#Messy logic here
df.replace(to_replace=r'\/Date(d+)', value=pretty_dates(df), regex=True)
Problem with this is that the df that is being passed to pretty_dates is the whole dataframe not just the cell that is needed to be replaced.
So the concept I'm trying to figure out is if there is a way that the value that should be replaced when using df.replace can be a function instead of a static value.
Thank you so much in advance
EDIT
To try to add some clarity, I have many columns in a dataframe, over a hundred that contain this date format. I would like not to list out every single column that has a date. Is there a way to apply the function the clean my dates across all the columns in my dataset? So I do not want to clean 1 column but all the hundreds of columns of my dataframe.
I'm sure you can use regex to do this in one step, but here is how to apply it to the whole column at once:
df = pd.Series(['/Date(1239018869048)/',
'/Date(1239018869048)/'],dtype=str)
df = df.str.replace('\/Date\(', '')
df = df.str.replace('\)\/', '')
print(df)
0 1239018869048
1 1239018869048
dtype: object
As far as I understand, you need to apply custom function to selected cells in specified column. Hope, that the following example helps you:
import pandas as pd
df = pd.DataFrame({'x': ['one', 'two', 'three']})
selection = df.x.str.contains('t', regex=True) # put your regexp here
df.loc[selection, 'x'] = df.loc[selection, 'x'].map(lambda x: x+x) # do some logic instead
You can apply this procedure to all columns of the df in a loop:
for col in df.columns:
selection = df.loc[:, col].str.contains('t', regex=True) # put your regexp here
df.loc[selection, col] = df.loc[selection, col].map(lambda x: x+x) # do some logic instead

Using filter in pandas to get an exact match and partial match at the same time

I have a dataframe that looks like this:
Y2000 Y2001 Y2002 Item Item Code
34 43 65 12 Test
I want to extract the columns Y2000, Y2001, Y2002 and Item. I do not want to extract the 'Item Code' column. How do I do this without explicitly specifying column names, as I have tons of columns in the full dataframe? Right now, I am using the filter command but it is not working for me:
df.filter(like='Y|Item')
It just returns an empty dataframe
IIUC then you can use a regex pattern:
In [2]:
df = pd.DataFrame(columns=['Y2000','Y2001','Y2002','Item','Item Code'])
df
Out[2]:
Empty DataFrame
Columns: [Y2000, Y2001, Y2002, Item, Item Code]
Index: []
In [8]:
df.filter(regex='^Y\d{4}$|^Item$')
Out[8]:
Empty DataFrame
Columns: [Y2000, Y2001, Y2002, Item]
Index: []
So ^Y\d{4}$|^Item$ looks for 'Y' at start followed by 4 digits and then terminates here with stop $ 'Item' at start and stop $ at end
According to the documentation for filter, you need the regex parameter:
df.filter(regex='Y|Item$')
where the columns satisfying re.search(regex, col) == True will be kept. The like version performs a sub-string search on the column names, which is why it does not work when a regex-like input such as 'Y|Item' is supplied.

Categories