Indexing column in Pandas Dataframe returns NaN - python

I am running into a problem with trying to index my dataframe. As shown in the attached picture, I have a column in the dataframe called 'Identifiers' that contains a lot of redundant information ({'print_isbn_canonical': '). I only want the ISBN that comes after.
#Option 1 I tried
testdf2 = testdf2[testdf2['identifiers'].str[26:39]]
#Option 2 I tried
testdf2['identifiers_test'] = testdf2['identifiers'].str.replace("{'print_isbn_canonical': '","")
Unfortunately both of these options turn the dataframe column into a colum only containing NaN values
Please help out! I cannot seem to find the solution and have tried several things. Thank you all in advance!
Example image of the dataframe

If the contents of your column identifiers is a real dict / json type, you can use the string accessor str[] to access the dict value by key, as follows:
testdf2['identifiers_test'] = testdf2['identifiers'].str['print_isbn_canonical']
Demo
data = {'identifiers': [{'print_isbn_canonical': '9780721682167', 'eis': '1234'}]}
df = pd.DataFrame(data)
df['isbn'] = df['identifiers'].str['print_isbn_canonical']
print(df)
identifiers isbn
0 {'print_isbn_canonical': '9780721682167', 'eis': '1234'} 9780721682167

Try this out :
testdf2['new_column'] = testdf2.apply(lambda r : r.identifiers[26:39],axis=1)
Here I assume that the identifiers column is string type

Related

Extract strings values from DataFrame column

I have the following DataFrame:
Student
food
1
R0100000
2
R0200000
3
R0300000
4
R0400000
I need to extract as a string the values of the "food" column of the df DataFrame when I filter the data.
For example, when I filter by the Student=1, I need the return value of "R0100000" as a string value, without any other characters or spaces.
This is the code to create the same DataFrame as mine:
data={'Student':[1,2,3,4],'food':['R0100000', 'R0200000', 'R0300000', 'R0400000']}
df=pd.DataFrame(data)
I tried to select the Dataframe Column and apply str(), but it does not return me the desired results:
df_new=df.loc[df['Student'] == 1]
df_new=df_new.food
df_str=str(df_new)
del df_new
This works for me:
s = df[df.Student==1]['food'][0]
s.strip()
It's pretty simple, first get the column.
like, col =data["food"] and then use col[index] to get respective value
So, you answer would be data["food"][0]
Also, you can use iloc and loc search for these.
(df.iloc[rows,columns], so we can use this property to get answer as, df.iloc[0,1])
df.loc[rows, column_names] example: df.loc[0,"food"]

Remove a dtype data from pandas dataframe column

I have a dataframe where it was added date and datetime information to a column where it was expected a string. What would be the best way to filter all dates and date values from a pandas dataframe column and replace those values to blank?
Thank you!
In general, if you provided a minimum working example of your problem, one could help more specifically, but assuming you have the following column:
df = pd.DataFrame(np.zeros(shape=(10,1)), columns = ["Mixed"])
df["Mixed"] = "foobar"
df.loc[2,"Mixed"] = pd.to_datetime("2022-08-22")
df.loc[7,"Mixed"] = pd.to_datetime("2022-08-21")
#print("Before Fix", df)
You can use apply(type) on the column to obtain the data-types of each cell and then use list comprehension [x!=str for x in types] to check for each cells datatype if its a string or not. After that, just replace those values that are not the desired datatype with a value of your choosing.
types = df["Mixed"].apply(type).values
mask = [x!=str for x in types]
df.loc[mask,"Mixed"] = "" #Or None, or whatever you want to overwrite it with
#print("After Fix", df)

Selecting specific columns in where condition using Pandas

I have a below Dataframe with 3 columns:
df = DataFrame(query, columns=["Processid", "Processdate", "ISofficial"])
In Below code, I get Processdate based on Processid==204 (without Column Names):
result = df[df.Processid == 204].Processdate.to_string(index=False)
But I wan the same result for Two columns at once without column names, Something like below code:
result = df[df.Processid == 204].df["Processdate","ISofficial"].to_string(index=False)
I know how to get above result but I dont want Column names, Index and data type.
Can someone help?
I think you are looking for header argument in to_string parameters. Set it to False.
df[df.Processid==204][['Processdate', 'ISofficial']].to_string(index=False, header=False)

How to create columns from a column that contains several pieces of information?

I have created a new dataframe by doing operations on two others and I would like to remove the false in the new dataframe but it is noted that my information is only in one column and I do not know how to do it.
code
import pandas as pd
members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
expeditions = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/expeditions.csv")
success_members_exp = pd.merge(members,
expeditions[['expedition_id','termination_reason']], on='expedition_id', how='inner')
success_members_exp_pourcent["pourcent"] = success_members_exp.groupby('expedition_id')['success'].value_counts(normalize=True) * 100
success_members_exp_pourcent.to_frame()
Change:
success_members_exp_pourcent["pourcent"] = success_members_exp.groupby('expedition_id')['success'].value_counts(normalize=True) * 100
to:
s = (success_members_exp.groupby('expedition_id')['success']
.value_counts(normalize=True)
.rename('pourcent')
.mul(100))
success_members_exp_pourcent = success_members_exp.join(s, on=['expedition_id','success'])
Explanation:
Output s is Series with MultiIndex, so for new column use DataFrame.join with rename and multiple by 100.
Which column is your information stored in? What type of column is it? We are missing a lot of context, a print out of the column would be really nice, with a more specific explanation of what exactly you are trying to extract.

How to keep leading zeros in a column when reading CSV with Pandas?

I am importing study data into a Pandas data frame using read_csv.
My subject codes are 6 numbers coding, among others, the day of birth. For some of my subjects this results in a code with a leading zero (e.g. "010816").
When I import into Pandas, the leading zero is stripped of and the column is formatted as int64.
Is there a way to import this column unchanged maybe as a string?
I tried using a custom converter for the column, but it does not work - it seems as if the custom conversion takes place before Pandas converts to int.
As indicated in this answer by Lev Landau, there could be a simple solution to use converters option for a certain column in read_csv function.
converters={'column_name': str}
Let's say I have csv file projects.csv like below:
project_name,project_id
Some Project,000245
Another Project,000478
As for example below code is trimming leading zeros:
from pandas import read_csv
dataframe = read_csv('projects.csv')
print dataframe
Result:
project_name project_id
0 Some Project 245
1 Another Project 478
Solution code example:
from pandas import read_csv
dataframe = read_csv('projects.csv', converters={'project_id': str})
print dataframe
Required result:
project_name project_id
0 Some Project 000245
1 Another Project 000478
To have all columns as str:
pd.read_csv('sample.csv', dtype=str)
To have certain columns as str:
# column names which need to be string
lst_str_cols = ['prefix', 'serial']
dict_dtypes = {x: 'str' for x in lst_str_cols}
pd.read_csv('sample.csv', dtype=dict_dtypes)
here is a shorter, robust and fully working solution:
simply define a mapping (dictionary) between variable names and desired data type:
dtype_dic= {'subject_id': str,
'subject_number' : 'float'}
use that mapping with pd.read_csv():
df = pd.read_csv(yourdata, dtype = dtype_dic)
et voila!
If you have a lot of columns and you don't know which ones contain leading zeros that might be missed, or you might just need to automate your code. You can do the following:
df = pd.read_csv("your_file.csv", nrows=1) # Just take the first row to extract the columns' names
col_str_dic = {column:str for column in list(df)}
df = pd.read_csv("your_file.csv", dtype=col_str_dic) # Now you can read the compete file
You could also do:
df = pd.read_csv("your_file.csv", dtype=str)
By doing this you will have all your columns as strings and you won't lose any leading zeros.
You Can do This , Works On all Versions of Pandas
pd.read_csv('filename.csv', dtype={'zero_column_name': object})
You can use converters to convert number to fixed width if you know the width.
For example, if the width is 5, then
data = pd.read_csv('text.csv', converters={'column1': lambda x: f"{x:05}"})
This will do the trick. It works for pandas==0.23.0 and also read_excel.
Python3.6 or higher required.
I don't think you can specify a column type the way you want (if there haven't been changes reciently and if the 6 digit number is not a date that you can convert to datetime). You could try using np.genfromtxt() and create the DataFrame from there.
EDIT: Take a look at Wes Mckinney's blog, there might be something for you. It seems to be that there is a new parser from pandas 0.10 coming in November.
As an example, consider the following my_data.txt file:
id,A
03,5
04,6
To preserve the leading zeros for the id column:
df = pd.read_csv("my_data.txt", dtype={"id":"string"})
df
id A
0 03 5
1 04 6

Categories