Using Python Panda's to fill new table with NaN values - python

I've imported data from a csv file which has columns NAME, ADDRESS, PHONE_NUMBER.
Sometimes, at least 1 of the columns has a missing value for that row. e.g
0 - Bill, Flat 2, 555123
1 - Katie, NaN, NaN
2 - Ruth, Flat 1, ?
I'm trying to get the NaN values to fill a new table which I can do if a filler value has been put in such as:
newDetails = details [details['PHONE_NUMBER']=="?"]
which gives me:
2 - Ruth, Flat 1, ?
I tried to use fillna but I couldn't find the syntax that would work.

Pandas fillna (pandas.DataFrame.fillna) is quite simple. Suppose your data frame is df. Here's how you can do.
df.fillna('_missing_value_', inplace=True)
It looks like you have different fields with missing value. May be try this:
df = df.where((pd.notnull(df)),'_missing_value_')
Edit1 to replace in a column
If you want to replace a column Flat 2, here's how:
col_flat = df[['Flat 2']].fillna('?')
df['Flat 2'] = col_flat['Flat 2']

Related

Extract strings values from DataFrame column

I have the following DataFrame:
Student
food
1
R0100000
2
R0200000
3
R0300000
4
R0400000
I need to extract as a string the values of the "food" column of the df DataFrame when I filter the data.
For example, when I filter by the Student=1, I need the return value of "R0100000" as a string value, without any other characters or spaces.
This is the code to create the same DataFrame as mine:
data={'Student':[1,2,3,4],'food':['R0100000', 'R0200000', 'R0300000', 'R0400000']}
df=pd.DataFrame(data)
I tried to select the Dataframe Column and apply str(), but it does not return me the desired results:
df_new=df.loc[df['Student'] == 1]
df_new=df_new.food
df_str=str(df_new)
del df_new
This works for me:
s = df[df.Student==1]['food'][0]
s.strip()
It's pretty simple, first get the column.
like, col =data["food"] and then use col[index] to get respective value
So, you answer would be data["food"][0]
Also, you can use iloc and loc search for these.
(df.iloc[rows,columns], so we can use this property to get answer as, df.iloc[0,1])
df.loc[rows, column_names] example: df.loc[0,"food"]

Subtract dataframes with completely different row names and column names

My dataframe 1 looks like this:
windcodes
name
yield
perp
163197.SH
shangguo comp
2.9248
NO
154563.SH
guosheng comp
2.886
Yes
789645.IB
guoyou comp
3.418
NO
My dataframe 2 looks like this
windcodes
CALC
1202203.IB
2.5517
1202203.IB
2.48457
1202203.IB
2.62296
and I want my result dataframe 3 to have one more new column than dataframe 1 which is to use the value in column 'yield' in dataframe 1 subtract the value in column 'CALC' in dataframe 2:
The result dataframe 3 should be looking like this
windcodes
name
yield
perp
yield-CALC
163197.SH
shangguo comp
2.9248
NO
0.3731
154563.SH
guosheng comp
2.886
Yes
0.40413
789645.IB
guoyou comp
3.418
NO
0.79504
It would be really helpful if anyone can tell me how to do it in python.
Just in case you have completely different indexes, use df2's underlying numpy array:
df1['yield-CALC'] = df1['yield'] - df2['yield'].values
You can try something like this:
df1['yield-CALC'] = df1['yield'] - df2['yield']
I'm assuming you don't want to join the dataframes, since the windcodes are not the same.
Do we need to join 2 dataframes from windcodes column? The windcodes are all the same in the sample data you have given in Dataframe2. Can you explain this?
If we are going to join from the windscode field. The code below will work.
df = pd.merge(left=df1, right=df2,how='inner',on='windcodes')
df['yield-CALC'] = df['yield']-df['CALC']
I will try to keep it as elaborated as possible:
environment I have used for coding is Jupyter Notebook
importing our required pandas library
import pandas as pd
getting your first table data in form of lists of lists (you can also use csv,excel etc here)
data_1 = [["163197.SH","shangguo comp",2.9248,"NO"],\
["154563.SH","guosheng comp",2.886,"Yes"] , ["789645.IB","guoyou comp",3.418,"NO"]]
creating dataframe one :
df_1 = pd.DataFrame(data_1 , columns = ["windcodes","name","yield","perp"])
df_1
Output:
getting your second table data in form of lists of lists (you can also use csv,excel etc here)
data_2 = [["1202203.IB",2.5517],["1202203.IB",2.48457],["1202203.IB",2.62296]]
creating dataframe two :
df_2 = pd.DataFrame(data_2 , columns = ["windcodes","CALC"])
df_2
Output:
Now creating the third dataframe:
df_3 = df_1 # becasue first 4 columns are same as our first dataframe
df_3
Output:
Now calculating the fourth column i.e "yield-CALC" :
df_3["yield-CALC"] = df_1["yield"] - df_2["CALC"] # each df_1 datapoint will be subtracted from df_2 datapoint one by one (still confused? search for "SIMD")
df_3
Output:

Indexing column in Pandas Dataframe returns NaN

I am running into a problem with trying to index my dataframe. As shown in the attached picture, I have a column in the dataframe called 'Identifiers' that contains a lot of redundant information ({'print_isbn_canonical': '). I only want the ISBN that comes after.
#Option 1 I tried
testdf2 = testdf2[testdf2['identifiers'].str[26:39]]
#Option 2 I tried
testdf2['identifiers_test'] = testdf2['identifiers'].str.replace("{'print_isbn_canonical': '","")
Unfortunately both of these options turn the dataframe column into a colum only containing NaN values
Please help out! I cannot seem to find the solution and have tried several things. Thank you all in advance!
Example image of the dataframe
If the contents of your column identifiers is a real dict / json type, you can use the string accessor str[] to access the dict value by key, as follows:
testdf2['identifiers_test'] = testdf2['identifiers'].str['print_isbn_canonical']
Demo
data = {'identifiers': [{'print_isbn_canonical': '9780721682167', 'eis': '1234'}]}
df = pd.DataFrame(data)
df['isbn'] = df['identifiers'].str['print_isbn_canonical']
print(df)
identifiers isbn
0 {'print_isbn_canonical': '9780721682167', 'eis': '1234'} 9780721682167
Try this out :
testdf2['new_column'] = testdf2.apply(lambda r : r.identifiers[26:39],axis=1)
Here I assume that the identifiers column is string type

Dataframe sum(axis=1) is returning Nan Values

I'm trying to make a sum of the second column ('ALL_PPA'), grouping by Numéro_département
Here's my code :
df.fillna(0,inplace=True)
df = df.loc[:, ('Numéro_département','ALL_PPA')]
df = df.groupby('Numéro_département').sum(axis=1)
print(df)
My DF is full of numbers, I don't have any NaN values, but when I apply the function df.sum(axis=1),some rows appear to have a NaN Value
Here's how my tab looks like before sum():
Here's after sum()
My question is : How am I supposed to do this? I've try to use numpy library but, it doesn't work as I want it to work
Drop the first row of that dataframe, as it just as the column names in it, and convert it to an int. Right now, it is an object because of the mixed data types:
df2 = df.iloc[1:].astype(int).copy()
Then, apply groupby.sum() and specify the column as well:
df3 = df2.groupby('Numero_department')['ALL_PPA'].sum()
I think using .dropna() before summing the DF will help remove any rows or columns (depending on the axis= you choose) with nan values. According to the screenshot provided, please drop the first line of the DF as it is a string.

Dynamic - Automated multiplication - Pandas dataframes

after spending quite a while search and reading on Stackoverflow and around the web, I am desperate...
I have a Pandas DataFrame with some imported data (spectra). The first column is the wavelength while the others are the various spectra (the data). The names of the columns are imported from a list that reads the filenames from a path and keeps just the names.
What I would like to achieve and I can't quite seem to get how is to multiply each of the columns with the wavelength column and either overwrite the existing ones or create a new dataframe (doesn't matter that much).
This is the code I have so far that does the job (even if not the most elegant, it get's the job done):
path = r'"thePathToData\PL_calc\Data_NIR'
idx = 0
#Create the DataFrame with all the data from the path above, use the filenames as column names
all_files = glob.glob(os.path.join(path, "*.asc"))
df = pd.concat((pd.read_csv(f, usecols=[1], sep='\t') for f in all_files), axis=1) #usecol=1 for the spectrum only
fileNames = [] # create a list for the filenames
for i in range(0,len(all_files)):
fileNames.append(all_files[i][71:-4])
df.columns = fileNames # assign the filenames as columns
wavelengths = pd.read_csv(all_files[0], usecols=[0], sep='\t') # add the wavelength column as first column of the dataframe
df.insert(loc=idx, column='Wavelength', value=wavelengths)
If I plot just the head of the DF it looks like this:
Wavelength F8BT_Pure_Batch1_px1_spectra_4V \ ...
0 478.0708 -3.384101
1 478.3917 -1.580399
2 478.7126 -0.323580
3 479.0334 -1.131425
4 479.3542 1.202728
The complete DF is:
1599 rows × 46 columns
Question 1:
I can't quite find an automated (dynamic) way of multiplying each col with the first one, essentially this:
for i in range(1, len(df.columns)):
df[[i]] = df[[0]] * df[[i]]
Question 2:
Why does this work:
df['F8BT_Pure_Batch1_px1_spectra_4V'] = df['Wavelength']*df['F8BT_Pure_Batch1_px1_spectra_4V']
while this doesn't and gives me an "IndexError: indices are out-of-bounds"
df[[1]] = df[[0]]*df[[1]]
But when I print(df[['Wavelength']]) Name: Wavelength, dtype: float64 and print(df[[0]]) [1599 rows x 1 columns] I get the same numbers..
Question 3:
Why does this df[fileNames] = df[fileNames].multiply(df.Wavelength) give me a ValueError: Columns must be same length as key? All the columns are of the same length (1599 rows long, 0-1598 and a total of 46 columns in this case). fileNames contains the names of the imported files and the names of the columns of the dataframe.
Many many thanks in advance for your help...
Alex
Question 1
To multiply your wavelength column by every other column in your DataFrame, you can use:
df.iloc[:, 1:] = df.iloc[:, 1:].mul(df['Wavelength'], axis=0)
This assumes your wavelength column is the first column.
Question 2
Selecting columns like that using an integer is asking for columns of your DataFrame that are named 0, 1, etc., as ints. There are none in your DataFrame. To select columns by index number look into the documentation for pandas' iloc method.
Question 3
When you call df[fileNames], you are getting a a DataFrame with the same number of columns as the length of your list fileNames. Your code df[fileNames].multiply(df.Wavelength) is not giving you a DataFrame with the same number of columns as df[fileNames], hence you cannot assign the values. Using the axis=0 parameter in the multiply function is working for me.

Categories