Inserting column with specifics - python

I have a specific question: I need to create a column name called "Plane type" for a column that contains the first 4 characters of the "TAIL_NUM" column.
How can I do this? I already imported the data and I can see it.

Creating new columns with Pandas (assuming that's what you're talking about) is very simple. Pandas also provides common string methods. Pandas Docs, Similar SO Question
You will use 'string slicing' which is worth reading about.
df['new_col'] = 'X'
or in your case:
df['Plane type'] = df['tail_num'].str[:4]

After viewing your code and assuming that the the column "TAIL_NUM" have string values, you can do like that:
df['Plane type'] = df["TAIL_NUM"].str[0:4]

Related

Why does vaex change column names that contain a period?

When using vaex I came across an unexpected error NameError: name 'column_2_0' is not defined.
After some investigation I found that in my data source (HDF5 file) the column name causing problems is actually called column_2.0 and that vaex renames it to column_2_0 but when performing operations using column names I run into the error. Here is a simple example that reproduces this error:
import pandas as pd
import vaex
cols = ['abc_1', 'abc1', 'abc.1']
vals = list(range(0,len(cols)))
df = pd.DataFrame([vals], columns=cols)
dfv = vaex.from_pandas(df)
for col in dfv.column_names:
dfv = dfv[dfv[col].notna()]
dfv.count()
...
NameError: name 'abc_1_1' is not defined
In this case it appears that vaex tries to rename abc.1 to abc_1 which is already taken so instead it ends up using abc_1_1.
I know that I can rename the column like dfv.rename('abc_1_1', 'abc_dot_1'), but (a) I'd need to introduce special logic for naming conflicts like in this example where the column name that vaex comes up with is already taken and (b) I'd rather not have to do this manually each time I have a column that contains a period.
I could also enforce all my column names from source data to never use a period but this seems like a stretch given that pandas and other sources where data might come from in general don't have this restriction.
What are some ideas to deal with this problem other than the two I mentioned above?
In Vaex the columns are in fact "Expressions". Expressions allow you do build sort of a computational graph behind the scenes as you are doing your regular dataframe operations. However, that requires the column names to be as "clean" as possible.
So column names like '2', or '2.5' are not allows, since the expression system can interpret them as numbers rather than column names. Also column names like 'first-name', the expressions system can interpret as df['first'] - df['name'].
To avoid this, vaex will smartly rename columns so that they can be used in the expression system. This is extremely complicated actually. So in your example above, you've found a case that has not been covered yet (isna/ notna).
Btw, you can always access the original names via df.get_column_names(alias=True).

how do I create a new column out of a dictionary's sub string on a pandas dataframe

I have the following repo for the files: https://github.com/Glarez/learning.git
dataframe
I need to create a column with the bold part of that string under the params column: "ufield_18":"ONLY" I dont see how can I get that since I'm learning to code from scratch. The solution to this would be nice, but what I would really appreciate is you to point me at the right direction to get the answer for myself. THANKS!
Since you do not want the exact answer. I will provide you one of the ways to achieve this:
filter the params column into a dictionary variable
create a loop to access the keys of the dictionary
append it to the pandas df you have (df[key] = np.nan) - Make sure you add some values while appending the column if your df already has some rows or just add np.nan
note np is numpy library which needs to be imported

How to search in a pandas dataframe column with the space in the column name

If I need to search if a value exists in a pandas data frame column , which has got a name without any spaces, then I simply do something like this
if value in df.Timestamp.values
This will work if the column name is Timestamp. However, I have got plenty of data with column names as 'Date Time'. How do I use the if in statement in that case?
If there is no easy way to check for this using the if in statement, can I search for the existence of the value in some other way? Note that I just need to search for the existence of the value. Also, this is not an index column.
Thank you for any inputs
It's better practice to use the square bracket notation:
df["Date Time"].values
Which does exactly the same thing
There are 2 ways of indexing columns in pandas. One is using the dot notation which you are using and the other is using square brackets. Both work the same way.
if value in df["Date Time"].values
in the case where you want to work with a column that has a header name with spaces
but you don't want it changed because you may have to forward the file
...one way is to just rename it, do whatever you want with the new no-spaced-name, them rename it back...# e.g. to drop the rows with the value "DUMMY" in the column 'Recipient Fullname'
df.rename(columns={'Recipient Fullname':'Recipient_Fullname'}, inplace=True)
df = df[(df.Recipient_Fullname != "DUMMY")]
df.rename(columns={'Recipient_Fullname':'Recipient Fullname'}, inplace=True)

Replicating Excel's VLOOKUP in Python Pandas

Would really appreciate some help with the following problem. I'm intending on using Pandas library to solve this problem, so would appreciate if you could explain how this can be done using Pandas if possible.
I want to take the following excel file:
Before
and:
1)convert the 'before' file into a pandas data frame
2)look for the text in 'Site' column. Where this text appears within the string in the 'Domain' column, return the value in 'Owner' Column under 'Output'.
3)the result should look like the 'After' file. I would like to convert this back into CSV format.
After
So essentially this is similar to an excel vlookup exercise, except its not an exact match we're looking for between the 'Site' and 'Domain' column.
I have already attempted this in Excel but im looking at over 100,000 rows, and comparing them against over 1000 sites, which crashes excel.
I have attempted to store the lookup list in the same file as the list of domains we want to classify with the 'Owner'. If there's a much better way to do this eg storing the lookup list in a separate data frame altogether, then that's fine.
Thanks in advance for any help, i really appreciate it.
Colin
I think the OP's question differs somewhat from the solutions linked in the comments which either deal with exact lookups (map) or lookups between dataframes. Here there is a single dataframe and a partial match to find.
import pandas as pd
import numpy as np
df = pd.ExcelFile('data.xlsx').parse(0)
df = df.astype(str)
df['Test'] = df.apply(lambda x: x['Site'] in x['Domain'],axis=1)
df['Output'] = np.where(df['Test']==True, df['Owner'], '')
df
The lambda allows reiteration of the in test to be applied across the axis, to return a boolean in Test. This then acts as a rule for looking up Owner and placing in Output.

changing column types of a pandas data frame -- finding offending rows that prevent casting

My PANDAS data has columns that were read as objects. I want to change these into floats. Following the post linked below (1), I tried:
pdos[cols] = pdos[cols].astype(float)
But PANDAS gives me an error saying that an object can't be recast as float.
ValueError: invalid literal for float(): 17_d
But when I search for 17_d in my data set, it tells me it's not there.
>>> '17_d' in pdos
False
I can look at the raw data to see what's happening outside of python, but feel if I'm going to take python seriously, I should know how to deal with this sort of issue. Why doesn't this search work? How could I do a search over objects for strings in PANDAS? Any advice?
Pandas: change data type of columns
of course it does, because you're only looking in the column list!
'17_d' in pdos
checks to see if '17_d' is in pdos.columns
so what you want to do is pdos[cols] == '17_d', which will give you a truth table. if you want to find which row it is, you can do (pdos[cols] == '17_d').any(1)

Categories