Extracting data from a column in a data frame in Python - python

I want to extract out the "A"(s) from this column. After doing that I want to be able to print the other data from other columns associated with "A" in the same row.
However, my code printed this instead:
outputs:
UniqueCarrier NaN
CancellationCode NaN
Name: CancellationCode, dtype: object
None
The column CancellationCode looks like this:
CancellationCode:
NaN
A
NaN
B
NaN
I want to get it to print in a data frame format with the filtered rows and columns.
Here is my code below:
cancellation_reason = (flight_data_finalcopy["CancellationCode"] == "A")
cancellation_reasons_filtered = cancellation_reason[["UniqueCarrier", "AirlineID", "Origin"]]
print(display(cancellation_reasons_filtered))

try this
cancellation_reason=flight_data_finalcopy[flight_data_finalcopy["CancellationCode"] == "A"]
cancellation_reasons_filtered = cancellation_reason[["UniqueCarrier", "AirlineID", "Origin"]]
print(display(cancellation_reasons_filtered))

Related

Python - looping through rows and concating rows until a certain value is encountered

I am getting myself very confused over a problem I am encountering with a short python script I am trying to put together. I am trying to iterate through a dataframe, appending rows to a new dataframe, until a certain value is encountered.
import pandas as pd
#this function will take a raw AGS file (saved as a CSV) and convert to a
#dataframe.
#it will take the AGS CSV and print the top 5 header lines
def AGS_raw(file_loc):
raw_df = pd.read_csv(file_loc)
#print(raw_df.head())
return raw_df
import_df = AGS_raw('test.csv')
def AGS_snip(raw_df):
for i in raw_df.iterrows():
df_new_row = pd.DataFrame(i)
cut_df = pd.DataFrame(raw_df)
if "**PROJ" == True:
cut_df = cut_df.concat([cut_df,df_new_row],ignore_index=True, sort=False)
elif "**ABBR" == True:
break
print(raw_df)
return cut_df
I don't need to get into specifics, but the values (**PROJ and **ABBR) in this data occur as single cells as the top of tables. So I want to loop row-wise through the data, appending rows until **ABBR is encountered.
When I call AGS_snip(import_df), nothing happens. Previous incarnations just spat out the whole df, and I'm just confused over the logic of the loops. Any assistance much appreciated.
EDIT: raw text of the CSV
**PROJ,
1,32
1,76
32,56
,
**ABBR,
1,32
1,76
32,56
The test CSV looks like this:
The reason that "nothing happens" is likely b/c of the conditions you're using in if and elif.
Neither "**PROJ" == True nor "**ABBR" == True will ever be True because neither "**PROJ" nor "**ABBR" are equal to True. Your code is equivalent to:
def AGS_snip(raw_df):
for i in raw_df.iterrows():
df_new_row = pd.DataFrame(i)
cut_df = pd.DataFrame(raw_df)
if False:
cut_df = cut_df.concat([cut_df,df_new_row],ignore_index=True, sort=False)
elif False:
break
print(raw_df)
return cut_df
Which is the same as:
def AGS_snip(raw_df):
for i in raw_df.iterrows():
df_new_row = pd.DataFrame(i)
cut_df = pd.DataFrame(raw_df)
print(raw_df)
return cut_df
You also always return from inside the loop and df_new_row isn't used for anything, so it's equivalent to:
def AGS_snip(raw_df):
first_row = next(raw_df.iterrows(), None)
if first_row:
cut_df = pd.DataFrame(raw_df)
print(raw_df)
return cut_df
Here's how to parse your CSV file into multiple separate dataframes based on a row condition. Each dataframe is stored in a Python dictionary, with titles as keys and dataframes as values.
import pandas as pd
df = pd.read_csv('ags.csv', header=None)
# Drop rows which consist of all NaN (Not a Number) / missing values.
# Reset index order from 0 to the end of dataframe.
df = df.dropna(axis='rows', how='all').reset_index(drop=True)
# Grab indices of rows beginning with "**", and append an "end" index.
idx = df.index[df[0].str.startswith('**')].append(pd.Index([len(df)]))
# Dictionary of { dataframe titles : dataframes }.
dfs = {}
for k in range(len(idx) - 1):
table_name = df.iloc[idx[k],0]
dfs[table_name] = df.iloc[idx[k]+1:idx[k+1]].reset_index(drop=True)
# Print the titles and tables.
for k,v in dfs.items():
print(k)
print(v)
# **PROJ
# 0 1
# 0 1 32.0
# 1 1 76.0
# 2 32 56.0
# **ABBR
# 0 1
# 0 1 32.0
# 1 1 76.0
# 2 32 56.0
# Access each dataframe by indexing the dictionary "dfs", for example:
print(dfs['**ABBR'])
# 0 1
# 0 1 32.0
# 1 1 76.0
# 2 32 56.0
# You can rename column names with for example this code:
dfs['**PROJ'].set_axis(['data1', 'data2'], axis='columns', inplace=True)
print(dfs['**PROJ'])
# data1 data2
# 0 1 32.0
# 1 1 76.0
# 2 32 56.0

remove nan from table data python?

I'm using BS4 to pull a table from an HTML webpage and trying to add it to a pandas data frame but it's very sloppy when I pull it and I can't seem to get it to print properly if anyone can help?
There is only 1 table available on the webpage and this is the code I'm using. and what it's pulling.
soup = BeautifulSoup(driver.page_source,'html.parser')
df = pd.read_html(str(soup))
print (df)
results:
[ Unnamed: 0 Student Number Student Name Placement Date
0 NaN 20808456 Sandy Gurlow 01/13/2023
1 NaN NaN NaN NaN]
But I've tried to use:
df.dropna(inplace=True)
And I get the error code:
AttributeError: 'list' object has no attribute 'dropna'
pandas.read_html returns a list of dataframes, with as many dataframes as it found tables in the input.
You need to use:
df = pd.read_html(driver.page_source)[0]
Or, to avoid IndexError in case of no table:
l = pd.read_html(driver.page_source)
if l:
df = l[0]
else:
print('no table found')

Filtering function for pandas - VIewing NaN values within a column

Function I have created:
#Create a function that identifies blank values
def GPID_blank(df, variable):
df = df.loc[df['GPID'] == variable]
return df
Test:
variable = ''
test = GPID_blank(df, variable)
test
Goal: Create a function that can filter any dataframe column 'GPID' to see all of the rows where GPID has missing data.
I have tried running variable = 'NaN' and still no luck. However, I know the function works, as if I use a real-life variable "OH82CD85" the function filters my dataset accordingly.
Therefore, why doesn't it filter out the blank cells variable = 'NaN'? I know for my dataset, there are 5 rows with GPID missing data.
Example df:
df = pd.DataFrame({'Client': ['A','B','C'], 'GPID':['BRUNS2','OH82CD85','']})
Client GPID
0 A BRUNS2
1 B OH82CD85
2 C
Sample of GPID column:
0 OH82CD85
1 BW07TI20
2 OW36HW81
3 PE56TA73
4 CT46SX81
5 OD79AU80
6 GF46DB60
7 OL07ST01
8 VP38SM57
9 AH90AE61
10 PG86KO78
11 NaN
12 NaN
13 SO21GR72
14 DY85IN90
15 KW80CV02
16 CM15QP83
17 VC38FP82
18 DA36RX05
19 DD74HD38
You can't use == with NaN. NaN != NaN.
Instead, you can modify your function a little to check if the parameter is NaN using pd.isna() (or np.isnan()):
def GPID_blank(df, variable):
if pd.isna(variable):
return df.loc[df['GPID'].isna()]
else:
return df.loc[df['GPID'] == variable]
You can't really search for NaN values like an expression. Also, in your example dataframe, '' is not NaN, but is str, and can be searched like an expression.
Instead, you need to check when you want to filter for NaN, and filter differently:
def GPID_blank(df, variable):
if pd.isna(variable):
df = df.loc[df['GPID'].isna()]
else:
df = df.loc[df['GPID'] == variable]
return df
It's not working because with variable = 'NaN' you're looking for a string which content is 'NaN', not for missing values.
You can try:
import pandas as pd
def GPID_blank(df):
# filtered dataframe with NaN values in GPID column
blanks = df[df['GPID'].isnull()].copy()
return blanks
filtered_df = GPID_blank(df)

How to edit a dataframe row by row while itterating?

So I am using a script to read a CSV and create a data frame from which it then scrapes price data using tickers from said data frame. The original data frame has the following columns, note NO 'Price'.
df.columns = ['Ticker TV', 'Ticker YF', 'TV Name', 'Sector', 'Industry', 'URLTV']
I've printed the below first couple of outputs from my "updated" data frame
Ticker TV Ticker YF ... URLTV Price
1 100D 100D.L ... URL NaN
2 1GIS 1GIS.L ... URL NaN
3 1MCS 1MCS.L ... URL NaN
... ... ... ... ... ...
2442 ZYT ZYT.L ...URL NaN
100D.L NaN NaN .. NaN 9272.50
1GIS.L NaN NaN ...NaN 8838.50
1MCS.L NaN NaN ...NaN 5364.00
As you can see it's not working as intended I would like to create a new column with the name of Price and attach each price with the correct ticker so 100D.L should be 9272.50 then when the script iterates to the next ticker it adds the next price value to 1GIS and so forth.
tickerList = df['Ticker YF']
for tick in tickerList:
summarySoup = getSummary(tick)
currentPriceData = priceData(summarySoup)
print('The Price of '+tick+ ' is '+str(currentPriceData))
df.at[tick,'Price'] = currentPriceData
Assign price using apply method:
df['Price'] = df['Ticker YF'].apply(lambda x: str(priceData(getSummary(x))))
tick is just the value from your 'Ticker YF' column. so you can use enumerate to get also the index. And if you want to access the former price to add them up you can then just use idx-1
tickerList = df['Ticker YF']
for idx, tick in enumerate(tickerList):
summarySoup = getSummary(tick)
currentPriceData = priceData(summarySoup)
print('The Price of '+tick+ ' is '+str(currentPriceData))
if idx!=0:
df.at[idx+1,'Price'] = float(currentPriceData)+float(df.at[idx,'Price'])
else:
df.at[idx+1,'Price'] = float(currentPriceData)
A more "elegant" idea could be something like:
df["Single_Price"]=df["Ticker YF"].apply(lambda x: priceData(getSummary(x)))
to get the value of the single prices. And then create the next column with the added prices:
df["Price"]=df["Ticker"].apply(lambda x: df["Single_Price"][df["Ticker"]<x["Ticker"]].sum())
this will add up every Single_Price (df["Single_Price"]) from every row that is before your current row Ticker x (df["Ticker"] < x["Ticker"]) and creates a new column Price in your dataframe.
after that cou can simply delete the single prices if you don't need them with:
del df["Single_Price"]

Python pandas split column with NaN values

Hello my dear coders,
I'm new to coding and I've stumbled upon a problem. I want to split a column of a csv file that I have imported via pandas in Python. The column name is CATEGORY and contains 1, 2 or 3 values such seperated by a comma (IE: 2343, 3432, 4959) Now I want to split these values into seperate columns named CATEGORY, SUBCATEGORY and SUBSUBCATEGORY.
I have tried this line of code:
products_combined[['CATEGORY','SUBCATEGORY', 'SUBSUBCATEGORY']] = products_combined.pop('CATEGORY').str.split(expand=True)
But I get this error: ValueError: Columns must be same length as key
Would love to hear your feedback <3
You need:
pd.DataFrame(df.CATEGORY.str.split(',').tolist(), columns=['CATEGORY','SUBCATEGORY', 'SUBSUBCATEGORY'])
Output:
CATEGORY SUBCATEGORY SUBSUBCATEGORY
0 2343 3432 4959
1 2343 3432 4959
I think this could be accomplished by creating three new columns and assigning each to a lambda function applied to the 'CATEGORY' column. Like so:
products_combined['SUBCATEGORY'] = products_combined['CATEGORY'].apply(lambda original: original[1] if len(original) > 1 else None)
products_combined['SUBSUBCATEGORY'] = products_combined['CATEGORY'].apply(lambda original: original[2] if len(original) > 2 else None)
products_combined['CATEGORY'] = products_combined['CATEGORY'].apply(lambda original: original[0])
The apply() method called on a series returns a new series that contains the result of running the passed function (in this case, the lambda function) on each row of the original series.
IIUC, use split and then Series:
(
df[0].apply(lambda x: pd.Series(x.split(",")))
.rename(columns={0:"CATEGORY", 1:"SUBCATEGORY", 2:"SUBSUBCATEGORY"})
)
CATEGORY SUBCATEGORY SUBSUBCATEGORY
0 2343 3432 4959
1 1 NaN NaN
2 44 55 NaN
Data:
d = [["2343,3432,4959"],["1"],["44,55"]]
df = pd.DataFrame(d)

Categories