Split data into 3 column dataframe - python

I'm having trouble parsing a data file into a data frame. When I read the data using pandas I get a one column data frame with all the information.
Server
7.14.182.917 - - [20/Dec/2018:08:30:21 -0500] "GET /tools/performance/log/lib/ui-bootstrap-tpls-0.23.5.min.js HTTP/1.1" 235 89583
7.18.134.196 - - [20/Dec/2018:07:40:13 -0500] "HEAD / HTTP/1.0" 502 -
...
I want to parse the data in three columns. I tried using df[['Server', 'Date', 'Address']] = pd.DataFrame([ x.split() for x in df['Server'].tolist() ]) but I'm getting an error ValueError: Columns must be same length as key
Is there a way to parse the data to have 3 columns as follows
Server Date Address
7.14.182.917 20/Dec/2018:08:30:21 -0500. "GET /tools/performance/log/lib/ui-bootstrap-tpls-0.23.5.min.js HTTP/1.1" 235 89583

Multiple approaches can be taken here depending on the input file type and format. If the file is a valid string path, try these approaches (more here):
import pandas as pd
# approach 1
df = pd.read_fwf('inputfile.txt')
# approach 2
df = pd.read_csv("inputfile.txt", sep = "\t") # check the delimiter
# then select the columns you want
df_subset = df[['Server', 'Date', 'Address']]
Full solution:
import pandas as pd
# read in text file
df = pd.read_csv("test_input.txt", sep=" ", error_bad_lines=False)
# convert df to string
df = df.astype(str)
# get num rows
num_rows = df.shape[0]
# get IP from index, then reset index
df['IP'] = df.index
# reset index to proper index
new_index = pd.Series(list(range(num_rows)))
df = df.set_index([new_index])
# rename columns and drop old cols
df = df.rename(columns={'Server': 'Date', 'IP': "Server"})
# create Date col, drop old col
df['Date'] = df.Date.str.cat(df['Unnamed: 1'])
df = df.drop(["Unnamed: 1"], axis=1)
# Create address col, drop old col
df['Address'] = df['Unnamed: 2'] + df['Unnamed: 3'] + df['Unnamed: 4']
df = df.drop(["Unnamed: 2","Unnamed: 3","Unnamed: 4"], axis=1)
# Strip brackets, other chars
df['Date'] = df['Date'].str.strip("[]")
df['Server'] = df["Server"].astype(str)
df['Server'] = df['Server'].str.strip("()-'', '-',")
Returns:

Related

having trouble removing the first column in data frame

I am trying to remove the unmanned column in my data frame.
Here is a snip of the dataframe
How do i get rid of the unnamed column in pink
here is my code;
df = pd.read_csv("/nalt_labels_DATA/nalt_altlabels.csv")
df['randomint'] = np.random.randint(100, 500, size=len(df))
df['Nalt_URI_suffix'] = df.loc[:,'NALT_URI'].astype(str) + '_' + df.loc[:,'randomint'].astype(str)
df.drop(['randomint', 'NALT_URI'], axis=1)
df1 = df[['Nalt_URI_suffix', 'Label']]
df1.to_csv("/nalt_labels_DATA/nalt_altlabels_suffix.csv")
csv.writer(open("/nalt_labels_DATA/nalt_altlabels_suffix.tsv", 'w+'), delimiter='\t').writerows(csv.reader(open("/nalt_labels_DATA/nalt_altlabels_suffix.csv")))

Python pandas dataframe column and index formatting issues

I am trying to output a pandas dataframe and I am only getting one column (PADD 5) instead of (PADD 1 through PADD 5). In addition, I cannot get the index to format in YYYY-MM-DD. I would appreciate if anyone knew how to output these two things. Thanks much!
# API Key from EIA
api_key = 'xxxxxxxxxxx'
# api_key = os.getenv("EIA_API_KEY")
# PADD Names to Label Columns
# Change to whatever column labels you want to use.
PADD_NAMES = ['PADD 1','PADD 2','PADD 3','PADD 4','PADD 5']
# Enter all your Series IDs here separated by commas
PADD_KEY = ['PET.MCRRIP12.M',
'PET.MCRRIP22.M',
'PET.MCRRIP32.M',
'PET.MCRRIP42.M',
'PET.MCRRIP52.M']
# Initialize list - this is the final list that you will store all the data from the json pull. Then you will use this list to concat into a pandas dataframe.
final_data = []
# Choose start and end dates
startDate = '2009-01-01'
endDate = '2023-01-01'
for i in range(len(PADD_KEY)):
url = 'https://api.eia.gov/series/?api_key=' + api_key + '&series_id=' + PADD_KEY[i]
r = requests.get(url)
json_data = r.json()
if r.status_code == 200:
print('Success!')
else:
print('Error')
print(json_data)
df = pd.DataFrame(json_data.get('series')[0].get('data'),
columns = ['Date', PADD_NAMES[i]])
df.set_index('Date', drop=True, inplace=True)
final_data.append(df)
# Combine all the data into one dataframe
crude = pd.concat(final_data, axis=1)
# Create date as datetype datatype
crude['Year'] = crude.index.astype(str).str[:4]
crude['Month'] = crude.index.astype(str).str[4:]
crude['Day'] = 1
crude['Date'] = pd.to_datetime(crude[['Year','Month','Day']])
crude.set_index('Date',drop=True,inplace=True)
crude.sort_index(inplace=True)
crude = crude[startDate:endDate]
crude = crude.iloc[:,:5]
df.head()
PADD 5
Date
202201 1996
202112 2071
202111 2125
202110 2128
202109 2232

Pandas reorder raw content

I do have the following Excel-File
Which I've converted it to DataFrame and dropped 2 columns using below code:
df = pd.read_excel(self.file)
df.drop(['Name', 'Scopus ID'], axis=1, inplace=True)
Now, My target is to switch all names orders within the df.
For example,
the first name is Adedokun, Babatunde Olubayo
which i would like to convert it to Babatunde Olubayo Adedokun
how to do that for the entire df whatever name is it?
Split the name and reconcat them.
import pandas as pd
data = {'Name': ['Adedokun, Babatunde Olubayo', "Uwizeye, Dieudonné"]}
df = pd.DataFrame(data)
def swap_name(name):
name = name.split(', ')
return name[1] + ' ' + name[0]
df['Name'] = df['Name'].apply(swap_name)
df
Output:
> Name
> 0 Babatunde Olubayo Adedokun
> 1 Dieudonné Uwizeye
Let's assume you want to do the operation on "Other Names 1":
df.loc[:, "Other Names1"] = df["Other Names1"].str.split(",").apply(lambda row: " ".join(row))
You can use str accessor:
df['Name'] = df['Name'].str.split(', ').str[::-1].str.join(' ')
print(df)
# Output
Name
0 Babatunde Olubayo Adedokun
1 Dieudonné Uwizeye

Python pandas df - Columns must be same length as key

I have a dataframe I created by scraping this PDF with tabula. I'm trying to create a point column using geocoder - but I keep getting a Columns must be same length as key error. My code, as well as a link to the PDF is below:
PDF: https://drive.google.com/file/d/1m-KCmEIFlmyVcfYKTTwMaBpH6V5voreH/view?usp=sharing
import tabula
import pandas as pd
import re
### Scrape and clean
dsf = tabula.read_pdf('/content/drive/MyDrive/Topcondoimage 11-22-2021.pdf', pages='all',lattice=True)
df = dsf[0]
df.columns = df.iloc[0]
df = df.drop(df.index[0])
df = df.iloc[: , 1:]
df = df.replace(np.nan, 'Not Available', regex=True)
df['geo_Address'] = df['Building / Address / City']
df['geo_Address'] = df['geo_Address'].map(lambda x: re.sub(r'\r', ' ', x))
df['loc'] = df['geo_Address'].apply(geolocator.geocode, timeout=10)
df['point'] = df['loc'].apply(lambda loc: tuple(loc.point) if loc else None)
df = df.rename(columns={'Building / Address / City': 'building_address_city','Days on\rMarket':'days_on_market','Price /\rSq. Ft.':'price_per_sqft'})
df.reset_index(drop=True, inplace=True)
df[['lat','lon','altitude']] = pd.DataFrame(df['point'].to_list(),index=df.index)
That last line is what triggers the error.
I've tried removing special characters and resetting the index.

Python pandas says columns can't be found but they exist within a csv file

So I have this script
mport pandas as pd
import numpy as np
PRIMARY_TUMOR_PATIENT_ID_REGEX = '^.{4}-.{2}-.{4}-01.*'
SHORTEN_PATIENT_REGEX = '^(.{4}-.{2}-.{4}).*'
def mutations_for_gene(df):
mutated_patients = df['identifier'].unique()
return pd.DataFrame({'mutated': np.ones(len(mutated_patients))}, index=mutated_patients)
def prep_data(mutation_path):
df = pd.read_csv(mutation_path, low_memory=True, dtype=str, header = 0)#Line 24 reads in a line memory csv file from the given path and parses it based on '\t' delimators, and casts the data to str
df = df[~df['Hugo_Symbol'].str.contains('Hugo_Symbol')] #analyzes the 'Hugo_Symbol' heading within the data and makes a new dataframe where any row that contains 'Hugo_Symbol' is dropped
df['Hugo_Symbol'] = '\'' + df['Hugo_Symbol'].astype(str) # Appends ''\'' to all the data remaining in that column
df['Tumor_Sample_Barcode'] = df['Tumor_Sample_Barcode'].str.strip() #strips away whitespace from the data within this heading
non_silent = df.where(df['Variant_Classification'] != 'Silent') #creates a new dataframe where the data within the column 'Variant_Classification' is not equal to 'Silent'
df = non_silent.dropna(subset=['Variant_Classification']) #Drops all the rows that are missing at least one element
non_01_barcodes = df[~df['Tumor_Sample_Barcode'].str.contains(PRIMARY_TUMOR_PATIENT_ID_REGEX)]
#TODO: Double check that the extra ['Tumor_Sample_Barcode'] serves no purpose
df = df.drop(non_01_barcodes.index)
print(df)
shortened_patients = df['Tumor_Sample_Barcode'].str.extract(SHORTEN_PATIENT_REGEX, expand=False)
df['identifier'] = shortened_patients
gene_mutation_df = df.groupby(['Hugo_Symbol']).apply(mutations_for_gene)
gene_mutation_df.columns = gene_mutation_df.columns.str.strip()
gene_mutation_df.set_index(['Hugo_Symbol', 'patient'], inplace=True)
gene_mutation_df = gene_mutation_df.reset_index()
gene_patient_mutations = gene_mutation_df.pivot(index='Hugo_Symbol', columns='patient', values='mutated')
return gene_patient_mutations.transpose().fillna(0)
This is the csv file that the script reads in:
identifier,Hugo_Symbol,Tumor_Sample_Barcode,Variant_Classification,patient
1,patient,a,Silent,6
22,mutated,d,e,7
1,Hugo_Symbol,f,g,88
The script gives this error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-60-3f9c00f320bc> in <module>
----> 1 prep_data('test.csv')
<ipython-input-59-2a67d5c44e5a> in prep_data(mutation_path)
21 display(gene_mutation_df)
22 gene_mutation_df.columns = gene_mutation_df.columns.str.strip()
---> 23 gene_mutation_df.set_index(['Hugo_Symbol', 'patient'], inplace=True)
24 gene_mutation_df = gene_mutation_df.reset_index()
25 gene_patient_mutations = gene_mutation_df.pivot(index='Hugo_Symbol', columns='patient', values='mutated')
e:\Anaconda3\lib\site-packages\pandas\core\frame.py in set_index(self, keys, drop, append, inplace, verify_integrity)
4546
4547 if missing:
-> 4548 raise KeyError(f"None of {missing} are in the columns")
4549
4550 if inplace:
KeyError: "None of ['Hugo_Symbol', 'patient'] are in the columns"
Previously, I had this is as that line
gene_mutation_df.index.set_names(['Hugo_Symbol', 'patient'], inplace=True)
But that also gave an error that the set_name length expects one argument but got two
Any help would be much appreciated
I would really prefer if the csv data was changed instead of the script and somehow the script could work with set_names instead of set_index
The issue is:
gene_mutation_df = df.groupby(['Hugo_Symbol']).apply(mutations_for_gene)
'Hugo_Symbol is used for a groupby, so now it's in the index, not a column
In the case of the sample data, an empty dataframe, with no columns, has been created.
gene_mutation_df = df.groupby(['Hugo_Symbol']).apply(mutations_for_gene)
print(gene_mutation_df) # print the dataframe to see what it looks like
print(gene_mutation_df.info()) # print the information for the dataframe
gene_mutation_df.columns = gene_mutation_df.columns.str.strip()
gene_mutation_df.set_index(['Hugo_Symbol', 'patient'], inplace=True)
# output
Empty DataFrame
Columns: [identifier, Hugo_Symbol, Tumor_Sample_Barcode, Variant_Classification, patient]
Index: []
Empty DataFrame
Columns: []
Index: []
<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Empty DataFrameNone
reset the index
Resetting the index, will make Hugo_Symbol a column again
As long as the dataframe is not empty, the KeyError should be resolved.
gene_mutation_df = gene_mutation_df.reset_index() # try adding this line
gene_mutation_df.set_index(['Hugo_Symbol', 'patient'], inplace=True)
Addition Notes
There are a number of lines of code, that may be resulting in an empty dataframe
non_01_barcodes = df[~df['Tumor_Sample_Barcode'].str.contains(PRIMARY_TUMOR_PATIENT_ID_REGEX)]
shortened_patients = df['Tumor_Sample_Barcode'].str.extract(SHORTEN_PATIENT_REGEX, expand=False)
gene_mutation_df = df.groupby(['Hugo_Symbol']).apply(mutations_for_gene)
Test if the dataframe is empty
Use .empty to determine if a dataframe is empty
def prep_data(mutation_path):
df = pd.read_csv(mutation_path, low_memory=True, dtype=str, header = 0)#Line 24 reads in a line memory csv file from the given path and parses it based on '\t' delimators, and casts the data to str
df.columns = df.columns.str.strip() # clean the column names here if there is leading or trailing whitespace.
df = df[~df['Hugo_Symbol'].str.contains('Hugo_Symbol')] #analyzes the 'Hugo_Symbol' heading within the data and makes a new dataframe where any row that contains 'Hugo_Symbol' is dropped
df['Hugo_Symbol'] = '\'' + df['Hugo_Symbol'].astype(str) # Appends ''\'' to all the data remaining in that column
df['Tumor_Sample_Barcode'] = df['Tumor_Sample_Barcode'].str.strip() #strips away whitespace from the data within this heading
non_silent = df.where(df['Variant_Classification'] != 'Silent') #creates a new dataframe where the data within the column 'Variant_Classification' is not equal to 'Silent'
df = non_silent.dropna(subset=['Variant_Classification']) #Drops all the rows that are missing at least one element
non_01_barcodes = df[~df['Tumor_Sample_Barcode'].str.contains(PRIMARY_TUMOR_PATIENT_ID_REGEX)]
#TODO: Double check that the extra ['Tumor_Sample_Barcode'] serves no purpose
df = df.drop(non_01_barcodes.index)
print(df)
shortened_patients = df['Tumor_Sample_Barcode'].str.extract(SHORTEN_PATIENT_REGEX, expand=False)
df['identifier'] = shortened_patients
gene_mutation_df = df.groupby(['Hugo_Symbol']).apply(mutations_for_gene)
gene_mutation_df = gene_mutation_df.reset_index() # reset the index here
print(gene_mutation_df)
if gene_mutation_df.empty: # check if the dataframe is empty
print('The dataframe is empty')
else:
# gene_mutation_df.set_index(['Hugo_Symbol', 'patient'], inplace=True) # this is not needed, pivot won't work if you do this
# gene_mutation_df = gene_mutation_df.reset_index() # this is not needed, the dataframe was reset already
gene_patient_mutations = gene_mutation_df.pivot(index='Hugo_Symbol', columns='patient', values='mutated') # values needs to be a column in the dataframe
return gene_patient_mutations.transpose().fillna(0)

Categories