I am preparing a code to identify the columns in a dataframe with the word "date" in their names. I am using RegEx for comparing the sub-strings generated from the original names using re.split() function.
This is the entire code:
import pandas as pd
import numpy as np
import re
df = pd.read_excel(r'C:\Users\rishi\Desktop\PGDBDA\Dataset\Dataset for Date Operations.xlsx')
#print(df)
# Dataset is loaded into Pandas dataframe
column_name = [names for names in df.columns]
#print(column_name)
# The column names are extracted into a list called column_name.
# We plan use a mechanism to identify the sub-string 'date' from the elements in column_name.
name_split = []
for index in column_name:
name_split.append(re.split(' |-|_',index))
#print(name_split)
# Using RegEx we are able to split the elements in the column name based on a set of dilimiters.
# We are grouping them in a list of lists nammed as name_split.
column_index = []
column_count = 0
regex_pattern = re.compile(r"\bdate\b", re.IGNORECASE)
for index in name_split:
for elements in index:
if re.search(regex_pattern, elements) != None:
column_index.append(column_count)
exit()
column_count+=1
print(column_index)
# Will tell us all the columns with 'date' in their names, by stating the index no of the column.
The issue is that every time I am running this portion of the code:
column_index = []
column_count = 0
regex_pattern = re.compile(r"\bdate\b", re.IGNORECASE)
for index in name_split:
for elements in index:
if re.search(regex_pattern, elements) != None:
column_index.append(column_count)
exit()
column_count+=1
print(column_index)
# Will tell us all the columns with 'date' in their names, by stating the index no of the column.
The console keeps crashing and reloading. Any insight on this issue will be highly appreciated.
Related
I have some big Excel files like this (note: other variables are omitted for brevity):
and would need to build a corresponding Pandas DataFrame with the following structure.
I am trying to develop a Pandas code for, at least, parsing the first column and transposing the id and the full of each user. Could you help with this?
The way that I would tackle it, and I am assuming there are likely to be more efficient ways, is to import the excel file into a dataframe, and then iterate through it to grab the details you need for each line. Store that information in a dictionary, and append each formed line into a list. This list of dictionaries can then be used to create the final dataframe.
Please note, I made the following assumptions:
Your excel file is named 'data.xlsx' and in the current working directory
The index next to each person increments by one EVERY time
All people have a position described in brackets next to the name
I made up the column names, as none were provided
import pandas as pd
# import the excel file into a dataframe (df)
filename = 'data.xlsx'
df = pd.read_excel(filename, names=['col1', 'col2'])
# remove blank rows
df.dropna(inplace=True)
# reset the index of df
df.reset_index(drop=True, inplace=True)
# initialise the variables
counter = 1
name_pos = ''
name = ''
pos = ''
line_dict = {}
list_of_lines = []
# iterate through the dataframe
for i in range(len(df)):
if df['col1'][i] == counter:
name_pos = df['col2'][i].split(' (')
name = name_pos[0]
pos = name_pos[1].rstrip(name_pos[1][-1])
p_index = counter
counter += 1
else:
date = df['col1'][i].strftime('%d/%m/%Y')
amount = df['col2'][i]
line_dict = {'p_index': p_index, 'name': name, 'position': pos, 'date':date, 'amount': amount}
list_of_lines.append(line_dict)
final_df = pd.DataFrame(list_of_lines)
OUTPUT:
I want to know how to extract particular table column from pdf file in python.
My code so far
import tabula.io as tb
from tabula.io import read_pdf
dfs = tb.read_pdf(pdf_path, pages='all')
print (len(dfs)) [It displays 73]
I am able to access individual table column by doing print (dfs[2]['Section ID'])
I want to know how can I search particular column in all data frame using for loop.
I want to do something like this
for i in range(len(dfs)):
if (dfs[i][2]) == 'Section ID ' //(This gives invalid syntax)
print dfs[i]
If you have only one dataframe with Section ID name (or are interested only in the first dataframe with this column) you can iterate over the list returned by read_pdf, check for the column presence with in df.columns and break when a match is found.
import tabula.io as tb
from tabula.io import read_pdf
df_list = tb.read_pdf(pdf_path, pages='all')
for df in df_list:
if 'Section ID' in df.columns:
break
print(df)
If you may have multiple dataframes with the Section ID column, you can use list comprehension filter and get a list of dataframes with that column name.
dfs_section_id = [df for df in df_list if 'Section ID' in df.columns]
So the code below is creating three data frames based on a year. Each data frame is essentially the same except each year will have different stats for how players did. However, the header at the top of the datat frame gets repeated within every 20 rows or so. Im trying to figure how to get rid of it. So i figured that if i search the "Player" column for every instance "Player" is repeated within the column, that i could find the occurences and delete the row that it occurs in. At the end of my code, i ran a print function to see how many times the header row occurs within the data and it comes out to be 20 times. I just cant figure out the way to delete those rows.
import pandas as pd
year = ["2018", "2019", "2020"]
str = "https://www.pro-football-reference.com/years/{}/fantasy.htm"
url = str.format(year)
urlList = []
for season in year:
url = str.format(season)
urlList.append(url)
df2018 = pd.read_html(urlList[0], header=1)
df2019 = pd.read_html(urlList[1], header=1)
df2020 = pd.read_html(urlList[2], header=1)
print(df2020)
print(sum(df2020[0]["Player"] == "Player"))
P.S. I thought there was a way to reference a data frame variable by using the form of: dataframe.variable ??
This should work:
import pandas as pd
year = ["2018", "2019", "2020"]
str = "https://www.pro-football-reference.com/years/{}/fantasy.htm"
url = str.format(year)
urlList = []
for season in year:
url = str.format(season)
urlList.append(url)
df2018 = pd.read_html(urlList[0], header=1)
df2019 = pd.read_html(urlList[1], header=1)
df2020 = pd.read_html(urlList[2], header=1)
df2020 = df2020[0]
df2020 = df2020[df2020['Rk'] != 'Rk']
print(df2020.head(50))
It filters the Rk column for the value "Rk", and excludes it when creating the new dataframe. I only ran the code for 2020, but you can repeat it for the other dataframes.
As a note, pd.read_html() makes a list of dataframes, rather than a dataframe, because an HTML website or file can contain multiple dataframes. That's why I included this line of code:
df2020 = df2020[0]. It selects the first dataframe from the list.
If you need to reset the index, add this code to the end:
df2020 = df2020.reset_index(drop=True)
I have a text file formatted like:
item(1) description="Tofu" Group="Foods" Quantity=5
item(2) description="Apples" Group="Foods" Quantity=10
What's the best way to read this style of format in Python?
Here's one way you could do this in pandas to get a DataFrame of your items.
(I copy-pasted your text file into "test.txt" for testing purposes.)
This method automatically assigns column names and sets the item(...) column as the index. You could also assign the column names manually, which would change the script a bit.
import pandas as pd
# read in the data
df = pd.read_csv("test.txt", delimiter=" ", header=None)
# set the index as the first column
df = df.set_index(0)
# capture our column names, to rename columns
column_names = []
# for each column...
for col in df.columns:
# extract the column name
col_name = df[col].str.split("=").str[0].unique()[0]
column_names.append(col_name)
# extract the data
col_data = df[col].str.split("=").str[1]
# optional: remove the double quotes
try:
col_data = col_data.replace('"', "")
except:
pass
# store just the data back in the column
df[col] = col_data
# store our new column names
df.columns = column_names
There are probably a lot of ways to do this based on what you're trying to accomplish and how much variation you expect in the data.
Data frame with 2 columns: old_path and new_path. Data frame can contain hundreds of rows.
The script iterates over a list of files.
For each file in the list, check if any part of its folder path matches a value in the old_path column. If there is a match, replace the file's matched old_path with the corresponding new_path value.
I achieved this with for index, row in df.iterrows(): or for row in df.itertuples():, but I'm thinking there should be a more efficient way to do it without having to use the second for loop.
Any help is appreciated. Sample below uses df.iterrows()
import pandas as pd
import os
df = pd.read_csv('path_lookup.csv')
# df:
# old_path new_path
# 0 F:\Business\Budget & Forecasting M:\Business\Finance\Forecast
# 1 F:\Business\Treasury Shared M:\Business\Finance\Treasury
# 2 C:\Temp C:\NewTemp
excel_link_analysis_list = [
{'excel_filename': 'C:\\Temp\\12345\\Distribution Adjusted Claim.xlsx',
'file_read': 'OK'},
{'excel_filename': 'C:\\Temp\\SubFolder\\cost estimates.xlsx',
'file_read': 'OK'}
]
for i in excel_link_analysis_list:
for index, row in df.iterrows():
if row['old_path'].lower() in i['excel_filename'].lower():
dest_path_and_file = i['excel_filename'].lower().replace(row['old_path'].lower(),
row['new_path'].lower())
print(dest_path_and_file)
prints:
c:\newtemp\12345\distribution adjusted claim.xlsx
c:\newtemp\subfolder\cost estimates.xlsx
Yes, pandas has nice built in string comparison functions, see here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html#pandas.Series.str.contains
This is how you could use Series.str.contains to get the index of the matching value (i.e. from the column old_path). You could then use that index to go back and get the value of new_path
Edit: updated to handle the case where new_path_matches has one value.
import pandas as pd
old_path = df['old_path']
new_path = df['new_path']
for filename in filenames:
b = old_path.str.contains(filename)
# Get the index of matches from `old_path` column
indeces_of_matches = b[b].index.values
# use the index of matches to get the corresponding `new_path' values
new_path_matches = old_path.loc[indeces_of_matches]
if (new_path_matches.value.shape[0]>0):
print new_path_matches.values[0] # print the new_path value