I have some big Excel files like this (note: other variables are omitted for brevity):
and would need to build a corresponding Pandas DataFrame with the following structure.
I am trying to develop a Pandas code for, at least, parsing the first column and transposing the id and the full of each user. Could you help with this?
The way that I would tackle it, and I am assuming there are likely to be more efficient ways, is to import the excel file into a dataframe, and then iterate through it to grab the details you need for each line. Store that information in a dictionary, and append each formed line into a list. This list of dictionaries can then be used to create the final dataframe.
Please note, I made the following assumptions:
Your excel file is named 'data.xlsx' and in the current working directory
The index next to each person increments by one EVERY time
All people have a position described in brackets next to the name
I made up the column names, as none were provided
import pandas as pd
# import the excel file into a dataframe (df)
filename = 'data.xlsx'
df = pd.read_excel(filename, names=['col1', 'col2'])
# remove blank rows
df.dropna(inplace=True)
# reset the index of df
df.reset_index(drop=True, inplace=True)
# initialise the variables
counter = 1
name_pos = ''
name = ''
pos = ''
line_dict = {}
list_of_lines = []
# iterate through the dataframe
for i in range(len(df)):
if df['col1'][i] == counter:
name_pos = df['col2'][i].split(' (')
name = name_pos[0]
pos = name_pos[1].rstrip(name_pos[1][-1])
p_index = counter
counter += 1
else:
date = df['col1'][i].strftime('%d/%m/%Y')
amount = df['col2'][i]
line_dict = {'p_index': p_index, 'name': name, 'position': pos, 'date':date, 'amount': amount}
list_of_lines.append(line_dict)
final_df = pd.DataFrame(list_of_lines)
OUTPUT:
Related
I have a directory containing several excel files. I want to create a DataFrame with a list of the filenames, a count of the number of rows in each file, and a min and max column.
Example file 1:
Example file 2:
Desired result:
This is as far as I've gotten:
fileslist = os.listdir(folder)
for file in fileslist:
str = file
if not str.startswith('~$'):
df = pd.read_excel(os.path.join(folder,file), header = 0, sheet_name = 'Main', usecols=['Name','Number'])
NumMax = max(df['Number'])
NumMin = min(df['Number'])
NameCount = df['Name'].count()
From here, I can't figure out how to create the final DataFrame as shown in the above "Desired Result." I'm very new at this and would appreciate any nudge in the right direction.
You're using str wrong. It is a function in Python, but you don't need it at all. Here, you just mean to write file.startswith. Now, to store the data, at each iteration you'll want to append to a list. What you can do is use dictionaries to create the data:
import pandas as pd
fileslist = os.listdir(folder)
data = [] # store the intermediate data in the loop
for file in fileslist:
# no need to assign file to str
if not file.startswith('~$'):
df = pd.read_excel(os.path.join(folder, file), header=0,
sheet_name='Main', usecols=['Name', 'Number'])
NumMax = max(df['Number'])
NumMin = min(df['Number'])
NameCount = df['Name'].count()
data.append(
{ # the dict keys will become pandas column names
'Filename': file, # you probably want to remove the extension here
'Count': NameCount,
'MinNumber': NumMin,
'MaxNumber': NumMax
})
df = pd.DataFrame(data)
From here, you just need to write the data frame to your excel file.
First of all, I would just like to point out that you shouldn't name any variable as "str" as you did here:
str = file
This can cause issues in the future if you ever try to convert some object to a string using the str(object) as you are redefining the method. Also, this redefinition of "file" is unnecessary, so you can just take that out. You did something similar with "file" as that is also a keyword that you are redefining. A name like "file_name" would be better.
As for how to create the final dataframe, it is somewhat simple. I would recommend you use a list and dictionaries and add all the data to that, then create the dataframe. Like this:
fileslist = os.listdir(folder)
# temporary list to store data
data = []
for file_name in fileslist:
if not file_name.startswith('~$'):
df = pd.read_excel(os.path.join(folder,file_name), header = 0, sheet_name = 'Main', usecols=['Name','Number'])
NumMax = max(df['Number'])
NumMin = min(df['Number'])
NameCount = df['Name'].()
# appending row of data with appropriate column names
data.append({'Filename': file_name, 'Count': NameCount, 'MinNumber': NumMin, 'MaxNumber': NumMax})
# creating actual dataframe
df = pd.DataFrame(data)
So the code below is creating three data frames based on a year. Each data frame is essentially the same except each year will have different stats for how players did. However, the header at the top of the datat frame gets repeated within every 20 rows or so. Im trying to figure how to get rid of it. So i figured that if i search the "Player" column for every instance "Player" is repeated within the column, that i could find the occurences and delete the row that it occurs in. At the end of my code, i ran a print function to see how many times the header row occurs within the data and it comes out to be 20 times. I just cant figure out the way to delete those rows.
import pandas as pd
year = ["2018", "2019", "2020"]
str = "https://www.pro-football-reference.com/years/{}/fantasy.htm"
url = str.format(year)
urlList = []
for season in year:
url = str.format(season)
urlList.append(url)
df2018 = pd.read_html(urlList[0], header=1)
df2019 = pd.read_html(urlList[1], header=1)
df2020 = pd.read_html(urlList[2], header=1)
print(df2020)
print(sum(df2020[0]["Player"] == "Player"))
P.S. I thought there was a way to reference a data frame variable by using the form of: dataframe.variable ??
This should work:
import pandas as pd
year = ["2018", "2019", "2020"]
str = "https://www.pro-football-reference.com/years/{}/fantasy.htm"
url = str.format(year)
urlList = []
for season in year:
url = str.format(season)
urlList.append(url)
df2018 = pd.read_html(urlList[0], header=1)
df2019 = pd.read_html(urlList[1], header=1)
df2020 = pd.read_html(urlList[2], header=1)
df2020 = df2020[0]
df2020 = df2020[df2020['Rk'] != 'Rk']
print(df2020.head(50))
It filters the Rk column for the value "Rk", and excludes it when creating the new dataframe. I only ran the code for 2020, but you can repeat it for the other dataframes.
As a note, pd.read_html() makes a list of dataframes, rather than a dataframe, because an HTML website or file can contain multiple dataframes. That's why I included this line of code:
df2020 = df2020[0]. It selects the first dataframe from the list.
If you need to reset the index, add this code to the end:
df2020 = df2020.reset_index(drop=True)
I have a text file formatted like:
item(1) description="Tofu" Group="Foods" Quantity=5
item(2) description="Apples" Group="Foods" Quantity=10
What's the best way to read this style of format in Python?
Here's one way you could do this in pandas to get a DataFrame of your items.
(I copy-pasted your text file into "test.txt" for testing purposes.)
This method automatically assigns column names and sets the item(...) column as the index. You could also assign the column names manually, which would change the script a bit.
import pandas as pd
# read in the data
df = pd.read_csv("test.txt", delimiter=" ", header=None)
# set the index as the first column
df = df.set_index(0)
# capture our column names, to rename columns
column_names = []
# for each column...
for col in df.columns:
# extract the column name
col_name = df[col].str.split("=").str[0].unique()[0]
column_names.append(col_name)
# extract the data
col_data = df[col].str.split("=").str[1]
# optional: remove the double quotes
try:
col_data = col_data.replace('"', "")
except:
pass
# store just the data back in the column
df[col] = col_data
# store our new column names
df.columns = column_names
There are probably a lot of ways to do this based on what you're trying to accomplish and how much variation you expect in the data.
I am preparing a code to identify the columns in a dataframe with the word "date" in their names. I am using RegEx for comparing the sub-strings generated from the original names using re.split() function.
This is the entire code:
import pandas as pd
import numpy as np
import re
df = pd.read_excel(r'C:\Users\rishi\Desktop\PGDBDA\Dataset\Dataset for Date Operations.xlsx')
#print(df)
# Dataset is loaded into Pandas dataframe
column_name = [names for names in df.columns]
#print(column_name)
# The column names are extracted into a list called column_name.
# We plan use a mechanism to identify the sub-string 'date' from the elements in column_name.
name_split = []
for index in column_name:
name_split.append(re.split(' |-|_',index))
#print(name_split)
# Using RegEx we are able to split the elements in the column name based on a set of dilimiters.
# We are grouping them in a list of lists nammed as name_split.
column_index = []
column_count = 0
regex_pattern = re.compile(r"\bdate\b", re.IGNORECASE)
for index in name_split:
for elements in index:
if re.search(regex_pattern, elements) != None:
column_index.append(column_count)
exit()
column_count+=1
print(column_index)
# Will tell us all the columns with 'date' in their names, by stating the index no of the column.
The issue is that every time I am running this portion of the code:
column_index = []
column_count = 0
regex_pattern = re.compile(r"\bdate\b", re.IGNORECASE)
for index in name_split:
for elements in index:
if re.search(regex_pattern, elements) != None:
column_index.append(column_count)
exit()
column_count+=1
print(column_index)
# Will tell us all the columns with 'date' in their names, by stating the index no of the column.
The console keeps crashing and reloading. Any insight on this issue will be highly appreciated.
I have an excel workbook with multiple sheets with some sales data. I am trying to sort them so that each customer has a separate sheet(different workbook), and has the item details. I have created a dictionary with all customernames.
for name in cust_dict.keys():
cust_dict[name] = pd.DataFrame(columns=cols)
for sheet in sheets:
ws = sales_wb.sheet_by_name(sheet)
code = ws.cell(4, 0).value #This is the item code
df = pd.read_excel(sales_wb, engine='xlrd', sheet_name=sheet, skiprows=7)
df = df.fillna(0)
count = 0
for index,row in df.iterrows():
print('rotation '+str(count))
count+=1
if row['Particulars'] != 0 and row['Particulars'] not in no_cust:
cust_name = row['Particulars']
# try:
cust_dict[cust_name] = cust_dict[cust_name].append(df.loc[df['Particulars'] == cust_name],ignore_index=False)
cust_dict[cust_name] = cust_dict[cust_name].drop_duplicates()
cust_dict[cust_name]['Particulars'] = code
Right now I have to drop duplicates because the Particulars has the client name more than once and hence the cope appends the data say x number of times.
I would like to avoid this but I can't seem to figure out a good way to do it.
The second problem is that since the code changes to the code in the last sheet for all rows, but I want it to remain the same for the rows pulled from a particular sheet.
I can't seem to figure out a way around both the above problems.
Thanks