Data Frame, how to save data extracted with webscraping? - python

I have a spreadsheet where columns "A", "B" and "C" have data, I need to save the webscraping data in column "D".
Currently, it deletes the contents of all columns and saves the information in column "B", this is my problem, as it should save in column "D" and keep the data from the other columns
I tried the command below and I had no success, because it just created a column named "c".
pd.DataFrame(data, columns=["c"])
just below the command I use to save all my webscraping data
data = {'Video': links}
df = pd.DataFrame(data)
df.to_excel(r"C:\__Imagens e Planilhas Python\Facebook\Youtube.xlsx", engine='xlsxwriter')
print(df)

You should have included what the data in "Youtube.xlsx" and data look like. The answer below is suggested with the assumption that they're the same length and that "Youtube.xlsx" has no index column and exactly 3 columns so any added column will be the 4th by default.
I don't know what's in "Youtube.xlsx" or in data, but the way it's coded df will have only one column (Video), and .to_excel uses write mode by default, so
Currently, it deletes the contents of all columns and saves the information in column "B"
[I expect it uses column the first column as index, so Video ends up as the second column.] If you don't want to write over the previous contents, the usual approach is to append with mode='a'; but that will append rows, and [afaik] there is no way to append columns directly to a file.
You could read the file into a DataFrame, then add the column and save.
filepath = r"C:\__Imagens e Planilhas Python\Facebook\Youtube.xlsx"
df = df.read_excel(filepath) #, header=None) # if there is no header row
df['Video'] = links
df.to_excel(filepath, engine='xlsxwriter') #, index=False)
[Use a different column name if the file already had a Video column that you want to preserve.]

Related

How can I create a empty dataframe with combined column names?

I am trying to create a dataframe from an .xlsx file that transforms a string that is in a cell into a number of strings that are arranged in a single cell.
For example, I have a dataframe as follows:
column_name1 column_name2
[[[A;B;C], [D;E]]],
[[F;G;H], [I;J]]]]]
My intention is that 5 columns are created: "column_name1_1", "column_name1_2", "column_name1_3", "column_name2_1", "column_name2_2". Can the column name be automatized?
After the dataframe is created, my intention is to enter the data "A" in the first column, "B" in the second column, and so on. "F" would also go in the first column, but under "A" and "G" would go in the second column, but under "B".
Is there any way to achieve this result? It would also be useful for me not to create the name of the columns, but to distribute the information in the way I stated above.
I have created this simple code that separates the letters into lists:
for headers in df.columns:
for cells in df[headers]:
cells = str(cells)
sublist = cells.split(character)
print(sublist)
I am using pandas for the first time and this is my first post. Any advice is welcome. Thank you all very much!
You can achieve this using Pandas.
Here you go!
import pandas as pd
# Load the .xlsx file into a Pandas dataframe
df = pd.read_excel("file.xlsx")
# Create a new dataframe to store the split values
split_df = pd.DataFrame()
# Loop through the columns
for headers in df.columns:
# Loop through the cells in each column
for cells in df[headers]:
cells = str(cells)
sublist = cells.split(";")
# Get the number of elements in the sublist
num_elements = len(sublist)
# Create new columns in the split_df dataframe for each element in the sublist
for i in range(num_elements):
column_name = headers + "_" + str(i+1)
split_df[column_name] = sublist[i]
# Reset the index of the split_df dataframe
split_df = split_df.reset_index(drop=True)
# Save the split_df dataframe to a new .xlsx file
split_df.to_excel("split_file.xlsx", index=False)
This code will split the values in a .xlsx file into a new dataframe, with each value separated into its own column. The new columns will be named based on the original column names and the position of the value in the list. The new dataframe will then be saved to a new .xlsx file named "split_file.xlsx".

Append data in New Column in excel using python

I have to Run this code everyday and store dataframe in new columns
How to Store the data frame to next column automatically without specifying the column number in excel file using Python Here dx stores the day of month i.e "09-11-2022" dx=9 so the data gets stored in column 9 but there will be gaps if i run it after some time like if dx=22 the columns between 9-22 will be empty So how to store data to next new column without specifying the startcol
df1 = pd.DataFrame({today:mob})
df2 = pd.DataFrame({today:dtop})
writer=pd.ExcelWriter('Pagespeed.xlsx',mode='a',engine="openpyxl",
if_sheet_exists='overlay')
df1.to_excel(writer, index=False,sheet_name='Mobile',startcol=dx)
df2.to_excel(writer,index=False,sheet_name='Desktop',startcol=dx)
writer.save()
Try this answer to get the max populated column in the excel file.
Then use startcol = max_column + 1 to write to the next empty column

Use Python to export Excel column data when column isn't in first row

I need to pull data from a column based on the column header. My only problem is the input files aren't consistent and have the column in different locations and the data doesn't start on row one.
Above is an example excel file. I want to pull the data for Market. I've got this to work using panda if the data starts at a1, but I can't get it to pull the data if it doesn't start in the first position.
How about you use this just after you pd.read_excel() statement ?
df=df.dropna(how='all',axis='columns').dropna(how='all',axis='rows')
You can then set the first row as header:
df.columns = df.iloc[0]
df = df[1:]
df

reading multiple excel sheets and dropping the last row of each sheet

My program reads in an Excel file that contains multiple sheets and concatenates them together. The issue is that the last row at the end of each sheet Totals and I don't want that row. Is there an argument that will drop the last row when I read the sheets in? And will I need to first read the sheets in and remove this last row before I run the concat function to avoid deleting out the wrong rows? I've tried using skipfooter = 0 and skipfooter = 1 but this threw an error message.
I assume you using pandas to read xlsx where the excel file have multiple sheet with difference length of data and you want to drop the last row from each sheet, so you can use [:-1] like this :
df = pd.ExcelFile('report.xlsx',engine = 'openpyxl')
data = [df.parse(name)[:-1] for name in df.sheet_names]

Problem with combining multiple excel files in python pandas

I am quite new to python programming. I need to combine 1000+ files into one file. each file has 3 sheets in it and I need to get data only from sheet2 and make an final excel file. I am facing a problem to pick a value from specific cell from each excel file on sheet2 and create a column. python is picking the value from first file and create a column on that
df = pd.DataFrame()
for file in files:
if file.endswith('.xlsm'):
df = pd.read_excel(file, sheet_name=1, header=None)
df['REPORT_NO'] = df.iloc[1][4] #Report Number
df['SUPPLIER'] = df.iloc[2][4] #Supplier
df['REPORT_DATE'] = df.iloc[0][4] #Report Number
df2 = df2.dropna(thresh=15)
df2 = df.append(df, ignore_index=True)
df = df.reset_index()
del df['index']
df2.to_excel('FINAL_FILES.xlsx')
How can I solve this issue so python can take from each excel and put the information on right rows.
I df.iloc[2][4] refers to the 2nd row and 4th column of the 1st sheet. You have imported with sheet_name=1 and never activated a different sheet, though you mentioned all of the .xlsm have 3 sheets.
II your scoping could be wrong. Why define df outside of the loop? If will change per file, so no need for an external one. All info form the loop should be put into your df2 before the next iteration of the loop.
III Have you checked if append is adding a row or a column?
Even though
df['REPORT_NO'] = df.iloc[1][4] #Report Number
df['SUPPLIER'] = df.iloc[2][4] #Supplier
df['REPORT_DATE'] = df.iloc[0][4] #Report Number
are written as columns they have Report Number/Supplier/Report Date repeated for every row in that column.
When you use df2 = df.append(df, ignore_index=True) check the output. It might not be appending in the way you intend.

Categories