I am trying to create a dataframe from an .xlsx file that transforms a string that is in a cell into a number of strings that are arranged in a single cell.
For example, I have a dataframe as follows:
column_name1 column_name2
[[[A;B;C], [D;E]]],
[[F;G;H], [I;J]]]]]
My intention is that 5 columns are created: "column_name1_1", "column_name1_2", "column_name1_3", "column_name2_1", "column_name2_2". Can the column name be automatized?
After the dataframe is created, my intention is to enter the data "A" in the first column, "B" in the second column, and so on. "F" would also go in the first column, but under "A" and "G" would go in the second column, but under "B".
Is there any way to achieve this result? It would also be useful for me not to create the name of the columns, but to distribute the information in the way I stated above.
I have created this simple code that separates the letters into lists:
for headers in df.columns:
for cells in df[headers]:
cells = str(cells)
sublist = cells.split(character)
print(sublist)
I am using pandas for the first time and this is my first post. Any advice is welcome. Thank you all very much!
You can achieve this using Pandas.
Here you go!
import pandas as pd
# Load the .xlsx file into a Pandas dataframe
df = pd.read_excel("file.xlsx")
# Create a new dataframe to store the split values
split_df = pd.DataFrame()
# Loop through the columns
for headers in df.columns:
# Loop through the cells in each column
for cells in df[headers]:
cells = str(cells)
sublist = cells.split(";")
# Get the number of elements in the sublist
num_elements = len(sublist)
# Create new columns in the split_df dataframe for each element in the sublist
for i in range(num_elements):
column_name = headers + "_" + str(i+1)
split_df[column_name] = sublist[i]
# Reset the index of the split_df dataframe
split_df = split_df.reset_index(drop=True)
# Save the split_df dataframe to a new .xlsx file
split_df.to_excel("split_file.xlsx", index=False)
This code will split the values in a .xlsx file into a new dataframe, with each value separated into its own column. The new columns will be named based on the original column names and the position of the value in the list. The new dataframe will then be saved to a new .xlsx file named "split_file.xlsx".
Related
I am new to programming and I am trying to learn. I am comparing 2 documents that have very similar data. I want to find out if data from column "concatenate" is found in the same column "contatenate" from the other document because I want to find out what changes where made during the last update of the file.
If the value cannot be found this whole row should be copied to a new document. Then I know that this dataset has been changed.
Here is the code I have:
import pandas as pd
# load the data from the two files into Pandas dataframes
df1 = pd.read_excel('/Users/bjoern/Desktop/PythonProjects/Comparison/MergedKeepa_2023-02-05.xlsx')
df2 = pd.read_excel('/Users/bjoern/Desktop/PythonProjects/Comparison/MergedKeepa_2023-02-04.xlsx')
# extract the values from column Concatenate in both dataframes
col_a_df1 = df1['concatenate']
col_a_df2 = df2['concatenate']
# find the intersection of the values in column A of both dataframes
intersection = col_a_df1.isin(col_a_df2)
# filter the rows of df1 where the value in column A is not found in df2
df1 = df1[intersection]
# write the filtered data to a new Excel file
df1.to_excel('/Users/bjoern/Desktop/PythonProjects/Comparison/filtered_data.xlsx', index=False)
I just duplicated the 2 inputfiles which means I should receive a blank document but the document is still copying data to the new sheet.
What did I do wrong?
Many thanks for your support!
If the value cannot be found, this whole row should be copied to a new
document.
IIUC, you need (~), the NOT operator, to negate your boolean mask :
df1 = df1[~intersection]
I'm trying to clean an excel file that has some random formatting. The file has blank rows at the top, with the actual column headings at row 8. I've gotten rid of the blank rows, and now want to use the row 8 string as the true column headings in the dataframe.
I use this code to get the position of the column headings by searching for the string 'Destination' in the whole dataframe, and then take the location of the True value in the Boolean mask to get the list for renaming the column headers:
boolmsk=df.apply(lambda row: row.astype(str).str.contains('Destination').any(), axis=1)
print(boolmsk)
hdrindex=boolmsk.index[boolmsk == True].tolist()
print(hdrindex)
hdrstr=df.loc[7]
print(hdrstr)
df2=df.rename(columns=hdrstr)
However when I try to use hdrindex as a variable, I get errors when the second dataframe is created (ie when I try to use hdrstr to replace column headings.)
boolmsk=df.apply(lambda row: row.astype(str).str.contains('Destination').any(), axis=1)
print(boolmsk)
hdrindex=boolmsk.index[boolmsk == True].tolist()
print(hdrindex)
hdrstr=df.loc[hdrindex]
print(hdrstr)
df2=df.rename(columns=hdrstr)
How do I use a variable to specify an index, so that the resulting list can be used as column headings?
I assume your indicator of actual header rows in dataframe is string "destination". Lets find where it is:
start_tag = df.eq("destination").any(1)
We'll keep the number of the index of first occurrence of word "destination" for further use:
start_row = df.loc[start_tag].index.min()
Using index number we will get list of values in the "header" row:
new_col_names = df.iloc[start_row].values.tolist()
And here we can assign new column names to dataframe:
df.columns = new_col_names
From here you can play with new dataframe, actual column names and proper indexing:
df2 = df.iloc[start_row+1:].reset_index(drop=True)
I have been trying to search a dataframe for a list of numbers, every time a number matches in a column I would like to return the whole row and save it to a new dataframe, and then to an excel.
millreflist is the list of numbers - can be of random length.
TUCABCP is the dataframe I am searching.
PO is the column I am searching in for the numbers.
I have tried the code below using .loc, but when opening the new excel file I am just getting the header and no rows or data.
millreflistlength = len(millreflist)
for i in range(millreflistlength): TUCABCP = TUCABCP.loc[TUCABCP['PO'] == millreflist[i]]
TUCABCP.to_excel("NEWBCP.xlsx", header=True, index=False)
I have used the following question for reference, but it does not cover when you would like to search with a list of numbers: Selecting rows from a Dataframe based on values in multiple columns in pandas
Try something like this:
## Get list, where each element is the index of a row which you want to keep
indexes = TUCABCP[TUCABCP['PO'].isin(millreflist)]
## Filter the original df to get just the rows with indexes in the list
df = TUCABCP[TUCABCP.index.isin(indexes)]
I have a csv file with repeated group of columns and I want to convert the repeated group of columns to only one column each.
I know for this kind of problem we can use the function melt in python but only when having repeated columns of only one variable .
I already found a simple solution for my problem , but I don't think it's the best.I put the repeated columns of every variable into a list,then all repeated variables into bigger list.
Then when iterating the list , I use melt on every variable(list of repeated columns of same group).
Finally I concatenate the new dataframes to only one dataframe.
Here is my code:
import pandas as pd
file_name='file.xlsx'
df_final=pd.DataFrame()
#create lists to hold headers & other variables
HEADERS = []
A = []
B=[]
C=[]
#Read CSV File
df = pd.read_excel(file_name, sheet_name='Sheet1')
#create a list of all the columns
columns = list(df)
#split columns list into headers and other variables
for col in columns:
if col.startswith('A'):
A.append(col)
elif col.startswith('B'):
B.append(col)
elif col.startswith('C') :
C.append(col)
else:
HEADERS.append(col)
#For headers take into account only the first 17 variables
HEADERS=HEADERS[:17]
#group column variables
All_cols=[]
All_cols.append(A)
All_cols.append(B)
All_cols.append(C)
#Create a final DF
for list in All_cols:
df_x = pd.melt(df,
id_vars=HEADERS,
value_vars=list,
var_name=list[0],
value_name=list[0]+'_Val')
#Concatenate DataFrames 1
df_final= pd.concat([df_A, df_x],axis=1)
#Delete duplicate columns
df_final= df_final.loc[:, ~df_final.columns.duplicated()]
I want to find a better maintenable solution for my problem and I want to have a dataframe for every group of columns (same variable) as a result.
As a beginner in python , I can't find a way of doing this.
I'm joining an image that explains what I want in case I didn't make it clear enough.
joined image
I have created a data frame from an excel file. I would like to create new columns from each unique value in the column 'animals'. Can anyone help with this? I am somewhat new to Python and Pandas. Thanks.
In:
import pandas as pd
#INPUT FILE INFORMATION
path = 'C:\Users\MY_COMPUTER\Desktop\Stack_Example.xlsx'
sheet = "Sheet1"
#READ FILE
dataframe = pd.io.excel.read_excel(path, sheet)
#SET DATE AS INDEX
dataframe = dataframe.set_index('date')
You said you want to create new columns from each unique value in the column "animals". As you did not specify what you want the new columns to have as values, I assume you want None values.
So, here is the code:
for value in dataframe['animals']:
if value not in dataframe:
dataframe[value]=None
The first line loops through each value of the column 'animals'.
The second line checks to make sure the value is not already in one of the columns so that your condition of having only unique values is satisfied.
The third line creates new columns named under each unique value of column 'animals'.