I have a csv file with repeated group of columns and I want to convert the repeated group of columns to only one column each.
I know for this kind of problem we can use the function melt in python but only when having repeated columns of only one variable .
I already found a simple solution for my problem , but I don't think it's the best.I put the repeated columns of every variable into a list,then all repeated variables into bigger list.
Then when iterating the list , I use melt on every variable(list of repeated columns of same group).
Finally I concatenate the new dataframes to only one dataframe.
Here is my code:
import pandas as pd
file_name='file.xlsx'
df_final=pd.DataFrame()
#create lists to hold headers & other variables
HEADERS = []
A = []
B=[]
C=[]
#Read CSV File
df = pd.read_excel(file_name, sheet_name='Sheet1')
#create a list of all the columns
columns = list(df)
#split columns list into headers and other variables
for col in columns:
if col.startswith('A'):
A.append(col)
elif col.startswith('B'):
B.append(col)
elif col.startswith('C') :
C.append(col)
else:
HEADERS.append(col)
#For headers take into account only the first 17 variables
HEADERS=HEADERS[:17]
#group column variables
All_cols=[]
All_cols.append(A)
All_cols.append(B)
All_cols.append(C)
#Create a final DF
for list in All_cols:
df_x = pd.melt(df,
id_vars=HEADERS,
value_vars=list,
var_name=list[0],
value_name=list[0]+'_Val')
#Concatenate DataFrames 1
df_final= pd.concat([df_A, df_x],axis=1)
#Delete duplicate columns
df_final= df_final.loc[:, ~df_final.columns.duplicated()]
I want to find a better maintenable solution for my problem and I want to have a dataframe for every group of columns (same variable) as a result.
As a beginner in python , I can't find a way of doing this.
I'm joining an image that explains what I want in case I didn't make it clear enough.
joined image
Related
I am trying to create a dataframe from an .xlsx file that transforms a string that is in a cell into a number of strings that are arranged in a single cell.
For example, I have a dataframe as follows:
column_name1 column_name2
[[[A;B;C], [D;E]]],
[[F;G;H], [I;J]]]]]
My intention is that 5 columns are created: "column_name1_1", "column_name1_2", "column_name1_3", "column_name2_1", "column_name2_2". Can the column name be automatized?
After the dataframe is created, my intention is to enter the data "A" in the first column, "B" in the second column, and so on. "F" would also go in the first column, but under "A" and "G" would go in the second column, but under "B".
Is there any way to achieve this result? It would also be useful for me not to create the name of the columns, but to distribute the information in the way I stated above.
I have created this simple code that separates the letters into lists:
for headers in df.columns:
for cells in df[headers]:
cells = str(cells)
sublist = cells.split(character)
print(sublist)
I am using pandas for the first time and this is my first post. Any advice is welcome. Thank you all very much!
You can achieve this using Pandas.
Here you go!
import pandas as pd
# Load the .xlsx file into a Pandas dataframe
df = pd.read_excel("file.xlsx")
# Create a new dataframe to store the split values
split_df = pd.DataFrame()
# Loop through the columns
for headers in df.columns:
# Loop through the cells in each column
for cells in df[headers]:
cells = str(cells)
sublist = cells.split(";")
# Get the number of elements in the sublist
num_elements = len(sublist)
# Create new columns in the split_df dataframe for each element in the sublist
for i in range(num_elements):
column_name = headers + "_" + str(i+1)
split_df[column_name] = sublist[i]
# Reset the index of the split_df dataframe
split_df = split_df.reset_index(drop=True)
# Save the split_df dataframe to a new .xlsx file
split_df.to_excel("split_file.xlsx", index=False)
This code will split the values in a .xlsx file into a new dataframe, with each value separated into its own column. The new columns will be named based on the original column names and the position of the value in the list. The new dataframe will then be saved to a new .xlsx file named "split_file.xlsx".
I have two dataframes:
The first one has n row of names.
The second one has n row of names.
for each name in the first dataframe:
see how many times it appears in the second dataframe.
The code looks something like this:
df5 = pd.read_excel(item1, usecols="B",skiprows=6)
df10 = pd.read_excel('SMR4xx_Change_situation.xlsm', sheet_name='LoPN',usecols='D', skiprows=4)
how do i count the number of times a name appears in the second database and output it besides the name in the first database?
Ex: The first name in the database is John. John appears in the second dataframe 4 times => output John 4
either print it in the console or write in a separate excel file the first database and on the second column the number of appearances.
Anything could help.
Well, you can create a datarame for the records you are seeking.
You can first get list of unique names in the first dataframe like
uniqueNames = df5['B'].unique() # Assuming column B contains the names
dfCount = pd.DataFrame(columns=['name', 'count'])
Now you can iterate through each of the unique names in the first dataframe and compare against the second dataframe like this:
for eachName in uniqueNames:
dfCount = dfCount.append({'name':eachName,
'count':(df10['D'] == eachName).sum()},
ignore_index=True) # Assuming you need to compare with column D
Or
If you want the counts to be present in the first database, something like this should work
import numpy as np
df10['counts'] = np.nan
df10['counts'] = np.select([dfCount['name']==df5['B']], [dfCount['count']], np.nan)
I have been trying to search a dataframe for a list of numbers, every time a number matches in a column I would like to return the whole row and save it to a new dataframe, and then to an excel.
millreflist is the list of numbers - can be of random length.
TUCABCP is the dataframe I am searching.
PO is the column I am searching in for the numbers.
I have tried the code below using .loc, but when opening the new excel file I am just getting the header and no rows or data.
millreflistlength = len(millreflist)
for i in range(millreflistlength): TUCABCP = TUCABCP.loc[TUCABCP['PO'] == millreflist[i]]
TUCABCP.to_excel("NEWBCP.xlsx", header=True, index=False)
I have used the following question for reference, but it does not cover when you would like to search with a list of numbers: Selecting rows from a Dataframe based on values in multiple columns in pandas
Try something like this:
## Get list, where each element is the index of a row which you want to keep
indexes = TUCABCP[TUCABCP['PO'].isin(millreflist)]
## Filter the original df to get just the rows with indexes in the list
df = TUCABCP[TUCABCP.index.isin(indexes)]
I have a dataframe that I've imported as follows.
df = pd.read_excel("./Data.xlsx", sheet_name="Customer Care", header=None)
I would like to set the first three rows as column headers but can't figure out how to do this. I gave the following a try:
df.columns = df.iloc[0:3,:]
but that doesn't seem to work.
I saw something similar in this answer. But it only applies if all sub columns are going to be named the same way, which is not necessarily the case.
Any recommendations would be appreciated.
df = pd.read_excel(
"./Data.xlsx",
sheet_name="Customer Care",
header=[0,1,2]
)
This will tell pandas to read the first three rows of the excel file as multiindex column labels.
If you want to modify the rows after you load them then set them as columns
#set the first three rows as columns
df.columns=pd.MultiIndex.from_arrays(df.iloc[0:3].values)
#delete the first three rows (because they are also the columns
df=df.iloc[3:]
I have the following code:
dfs = glob.glob(path + "/*.csv")
df = pd.concat([pd.read_csv(df) for df in dfs], axis=1, ignore_index=False)
df1 = df.loc[:,~df.columns.duplicated()]
df1.to_csv("userpath.csv")
The purpose of this code is to take random/multiple csv files all taken from the same database and to merge them together next to each other. These files all have the same rows with different columns names but have the same code on the first row. For example csv file one will have J1_01,J1_02,J2_01,J2_02..... and then it will repeat with the other merged csv file J1_01,J1_02,J2_01,J2_02,J3_01.... All the csv files will have varying columns. The second row provides the title description of the column's value. Each csv file has three columns that give a description of name of the row and the ID number of the row for example: Id, Id2, Label Name. I want to keep the first instance of these three and delete the remaining duplicates. I used the code df.loc[:,~df.columns.duplicated()] however, since the J1_01,J1_02,J2_01,J2_02,J3_01.... will eventually duplicate as the new csv file is merged, I loose some columns. Is there any way to specify the df.loc[:,~df.columns.duplicated()] code to just drop the three Id, Id2, Label Name specific duplicates after keeping the first three? Thanks! As a follow up question if anyone is willing to help, if I want to replace specific characters present in each column(":",";" or spaces) with say an underscore, is there any way to do this with pandas? Thanks again!
Edit: Here's a screenshot of the merged csv file.
I want to keep the first instance of 'GEO.id','GEO.id2' and 'Geo.displ' and delete anytime these three columns are repeated.
From your image it seems that the columns you want to keep are the columns that begin with GEO. To do this, you can use regex to match the names, then get the indices of these columns, then splice the dataframe based on the column index.
import re
pattern = r'GEO' # or just "id" or whatever pattern best matches your data
# Returns list of indices that match your pattern
match_idx = [i for i, e in enumerate(df.columns) if re.search(pattern, e)]
# Select all but the first two columns (since you want to keep those)
drop_cols = match_idx[2:]
# Now choose all columns that don't match the indices of the columns you're dropping
usecols = [idx for idx, e in enumerate(df.columns) if idx not in drop_cols]
# Then select your data
df1 = df.iloc[:, usecols]
Note: if you try to select a single column like df['GEO.id'], it will return all the columns called GEO.id, which is why we have to drop the columns by index and not their name.