Populate dataframe by unnesting list of the first column

Populate dataframe by unnesting list of the first column - python

I have the following issue with a csv in panda the data looks as follow :
Column A :row1: [« a », « b »; « c »
Row2 : [« d »; « e », « f »
Etc …
Note the different delimiters.
I would like it to populate next column based on the cell keys in the list in it like this :
ColA row 1: [a] col b:[b] colc[c]
Row 2: [d] col b:[e] colc:[f]
And so on for as many values there is in a cell I would like it to populate over every column it’s row.
I hope to get some insights from you and that my explanation is clear,
Thanks
Im struggling so far
I can’t share the data but basically I have every row in column A that contains a list csv like with separators and I would like for n number of values within this list in this cell to populate n number of rows in the next columns. , I think I would need to strip the data based on the multiple delimiters and treat them as one ( as you would do in excel ) and then for each row create a function appending each values of the first cell list ? But I’m not sure how to create this…
Each « Keys » of the list in the cell with separated values should go to a the next row (horizontal) in the next column and this for each rows in the data set I would like to un-nest these strings

I'm not sure I understand your I/O but you can try this :
import pandas as pd
df= (
pd.read_csv("test.txt", sep="[;,]", engine="python",
header=None, skiprows=1)
.astype(str).apply(lambda x: x.str.strip("« »"))
)
# convert the numeric index columns to alphabetic letters
df.columns= (
df.columns.astype(str)
.str.replace(r"(\d)",
lambda m: "Col" + chr(ord('#')+ int(float(m.group(0)))+1),
regex=True)
)
# Output:
print(df)
ColA ColB ColC
0 a b c
1 d e f
# .txt used:

Related

How to compare two dataframes in pandas

I have two dataframes:
The first one has n row of names.
The second one has n row of names.
for each name in the first dataframe:
see how many times it appears in the second dataframe.
The code looks something like this:
df5 = pd.read_excel(item1, usecols="B",skiprows=6)
df10 = pd.read_excel('SMR4xx_Change_situation.xlsm', sheet_name='LoPN',usecols='D', skiprows=4)
how do i count the number of times a name appears in the second database and output it besides the name in the first database?
Ex: The first name in the database is John. John appears in the second dataframe 4 times => output John 4
either print it in the console or write in a separate excel file the first database and on the second column the number of appearances.
Anything could help.

Well, you can create a datarame for the records you are seeking.
You can first get list of unique names in the first dataframe like
uniqueNames = df5['B'].unique() # Assuming column B contains the names
dfCount = pd.DataFrame(columns=['name', 'count'])
Now you can iterate through each of the unique names in the first dataframe and compare against the second dataframe like this:
for eachName in uniqueNames:
dfCount = dfCount.append({'name':eachName,
'count':(df10['D'] == eachName).sum()},
ignore_index=True) # Assuming you need to compare with column D
Or
If you want the counts to be present in the first database, something like this should work
import numpy as np
df10['counts'] = np.nan
df10['counts'] = np.select([dfCount['name']==df5['B']], [dfCount['count']], np.nan)

Subsetting a dataframe in pandas according to column name values

I have a dataframe in pandas that i need to split up. It is much larger than this, but here is an example:
ID A B
a 0 0
b 1 1
c 2 2
and I have a list: keep_list = ['ID','A']
and another list: recode_list = ['ID','B']
I'd like the split the dataframe up by the column headers into two dataframes: one dataframe with those columns and values whose column headers match the keep_list, and one with those column headers and data that match the recode_alleles list. Every code I have tried thus far has not worked as it is trying to compare the values to the list, not the column names.
Thank you so much in advance for your help!

Assuming your DataFrame's name is df:
you can simply do
df[keep_list] and df[recode_list] to get what you want.

You can do this by index.intersection:
df1 = df[df.columns.intersection(keep_list)]
df2 = df[df.columns.intersection(recode_list)]

python - append only select columns as rows

Original file has multiple columns but there are lots of blanks and I want to rearrange so that there is one nice column with info. Starting with 910 rows, 51 cols (newFile df) -> Want 910+x rows, 3 cols (final df) final df has 910 rows.
newFile sample
for i in range (0,len(newFile)):
for j in range (0,48):
if (pd.notnull(newFile.iloc[i,3+j])):
final=final.append(newFile.iloc[[i],[0,1,3+j]], ignore_index=True)
I have this piece of code to go through newFile and if 3+j column is not null, to copy columns 0,1,3+j to a new row. I tried append() but it adds not only rows but a bunch of columns with NaNs again (like the original file).
Any suggestions?!

Your problem is that you are using a DataFrame and keeping column names, so adding a new columns with a value will fill the new column with NaN for the rest of the dataframe.
Plus your code is really inefficient given the double for loop.
Here is my solution using melt()
#creating example df
df = pd.DataFrame(numpy.random.randint(0,100,size=(100, 51)), columns=list('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXY'))
#reconstructing df as long version, keeping columns from index 0 to index 3
df = df.melt(id_vars=df.columns[0:2])
#dropping the values that are null
df.dropna(subset=['value'],inplace=True)
#here if you want to keep the information about which column the value is coming from you stop here, otherwise you do
df.drop(inplace=True,['variable'],axis=1)
print(df)

convert group of repeated columns to one column each using python

I have a csv file with repeated group of columns and I want to convert the repeated group of columns to only one column each.
I know for this kind of problem we can use the function melt in python but only when having repeated columns of only one variable .
I already found a simple solution for my problem , but I don't think it's the best.I put the repeated columns of every variable into a list,then all repeated variables into bigger list.
Then when iterating the list , I use melt on every variable(list of repeated columns of same group).
Finally I concatenate the new dataframes to only one dataframe.
Here is my code:
import pandas as pd
file_name='file.xlsx'
df_final=pd.DataFrame()
#create lists to hold headers & other variables
HEADERS = []
A = []
B=[]
C=[]
#Read CSV File
df = pd.read_excel(file_name, sheet_name='Sheet1')
#create a list of all the columns
columns = list(df)
#split columns list into headers and other variables
for col in columns:
if col.startswith('A'):
A.append(col)
elif col.startswith('B'):
B.append(col)
elif col.startswith('C') :
C.append(col)
else:
HEADERS.append(col)
#For headers take into account only the first 17 variables
HEADERS=HEADERS[:17]
#group column variables
All_cols=[]
All_cols.append(A)
All_cols.append(B)
All_cols.append(C)
#Create a final DF
for list in All_cols:
df_x = pd.melt(df,
id_vars=HEADERS,
value_vars=list,
var_name=list[0],
value_name=list[0]+'_Val')
#Concatenate DataFrames 1
df_final= pd.concat([df_A, df_x],axis=1)
#Delete duplicate columns
df_final= df_final.loc[:, ~df_final.columns.duplicated()]
I want to find a better maintenable solution for my problem and I want to have a dataframe for every group of columns (same variable) as a result.
As a beginner in python , I can't find a way of doing this.
I'm joining an image that explains what I want in case I didn't make it clear enough.
joined image

Deleting specific columns pandas

I have the following code:
dfs = glob.glob(path + "/*.csv")
df = pd.concat([pd.read_csv(df) for df in dfs], axis=1, ignore_index=False)
df1 = df.loc[:,~df.columns.duplicated()]
df1.to_csv("userpath.csv")
The purpose of this code is to take random/multiple csv files all taken from the same database and to merge them together next to each other. These files all have the same rows with different columns names but have the same code on the first row. For example csv file one will have J1_01,J1_02,J2_01,J2_02..... and then it will repeat with the other merged csv file J1_01,J1_02,J2_01,J2_02,J3_01.... All the csv files will have varying columns. The second row provides the title description of the column's value. Each csv file has three columns that give a description of name of the row and the ID number of the row for example: Id, Id2, Label Name. I want to keep the first instance of these three and delete the remaining duplicates. I used the code df.loc[:,~df.columns.duplicated()] however, since the J1_01,J1_02,J2_01,J2_02,J3_01.... will eventually duplicate as the new csv file is merged, I loose some columns. Is there any way to specify the df.loc[:,~df.columns.duplicated()] code to just drop the three Id, Id2, Label Name specific duplicates after keeping the first three? Thanks! As a follow up question if anyone is willing to help, if I want to replace specific characters present in each column(":",";" or spaces) with say an underscore, is there any way to do this with pandas? Thanks again!
Edit: Here's a screenshot of the merged csv file.
I want to keep the first instance of 'GEO.id','GEO.id2' and 'Geo.displ' and delete anytime these three columns are repeated.

From your image it seems that the columns you want to keep are the columns that begin with GEO. To do this, you can use regex to match the names, then get the indices of these columns, then splice the dataframe based on the column index.
import re
pattern = r'GEO' # or just "id" or whatever pattern best matches your data
# Returns list of indices that match your pattern
match_idx = [i for i, e in enumerate(df.columns) if re.search(pattern, e)]
# Select all but the first two columns (since you want to keep those)
drop_cols = match_idx[2:]
# Now choose all columns that don't match the indices of the columns you're dropping
usecols = [idx for idx, e in enumerate(df.columns) if idx not in drop_cols]
# Then select your data
df1 = df.iloc[:, usecols]
Note: if you try to select a single column like df['GEO.id'], it will return all the columns called GEO.id, which is why we have to drop the columns by index and not their name.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Populate dataframe by unnesting list of the first column - python

Related

How to compare two dataframes in pandas

Subsetting a dataframe in pandas according to column name values

python - append only select columns as rows

convert group of repeated columns to one column each using python

Deleting specific columns pandas

Categories

Resources