How to combine multiple columns into one single block - python

I'm trying to achieve a dataframe transformation (kinda complicated for me) with Pandas, see image below.
The original dataframe source is an Excel sheet (here is an example) that looks exactly like the input here :
INPUT → OUTPUT (made by draw.io)
Basically, I need to do these transformations by order :
Select (in each block) the first four lines + the last two lines
Stack all the blocks together
Drop the last three unnamed columns
Select columns A and E
Fill down the column A
Create a new column N1 that holds a sequence of values (ID-01 to ID-06)
Create a new column N2 that concatente the first value of the block and its number
And for that, I made this code who unfortunately return a [0 rows × 56 columns] dataframe :
import pandas as pd
myFile = r"C:\Users\wasp_96b\Desktop\ExcelSheet.xlsx"
df1 = pd.read_excel(myFile, sheet_name = 'Sheet1')
df2 = (pd.wide_to_long(df1.reset_index(), 'A' ,i='index',j='value').reset_index(drop=True))
df2.ffill(axis = 0)
df2.insert(2, 'N1', 'ID-' + str(range(1, 1 + len(df2))))
df2.insert(3, 'N2', len(df2)//5)
display(df2)
Do you have any idea or explanation for this scenario ?
Is there any other ways I can obtain the result I'm looking for ?

The Column names in your code and in the data are not matching. However, from the data and the output you desire, I think I am able to solve your query. The code is very specific for the data you provided and you might need to change it later
CODE
import pandas as pd
myFile = "ExcelSheet.xlsx"
df = pd.read_excel(myFile, sheet_name='Sheet1')
# Forwad filling the column
df["Housing"] = df["Housing"].ffill()
# Select the first 4 lines and last two lines
df = pd.concat([df.head(4), df.tail(2)]).reset_index(drop=True)
# Drop the unneccsary columns
df = df.drop(columns=[col for col in df.columns if not (col.startswith("Elements") or col == "Housing")])
df.rename(columns={"Elements": "Elements.0"}, inplace=True)
# Stack all columns
df = pd.wide_to_long(df.reset_index(), stubnames=["Elements."], i="index", j="N2").reset_index("N2")
df.rename(columns={"Elements.": "Elements"}, inplace=True)
# Adding N1 and N2
df["N1"] = "ID_" + (df.index + 1).astype("str")
df["N2"] = df["Housing"] + "-" + (df["N2"] + 1).astype("str")
# Finishing up
df = df[["Housing", "Elements", "N1", "N2"]].reset_index(drop=True)
print(df.head(12))
OUTPUT(only first 12 rows)
Housing Elements N1 N2
0 OID1 1 ID_1 OID1-1
1 OID1 M-0368 ID_2 OID1-1
2 OID1 JUM ID_3 OID1-1
3 OID1 NODE-1 ID_4 OID1-1
4 OID4 BTM-B ID_5 OID4-1
5 OID4 1 ID_6 OID4-1
6 OID1 1 ID_1 OID1-2
7 OID1 M-0379 ID_2 OID1-2
8 OID1 JUM ID_3 OID1-2
9 OID1 NODE-2 ID_4 OID1-2
10 OID4 BTM-B ID_5 OID4-2
11 OID4 2 ID_6 OID4-2

Related

Converting working code into a method, which will take dataframe columns values as input and would return 0 and 1 as values of a new column

I am new to core python. I have a working code which I need to convert into a method.
So, I have around 50k data with 30 columns. Out of 30 columns 3 columns are important for this requirement. Id,Code, and bill_id. I need to populate new column "multiple_instance" with 0s and 1s. Hence, final dataframe will contain 50k data with 31 columns. 'Code' column contains n number of codes, hence I am filtering my interest of codes and applying the remaining concept.
I need to pass these 3 columns in a method() which would return 0s and 1s.
Note: multiple_instance_codes is a variable which can be changed later.
multiple_instance_codes = ['A','B','C','D']
filt = df['Code'].str.contains('|'.join(multiple_instance_codes ), na=False,case=False)
df_mul = df[filt]
df_temp = df_mul.groupby(['Id'])[['Code']].size().reset_index(name='count')
df_mul = df_mul.merge(df_temp, on='Id', how='left')
df_mul['Cumulative_Sum'] = df_mul.groupby(['bill_id'])['count'].apply(lambda x: x.cumsum())
df_mul['multiple_instance'] = np.where(df_mul['Cumulative_Sum'] > 1, 1, 0)```
**Sample data :**
bill_id Id Code Cumulative_Sum multiple_instance
10 1 B 1 0
10 2 A 2 1
10 3 C 3 1
10 4 A 4 1
Nevermind, It is completed and working fine.
def multiple_instance(df):
df_scored = df.copy()
filt = df_scored['Code'].str.contains('|'.join(multiple_instance_codes), na=False,case=False)
df1 = df_scored[filt]
df_temp = df1.groupby(['Id'])[['Code']].size().reset_index(name='count')
df1 = df1.merge(df_temp, on='Id', how='left')
df1['Cum_sum'] = df1.groupby(['bill_id'])['count'].apply(lambda x: x.cumsum())
df_scored = df_scored.merge(df1)
df_scored['muliple instance'] = np.where(df_scored['Cumulative_Sum'] > 1, 1, 0)
return df_scored

How to compare 2 data frames in python and highlight the differences?

I am trying to compare 2 files one is in xls and other is in csv format.
File1.xlsx (not actual data)
Title Flag Price total ...more columns
0 A Y 12 300
1 B N 15 700
2 C N 18 1000
..
..
more rows
File2.csv (not actual data)
Title Flag Price total ...more columns
0 E Y 7 234
1 B N 16 600
2 A Y 12 300
3 C N 17 1000
..
..
more rows
I used Pandas and moved those files to data frame. There is no unique columns(to make id) in the files and there are 700K records to compare. I need to compare File 1 with File 2 and show the differences. I have tried few things but I am not getting the outliers as expected.
If I use merge function as below, I am getting output with the values only for File 1.
diff_df = df1.merge(df2, how = 'outer' ,indicator=True).query('_merge == "left_only"').drop(columns='_merge')
output I am getting
Title Attention_Needed Price total
1 B N 15 700
2 C N 18 1000
This output is not showing the correct diff as record with Title 'E' is missing
I also tried using panda merge
diff_df = pd.merge(df1, df2, how='outer', indicator='Exist')
& output for above was
Title Flag Price total Exist
0 A Y 12 300 both
1 B N 15 700 left_only
2 C N 18 1000 left_only
3 E Y 7 234 right_only
4 B N 16 600 right_only
5 C N 17 1000 right_only
Problem with above output is it is showing records from both the data frames and it will be very difficult if there are 1000 of records in each data frame.
Output I am looking for (for differences) by adding extra column("Comments") and give message as matching, exact difference, new etc. or on the similar lines
Title Flag Price total Comments
0 A Y 12 300 matching
1 B N 15 700 Price, total different
2 C N 18 1000 Price different
3 E Y 7 234 New record
If above output can not be possible, then please suggest if there is any other way to solve this.
PS: This is my first question here, so please let me know if you need more details here.
Rows in DF1 Which Are Not Available in DF2
df = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']
Rows in DF2 Which Are Not Available in DF1
df = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='right_only']
If you're differentiating by row not column
pd.concat([df1,df2]).drop_duplicates(keep=False)
If each df has the same columns and each column should be compared individually
for col in data.columns:
set(df1.col).symmetric_difference(df2.col)
# WARNING: this way of getting column diffs likely won't keep row order
# new row order will be [unique_elements_from_df1_REVERSED] concat [unique_elements_from_df2_REVERSED]
lets assume df1 (left) is our "source of truth" for what's considered an original record.
after running
diff_df = df1.merge(df2, how = 'outer' ,indicator=True).query('_merge == "left_only"').drop(columns='_merge')
take the output and split it into 2 df's.
df1 = diff_df[diff_df["Exist"] in ["both", "left_only"]]
df2 = diff_df[diff_df["Exist"] == "right_only"]
Right now, if you drop the "exist" row from df1, you'll have records where the comment would be "matching".
Let's assume you add the 'comments' column to df1
you could say that everything in df2 is a new record, but that would disregard the "price/total different".
If you really want the difference comment, now is a tricky bit where the 'how' really depends on what order columns matter most (title > flag > ...) and how much they matter (weighting system)
After you have a wighting system determined, you need a 'scoring' method that will compare two rows in order to see how similar they are based on the column ranking you determine.
# distributes weight so first is heaviest, last is lightest, total weight = 100
# if i was good i'd do this with numpy not manually
def getWeights(l):
weights = [0 for col in l]
total = 100
while total > 0:
for i, e in enumerate(l):
for j in range(i+1):
weights[j] += 1
total -= 1
return weights
def scoreRows(row1, row2):
s = 0
for i, colName in enumerate(colRank):
if row1[colName] == row2[colName]:
s += weights[i]
colRank = ['title', 'flag']
weights = getWeights(colRank)
Let's say only these 2 matter and the rest are considered 'modifications' to an original row
That is to say, if a row in df2 doesn't have a matching title OR flag for ANY row in df1, that row is a new record
What makes a row a new record is completely up to you.
Another way of thinking about it is that you need to determine what makes some row in df2 'differ' from some row in df1 and not a different row in df1
if you have 2 rows in df1
row1: [1, 2, 3, 4]
row2: [1, 6, 3, 7]
and you want to compare this row against that df
[1, 6, 5, 4]
this row has the same first element as both, the same second element as row2, and the same 4th element of row1.
so which row does it differ from?
if this is a question you aren't sure how to answer, consider cutting losses and just keep df1 as "good" records and df2 as "new" records
if you're sticking with the 'differs' comment, our next step is to filter out truly new records from records that have slight differences by building a score table
# to recap
# df1 has "both" and "left_only" records ("matching" comment)
# df2 has "right_only" records (new records and differing records)
rowScores = []
# list of lists
# each inner list index correlates to the index for df2
# inner lists are
# made up of tuples
# each tuple first element is the actual row from df1 that is matched
# second element is the score for matching (out of 100)
for i, row1 in df2.itterrows():
thisRowsScores = []
#df2 first because they are what we are scoring
for j, row2 in df1.iterrows():
s = scoreRows(row1, row2)
if s>0: # only save rows and scores that matter
thisRowsScores.append((row2, s))
# at this point, you can either leave the scoring as a table and have comments refer how different differences relate back to some row
# or you can just keep the best score like i'll be doing
#sort by score
sortedRowScores = thisRowsScores.sort(key=lambda x: x[1], reverse=True)
rowScores.append(sortedRowScores[0])
# appends empty list if no good matches found in df1
# alternatively, remove 'reversed' from above and index at -1
The reason we save the row itself is so that it can be indexed by df1 in order to add a "differ" comments
At this point, lets just say that df1 already has the comments "matching" added to it
Now that each row in df2 has a score and reference to the row it matched best in df1, we can edit the comment to that row in df1 to list the columns with different values.
But at this point, I feel as though that df now needs a reference back to df2 so that the record and values those difference refer to are actually gettable.

Pandas : How to use one value from a column (value repeats itself) as a header from another column, multiple times using wildcards

I have a data with multiple inputs from a semi-structured csv and I am trying to use one single (first) value from a set of columns (more then 500) as a header for other set of columns containing similar headers (another 500 rows)
After reading it I got something like this
import pandas as pd, numpy as np
df = pd.DataFrame({'Service': np.arange(8),
'Ticket': np.random.rand(8),
'Var_1': np.random.rand(8), # values column
'Var_1_View': 'temp temp temp temp temp temp temp temp'.split(), # header of values of column
'Var_2': np.arange(8),
'Var_2_View': 'pres pres pres pres pres pres pres pres'.split(),
'Var_3': np.arange(8) * 2,
'Var_3_View': 'shift shift shift shift shift shift shift shift'.split(),
'D': np.arange(8) * 5,
'Mess_3': np.random.rand(8),
'Mess_3_View': 'id id id id id id id id'.split(),
'E': np.arange(8)})
Headers containing values end with up to a 3 Digit number _# up to _### (more then 500 to be precise).
Headers with description about values end with text : _View
I've created two dfs, one containing and other not containing the expression _View
df_headers =df.iloc[:,df.columns.str.contains('View')] # wanted headers on columns containing values
df_values =df.iloc[:,~df.columns.str.contains('View')] # headers should be replaced here
My idea was to extract first values from the df_headers as a list and using df.replace or df.rename, change headers on the df_values containing the values.
I could do it manually, but i have a huge df with different prefixxes and suffixes, but always using the _View as description to nearest column containing values.
As a result I would have the df_dont with new headers and with columns were this rule doesnt apply (Ticket, D, E, etc).
Since its my first question I would be great to have feedback, about clearness, explanation, or any other positive comments are welcome.
It's not completely clear to me what you want to achieve so this might be off:
view_cols = {col for col in df.columns if col.endswith("_View")}
rename_dict = {
col.replace("_View", ""): df[col].iat[0] for col in view_cols
}
new_cols = [col for col in df.columns if col not in view_cols]
df_new = df[new_cols].rename(columns=rename_dict)
Result:
Service Ticket temp pres shift D id E
0 0 0.623941 0.934402 0 0 0 0.644999 0
1 1 0.122866 0.918892 1 2 5 0.675976 1
2 2 0.472081 0.790443 2 4 10 0.825020 2
3 3 0.914086 0.849609 3 6 15 0.357074 3
4 4 0.684477 0.729126 4 8 20 0.010928 4
5 5 0.132002 0.673680 5 10 25 0.884599 5
6 6 0.841921 0.224638 6 12 30 0.197387 6
7 7 0.721800 0.412439 7 14 35 0.875199 7

How to modify data after replicate in Pandas?

I am trying to edit values after making duplicate rows in Pandas.
I want to edit only one column ("code"), but i see that since it has duplicates , it will affect the entire rows.
Is there any method to first create duplicates and then modify data only of duplicates created ?
import pandas as pd
df=pd.read_excel('so.xlsx',index=False)
a = df['code'] == 1234
b = df[a]
df=df.append(b)
print('\n\nafter replicate')
print(df)
Current output after making duplicates is as below:
coun code name
0 A 123 AR
1 F 123 AD
2 N 7 AR
3 I 0 AA
4 T 10 AS
2 N 7 AR
3 I 7 AA
Now I expect to change values only on duplicates created , in this case bottom two rows. But now I see the indexes are duplicated as well.
You can avoid the duplicate indices by using the ignore_index argument to append.
df=df.append(b, ignore_index=True)
You may also find it easier to modify your data in b, before appending it to the frame.
import pandas as pd
df=pd.read_excel('so.xlsx',index=False)
a = df['code'] == 3
b = df[a]
b["region"][2] = "N"
df=df.append(b, ignore_index=True)
print('\n\nafter replicate')
print(df)

add columns different length pandas

I have a problem with adding columns in pandas.
I have DataFrame, dimensional is nxk. And in process I wiil need add columns with dimensional mx1, where m = [1,n], but I don't know m.
When I try do it:
df['Name column'] = data
# type(data) = list
result:
AssertionError: Length of values does not match length of index
Can I add columns with different length?
If you use accepted answer, you'll lose your column names, as shown in the accepted answer example, and described in the documentation (emphasis added):
The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information.
It looks like column names ('Name column') are meaningful to the Original Poster / Original Question.
To save column names, use pandas.concat, but don't ignore_index (default value of ignore_index is false; so you can omit that argument altogether). Continue to use axis=1:
import pandas
# Note these columns have 3 rows of values:
original = pandas.DataFrame({
'Age':[10, 12, 13],
'Gender':['M','F','F']
})
# Note this column has 4 rows of values:
additional = pandas.DataFrame({
'Name': ['Nate A', 'Jessie A', 'Daniel H', 'John D']
})
new = pandas.concat([original, additional], axis=1)
# Identical:
# new = pandas.concat([original, additional], ignore_index=False, axis=1)
print(new.head())
# Age Gender Name
#0 10 M Nate A
#1 12 F Jessie A
#2 13 F Daniel H
#3 NaN NaN John D
Notice how John D does not have an Age or a Gender.
Use concat and pass axis=1 and ignore_index=True:
In [38]:
import numpy as np
df = pd.DataFrame({'a':np.arange(5)})
df1 = pd.DataFrame({'b':np.arange(4)})
print(df1)
df
b
0 0
1 1
2 2
3 3
Out[38]:
a
0 0
1 1
2 2
3 3
4 4
In [39]:
pd.concat([df,df1], ignore_index=True, axis=1)
Out[39]:
0 1
0 0 0
1 1 1
2 2 2
3 3 3
4 4 NaN
We can add the different size of list values to DataFrame.
Example
a = [0,1,2,3]
b = [0,1,2,3,4,5,6,7,8,9]
c = [0,1]
Find the Length of all list
la,lb,lc = len(a),len(b),len(c)
# now find the max
max_len = max(la,lb,lc)
Resize all according to the determined max length (not in this example
if not max_len == la:
a.extend(['']*(max_len-la))
if not max_len == lb:
b.extend(['']*(max_len-lb))
if not max_len == lc:
c.extend(['']*(max_len-lc))
Now the all list is same length and create dataframe
pd.DataFrame({'A':a,'B':b,'C':c})
Final Output is
A B C
0 1 0 1
1 2 1
2 3 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
I had the same issue, two different dataframes and without a common column. I just needed to put them beside each other in a csv file.
Merge:
In this case, "merge" does not work; even adding a temporary column to both dfs and then dropping it. Because this method makes both dfs with the same length. Hence, it repeats the rows of the shorter dataframe to match the longer dataframe's length.
Concat:
The idea of The Red Pea didn't work for me. It just appended the shorter df to the longer one (row-wise) while leaving an empty column (NaNs) above the shorter df's column.
Solution: You need to do the following:
df1 = df1.reset_index()
df2 = df2.reset_index()
df = [df1, df2]
df_final = pd.concat(df, axis=1)
df_final.to_csv(filename, index=False)
This way, you'll see your dfs besides each other (column-wise), each of which with its own length.
If somebody like to replace a specific column of a different size instead of adding it.
Based on this answer, I use a dict as an intermediate type.
Create Pandas Dataframe with different sized columns
If the column to be inserted is not a list but already a dict, the respective line can be omitted.
def fill_column(dataframe: pd.DataFrame, list: list, column: str):
dict_from_list = dict(enumerate(list)) # create enumertable object from list and create dict
dataFrame_asDict = dataframe.to_dict() # Get DataFrame as Dict
dataFrame_asDict[column] = dict_from_list # Assign specific column
return pd.DataFrame.from_dict(dataFrame_asDict, orient='index').T # Create new DataSheet from Dict and return it

Categories