I am trying to split a column but I noticed split changing the other values. For example, some values of row 10 exchange with row 8. Why is that?
Actual data on ID 10
| vat_number | email | foi_mail | website
| 10 | abc#test.com;example#test.com;example#test.com | xyz#test.com | example.com
After executing this line of code:
base_data[['email','email_1','email_2']] = pd.DataFrame(base_data.email.str.split(';').tolist(),
columns = ['email','email_1','email_2'])
base_data becomes:
| vat_number | email | foi_mail | website | email_1 | email_2
| 10 | some other row value | some other row value | example.com | ------ | -----
Before:
After:
Data contains thousands of row, but I showed only one row.
try do table in table:
def test():
base_data = []
base_data.append(['12','32'])
base_data.append(['352','335'])
base_data.append(['232','32'])
print(base_data)
a = base_data[0]
print(a)
print(a[0])
print(a[1])
input("Enter to contuniue. . .")
and use loop to add
if i understand the case. I believe you need something like that:
base_data = base_data.merge(base_data['email'].str.split(';', expand = True).rename(columns = {0:'email',1:'email_1',2:'email_2']}), left_index = True, right_index = True)
Here is the logic explanation:
a1 = list('abcdef')
b1 = list('fedcba')
c1 = [f'{x[0]};{x[1]}' for x in zip(a1, b1)]
df1 = pd.DataFrame({'c1':c1})
df1
Out[1]:
c1
0 a;f
1 b;e
2 c;d
3 d;c
4 e;b
5 f;a
df1 = df1.merge(df1['c1'].str.split(';', expand = True).rename(columns = {0:'c2',1:'c3'}), left_index = True, right_index = True)
df1
Out[2]:
c1 c2 c3
0 a;f a f
1 b;e b e
2 c;d c d
3 d;c d c
4 e;b e b
5 f;a f a
Use the expand parameter of .str.split:
import pandas as pd
# your dataframe
vat_number email foi_mail website
NaN abc#test.com;example#test.com;example#test.com xyz#test.com example.com
# split and expand
df[['email_1', 'email_2', 'email_3']] = df['email'].str.split(';', expand=True)
# drop `email` col
df.drop(columns='email', inplace=True)
# result
vat_number foi_mail website email_1 email_2 email_3
NaN xyz#test.com example.com abc#test.com example#test.com example#test.com
Related
I have two lists (l1 and l2) consisting of required columns of two dataframes (df1 and df2) that I would like to apply operation upon.
One list has all required columns ending with an _x and the other with an _y.
I would like to subtract the values of these columns by index, for instance,
df_final['first_col_sub'] = first element of l2 - first element of l1
df_final['second_col_sub'] = second element of l2 - second element of l1 and so on
Actually both of the dataframes have the same column headers and I cannot directly use the column headers to perform the operation which is why I added _x and _y in the column headers so that it may help to do the subtraction.
For example,
df1,
1_x | 2_x | 3_x | 4_x | 5_x ...
1 | 2 | 3 | 4 | 5
df2,
1_y | 2_y | 3_y | 4_y | 5_y ...
5 | 4 | 3 | 2 | 1
df_final,
first_col_sub | first_col_sub | first_col_sub | first_col_sub | first_col_sub ...
4 | 2 | 0 | -2 | -4
How may I achieve this? Any help is greatly appreciated.
(If there is anything unclear please let me know.)
If same number of rows and same number of columns with matched values before _ use:
l1 = ['1_x','2_x'...]
l2 = ['1_y','2_y'...]
df2[l2].sub(df1[l1].to_numpy())
Or better is remove _x and _y for match values before _ in subtract:
f = lambda x: x[:-2]
df2[l2].rename(columns=f).sub(df1[l1].rename(columns=f))
Also is possible filter by ending y and x columns in DataFrame.filter:
f = lambda x: x[:-2]
df2.filter(regex='y$').rename(columns=f).sub(df1.filter(regex='x$').rename(columns=f))
Or:
df22 = df2.filter(regex='y$')
df11 = df1.filter(regex='x$')
df22.columns = df22.columns.str[:-2]
df11.columns = df11.columns.str[:-2]
df22.sub(df11)
You can subtract pandas dataframes directly:
df_final = df2 - df1
Or based on index names:
for x in df_x.columns:
base = x[0:-2]
if base + '_y' in df_y.columns:
df_f[base + '_final'] = df_x[x] - df_y[base + '_y']
I have 2 dataframes, both have an identical emails column and each has a unique ID Column. My code used to create these looks like this
import pandas as pd
df = pd.read_excel(r'C:\Users\file.xlsx')
df['healthAssessment'] = df['ltv']*.01*df['Employment.Weight']*df['Income_Per_Year']/df['Debits_Per_Year'].astype(int)df['ltv']*.01*df['Employment.Weight']*df['Income_Per_Year']/df['Debits_Per_Year'].astype(int)
df0 = df.loc[df['receivedHealthEmail'].str.contains('No Email Sent')]
df2 = df0.loc[df['healthAssessment'] > 2.5]
df3 = df2.loc[df['Emails'].str.contains('#')]
print (df)
df4 = df
df1 = df3
receiver = df1['Emails'].astype(str)
receivers = receiver
df1['receivedHealthEmail'] = receiver
print (df1)
the first dataframe it produces looks roughly like this
Unique ID | Emails | receivedHealthEmail| healthAssessment
0 | aaaaaaaaaa#aaaaaa | No Email Sent| 2.443849
1 | bbbbbbbbbbbbb#bbb | No Email Sent| 3.809817
2 | ccccccccccccc#ccc | No Email Sent| 2.952871
3 | ddddddddddddd#ddd | No Email Sent| 2.564398
4 | eeeeeeeeeee#eeeee | No Email Sent| 3.315868
... | ... | ... ...
3294 | no email provided | No Email Sent| 7.674677
the second data frame looks like this
Unique ID Emails receivedHealthEmail| healthAssessment
1 | bbbbbbbbbbbbb#bbb| bbbbbbbbbbbbb#bbb| 3.809817
2 | cccccccccccccc#cc| cccccccccccccc#cc| 2.952871
3 | ddddddddddddd#ddd| ddddddddddddd#ddd| 2.564398
4 | eeeeeeeeeee#eeeee| eeeeeeeeeee#eeeee| 3.315868
i need a way to overwrite the received emails tab in the first dataframe using the values from the second dataframe. any help is appreciated
You can merge the 2 dataframes based on UniqueID:
df = df1.merge(df2, on='UniqueID')
df.drop(columns=['receivedHealthEmail_x', 'healthAssessment_x', 'Emails_x'], inplace=True)
print(df)
UniqueID Emails_y receivedHealthEmail_y healthAssessment_y
0 1 bbbbbbbbbbbbb#bbb bbbbbbbbbbbbb#bbb 3.809817
1 2 cccccccccccccc#cc cccccccccccccc#cc 2.952871
2 3 ddddddddddddd#ddd ddddddddddddd#ddd 2.564398
3 4 eeeeeeeeeee#eeeee eeeeeeeeeee#eeeee 3.315868
I have two dataframes with similar formats. Both have 3 indexes/headers. Most of the headers are the same but df2 has a few additional ones. When I add them up the order of the headers gets mixed up. I would like to maintain the order of df1. Any ideas?
Global = pd.read_excel('Mickey Mouse_Clean2.xlsx',header=[0,1,2,3],index_col=[0,1],sheet_name = 'Global')
Oslav = pd.read_excel('Mickey Mouse_Clean2.xlsx',header=[0,1,2,3],index_col=[0,1],sheet_name = 'Country XYZ')
Oslav = Oslav.replace(to_replace=1,value=10)
Oslav = Oslav.replace(to_replace=-1,value=-2)
df = Global.add(Oslav,fill_value=0)
Example of df Format
HeaderA | Header2 | Header3 |
xxx1|xxx2|xxx3|xxx4||xxx1|xxx2|xxx3|xxx4||xxx1|xxx2|xxx3|xxx4 |
ColX|ColY |ColA|ColB|ColC|ColD||ColD|ColE|ColF|ColG||ColH|ColI|ColJ|ColDK|
1 | ds | 1 | |+1 |-1 | .......................................
2 | dh | ..........................................................
3 | ge | ..........................................................
4 | ew | ..........................................................
5 | er | ..........................................................
df = df[Global.columns+list(set(Oslav.columns)-set(Global.columns))].copy()
or
df = df[Global.columns+[col for col in Oslav.columns if not col in Global.columns]].copy()
(The second option should preserve the order of Oslav columns as well, if you care about that.)
or
df = df.reindex(columns=Global.columns+list(set(Oslav.columns)-set(Global.columns)))
If you don't want to keep the columns that are in Oslav, but not in Global, you can do
df = df[Global.columns].copy()
Note that without .copy(), you're getting a view of the previous dataframe, rather than a dataframe in its own right.
We have 2 excel files one having 7.5k records and other having 7k records. We need to compare the data by keeping one specific column as fixed from one sheet to compare with another sheet.
For Example sheet1:
**Emp_ID| Name| Phone| Address**
-------------------------------------
1 | A | 123 | ABC
-------------------------------------
2 | B | 456 | CBD
-------------------------------------
3 | C | 789 | S
For Example sheet2:
**Emp_ID| Name| Phone| Address**
-------------------------------------
1 | A | 123 | ABC
-------------------------------------
3 | C | 789 | S
Python Comparison should on the basis of Emp_ID and Emp_ID=2 should be output as missing when passing the argument as Emp_ID while executing the python script.
I am trying the same by using XLRD module, but its comparing only cell by cell instead of freezing one column and then comparing the row with other excel file.
def compareexcel(oldSheet, newSheet):
rowb2 = xlrd.open_workbook(oldSheet)
rowb1 = xlrd.open_workbook(newSheet)
sheet1 = rowb1.sheet_by_index(0)
sheet2 = rowb2.sheet_by_index(0)
for rownum in range(max(sheet1.nrows, sheet2.nrows)):
if rownum < sheet1.nrows:
row_rb1 = sheet1.row_values(rownum)
row_rb2 = sheet2.row_values(rownum)
for colnum, (c1, c2) in enumerate(izip_longest(row_rb1, row_rb2)):
if c1 != c2:
print "Row {} Col {} - {} != {}".format(rownum+1, colnum+1, c1, c2)
I have written one function to search for a column value into another sheet and on the basis of that comparison would take place in compare function
def search(sheet2 , s):
for row in range(sheet2.nrows):`enter code here`
if s == sheet2.cell(row,0).value:
return (row,0)
return (9,9)
def compare(oldPerPaxSheet,newPerPaxSheet):
rb1 = xlrd.open_workbook(oldPerPaxSheet)
rb2 = xlrd.open_workbook(newPerPaxSheet)
sheet1 = rb1.sheet_by_index(0)
sheet2 = rb2.sheet_by_index(0)
for rownum in range(max(self.sheet1.nrows, self.sheet2.nrows)):
if rownum < sheet1.nrows:
row_rb1 = sheet1.row_values(rownum)
print ("row_rb1 : "), row_rb1
search_str = sheet1.cell(rownum,0).value
r,c = search(sheet2,search_str)
if (c != 9):
row_rb2 = sheet2.row_values(r)
for colnum, (c1, c2) in enumerate(izip_longest(row_rb1, row_rb2)):
if c1 != c2:
print "Row {} Col {} - {} != {}".format(rownum+1, colnum+1, c1, c2)
else:
print ("ROw does not exists in the other sheet")
pass
else:
print ("Row {} missing").format(rownum+1)
You can easily use pandas.read_excel for this.
I will make 2 DataFrames with Emp_ID as index
import pandas as pd
sheets = pd.read_excel(excel_filename, sheetname=[old_sheet, new_sheet], index_col=0)
sheet1 = sheets[old_sheet]
sheet2 = sheets[new_sheet]
I added some rows to have clearer differences
sheet1
Name Phone Address
Emp_ID
1 A 123 ABC
2 B 456 CBD
3 C 789 S
5 A 123 ABC
sheet2
Name Phone Address
Emp_ID
1 A 123 ABC
3 C 789 S
4 D 12 A
5 E 123 ABC
calculating the missing Emp_ID becomes very simple then
missing_in_1 = set(sheet2.index) - set(sheet1.index)
missing_in_2 = set(sheet1.index) - set(sheet2.index)
missing_in_1, missing_in_2
({4}, {2})
so sheet1 has no Emp_ID 4 which is in sheet2, and sheet2 lacks a 2, as expected
Then to look for the differences, we do an inner-join on the 2 sheets
combined = pd.merge(sheet1, sheet2, left_index=True, right_index=True, suffixes=('_1', '_2'))
combined
Name_1 Phone_1 Address_1 Name_2 Phone_2 Address_2
Emp_ID
1 A 123 ABC A 123 ABC
3 C 789 S C 789 S
5 A 123 ABC E 123 ABC
and loop over the columns of sheet1 to look for differences and save these in a dict
differences = {}
for column in sheet1.columns:
diff = combined[column+'_1'] != combined[column+'_2']
if diff.any():
differences[column] = list(combined[diff].index)
differences
{'Name': [5]}
If you want the whole list of differences, you change the last line to differences[column] = combined[diff]
differences
{'Name':
Name_1 Phone_1 Address_1 Name_2 Phone_2 Address_2
Emp_ID
5 A 123 ABC E 123 ABC}
I have a DataFrame, df, that looks like:
ID | TERM | DISC_1
1 | 2003-10 | ECON
1 | 2002-01 | ECON
1 | 2002-10 | ECON
2 | 2003-10 | CHEM
2 | 2004-01 | CHEM
2 | 2004-10 | ENGN
2 | 2005-01 | ENGN
3 | 2001-01 | HISTR
3 | 2002-10 | HISTR
3 | 2002-10 | HISTR
ID is a student ID, TERM is an academic term, and DISC_1 is the discipline of their major. For each student, I’d like to identify the TERM when (and if) they changed DISC_1, and then create a new DataFrame that reports when. Zero indicates they did not change. The output looks like:
ID | Change
1 | 0
2 | 2004-01
3 | 0
My code below works, but it’s very slow. I tried to do this using Groupby, but was unable to. Could someone explain how I might accomplish this task more efficiently?
df = df.sort_values(by = ['PIDM', 'TERM'])
c = 0
last_PIDM = 0
last_DISC_1 = 0
change = [ ]
for index, row in df.iterrows():
c = c + 1
if c > 1:
row['change'] = np.where((row['PIDM'] == last_PIDM) & (row['DISC_1'] != last_DISC_1), row['TERM'], 0)
last_PIDM = row['PIDM']
last_DISC_1 = row['DISC_1']
else:
row['change'] = 0
change.append(row['change'])
df['change'] = change
change_terms = df.groupby('PIDM')['change'].max()
Here's a start:
df = df.sort_values(['ID', 'TERM'])
gb = df.groupby('ID').DISC_1
df['Change'] = df.TERM[gb.apply(lambda x: x != x.shift().bfill())]
df.Change = df.Change.fillna(0)
I've never been a big pandas user, so my solution would involve spitting that df out as a csv, and iterating over each row, while retaining the previous row. If it is properly sorted (first by ID, then by Term date) I might write something like this...
import csv
with open('inputDF.csv', 'rb') as infile:
with open('outputDF.csv', 'wb') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
previousline = reader.next() #grab the first row to compare to the second
termChange = 0
for line in reader:
if line[0] != previousline[0]: #new ID means print and move on to next person
writer.writerow([previousline[0], termChange]) #print to file ID, termChange date
termChange = 0
elif line[2] != previousline[2]: #new discipline
termChange = line[1] #set term changed date
#termChange = previousline[1] #in case you want to rather retain the last date they were in the old dicipline
previousline = line #store current line as previous and continue loop