Python Pandas compare CSV keyerror - python

I am using Python Pandas to try and match the references from CSV2 to the data in CSV1 and create a new output file.
CSV1
reference,name,house
234 8A,john,37
564 68R,bill,3
RT4 VV8,kate,88
76AA,harry ,433
CSV2
reference
234 8A
RT4 VV8
CODE
import pandas as pd
df1 = pd.read_csv(r'd:\temp\data1.csv')
df2 = pd.read_csv(r'd:\temp\data2.csv')
df3 = pd.merge(df1,df2, on= 'reference', how='inner')
df3.to_csv('outpt.csv')
I am getting a keyerror for reference when I run it, could it be the spaces in the data that is causing the issue? The data is comma delimited.

most probably you have either leading or trailing white spaces in reference column after reading your CSV files.
you can check it in this way:
print(df1.columns.tolist())
print(df2.columns.tolist())
you can "fix" it by adding sep=r'\s*,\s*' parameter to your pd.read_csv() calls
Example:
In [74]: df1
Out[74]:
reference name house
0 234 8A john 37
1 564 68R bill 3
2 RT4 VV8 kate 88
3 76AA harry 433
In [75]: df2
Out[75]:
reference
0 234 8A
1 RT4 VV8
In [76]: df2.columns.tolist()
Out[76]: ['reference ']
In [77]: df1.columns.tolist()
Out[77]: ['reference', 'name', 'house']
In [78]: df1.merge(df2, on='reference')
...
KeyError: 'reference'
fixing df2:
data = """\
reference
234 8A
RT4 VV8"""
df2 = pd.read_csv(io.StringIO(data), sep=r'\s*,\s*')
now it works:
In [80]: df1.merge(df2, on='reference')
Out[80]:
reference name house
0 234 8A john 37
1 RT4 VV8 kate 88

Related

How to change values in a string column of a Pandas dataframe?

import pandas as pd
names = ['Bob','Jessica','Mary','John','Mel']
births = [968,155,77,578,973]
BabyDataSet = list(zip(names,births))
df = pd.DataFrame(data = BabyDataSet, columns=['Names','Births'])
df.at[3,'Names'].str.replace(df.at[3,'Names'],'!!!')
I want to change 'John' to '!!!' without directly referring 'John'.
In this way, it notice me "AttributeError: 'str' object has no attribute 'str'"
import pandas as pd
names = ['Bob','Jessica','Mary','John','Mel']
births = [968,155,77,578,973]
BabyDataSet = list(zip(names,births))
df = pd.DataFrame(data = BabyDataSet, columns=['Names','Births'])
df.loc[3,'Names'] = '!!!'
print(df)
Output:
Names Births
0 Bob 968
1 Jessica 155
2 Mary 77
3 !!! 578
4 Mel 973
You should replace with series not single value ,single value also called assign
df['Names'] = df['Names'].str.replace(df.at[3,'Names'],'!!!')
df
Out[329]:
Names Births
0 Bob 968
1 Jessica 155
2 Mary 77
3 !!! 578
4 Mel 973

Python: how to get unique ID and remove duplicates from column 1 (ID), and column 3 (Description), Then get the median for column 2 (Value) in Pandas

Python: how to get unique ID and remove duplicates from column 1 (ID), and column 3 (Description), Then get the median for column 2
ID
Value
Description
123456
116
xx
123456
117
xx
123456
113
xx
123456
109
xz
123456
108
xz
123456
98
xz
121214
115
abc
121214
110
abc
121214
103
abc
121214
117
abz
121214
120
abz
121214
125
abz
151416
114
zxc
151416
135
zxc
151416
127
zxc
151416
145
zxm
151416
125
zxm
151416
121
zxm
Procced table should look like:
ID
xx
xz
abc
abz
zxc
zxm
123456
110
151
0
0
0
0
121214
0
0
132
113
0
0
151416
0
0
0
0
124
115
I went for the approach of the mean, but your "expected output" example doesn't give a mean. Is that me misunderstanding what you mean?
pd.pivot_table(DF, 'Value', index='ID', columns='Description')
Should do the trick, default math function is the mean, so that's ideal. More info can be found here (mind you, DF is the to import dataframe).
Maybe this approach will work for you?
d = {'ID': [1,1,2,3,3,4,4,4,4,5,5], 'Value': [5,6,7,8,9,7,8,5,1,2,4]}
df = pd.DataFrame(data=d)
unique = set(df['ID'])
value_mean = []
for i in unique:
a = df[df['ID']==i]['Value']
a = a.mean()
value_mean.append(a)
Well you have e.g. 6 'ID' with value '123456'. If you only want unique 'ID', you need to remove 5 'ID' rows, by doing this you will not have duplicate 'Description' values anymore. The question is, do you want unique ID or unique Description values (or unique combination of both)?
There are probably more options to solve this. What you could do is combine the ID and Description into a new column, and remove the duplicate in the DataFrame. Hopefully this would help.
import pandas as pd
a = {'ID': [1,1,1,2,2,2,3,3,3,4,4,4,5,5,5],
'Value': [1,2,3,4,5,6,7,8,9,1,2,3,4,5,6],
'Description': ['a','a','b','b','c','d','d','a','c','d','e','e','e','a','b']}
df = pd.DataFrame(data=a)
unique_combined = []
for i in range(len(df)):
unique_combined.append((str(df.iloc[i]['ID'])+ df.iloc[i]['Description']))
df['un'] = unique_combined
df.drop_duplicates(subset=['un'])

Merge and create multiple columns based in the number of columns present in the dataframe- Pandas

I have a columns with numbers separated by comma now the values should be split into new columns.
Site UserId
ABC '456,567,67,96'
DEF '67,987'
The new Dataframe should look like:
Site UserID UserId1 UserId2 UserId3 UserId4
ABC '456,567,67,96' 456 567 67 96
DEF '67,987' 67 987
POC '4321,96,912 4321 87 912
Also an empty column next to each column to map the numbers with the name and Phone Number.
user
UserId UserName PhoneNo
4321 EB_Meter 9980688666
987 EB_Meter987 9255488721
912 DG_Meter912 8897634219
567 Ups_Meter567 7263193155
456 Ups_Meter456 8987222112
96 DG_Meter96
67 DGB_Meter
So the final DataFrame is:
Values Value1 Name1 Phone1 Value2 Name2 Phone2 Value3 Name3 Phone3 Value4 Name4 Phone 4
'456,567,67,96' 456 Ups_Meter456 8987222112 567 Ups_Meter567 7263193155 67 DGB_Meter 96 DG_Meter96
'67,987' 67 DGB_Meter 987 EB_Meter987 9255488721
'4321,96,912 4321 EB_Meter 9980688666 96 DG_Meter96 912 DG_Meter912 8897634219
Here are added multiple columns per UserId, so instead map is used melt with left join in merge, reshaping is created by DataFrame.pivot:
df2['UserId'] = df2['UserId'].astype(str)
df3 = df1['UserId'].str.strip("'").str.split(',',expand=True)
df3 = (df3.reset_index()
.melt('index', value_name='UserId')
.merge(df2, on='UserId', how='left')
.pivot(index='index', columns='variable')
.sort_index(axis=1, level=1, sort_remaining=False)
)
df3.columns = df3.columns.map(lambda x: f'{x[0]}_{x[1] + 1}')
df = df1.join(df3)
print (df)
Site UserId UserId_1 UserName_1 PhoneNo_1 UserId_2 \
0 ABC 456,567,67,96 456 Ups_Meter456 8987222112 567
1 DEF 67,987 67 DGB_Meter NaN 987
UserName_2 PhoneNo_2 UserId_3 UserName_3 PhoneNo_3 UserId_4 \
0 Ups_Meter567 7263193155 67 DGB_Meter NaN 96
1 EB_Meter987 9255488721 None NaN NaN None
UserName_4 PhoneNo_4
0 DG_Meter96 NaN
1 NaN NaN
You can use:
df[[ 'UserId1', 'UserId2', 'UserId3', 'UserId4']] = df['UserId'].str.split(",", expand=True)

Convert a list of lists to a structured pandas dataframe

I am trying to convert the following data structure;
To the format below in python 3;
if your data looks like:
array = [['PIN: 123 COD: 222 \n', 'LOA: 124 LOC: Sea \n'],
['PIN:456 COD:555 \n', 'LOA:678 LOC:Chi \n']]
You can do this:
1 Step: use regular expressions to parse your data, because it is string.
see more about reg-exp
raws=list()
for index in range(0,len(array)):
raws.append(re.findall(r'(PIN|COD|LOA|LOC): ?(\w+)', str(array[index])))
Output:
[[('PIN', '123'), ('COD', '222'), ('LOA', '124'), ('LOC', 'Sea')], [('PIN', '456'), ('COD', '555'), ('LOA', '678'), ('LOC', 'Chi')]]
2 Step: extract raw values and column names.
columns = np.array(raws)[0,:,0]
raws = np.array(raws)[:,:,1]
Output:
raws -
[['123' '222' '124' 'Sea']
['456' '555' '678' 'Chi']]
columns -
['PIN' 'COD' 'LOA' 'LOC']
3 Step: Now we can just create df.
df = pd.DataFrame(raws, columns=columns)
Output:
PIN COD LOA LOC
0 123 222 124 Sea
1 456 555 678 Chi
Is it what you want?
I hope it helps, I'm not sure about your input format.
And don't forget import libraries! (I used pandas as pd, numpy as np, re).
UPD: another way I have created log file like you have:
array = open('example.log').readlines()
Output:
['PIN: 123 COD: 222 \n',
'LOA: 124 LOC: Sea \n',
'PIN: 12 COD: 322 \n',
'LOA: 14 LOC: Se \n']
Then split by ' ' , drop '\n' and reshape:
raws = np.array([i.split(' ')[:-1] for i in array]).reshape(2, 4, 2)
In reshape, first number is raws count in your future dataframe, second - count of columns and last - you don't need to change. It won't works if you don't have whitespace between info and '\n' in each raw. If you don't, I will change an example.
Output:
array([[['PIN:', '123'],
['COD:', '222'],
['LOA:', '124'],
['LOC:', 'Sea']],
[['PIN:', '12'],
['COD:', '322'],
['LOA:', '14'],
['LOC:', 'Se']]],
dtype='|S4')
And then take raws and columns:
columns = np.array(raws)[:,:,0][0]
raws = np.array(raws)[:,:,1]
Finally, create dataframe (and cat last symbol for columns):
pd.DataFrame(raws, columns=[i[:-1] for i in columns])
Output:
PIN COD LOA LOC
0 123 222 124 Sea
1 12 322 14 Se
If you have many log files, you can do that for each in for-loop, save each dataframe at array (example, array calls DF_array) and then use pd.concat to do one dataframe from array of dataframes.
pd.concat(DF_array)
If you need I can add an example.
UPD:
I have created a dir with log files and then make array with all files from PATH:
PATH = "logs_data/"
files = [PATH + i for i in os.listdir(PATH)]
Then do for-loop like in last update:
dfs = list()
for f in files:
array = open(f).readlines()
raws = np.array([i.split(' ')[:-1] for i in array]).reshape(len(array)/2, 4, 2)
columns = np.array(raws)[:,:,0][0]
raws = np.array(raws)[:,:,1]
df = pd.DataFrame(raws, columns=[i[:-1] for i in columns])
dfs.append(df)
result = pd.concat(dfs)
Output:
PIN COD LOA LOC
0 123 222 124 Sea
1 12 322 14 Se
2 1 32 4 Ses
0 15673 2324 13464 Sss
1 12452 3122 11234 Se
2 11 132 4 Ses
0 123 222 124 Sea
1 12 322 14 Se
2 1 32 4 Ses

Using Python to combine selective rows from data frames into 1

There are several Excel files in a folder. Their structures are the same and contents are different. I want to combine them into 1 Excel file, read in this sequence 55.xlsx, 44.xlsx, 33.xlsx, 22.xlsx, 11.xlsx.
These lines are doing a good job:
import os
import pandas as pd
working_folder = "C:\\temp\\"
files = os.listdir(working_folder)
files_xls = []
for f in files:
if f.endswith(".xlsx"):
fff = working_folder + f
files_xls.append(fff)
df = pd.DataFrame()
for f in reversed(files_xls):
data = pd.read_excel(f) #, sheet_name = "")
df = df.append(data)
df.to_excel(working_folder + 'Combined 1.xlsx', index=False)
The picture shows how the original sheets looked like, also the result.
But in the sequential reading, I want only the unique rows to be appended, in addition to what’s in the data frame.
In this case:
the code read the file 55.xlsx first, then 44.xlsx, then 33.xlsx…
when it reads 44.xlsx, the row 444 Kate should not be appended as there were already a Kate from previous data frame.
when it reads 33.xlsx, the row 333 Kate should not be appended as there were already a Kate from previous data frame.
when it reads 22.xlsx, the row 222 Jack should not be appended as there were already a Jack from previous data frame.
By the way, here are the data frames (instead of Excel files) for your convenience.
d5 = {'Code': [555, 555], 'Name': ["Jack", "Kate"]}
d4 = {'Code': [444, 444], 'Name': ["David", "Kate"]}
d3 = {'Code': [333, 333], 'Name': ["Paul", "Kate"]}
d2 = {'Code': [222, 222], 'Name': ["Jordan", "Jack"]}
d1 = {'Code': [111, 111], 'Name': ["Leslie", "River"]}
df.drop_duplicates(subset=['name'], keep='first')
I think need drop_duplicates:
import glob
working_folder = "C:\\temp\\"
files = glob.glob(working_folder + '/*.xlsx')
dfs = [pd.read_excel(fp) for fp in files]
df = pd.concat(dfs)
df = df.drop_duplicates('Name')
df.to_excel(working_folder + 'Combined 1.xlsx', index=False)
Solution with data and inverse sorting files:
import glob
working_folder = "C:\\temp\\"
files = glob.glob(working_folder + '/*.xlsx')
print (files)
['C:\\temp\\11.xlsx', 'C:\\temp\\22.xlsx', 'C:\\temp\\33.xlsx',
'C:\\temp\\44.xlsx', 'C:\\temp\\55.xlsx']
files = sorted(files, key=lambda x: int(x.split('\\')[-1][:-5]), reverse=True)
print (files)
['C:\\temp\\55.xlsx', 'C:\\temp\\44.xlsx', 'C:\\temp\\33.xlsx',
'C:\\temp\\22.xlsx', 'C:\\temp\\11.xlsx']
dfs = [pd.read_excel(fp) for fp in files]
df = pd.concat(dfs)
print (df)
Code Name
0 555 Jack
1 555 Kate
0 444 David
1 444 Kate
0 333 Paul
1 333 Kate
0 222 Jordan
1 222 Jack
0 111 Leslie
1 111 River
df = df.drop_duplicates('Name')
print (df)
Code Name
0 555 Jack
1 555 Kate
0 444 David
0 333 Paul
0 222 Jordan
0 111 Leslie
1 111 River
df.to_excel(working_folder + 'Combined 1.xlsx', index=False)

Categories