There are several Excel files in a folder. Their structures are the same and contents are different. I want to combine them into 1 Excel file, read in this sequence 55.xlsx, 44.xlsx, 33.xlsx, 22.xlsx, 11.xlsx.
These lines are doing a good job:
import os
import pandas as pd
working_folder = "C:\\temp\\"
files = os.listdir(working_folder)
files_xls = []
for f in files:
if f.endswith(".xlsx"):
fff = working_folder + f
files_xls.append(fff)
df = pd.DataFrame()
for f in reversed(files_xls):
data = pd.read_excel(f) #, sheet_name = "")
df = df.append(data)
df.to_excel(working_folder + 'Combined 1.xlsx', index=False)
The picture shows how the original sheets looked like, also the result.
But in the sequential reading, I want only the unique rows to be appended, in addition to what’s in the data frame.
In this case:
the code read the file 55.xlsx first, then 44.xlsx, then 33.xlsx…
when it reads 44.xlsx, the row 444 Kate should not be appended as there were already a Kate from previous data frame.
when it reads 33.xlsx, the row 333 Kate should not be appended as there were already a Kate from previous data frame.
when it reads 22.xlsx, the row 222 Jack should not be appended as there were already a Jack from previous data frame.
By the way, here are the data frames (instead of Excel files) for your convenience.
d5 = {'Code': [555, 555], 'Name': ["Jack", "Kate"]}
d4 = {'Code': [444, 444], 'Name': ["David", "Kate"]}
d3 = {'Code': [333, 333], 'Name': ["Paul", "Kate"]}
d2 = {'Code': [222, 222], 'Name': ["Jordan", "Jack"]}
d1 = {'Code': [111, 111], 'Name': ["Leslie", "River"]}
df.drop_duplicates(subset=['name'], keep='first')
I think need drop_duplicates:
import glob
working_folder = "C:\\temp\\"
files = glob.glob(working_folder + '/*.xlsx')
dfs = [pd.read_excel(fp) for fp in files]
df = pd.concat(dfs)
df = df.drop_duplicates('Name')
df.to_excel(working_folder + 'Combined 1.xlsx', index=False)
Solution with data and inverse sorting files:
import glob
working_folder = "C:\\temp\\"
files = glob.glob(working_folder + '/*.xlsx')
print (files)
['C:\\temp\\11.xlsx', 'C:\\temp\\22.xlsx', 'C:\\temp\\33.xlsx',
'C:\\temp\\44.xlsx', 'C:\\temp\\55.xlsx']
files = sorted(files, key=lambda x: int(x.split('\\')[-1][:-5]), reverse=True)
print (files)
['C:\\temp\\55.xlsx', 'C:\\temp\\44.xlsx', 'C:\\temp\\33.xlsx',
'C:\\temp\\22.xlsx', 'C:\\temp\\11.xlsx']
dfs = [pd.read_excel(fp) for fp in files]
df = pd.concat(dfs)
print (df)
Code Name
0 555 Jack
1 555 Kate
0 444 David
1 444 Kate
0 333 Paul
1 333 Kate
0 222 Jordan
1 222 Jack
0 111 Leslie
1 111 River
df = df.drop_duplicates('Name')
print (df)
Code Name
0 555 Jack
1 555 Kate
0 444 David
0 333 Paul
0 222 Jordan
0 111 Leslie
1 111 River
df.to_excel(working_folder + 'Combined 1.xlsx', index=False)
Related
import pandas as pd
names = ['Bob','Jessica','Mary','John','Mel']
births = [968,155,77,578,973]
BabyDataSet = list(zip(names,births))
df = pd.DataFrame(data = BabyDataSet, columns=['Names','Births'])
df.at[3,'Names'].str.replace(df.at[3,'Names'],'!!!')
I want to change 'John' to '!!!' without directly referring 'John'.
In this way, it notice me "AttributeError: 'str' object has no attribute 'str'"
import pandas as pd
names = ['Bob','Jessica','Mary','John','Mel']
births = [968,155,77,578,973]
BabyDataSet = list(zip(names,births))
df = pd.DataFrame(data = BabyDataSet, columns=['Names','Births'])
df.loc[3,'Names'] = '!!!'
print(df)
Output:
Names Births
0 Bob 968
1 Jessica 155
2 Mary 77
3 !!! 578
4 Mel 973
You should replace with series not single value ,single value also called assign
df['Names'] = df['Names'].str.replace(df.at[3,'Names'],'!!!')
df
Out[329]:
Names Births
0 Bob 968
1 Jessica 155
2 Mary 77
3 !!! 578
4 Mel 973
I have a data in a file I dont know if it is delimited by space or tab
Data In:
id Name year Age Score
123456 ALEX BROWNNIS VND 0 19 115
123457 MARIA BROWNNIS VND 0 57 170
123458 jORDAN BROWNNIS VND 0 27 191
I read it the data with read_csv and using the tab delimited
df = pd.read_csv(data.txt,sep='\t')
out:
id Name year Age Score
0 123456 ALEX BROWNNIS VND ... 0 19 115
1 123457 MARIA BROWNNIS VND ... 0 57 170
2 123458 jORDAN BROWNNIS VND ... 0 27 191
There is a lot of a white spaces between the column. Am I using delimiter correctly? and when I try to process the column name, I gotkey error so I basically think the fault is use of \t.
What are the possible way to fix this problem?
Since you have two columns and the second one has variable number of words, you need to read it as a regular file and then combine second to last words.
id = []
Name = []
year = []
Age = []
Score = []
with open('data.txt') as f:
text = f.read()
lines = text.split('\n')
for line in lines:
if len(line) < 3: continue
words = line.split()
id.append(words[0])
Name.append(' '.join(words[1:-3]))
year.append(words[-3])
Age.append(words[-2])
Score.append(words[-1])
df = pd.DataFrame.from_dict({'id': id, 'Name': Name,
'year': year, 'Age': Age, 'Score': Score})
Edit: you'd posted the overall data, so I'll change my answer to fit it.
You can use the skipinitialspace parameter like in the following example.
df2 = pd.read_csv('data.txt', sep='\t', delimiter=',', encoding="utf-8", skipinitialspace=True)
Pandas documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
Problem Solved:
df = pd.read_csv('data.txt', sep='\t',engine="python")
I added this line of code to remove space between columns and it's work
df.columns = df.columns.str.strip()
I'm using Vote History data from the Secretary of State, however the .txt file they gave me is 7 million rows, where each row is a string with 27 characters. The first 3 characters are a code for the county. The next 8 characters are the registration ID, the next 8 characters are the date voted, etc. I can't do text to columns in excel because the file is too big. Is there a way to separate this file into columns in python pandas?
Example
Currently I have:
0010000413707312012026R
0010000413708212012027R
0010000413711062012029
0010004535307312012026D
I want to have columns:
001 00004137 07312012 026 R
001 00004137 08212012 027 R
001 00004137 11062012 029
001 00045353 07312012 026 D
Where each space separates a new column. Any suggestions? Thanks.
Simplest I can make it:
import pandas as pd
sample_lines = ['0010000413707312012026R','0010000413708212012027R','0010000413711062012029','0010004535307312012026D]']
COLUMN_NAMES = ['A','B','C','D','E']
df = pd.DataFrame(columns=COLUMN_NAMES)
for line in sample_lines:
row = [line[0:3], line[3:11], line[11:19], line[19:22], line[22:23]]
df.loc[len(df)] = row
print (df)
Outputs:
A B C D E
0 001 00004137 07312012 026 R
1 001 00004137 08212012 027 R
2 001 00004137 11062012 029
3 001 00045353 07312012 026 D
try this:
I think you don't have issue reading form txt file,simplified case would be like here:
a=['0010000413707312012026R','0010000413708212012027R','0010000413711062012029','0010004535307312012026D']
area=[]
date=[]
e1=[]
e2=[]
e3=[]
#001 00004137 07312012 026 R
for i in range (0,len(a)):
area.append(a[i][0:3])
date.append(a[i][3:11])
e1.append(a[i][11:19])
e2.append(a[i][19:22])
e3.append(a[i][22:23])
all_list = pd.DataFrame(
{'area': area,
'date': date,
'e1': e1,
'e2': e2,
'e3': e3
})
print(all_list )
#save as CSV file
all_list.to_csv('all.csv')
Since the file is too big, its better to read and save it into a different file, instead of read the entire file in memory:
with open('temp.csv') as f:
for line in f:
code = line[0:3]
registration = line[3:11]
date = line[11:19]
second_code = line[19:22]
letter = line[22:]
with open('modified.csv', 'a') as f2:
f2.write(
' '.join([code, registration, date, second_code, letter]))
You can also read the content to from the txt file and use extract to divide the dataframe columns
df = pd.read_csv('temp.csv', header=None)
df
# 0
# 0 0010000413707312012026R
# 1 0010000413708212012027R
# 2 0010000413711062012029
# 3 0010004535307312012026D
df = df[df.columns[0]].str.extract('(.{3})(.{8})(.{8})(.{3})(.*)')
df
# 0 1 2 3 4
# 0 001 00004137 07312012 026 R
# 1 001 00004137 08212012 027 R
# 2 001 00004137 11062012 029
# 3 001 00045353 07312012 026 D
I am trying to convert the following data structure;
To the format below in python 3;
if your data looks like:
array = [['PIN: 123 COD: 222 \n', 'LOA: 124 LOC: Sea \n'],
['PIN:456 COD:555 \n', 'LOA:678 LOC:Chi \n']]
You can do this:
1 Step: use regular expressions to parse your data, because it is string.
see more about reg-exp
raws=list()
for index in range(0,len(array)):
raws.append(re.findall(r'(PIN|COD|LOA|LOC): ?(\w+)', str(array[index])))
Output:
[[('PIN', '123'), ('COD', '222'), ('LOA', '124'), ('LOC', 'Sea')], [('PIN', '456'), ('COD', '555'), ('LOA', '678'), ('LOC', 'Chi')]]
2 Step: extract raw values and column names.
columns = np.array(raws)[0,:,0]
raws = np.array(raws)[:,:,1]
Output:
raws -
[['123' '222' '124' 'Sea']
['456' '555' '678' 'Chi']]
columns -
['PIN' 'COD' 'LOA' 'LOC']
3 Step: Now we can just create df.
df = pd.DataFrame(raws, columns=columns)
Output:
PIN COD LOA LOC
0 123 222 124 Sea
1 456 555 678 Chi
Is it what you want?
I hope it helps, I'm not sure about your input format.
And don't forget import libraries! (I used pandas as pd, numpy as np, re).
UPD: another way I have created log file like you have:
array = open('example.log').readlines()
Output:
['PIN: 123 COD: 222 \n',
'LOA: 124 LOC: Sea \n',
'PIN: 12 COD: 322 \n',
'LOA: 14 LOC: Se \n']
Then split by ' ' , drop '\n' and reshape:
raws = np.array([i.split(' ')[:-1] for i in array]).reshape(2, 4, 2)
In reshape, first number is raws count in your future dataframe, second - count of columns and last - you don't need to change. It won't works if you don't have whitespace between info and '\n' in each raw. If you don't, I will change an example.
Output:
array([[['PIN:', '123'],
['COD:', '222'],
['LOA:', '124'],
['LOC:', 'Sea']],
[['PIN:', '12'],
['COD:', '322'],
['LOA:', '14'],
['LOC:', 'Se']]],
dtype='|S4')
And then take raws and columns:
columns = np.array(raws)[:,:,0][0]
raws = np.array(raws)[:,:,1]
Finally, create dataframe (and cat last symbol for columns):
pd.DataFrame(raws, columns=[i[:-1] for i in columns])
Output:
PIN COD LOA LOC
0 123 222 124 Sea
1 12 322 14 Se
If you have many log files, you can do that for each in for-loop, save each dataframe at array (example, array calls DF_array) and then use pd.concat to do one dataframe from array of dataframes.
pd.concat(DF_array)
If you need I can add an example.
UPD:
I have created a dir with log files and then make array with all files from PATH:
PATH = "logs_data/"
files = [PATH + i for i in os.listdir(PATH)]
Then do for-loop like in last update:
dfs = list()
for f in files:
array = open(f).readlines()
raws = np.array([i.split(' ')[:-1] for i in array]).reshape(len(array)/2, 4, 2)
columns = np.array(raws)[:,:,0][0]
raws = np.array(raws)[:,:,1]
df = pd.DataFrame(raws, columns=[i[:-1] for i in columns])
dfs.append(df)
result = pd.concat(dfs)
Output:
PIN COD LOA LOC
0 123 222 124 Sea
1 12 322 14 Se
2 1 32 4 Ses
0 15673 2324 13464 Sss
1 12452 3122 11234 Se
2 11 132 4 Ses
0 123 222 124 Sea
1 12 322 14 Se
2 1 32 4 Ses
I am using Python Pandas to try and match the references from CSV2 to the data in CSV1 and create a new output file.
CSV1
reference,name,house
234 8A,john,37
564 68R,bill,3
RT4 VV8,kate,88
76AA,harry ,433
CSV2
reference
234 8A
RT4 VV8
CODE
import pandas as pd
df1 = pd.read_csv(r'd:\temp\data1.csv')
df2 = pd.read_csv(r'd:\temp\data2.csv')
df3 = pd.merge(df1,df2, on= 'reference', how='inner')
df3.to_csv('outpt.csv')
I am getting a keyerror for reference when I run it, could it be the spaces in the data that is causing the issue? The data is comma delimited.
most probably you have either leading or trailing white spaces in reference column after reading your CSV files.
you can check it in this way:
print(df1.columns.tolist())
print(df2.columns.tolist())
you can "fix" it by adding sep=r'\s*,\s*' parameter to your pd.read_csv() calls
Example:
In [74]: df1
Out[74]:
reference name house
0 234 8A john 37
1 564 68R bill 3
2 RT4 VV8 kate 88
3 76AA harry 433
In [75]: df2
Out[75]:
reference
0 234 8A
1 RT4 VV8
In [76]: df2.columns.tolist()
Out[76]: ['reference ']
In [77]: df1.columns.tolist()
Out[77]: ['reference', 'name', 'house']
In [78]: df1.merge(df2, on='reference')
...
KeyError: 'reference'
fixing df2:
data = """\
reference
234 8A
RT4 VV8"""
df2 = pd.read_csv(io.StringIO(data), sep=r'\s*,\s*')
now it works:
In [80]: df1.merge(df2, on='reference')
Out[80]:
reference name house
0 234 8A john 37
1 RT4 VV8 kate 88