Related
I have a list of excel files (.xlsx,.xls), I'm trying to get headers of each of these files after loaded.
Here I have taken a one excel file and loaded into pandas as.
pd.read_excel("sample.xlsx")
output is:
Here we would like to get an header information as per our requirement, here in the attached image the required headers are existed at index 8 as you can see in red color coded.
pd.read_excel('sample.xlsx',skiprows=9)
as we know now we have a correct header at 8 i can go back and specify in read_excel as skip_rows at 8 so that it reads from this index and headers will be appeared as.
How to handle this type of cases programmatically among a list of excel files where we don't know where the header is existed? in this case we have known that header is at 8. but what if we don't know this in other files.
Sample file can be downloaded for your ref:
https://github.com/myamullaciencia/pg_diploma_ai_ml_uohyd/blob/main/sample_file.xlsx
Use:
df = pd.read_excel('sample_file.xlsx')
#test all rows if previous row is only NaNs
m1 = df.shift(fill_value=0).isna().all(axis=1)
#test all rows if no NaNs
m2 = df.notna().all(axis=1)
#chain together and filter all next rows after first match
df = df[(m1 & m2).cummax()]
#set first row to columns names
df = df.set_axis(df.iloc[0].rename(None), axis=1).iloc[1:].reset_index(drop=True)
print (df)
LN FN SSN DOB DOH Gender Comp_2011 Comp_2010 \
0 Ax Bx 000-00-0000 8/3/1800 1/1/1800 Male 384025.56 396317
1 Er Ds 000-00-0000 5/7/1800 7/1/1800 Male 382263.86 392474
2 Po Ch 000-00-0000 9/9/1800 1/1/1800 Male 406799.34 395677
3 Rt Da 000-00-0000 6/24/1800 7/1/1800 Male 395767.12 424093
4 Yh St 000-00-0000 3/15/1800 7/1/1800 Male 376936.58 373754
5 Ws Ra 000-00-0000 6/12/1800 7/10/1800 Male 425720.06 420927
Comp_2009 Allocation Group NRD
0 360000 0.05 2022-09-01 00:00:00
1 360000 0.05 2015-06-01 00:00:00
2 360000 0.05 2013-01-01 00:00:00
3 360000 0.05 2020-07-01 00:00:00
4 360000 0 2013-01-01 00:00:00
5 306960 0 2034-07-01 00:00:00
Use:
df = pd.read_excel('sample_file.xlsx', header = None)
d = df[df[5] == 'Gender'].index[0]
ndf = df[d+1:]
ndf.columns = df.loc[d].values
ndf.reset_index(drop = True)
Output:
Please note that the idea is that the Gender is always in the headers.
Based on your comment, you can use the following condition:
df[df.notna().sum(axis = 1)==11].index[0]
I used below codes to combine a bunch of csv files. There is a column [UPC] start with 000000. Pandas detect the UPC as numeric value so all leading zeroes are ignored.
import pandas as pd
file_ptn = os.path.join('nielsen_sku_fact*.csv')
files = glob.glob(file_ptn)
sch_inx = [
'[All Markets]',
'[All Periods]',
'[UPC]'
]
df = reduce(lambda left,right: pd.DataFrame.combine_first(left,right), [pd.read_csv(f,index_col=sch_inx) for f in files])
The challenge is that [UPC] needs to be set as index in order to combine all files into the same schema. I prefer to use combine_first method for code elegance purposes; so no need to suggest a different merge/combine method other than combine_first.
Perhaps the problem is with the index_col paramter, why not set the index after reading the csv. i.e
li = [pd.read_csv(f, dtype={d:object for d in sch_inx }).set_index(sch_inx) for f in files]
main_df = reduce(lambda left,right: pd.DataFrame.combine_first(left,right),li)
Lets take an example for preserving the leading zeroes i.e
amount donorID recipientID year
0 0100 101 11 2014
1 0200 101 11 2014
2 0500 101 21 2014
3 0200 102 21 2014
# Copy the above dataframe
sch_ind = ['amount','donorID']
df = pd.read_clipboard(dtype={d:object for d in sch_ind}).set_index(sch_ind)
print(df)
recipientID year
amount donorID
0100 101 11 2014
0200 101 11 2014
0500 101 21 2014
0200 102 21 2014
If it works with clipboard it works with csv too.
I think you need change combine_first and add parameter dtype to read_csv by dictionary - column name with type str.
Also for index is used numpy.intersect1d for intersection between columns names and sch_inx and select intersected column(s):
dfs = []
di = {d:str for d in sch_inx}
for fp in files:
df = pd.read_csv(fp, dtype=di)
#if want only first intersectioned column add [0]
#col = np.intersect1d(df.columns, sch_inx)[0]
col = np.intersect1d(df.columns, sch_inx)
dfs.append(df.set_index(col))
df = reduce(lambda left,right: left.combine_first(right), dfs)
You cannot use dtype with index_col in pandas 0.22.0, because bug.
Point 1
There are several ways to preserve the string-ness of the '[UPC]' column.
Use dtype as mentioned in other posts
Use converters
Perform the conversion afterwards with pd.Series.str.zfill
Setup
Let's begin by setting up some files. I'm using Jupyter Notebook and I can use the handy %%writefile magic.
%%writefile nielson_sku_fact01.csv
[All Markets],[All Periods],[UPC],A,B
1,2,0001,3,4
1,3,2000,7,8
%%writefile nielson_sku_fact02.csv
[All Markets],[All Periods],[UPC],C,D
1,4,0001,3,4
1,3,3000,7,8
%%writefile nielson_sku_fact03.csv
[All Markets],[All Periods],[UPC],B,D
1,4,0002,10,11
1,2,2000,8,8
Let's use OP's code to get some vars
import glob
import os
import pandas as pd
from functools import reduce
files = glob.glob('nielson_sku_fact*.csv')
sch_inx = [
'[All Markets]',
'[All Periods]',
'[UPC]'
]
Now let's show how the three conversions work:
pd.read_csv('nielson_sku_fact01.csv', dtype={'[UPC]': str})
[All Markets] [All Periods] [UPC] A B
0 1 2 0001 3 4
1 1 3 2000 7 8
pd.read_csv('nielson_sku_fact01.csv', converters={'[UPC]': str})
[All Markets] [All Periods] [UPC] A B
0 1 2 0001 3 4
1 1 3 2000 7 8
Using pd.Series.str.zfill
pd.read_csv('nielson_sku_fact01.csv')['[UPC]'].astype(str).pipe(
lambda s: s.str.zfill(s.str.len().max()))
[All Markets] [All Periods] [UPC] A B
0 1 2 0001 3 4
1 1 3 2000 7 8
Point 2
If you want elegance, There is no need to use a lambda that takes two arguments when pd.DataFrame.combine_first is already a function that takes two arguments. In addition, you can use map with a prepared reading function to make it nice and clean:
def read(filename):
return pd.read_csv(
filename,
converters={'[UPC]': str}
).set_index(sch_inx)
reduce(pd.DataFrame.combine_first, map(read, files))
A B C D
[All Markets] [All Periods] [UPC]
1 2 0001 3.0 4.0 NaN NaN
2000 NaN 8.0 NaN 8.0
3 2000 7.0 8.0 NaN NaN
3000 NaN NaN 7.0 8.0
4 0001 NaN NaN 3.0 4.0
0002 NaN 10.0 NaN 11.0
Point 3
I think you should reconsider using pd.DataFrame.combine_first because the nature of glob doesn't look like you can control the order of your files very easily. And you might get unpredictable outcomes depending on how glob returns those files. Unless you don't care, then... good luck.
When using read_csv, you can set the type of a column by passing a dtype argument. Eg:
pd.read_csv(f, index_col=sch_inx, dtype={'[UPC]': 'str'})
See: docs
So I have a .csv file where each row looks like this:
,11:00:14,4,5.,93.7,0.01,0.0,7,20,0.001,10,49.3,0.01,
,11:00:15,4,5.,94.7,0.04,0.5,7,20,0.005,10,49.5,0.04,
when it should look like this:
11:00:14,4,5.,93.7,0.01,0.0,7,20,0.001,10,49.3,0.01
11:00:15,4,5.,94.7,0.04,0.5,7,20,0.005,10,49.5,0.04
I think that this is the reason why pandas is not creating data frames properly. What can I do to remove these commas?
The code generating the original csv file is
def tsv2csv():
# read tab-delimited file
with open(file_location + tsv_file,'r') as fin:
cr = csv.reader(fin, delimiter='\t')
filecontents = [line for line in cr]
# write comma-delimited file (comma is the default delimiter)
# give the exact location of the file
#"newline=''" at the end of the line stops there being spaces between each row
with open(new_csv_file,'w', newline='') as fou:
cw = csv.writer(fou, quotechar='', quoting=csv.QUOTE_NONE)
cw.writerows(filecontents)
You can use usecols to specify the columns you want to import as follows:
import pandas as pd
csv_df = pd.read_csv('temp.csv', header=None, usecols=range(1,13))
This will skip the first and last empty columns.
The trailing commas correspond to missing data. When loading in your dataframe, they're loaded in as NaNs, so all you'd need to do is get rid of it, either using dropna or by slicing them out -
df = pd.read_csv('file.csv', header=None).dropna(how='all', axis=1)
Or,
df = pd.read_csv('file.csv', header=None).iloc[:, 1:-1]
df
1 2 3 4 5 6 7 8 9 10 11 12
0 11:00:14 4 5.0 93.7 0.01 0.0 7 20 0.001 10 49.3 0.01
1 11:00:15 4 5.0 94.7 0.04 0.5 7 20 0.005 10 49.5 0.04
You can strip any character at the beginning and end of a text by using strip and give a string with the characters you wan't to escape as an argument.
x = ',11:00:14,4,5.,93.7,0.01,0.0,7,20,0.001,10,49.3,0.01,'
print x.strip(',')
>11:00:14,4,5.,93.7,0.01,0.0,7,20,0.001,10,49.3,0.01
Not sure If It Works in you case, bit have you tried import:
df = pd.read_csv('filename', sep=';')
Let's say I have a text file that looks like this:
Item,Date,Time,Location
1,01/01/2016,13:41,[45.2344:-78.25453]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242]
3,01/10/2016,01:27,[51.2344:-86.24432]
What I'd like to be able to do is read that in with pandas.read_csv, but the second row will throw an error. Here is the code I'm currently using:
import pandas as pd
df = pd.read_csv("path/to/file.txt", sep=",", dtype=str)
I've tried to set quotechar to "[", but that obviously just eats up the lines until the next open bracket and adding a closing bracket results in a "string of length 2 found" error. Any insight would be greatly appreciated. Thanks!
Update
There were three primary solutions that were offered: 1) Give a long range of names to the data frame to allow all data to be read in and then post-process the data, 2) Find values in square brackets and put quotes around it, or 3) replace the first n number of commas with semicolons.
Overall, I don't think option 3 is a viable solution in general (albeit just fine for my data) because a) what if I have quoted values in one column that contain commas, and b) what if my column with square brackets is not the last column? That leaves solutions 1 and 2. I think solution 2 is more readable, but solution 1 was more efficient, running in just 1.38 seconds, compared to solution 2, which ran in 3.02 seconds. The tests were run on a text file containing 18 columns and more than 208,000 rows.
We can use simple trick - quote balanced square brackets with double quotes:
import re
import six
import pandas as pd
data = """\
Item,Date,Time,Location,junk
1,01/01/2016,13:41,[45.2344:-78.25453],[aaaa,bbb]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242],[0,1,2,3]
3,01/10/2016,01:27,[51.2344:-86.24432],[12,13]
4,01/30/2016,05:55,[51.2344:-86.24432,41.2342:-81242,55.5555:-81242],[45,55,65]"""
print('{0:-^70}'.format('original data'))
print(data)
data = re.sub(r'(\[[^\]]*\])', r'"\1"', data, flags=re.M)
print('{0:-^70}'.format('quoted data'))
print(data)
df = pd.read_csv(six.StringIO(data))
print('{0:-^70}'.format('data frame'))
pd.set_option('display.expand_frame_repr', False)
print(df)
Output:
----------------------------original data-----------------------------
Item,Date,Time,Location,junk
1,01/01/2016,13:41,[45.2344:-78.25453],[aaaa,bbb]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242],[0,1,2,3]
3,01/10/2016,01:27,[51.2344:-86.24432],[12,13]
4,01/30/2016,05:55,[51.2344:-86.24432,41.2342:-81242,55.5555:-81242],[45,55,65]
-----------------------------quoted data------------------------------
Item,Date,Time,Location,junk
1,01/01/2016,13:41,"[45.2344:-78.25453]","[aaaa,bbb]"
2,01/03/2016,19:11,"[43.3423:-79.23423,41.2342:-81242]","[0,1,2,3]"
3,01/10/2016,01:27,"[51.2344:-86.24432]","[12,13]"
4,01/30/2016,05:55,"[51.2344:-86.24432,41.2342:-81242,55.5555:-81242]","[45,55,65]"
------------------------------data frame------------------------------
Item Date Time Location junk
0 1 01/01/2016 13:41 [45.2344:-78.25453] [aaaa,bbb]
1 2 01/03/2016 19:11 [43.3423:-79.23423,41.2342:-81242] [0,1,2,3]
2 3 01/10/2016 01:27 [51.2344:-86.24432] [12,13]
3 4 01/30/2016 05:55 [51.2344:-86.24432,41.2342:-81242,55.5555:-81242] [45,55,65]
UPDATE: if you are sure that all square brackets are balances, we don't have to use RegEx's:
import io
import pandas as pd
with open('35948417.csv', 'r') as f:
fo = io.StringIO()
data = f.readlines()
fo.writelines(line.replace('[', '"[').replace(']', ']"') for line in data)
fo.seek(0)
df = pd.read_csv(fo)
print(df)
I can't think of a way to trick the CSV parser into accepting distinct open/close quote characters, but you can get away with a pretty simple preprocessing step:
import pandas as pd
import io
import re
# regular expression to capture contents of balanced brackets
location_regex = re.compile(r'\[([^\[\]]+)\]')
with open('path/to/file.txt', 'r') as fi:
# replaced brackets with quotes, pipe into file-like object
fo = io.StringIO()
fo.writelines(unicode(re.sub(location_regex, r'"\1"', line)) for line in fi)
# rewind file to the beginning
fo.seek(0)
# read transformed CSV into data frame
df = pd.read_csv(fo)
print df
This gives you a result like
Date_Time Item Location
0 2016-01-01 13:41:00 1 [45.2344:-78.25453]
1 2016-01-03 19:11:00 2 [43.3423:-79.23423, 41.2342:-81242]
2 2016-01-10 01:27:00 3 [51.2344:-86.24432]
Edit If memory is not an issue, then you are better off preprocessing the data in bulk rather than line by line, as is done in Max's answer.
# regular expression to capture contents of balanced brackets
location_regex = re.compile(r'\[([^\[\]]+)\]', flags=re.M)
with open('path/to/file.csv', 'r') as fi:
data = unicode(re.sub(location_regex, r'"\1"', fi.read()))
df = pd.read_csv(io.StringIO(data))
If you know ahead of time that the only brackets in the document are those surrounding the location coordinates, and that they are guaranteed to be balanced, then you can simplify it even further (Max suggests a line-by-line version of this, but I think the iteration is unnecessary):
with open('/path/to/file.csv', 'r') as fi:
data = unicode(fi.read().replace('[', '"').replace(']', '"')
df = pd.read_csv(io.StringIO(data))
Below are the timing results I got with a 200k-row by 3-column dataset. Each time is averaged over 10 trials.
data frame post-processing (jezrael's solution): 2.19s
line by line regex: 1.36s
bulk regex: 0.39s
bulk string replace: 0.14s
I think you can replace first 3 occurence of , in each line of file to ; and then use parameter sep=";" in read_csv:
import pandas as pd
import io
with open('file2.csv', 'r') as f:
lines = f.readlines()
fo = io.StringIO()
fo.writelines(u"" + line.replace(',',';', 3) for line in lines)
fo.seek(0)
df = pd.read_csv(fo, sep=';')
print df
Item Date Time Location
0 1 01/01/2016 13:41 [45.2344:-78.25453]
1 2 01/03/2016 19:11 [43.3423:-79.23423,41.2342:-81242]
2 3 01/10/2016 01:27 [51.2344:-86.24432]
Or can try this complicated approach, because main problem is, separator , between values in lists is same as separator of other column values.
So you need post - processing:
import pandas as pd
import io
temp=u"""Item,Date,Time,Location
1,01/01/2016,13:41,[45.2344:-78.25453]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242,41.2342:-81242]
3,01/10/2016,01:27,[51.2344:-86.24432]"""
#after testing replace io.StringIO(temp) to filename
#estimated max number of columns
df = pd.read_csv(io.StringIO(temp), names=range(10))
print df
0 1 2 3 4 \
0 Item Date Time Location NaN
1 1 01/01/2016 13:41 [45.2344:-78.25453] NaN
2 2 01/03/2016 19:11 [43.3423:-79.23423 41.2342:-81242
3 3 01/10/2016 01:27 [51.2344:-86.24432] NaN
5 6 7 8 9
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 41.2342:-81242] NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
#remove column with all NaN
df = df.dropna(how='all', axis=1)
#first row get as columns names
df.columns = df.iloc[0,:]
#remove first row
df = df[1:]
#remove columns name
df.columns.name = None
#get position of column Location
print df.columns.get_loc('Location')
3
#df1 with Location values
df1 = df.iloc[:, df.columns.get_loc('Location'): ]
print df1
Location NaN NaN
1 [45.2344:-78.25453] NaN NaN
2 [43.3423:-79.23423 41.2342:-81242 41.2342:-81242]
3 [51.2344:-86.24432] NaN NaN
#combine values to one column
df['Location'] = df1.apply( lambda x : ', '.join([e for e in x if isinstance(e, basestring)]), axis=1)
#subset of desired columns
print df[['Item','Date','Time','Location']]
Item Date Time Location
1 1 01/01/2016 13:41 [45.2344:-78.25453]
2 2 01/03/2016 19:11 [43.3423:-79.23423, 41.2342:-81242, 41.2342:-8...
3 3 01/10/2016 01:27 [51.2344:-86.24432]
Is there a way to read a file like this in and skip the column index (1-5) like this example? I'm using read_csv.
24.0 1:0.00632 2:18.00 3:2.310 4:0 5:0.5380
21.6 1:0.02731 2:0.00 3:7.070 4:0 5:0.4690
Expected table read:
24.0 0.00632 18.00 2.310 0 0.5380
read_csv won't handle this the way you want because it's not a CSV.
You can do e.g.
pd.DataFrame([[chunk.split(':')[-1] for chunk in line.split()] for line in f])
Your data is oddly structured. Given the colon index separator, you can read the file mostly as text via the usual read_csv. Then, loop through each column in the dataframe (except for the first one), split the string on ':', take the second element which represents your desired value, and convert that value to a float (all done via a list comprehension).
df = pd.read_csv('data.txt', sep=' ', header=None)
>>> df
0 1 2 3 4 5
0 24.0 1:0.00632 2:18.00 3:2.310 4:0 5:0.5380
1 21.6 1:0.02731 2:0.00 3:7.070 4:0 5:0.4690
df.iloc[:, 1:] = df.iloc[:, 1:].applymap(lambda s: float(s.split(':')[1]))
>>> df
0 1 2 3 4 5
0 24.0 0.00632 18 2.31 0 0.538
1 21.6 0.02731 0 7.07 0 0.469