pandas read_excel select rows - python

thanks to StackOverflow (so basically all of you) I've managed to solve almost all my issues regarding reading excel data to DataFrame, except one... My code goes like this:
df = pd.read_excel(
fileName,
sheetname=sheetName,
header=None,
skiprows=3,
index_col=None,
skip_footer=0,
parse_cols='A:J,AB:CC,CE:DJ',
na_values='')
The thing is that in excel files which I'm parsing last row of dataI want to load is every time on different position. The only way I can identify last row of data which interest me is to look for word "SUMA" in first column of each sheet, and the last row I want to load to df will be n-1 row from the one containing "SUMA". Rows below SUMA also have some irrevelant (for me) information and there can by quite a lot of them so I want to avoid loading them.

If you do it with generators, you could do something like this. This loads the complete DataFrame, but afterwards filters out the lines after 'SUMA', using the trick that True == 1, so you only keep the relevant info. You might need some work afterwards to get the dtypes correct
def read_files(files):
sheetname = 'my_sheet'
for file in files:
yield pd.read_excel(
file,
sheetname=sheetName,
header=None,
skiprows=3,
index_col=None,
skip_footer=0,
parse_cols='A:J,AB:CC,CE:DJ',
na_values='')
def clean_files(dataframes):
summary_text = 'SUMA'
for df in dataframes:
index_after_suma = df.index.str.startswith(summary_text).cumsum()
yield df.loc[~index_after_suma, :]

Related

Changing Headers in .csv files

Right now I am trying to read in data which is provided in a messy to read-in format. Here is an example
#Name,
#Comment,""
#ExtComment,""
#Source,
[Data]
1,2
3,4
5,6
#[END_OF_FILE]
When working with one or two of these files, I have manually changed the ['DATA'] header to ['x', 'y'] and am able to read in data just fine by skipping the first few rows and not reading the last line.
However, right now I have 30+ files, split between two different folders and I am trying to figure out the best way to read in the files and change the header of each file from ['DATA'] to ['x', 'y'].
The excel files are in a folder one path lower than the file that is supposed to read them (i.e. folder 1 contains set of code below, and folder 2 contains the excel files, folder 1 contains folder 2)
Here is what I have right now:
#sets - refers to the set containing the name of each file (i.e. [file1, file2])
#df - the dataframe which you are going to store the data in
#dataLabels - the headers you want to search for within the .csv file
#skip - the number of rows you want to skip
#newHeader - what you want to change the column headers to be
#pathName - provide path where files are located
def reader (sets, df, dataLabels, skip, newHeader, pathName):
for i in range(len(sets)):
df_temp = pd.read_csv(glob.glob(pathName+ sets[i]+".csv"), sep=r'\s*,', skiprows = skip, engine = 'python')[:-1]
df_temp.column.value[0] = [newHeader]
for j in range(len(dataLabels)):
df_temp[dataLabels[j]] = pd.to_numeric(df_temp[dataLabels[j]],errors = 'coerce')
df.append(df_temp)
return df
When I run my code, I run into the error:
No columns to parse from file
I am not quite sure why - I have tried skipping past the [DATA] header and I still receive that error.
Note, for this example I would like the headers to be 'x', 'y' - I am trying to make a universal function so that I could change it to something more useful depending on what I am measuring.
If the #[DATA] row is to be replaced regardless, just ignore it. You can just tell pandas to ignore lines that start with # and then specify your own names:
import pandas as pd
df = pd.read_csv('test.csv', comment='#', names=['x', 'y'])
which gives
x y
0 1 2
1 3 4
2 5 6
Expanding Kraigolas's answer, to do this with multiple files you can use a list comprehension:
files = [glob.glob(f"{pathName}{set_num}.csv") for set_num in sets]
df = pd.concat([pd.read_csv(file, comment="#", names = ["x", "y"]) for file in files])
If you're lucky, you can use Kraigolas' answer to treat those lines as comments.
In other cases you may be able to use the skiprows argument to skip header columns:
df= pd.read_csv(path,skiprows=10,skipfooter=2,names=["x","y"])
And yes, I do have an unfortunate file with a 10-row heading and 2 rows of totals.
Unfortunately I also have very unfortunate files where the number of headings change.
In this case I used the following code to iterate until I find the first "good" row, then create a new dataframe from the rest of the rows. The names in this case are taken from the first "good" row and the types from the first data row
This is certainly not fast, it's a last resort solution. If I had a better solution I'd use it:
data = df
if(first_col not in df.columns):
# Skip rows until we find the first col header
for i, row in df.iterrows():
if row[0] == first_col:
data = df.iloc[(i + 1):].reset_index(drop=True)
# Read the column names
series = df.iloc[i]
series = series.str.strip()
data.columns = list(series)
# Use only existing column types
types = {k: v for k, v in dtype.items() if k in data.columns}
# Apply the column types again
data = data.astype(dtype=types)
break
return data
In this case the condition is finding the first column name (first_col) in the first cell.
This can be adopted to use different conditions, eg looking for the first numeric cell:
columns = ["x", "y"]
dtypes = {"x":"float64", "y": "float64"}
data = df
# Skip until we find the first numeric value
for i, row in df.iterrows():
if row[0].isnumeric():
data = df.iloc[(i + 1):].reset_index(drop=True)
# Apply names and types
data.columns = columns
data = data.astype(dtype=dtypes)
break
return data

How can I filter a csv file based on its columns in python?

I have a CSV file with over 5,000,000 rows of data that looks like this (except that it is in Farsi):
Contract Code,Contract Type,State,City,Property Type,Region,Usage Type,Area,Percentage,Price,Price per m2,Age,Frame Type,Contract Date,Postal Code
765720,Mobayee,East Azar,Kish,Apartment,,Residential,96,100,570000,5937.5,36,Metal,13890107,5169614658
766134,Mobayee,East Azar,Qeshm,Apartment,,Residential,144.5,100,1070000,7404.84,5,Concrete,13890108,5166884645
766140,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,100,1050000,7266.44,5,Concrete,13890108,5166884645
766146,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,100,700000,4844.29,5,Concrete,13890108,5166884645
766147,Mobayee,East Azar,Kish,Apartment,,Residential,144.5,100,1625000,11245.67,5,Concrete,13890108,5166884645
770822,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,50,500000,1730.1,5,Concrete,13890114,5166884645
I would like to write a code to pass the first row as the header and then extract data from two specific cities (Kish and Qeshm) and save it into a new CSV file. Somthing like this one:
Contract Code,Contract Type,State,City,Property Type,Region,Usage Type,Area,Percentage,Price,Price per m2,Age,Frame Type,Contract Date,Postal Code
765720,Mobayee,East Azar,Kish,Apartment,,Residential,96,100,570000,5937.5,36,Metal,13890107,5169614658
766134,Mobayee,East Azar,Qeshm,Apartment,,Residential,144.5,100,1070000,7404.84,5,Concrete,13890108,5166884645
766147,Mobayee,East Azar,Kish,Apartment,,Residential,144.5,100,1625000,11245.67,5,Concrete,13890108,5166884645
It's worth mentioning that I'm very new to python.
I've written the following block to define the headers, but this is the furthest I've gotten so far.
import pandas as pd
path = '/Users/Desktop/sample.csv'
df = pd.read_csv(path , header=[0])
df.head = ()
You don't need to use header=... because the default is to treat the first row as the header, so
df = pd.read_csv(path)
Then, to keep rows on conditions:
df2 = df[df['City'].isin(['Kish', 'Qeshm'])]
And you can save it with
df2.to_csv(another_path)

How can I fix "Error tokenizing data" on pandas csv reader?

I'm trying to read a csv file with pandas.
This file actually has only one row but it causes an error whenever I try to read it.
Something wrong seems happening in line 8 but I could hardly find the 8th line since there's clearly only one row on it.
I do like:
with codecs.open("path_to_file", "rU", "Shift-JIS", "ignore") as file:
df = pd.read_csv(file, header=None, sep="\t")
df
Then I get:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 8, saw 3
I don't get what's really going on, so any of your advice will be appreciated.
I struggled with this almost a half day , I opened the csv with notepad and noticed that separate is TAB not comma and then tried belo combination.
df = pd.read_csv('C:\\myfile.csv',sep='\t', lineterminator='\r')
Try df = pd.read_csv(file, header=None, error_bad_lines=False)
The existing answer will not include these additional lines in your dataframe. If you'd like your dataframe to be as wide as its widest point, you can use the following:
delimiter = ','
max_columns = max(open(path_name, 'r'), key = lambda x: x.count(delimiter)).count(delimiter)
df = pd.read_csv(path_name, header = None, skiprows = 1, names = list(range(0,max_columns)))
Set skiprows = 1 if there's actually a header, you can always retrieve the header column names later.
You can also identify rows that have more columns populated than the number of column names in the original header.

Creating a script in Python 3 to compare two CSVs and output the similarities and differences between the two into another set of CSVs

I have two CSVs, ideally the CSVs will contain same data, in reality sometimes the content may differ here and there. Instead of manually browsing the two CSVs and find out whats same and different, I am trying to create a python script which I will run weekly that will tell me whats same and what not.
Here's the logic.
1. Given 2 CSVs
2. Compare them row by row.
3. Any rows that are different between the two CSVs should be recorded into an another CSV (the entire row/s)
4. Any rows that are same between the CSVs should be recorded into another CSV (the entire row/s).
This will help me visually see what the differences are and action them accordingly.
Below is an example of what I am looking for.
The code below is what I have so far
with open('Excel 1.csv', 'r') as csvOne, open('Excel 2.csv', 'r') as csvTwo:
csvOne = csvOne.readlines()
csvTWO = csvTWO.readlines()
with open('resultsSame.csv', 'w') as resultFileSame:
for row in csvTWO:
if row not in csvONE:
resultFileSame.write(row)
with open('resultsDifference.csv', 'w') as resultFileDifference:
for row in csvTWO:
if row in csvONE:
resultFileDifference.write(row)
I want the script to compare rows and only if there is a similarity or differences between rows output that into another set of CSVs. The above code works but it removes the columns which are in one CSV and not the other and not rows. I want to keep the columns even though if they are not in the other CSV and only show me what roles are in one or the other in separate CSVs.
Please see below the results I get when I run the first code you've given, on your dataset example.
If you look at the above, I can't quite seem to figure out how your getting the output that you are, as that is exactly what I want! To be honest, I don't need to print out the headers as I am comparing those as well, they can sometime end of different due to user error.
Here is the modified version of your code.
with open('excel1.csv', 'r') as csvOne, open('excel2.csv', 'r') as csvTwo:
csvONE = csvOne.readlines()
csvTWO = csvTwo.readlines()
with open('resultsDifference.csv', 'w') as resultFileDifference:
# Write the header to difference file.
# Because, the headers are same for 2 input CSVs, the header row will be obviously into resultsSame.csv
resultFileDifference.write(csvONE[0])
for row in csvTWO:
if row not in csvONE:
resultFileDifference.write(row)
with open('resultsSame.csv', 'w') as resultFileSame:
for row in csvTWO:
if row in csvONE:
resultFileSame.write(row)
Using pandas will make your work easier. Here is the snippet and is self explanatory.
import pandas as pd
df1 = pd.read_csv('excel1.csv')
df2 = pd.read_csv('excel2.csv')
merged = df1.merge(df2, indicator=True, how='outer')
diff_df = merged[merged['_merge'] == 'right_only'].drop('_merge', axis=1)
similar_df = merged[merged['_merge'] == 'both'].drop('_merge', axis=1)
print(diff_df)
print(similar_df)
diff_df.to_csv('resultsDifference.csv', index=False)
similar_df.to_csv('resultsSame.csv', index=False)
Documentation of pandas merge function Pandas-merge function
I've created the script based on the example you've given in your question. Here is the snap of inputs and outputs of the example.
Excel1
Excel2
resultsSame.csv
resultsDifference.csv
I'm sure the script produces the results what you've quoted in your question except the index. If you are interested in the row indices as in your question, then below is the updated script. Let me know whether it meets your needs.
import pandas as pd
df1 = pd.read_csv('excel1.csv')
df2 = pd.read_csv('excel2.csv')
merged = df1.merge(df2, indicator=True, how='outer')
diff_df = merged[merged['_merge'] == 'right_only'].drop('_merge', axis=1)
similar_df = merged[merged['_merge'] == 'both'].drop('_merge', axis=1)
diff_df.index = range(1,len(diff_df)+1)
similar_df.index = range(1,len(similar_df)+1)
diff_df.to_csv('resultsDifference.csv')
similar_df.to_csv('resultsSame.csv')
Ah! I'm wondering!!! These are the CSV file contents I have..
excel1.csv
A,B,C,D
A,A,A,A
B,B,B,B
C,C,C,A
D,,,
excel2.csv
A,B,C,D
A,A,A,A
B,B,B,B
C,C,C,C
D,D,,

read a huge csv and create a dataframe

I have a csv about 4000,0000 rows and 3 columns.I want to read into python,and create a dataframe with these data. I always has memory error.
df = pd.concat([chunk for chunk in pd.read_csv(cmct_0430x.csv',chunksize=1000)])
I also tried creat pandas DataFrame from generator,it still has memory error.
for line in open("cmct_0430x.csv"):
yield line
my computer is win64,8G
how can I solve this problem? thank you very much.
df = pd.read_csv('cmct_0430x.csv')
40 million rows shouldn't be a problem.
please post your error message if this doesn't work
You actually read the csv file with chunked mode, but merged them into one data-frame in RAM. So the problem still exists. You can divide your data into multiple frames, and work on them separately.
reader = pd.read_csv(file_name, chunksize=chunk_size, iterator=True)
while True:
try:
df = reader.get_chunk(chunk_size)
# work on df
except:
break
del df

Categories