Python comparing a file to load to a base file - python

I want to have a base file that I can use to compare what they give me to load. e.g the based file has name,surname,amount,etc. the file can come as sue,200,anderson,etc. without column names
How can I ensure that the given file has similar columns as per the base file?
Any one with a better thinking around this, please help.

If you files will do rain more than one line you should look at the pandas framework and especially and the dataframe class to read the two files. After reading them, you could just count the number of columns and compare the number.
If each file only contains one line, this is (in my opinion) too much overhead to use; I stead you should just read each line, count the number of , and compare the numbers, by doing something similar to this:
If len(lineOfBaseFile.split(',')) == len(lineOfValueFile.split(',')):
Everything alright, process the data
else:
Columns do not match

I am using pandas, here is what I have done.
import datetime
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
from IPython.display import display, HTML
filepath = "C:/WORKBOOK.XLSM"
filepath2 ="C:/slsSep20170904.csv"
df = pd.read_excel(filepath,sheetname='Table',skiprows=
[0,1,2,3,4,5,6,7,8,9,10,11,12],usecols=[6,7,8,9,10,11,12,13,14,15,16,17])
df2 = pd.read_csv(filepath2,skiprows=[0,1,2,3,4,5,6,7,8,9,10,11,12],usecols=
[6,7,8,9,10,11,12,13,14,15,16,17])
if all(df['Unnamed: 11'] == df2['Unnamed: 11']):
difference = 'it is equals to'
print(df['Unnamed: 11'])
print(df2.columns)

This worked for me but I would like any other better idea to do this.
if (df.iloc[0][0] in df2.iloc[0][0] and
df.iloc[0][1] in df2.iloc[0][1] and
str(df.iloc[0][2]) in str(df2.iloc[0][2]) and
df.iloc[0][3] in df2.iloc[0][3] and
str(df.iloc[0][4]) in str(df2.iloc[0][4]) and
df.iloc[0][5] in df2.iloc[0][5] and
df.iloc[0][6] in df2.iloc[0][6] and
str(df.iloc[0][7]) in str(df2.iloc[0][7]) and
df.iloc[0][8] in df2.iloc[0][8] and
df.iloc[0][9] in df2.iloc[0][9] and
df.iloc[0][10] in df2.iloc[0][10] and
df.iloc[0][11] in df2.iloc[0][11]):
print(difference)
df.to_csv('C:/Test.csv', index=False,
header=False)
else:
df= pd.DataFrame({'A' : []})
df.to_csv('C:/Test.csv', index=False,
header=False)
df.head()

Related

Using pandas, how do I turn one csv file column into list and then filter a different csv with the created list?

Basically I have one csv file called 'Leads.csv' and it contains all the sales leads we already have. I want to turn this csv column 'Leads' into a list and then check a 'Report' csv to see if any of the leads are already in there and then filter it out.
Here's what I have tried:
import pandas as pd
df_leads = pd.read_csv('Leads.csv')
leads_list = df_leads['Leads'].values.tolist()
df = pd.read_csv('Report.csv')
df = df.loc[(~df['Leads'].isin(leads_list))]
df.to_csv('Filtered Report.csv', index=False)
Any help is much appreciated!
You can try:
import pandas as pd
df_leads = pd.read_csv('Leads.csv')
df = pd.read_csv('Report.csv')
set_filtered = set(df['Leads'])-(set(df_leads['Leads']))
df_filtered = df[df['Leads'].isin(set_filtered)]
Note: Sets, are significantly faster than lists for this operation.

Adding new column with the header containing date at the beginning of CSV file

I was looking on Stackoverflow for this thing but I didn't find exactly what I wanted. I would like to open csv file on Python and add new column with header "Date" and until end of the file add today's date. How can I do it? I was trying to do it with pandas but I only know how to append to the end.
I was trying to do this that way with package csv:
x=open(outfile_name1)
y=csv.reader(x)
z=[]
for row in y:
z.append(['0'] + row)
Instead of ['0'] I wanted to put today's date. Can I then convert this list to csv with pandas or something? Thanks in advance for help!
Try this:
import pandas as pd
import datetime
df = pd.read_csv("my.csv")
df.insert(0, 'Date', datetime.datetime.today().strftime('%Y-%m-%d'))
df.to_csv("my_withDate.csv", index=False)
PS: Read the docs
Is this what you are looking for?
import pandas as pd
import datetime
df = pd.read_csv("file.csv")
df['Date'] = datetime.datetime.today().strftime('%Y-%m-%d')
df.to_csv("new_file.csv", index=False)
As far as I undestand ultimate goal is to write data to csv. One option to do that is to open first file for reading data, second for writing data then write header row into new file prepending it with column name 'Date,' and then iterate over data rows prepending them with date (requires 3.6 <= Python as uses f-strings):
import datetime
with open('columns.csv', 'r') as out, open('out.csv', 'w') as into:
headers = 'Date,' + next(out)
print(headers, end='', file=into)
for row in out:
print(f'{datetime.datetime.today().date()}, {row}', end='', file=into)

Read only specific fields from large JSON and import into a Pandas Dataframe

I have a folder with more or less 10 json files that size between 500 and 1000 Mb.
Each file contains about 1.000.000 of lines like the loffowling:
{
"dateTime": '2019-01-10 01:01:000.0000'
"cat": 2
"description": 'This description'
"mail": 'mail#mail.com'
"decision":[{"first":"01", "second":"02", "third":"03"},{"first":"04", "second":"05", "third":"06"}]
"Field001": 'data001'
"Field002": 'data002'
"Field003": 'data003'
...
"Field999": 'data999'
}
My target is to analyze it with pandas so I would like to save the data coming from all the files into a Dataframe.
If I loop all the files Python crash because I don't have free resources to manage the data.
As for my purpose I only need a Dataframe with two columns cat and dateTime from all the files, which I suppose is lighter that a whole Dataframe with all the columns I have tryed to read only these two columns with the following snippet:
Note: at the moment I am working with only one file, when I get a fast reader code I will loop to all other files (A.json, B.json, ...)
import pandas as pd
import json
import os.path
from glob import glob
cols = ['cat', 'dateTime']
df = pd.DataFrame(columns=cols)
file_name='this_is_my_path/File_A.json'
with open(file_name, encoding='latin-1') as f:
for line in f:
data=json.loads(line)
lst_dict=({'cat':data['cat'], 'dateTime':data['dateTime']})
df = df.append(lst_dict, ignore_index=True)
The code works, but it is very very slow so it takes more than one hour for one, file while reading all the file and storing into a Dataframe usually takes me 8-10 minutes.
Is there a way to read only two specific columns and append to a Dataframe in a faster way?
I have tryed to read all the JSON file and store into a Dataframe, then drop all the columns but 'cat' and 'dateTime' but it seems to be too heavy for my MacBook.
I had the same problem. I found out that appending a dict to a DataFrame is very very slow. Extract the values as a list instead. In my case it took 14 s instead of 2 h.
cols = ['cat', 'dateTime']
data = []
file_name = 'this_is_my_path/File_A.json'
with open(file_name, encoding='latin-1') as f:
for line in f:
doc = json.loads(line)
lst = [doc['cat'], doc['dateTime']]
data.append(lst)
df = pd.DataFrame(data=data, columns=cols)
Will this help?
Step 1.
Read your json file from pandas
"pandas.read_json() "
Step 2.
Then filter out your 2 columns from the dataframe.
Let me know if you still face any issue.
Thanks

Comparing two Microsoft Excel files in Python

I have two Microsoft Excel files fileA.xlsx and fileB.xlsx
fileA.xlsx looks like this:
fileB.xlsx looks like this:
The Message section of a row can contain any type of character. For example: smileys, Arabic, Chinese, etc.
I would like to find and remove all rows from fileB which are already present in fileA. How can I do this in Python?
You can use Panda's merge to first get the rows which are similar,
then you can use them as a filter.
import pandas as pd
df_A = pd.read_excel("fileA.xlsx", dtype=str)
df_B = pd.read_excel("fileB.xlsx", dtype=str)
df_new = df_A.merge(df_B, on = 'ID',how='outer',indicator=True)
df_common = df_new[df_new['_merge'] == 'both']
df_A = df_A[(~df_A.ID.isin(df_common.ID))]
df_B = df_B[(~df_B.ID.isin(df_common.ID))]
df_A, df_B now contains the rows from fileA,fileB respectively without the common rows in both.
Hope this helps.
Here I'am trying with using pandas and you have to also install xlrd for opening xlsx files,
Then it will take values from second file that are not in first file. Then creating a excel file name with second file name will rewrite the second file :
import pandas as pd
a = pd.read_excel('a.xlsx')
b = pd.read_excel('b.xlsx')
diff = b[b!=a].dropna()
diff.to_excel("b.xlsx",sheet_name='Sheet1',index=False)

Fixing dates in pandas dataframe

Scenario: I am using a python code to extract data from excel files. Currently my code reads each file into a single data frame and joins them in a list of data frames.
Issue: The original excel source files are organized by columns (dates) and identifiers (rows). Some of these files have a date in a string format, such as 20170611 or 11062015.
What I tried so far: From previous research here in SO, I found some questions and answers about this topic, but they all referred to a single conversion, for example with:
datetime.datetime.strptime('24052010', "%d%m%Y").date()
datetime.date(2010, 5, 24)
This is the kind of operation I need, but I would like to perform it for all column headers of the affected files in a loop.
Question: Is it possible to do this? How can it be done?
Obs: I thought about looping through the excel files with some code to select the ones that are affected, but since I don't know how to do that, I will select the files by hand and have them fixed individually. So my objective is just to loop the columns and fix the dates of those files.
Current code that gets data from excel:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob, os
import datetime as dt
from datetime import datetime
import matplotlib as mpl
directory = os.path.join("C:\\","Users\\DGMS\\Desktop\\final 2")
list_of_dfs = []
for root,dirs,files in os.walk(directory):
for file in files:
f = os.path.join(root, file)
print(f)
list_of_dfs .append(pd.read_excel(f))
You can use pandas.to_datetime. It does a reasonable guess at inferring the datetime format. If all formats with the year at the back have the day (and not the month) first you can use the dayfirst=True argument
I also prefer pathlib.Path.glob over os.walk
I would do something like this
from pathlib import Path
start_dir = Path('.')
excel_files = start_dir.glob('*/*.xlsx')
list_of_dfs = [(filename, pd.read_excel(filename, header=0, dayfirst=True)) for filename in excel_files]
for filename, df in list_of_dfs:
try:
datetimes = pd.to_datetime(df.columns)
df.columns = datetimes
except ValueError:
print('failed to parse column in %s' % filename
You can try this. It might solve your problem, as it can interpret several ways of writing dates.
columns = df.columns
rename_cols = {}
for col in columns:
rename_cols[col] = parse(col)
df.rename(columns=rename_cols, axis=1)

Categories