Merge misaligned pandas dataframes - python

I have around 100 csv files. Each of them are written in to its own pandas dataframe and then merged later on and finally being written in to a database.
Each csv file contains a 1000 rows and 816 columns.
Here is the problem:
Each of the csv files contains the 816 columns but not all of the columns contains data. As a result of this some of the csv files are misaligned - the data has been moved left, but the column has not been deleted.
Here's an made up example:
CSV file A (which is correct):
Name Age City
Joe 18 London
Kate 19 Berlin
Math 20 Paris
CSV file B (with misaglignment):
Name Age City
Joe 18 London
Kate Berlin
Math 20 Paris
I would like to merge A and B, but my current solution results in a misalignment.
I'm not sure whether this is easier to deal with in SQL or Python, but I hoped some of you could come up with a good solution.
The current solution to merge the dataframes is as follows:
def merge_pandas(csvpaths):
list = []
for path in csvpaths:
frame = pd.read_csv(sMainPath + path, header=0, index_col = None)
list.append(frame)
return pd.concat(list)
Thanks in advance.

A generic solutions for these types of problems is most likely overkill. We note that the only possible mistake is when a value is written into a column to the left from where it belongs.
If your problem is more complex than the two column example you gave, you should have an array that contains the expected column type for your convenience.
types = ['string', 'int']
Next, I would set up a marker to identify flaws:
df['error'] = 0
df.loc[df.City.isnull(), 'error'] = 1
The script can detect the error with certainty
In your simple scenario, whenever there is an error, we can simply check the value in the first column.
If it's a number, ignore and move on (keep NaN on second value)
If it's a string, move it to the right
In your trivial example, that would be
def checkRow(row):
try:
row['Age'] = int(row['Age'])
except ValueError:
row['City']= row['Age']
row['Age'] = np.NaN
return row
df.apply(checkRow, axis=1)
In case you have more than two columns, use your types variable to do iterated checks to find out where the NaN belongs.
The script cannot know the error with certainty
For example, if two adjacent columns are both string value. In that case, you're screwed. Use a second marker to save these columns and do it manually. You could of course do advanced checks (it should be a city name, check whether the value is a city name), but this is probably overkill and doing it manually is faster.

Related

Change values in dataframe column based on another dataframe

I've seen a large number of similar questions but nothing quite answers what I am looking to do.
I have two dataframes
Conn_df that contains names and company details manually entered (e.g. Conn_df["Name", "Company_name", "Company_Address"]
Cleanse_df that contains cleaned up company names (e.g. Cleanse_df["Original_Company_Name", "Cleanse_Company_Name"]
The data for both is held in csv files that are imported into the script.
I want to change the company details in Conn_df.Company_Name using the values in Cleanse_df, where the Conn_df.Company_Name equals the Cleanse_df.Original_Company_Name and is replaced by Cleanse_df.Cleanse_Company_Name.
I have tried:
Conn_df["Company"] = Conn_df["Company"].replace(Conn_df["Company"], Cleanse_df["Cleansed"]) but got
replace() takes no keyword arguments
I also tried:
Conn_df["Company"] = Conn_df["Company"].map(Cleanse_df.set_index("Original")["Cleansed"]) but got
Reindexing only valid with uniquely valued Index objects
Any suggestions on how to get the values to be replaced. I would note that both dataframes run to many tens of thousands of rows, so creating a manual list is not possible.
I think you want something along the lines of this:
conn_df = pd.DataFrame({'Name':['Mac','K','Hutt'],
'Company_name':['McD','KFC','PH'],
'Company_adress':['street1','street2','street4']})
cleanse_df = pd.DataFrame({'Original_Company_Name':['McD'],'Cleanse_Company_Name':
['MacDonalds']})
cleanse_df = cleanse_df.rename(columns={'Original_Company_Name':'Company_name'})
merged_df = conn_df.merge(cleanse_df,on='Company_name',how='left')
merged_df['Cleanse_Company_Name'].fillna(merged_df['Company_name'],inplace=True)
final_df = merged_df[['Name','Company_adress','Cleanse_Company_Name']]\
.rename(columns={'Cleanse_Company_Name':'Company_name'})
This would return:
Name Company_adress Company_name
0 Mac street1 MacDonalds
1 K street2 KFC
2 Hutt street4 PH
You merge the two dataframes and then keep the replaced new value, if there is no value to replace the name then the name will just stay the same, this is done by the fillna command.

How do I systematically compare all rows in two Pandas dataframes using specific columns and return the differences?

I have two large dataframes from different sources, largely of the same structure but of different lengths and in a different order. Most of the data is comparable but not all. The rows represent individuals and the the columns contain data about those individuals. I want to check by row certain column values of one dataframe against the 'master' dataframe and then return the omissions so these can be added to it.
I have been using the df.query() method to check individual cases using my own inputs because I can search the master dataframe using multiple columns - so, something like df.query('surname == "JONES" and initials == "D V" and town == "LONDON"'). I want to do something like this but by creating a query of each row of the other dataframe using data from specific columns.
I think I can work out how I might do this using for loops and if statements but that's going to be wildly inefficient and obviously not ideal. List comprehension might be more efficient but I can't work out the dataframe comparison part unless I create a new column whose data is built from the values I want to compare (JONES-DV-LONDON, but that seems wrong).
There is an answer in here I think but it relies on the dataframes being more or less identical (which mine aren't - hence my wish to compare only certain columns).
I have been unable to find an example of someone doing the same, which might be my failure again. I am a novice and I have a feeling I might be thinking about this in completely the wrong way. I would very much value any thoughts and pointers...
EDIT - some sample data (not exactly what I'm using but hopefully helps show what I am trying to do)
df1 (my master list)
surname initials town
JONES D V LONDON
DAVIES H G BIRMINGHAM
df2
surname initials town
DAVIES H G BIRMINGHAM
HARRIS P J SOUTHAMPTON
JONES D V LONDON
I would like to identify the columns to use in the comparison (surname, initials, town here - assume there are more with data that cannot be matched) and then return the unique results from df2 - so in this case a dataframe containing:
surname initials town
HARRIS P J SOUTHAMPTON
define columns to join:
cols = ['surname', 'initials', 'town']
Than you can use merge with parameter indicator=True which shows appearance of the data (left_only, right_only or both) :
df_res = df1.merge(df2, 'outer',on=cols, indicator=True)
and exclude rows appear in both dataframes:
df_res = df_res[df_res['_merge'] != 'both']
print(df_res)
surname initials town _merge
2 HARRIS P J SOUTHAMPTON right_only
you can filter by left_only or right only.

How to perform an Excel INDEX MATCH equivalent in Python

I have a question regarding how to perform what would be the equivalent of returning a value using the INDEX MATCH functions in Excel and applying it in Python.
As an Excel user performing data analytics and manipulation on large data-sets I have moved to Python for efficiency. What I am attempting to do is to populate the column cells within a pandas dataframe based on the value returned from the value stored within a dictionary.
In an attempt to do this I have used the following code:
# imported csv DataFrames
crew_data = pd.read_csv(r'C:\file_path\crew_data.csv')
export_template = pd.read_csv(r'C:\file_path\export_template.csv')
#contract number dictionary
contract = {'Northern':'046-2019',
'Southern':'048-2015D',}
#function that attempts to perform a INDEX MATCH equivalent
def contract_num():
for x, y in enumerate(crew_data.loc[:, 'Region']):
if y in contract.keys():
num = contract[y]
else:
print('ERROR')
return(num)
#for loop which prepares then exports the load data
for i, r in enumerate(export_template):
export_template.loc[:, 'Contract'] = contract_num()
export_template.to_csv(r'C:\file_path\export_files\UPLOADER.csv')
print(export_template)
To summarise what the code is intended to do is as follows:
The for loop contained in the contract_num function begins by iterating over the Region column in the crew_data DataFrame
if the value y from the DataFrame matches the key in the contract dictionary (Note: the Region column only contains 2 values, 'Southern' and 'Northern') it will return the corresponding value from the value in the contract dictionary
The for loop which prepares then exports the load data calls on the contract_num() function to populate the Contract column in the export_template DataFrame
Please note that there are 116 additional columns which are populated in this loop which have been excluded from the code above to save space.
When the code is executed it produces the result as intended, however, the issue is that when the function is called in the second for loop it only returns a single value of 048-2015D instead of the value which corresponds to the correct Region.
As mentioned previously this would have typically been carried out in Excel using INDEX MATCH, however doing so is not as efficient as using a script such as that above.
Being a beginner, I suspect the example code may appear con-deluded and unnecessary and could be performed using a more concise method.
If anyone could provide a solution or guidance that would be greatly appreciated.
df = pd.DataFrame({'Region': ['Northern', 'Northern', 'Northern',
'Northern', 'Southern', 'Southern',
'Northern', 'Eastern']})
contract = {'Northern':'046-2019',
'Southern':'048-2015D'}
# similar to INDEX MATCH
df['Contract'] = df.Region.map(contract)
out:
Region Contract
0 Northern 046-2019
1 Northern 046-2019
2 Northern 046-2019
3 Northern 046-2019
4 Southern 048-2015D
5 Southern 048-2015D
6 Northern 046-2019
7 Eastern NaN
you can add print if Contract has not matched:
if df.Contract.isna().any():
print("ERROR")
or make an assertion:
assert not df.Contract.isna().any(), "found empty contract field"
and out in this case:
AssertionError: found empty contract field

pandas dataframe throwing an empty list

I have a table where column names are not really organized like they have different years of data with different column numbers.
So I should access each data through specified column names.
I am using this syntax to access a column.
df = df[["2018/12"]]
But when I just want to extract numbers under that column, using
df.iloc[0,0]
it throws an error like
single positional indexer is out-of-bounds
So I am using
df.loc[0]
but it has the column name with the numeric data.
How can I extract just the number of each row?
Below is the CSV data
Closing Date,2014/12,2015/12,2016/12,2017/12,2018/12,Trend
Net Sales,"31,634","49,924","62,051","68,137","72,590",
""
Net increase,"-17,909","-16,962","-34,714","-26,220","-29,721",
Net Received,-,-,-,-,-,
Net Paid,-328,"-6,038","-9,499","-9,375","-10,661",
When writing this dumb question, I was just a beginner not even knowing what I wanted ask.
The OP's question comes down to "getting the row as a list" since he ended his post asking
how to get numbers(though he said "number" maybe by mistake) of each row.
The answer is that he made a mistake of using double square brackets in his example and it caused problems.
The solution is to use df = df["2018/12"] instead of df= df[["2018/12"]]
As for things I(me at the time of writing this) mentioned, I will answer them one by one:
Let's say the table looks like this
Unnamed: 0 2018/12 country drives_right
0 US 809 United States True
1 AUS 731 Australia False
2 JAP 588 Japan False
3 IN 18 India False
4 RU 200 Russia True
5 MOR 70 Morocco True
6 EG 45 Egypt True
1>df = df[["2018/12"]]
: it will output a dataframe which only has the column "2018/12" and the index column on the left side.
2>df.iloc[0,0]
Now, since from 1> we have a new dataframe having only one column(except for index column mentioning index values) this will output the first element of the column.
In the example above, the outcome will be "809" since it's the first element of the column.
3>
But when I just want to extract numbers under that column, using
df.iloc[0,0]
-> doesn't make sense if you want to get extract numbers. It will just output one element
809 from the sub-dataframe you created using df = df[["2018/12"]].
it throws an error like
single positional indexer is out-of-bounds
Maybe you are confused about the outcome.(Maybe in this case "df" is the one before your df dataframe subset assignment?(df=df[["2018/12"]]) Since df = df[["2018/12"]] will output a dataframe so it will work fine.
3
So I am using
df.loc[0]
but it has the column name with the numeric data.
: Yes df.loc[0] from df = df[["2018/12"]] will return column name and the first element of that column.
4.
How can I extract just the number of each row?
You mean "numbers" of each row right?
Use this:
print(df["2018/12"].values.tolist())
In terms of finding varying names of columns or rows, and then access each rows and columns, you should think of using regex.

Getting errors whenever dealing with null or NaN types when working on csv files with pandas

I am trying to replace all the Country ISO codes to Full Country Names to keep everything consistent as part of cleaning some data. I managed to find the pycountry package, which helps a ton! There are some fields on the CSV file that are empty, which I believe is causing some issues when running my code below.
Also, an additional question, not sure if it's just me, but there are times when CSV reads empty files as null/NaN or simply empty. I don't really know what went wrong there, but if possible I would like to change all those empty cells into one "thing" or type for ease of filtering/dropping it out.
df = pd.read_csv("file.csv")
#use pycountry to match the Nationalities as actual country names
import pycountry
list_alpha_2 = [i.alpha_2 for i in list(pycountry.countries)]
list_alpha_3 = [i.alpha_3 for i in list(pycountry.countries)]
def country_flag(df):
if (len(df['Nationality'])==2 and df['Nationality'] in list_alpha_2):
return pycountry.countries.get(alpha_2=df['Nationality']).name
elif (len(df['Nationality'])==3 and df['Nationality'] in list_alpha_3):
return pycountry.countries.get(alpha_3=df['Nationality']).name
elif (len(df['Nationality'])>3):
return df['Nationality']
else:
return '#N/A'
df['Nationality']=df.apply(country_flag,axis =1)
df
I was expecting the result to be something like:
0 AF 100 Afghanistan
1 #N/A
2 AUS 140 Australia
3 Germany 400 Germany
The error message I am getting is
TypeError: ("object of type 'float' has no len()", 'occurred at index 0')
Yet, there shouldn't be any float type values in the 'Nationality' column I am working on. I am guessing this is simply the empty/null/NaN values being considered a float type?
One idea is remove misisng values first by Series.dropna and use Series.apply:
print (df)
Nationality
0 AF
1 NaN
2 AUS
3 Germany
import pycountry
list_alpha_2 = [i.alpha_2 for i in list(pycountry.countries)]
list_alpha_3 = [i.alpha_3 for i in list(pycountry.countries)]
def country_flag(x):
if (len(x)==2 and x in list_alpha_2):
return pycountry.countries.get(alpha_2=x).name
elif (len(x)==3 and x in list_alpha_3):
return pycountry.countries.get(alpha_3=x).name
elif (len(x)>=3):
return x
else:
return np.nan
df['Nationality'] = df['Nationality'].dropna().astype(str).apply(country_flag)
print (df)
Nationality
0 Afghanistan
1 NaN
2 Australia
3 Germany
One thing to watch out for is when pandas is reading from a data source and tries to automatically assign a data type to a column, it will sometimes assign a different data type than what you would expect depending upon if there are empty values or not in the data source.
A classical example is integer values that are converted to float values.
If you have a CSV file with this exact content (note missing value in row 2 of column A):
ColA,ColB
0,2
,1
5,4
then reading the file with
res_df=pandas.read_csv(filename)
will create a dataframe with floats in the column A and integers in the column B.
This is due to the fact that there is no canonical way to assign an "empty" value to an integer, whereas a float can just be set as NaN (not a number).
But if that value was present, you would get 2 columns of integers.
Just something to be aware of, as it may easily be forgotten, and then suddenly you are getting floats instead of integers in your code and be confused about it.

Categories