Merge csv's with some common columns and fill in Nans - python

I have several csv files (all in one folder) which have columns in common but also have distinct columns. They all contain the IP column. The data looks like
File_1.csv
a,IP,b,c,d,e
info,192.168.0.1,info1,info2,info3,info4
File_2.csv
a,b,IP,d,f,g
info,,192.168.0.1,info2,info5,info6
As you can see File 1 and File 2 disagree on what belongs in column d but I do not mind which file it keeps the information from. I have tried pandas.merge but this however returns two separate entries for 192.168.0.1 with NaN in the columns present in File 1 not in File 2 and vice versa. Does anyone know of a way to do this?
Edit 1:
The desired output should look like:
output
a,IP,b,c,d,e,f,g
info,192.168.0.1,info1,info2,info3,info4,info5,info6
and I would like the output to be like this for all rows, not every item in file 1 is in file2 and vice versa.
Edit 2:
Any IP address present in file 1 but not present in file 2 should have a blank or Not Available Value in any unique columns in the output file. For example in the output file, columns f and g would be blank for IP addresses that were present in file 1 but not in file 2. Similarly, for an IP in file 2 and not in file 1, columns c and e would be blank in the output file.

This case:
Set IP_address as index column and then use combine_first() to fill in a holes in a data_frame which is the union of all IP_address and columns.
import pandas as pd
#read in the files using the IP address as the index column
df_1 = pd.read_csv('file1.csv', header= 0, index_col = 'IP')
df_2 = pd.read_csv('file2.csv', header= 0, index_col = 'IP')
#fill in the Nan
combined_df = df_1.combine_first(df_2)
combined_df.write_csv(path = '', sep = ',')
EDIT: The union of the indices will be taken, so we should put the IP address in the index column to ensure IP addresses in both files are read in.
combine_first() for other cases:
As the documentation states, you'll only have to be careful if the same IP address in both files has conflicting nonempty information for a column (such as column d in your above example). In df_1.combine_first(df_2), df_1 is prioritized and column d will be set to the value from df_1. Since you said, it doesn't matter which file you will draw information from in this case, this isn't a concern for this problem.

I think a simple dictionary should do the job. Assume you've read the contents of each file into lists file1 and file2, so that:
file1[0] = [a,IP,b,c,d,e]
file1[1] = [info,192.168.0.1,info1,info2,info3,info4]
file2[0] = [a,b,IP,d,f,g]
file2[1] = [info,,192.168.0.1,info2,info5,info6]
(with quotes around each entry). The following should do what you want:
new_dict = {}
for i in range(0, len(file2[0]):
new_dict[file2[0][i]] = file2[1][i]
for i in range(0, len(file1[0]):
new_dict[file1[0][i]] = file1[1][i]
output = [[],[]]
output[0] = [key for key in new_dict]
output[1] = [new_dict[key] for key in output[0]]
Then you should get:
output[0] = [a,IP,b,c,d,e,f,g]
output[1] = [info,192.168.0.1,info1,info2,info3,info4,info5,info6]

Related

Changing Headers in .csv files

Right now I am trying to read in data which is provided in a messy to read-in format. Here is an example
#Name,
#Comment,""
#ExtComment,""
#Source,
[Data]
1,2
3,4
5,6
#[END_OF_FILE]
When working with one or two of these files, I have manually changed the ['DATA'] header to ['x', 'y'] and am able to read in data just fine by skipping the first few rows and not reading the last line.
However, right now I have 30+ files, split between two different folders and I am trying to figure out the best way to read in the files and change the header of each file from ['DATA'] to ['x', 'y'].
The excel files are in a folder one path lower than the file that is supposed to read them (i.e. folder 1 contains set of code below, and folder 2 contains the excel files, folder 1 contains folder 2)
Here is what I have right now:
#sets - refers to the set containing the name of each file (i.e. [file1, file2])
#df - the dataframe which you are going to store the data in
#dataLabels - the headers you want to search for within the .csv file
#skip - the number of rows you want to skip
#newHeader - what you want to change the column headers to be
#pathName - provide path where files are located
def reader (sets, df, dataLabels, skip, newHeader, pathName):
for i in range(len(sets)):
df_temp = pd.read_csv(glob.glob(pathName+ sets[i]+".csv"), sep=r'\s*,', skiprows = skip, engine = 'python')[:-1]
df_temp.column.value[0] = [newHeader]
for j in range(len(dataLabels)):
df_temp[dataLabels[j]] = pd.to_numeric(df_temp[dataLabels[j]],errors = 'coerce')
df.append(df_temp)
return df
When I run my code, I run into the error:
No columns to parse from file
I am not quite sure why - I have tried skipping past the [DATA] header and I still receive that error.
Note, for this example I would like the headers to be 'x', 'y' - I am trying to make a universal function so that I could change it to something more useful depending on what I am measuring.
If the #[DATA] row is to be replaced regardless, just ignore it. You can just tell pandas to ignore lines that start with # and then specify your own names:
import pandas as pd
df = pd.read_csv('test.csv', comment='#', names=['x', 'y'])
which gives
x y
0 1 2
1 3 4
2 5 6
Expanding Kraigolas's answer, to do this with multiple files you can use a list comprehension:
files = [glob.glob(f"{pathName}{set_num}.csv") for set_num in sets]
df = pd.concat([pd.read_csv(file, comment="#", names = ["x", "y"]) for file in files])
If you're lucky, you can use Kraigolas' answer to treat those lines as comments.
In other cases you may be able to use the skiprows argument to skip header columns:
df= pd.read_csv(path,skiprows=10,skipfooter=2,names=["x","y"])
And yes, I do have an unfortunate file with a 10-row heading and 2 rows of totals.
Unfortunately I also have very unfortunate files where the number of headings change.
In this case I used the following code to iterate until I find the first "good" row, then create a new dataframe from the rest of the rows. The names in this case are taken from the first "good" row and the types from the first data row
This is certainly not fast, it's a last resort solution. If I had a better solution I'd use it:
data = df
if(first_col not in df.columns):
# Skip rows until we find the first col header
for i, row in df.iterrows():
if row[0] == first_col:
data = df.iloc[(i + 1):].reset_index(drop=True)
# Read the column names
series = df.iloc[i]
series = series.str.strip()
data.columns = list(series)
# Use only existing column types
types = {k: v for k, v in dtype.items() if k in data.columns}
# Apply the column types again
data = data.astype(dtype=types)
break
return data
In this case the condition is finding the first column name (first_col) in the first cell.
This can be adopted to use different conditions, eg looking for the first numeric cell:
columns = ["x", "y"]
dtypes = {"x":"float64", "y": "float64"}
data = df
# Skip until we find the first numeric value
for i, row in df.iterrows():
if row[0].isnumeric():
data = df.iloc[(i + 1):].reset_index(drop=True)
# Apply names and types
data.columns = columns
data = data.astype(dtype=dtypes)
break
return data

How to prevent pandas.dataframe.to_csv from creating new columns when appending?

I am following Freddy's example in appending my csv file with unique values. Here is the code I am using:
header = ['user.username', 'user.id']
user_filename = f"{something}_users.csv"
if os.path.isfile(user_filename): #checks if file exists
#Read in old data
oldFrame = pd.read_csv(user_filename, header=0)
#Concat and drop dups
df_diff = pd.concat([oldFrame, df[['user.username', 'user.id']]],ignore_index=True).drop_duplicates()
#Write new rows to csv file
df_diff.to_csv(user_filename, header = False, index=False)
else: # else it exists so append
df.to_csv(user_filename, columns = header, header=['username', 'user_id'], index=False, mode = 'a')
Running this code for the first time returns the desired result: A csv file with two named columns (username and user_id) and the respective values. If I run it a second time, something odd happens: I still keep the old values and also the new values. But the new values appear below the old ones in two new (unnamed) columns like so:
username user_id
user1 123
user2 456
user3 789
user4 124
The output I'm looking for is this:
username user_id
user1 123
user2 456
user3 789
user4 124
The main issue with the code is of naming convention. Try this piece of code
header = ['user.username', 'user.user_id']
user_filename = "users.csv"
if os.path.isfile(user_filename): #checks if file exists
#Read in old data
oldFrame = pd.read_csv(user_filename, header=0)
#Concat and drop dups
concat = pd.concat([oldFrame, df[['user.username', 'user.user_id']]], ignore_index=True)
df_diff = concat.drop_duplicates()
#Write new rows to csv file
df_diff.to_csv(user_filename, header=['user.username', 'user.user_id'], index=False)
else: # else it exists so append
df.to_csv(user_filename, columns = header, header=['user.username', 'user.user_id'], index=False, mode='a')
What this code does differently is, that the name of the headers you are reading from the file should be the same as the names of headers you are trying to concatenate the data with. You can use some interim dictionary to achieve this if you don't want to change your column names.
The problem is caused by concatenating two dataframes with different column names. The imported dataframe already has the new column names ('username' and 'user_id'), the dataframe df still uses 'user.username' and 'user.id'.
To avoid the error, I changed this line
df_diff = pd.concat([oldFrame, df[['user.username', 'user.id']]],ignore_index=True).drop_duplicates()
to
df_diff = pd.concat([oldFrame, df[['user.username', 'user.id']].rename(columns={"user.username": "username", "user.id": "user_id"})],ignore_index=True).drop_duplicates()

Comparing two CSV files using lists and dictionaries

I have two CSV files, the first with 3 columns and numerous rows and the second with 4 columns and numerous rows, I am trying to retrieve data from the 1st file based on the RemoveDes list (In code), "RemovedDes" is a filtered version of File 2, which has filtered out rows of data where the first letter is 'E' in the Destination column of File 2. Not all data from File 1 is going to be used, only data which corresponds to the RemoveDes hence why I need to compare the two.
How can I print out only the relevant data from file 1?
I know it's probably very easy to do but I am new to this, any assistance is much appreciated, cheers.
(for further clarification; I'm after the Eastings and Northings in File 1 but need to use "RemovedDes" (which filtered out unnecessary information in File2) to match the data in the two files)
File 1 Sample Data (many more rows):
Destination Easting Northing
D4 . 102019 . 1018347
D2 . 102385 . 2048908
File 2 Sample Data (many more rows):
Legend Destination Distance Width
10 D4 . 67 . 87
18 E2 . 32 . 44
Note that E2 is filtered out as it starts with E.. See code bellow for clarification.
Legend Destination Distance Width
1stFile = open(file2.csv, 'r')
FILE1 = 1stFile.readlines()
print(FILE1)
list_dictionary = []
2ndFile = open(file2.csv, 'r')
FILE2 = 2ndFile.readlines()
print(FILE2)
for line in FILE2:
values = line.split(',')
Legend = values[0]
Destination = values[1]
Distance = values[2]
Width = values[3]
diction_list['LEG'] = Legend
diction_list['DEST'] = Destination
diction_list['DIST'] = Distance
diction_list['WID'] = Width
list_dictionary.append(the_dictionary)
RemovedDes = []
for line_dict in list_dictionary:
if not li_dict['DEST'].startswith('E'): #Filters out rows of data which starts with the letter E in File 2.
RemovedDes.append(li_dict)
print(RemovedDes)
Based on the clarification in the comments, I suggest the following approach:
use a pandas.DataFrame as your data structure of choice
perform a join of your lists
The following code will create a pandas data frame data with all entries of file2, extended by their respective entries in the columns Easting and Northing of file1
import pandas as pd
file1 = pd.read_csv('file1.csv')
file2 = pd.read_csv('file2.csv')
data = pd.merge(file2, file1, how = 'left', on = 'Destination')
Note: this assumes that Destination has unique values across the board and that both .csv-Files come with a header line.

Merging csv columns while checking ID of the first column

I have a 4 csv files exported from e-shop database I need to merge them by columns, which I would maybe manage to do alone. But the problem is to match the right columns
First file:
"ep_ID","ep_titleCS","ep_titlePL".....
"601","Kancelářská židle šedá",NULL.....
...
Second file:
"pe_photoID","pe_productID","pe_sort"
"459","603","1"
...
Third file:
"epc_productID","epc_categoryID","epc_root"
"2155","72","1"
...
Fourth file:
"ph_ID","ph_titleCS"...
"379","5391132275.jpg"
...
I need to match the rows so rows with same "ep_ID" and "epc_productID" are merged together and rows with same "ph_ID", "pe_photoID" too. I don't really know where to start, hopefully, I wrote it understandably
Update:
I am using :
files = ['produkty.csv', 'prirazenifotek.csv', 'pprirazenikategorii.csv', 'adresyfotek.csv']
dfs = []
for f in files:
df = pd.read_csv(f,low_memory=False)
dfs.append(df)
first_and_third =pd.merge(dfs[0],dfs[1],left_on = "ep_ID",right_on="pe_photoID")
first_and_third.to_csv('new_filepath.csv', index=False)
Ok this code works, but it does two things in another way than I need:
When there is a row in file one with ID = 1 for example and in the next file two there is 5 rows with bID = 1, then it creates 5 rows int the final file I would like to have one row that would have multiple values from every row with bID = 1 in file number two. Is it possible?
And it seems to be deleting some rows... not sure till i get rid of the "duplicates"...
You can use pandas's merge method to merge the csvs together. In your question you only provide keys between the 1st and 3rd files, and the 2nd and 4th files. Not sure if you want one giant table that has them all together -- if so you will need to find another intermediary key, maybe one you haven't listed(?).
import pandas as pd
files = ['path_to_first_file.csv', 'second_file.csv', 'third_file.csv', 'fourth_file.csv']
dfs = []
for f in files:
df = pd.read_csv(f)
dfs.append(df)
first_and_third = dfs[0].merge(dfs[2], left_on='ep_ID', right_on='epc_productID', how='left')
second_and_fourth = dfs[1].merge(dfs[3], left_on='pe_photoID', right_on='ph_ID', how='left')
If you want to save the dataframe back down to a file you can do so:
first_and_third.to_csv('new_filepath.csv', index=False)
index=False assumes you have no index set on the dataframe and don't want the dataframe's row numbers to be included in the final csv.

Selectin Dataframe columns name from a csv file

I have a .csv to read into a DataFrame and the names of the columns are in the same .csv file in the previos rows. Usually I drop all the 'unnecesary' rows to create the DataFrame and then hardcode the names of each dataframe
Trigger time,2017-07-31,10:45:38
CH,Signal name,Input,Range,Filter,Span
CH1, "Tin_MIX_Air",TEMP,PT,Off,2000.000000,-200.000000,degC
CH2, "Tout_Fan2b",TEMP,PT,Off,2000.000000,-200.000000,degC
CH3, "Tout_Fan2a",TEMP,PT,Off,2000.000000,-200.000000,degC
CH4, "Tout_Fan1a",TEMP,PT,Off,2000.000000,-200.000000,degC
Here you can see the rows where the columns names are in double quotes "TinMix","Tout..",etc there are exactly 16 rows with names
Logic/Pulse,Off
Data
Number,Date&Time,ms,CH1,CH2,CH3,CH4,CH5,CH7,CH8,CH9,CH10,CH11,CH12,CH13,CH14,CH15,CH16,CH20,Alarm1-10,Alarm11-20,AlarmOut
NO.,Time,ms,degC,degC,degC,degC,degC,degC,%RH,%RH,degC,degC,degC,degC,degC,Pa,Pa,A,A1234567890,A1234567890,A1234
1,2017-07-31 10:45:38,000,+25.6,+26.2,+26.1,+26.0,+26.3,+25.7,+43.70,+37.22,+25.6,+25.3,+25.1,+25.3,+25.3,+0.25,+0.15,+0.00,LLLLLLLLLL,LLLLLLLLLL,LLLL
And here the values of each variables start.
What I need to do is create a Dataframe from this .csv and place these names in the columns names. I'm new to Python and I'm not very sure how to do it
import pandas as pd
path = r'path-to-file.csv'
data=pd.DataFrame()
with open(path, 'r') as f:
for line in f:
data = pd.concat([data, pd.DataFrame([tuple(line.strip().split(','))])], ignore_index=True)
data.drop(data.index[range(0,29)],inplace=True)
x=len(data.iloc[0])
data.drop(data.columns[[0,1,2,x-1,x-2,x-3]],axis=1,inplace=True)
data.reset_index(drop=True,inplace=True)
data = data.T.reset_index(drop=True).T
data = data.apply(pd.to_numeric)
This is what I've done so far to get my dataframe with the usefull data, I'm dropping all the other columns that arent useful to me and keeping only the values. Last three lines are to reset row/column indexes and to transform the whole df to floats. What I would like is to name the columns with each of the names I showed in the first piece of coding as a I said before I'm doing this manually as:
data.columns = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p']
But I would like to get them from the .csv file since theres a possibility on changing the CH# - "Name" combination
Thank you very much for the help!
Comment: possible for it to work within the other "OPEN " loop that I have?
Assume Column Names from Row 2 up to 6, Data from Row 7 up to EOF.
For instance (untested code)
data = None
columns = []
with open (path) as fh:
for row, line in enumerate (fh, 1):
if row > 2 and row <= 6:
ch, name = line.split(',')[:2]
columns.append(name)
else:
row_data = [tuple(line.strip().split(','))]
if not data:
data = pd.DataFrame(row_data, columns=columns, ignore_index=True)
else:
data.append(row_data)
Question: ... I would like to get them from the .csv file
Start with:
with open (path) as fh:
for row, line in enumerate (fh, 1):
if row > 2:
ch, name = line.split(',')[:2]

Categories