Merging csv columns while checking ID of the first column

Merging csv columns while checking ID of the first column - python

I have a 4 csv files exported from e-shop database I need to merge them by columns, which I would maybe manage to do alone. But the problem is to match the right columns
First file:
"ep_ID","ep_titleCS","ep_titlePL".....
"601","Kancelářská židle šedá",NULL.....
...
Second file:
"pe_photoID","pe_productID","pe_sort"
"459","603","1"
...
Third file:
"epc_productID","epc_categoryID","epc_root"
"2155","72","1"
...
Fourth file:
"ph_ID","ph_titleCS"...
"379","5391132275.jpg"
...
I need to match the rows so rows with same "ep_ID" and "epc_productID" are merged together and rows with same "ph_ID", "pe_photoID" too. I don't really know where to start, hopefully, I wrote it understandably
Update:
I am using :
files = ['produkty.csv', 'prirazenifotek.csv', 'pprirazenikategorii.csv', 'adresyfotek.csv']
dfs = []
for f in files:
df = pd.read_csv(f,low_memory=False)
dfs.append(df)
first_and_third =pd.merge(dfs[0],dfs[1],left_on = "ep_ID",right_on="pe_photoID")
first_and_third.to_csv('new_filepath.csv', index=False)
Ok this code works, but it does two things in another way than I need:
When there is a row in file one with ID = 1 for example and in the next file two there is 5 rows with bID = 1, then it creates 5 rows int the final file I would like to have one row that would have multiple values from every row with bID = 1 in file number two. Is it possible?
And it seems to be deleting some rows... not sure till i get rid of the "duplicates"...

You can use pandas's merge method to merge the csvs together. In your question you only provide keys between the 1st and 3rd files, and the 2nd and 4th files. Not sure if you want one giant table that has them all together -- if so you will need to find another intermediary key, maybe one you haven't listed(?).
import pandas as pd
files = ['path_to_first_file.csv', 'second_file.csv', 'third_file.csv', 'fourth_file.csv']
dfs = []
for f in files:
df = pd.read_csv(f)
dfs.append(df)
first_and_third = dfs[0].merge(dfs[2], left_on='ep_ID', right_on='epc_productID', how='left')
second_and_fourth = dfs[1].merge(dfs[3], left_on='pe_photoID', right_on='ph_ID', how='left')
If you want to save the dataframe back down to a file you can do so:
first_and_third.to_csv('new_filepath.csv', index=False)
index=False assumes you have no index set on the dataframe and don't want the dataframe's row numbers to be included in the final csv.

Related

Load csv files with multiple columns into several dataframe

I am trying to load some large csv files which appear to have multiple columns and I am struggling with it.
I don't know who design these csv files, but they appear to have event data as well as log data in each csv. At the start of each csv file there is some initial status liens as well
Everything is in a separate rows
The Event data uses 2 columns (Data and Event comment)
The Log data has multiple columns( Date and 20+ columns.
I give an example of the type of data setup below:
Initial; [Status] The Zoo is Closed;
Initial; Status] The Sun is Down;
Initial; [Status] Monkeys ar sleeping;
Time;No._Of_Monkeys;Monkeys_inside;Monkeys_Outside;Number_of_Bananas
06:00; 5;5;0;10
07:00; 5;5;0;10
07:10;[Event] Sun is up
08:00; 5;5;0;10
08:30; [Event] Monkey Doors open and Zoo Opens
09:00; 5;5;0;10
08:30; [Event] Monkey Goes out
09:00; 5;4;1;10
08:30; [Event] Monkey Eats Banana
09:00; 5;4;1;9
08:30; [Event] Monkey Goes out
09:00; 5;5;2;9
Now what I want to do is to put the Log data into one data frame and the Initial and Event data into another.
Now I can read the csv files with csv_reader and go row by row but this is proving very slow, especially when trying to go thorough multiple files and each file containing about 40k rows
Below is code I am using below
csv_files = [f for f in os.listdir('.') if f.endswith('.log')]
for file in csv_files:
# Open the CSV file in read mode
with open(file, 'r') as csv_file:
# Use the csv module to parse the file
csv_reader = csv.reader(csv_file, delimiter=';')
# Loop through the rows of the file
for row in csv_reader:
# If the row has event data
if len(row) == 2:
# Add the row to the Eventlog
EventLog = EventLog.append(pd.Series(row), ignore_index=True)
# If the row is separated by a single separator
elif len(row) > 2:
#First row entered into data log will be the column headers
if DataLog.empty:
DataLog=pd.DataFrame(columns=row)
else:
# Add the row to the single_separator_df DataFrame
DataLog = DataLog.append(pd.Series(row), ignore_index=True)
Is there a better way to do this....preferably faster
IF I use pandas read_csv it seems to only load the Initial data. i.e first 3 lines of my data above.
I can use skip rows to skip down to where the data is and then it will load the rest, but I can't see to figure out how to separate out the event and log data from there
so looking for ideas before i lose what little hair I have left.

If I understood your data format corectly, I would do something like this:
# simply read data as one column data without headers and indexes
df = pd.read_csv("your_file_name.log", header=None, sep=',')
# split values in this column by ; (in each row will be list of values)
tmp_df = df[0].str.split(";")
# delete empty values in the first 3 rows (because we have ; in the end of these rows)
tmp_df = tmp_df.map(lambda x: [y for y in x if y != ''])
# those rows which have 2 values we insert in one dataframe
EventLog = pd.DataFrame(tmp_df[tmp_df.str.len() == 2].to_list())
# other ones we inset in another dataframe (in the first row will be column names)
data_log_tmp = tmp_df[tmp_df.str.len() != 2].to_list()
DataLog = pd.DataFrame(data_log_tmp[1:], columns=data_log_tmp[0])

Here is an example of loading a CSV file, assuming that Monkeys_inside field is always NaN in Event data and assigned in log data, because I used it as a condition to retrieve the event data :
df = pd.read_csv('huge_data.csv', skiprows=3, sep=';')
log_df = df.dropna().reset_index(drop=True)
event_df = df[~df['Monkeys_inside'].notnull()].reset_index(drop=True)
And assuming also that all your CSV file contains those 3 Status lines.
Keep in mind that the dataframe will hold duplicated rows if you have some in your csv files, to remove them, you need just to call the drop_duplicates function and you good :
event_df = event_df.drop_duplicates()

Changing Headers in .csv files

Right now I am trying to read in data which is provided in a messy to read-in format. Here is an example
#Name,
#Comment,""
#ExtComment,""
#Source,
[Data]
1,2
3,4
5,6
#[END_OF_FILE]
When working with one or two of these files, I have manually changed the ['DATA'] header to ['x', 'y'] and am able to read in data just fine by skipping the first few rows and not reading the last line.
However, right now I have 30+ files, split between two different folders and I am trying to figure out the best way to read in the files and change the header of each file from ['DATA'] to ['x', 'y'].
The excel files are in a folder one path lower than the file that is supposed to read them (i.e. folder 1 contains set of code below, and folder 2 contains the excel files, folder 1 contains folder 2)
Here is what I have right now:
#sets - refers to the set containing the name of each file (i.e. [file1, file2])
#df - the dataframe which you are going to store the data in
#dataLabels - the headers you want to search for within the .csv file
#skip - the number of rows you want to skip
#newHeader - what you want to change the column headers to be
#pathName - provide path where files are located
def reader (sets, df, dataLabels, skip, newHeader, pathName):
for i in range(len(sets)):
df_temp = pd.read_csv(glob.glob(pathName+ sets[i]+".csv"), sep=r'\s*,', skiprows = skip, engine = 'python')[:-1]
df_temp.column.value[0] = [newHeader]
for j in range(len(dataLabels)):
df_temp[dataLabels[j]] = pd.to_numeric(df_temp[dataLabels[j]],errors = 'coerce')
df.append(df_temp)
return df
When I run my code, I run into the error:
No columns to parse from file
I am not quite sure why - I have tried skipping past the [DATA] header and I still receive that error.
Note, for this example I would like the headers to be 'x', 'y' - I am trying to make a universal function so that I could change it to something more useful depending on what I am measuring.

If the #[DATA] row is to be replaced regardless, just ignore it. You can just tell pandas to ignore lines that start with # and then specify your own names:
import pandas as pd
df = pd.read_csv('test.csv', comment='#', names=['x', 'y'])
which gives
x y
0 1 2
1 3 4
2 5 6

Expanding Kraigolas's answer, to do this with multiple files you can use a list comprehension:
files = [glob.glob(f"{pathName}{set_num}.csv") for set_num in sets]
df = pd.concat([pd.read_csv(file, comment="#", names = ["x", "y"]) for file in files])

If you're lucky, you can use Kraigolas' answer to treat those lines as comments.
In other cases you may be able to use the skiprows argument to skip header columns:
df= pd.read_csv(path,skiprows=10,skipfooter=2,names=["x","y"])
And yes, I do have an unfortunate file with a 10-row heading and 2 rows of totals.
Unfortunately I also have very unfortunate files where the number of headings change.
In this case I used the following code to iterate until I find the first "good" row, then create a new dataframe from the rest of the rows. The names in this case are taken from the first "good" row and the types from the first data row
This is certainly not fast, it's a last resort solution. If I had a better solution I'd use it:
data = df
if(first_col not in df.columns):
# Skip rows until we find the first col header
for i, row in df.iterrows():
if row[0] == first_col:
data = df.iloc[(i + 1):].reset_index(drop=True)
# Read the column names
series = df.iloc[i]
series = series.str.strip()
data.columns = list(series)
# Use only existing column types
types = {k: v for k, v in dtype.items() if k in data.columns}
# Apply the column types again
data = data.astype(dtype=types)
break
return data
In this case the condition is finding the first column name (first_col) in the first cell.
This can be adopted to use different conditions, eg looking for the first numeric cell:
columns = ["x", "y"]
dtypes = {"x":"float64", "y": "float64"}
data = df
# Skip until we find the first numeric value
for i, row in df.iterrows():
if row[0].isnumeric():
data = df.iloc[(i + 1):].reset_index(drop=True)
# Apply names and types
data.columns = columns
data = data.astype(dtype=dtypes)
break
return data

Creating a script in Python 3 to compare two CSVs and output the similarities and differences between the two into another set of CSVs

I have two CSVs, ideally the CSVs will contain same data, in reality sometimes the content may differ here and there. Instead of manually browsing the two CSVs and find out whats same and different, I am trying to create a python script which I will run weekly that will tell me whats same and what not.
Here's the logic.
1. Given 2 CSVs
2. Compare them row by row.
3. Any rows that are different between the two CSVs should be recorded into an another CSV (the entire row/s)
4. Any rows that are same between the CSVs should be recorded into another CSV (the entire row/s).
This will help me visually see what the differences are and action them accordingly.
Below is an example of what I am looking for.
The code below is what I have so far
with open('Excel 1.csv', 'r') as csvOne, open('Excel 2.csv', 'r') as csvTwo:
csvOne = csvOne.readlines()
csvTWO = csvTWO.readlines()
with open('resultsSame.csv', 'w') as resultFileSame:
for row in csvTWO:
if row not in csvONE:
resultFileSame.write(row)
with open('resultsDifference.csv', 'w') as resultFileDifference:
for row in csvTWO:
if row in csvONE:
resultFileDifference.write(row)
I want the script to compare rows and only if there is a similarity or differences between rows output that into another set of CSVs. The above code works but it removes the columns which are in one CSV and not the other and not rows. I want to keep the columns even though if they are not in the other CSV and only show me what roles are in one or the other in separate CSVs.
Please see below the results I get when I run the first code you've given, on your dataset example.
If you look at the above, I can't quite seem to figure out how your getting the output that you are, as that is exactly what I want! To be honest, I don't need to print out the headers as I am comparing those as well, they can sometime end of different due to user error.

Here is the modified version of your code.
with open('excel1.csv', 'r') as csvOne, open('excel2.csv', 'r') as csvTwo:
csvONE = csvOne.readlines()
csvTWO = csvTwo.readlines()
with open('resultsDifference.csv', 'w') as resultFileDifference:
# Write the header to difference file.
# Because, the headers are same for 2 input CSVs, the header row will be obviously into resultsSame.csv
resultFileDifference.write(csvONE[0])
for row in csvTWO:
if row not in csvONE:
resultFileDifference.write(row)
with open('resultsSame.csv', 'w') as resultFileSame:
for row in csvTWO:
if row in csvONE:
resultFileSame.write(row)

Using pandas will make your work easier. Here is the snippet and is self explanatory.
import pandas as pd
df1 = pd.read_csv('excel1.csv')
df2 = pd.read_csv('excel2.csv')
merged = df1.merge(df2, indicator=True, how='outer')
diff_df = merged[merged['_merge'] == 'right_only'].drop('_merge', axis=1)
similar_df = merged[merged['_merge'] == 'both'].drop('_merge', axis=1)
print(diff_df)
print(similar_df)
diff_df.to_csv('resultsDifference.csv', index=False)
similar_df.to_csv('resultsSame.csv', index=False)
Documentation of pandas merge function Pandas-merge function
I've created the script based on the example you've given in your question. Here is the snap of inputs and outputs of the example.
Excel1
Excel2
resultsSame.csv
resultsDifference.csv
I'm sure the script produces the results what you've quoted in your question except the index. If you are interested in the row indices as in your question, then below is the updated script. Let me know whether it meets your needs.
import pandas as pd
df1 = pd.read_csv('excel1.csv')
df2 = pd.read_csv('excel2.csv')
merged = df1.merge(df2, indicator=True, how='outer')
diff_df = merged[merged['_merge'] == 'right_only'].drop('_merge', axis=1)
similar_df = merged[merged['_merge'] == 'both'].drop('_merge', axis=1)
diff_df.index = range(1,len(diff_df)+1)
similar_df.index = range(1,len(similar_df)+1)
diff_df.to_csv('resultsDifference.csv')
similar_df.to_csv('resultsSame.csv')
Ah! I'm wondering!!! These are the CSV file contents I have..
excel1.csv
A,B,C,D
A,A,A,A
B,B,B,B
C,C,C,A
D,,,
excel2.csv
A,B,C,D
A,A,A,A
B,B,B,B
C,C,C,C
D,D,,

Selectin Dataframe columns name from a csv file

I have a .csv to read into a DataFrame and the names of the columns are in the same .csv file in the previos rows. Usually I drop all the 'unnecesary' rows to create the DataFrame and then hardcode the names of each dataframe
Trigger time,2017-07-31,10:45:38
CH,Signal name,Input,Range,Filter,Span
CH1, "Tin_MIX_Air",TEMP,PT,Off,2000.000000,-200.000000,degC
CH2, "Tout_Fan2b",TEMP,PT,Off,2000.000000,-200.000000,degC
CH3, "Tout_Fan2a",TEMP,PT,Off,2000.000000,-200.000000,degC
CH4, "Tout_Fan1a",TEMP,PT,Off,2000.000000,-200.000000,degC
Here you can see the rows where the columns names are in double quotes "TinMix","Tout..",etc there are exactly 16 rows with names
Logic/Pulse,Off
Data
Number,Date&Time,ms,CH1,CH2,CH3,CH4,CH5,CH7,CH8,CH9,CH10,CH11,CH12,CH13,CH14,CH15,CH16,CH20,Alarm1-10,Alarm11-20,AlarmOut
NO.,Time,ms,degC,degC,degC,degC,degC,degC,%RH,%RH,degC,degC,degC,degC,degC,Pa,Pa,A,A1234567890,A1234567890,A1234
1,2017-07-31 10:45:38,000,+25.6,+26.2,+26.1,+26.0,+26.3,+25.7,+43.70,+37.22,+25.6,+25.3,+25.1,+25.3,+25.3,+0.25,+0.15,+0.00,LLLLLLLLLL,LLLLLLLLLL,LLLL
And here the values of each variables start.
What I need to do is create a Dataframe from this .csv and place these names in the columns names. I'm new to Python and I'm not very sure how to do it
import pandas as pd
path = r'path-to-file.csv'
data=pd.DataFrame()
with open(path, 'r') as f:
for line in f:
data = pd.concat([data, pd.DataFrame([tuple(line.strip().split(','))])], ignore_index=True)
data.drop(data.index[range(0,29)],inplace=True)
x=len(data.iloc[0])
data.drop(data.columns[[0,1,2,x-1,x-2,x-3]],axis=1,inplace=True)
data.reset_index(drop=True,inplace=True)
data = data.T.reset_index(drop=True).T
data = data.apply(pd.to_numeric)
This is what I've done so far to get my dataframe with the usefull data, I'm dropping all the other columns that arent useful to me and keeping only the values. Last three lines are to reset row/column indexes and to transform the whole df to floats. What I would like is to name the columns with each of the names I showed in the first piece of coding as a I said before I'm doing this manually as:
data.columns = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p']
But I would like to get them from the .csv file since theres a possibility on changing the CH# - "Name" combination
Thank you very much for the help!

Comment: possible for it to work within the other "OPEN " loop that I have?
Assume Column Names from Row 2 up to 6, Data from Row 7 up to EOF.
For instance (untested code)
data = None
columns = []
with open (path) as fh:
for row, line in enumerate (fh, 1):
if row > 2 and row <= 6:
ch, name = line.split(',')[:2]
columns.append(name)
else:
row_data = [tuple(line.strip().split(','))]
if not data:
data = pd.DataFrame(row_data, columns=columns, ignore_index=True)
else:
data.append(row_data)
Question: ... I would like to get them from the .csv file
Start with:
with open (path) as fh:
for row, line in enumerate (fh, 1):
if row > 2:
ch, name = line.split(',')[:2]

Join two csv files with pandas/python without duplicates

I would like to concatenate 2 csv files. Each CSV file has the following structure:
File 1
id,name,category-id,lat,lng
4c29e1c197,Area51,4bf58dd8d,45.44826958,9.144208431
4ede330477,Punto Snai,4bf58dd8d,45.44833354,9.144086353
51efd91d49,Gelateria Cecilia,4bf58dd8d,45.44848931,9.144008735
File 2
id,name,category-id,lat,lng
4c29e1c197,Area51,4bf58dd8d,45.44826958,9.144208432
4ede330477,Punto Snai,4bf58dd8d,45.44833354,9.144086353
51efd91d49,Gelateria Cecilia,4bf58dd8d,45.44848931,9.144008735
5748729449,Duomo Di Milano,52e81612bc,45.463898,9.192034
I got a final csv that look like
Final file
id,name,category-id,lat,lng
4c29e1c197,Area51,4bf58dd8d,45.44826958,9.144208431
4c29e1c197,Area51,4bf58dd8d,45.44826958,9.144208432
4ede330477,Punto Snai,4bf58dd8d,45.44833354,9.144086353
51efd91d49,Gelateria Cecilia,4bf58dd8d,45.44848931,9.144008735
5748729449,Duomo Di Milano,52e81612bc,45.463898,9.192034
So I have done this:
import pandas as pd
df1=pd.read_csv("file1.csv")
df2=pd.read_csv("file2.csv")
full_df = pd.concat(df1,df2)
full_df = full_df.groupby(['id','category_id','lat','lng']).count()
full_df2 = full_df[['id','category_id']].groupby('id').agg('count')
full_df2.to_csv("final.csv",index=False)
I tried to groupby by id, categoy_id, lat and lng, the name could change
After the first groupby I want to groupby again but now by id and category_id because as showed in my example the first row changed in long but that is probably because file2 is an update of file1
I don't understand about groupby because when i tried to print I got just the count value.

One way to solve this problem is to just use df.drop_duplicates() after you have concatenated the two DataFrames. Additionally, drop_duplicates has an argument "keep", which allows you to specify that you want to keep the last occurrence of the duplicates.
full_df = pd.concat([df1,df2])
unique_df = full_df.drop_duplicates(keep='last')
Check the documentation for drop_duplicates if you need further help.

I could resolve this problemen with the next code:
import pandas as pd
df1=pd.read_csv("file1.csv")
df2=pd.read_csv("file2.csv")
df_final=pd.concat([df1,df2]).drop_duplicates(subset=['id','category_id','lat','lng']).reset_index(drop=True)
print(df_final.shape)
df_final2=df_final.drop_duplicates(subset=['id','category_id']).reset_index(drop=True)
df_final2.to_csv('final', index=False)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.