How to create new file from two other csv files? - python

I have two .csv files.
First:
col. names: 'student_id' and 'mark'
Second:
col. names: 'student_id','name','surname'
and I want create third .csv file with 'student_id','name', 'surname' where row['mark'] == 'five' or 'four'
good_student=[]
for index, row in first_file.iterrows():
if row['mark'] == 'five':
good_student.append(row['studentId'])
elif row['mark'] == 'four':
good_student.append(row['studentId'])
for index, row in second_file.iterrows():
for i in good_student:
if row['studentId'] == i:

As the other user suggested, a dataframe is a robust way of handling your csv problem. First, I would read the two csv files into dataframes using theread_csv function. Then I would join the two based on student id. The result is a dataframe w
ith columns as student_id, mark, name, and surname. Any missing values will be NaN (which dataframe the join is called on is important in handling missing values). The joined dataframe is then filtered by the value in the mark cell.
import pandas as pd
df1 = pd.read_csv('one.csv') # student_id, mark
df2 = pd.read_csv('two.csv') # student_id, name, surname
df1 = df1.join(df2.set_index('student_id'), on='student_id')
df1 = df1.loc[(df1['mark'] == 'five') | (df1['mark'] == 'four')]

You can just read both csv's as a dataframe and join them.
import pandas as pd
df_1 = pd.read_csv("csv_1")
df_2 = pd.read_csv("csv_2")
df_1 = df_1.join(df_2)
df_1.to_csv("new_csv")
The result will be a csv file with appended columns. If line 1 of csv_1 and line 1 of csv_2 refer to the same thing (person, object, ad_id...) then it can be used without problems.
Edit:
If both cvs's are index the rows by student_id, then the easiest way is to include taht when loading the dataframes:
import pandas as pd
df_1 = pd.read_csv("csv_1", index_col = "student_id")
df_2 = pd.read_csv("csv_2", index_col = "student_id")
df_1 = df_1.join(df_2)
df_1.to_csv("new_csv")

Related

Automatic transposing Excel user data in a Pandas Dataframe

I have some big Excel files like this (note: other variables are omitted for brevity):
and would need to build a corresponding Pandas DataFrame with the following structure.
I am trying to develop a Pandas code for, at least, parsing the first column and transposing the id and the full of each user. Could you help with this?
The way that I would tackle it, and I am assuming there are likely to be more efficient ways, is to import the excel file into a dataframe, and then iterate through it to grab the details you need for each line. Store that information in a dictionary, and append each formed line into a list. This list of dictionaries can then be used to create the final dataframe.
Please note, I made the following assumptions:
Your excel file is named 'data.xlsx' and in the current working directory
The index next to each person increments by one EVERY time
All people have a position described in brackets next to the name
I made up the column names, as none were provided
import pandas as pd
# import the excel file into a dataframe (df)
filename = 'data.xlsx'
df = pd.read_excel(filename, names=['col1', 'col2'])
# remove blank rows
df.dropna(inplace=True)
# reset the index of df
df.reset_index(drop=True, inplace=True)
# initialise the variables
counter = 1
name_pos = ''
name = ''
pos = ''
line_dict = {}
list_of_lines = []
# iterate through the dataframe
for i in range(len(df)):
if df['col1'][i] == counter:
name_pos = df['col2'][i].split(' (')
name = name_pos[0]
pos = name_pos[1].rstrip(name_pos[1][-1])
p_index = counter
counter += 1
else:
date = df['col1'][i].strftime('%d/%m/%Y')
amount = df['col2'][i]
line_dict = {'p_index': p_index, 'name': name, 'position': pos, 'date':date, 'amount': amount}
list_of_lines.append(line_dict)
final_df = pd.DataFrame(list_of_lines)
OUTPUT:

Pandas compare rows and columns from different excel files, update value or append value

currently i have two cvs files, one is the temporary file (df1) which has 10+ columns, the other is the Master file (df2) and has only 2 columns. I would like to iterate over rows and compare values from a column that is in both files (UserName) and if the value of UserName is already present in the Master File, add the value of the other column that appears in both files (Score) to the value of Score in Master File (df2) for that specific user. If on the other hand, the value of UserName from the temporary file is not present in the Master File, just append row to the Master Table as new row.
example Master file (df2):
example temp file (df1):
new master file i would like to obtain after comparison:
i have the following code but currently it appends all rows every time a comparison is made between the 2 files, could use some help to determine if it's even a good approach for the described problem:
import os
import win32com.client
import pandas as pd
import numpy as np
path = os.path.expanduser("Attachments")
MasterFile=os.path.expanduser("MasterFile.csv")
fields = ['UserName', 'Score']
df1 = pd.read_csv(zipfilepath, skipinitialspace=True, usecols=fields)
df2 = pd.read_csv(MasterFile, skipinitialspace=True, usecols=fields)
comparison_values = df1['UserName'].values == df2['UserName'].values
print(comparison_values)
rows = np.where(comparison_values == False)
for item in comparison_values:
if item==True:
df2['Score']=df2['Score']+df1['Score']
else:
df2 = df2.append(df1[{'UserName', 'Score'}], ignore_index=True)
df2.to_csv(MasterFile, mode='a', index=False, header=False)
EDIT**
what about a mix of integers and strings in the 2 files? Example
Example Master File (df2):
Example temp file (df1):
new master file i would like to obtain after comparison:
IIUC, you can use
df = pd.concat([df1, df2]).groupby('UserName', as_index=False).sum()

Filtering a CSV File using two columns

I am a newbie to python. I am working on a CSV file where it has over a million records. In the data, every Location has a unique ID (SiteID). I want to filter for and remove any records where there is no value or mismatch between SiteID and Location in my CSV file. (Note: This script should print the lines number and mismatch field values for each record.)
I have the following code. Please help me out:
import pandas as pd
pd = pandas.read_csv ('car-auction-data-from-ghana', delimiter = ";")
pd.head()
date_time = (pd['Date Time'] >= '2010-01-01T00:00:00+00:00') #to filter from a specific date
comparison_column = pd.where(pd['SiteID'] == pd['Location'], True, False)
comparison_column
This should be your solution:
df = pd.read_csv('car-auction-data-from-ghana', delimiter = ";")
print(df.head())
date_time = (df['Date Time'] >= '2010-01-01T00:00:00+00:00') #to filter from a specific date
df = df[df['SiteID'] == df['Location']]
print(df)
You need to call read_csv as a member of pd because it is the alias to the imported package, and use df as the variable for your data frame. The line with the comparison drops rows in which the boolean is not equal, the two not being equal in this case.

How to skip rows while importing csv?

How to skip the rows based on certain value in the first column of the dataset. For example: if the first column has some unwanted stuffs in the first few rows and i want skip those rows upto a trigger value. please help me for importing csv in python
You can achieve this by using the argument skip_rows
Here is sample code below to start with:
import pandas as pd
df = pd.read_csv('users.csv', skiprows=<the row you want to skip>)
For a series of CSV files in the folder, you could use the for loop, read the CSV file and remove the row from the df containing the string.Lastly, concatenate it to the df_overall.
Example:
from pandas import DataFrame, concat, read_csv
df_overall = DataFrame()
dir_path = 'Insert your directory path'
for file_name in glob.glob(dir_path+'*.csv'):
df = pd.read_csv('file_name.csv', header=None)
df = df[~df. < column_name > .str.contains("<your_string>")]
df_overall = concat(df_overall, df)

apply result is a dataframe, how can I store it?

I have a dataframe containing a path to an excel file, a sheet name and an id, each in one column:
df = pd.DataFrame([['00100', 'one.xlsx', 'sheet1'],
['00100', 'two.xlsx', 'sheet2'],
['05300', 'thr.xlsx', 'sheet3'],
['95687', 'fou.xlsx', 'sheet4'],
['05300', 'fiv.xlsx', 'sheet5']],
columns=['id', 'file', 'sheet'])
This dataframe looks like:
id file sheet
0 00100 c:\one.xlsx sheet1
1 00100 c:\two.xlsx sheet2
2 05300 c:\thr.xlsx sheet3
3 95687 c:\fou.xlsx sheet4
4 05300 c:\fiv.xlsx sheet5
I made a function to use with apply, which will read each file and return a dataframe.
def getdata(row):
file = row['file']
sheet = row['sheet']
id = row['id']
tempdf = pd.ExcelFile(file) # Used on purpose
tempdf = tempdf.parse(sheet) # Used on purpose
tempdf['ID'] = id
return tempdf
I then use apply over the initial dataframe so it will return a dataset for each row. The problem is, I don't know how to store the dataframes created in this way.
I tried to store the dataframes in a new column, but the column stores None:
df['data'] = df.apply(getdata, axis=1)
I tried to create a dictionary but the ways that came to my mind were plain useless:
results = {df.apply(getdata, axis=1)} # for this one, in the function I tried to return id, tempdf
In the end, I ended converting the 'id' column to an index to iterate over it in the following way:
for id in df.index:
df[id] = getdata(df.loc[id], id)
But I want to know if there was a way to store the resulting dataframes without using an iterator.
Thanks for your feedback.

Categories