I have imported two .csv files as pandas. One panda, df1, looks something like this:
projName projOwner Data
proj0 projOwnder0 5
proj1 projOwnder1 7
proj2 projOwnder2 8
proj3 projOwnder3 3
The second panda, df2, looks like this:
projName projOwner projEmail projFirstName projLastName
proj0 projOwnder0 email0 firstName0 lastName0
proj1 projOwnder1 email1 firstName1 lastName4
proj2 projOwnder2 email2 firstName2 lastName5
proj3 projOwnder3 email3 firstName3 lastName6
Basically what I have done is set the index on the df2 to projName. Now I am iterating through the rows of df1 and want to use data from df2 based on df1.
df2 = df.set_index("projName")
for index, row in df1.iterrows():
project_name = str(row['projName'])
firstName = df2.loc[repo_name,'projFirstName']
lastName = df2.loc[repo_name,'projLasttName']
I have done this and it works on some of the rows, but for others it gives me a string of different values in that column. I have tried using .at, .iloc, .loc and have not had success. Can someone help me to see what I am doing wrong.
One way to do this that would be much easier would be to use the pandas merge function to merge the dataframes first, then you don't have to reference the data in one dataframe by the data in another - it's all in one place. For example:
import pandas as pd
df1 = pd.DataFrame({'projName':['proj0', 'proj1'],
'projOwner':['projOwner0','projOwner1'],
'Data':[5, 7]})
df2 = pd.DataFrame({'projName':['proj0', 'proj1'],
'projOwner':['projOwner0','projOwner1'],
'projEmail':['email0', 'email1']})
df = df1.merge(df2, on=['projName', 'projOwner'])
print(df)
df.set_index('projName')
for index, row in df.iterrows():
print(row['projName'])
print(row['projOwner'])
print(row['projEmail'])
print(row['Data'])
df now looks like this:
Data projName projOwner projEmail
0 5 proj0 projOwner0 email0
1 7 proj1 projOwner1 email1
And looping through the rows and printing the project, project owner, and email, and data gives this:
proj0
projOwner0
email0
5
proj1
projOwner1
email1
7
Related
I need to compile grades from 10 files named quiz2, quiz3 [...], quiz11.
I have the following transformation:
Import the xls to df with pandas
Get only the 4 renamed columns
Keep only the highest grade if there is multiple values for the same ID
The code for one dataframe is the following:
quiz2=pd.read_excel(r'C:\Users\llarbodiere\Desktop\Perso\grade compil\quiz\quiz2.xls')
quiz2=quiz2.rename({'Nom d’utilisateur': 'ID', 'Note totale': 'quiz2'}, axis='columns')
quiz2=quiz2[['Nom','Prénom','ID','quiz2']]
quiz2.groupby("ID").max().sort_values("Nom").fillna(0)
I want to iterate the same transformations for all the quizzes from quiz2 to quiz11. I have tried a for loop but it did not worked.
Thanks by advance!
You could generate the file name dynamically by looping through a range of numbers from 1 to 11 and concatenating the number to the file name and suffix.
#create an empty dataframe for collecting loop results
cumulative_df = pd.DataFrame(columns = ['a'])
#loop through a range of numbers from 1 to 11
for x in range(1,11):
#generate the file name
file = 'quiz'+str(x)+'.xls'
df=pd.read_excel('C:/Users/llarbodiere/Desktop/Perso/grade compil/quiz/'+file)
df=df.rename({'Nom d’utilisateur': 'ID', 'Note totale': 'quiz'}, axis='columns')
df=df[['Nom','Prénom','ID','quiz']]
df.groupby("ID").max().sort_values("Nom").fillna(0)
#add the df active in the loop to the cumulative df
pd.concat([cumulative_df, df])
print(cumulative_df)
EDIT: the example above is for the specific file names you mentioned. This could be generalized further to work for all files in a given directory, for example.
I have 2 data frames representing CSV files as such:
# 1.csv
id,email
1,someone#email.com
2,someoneelse#email.com
...
# 2.csv
id,email
3,someone#otheremail.com
4,someone#email.com
...
What I'm trying to do is to merge both tables into one DataFrame using Pandas based on whether a particular column (in this case column 2, email) is identical in both DataFrames.
I need the merged DataFrame to choose the id from 2.csv.
For example, using the sample data above, since the email column value someone#email.com exists in both CSVs, I need the merged DataFrame to output the following:
# 3.csv
id,email
4,someone#gmail.com
2,someoneelse#email.com
3,someone#otheremail.com
What I have so far is as follows:
df1 = pd.read_csv('/path/to/1.csv')
print("df1 has {} rows".format(len(df1.index)))
# "df1 has 14072 rows"
df2 = pd.read_csv('/path/to/2.csv')
print("df2 has {} rows".format(len(df2.index)))
# "df2 has 56766 rows"
join = pd.merge(df1, df2, on="email", how="inner")
print("join has {} rows".format(len(join.index)))
# "join has 321 rows"
The problem is that the join DataFrame produces only the rows where the email field exists in both DataFrames. What I expect is that the output DataFrame contain 56766 + 14072 - 321 = 70517 rows with the id values be the ones from 2.csv when the email field is identical in both source DataFrames. I tried to change the merge(how="left|right") but the results are identical.
from datatable import dt, f, by
df1 = dt.Frame("""
id,email
1,someone#email.com
2,someoneelse#email.com
""")
df1['csv'] = 1
df2 = dt.Frame("""
id,email
3,someone#otheremail.com
4,someone#email.com
""")
df2['csv'] = 2
df_all = dt.rbind(df1, df2)
df_result = df_all[-1, ['id'], by('email')]
Resolved it by uploading the files to Google Spreadsheet and usingVLOOKUP
hi i have 4 pandas dataframe: df1, df2 ,df3, df4.
What i like to do is iterate (using a for loop) the save of this dataframe using to_pickle.
what i did is this:
out = 'mypath\\myfolder\\'
r = [ orders, adobe, mails , sells]
for i in r:
i.to_pickle( out + '\\i.pkl')
The command is fine but it does not save every database with his name but overwriting the same databse i.pkl (i think because is not correct my code)
It seem it can't rename every database with his name (e.g. for orders inside the for loop orders is saved with the name i.pkl and so on with the orders dataframe involved)
What i expect is to have 4 dataframe saved with the name inserted in the object r (so : orders.pkl, adobe.pkl ,mails.pkl, sells.pkl)
How can i do this?
You can't stringify the variable name (this is not something you generally do), but you can do something simple:
import os
out = 'mypath\\myfolder\\'
df_list = [df1, df2, df3, df4]
for i, df in enumerate(df_list, 1):
df.to_pickle(os.path.join(out, f'\\df{i}.pkl')
If you want to provide custom names for your files, here is my suggestion: use a dictionary.
df_map = {'orders': df1, 'adobe': df2, 'mails': df3, 'sells': df4}
for name, df in df_map.items():
df.to_pickle(os.path.join(out, f'\\{name}.pkl')
I have two dataframes(f1_df and f2_df):
f1_df looks like:
ID,Name,Gender
1,Smith,M
2,John,M
f2_df looks like:
name,gender,city,id
Problem:
I want the code to compare the header of f1_df with f2_df by itself and copy the data of the matching columns using panda.
Output:
the output should be like this:
name,gender,city,id # name,gender,and id are the only matching columns btw f1_df and f2_df
Smith,M, ,1 # the data copied for name, gender, and id columns
John,M, ,2
I am new to Pandas and not sure how to handle the problem. I have tried to do an inner join to the matching columns, but that did not work.
Here is what I have so far:
import pandas as pd
f1_df = pd.read_csv("file1.csv")
f2_df = pd.read_csv("file2.csv")
for i in f1_df:
for j in f2_df:
i = i.lower()
if i == j:
joined = f1_df.join(f2_df)
print joined
Any idea how to solve this?
try this if you want to merge / join your DFs on common columns:
first lets convert all columns to lower case:
df1.columns = df1.columns.str.lower()
df2.columns = df2.columns.str.lower()
now we can join on common columns
common_cols = df2.columns.intersection(df1.columns).tolist()
joined = df1.set_index(common_cols).join(df2.set_index(common_cols)).reset_index()
Output:
In [259]: joined
Out[259]:
id name gender city
0 1 Smith M NaN
1 2 John M NaN
export to CSV:
In [262]: joined.to_csv('c:/temp/joined.csv', index=False)
c:/temp/joined.csv:
id,name,gender,city
1,Smith,M,
2,John,M,
I have a DataFrame that consists of one column ('Vals') which is a dictionary. The DataFrame looks more or less like this:
In[215]: fff
Out[213]:
Vals
0 {u'TradeId': u'JP32767', u'TradeSourceNam...
1 {u'TradeId': u'UUJ2X16', u'TradeSourceNam...
2 {u'TradeId': u'JJ35A12', u'TradeSourceNam...
When looking at an individual row the dictionary looks like this:
In[220]: fff['Vals'][100]
Out[218]:
{u'BrdsTraderBookCode': u'dffH',
u'Measures': [{u'AssetName': u'Ie0',
u'DefinitionId': u'6dbb',
u'MeasureValues': [{u'Amount': -18.64}],
u'ReportingCurrency': u'USD',
u'ValuationId': u'669bb'}],
u'SnapshotId': 12739,
u'TradeId': u'17304M',
u'TradeLegId': u'31827',
u'TradeSourceName': u'xxxeee',
u'TradeVersion': 1}
How can I split the the columns and create a new DataFrame, so that I get one column with TradeId and another one with MeasureValues?
try this:
l=[]
for idx, row in df['Vals'].iteritems():
temp_df = pd.DataFrame(row['Measures'][0]['MeasureValues'])
temp_df['TradeId'] = row['TradeId']
l.append(temp_df)
pd.concat(l,axis=0)
Here's a way to get TradeId and MeasureValues (using twice your sample row above to illustrate the iteration):
new_df = pd.DataFrame()
for id, data in fff.iterrows():
d = {'TradeId': data.ix[0]['TradeId']}
d.update(data.ix[0]['Measures'][0]['MeasureValues'][0])
new_df = pd.concat([new_df, pd.DataFrame.from_dict(d, orient='index').T])
Amount TradeId
0 -18.64 17304M
0 -18.64 17304M