Substituting column value if particular column exists in two DataFrames with Pandas - python

I have 2 data frames representing CSV files as such:
# 1.csv
id,email
1,someone#email.com
2,someoneelse#email.com
...
# 2.csv
id,email
3,someone#otheremail.com
4,someone#email.com
...
What I'm trying to do is to merge both tables into one DataFrame using Pandas based on whether a particular column (in this case column 2, email) is identical in both DataFrames.
I need the merged DataFrame to choose the id from 2.csv.
For example, using the sample data above, since the email column value someone#email.com exists in both CSVs, I need the merged DataFrame to output the following:
# 3.csv
id,email
4,someone#gmail.com
2,someoneelse#email.com
3,someone#otheremail.com
What I have so far is as follows:
df1 = pd.read_csv('/path/to/1.csv')
print("df1 has {} rows".format(len(df1.index)))
# "df1 has 14072 rows"
df2 = pd.read_csv('/path/to/2.csv')
print("df2 has {} rows".format(len(df2.index)))
# "df2 has 56766 rows"
join = pd.merge(df1, df2, on="email", how="inner")
print("join has {} rows".format(len(join.index)))
# "join has 321 rows"
The problem is that the join DataFrame produces only the rows where the email field exists in both DataFrames. What I expect is that the output DataFrame contain 56766 + 14072 - 321 = 70517 rows with the id values be the ones from 2.csv when the email field is identical in both source DataFrames. I tried to change the merge(how="left|right") but the results are identical.

from datatable import dt, f, by
df1 = dt.Frame("""
id,email
1,someone#email.com
2,someoneelse#email.com
""")
df1['csv'] = 1
df2 = dt.Frame("""
id,email
3,someone#otheremail.com
4,someone#email.com
""")
df2['csv'] = 2
df_all = dt.rbind(df1, df2)
df_result = df_all[-1, ['id'], by('email')]

Resolved it by uploading the files to Google Spreadsheet and usingVLOOKUP

Related

Using Panda, Update column values based on a list of ID and new Values

I have a df with and ID and Sell columns. I want to update the Sell column, using a list of new Sells (not all raws need to be updated - just some of them). In all examples I have seen, the value is always the same or is coming from a column. In my case, I have a dynamic value.
This is what I would like:
file = ('something.csv') # Has 300 rows
IDList= [['453164259','453106168','453163869','453164463'] # [ID]
SellList=[120,270,350,410] # Sells values
csv = path_pattern = os.path.join(os.getcwd(), file)
df = pd.read_csv(file)
df.loc[df['Id'].isin(IDList[x]), 'Sell'] = SellList[x] # Update the rows with the corresponding Sell value of the ID.
df.to_csv(file)
Any ideas?
Thanks in advance
Assuming 'id' is a string (as mentioned in IDList) & is not index of your df
IDList= [['453164259','453106168','453163869','453164463'] # [ID]
SellList=[120,270,350,410]
id_dict={x:y for x,y in zip(IDList,SellList)}
for index,row in df.iterrows():
if row['id'] in IDList:
df.loc[str(index),'Sell']=id_dict[row['id']]
If id is index:
IDList= [['453164259','453106168','453163869','453164463'] # [ID]
SellList=[120,270,350,410]
id_dict={x:y for x,y in zip(IDList,SellList)}
for index,row in df.iterrows():
if index in IDList:
df.loc[str(index),'Sell']=id_dict[index]
What I did is created a dictionary using IDlist & SellList & than looped over the df using iterrows()
df = pd.read_csv('something.csv')
IDList= ['453164259','453106168','453163869','453164463']
SellList=[120,270,350,410]
This will work efficiently, specially for large files:
df.set_index('id', inplace=True)
df.loc[IDList, 'Sell'] = SellList
df.reset_index() ## not mandatory, just in case you need 'id' back as a column
df.to_csv(file)

Pyspark dataframe join based on key,group by and max

i have two parquet files, which i load with spark.read. These 2 dataframes have a same column named key, so i join them with:
df = df.join(df2, on=['key'], how='inner')
df columns are: ["key","Duration","Distance"] and df2 : ["key",department id"]. At the end i want to print Duration, max(Distance),department id group by department id. What i have done so far is:
df.join(df.groupBy('departmentid').agg(F.max('Distance').alias('Distance')),on='Distance',how='leftsemi').show()
but i think it is too slow, is there a faster way to achieve my goal?
thanks in advance
EDIT: sample (first 2 lines of each file)
df:
369367789289,2015-03-27 18:29:39,2015-03-27 19:08:28,-
73.975051879882813,40.760562896728516,-
73.847900390625,40.732685089111328,34.8
369367789290,2015-03-27 18:29:40,2015-03-27 18:38:35,-
73.988876342773438,40.77423095703125,-
73.985160827636719,40.763439178466797,11.16
df1:
369367789289,1
369367789290,2
each columns is seperated by "," first column on both files is my key, then i have timestamps,longtitudes and latitudes. At the second file i have only the key and department id.
to create Distance i am using a function called formater. this is how i get my distance and duration:
df = df.filter("_c3!=0 and _c4!=0 and _c5!=0 and _c6!=0")
df = df.withColumn("_c0", df["_c0"].cast(LongType()))
df = df.withColumn("_c1", df["_c1"].cast(TimestampType()))
df = df.withColumn("_c2", df["_c2"].cast(TimestampType()))
df = df.withColumn("_c3", df["_c3"].cast(DoubleType()))
df = df.withColumn("_c4", df["_c4"].cast(DoubleType()))
df = df.withColumn("_c5", df["_c5"].cast(DoubleType()))
df = df.withColumn("_c6", df["_c6"].cast(DoubleType()))
df = df.withColumn('Distance', formater(df._c3,df._c5,df._c4,df._c6))
df = df.withColumn('Duration', F.unix_timestamp(df._c2) -F.unix_timestamp(df._c1))
and then as i showed above:
df = df.join(vendors, on=['key'], how='inner')
df.registerTempTable("taxi")
df.join(df.groupBy('vendor').agg(F.max('Distance').alias('Distance')),on='Distance',how='leftsemi').show()
Output must be
Distance Duration department id
grouped by id, and geting only the row with max(distance)

Trying to access one cell in a pandas dataframe

I have imported two .csv files as pandas. One panda, df1, looks something like this:
projName projOwner Data
proj0 projOwnder0 5
proj1 projOwnder1 7
proj2 projOwnder2 8
proj3 projOwnder3 3
The second panda, df2, looks like this:
projName projOwner projEmail projFirstName projLastName
proj0 projOwnder0 email0 firstName0 lastName0
proj1 projOwnder1 email1 firstName1 lastName4
proj2 projOwnder2 email2 firstName2 lastName5
proj3 projOwnder3 email3 firstName3 lastName6
Basically what I have done is set the index on the df2 to projName. Now I am iterating through the rows of df1 and want to use data from df2 based on df1.
df2 = df.set_index("projName")
for index, row in df1.iterrows():
project_name = str(row['projName'])
firstName = df2.loc[repo_name,'projFirstName']
lastName = df2.loc[repo_name,'projLasttName']
I have done this and it works on some of the rows, but for others it gives me a string of different values in that column. I have tried using .at, .iloc, .loc and have not had success. Can someone help me to see what I am doing wrong.
One way to do this that would be much easier would be to use the pandas merge function to merge the dataframes first, then you don't have to reference the data in one dataframe by the data in another - it's all in one place. For example:
import pandas as pd
df1 = pd.DataFrame({'projName':['proj0', 'proj1'],
'projOwner':['projOwner0','projOwner1'],
'Data':[5, 7]})
df2 = pd.DataFrame({'projName':['proj0', 'proj1'],
'projOwner':['projOwner0','projOwner1'],
'projEmail':['email0', 'email1']})
df = df1.merge(df2, on=['projName', 'projOwner'])
print(df)
df.set_index('projName')
for index, row in df.iterrows():
print(row['projName'])
print(row['projOwner'])
print(row['projEmail'])
print(row['Data'])
df now looks like this:
Data projName projOwner projEmail
0 5 proj0 projOwner0 email0
1 7 proj1 projOwner1 email1
And looping through the rows and printing the project, project owner, and email, and data gives this:
proj0
projOwner0
email0
5
proj1
projOwner1
email1
7

How to Copy the Matching Columns between CSV Files Using Pandas?

I have two dataframes(f1_df and f2_df):
f1_df looks like:
ID,Name,Gender
1,Smith,M
2,John,M
f2_df looks like:
name,gender,city,id
Problem:
I want the code to compare the header of f1_df with f2_df by itself and copy the data of the matching columns using panda.
Output:
the output should be like this:
name,gender,city,id # name,gender,and id are the only matching columns btw f1_df and f2_df
Smith,M, ,1 # the data copied for name, gender, and id columns
John,M, ,2
I am new to Pandas and not sure how to handle the problem. I have tried to do an inner join to the matching columns, but that did not work.
Here is what I have so far:
import pandas as pd
f1_df = pd.read_csv("file1.csv")
f2_df = pd.read_csv("file2.csv")
for i in f1_df:
for j in f2_df:
i = i.lower()
if i == j:
joined = f1_df.join(f2_df)
print joined
Any idea how to solve this?
try this if you want to merge / join your DFs on common columns:
first lets convert all columns to lower case:
df1.columns = df1.columns.str.lower()
df2.columns = df2.columns.str.lower()
now we can join on common columns
common_cols = df2.columns.intersection(df1.columns).tolist()
joined = df1.set_index(common_cols).join(df2.set_index(common_cols)).reset_index()
Output:
In [259]: joined
Out[259]:
id name gender city
0 1 Smith M NaN
1 2 John M NaN
export to CSV:
In [262]: joined.to_csv('c:/temp/joined.csv', index=False)
c:/temp/joined.csv:
id,name,gender,city
1,Smith,M,
2,John,M,

Python: Merging/joining two dataframes

I'm trying to merge/join two dataframes, each with three keys (Age, Gender and Signed_In). Both dataframes have the same parent and were created by groupby, but have unique value columns.
It seems like the merge/join should be painless given the unique combined keys are shared across both dataframes. Thinking there must be some simple error with my attempt at 'merge' and 'join' but can't for the life of me resolve it.
times = pd.read_csv('nytimes.csv')
# Produces times_mean table consisting of two value columns, avg_impressions and avg_clicks
times_mean = times.groupby(['Age','Gender','Signed_In']).mean()
times_mean.columns = ['avg_impressions', 'avg_clicks']
# Produces times_max table consisting of two value columns, max_impressions and max_clicks
times_max = times.groupby(['Age','Gender','Signed_In']).max()
times_max.columns = ['max_impressions', 'max_clicks']
# Following intended to produce combined table with four value columns
times_join = times_mean.join(times_max, on = ['Age', 'Gender', 'Signed_In'])
times_join2 = pd.merge(times_mean, times_max, on=['Age', 'Gender', 'Signed_In'])
You don't need to the on kwarg when joining on equivalently structured MultiIndex
Here's an example demonstrating this:
import numpy as np
import pandas
a = np.random.normal(size=10)
b = a + 10
index = pandas.MultiIndex.from_product([['A', 'B'], list('abcde')])
df_a = pandas.DataFrame(a, index=index, columns=['colA'])
df_b = pandas.DataFrame(b, index=index, columns=['colB'])
df_a.join(df_b)
Which gives me:
colA colB
A a -1.525376 8.474624
b 0.778333 10.778333
c 1.153172 11.153172
d 0.966560 10.966560
e 0.089765 10.089765
B a 0.717717 10.717717
b 0.305545 10.305545
c 0.123548 10.123548
d -1.018660 8.981340
e -0.635103 9.364897

Categories