Join/ Merge 2 Comma Separated Text Files - python

I would really appreciate assistance in developing Python/ Pandas code to left-join 2 separate CSV files. I'm very new to Python so unclear how to begin. I have used Excel plugins to achieve same but it would normally hang or take hours to complete due to the huge amount of records being processed.
Scenario: joining on the first column.
CSV1
AN DNM OCD TRI
1 343 5656 90
2 344 5657 91
3 345 5658 92
4 346 5659 93
CSV2
AN2 STATE PLAN
4 3 19
3 2 35
7 3 19
8 3 19
Result inclusive of a match status if possible:
AN DNM OCD TRI STATE PLAN Join Status
1 343 5656 90 No_match
2 344 5657 91 No_match
3 345 5658 92 2 35 Match
4 346 5659 93 3 19 Match
All help appreciated.

You can use .merge with indicator= parameter:
out = df1.merge(
df2, left_on="AN", right_on="AN2", indicator="Join Status", how="left"
)
out = out.drop(columns="AN2")
out["Join Status"] = out["Join Status"].map(
{"left_only": "No_match", "both": "Match"}
)
print(out)
Prints:
AN DNM OCD TRI STATE PLAN Join Status
0 1 343 5656 90 NaN NaN No_match
1 2 344 5657 91 NaN NaN No_match
2 3 345 5658 92 2.0 35.0 Match
3 4 346 5659 93 3.0 19.0 Match

Let's assume you have df1 and df2 and you want to merge the two dataframe
df = df1.merge(df2, how='left', left_on='AN', right_on='AN2')
I hope this will help you

Related

Subtract/Add existing values if contents of one dataframe is present in another using pandas

Here are 2 dataframes
df1:
Index Number Name Amount
0 123 John 31
1 124 Alle 33
2 312 Amy 33
3 314 Holly 35
df2:
Index Number Name Amount
0 312 Amy 13
1 124 Alle 35
2 317 Jack 53
The resulting dataframe should look like this
result_df:
Index Number Name Amount Curr_amount
0 123 John 31 31
1 124 Alle 33 68
2 312 Amy 33 46
3 314 Holly 35 35
4 317 Jack 53
I have tried using pandas isin but it only says if the Number column was present or no in boolean. Is there any way to do this efficiently?
Use merge with outer join and then add Series.add (or
Series.sub if necessary):
df = df1.merge(df2, on=['Number','Name'], how='outer', suffixes=('','_curr'))
df['Amount_curr'] = df['Amount_curr'].add(df['Amount'], fill_value=0)
print (df)
Number Name Amount Amount_curr
0 123 John 31.0 31.0
1 124 Alle 33.0 68.0
2 312 Amy 33.0 46.0
3 314 Holly 35.0 35.0
4 317 Jack NaN 53.0

Python Pandas copy a set of cells within a dataframe based on a matching key

My knowledge of Pandas is relatively limited, and I've accomplished a lot with a small foundation + all the help in SO. This is the first time I've found myself at a dead end.
I'm trying to find the most efficient way to do the following:
I have a single df of ~150000 rows, with ~40 columns.
Here is a sample dataframe to work with for investigating a solution:
UniqueID CST WEIGHT VOLUME PRODUCTIVITY
0 413-20012 3 123 12 1113
1 413-45365 1 889 75 6748
2 413-21165 8 554 13 4536
3 413-24354 1 387 35 7649
4 413-34658 2 121 88 2468
5 413-36889 4 105 76 3336
6 413-23457 5 355 42 7894
7 413-30089 5 146 10 9112
8 413-41158 5 453 91 4545
9 413-51015 9 654 66 2232
One of the columns is a unique ID, the remaining columns contain data corresponding to the object of that ID. Example:
I've determined a merged-style relationship between the objects outside of the DF, and now need to paste data where that relationship exists, from a 'parent' ID to all of its 'child' IDs.
If I've determined that 413-23457 is the parent of 413-20012 and 413-21165, I then need to copy the values from the parent only in columns WEIGHT, VOLUME, and PRODUCTIVITY (but not UniqueID or CST) to the child objects. I also determine that 413-41158 is the parent of 413-45365 and 413-51015.
I have to do this for many sets of these types of associations across the dataframe.
I've attempted to manipulate a lot of sample code for pasting between dataframes, but several of my requirements appear to be making it difficult to search for a useful enough sample. I can also envision a way where I create objects of everything using .itterows(), and then matching and pasting accordingly in a loop. But, having relegated to .iterrows() for past solutions, and noting how long it can take, I don't think I can apply that here and sustain it for larger datasets.
Any help would be greatly appreciated.
Edit with additional content per suggested solution
If I rearrange the input dataframe to sort rows more randomly, the suggested answers do not really do the trick (my fault for not better reflecting the actual dataset to this test sample).
Starting Dataframe is:
UniqueID CST WEIGHT VOLUME PRODUCTIVITY
0 413-20012 3 123 12 1113
1 413-45365 1 889 75 6748
2 413-21165 8 554 13 4536
3 413-24354 1 387 35 7649
4 413-34658 2 121 88 2468
5 413-36889 4 105 76 3336
6 413-23457 5 355 42 7894
7 413-30089 5 146 10 9112
8 413-41158 5 453 91 4545
9 413-51015 9 654 66 2232
Current suggested solution is:
parent_child_dict = {
'413-51015': '413-41158',
'413-21165': '413-23457',
'413-45365': '413-41158',
'413-20012': '413-23457'
}
(df.merge(df.UniqueID
.replace(parent_child_dict),
on='UniqueID',
how='right')
.set_index(df.index)
.assign(UniqueID=df.UniqueID,
CST=df.CST)
)
Resulting Dataframe is:
UniqueID CST WEIGHT VOLUME PRODUCTIVITY
0 413-20012 3 387 35 7649
1 413-45365 1 121 88 2468
2 413-21165 8 105 76 3336
3 413-24354 1 355 42 7894
4 413-34658 2 355 42 7894
5 413-36889 4 355 42 7894
6 413-23457 5 146 10 9112
7 413-30089 5 453 91 4545
8 413-41158 5 453 91 4545
9 413-51015 9 453 91 4545
The results are not what was expected now that the rows are in a random order, and I don't understand some of what has happened. Row with UniqueID 413-45365 was intended to mirror the data for 413-41158, but has some combination of data (121, 88, 2468) that does not exist in any of the other rows or even cells in the starting DF.
First thing i would do is to get your parent-child relationship into a dictionary. and then we can use replace and merge:
# create a dictionary of parent-child relationship
parent_child_dict = {}
for parent_id in parent_objects:
children = get_merge(parent_id)
for child in children:
child_id = get_object_info(child)
# update dict
parent_child_dict[child_id] = parent_id
# parent_child_dict = {
# '413-20012': '413-23457',
# '413-21165': '413-23457',
# '413-45365': '413-41158',
# '413-51015': '413-41158'
# }
# merge and copy data back
(df.merge(df.UniqueID
.replace(parent_child_dict),
on='UniqueID',
how='right')
.set_index(df.index)
.assign(UniqueID=df.UniqueID,
CST=df.CST)
)
Output:
UniqueID CST WEIGHT VOLUME PRODUCTIVITY
1 413-23457 5 355 42 7894
2 413-20012 3 355 42 7894
3 413-21165 8 355 42 7894
4 413-24354 1 387 35 7649
5 413-34658 2 121 88 2468
6 413-36889 4 105 76 3336
7 413-30089 5 146 10 9112
9 413-41158 5 453 91 4545
10 413-45365 1 453 91 4545
11 413-51015 9 453 91 4545

dataframe concatenating with indexing

I have a Python dataframe that reads from a file
the next step I do is to break dataset into 2 datasets df_LastYear & df_ThisYear
Note : that Index is not continuous missing 2 & 6
ID AdmissionAge
0 14 68
1 22 86
3 78 40
4 124 45
5 128 35
7 148 92
8 183 71
9 185 98
10 219 79
after applying some predictive models I get results of predictive values y_ThisYear
Prediction
0 2.400000e+01
1 1.400000e+01
2 1.000000e+00
3 2.096032e+09
4 2.000000e+00
5 -7.395179e+11
6 6.159412e+06
7 5.592327e+07
8 5.303477e+08
9 5.500000e+00
10 6.500000e+00
I am trying to concat both datasets df_ThisYear and y_ThisYear into one dataset
but I always get these results
ID AdmissionAge Prediction
0 14.0 68.0 2.400000e+01
1 22.0 86.0 1.400000e+01
2 NaN NaN 1.000000e+00
3 78.0 40.0 2.096032e+09
4 124.0 45.0 2.000000e+00
5 128.0 35.0 -7.395179e+11
6 NaN NaN 6.159412e+06
7 148.0 92.0 5.592327e+07
8 183.0 71.0 5.303477e+08
9 185.0 98.0 5.500000e+00
10 219.0 79.0 6.500000e+00
There are NaNs which did not exist before
I found that these NaNs are belonging to the index which was not included in df_ThisYear
Therefore I try reset index so I get continuous Indices
I used
df_ThisYear.reset_index(drop=True)
but still getting same indices
How to fix this problem so I can concatenate df_ThisYear with y_ThisYear correctly?
Then you just need join
df.join(Y)
ID AdmissionAge Prediction
0 14 68 2.400000e+01
1 22 86 1.400000e+01
3 78 40 2.096032e+09
4 124 45 2.000000e+00
5 128 35 -7.395179e+11
7 148 92 5.592327e+07
8 183 71 5.303477e+08
9 185 98 5.500000e+00
10 219 79 6.500000e+00
If you are really excited about using concat, you can provide 'inner' to the how argument:
pd.concat([df_ThisYear, y_ThisYear], axis=1, join='inner')
This returns
Out[6]:
ID AdmissionAge Prediction
0 14 68 2.400000e+01
1 22 86 1.400000e+01
3 78 40 2.096032e+09
4 124 45 2.000000e+00
5 128 35 -7.395179e+11
7 148 92 5.592327e+07
8 183 71 5.303477e+08
9 185 98 5.500000e+00
10 219 79 6.500000e+00
Because y_ThisYear has different index than df_ThisYear
When I joined both using
df_ThisYear.join(y_ThisYear )
it started to match each number it its matching index
I know this is right if indices are actually represent the same record i.e. index 7 in df_ThisYear value is matching y_ThisYear index 7 too
In my case I just want to match first record in y_ThisYear to the first in df_ThisYear regardless of their index number
I found this code that does that.
df_ThisYear = pd.concat([df_ThisYear.reset_index(drop=True), pd.DataFrame(y_ThisYear)], axis=1)
Thanks for everyone helped with the answer

Pandas Collapse and Stack Multi-level columns

I want to break down multi level columns and have them as a column value.
Original data input (excel):
As read in dataframe:
Company Name Company code 2017-01-01 00:00:00 Unnamed: 3 Unnamed: 4 Unnamed: 5 2017-02-01 00:00:00 Unnamed: 7 Unnamed: 8 Unnamed: 9 2017-03-01 00:00:00 Unnamed: 11 Unnamed: 12 Unnamed: 13
0 NaN NaN Product A Product B Product C Product D Product A Product B Product C Product D Product A Product B Product C Product D
1 Company A #123 1 5 3 5 0 2 3 4 0 1 2 3
2 Company B #124 600 208 30 20 600 213 30 15 600 232 30 12
3 Company C #125 520 112 47 15 520 110 47 10 520 111 47 15
4 Company D #126 420 165 120 31 420 195 120 30 420 182 120 58
Intended data frame:
I have tried stack() and unstack() and also swap level, but I couldn't get the dates column to 'drop as row'. Looks like the merged cells in excels will produce NaN as in the dataframes - and if its the columns that is merged, I will have a unnamed column. How do I work around it? Am I missing something really simple here?
Using stack
df.stack(level=0).reset_index(level=1)

How to select and replace similar occurrences in a column

I'm working on a ML project for a class. I'm currently cleaning the data and I encountered a problem. I basically have a column (which is identified as dtype object) that has ratings about a certain aspect in a hotel. When i checked what the values of this column were and in what frequency they appeared, I noticed that there are some wrong values in it (as you can see below, instead of ratings, some rows have a date as a value)
rating value_counts()
100 527
98 229
97 172
99 163
96 150
95 127
93 100
90 94
94 93
80 65
92 55
91 39
88 35
89 32
87 31
85 25
86 17
84 12
60 12
83 8
70 5
73 5
82 4
78 3
67 3
2018-11-11 3
20 2
81 2
2018-11-03 2
40 2
79 2
75 2
2018-10-26 2
2 1
2018-08-30 1
2018-09-03 1
2015-09-05 1
55 1
2018-10-12 1
2018-05-11 1
2018-11-14 1
2018-09-15 1
2018-04-07 1
2018-08-16 1
71 1
2018-09-18 1
2018-11-05 1
2018-02-04 1
NaN 1
What I wanted to do was to replace all the values that look like dates with NaN so I can later fill them with appropriate values. Is there a good way to do this other than selecting each different date one by one and replacing it with a NaN? Is there a way to select similar values (in this case all the dates that start in the same way, 2018) and replace them all?
Thank you for taking the time to read this!!
There are multiple options to clean this data.
Option 1: Rating column is ofobject type, search the strings by presence of '-' and replace with np.nan
df.loc[df['rating'].str.contains('-', na = False), 'rating'] = np.nan
Option 2: Convert the column to numeric which will coerce the dates to nan.
df['rating'] = pd.to_numeric(df['rating'], errors = 'coerce')

Categories