Common values between multiple dataframes with different length - python

I have 3 huge dataframes that have different length of values
Ex,
A B C
2981 2952 1287
2759 2295 2952
1284 2235 1284
1295 1928 0887
2295 1284 1966
1567 1928
1287 2374
2846
2578
I want to find the common values between the three columns like this
A B C Common
2981 2952 1287 1284
2759 2295 2952 2295
1284 2235 1284
1295 1928 0887
2295 1284 1966
1567 2295
1287 2374
2846
2578
I tried (from here)
df1['Common'] = np.intersect1d(df1.A, np.intersect1d(df2.B, df3.C))
but I get this error, ValueError: Length of values does not match length of index

Idea is create Series with index filtered by indexing with length of array:
a = np.intersect1d(df1.A, np.intersect1d(df2.B, df3.C))
df1['Common'] = pd.Series(a, index=df1.index[:len(a)])
If same DataFrame:
a = np.intersect1d(df1.A, np.intersect1d(df1.B, df1.C))
df1['Common'] = pd.Series(a, index=df1.index[:len(a)])
print (df1)
A B C Common
0 2981.0 2952.0 1287 1284.0
1 2759.0 2295.0 2952 2295.0
2 1284.0 2235.0 1284 NaN
3 1295.0 1928.0 887 NaN
4 2295.0 1284.0 1966 NaN
5 NaN 1567.0 2295 NaN
6 NaN 1287.0 2374 NaN
7 NaN NaN 2846 NaN
8 NaN NaN 2578 NaN

Related

How to calculate sumproduct in pandas by column?

I have a dataframe:
ID 2000-01 2000-02 2000-03 2000-04 2000-05 val
1 2847 2861 2875 2890 2904 94717
2 1338 1343 1348 1353 1358 70105
3 3301 3311 3321 3331 3341 60307
4 1425 1422 1419 1416 1413 79888
I want to add a new row to the table with the sumproduct formula (excel) =sumproduct(array $val, array 2000-xx). The first value in the new row is computed as 2847x94717 + 1338x70105 + 3301x60307 + 1425x79888 = 676373596 (in Excel terms, B2xG2+B3xG3+B4xG4+B5xG5)
Output:
ID 2000-01 2000-02 2000-03 2000-04 2000-05 val
1 2847 2861 2875 2890 2904 94717
2 1338 1343 1348 1353 1358 70105
3 3301 3311 3321 3331 3341 60307
4 1425 1422 1419 1416 1413 79888
5 676373596 678413565 680453534 682588220 684628189
How do I go about this?
You can do this, assuming ID is not in the index:
df.loc[5, :] = df.iloc[:,1:-1].mul(df['val'], axis=0).sum()
Output:
ID 2000-01 2000-02 2000-03 2000-04 2000-05 val
0 1.0 2847.0 2861.0 2875.0 2890.0 2904.0 94717.0
1 2.0 1338.0 1343.0 1348.0 1353.0 1358.0 70105.0
2 3.0 3301.0 3311.0 3321.0 3331.0 3341.0 60307.0
3 4.0 1425.0 1422.0 1419.0 1416.0 1413.0 79888.0
5 NaN 676373596.0 678413565.0 680453534.0 682588220.0 684628189.0 NaN
Use pandas.DataFrame.mul with axis=0 then sum and let pandas intrinsic data alignment put the values in the correct column based on indexing.
You could do the dot product # and merge back to the original dataframe:
df.merge(pd.DataFrame(df.iloc[:,1:-1].T # df['val']).T, how='outer')
ID 2000-01 2000-02 2000-03 2000-04 2000-05 val
0 1.0 2847 2861 2875 2890 2904 94717.0
1 2.0 1338 1343 1348 1353 1358 70105.0
2 3.0 3301 3311 3321 3331 3341 60307.0
3 4.0 1425 1422 1419 1416 1413 79888.0
4 NaN 676373596 678413565 680453534 682588220 684628189 NaN
Other options for the same result
columns_to_multiply = df.columns.drop(['ID', 'val'])
df1 = df.copy()
for x in columns_to_multiply:
df1[x] *= df1['val']
prod_sum_list = [len(df)] + df1[columns_to_multiply].sum().tolist() + [np.nan]
df.loc[len(df.index)] = prod_sum_list
df
You can do:
row = [sum(df[col]*df['val']) for col in df.columns.drop(['ID','val'])]
row.insert(0, len(df)+1)
row.insert(len(row), 0)
df.loc[len(df)] = row
df.loc[len(df)-1,'val'] = ''

Replace the missing value NAN based on values of another columns (conditions)

Hi I would like to fill in the NaN value based on value of sources.
I have tried the np.select, but this method also overwrite the other correct values.
landline_area1['area'] = np.select(area_conditions, values)
Table overview
source codes area
4 1304 1304 Dover
5 1768 1768 Penrith
6 2077 NaN NaN
7 1225 1225 Bath
8 1142 NaN NaN
conditions
area_conditions = [
(landline_area1['source'].str.startswith('20')),
(landline_area1['source'].str.startswith('23')),
(landline_area1['source'].str.startswith('24'))]
values
values = [
'London',
'Southampton / Portsmouth',
'Coventry']
Expected result
source codes area
4 1304 1304 Dover
5 1768 1768 Penrith
6 2077 NaN London
7 1225 1225 Bath
8 1142 NaN Sheffield
Let us try np.select and adding astype str
#landline_area1['source'].astype(str).str.startswith('20')
s = np.select(area_conditions, values)
landline_area1['area'].fillna(pd.Series(s, index=landline_area1.index),inplace=True)

df.to_sql not working [HY04] - Invalid SQL data type (0) (SQLBindParameter)

I am fairly new to Python, at least to these kinds of problems, and have been struggling with this for days!
I have used df.to_sql for quite a bit of time and for a lot of different dataframes, but have now experienced the following issue/Error:
DBAPIError: (pyodbc.Error) ('HY004', '[HY004] [Microsoft][ODBC SQL Server Driver]Invalid SQL data type (0) (SQLBindParameter)')
My code looks as follows:
**pyodbc.pooling = False
params = urllib.parse.quote_plus("Driver={SQL Server};Server=[ServerName];Database=DatabaseName")
engine = create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)**
df.to_sql('Table_Name',
con = engine,
if_exists='replace')
I have used the code for a lot of different dataframes, and have no clue what has changed for this particular data? It is from a JSON-file, which I have flattened.
Any ideas, as to what might cause this?
I am not using Linux or anything.
My Dataframe looks as follows:
eventId
0 97592224 3361 1609 2086192 1H 0.491250 8
1 97592225 49876 1609 2086192 1H 3.649639 8
2 97592168 14853 1631 2086192 1H 5.356103 8
3 97592169 8730 1631 2086192 1H 8.653246 8
4 97592170 14763 1631 2086192 1H 10.219569 8
... ... ... ... ... ... ... ...
1667 586886958 14870 1612 2086209 2H 2915.347568 2
1668 586886947 8292 1624 2086209 2H 2924.483590 3
1669 586886948 269676 1624 2086209 2H 2926.881486 1
1670 586886959 25393 1612 2086209 2H 2927.766165 1
1671 586886949 13484 1624 2086209 2H 2929.858276 8
eventName subEventId subEventName positions.0.x positions.0.y
0 Pass 85 Simple pass 51 51
1 Pass 83 High pass 35 48
2 Pass 85 Simple pass 39 12
3 Pass 83 High pass 34 23
4 Pass 82 Head pass 65 24
... ... ... ... ... ...
1667 Foul 24 Protest 37 90
1668 Free Kick 36 Throw in 64 0
1669 Duel 10 Air duel 77 23
1670 Duel 10 Air duel 23 77
1671 Pass 83 High pass 74 27
positions.1.x positions.1.y tags.0.id tags.1.id tags.2.id tags
0 35.0 48.0 1801.0 NaN NaN NaN
1 61.0 88.0 1802.0 NaN NaN NaN
2 34.0 23.0 1801.0 NaN NaN NaN
3 65.0 24.0 1801.0 NaN NaN NaN
4 85.0 22.0 1801.0 NaN NaN NaN
... ... ... ... ... ... ...
1667 NaN NaN 1702.0 NaN NaN NaN
1668 77.0 23.0 1801.0 NaN NaN NaN
1669 74.0 27.0 701.0 1802.0 NaN NaN
1670 26.0 73.0 703.0 1801.0 NaN NaN
1671 NaN NaN 1802.0 NaN NaN NaN
Edited:
The DataFrame can be downloaded using this link: https://drive.google.com/file/d/1r_zpj-24Unb9XXyYlVNwBkMU_YaKC2ll/view?usp=sharing

compare column values only with identical datetime index

I have a long df from 07:00:00 to 20:00:00 (df1) and a short df with only fractions of the long one (df2) (identical datetime index values).
I would like to compare the groupsize values of the two data frames.
The datetime index, id, x, and y values should be identical.
I can i do this?
df1:
Out[180]:
date id gs x y
2019-10-09 07:38:22.139 3166 nan 248 233
2019-10-09 07:38:25.259 3166 nan 252 235
2019-10-09 07:38:27.419 3166 nan 253 231
2019-10-09 07:38:30.299 3166 nan 251 232
2019-10-09 07:38:32.379 3166 nan 251 233
2019-10-09 07:38:37.179 3166 nan 228 245
2019-10-09 07:39:49.498 3167 nan 289 253
2019-10-09 07:40:19.099 3168 nan 288 217
2019-10-09 07:40:38.779 3169 nan 278 139
2019-10-09 07:40:39.899 3169 nan 279 183
...
2019-10-09 19:52:53.959 5725 nan 190 180
2019-10-09 19:52:56.439 5725 nan 193 185
2019-10-09 19:52:58.919 5725 nan 204 220
2019-10-09 19:53:06.440 5804 nan 190 198
2019-10-09 19:53:08.919 5804 nan 200 170
2019-10-09 19:53:11.419 5804 nan 265 209
2019-10-09 19:53:16.460 5789 nan 292 218
2019-10-09 19:53:36.460 5806 nan 284 190
2019-10-09 19:54:08.939 5807 nan 404 226
2019-10-09 19:54:23.979 5808 nan 395 131
df2:
Out[181]:
date id gs x y
2019-10-09 11:20:01.418 3479 2.0 353 118.0
2019-10-09 11:20:01.418 3477 2.0 315 92.0
2019-10-09 11:20:01.418 3473 2.0 351 176.0
2019-10-09 11:20:01.418 3476 2.0 318 176.0
2019-10-09 11:20:01.418 3386 0.0 148 255.0
2019-10-09 11:20:01.418 3390 0.0 146 118.0
2019-10-09 11:20:01.418 3447 0.0 469 167.0
2019-10-09 11:20:03.898 3447 0.0 466 169.0
2019-10-09 11:20:03.898 3390 0.0 139 119.0
2019-10-09 11:20:03.898 3477 2.0 316 93.0
Expected output should be a dataframe with columns "date", "id", "x", "y", "gs(df1)", "gs(df2)"
Do a Merge where everything is equal but make sure to reset index so its part of the merge condition
df1_t = df1.reset_index()
df2_t = df1.reset_index()
results = df1_t.merge(df2_t, left_on = ['date', 'ids', 'x', 'y'],
right_on = ['date', 'ids', 'x', 'y'],
indicator = True).reset_index()
print(results)
results will have the rows on df1 that are in df2.

Dataframe Merge in Pandas

For some reason, I cannot get this merge to work correctly.
This Dataframe (rspars) has 2,000+ rows...
rsparid f1mult f2mult f3mult
0 1 0.318 0.636 0.810
1 2 0.348 0.703 0.893
2 3 0.384 0.777 0.000
3 4 0.296 0.590 0.911
4 5 0.231 0.458 0.690
5 6 0.275 0.546 0.839
6 7 0.248 0.486 0.731
7 8 0.430 0.873 0.000
8 9 0.221 0.438 0.655
9 11 0.204 0.399 0.593
When trying to join the above to a table based on the rsparid columns to this Dataframe...
line_track line_race rsparid
line_date
2013-03-23 TP 10 1400
2013-02-23 GP 7 634
2013-01-01 GP 7 1508
2012-11-11 AQU 5 96
2012-10-11 BEL 2 161
Using this...
df = pd.merge(datalines, rspars, how='left', on='rsparid')
I get blanks..
line_track line_race rsparid f1mult f2mult f3mult
0 TP 10 1400 NaN NaN NaN
1 TP 10 1400 NaN NaN NaN
2 TP 10 1400 NaN NaN NaN
3 GP 7 634 NaN NaN NaN
4 GP 10 634 NaN NaN NaN
Note, the "datalines" column can have thousands more rows than the rspars, thus the left join. I must be doing something wrong?
I also tried it this way...
df = datalines.merge(rspars, how='left', on='rsparid')
EXAMPLE #2
I dropped the data down to a few rows...
rspars:
rsparid f1mult f2mult f3mult
0 1400 0.216 0.435 0.656
datalines:
rsparid
0 1400
1 634
2 1508
3 96
4 161
5 1011
6 1007
7 518
8 1955
9 678
Merging...
datalines.merge(rspars, how='left', on='rsparid')
Output...
rsparid f1mult f2mult f3mult
0 1400 NaN NaN NaN
1 634 NaN NaN NaN
2 1508 NaN NaN NaN
3 96 NaN NaN NaN
4 161 NaN NaN NaN
5 1011 NaN NaN NaN
6 1007 NaN NaN NaN
7 518 NaN NaN NaN
8 1955 NaN NaN NaN
9 678 NaN NaN NaN
The NaNs mean they have no values in rsparid in common. This can be tricky when merging things that may look the same when they repr
The repr of small DataFrames with strings (of integers) or integers looks the same and no dtype information is printed when frames are small. You can get this information (and more) for small frames by calling the DataFrame.info() method, like so: df.info(). This will give you a nice summary of what's in the DataFrame and what the dtypes of its columns are:
In [205]: datalines_int = DataFrame({'rsparid':[1400,634,1508,96,161,1011,1007,518,1955,678]})
In [206]: datalines_str = DataFrame({'rsparid':map(str,[1400,634,1508,96,161,1011,1007,518,1955,678])})
In [207]: datalines_int
Out[207]:
rsparid
0 1400
1 634
2 1508
3 96
4 161
5 1011
6 1007
7 518
8 1955
9 678
In [208]: datalines_str
Out[208]:
rsparid
0 1400
1 634
2 1508
3 96
4 161
5 1011
6 1007
7 518
8 1955
9 678
In [209]: datalines_int.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 1 columns):
rsparid 10 non-null values
dtypes: int64(1)
In [210]: datalines_str.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 1 columns):
rsparid 10 non-null values
dtypes: object(1)
NOTE: You'll notice a slight difference in the reprs here, most likely because of padding of numeric DataFrames. Point is, no one would really be able to see that using this interactively, unless they were specifically looking for the difference.

Categories