Speed up operations over Python Pandas dataframes

Speed up operations over Python Pandas dataframes - python

I would like to speed up a loop over a python Pandas Dataframe. Unfortunately, decades of using low-level languages mean I often struggle to find prepackaged solutions. Note: data is private, but I will see if I can fabricate something and add it into an edit if it helps.
The code has three pandas dataframes: drugUseDF, tempDF, which holds the data, and tempDrugUse, which stores what's been retrieved. I look over every row of tempDF (there will be several million rows), retrieving the prodcode identified from each row and then using that to retrieve the corresponding value from use1 column in the drugUseDF. I've added comments to help navigate.
This is the structure of the dataframes:
tempDF
patid eventdate consid prodcode issueseq
0 20001 21/04/2005 2728 85 0
1 25001 21/10/2000 3939 40 0
2 25001 21/02/2001 3950 37 0
drugUseDF
index prodcode ... use1 use2
0 171 479 ... diabetes NaN
1 172 9105 ... diabetes NaN
2 173 5174 ... diabetes NaN
tempDrugUse
use1
0 NaN
1 NaN
2 NaN
This is the code:
dfList = []
# if the drug dataframe contains the use1 column. Can this be improved?
if sum(drugUseDF.columns.isin(["use1"])) == 1:
#predine dataframe where we will store the results to be the same length as the main data dataframe.
tempDrugUse = DataFrame(data=None, index=range(len(tempDF.index)), dtype=np.str, columns=["use1"])
#go through each row of the main data dataframe.
for ind in range(len(tempDF)):
#retrieve the prodcode from the *ind* row of the main data dataframe
prodcodeStr = tempDF.iloc[ind]["prodcode"]
#get the corresponding value from the use1 column matching the prodcode column
useStr = drugUseDF[drugUseDF.loc[:, "prodcode"] == prodcodeStr]["use1"].values[0]
#update the storing dataframe
tempDrugUse.iloc[ind]["use1"] = useStr
print("[DEBUG] End of loop for use1")
dfList.append(tempDrugUse)
The order of the data matters. I can't retrieve multiple rows by matching the prodcode because each row has a date column. Retrieving multiple rows and adding them to the tempDrugUse dataframe could mean that the rows are no longer in chronological date order.

When trying to combine data in two dataframes you should use the merge (similar to JOIN in sql-like languages). Performance wise, you should never loop over the rows - you should use the pandas built-in methods whenever possible. Ordering can be achieved with the sort_values method.

If I understand you correctly, you want to map the prodcode from both tables. You can do this via pd.merge (please note the example in the code below differs from your data):
tempDF = pd.DataFrame({'patid': [20001, 25001, 25001],
'prodcode': [101,102,103]})
drugUseDF = pd.DataFrame({'prodcode': [101,102,103],
'use1': ['diabetes', 'hypertonia', 'gout']})
merged_df = pd.merge(tempDF, drugUseDF, on='prodcode', how='left')

Related

Create a new column from two columns of a dataframe where rows of each column contains list in string format

I have a data frame(in csv file) with two columns each containing lists(of variable length) in string format. I am providing the link to the google drive where I have stored the csv file for reference https://drive.google.com/file/d/1Hdu04JdGpPqG9_k6Mjx_1XNLBvogXfnN/view?usp=sharing
The dataframe looks like this
Opp1 Opp2
0 ['KingdomofPoland','GrandDuchyofLithuania'] ['Georgia']
1 ['NorthernYuanDynasty'] ['Georgia']
2 ['SpanishEmpire','CaptaincyGeneralofChile'] ['ChechenRepublic']
... ... ...
3409 ['Turkey','SyrianOpposition'] ['CatholicLeague','SpanishEmpire']
3410 ['Egypt','UnitedArabEmirates'] ['SpanishEmpire']
3411 ['Turkey','SyrianOpposition'] ['SpanishEmpire']
3412 ['UnitedStates','UnitedKingdom','SaudiArabia'] ['SpanishEmpire']
3413 ['Turkey'] ['Russia']
3414 rows × 2 columns
The columns values are strings, I figured that out when I do
Input - df['Opp1'][0][0]
Out - '['
Output is given as '['. Instead the output should be the first element of the list of first row i.e 'KingdomofPoland'.
After solving this issue, I want to create a new column by combining elements of lists from each row of Opp1 and Opp2 columns. The elements of each row in Opp1 column are the name of countries and empires that were involved in a war with the corresponding country/empire of the same row in Opp2 column.
So basically a new column with row entries as
new_col
0 ['KingdomofPoland', 'Georgia']
0 ['GrandDuchyofLithuania', 'Georgia']
1 ['NorthernYuanDynasty', 'Georgia']
2 ['SpanishEmpire', 'ChechenRepublic']
2 ['CaptaincyGeneralofChile', 'ChechenRepublic']
... ... ...
3409 ['Turkey', 'CatholicLeague']
3409 ['Turkey', 'SpanishEmpire']
3409 ['SyrianOpposition', 'CatholicLeague]
3409 ['SyrianOpposition', 'SpanishEmpire']
3410 ['Egypt','SpanishEmpire']
3410 ['UnitedArabEmirates','SpanishEmpire']
3411 ['Turkey', 'SpanishEmpire']
3411 ['SyrianOpposition', 'SpanishEmpire']
.................
This will essentially introduce new rows as we are kind of exploding the Opp1 and Opp2 columns simultaneously iterating over there rows elements.
The end goal is to get an edge list of countries that were involved in a specific war represented by the original Opp1(opposition 1) and Opp2(opposition2) columns. Each entity(country) from Opp1 row list should be attached to each entity(country) of Opp2 row list. The final dataset will be used on Gephi as edge lists.
I am a beginner in data analysis with python. till now I have been cleaning my dataset manually which has consumed upteen precious hours. Can anyone help me with this.
Note - There are multiple similar entries in each row of Opp1 and Opp2 columns as same countries fought wars many times in different years.
I am attaching the pic for df_types of my dataframe as requested.

IIUC, do some data clean up by remove an intra-string single quote.
And, then use library yaml to convert your string to actual list in each pandas dataframe cell with applymap. Lastly, apply explode to your dataframe twice once for each column you want to expand.
import yaml
import pandas as pd
df = pd.read_csv('Downloads/nodes_list.csv', index_col=[0])
df['Opp1'] = df['Opp1'].str.replace("[\'\"]s",'s', regex=True)
df['Opp2'] = df['Opp2'].str.replace("[\'\"]s",'s', regex=True)
df = df.applymap(yaml.safe_load)
df_new = df.explode('Opp1').explode('Opp2').apply(list, axis=1)
df_new
Output:
0 [KingdomofPoland, Georgia]
0 [GrandDuchyofLithuania, Georgia]
1 [NorthernYuanDynasty, Georgia]
2 [SpanishEmpire, ChechenRepublic]
2 [CaptaincyGeneralofChile, ChechenRepublic]
...
3411 [SyrianOpposition, SpanishEmpire]
3412 [UnitedStates, SpanishEmpire]
3412 [UnitedKingdom, SpanishEmpire]
3412 [SaudiArabia, SpanishEmpire]
3413 [Turkey, Russia]
Length: 31170, dtype: object

Pandas loop over 2 dataframe and drop duplicates

I have 2 csv files with some random numbers, as follow:
csv1.csv
0 906018
1 007559
2 910475
3 915104
4 600393
...
5070 907525
5071 903079
5072 001910
5073 909735
5074 914861
length 5075
csv2.csv
0 5555
1 7859
2 501303
3 912414
4 913257
...
7497 915031
7498 915030
7499 915033
7500 902060
7501 915038
length 7502
Some elements in csv1 are present in csv2 but I don't know exactly which one and I would like to extract those unique values. So my idea was to start merging together the 2 data frame, and than remove the duplicates.
so I wrote the following code:
import pandas as pd
import csv
unique_users = pd.read_csv('./csv1.csv')
unique_users['id']
identity = pd.read_csv('./csv2.csv')
identityNumber = identity['IDNumber']
identityNumber
df = pd.concat([identityNumber, unique_users])
Until here everything is perfect and the length of the df is the sum of the 2 length, but I realised the part where I got stuck.
the df concat it did its job and concat based on the index, so now I have tons of NaN.
and when I use the code:
final_result = df.drop_duplicates(keep=False)
The data frame does not drop any value because the df structure now look like this:
Identitynumber. ID
5555 NaN
so I guess that drop duplicate is looking for the same exact values, but as they don't exist it just keeps it.
So what I would like to do, is loop over both data frame, and if a value in csv1 exists in csv2, I want them to be dropped.
Can anyone help with this please?
And please if you need more info just let me know.
UPDATE:
I think I found the reason why is not working but I am not sure how to solve this.
my csv1 looks like this:
id
906018,
007559,
910475,
915104,
600393,
007992,
502313,
004609,
910017,
007954,
006678,
in Jupiter notebook when I open the csv, it looks this way.
id
906018 NaN
007559 NaN
910475 NaN
915104 NaN
600393 NaN
... ...
907525 NaN
903079 NaN
001910 NaN
909735 NaN
914861 NaN
and I do not understand why is seeing the id as NaN.
in fact I tried to add a new column into csv2, and as value I passed the id from csv1..and I can confirm that they are all NaN.
So I believe the source of the problem is surely this, which than reflects on all the other events.
Can anyone help to understand how I can solve this issue?

you can achieve this using df.merge():
# Data samples
data_1 = {'col_a': [906018,7559,910475,915104,600393,907525,903079,1910,909735,914861]}
data_2 = {'col_b': [5555,7859,914861,912414,913257,915031,1910,915104,7559,915038]}
df1 = pd.DataFrame(data_1)
df2 = pd.DataFrame(data_2)
# using isin() method
unique_vals = df1.merge(df2, right_on='col_b', left_on='col_a')['col_a']
new_df1 = df1[~df1.col_a.isin(unique_vals)]
# another approach
new_df1 = df1[df1.merge(df2, right_on='col_b', left_on='col_a', how='left')['col_b'].isna()]
print(new_df1)
# col_a
# 0 906018
# 2 910475
# 4 600393
# 5 907525
# 6 903079
# 8 909735

This will remove the duplicates between your two dataframes and keep all the records in one dataframe df.
df = pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)

You are getting NaN because when you concatenate, Pandas doesn't know what you want to do with the different column names of your two dataframes. One of your dataframes has an IdentityNumber column and the other has an ID column. Pandas can't figure out what you want, so it puts both columns into the resulting dataframe.
Try this:
pd.concat([df1["IDNumber"], df2["ID"]]).drop_duplicates().reset_index(drop=True)

resample dataframe on values rather than time / merge dataframes

I have two dataframes which I would like to merge. Both dataframes were obtained from measurements, and read in from stored CSV files. Below is the minimal code:
import pandas as pd
df1 = pd.read_csv('file_charge.csv') #battery charging file
print(df1.head())
soc volt
0 0.000052 46.489062
1 0.000246 46.600472
2 0.000537 46.714063
3 0.000833 46.823437
4 0.001125 46.919929
print(len(df1))
3052
df2 = pd.read_csv('file_discharge.csv') #battery discharging file
print(df2.head())
volt soc
0 56.064844 0.999797
1 55.608464 0.999458
2 55.236909 0.999117
3 54.908865 0.998753
4 54.639002 0.998398
print(len(df2))
2962
With timeseries data, I have found that resampling and using the datetime-index to merge/join/concat on to be great. What I want to do is create an overall file that contains the following:
soc | volt_charge | volt_discharge | volt_average
The issue I see at the moment is that the dataframes are of different length, which I don't know how to easily address. (How) Is it possible to downsample (or even upsample) a dataframe with a numeric index?
So far, my attempts at combining/merging the dataframes have failed. Using pd.merge results in an empty resulting dataframe, whereas pd.merge_ordered gives a df that has only 2 columns (soc and volt), rather than the desired 3 (soc, volt_1, volt_2), from which the additional fourth (volt_avg = mean(volt_1,volt_2)) column can be made.
In graphical terms: it is possible to graph both df1 and df2 on the same x- and y-axes, yet I don't know how the "visual" average of df1 and df2 could be created.

Assign value to dataframe from another dataframe based on two conditions

I am trying to assign values from a column in df2['values'] to a column df1['values']. However values should only be assigned if:
df2['category'] is equal to the df1['category'] (rows are part of the same category)
df1['date'] is in df2['date_range'] (date is in a certain range for a specific category)
So far I have this code, which works, but is far from efficient, since it takes me two days to process the two dfs (df1 has ca. 700k rows).
for i in df1.category.unique():
for j in df2.category.unique():
if i == j: # matching categories
for ia, ra in df1.loc[df1['category'] == i].iterrows():
for ib, rb in df2.loc[df2['category'] == j].iterrows():
if df1['date'][ia] in df2['date_range'][ib]:
df1.loc[ia, 'values'] = rb['values']
break
I read that I should try to avoid using for-loops when working with dataframes. List comprehensions are great, however since I do not have a lot of experience yet, I struggle formulating more complicated code.
How can I iterate over this problem more efficient? What essential key aspect should I think about when iterating over dataframes with conditions?
The code above tends to skip some rows or assigns them wrongly, so I need to do a cleanup afterwards. And the biggest problem, that it is really slow.
Thank you.
Some df1 insight:
df1.head()
date category
0 2015-01-07 f2
1 2015-01-26 f2
2 2015-01-26 f2
3 2015-04-08 f2
4 2015-04-10 f2
Some df2 insight:
df2.date_range[0]
DatetimeIndex(['2011-11-02', '2011-11-03', '2011-11-04', '2011-11-05',
'2011-11-06', '2011-11-07', '2011-11-08', '2011-11-09',
'2011-11-10', '2011-11-11', '2011-11-12', '2011-11-13',
'2011-11-14', '2011-11-15', '2011-11-16', '2011-11-17',
'2011-11-18'],
dtype='datetime64[ns]', freq='D')
df2 other two columns:
df2[['values','category']].head()
values category
0 01 f1
1 02 f1
2 2.1 f1
3 2.2 f1
4 03 f1

Edit: Corrected erroneous code and added OP input from a comment
Alright so if you want to join the dataframes on similar categories, you can merge them :
import pandas as pd
df3 = df1.merge(df2, on = "category")
Next, since date is a timestamp and the "date_range" is actually generated from two columns, per OP's comment, we rather use :
mask = (df3["startdate"] <= df3["date"]) & (df3["date"] <= df3["enddate"])
subset = df3.loc[mask]
Now we get back to df1 and merge on the common dates while keeping all the values from df1. This will create NaN for the subset values where they didn't match with df1 in the earlier merge.
As such, we set df1["values"] where the entries in common are not NaN and we leave them be otherwise.
common_dates = df1.merge(subset, on = "date", how= "left") # keeping df1 values
df1["values"] = np.where(common_dates["values_y"].notna(),
common_dates["values_y"], df1["values"])
N.B : If more than one df1["date"] matches with the date range, you'll have to drop some values otherwise duplicates mess up the explanation.

You could accomplish the first point:
1. df2['category'] is equal to the df1['category']
with the use of a join.
You could then use a for loop for filtering the data poings from df1[date] inside the merged dataframe that are not contemplated in the df2[date_range]. Unfortunately I need more information about the content of df1[date] and df2[date_range] to write the code here that would exactly do that.

Pandas and GeoPandas indexing and slicing

I am using GeoPandas and Pandas.
I have a, say, 300,000 rows Dataframe, df, with 4 columns + the index column.
id lat lon geometry
0 2009 40.711174 -73.99682 0
1 536 40.741444 -73.97536 0
2 228 40.754601 -73.97187 0
however the unique ids are only a handful (~200)
I want to generate a shapely.geometry.point.Point object for each (lat,lon) combination, similarly to what shown here: http://nbviewer.ipython.org/gist/kjordahl/7129098
(see cell#5),
where it loops through all rows of the dataframe; but for such a big dataset, I wanted to limit the loop to the much smaller number of unique ids.
Therefore, for a given id value, idvalue (i.e., 2009 from the first row) create the GeoSeries, and assign it directly to ALL rows that have id==idvalue
My code looks like:
for count, iunique in enumerate(df.if.unique()):
sc_start = GeoSeries([Point(np.array(df[df.if==iunique].lon)[0],np.array(df[df.if==iunique].lat)[0])])
df.loc[iunique,['geometry']] = sc_start
however things don't work - the geometry field does not change - and I think is because the indexes of sc_start don't match with the indexes of df.
how can I solve this? should I just stick to the loop through the whole df?

I would take the following approach:
First find the unique id's and create a GeoSeries of Points for this:
unique_ids = df.groupby('id', as_index=False).first()
unique_ids['geometry'] = GeoSeries([Point(x, y) for x, y in zip(unique_ids['lon'], unique_ids['lat'])])
Then merge these geometries with the original dataframe on matching ids:
df.merge(unique_ids[['id', 'geometry']], how='left', on='id')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Speed up operations over Python Pandas dataframes - python

When trying to combine data in two dataframes you should use the merge (similar to JOIN in sql-like languages). Performance wise, you should never loop over the rows - you should use the pandas built-in methods whenever possible. Ordering can be achieved with the sort_values method.

Related

Create a new column from two columns of a dataframe where rows of each column contains list in string format

Pandas loop over 2 dataframe and drop duplicates

resample dataframe on values rather than time / merge dataframes

Assign value to dataframe from another dataframe based on two conditions

Pandas and GeoPandas indexing and slicing

Categories

Resources