drop_duplicates in pandas for a large data set

drop_duplicates in pandas for a large data set - python

I am new to pandas so sorry for naiveté.
I have two dataframe.
One is out.hdf:
999999 2014 1 2 15 19 45.19 14.095 -91.528 69.7 4.5 0.0 0.0 0.0 603879074
999999 2014 1 2 23 53 57.58 16.128 -97.815 23.2 4.8 0.0 0.0 0.0 603879292
999999 2014 1 9 12 27 10.98 13.265 -89.835 55.0 4.5 0.0 0.0 0.0 603947030
999999 2014 1 9 20 57 44.88 23.273 -80.778 15.0 5.1 0.0 0.0 0.0 603947340
and another one is out.res (the first column is station name):
061Z 56.72 0.0 P 603879074
061Z 29.92 0.0 P 603879074
0614 46.24 0.0 P 603879292
109C 87.51 0.0 P 603947030
113A 66.93 0.0 P 603947030
113A 26.93 0.0 P 603947030
121A 31.49 0.0 P 603947340
The last columns in both dataframes are ID.
I want to creat a new dataframe which puts the same IDs from two dataframes together in this way (first reads a line from hdf, then puts the lines from res with the same ID beneath it, but doesn't keep the ID in res).
The new dataframe:
"999999 2014 1 2 15 19 45.19 14.095 -91.528 69.7 4.5 0.0 0.0 0.0 603879074"
061Z 56.72 0.0 P
061Z 29.92 0.0 P
"999999 2014 1 2 23 53 57.58 16.128 -97.815 23.2 4.8 0.0 0.0 0.0 603879292"
0614 46.24 0.0 P
"999999 2014 1 9 12 27 10.98 13.265 -89.835 55.0 4.5 0.0 0.0 0.0 603947030"
109C 87.51 0.0 P
113A 66.93 0.0 P
113A 26.93 0.0 P
"999999 2014 1 9 20 57 44.88 23.273 -80.778 15.0 5.1 0.0 0.0 0.0 603947340"
121A 31.49 0.0 P
My code to do this is:
import csv
import pandas as pd
import numpy as np
path= './'
hdf = pd.read_csv(path + 'out.hdf', delimiter = '\t', header = None)
res = pd.read_csv(path + 'out.res', delimiter = '\t', header = None)
###creating input to the format of ph2dt-jp/ph
with open('./new_df', 'w', encoding='UTF8') as f:
writer = csv.writer(f, delimiter='\t')
i=0
with open('./out.hdf', 'r') as a_file:
for line in a_file:
liney = line.strip()
writer.writerow(np.array([liney]))
print(liney)
j=0
with open('./out.res', 'r') as a_file:
for line in a_file:
if res.iloc[j, 4] == hdf.iloc[i, 14]:
strng = res.iloc[j, [0, 1, 2, 3]]
print(strng)
writer.writerow(np.array(strng))
j+=1
i+=1
The goal is to keep just unique stations in the 3rd dataframe. I used these commands for res to keep unique stations before creating the 3rd dataframe:
res.drop_duplicates([0], keep = 'last', inplace = True)
and
res.groupby([0], as_index = False).last()
and it works fine. The problem is for a large data set, including thousands of lines, using these commands causes some lines of res file to be omitted in the 3rd dataframe.
Could you please let me know what I should do to give the same result for a large dataset?
I am going crazy and thanks for your time and help in advance.

I found the problem and hope it is helpful for others in the future.
In a large data set, the duplicated stations were repeating many times but not consecutively. Drop_duplicates() were keeping just one of them.
However, I wanted to drop just consecutive stations not all of them. And I've done this using shift:
unique_stations = res.loc[res[0].shift() != res[0]]

Related

Concatenate data in CSV files with overlapping data in columns

I have a couple CSV files that have vaccine data, such as this:
File 1
Entity,Code,Date,people_vaccinated
Wisconsin,,2021-01-12,125895
Wisconsin,,2021-01-13,125895
Wisconsin,,2021-01-14,135841
Wisconsin,,2021-01-15,151387
Wisconsin,,2021-01-19,188144
Wisconsin,,2021-01-20,193461
Wisconsin,,2021-01-21,204746
Wisconsin,,2021-01-22,221067
Wisconsin,,2021-01-23,241512
Wisconsin,,2021-01-24,260664
Wyoming,,2021-01-12,13577
Wyoming,,2021-01-13,14406
Wyoming,,2021-01-14,17310
Wyoming,,2021-01-15,19931
Wyoming,,2021-01-19,24788
Wyoming,,2021-01-20,25841
Wyoming,,2021-01-21,25841
Wyoming,,2021-01-22,29993
Wyoming,,2021-01-23,32746
Wyoming,,2021-01-24,35868
File 2
Entity,Code,Date,people_fully_vaccinated
Wisconsin,,2021-01-12,11343
Wisconsin,,2021-01-13,11343
Wisconsin,,2021-01-15,17108
Wisconsin,,2021-01-19,23641
Wisconsin,,2021-01-20,27312
Wisconsin,,2021-01-21,32268
Wisconsin,,2021-01-22,37901
Wisconsin,,2021-01-23,42229
Wisconsin,,2021-01-24,45641
Wyoming,,2021-01-12,2116
Wyoming,,2021-01-13,2559
Wyoming,,2021-01-15,2803
Wyoming,,2021-01-19,3242
Wyoming,,2021-01-20,3441
Wyoming,,2021-01-21,3441
Wyoming,,2021-01-22,4515
Wyoming,,2021-01-23,4773
Wyoming,,2021-01-24,4895
Not all the data (specifically dates going with locations) overlaps, but for the ones that do, how would I combine the last column? I'm guessing using pandas would be best, but I don't want to get stuck messing with a bunch of nested loops.

If you are trying to merge file1 with file2 only for the records in file1 then solution:
import pandas as pd
## suppose file1_df and file2_df are related Dataframe object for file1 and file2 respectively.
merged_df = pd.merge(file1_df, file2_df, how='left' on=['Entity','Code','Date'])
Note: if you are familiar with set operations, you can do right outer joint, left joint, inner joint, and full outer join changing how parameter in the above function call.
reference

import pandas as pd
data1 = pd.read_csv('file1.csv') # path of file1
data2 = pd.read_csv('file2.csv') # path of file2
data1['Code'] = data1['Code'].fillna(0) # replace Nan with 0
data2['Code'] = data2['Code'].fillna(0) # replace Nan with 0
combined_data = data1.append(data2,ignore_index=True) # since both the file have same column so we append one in another
result = combined_data.groupby(['Entity','Code','Date'], as_index=False)['people_vaccinated'].sum() # group by column and add people who got vaccinated based on same location and date and code
print(result)
Entity: Code: Date: people_vaccinated
0 Wisconsin 0.0 12-01-2021 137238
1 Wisconsin 0.0 13-01-2021 137238
2 Wisconsin 0.0 14-01-2021 135841
3 Wisconsin 0.0 15-01-2021 168495
4 Wisconsin 0.0 19-01-2021 211785
5 Wisconsin 0.0 20-01-2021 220773
6 Wisconsin 0.0 21-01-2021 237014
7 Wisconsin 0.0 22-01-2021 258968
8 Wisconsin 0.0 23-01-2021 283741
9 Wisconsin 0.0 24-01-2021 306305
10 Wyoming 0.0 12-01-2021 15693
11 Wyoming 0.0 13-01-2021 16965
12 Wyoming 0.0 14-01-2021 17310
13 Wyoming 0.0 15-01-2021 22734
14 Wyoming 0.0 19-01-2021 28030
15 Wyoming 0.0 20-01-2021 29282
16 Wyoming 0.0 21-01-2021 29282
17 Wyoming 0.0 22-01-2021 34508
18 Wyoming 0.0 23-01-2021 37519
19 Wyoming 0.0 24-01-2021 40763

k-means returns nan values?

I recently came across a k-means tutorial that looks a bit different than what I remember the algorithm to be, but it should still do the same after all it's k-means. So, I went and gave it a try with some data, here's how the code looks:
# Assignment Stage:
def assignment(data, centroids):
for i in centroids.keys():
#sqrt((x1-x2)^2+(y1-y2)^2 + etc)
data['distance_from_{}'.format(i)]= (
np.sqrt((data['soloRatio']-centroids[i][0])**2
+(data['secStatus']-centroids[i][1])**2
+(data['shipsDestroyed']-centroids[i][2])**2
+(data['combatShipsLost']-centroids[i][3])**2
+(data['miningShipsLost']-centroids[i][4])**2
+(data['exploShipsLost']-centroids[i][5])**2
+(data['otherShipsLost']-centroids[i][6])**2
))
print(data['distance_from_{}'.format(i)])
centroid_distance_cols = ['distance_from_{}'.format(i) for i in centroids.keys()]
data['closest'] = data.loc[:, centroid_distance_cols].idxmin(axis=1)
data['closest'] = data['closest'].astype(str).str.replace('\D+', '')
return data
data = assignment(data, centroids)
and:
#Update stage:
import copy
old_centroids = copy.deepcopy(centroids)
def update(k):
for i in centroids.keys():
centroids[i][0]=np.mean(data[data['closest']==i]['soloRatio'])
centroids[i][1]=np.mean(data[data['closest']==i]['secStatus'])
centroids[i][2]=np.mean(data[data['closest']==i]['shipsDestroyed'])
centroids[i][3]=np.mean(data[data['closest']==i]['combatShipsLost'])
centroids[i][4]=np.mean(data[data['closest']==i]['miningShipsLost'])
centroids[i][5]=np.mean(data[data['closest']==i]['exploShipsLost'])
centroids[i][6]=np.mean(data[data['closest']==i]['otherShipsLost'])
return k
#TODO: add graphical representation?
while True:
closest_centroids = data['closest'].copy(deep=True)
centroids = update(centroids)
data = assignment(data,centroids)
if(closest_centroids.equals(data['closest'])):
break
When I run the initial assignment stage, it returns the distances, however when I run the update stage, all distance values become NaN, and I just dont know why or at which point exactly this happens... Maybe I made I mistake I can't spot?
Here's an excerpt of the data im working with:
Unnamed: 0 characterID combatShipsLost exploShipsLost miningShipsLost \
0 0 90000654.0 8.0 4.0 5.0
1 1 90001581.0 97.0 5.0 1.0
2 2 90001595.0 61.0 0.0 0.0
3 3 90002023.0 22.0 1.0 0.0
4 4 90002030.0 74.0 0.0 1.0
otherShipsLost secStatus shipsDestroyed soloRatio
0 0.0 5.003100 1.0 10.0
1 0.0 2.817807 6251.0 6.0
2 0.0 -2.015310 752.0 0.0
3 4.0 5.002769 43.0 5.0
4 1.0 3.090204 301.0 7.0

Having difficulty getting multiple columns in HDF5 Table Data

I am new to hdf5 and was trying to store a DataFrame row into the hdf5 format. I was to append a row at different locations within the file; however, every time I append it shows up at an array in a single column rather than a single value in multiple columns.
I have tried both h5py and pandas and it seems like pandas is the better option for appending. Additionally, I have really been trying a lot of different methods. Truly, any help would be greatly appreciated.
Here is me sending an array multiple times into the hdf5 file.
import pandas as pd
import numpy as np
data = np.zeros((1,48), dtype = float)
columnName = ['Hello'+str(y) for (x,y), item in np.ndenumerate(data)]
df = pd.DataFrame(data = data, columns =columnName)
file = pd.HDFStore('file.hdf5', mode = 'a', complevel = 9, comlib = 'blosc')
for x in range(0,11):
file.put('/data', df, column_data = columnName , append = True, format = 'table')

In [243]: store = pd.HDFStore('test.h5')
This seems to work fine:
In [247]: store.put('foo',df,append=True,format='table')
In [248]: store.put('foo',df,append=True,format='table')
In [249]: store.put('foo',df,append=True,format='table')
In [250]: store['foo']
Out[250]:
Hello0 Hello1 Hello2 Hello3 Hello4 ... Hello43 Hello44 Hello45 Hello46 Hello47
0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
[3 rows x 48 columns]

How to perform functions that reference previous row on subset of data in a dataframe using groupby

I have some log data that represents an item (id) and a timestamp that an action was a started and I want to determine the time between actions on each item.
for example, I have some data that looks like this:
data = [{"timestamp":"2019-05-21T14:17:29.265Z","id":"ff9dad92-e7c1-47a5-93a7-6e49533a6e25"},{"timestamp":"2019-05-21T14:21:49.722Z","id":"ff9dad92-e7c1-47a5-93a7-6e49533a6e25"},{"timestamp":"2019-05-21T15:16:25.695Z","id":"ff9dad92-e7c1-47a5-93a7-6e49533a6e25"},{"timestamp":"2019-05-21T15:16:25.696Z","id":"ff9dad92-e7c1-47a5-93a7-6e49533a6e25"},{"timestamp":"2019-05-22T07:51:17.49Z","id":"ff12891e-5786-438b-891c-abd4244723b4"},{"timestamp":"2019-05-22T08:11:13.948Z","id":"ff12891e-5786-438b-891c-abd4244723b4"},{"timestamp":"2019-05-22T11:52:59.897Z","id":"ff12891e-5786-438b-891c-abd4244723b4"},{"timestamp":"2019-05-22T11:53:03.406Z","id":"ff12891e-5786-438b-891c-abd4244723b4"},{"timestamp":"2019-05-22T11:53:03.481Z","id":"ff12891e-5786-438b-891c-abd4244723b4"},{"timestamp":"2019-05-21T14:23:08.147Z","id":"fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa"},{"timestamp":"2019-05-21T14:29:18.228Z","id":"fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa"},{"timestamp":"2019-05-21T15:17:09.831Z","id":"fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa"},{"timestamp":"2019-05-21T15:17:09.834Z","id":"fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa"},{"timestamp":"2019-05-21T14:02:19.072Z","id":"fd3554cd-b83d-49af-a8e6-7bf41c741cd0"},{"timestamp":"2019-05-21T14:02:34.867Z","id":"fd3554cd-b83d-49af-a8e6-7bf41c741cd0"},{"timestamp":"2019-05-21T14:12:28.877Z","id":"fd3554cd-b83d-49af-a8e6-7bf41c741cd0"},{"timestamp":"2019-05-21T15:19:19.567Z","id":"fd3554cd-b83d-49af-a8e6-7bf41c741cd0"},{"timestamp":"2019-05-21T15:19:19.582Z","id":"fd3554cd-b83d-49af-a8e6-7bf41c741cd0"},{"timestamp":"2019-05-21T09:58:02.185Z","id":"f89c2e3e-06dc-467b-813b-dc92f2692f63"},{"timestamp":"2019-05-21T10:07:24.044Z","id":"f89c2e3e-06dc-467b-813b-dc92f2692f63"}]
stack = pd.DataFrame(data)
stack.head()
I have tried getting all the unique ids to split the data frame and then getting the time taken with the index to recombine with the original set like, but the function is extremely slow on large data-sets and messes up both the index
and timestamp order resulting in results getting miss matched.
import ciso8601 as time
records = []
for i in list(stack.id.unique()):
dff = stack[stack.id == i]
time_taken = []
times = []
i = 0
for _, row in dff.iterrows():
if bool(times):
print(_)
current_time = time.parse_datetime(row.timestamp)
prev_time = times[i]
time_taken = current_time - prev_time
times.append(current_time)
i+=1
records.append(dict(index = _, time_taken = time_taken.seconds))
else:
records.append(dict(index = _, time_taken = 0))
times.append(time.parse_datetime(row.timestamp))
x = pd.DataFrame(records).set_index('index')
stack.merge(x, left_index=True, right_index=True, how='inner')
Is there a neat pandas groupby and apply way of doing this so that I don't have to split the frame and store it in memory so that can reference the previous row in the subset?
Thanks

You can use GroupBy.diff:
stack['timestamp'] = pd.to_datetime(stack['timestamp'])
stack['timestamp']= (stack.sort_values(['id','timestamp'])
.groupby('id')
.diff()['timestamp']
.dt.total_seconds()
.round().fillna(0))
print(stack['time_taken'])
0 0.0
1 260.0
2 3276.0
3 0.0
4 0.0
5 1196.0
6 13306.0
7 4.0
8 0.0
9 0.0
10 370.0
11 2872.0
...
If you want the resulting dataframe to be ordered by date, instead do:
stack['timestamp'] = pd.to_datetime(stack['timestamp'])
stack = stack.sort_values(['id','timestamp'])
stack['time_taken'] = (stack.groupby('id')
.diff()['timestamp']
.dt.total_seconds()
.round()
.fillna(0))

If dont need replace timestamp to datetimes create Series filled by datetimes by to_datetime and pass to DataFrameGroupBy.diff, then convert to seconds by Series.dt.total_seconds, if necessary round by Series.round and replace missing values by 0:
t = pd.to_datetime(stack['timestamp'])
stack['time_taken'] = t.groupby(stack['id']).diff().dt.total_seconds().round().fillna(0)
print (stack)
id timestamp time_taken
0 ff9dad92-e7c1-47a5-93a7-6e49533a6e25 2019-05-21T14:17:29.265Z 0.0
1 ff9dad92-e7c1-47a5-93a7-6e49533a6e25 2019-05-21T14:21:49.722Z 260.0
2 ff9dad92-e7c1-47a5-93a7-6e49533a6e25 2019-05-21T15:16:25.695Z 3276.0
3 ff9dad92-e7c1-47a5-93a7-6e49533a6e25 2019-05-21T15:16:25.696Z 0.0
4 ff12891e-5786-438b-891c-abd4244723b4 2019-05-22T07:51:17.49Z 0.0
5 ff12891e-5786-438b-891c-abd4244723b4 2019-05-22T08:11:13.948Z 1196.0
6 ff12891e-5786-438b-891c-abd4244723b4 2019-05-22T11:52:59.897Z 13306.0
7 ff12891e-5786-438b-891c-abd4244723b4 2019-05-22T11:53:03.406Z 4.0
8 ff12891e-5786-438b-891c-abd4244723b4 2019-05-22T11:53:03.481Z 0.0
9 fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa 2019-05-21T14:23:08.147Z 0.0
10 fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa 2019-05-21T14:29:18.228Z 370.0
11 fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa 2019-05-21T15:17:09.831Z 2872.0
12 fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa 2019-05-21T15:17:09.834Z 0.0
13 fd3554cd-b83d-49af-a8e6-7bf41c741cd0 2019-05-21T14:02:19.072Z 0.0
14 fd3554cd-b83d-49af-a8e6-7bf41c741cd0 2019-05-21T14:02:34.867Z 16.0
15 fd3554cd-b83d-49af-a8e6-7bf41c741cd0 2019-05-21T14:12:28.877Z 594.0
16 fd3554cd-b83d-49af-a8e6-7bf41c741cd0 2019-05-21T15:19:19.567Z 4011.0
17 fd3554cd-b83d-49af-a8e6-7bf41c741cd0 2019-05-21T15:19:19.582Z 0.0
18 f89c2e3e-06dc-467b-813b-dc92f2692f63 2019-05-21T09:58:02.185Z 0.0
19 f89c2e3e-06dc-467b-813b-dc92f2692f63 2019-05-21T10:07:24.044Z 562.0
Or if need replace timestamp to datetimes use #yatu answer.

turning a list from a loop iteration to a df in pyhon

I have some file which I am running a loop over all file and doing some calculations. I would like to get a new df with the name of the files on the row side and the calculated value per each file in the correct row.
the code is:
results = []
file_name = '{}'
for file in folder:
df = pd.read_csv(file_name.format(file))
print("reading file ", file)
results.append(df['old_calc'])#this is the data i want to save to the new df and I need it .sum()
the above code doesn't work as expected as it is giving me:
old calc old calc old calc old calc old calc old calc old calc
4 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 59.0 0.0 0.0
6 0.0 0.0 58.4 0.0 0.0 0.0
7 0.0 0.0 8.4 -79.1 0.0 0.0
8 0.0 0.0 120.9 0.0 0.0 0.0
The expected result will be a new df named results:
file1 0
file2 0
file3 187.7
file4 20.1
file5 0
thanks for the help

This is one way you can extract the data you need:
dfs = {file: pd.read_csv(file) for file in folder}
result_dict = {k: v['old_calc'].sum() for k, v in dfs.items()}
result_df = pd.DataFrame.from_dict(result_dict, orient='index')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

drop_duplicates in pandas for a large data set - python

Related

Concatenate data in CSV files with overlapping data in columns

k-means returns nan values?

Having difficulty getting multiple columns in HDF5 Table Data

How to perform functions that reference previous row on subset of data in a dataframe using groupby

turning a list from a loop iteration to a df in pyhon

Categories

Resources