So I am trying to open a CSV file, read its fields and based on that fix some other fields and then save that data back to csv. My problem is that the CSV file has 2 million rows. What would be the best way to speed this up.
The CSV file consists of
ID; DATE(d/m/y); SPECIAL_ID; DAY; MONTH; YEAR
I am counting how often a row with the same date appears on my record and then update SPECIAL_ID based on that data.
Based on my previous research I decided to use pandas. I'll be processing even bigger sets of data in future (1-2GB) - this one is around 119MB so it crucial I find a good fast solution.
My code goes as follows:
df = pd.read_csv(filename, delimiter=';')
df_fixed= pd.DataFrame(columns=stolpci) #when I process the row in df I append it do df_fixed
d = 31
m = 12
y = 100
s = (y,m,d)
list_dates= np.zeros(s) #3 dimensional array.
for index, row in df.iterrows():
# PROCESSING LOGIC GOES HERE
# IT CONSISTS OF FEW IF STATEMENTS
list_dates[row.DAY][row.MONTH][row.YEAR] += 1
row['special_id'] = list_dates[row.DAY][row.MONTH][row.YEAR]
df_fixed = df_fixed.append(row.to_frame().T)
df_fixed .to_csv(filename_fixed, sep=';', encoding='utf-8')
I tried to make a print for every thousand rows processed. At first, my script needs 3 seconds for 1000 rows, but the longer it runs the slower it gets.
at row 43000 it needs 29 seconds and so on...
Thanks for all future help :)
EDIT:
I am adding additional information about my CSV and exptected output
ID;SPECIAL_ID;sex;age;zone;key;day;month;year
2;13012016505__-;F;1;1001001;1001001_F_1;13;1;2016
3;25122013505__-;F;4;1001001;1001001_F_4;25;12;2013
4;24022012505__-;F;5;1001001;1001001_F_5;24;2;2012
5;09032012505__-;F;5;1001001;1001001_F_5;9;3;2012
6;21082011505__-;F;6;1001001;1001001_F_6;21;8;2011
7;16082011505__-;F;6;1001001;1001001_F_6;16;8;2011
8;21102011505__-;F;6;1001001;1001001_F_6;16;8;2011
I have to replace - in the special ID field to a proper number.
For example for a row with
ID = 2 the SPECIAL_ID will be
26022018505001 (- got replaced by 001) if someone else in the CSV shares the same DAY, MONTH, YEAR the __- will be replaced by 002 and so on...
So exptected output for above rows would be
ID;SPECIAL_ID;sex;age;zone;key;day;month;year
2;13012016505001;F;1;1001001;1001001_F_1;13;1;2016
3;25122013505001;F;4;1001001;1001001_F_4;25;12;2013
4;24022012505001;F;5;1001001;1001001_F_5;24;2;2012
5;09032012505001;F;5;1001001;1001001_F_5;9;3;2012
6;21082011505001;F;6;1001001;1001001_F_6;21;8;2011
7;16082011505001;F;6;1001001;1001001_F_6;16;8;2011
8;21102011505002;F;6;1001001;1001001_F_6;16;8;2011
EDIT:
I changed my code to something like this: I fill list of dicts with data and then convert that list do dataframe and save as csv. This will take around 30minutes to complete
list_popravljeni = []
df = pd.read_csv(filename, delimiter=';')
df_dates = df.groupby(by=['dan_roj', 'mesec_roj', 'leto_roj']).size().reset_index()
for index, row in df_dates.iterrows():
df_candidates= df.loc[(df['dan_roj'] == dan_roj) & (df['mesec_roj'] == mesec_roj) & (df['leto_roj'] == leto_roj) ]
for index, row in df_candidates.iterrows():
vrstica = {}
vrstica['ID'] = row['identifikator']
vrstica['SPECIAL_ID'] = row['emso'][0:11] + str(index).zfill(2)
vrstica['day'] = row['day']
vrstica['MONTH'] = row['MONTH']
vrstica['YEAR'] = row['YEAR']
list_popravljeni.append(vrstica)
pd.DataFrame(list_popravljeni, columns=list_popravljeni[0].keys())
I think this gives what you're looking for and avoids looping. Potentially it could be more efficient (I wasn't able to find a way to avoid creating counts). However, it should be much faster than your current approach.
df['counts'] = df.groupby(['year', 'month', 'day'])['SPECIAL_ID'].cumcount() + 1
df['counts'] = df['counts'].astype(str)
df['counts'] = df['counts'].str.zfill(3)
df['SPECIAL_ID'] = df['SPECIAL_ID'].str.slice(0, -3).str.cat(df['counts'])
I added a fake record at the end to confirm it does increment properly:
SPECIAL_ID sex age zone key day month year counts
0 13012016505001 F 1 1001001 1001001_F_1 13 1 2016 001
1 25122013505001 F 4 1001001 1001001_F_4 25 12 2013 001
2 24022012505001 F 5 1001001 1001001_F_5 24 2 2012 001
3 09032012505001 F 5 1001001 1001001_F_5 9 3 2012 001
4 21082011505001 F 6 1001001 1001001_F_6 21 8 2011 001
5 16082011505001 F 6 1001001 1001001_F_6 16 8 2011 001
6 21102011505002 F 6 1001001 1001001_F_6 16 8 2011 002
7 21102012505003 F 6 1001001 1001001_F_6 16 8 2011 003
If you want to get rid of counts, you just need:
df.drop('counts', inplace=True, axis=1)
Related
I am trying to randomly select records from 17mm dataframe using np.random.choice as it runs faster compared to other methods but I am getting incorrect value in output against each record...example below:
data = {
"calories":[420,380,390,500,200,100],
"Duration":[50,40,45,600,450,210],
"Id":[1,1 2,3,2,3],
"Yr":[2003,2003,2009,2003,2012,2003],
"Mth":[3,6,9,12,3,6],
}
df = PD.dataframe(data)
df2=df.groupby(['id','yr'],as_index=False).agg(np.random.choice)
Output:
Id yr calories duration mth
1 2003 420 50 6
2 2009 390 45 9
2 2012 200 450 3
3 2003 500 210 6
Problem in the output is for Id 3 for calories 500, duration and mth should be 600 and 12 instead of 210 and 6...can anyone please help why it is choosing value from different row ?
Expected output:
Same row value should be retained after random selection
This doesn't work because Pandas applies aggregates across each column independently, try putting a print statement in, e.g.:
def fn(x):
print(x)
return np.random.choice(x)
df.groupby(['id','yr'],as_index=False).agg(fn)
would let you see when the function was called and what it was called with.
I'm not an expert in Pandas, but using GroupBy.apply seems to be the easiest way I've found of keeping rows together.
Something like the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({
"calories":[420,380,390,500,200,100],
"duration":[50,40,45,600,450,210],
"id":[1,1,2,3,2,3],
"yr":[2003,2003,2009,2003,2012,2003],
"mth":[3,6,9,12,3,6],
})
df.groupby(['id', 'yr'], as_index=False).apply(lambda x: x.sample(1))
produces:
calories duration id yr mth
0 1 380 40 1 2003 6
1 2 390 45 2 2009 9
2 4 200 450 2 2012 3
3 5 100 210 3 2003 6
the two numbers at the beginning are because you end up with a multi-index. If you want to know where the rows were selected from this would contain useful information, otherwise you could discard the index.
Note that there are warnings in the docs that this might not be very performant, but don't know the details.
Update: I've just had more of a read of the docs, and noticed that there's a GroupBy.sample method, so you could instead just do:
df.groupby(['id', 'yr']).sample(1)
which would presumably be performant as well as being much shorter!
I have a text file called Orbit 1 and I need help opening it and then creating three separate arrays. I'm new to Python and have been having difficulty with this aspect. Here are the first few rows of my text file. There are 1112 rows including the header.
Year Month Day Hour Minute Second Millisecond Longitude Latitude Altitude
2019 3 17 5 55 55 0 108.8730074 50.22483151 412.6226898
2019 3 17 5 56 0 0 108.9895097 50.53642185 412.7368197
2019 3 17 5 56 5 0 109.1078294 50.8478274 412.850563
2019 3 17 5 56 10 0 109.2280101 51.15904424 412.9640113
2019 3 17 5 56 15 0 109.3500969 51.47006828 413.0772319
2019 3 17 5 56 20 0 109.4741362 51.78089533 413.1901358
2019 3 17 5 56 25 0 109.6001758 52.09152105 413.3025291
2019 3 17 5 56 30 0 109.728265 52.40194099 413.414457
2019 3 17 5 56 35 0 109.8584548 52.71215052 413.5259984
2019 3 17 5 56 40 0 109.9907976 53.02214489 413.6371791
I desire to open this text file to create three arrays called lat[N], long[N], and time[N] where N is the number of rows in the file. I ultimately want to be able to determine what the latitude, longitude, and time is at any point. For example, lat[0] should return 50.22483151 if working properly. In addition, for the time, I would need to convert to decimal hours and then create the array.
Essentially I need help with opening this text file I have and then creating the three arrays.
I've tried this method for opening the file, but I get stuck when trying to write the array and I think I may not be opening the file correctly.
import numpy as np
file_name = 'C:\\Users\\Saman\\OneDrive\\Documents\\Orbit 1.txt'
data = []
with open(file_name) as file:
next(file)
for line in file:
row = line.split()
row = [float(x) for x in row]
data.append(row)
The most effortless way to solve your problem is to use Pandas:
import pandas as pd
df = pd.read_table('Orbit 1.txt', sep=r'\s+')
df['Longitude']
#0 108.873007
#1 108.989510
#2 109.107829
#3 109.228010
#4 109.350097
#5 109.474136
#6 109.600176
#7 109.728265
#8 109.858455
#9 109.990798
Once you get a Pandas DataFrame, you may want to use it for the rest of the data processing, too.
file_name = 'info.txt'
Lat=[]
Long=[]
Time=[]
left_justified=lambda x: x+" "*(19-len(x))
right_justified=lambda x: " "*(19-len(x))+x
with open(file_name) as file:
next(file)
for line in file:
data=line.split()
Lat.append(data[8])
Long.append(data[7])
hrs=int(data[3])
minutes=int(data[4])
secs=int(data[5])
total_secs=secs+minutes*60+hrs*3600
Time.append(total_secs/3600)
print(left_justified("Time"),left_justified("Lat"),left_justified("Long"))
for i in range(len(Lat)):
print(left_justified(str(Time[i])),left_justified(Lat[i]),left_justified(Long[i]))
Try this
I am a beginner with Pandas and I have a large dataset in an archaic format which I would like to wrangle into Pandas format. The data looks like this:
0 1 2 3 4 5 ...
0 ì 8=xx 9=00 35=8 49=YY 56=073 ...
1 8=xx 9=00 35=8 49=YY 56=073 34=10715 ...
2 8=xx 9=00 35=8 49=YY 56=073 34=10716 ...
...
The column headers are separated by "=" with header on the left and field on the right. Hence the data should look like this:
8 9 35 49 56 34 ...
0 xx 00 8 YY 073 107 ...
1 xx 00 8 YY 073 107 ...
2 xx 00 8 YY 073 107 ...
...
Each row has a different number of columns and there may be some repetition per row, for example, 8=xx may occur multiple times per row. I would like to create a new column (eg. 8_x, 8_y, ...) each time this happens. I have tried to formulate a for/iterrows() loop to iterate through each row but not sure how I can separate a string and set the header at one go.
I've tried to look for a similar issue on the site but no success so far. Any help is much appreciated!
Edit: Adding in the code I used to parse the initial raw data into the format in the first table.
import pandas as pd
df = pd.read_csv('File.dat', sep='\n',nrows = 2, header=None, encoding = "ANSI")
df = df[0].str.split('<SPECIAL CHAR.>', expand=True)
As mentioned above in one of the comments on the original post, the 'right' way to deal with this would be to parse the data before it's in a dataframe. That being said, once the data is in a dataframe you can use the following code:
rows = []
def parse_row(row):
d = {}
for item in row[1]:
if type(item) != str or "=" not in item:
continue # ignore this item
[col_name, val] = item.split("=")
if col_name in d:
inx = 0
while f"{col_name}_{inx}" in d:
inx += 1
col_name = f"{col_name}_{inx}"
print("new colname is {col_name}")
d[col_name] = val
return d
for row in df.iterrows():
rows.append(parse_row(row))
pd.DataFrame(rows)
I tested it with the following input:
0 1 2 3 4 5
0 ì 8=xx 9=00 35=8 49=YY 56=073
1 8=xx 9=00 35=8 49=YY 56=073 34=10715
2 8=xx 9=00 35=8 49=YY 8=zz 34=10716
This is the output:
8 9 35 49 56 34 8_0
0 xx 00 8 YY 073 NaN NaN
1 xx 00 8 YY 073 10715 NaN
2 xx 00 8 YY NaN 10716 zz
If the original .dat file is in a plain text format like one of the comment says it can be easily transformed into the CSV format:
Open the .dat file in your favorite text editor that support regular expressions.
Copy the first line and remove all occurrences of '=[^,]+' to create the header with column names.
From the 2nd line onward remove all occurrences of '[^,]=' to preserve the cell values.
Save the CSV file and open in Python with pd.read_csv(...).
This way every time you load the CSV chances are Pandas will guess the data format in each column correctly.
As mentioned above in one of the comments on the original post, the 'right' way to deal with this would be to parse the data before it's in a dataframe
I am working on a dataset with pandas in which a maintenance work is done at a location. The maintenance is done at random intervals, sometimes a year, and sometimes never. I want to find the years since the last maintenance action at each site if an action has been made on that site. There can be more than one action for a site and the occurrences of actions are random. For the years prior to the first action, it is not possible to know the years since action because that information is not in the dataset.
I give only two sites in the following example but in the original dataset, I have thousands of them. My data only covers the years 2014 through 2017.
Action = 0 means no action has been performed that year, Action = 1 means some action has been done. Measurement is a performance reading related to the effect of the action. The action can happen in any year.
Site Year Action Measurement
A 2014 1 100
A 2015 0 150
A 2016 0 300
A 2017 0 80
B 2014 0 200
B 2015 1 250
B 2016 1 60
B 2017 0 110
Given this dataset; I want to have a dataset like this:
Item Year Action Measurement Years_Since_Last_Action
A 2014 1 100 1
A 2015 0 150 2
A 2016 0 300 3
A 2017 0 80 4
B 2015 1 250 1
B 2016 1 60 1
B 2017 0 110 2
Please observe the Year 2015 is filtered out for Site B because that year is prior to the first action for that site.
Many thanks in advance!
I wrote the code myself. It is messy but does the job for me. :)
The solution assumes that df_select has an integer index.
df_select = (df_select[df_select['Site'].map((df_select.groupby('Site')['Action'].max() == 1))])
years_since_action = pd.Series(dtype='int64')
gbo = df_select.groupby('Site')
for (key,group) in gbo:
indices_with_ones = group[group['Action']==1].index
indices = group.index
group['Years_since_action'] = 0
group.loc[indices_with_ones,'Years_since_action'] = 1
for idx_with_ones in indices_with_ones.sort_values(ascending=False):
for idx in indices:
if group.loc[idx,'Years_since_action']==0:
if idx>idx_with_ones:
group.loc[idx,'Years_since_action'] = idx - idx_with_ones + 1
years_since_action = years_since_action.append(group['Years_since_action'])
df_final = pd.merge(df_select,pd.DataFrame(years_since_action),how='left',left_index=True,right_index=True)
Here is how I will approach it:
import pandas as pd
from io import StringIO
import numpy as np
s = '''Site Year Action Measurement
A 2014 1 100
A 2015 0 150
A 2016 0 300
A 2017 0 80
B 2014 0 200
B 2015 1 250
B 2016 1 60
B 2017 0 110
'''
ss = StringIO(s)
df = pd.read_csv(ss, sep=r"\s+")
df_maintain = df[df.Action==1][['Site', 'Year']]
df_maintain.reset_index(drop=True, inplace=True)
df_maintain
def find_last_maintenance(x):
df_temp = df_maintain[x.Site == df_maintain.Site]
gap = [0]
for ind, row in df_temp.iterrows():
if (x.Year >= row['Year']):
gap.append(x.Year - row['Year'] + 1)
return gap[-1]
df['Gap'] = df.apply(find_last_maintenance, axis=1)
df = df[df.Gap !=0]
This generates the desired output.
I'm quite new to pandas and python, and I'm coming from a background in biochemistry and drug discovery. One frequent task that I'd like to automate is the conversion of a list of combination of drug treatments and proteins to a format that contains all such combinations.
For instance, if I have a DataFrame containing a given set of combinations:
https://github.com/colinhiggins/dillydally/blob/master/input.csv, I'd like to turn it into https://github.com/colinhiggins/dillydally/blob/master/output.csv such that each protein (1, 2, and 3) are copied n times to an output DataFrame where the number of rows, n, is the number of drugs and drug concentrations plus one for a no-drug row of each protein.
Ideally, the degree of combination would be dictated by some other table that indicates relationships, for example if proteins 1 and 2 are to be treated with drugs 1, 2, and 3 but that protein 2 isn't treated with any drugs.
I'm thinking some kind of nested for loop is going to be required, but I can't wrap my head around just quite how to start it.
Consider the following solution
from itertools import product
import pandas
protein = ['protein1' , 'protein2' , 'protein3' ]
drug = ['drug1' , 'drug2', 'drug3']
drug_concentration = [100,30,10]
df = pandas.DataFrame.from_records( list( i for i in product(protein, drug, drug_concentration ) ) , columns=['protein' , 'drug' , 'drug_concentration'] )
>>> df
protein drug drug_concentration
0 protein1 drug1 100
1 protein1 drug1 30
2 protein1 drug1 10
3 protein1 drug2 100
4 protein1 drug2 30
5 protein1 drug2 10
6 protein1 drug3 100
7 protein1 drug3 30
8 protein1 drug3 10
9 protein2 drug1 100
10 protein2 drug1 30
11 protein2 drug1 10
12 protein2 drug2 100
13 protein2 drug2 30
14 protein2 drug2 10
15 protein2 drug3 100
16 protein2 drug3 30
17 protein2 drug3 10
18 protein3 drug1 100
19 protein3 drug1 30
20 protein3 drug1 10
21 protein3 drug2 100
22 protein3 drug2 30
23 protein3 drug2 10
24 protein3 drug3 100
25 protein3 drug3 30
26 protein3 drug3 10
This is basically a cartesian product you're after, which is the functionality of the product function in the itertools module. I'm admitedly confused why you want the empty rows that just list out the proteins with nan's in the other columns. Not sure if that was intentional or accidental. If the datatypes were uniform and numeric this is similar functionality to what's known as a meshgrid.
I've worked through part of this with the help of add one row in a pandas.DataFrame using the method recommended by ShikharDua of creating a list of dicts, each dict corresponding to a row in the eventual DataFrame.
The code is:
data = pandas.read_csv('input.csv')
dict1 = {"protein":"","drug":"","drug_concentration":""} #should be able to get this automatically using the dataframe columns, I think
rows_list = []
for unique_protein in data.protein.unique():
dict1 = {"protein":unique_protein,"drug":"","drug_concentration":""}
rows_list.append(dict1)
for unique_drug in data.drug.unique():
for unique_drug_conc in data.drug_concentration.unique():
dict1 = {"protein":unique_protein,"drug":unique_drug,"drug_concentration":unique_drug_conc}
rows_list.append(dict1)
df = pandas.DataFrame(rows_list)
df
It isn't as flexible as I was hoping, since the extra row from protein with no drugs is hard-coded into the nested for loops, but at least its a start. I guess I can add some if statements within each for loop.
I've improved upon the earlier version
enclosed it in a function
added a check for proteins that won't be treated with drugs from another input CSV file that contains the same proteins in column A and either true or false in column B labeled "treat with drugs"
Skips null values. I noticed that my example input.csv had equal length columns, and the function started going a little nuts with NaN rows if they had unequal lengths.
Initial dictionary keys are set from the columns from the initial input CSV instead of hard-coding them.
I tested this with some real data (hence the change from input.csv to realinput.csv), and it works quite nicely.
Code for a fully functional python file follows:
import pandas
import os
os.chdir("path_to_directory_containing_realinput_and_boolean_file")
realinput = pandas.read_csv('realinput.csv')
rows_list = []
dict1 = dict.fromkeys(realinput.columns,"")
prot_drug_bool = pandas.read_csv('protein_drug_bool.csv')
prot_drug_bool.index = prot_drug_bool.protein
prot_drug_bool = prot_drug_bool.drop("protein",axis=1)
def null_check(value):
return pandas.isnull(value)
def combinator(input_table):
for unique_protein in input_table.protein.unique():
dict1 = dict.fromkeys(realinput.columns,"")
dict1['protein']=unique_protein
rows_list.append(dict1)
if prot_drug_bool.ix[unique_protein]:
for unique_drug in input_table.drug.unique():
if not null_check(unique_drug):
for unique_drug_conc in input_table.drug_concentration.unique():
if not null_check(unique_drug_conc):
dict1 = dict.fromkeys(realinput.columns,"")
dict1['protein']=unique_protein
dict1['drug']=unique_drug
dict1['drug_concentration']=unique_drug_conc
rows_list.append(dict1)
df = pandas.DataFrame(rows_list)
return df
df2 = combinator(realinput)
df2.to_csv('realoutput.csv')
I'd still like to make it more versatile by getting away from hard-coding any dictionary keys and letting the user-defined input.csv column headers dictate the output. Also, I'd like to move away from the defined three-column setup to handle any number of columns.