There is a huge CSV file that is being read by pd.read_table('file.csv', chunksize=50000 ). Currently with each loop iteration I read the value_counts relevant to the current chunk using the df.col.value_counts() method. I got it working through loops and tricks with numpy, but I'm wondering if there is a cleaner way to do this using pandas?
Code:
prev = None
# LOOP CHUNK DATA
for imdb_basics in pd.read_table(
'data/imdb.title.basics.tsv',
dtype={'tconst':str,'originalTitle':str,'startYear':str },
usecols=['tconst','originalTitle','startYear'],
chunksize=50000,
sep='\t'
):
# REMOVE NULL DATA & CONVERT TO NUMBER
imdb_basics.startYear = imdb_basics.startYear.replace( "\\N", 0 )
imdb_basics.startYear = pd.to_numeric( imdb_basics.startYear )
# --- loops and tricks --- !
tmp = imdb_basics.startYear.value_counts( sort=False )
current = {
'year': list( tmp.keys() ),
'count': list( tmp.values )
}
if prev is None :
prev = current
else:
for i in range( len( prev['year'] ) ):
for j in range( len( current['year'] ) ):
if prev['year'][i] == current['year'][j]:
prev['count'][i] += current['count'][j]
for i in range( len( current['year'] ) ):
if not ( current['year'][i] in prev['year'] ):
prev['year'].append( current['year'][i] )
prev['count'].append( current['count'][i] )
EDIT:
I'm working with a large data file, plus the remote machine I'm currently using has a very limited amount of memory, so removing chunking in pandas is not an option.
Like I said in my comments, you don't need to worry about the key management. Pandas can do all of that for you. Consider this trivial example with some mock data with a year column and some other column:
from io import StringIO
import numpy
import pandas
numpy.random.seed(0)
# years to chose from
years = numpy.arange(2000, 2017)
# relative probabilities of a year being selected (2000 should be absent)
weights = numpy.linspace(0.0, 0.7, num=len(years))
weights /= weights.sum()
# fake dataframe turned into a fake CSV
x = numpy.random.choice(years, size=200, p=weights)
text = pandas.DataFrame({
'year': x,
'value': True
}).to_csv()
Since this is a small file, we can read it all at once to get the "correct" answer
pandas.read_csv(StringIO(text))['year'].value_counts().sort_index()
2001 1
2002 6
2003 2
2004 6
2005 6
2006 11
2007 9
2008 12
2009 13
2010 9
2011 18
2012 16
2013 29
2014 20
2015 21
2016 21
Name: year, dtype: int64
OK, so now let's try a chunking approaching, using pandas methods:
result = None
for chunk in pandas.read_csv(StringIO(text), chunksize=25):
tmp = chunk['year'].value_counts()
if result is None: # first chunk
result = tmp.copy()
else: # all other chunks
result = result.add(tmp, fill_value=0).astype(int)
final = result.sort_index()
final
2001 1
2002 6
2003 2
2004 6
2005 6
2006 11
2007 9
2008 12
2009 13
2010 9
2011 18
2012 16
2013 29
2014 20
2015 21
2016 21
Name: year, dtype: int64
So it works. Pandas will align and fill the index during basic operations.
You could try dask.dataframe. It is underused because it only offers a subset of pandas functionality. But if the problem is the ugly syntax via chunking, you could try this:
import dask.dataframe as dd
df = dd.read_csv('my_big_file.csv')
counts = df['col'].value_counts()
counts.compute()
Internally, dask deals with chunking, aggregation, etc.
Related
How exclude rows from the data by condition.
1- using .loc I selected the part to be removed
2- problem is there are empty rows in "year", I want to keep all the empty and anything < 2020
I would use !< but doesn't work, python just accepts !=
# dataframe
cw=
year name
2022 as
2020 ad
sd
sd
1988 wwe
1999 we
cw = cw.loc[cw['year']!>'2020']
The Problem is the empty fields, its tricky... I need keep everything that is NOT > 2020, so I want to keep the empty and smaller values
Isn't not greater than n the same as less than or equal to n?
cw = cw.loc[cw['year']!>'2020']
simply becomes
cw = cw.loc[cw['year'] <= '2020']
negating the query will also work but it's important that your "year" column be either an int or a timestamp if you want to make sure the > operator works correctly.
Try something more like this:
import pandas as pd
cw = pd.DataFrame({"year": [2022, 2020, None, None, 1988, 1999],
"name": ["as", "ad", "sd", "sd", "wwe", "we"]}, dtype=int)
"""
year name
0 2022 as
1 2020 ad
2 None sd
3 None sd
4 1988 wwe
5 1999 we
"""
cw = cw.loc[~(cw["year"] > 2020)]
"""
year name
1 2020 ad
2 None sd
3 None sd
4 1988 wwe
5 1999 we
"""
I am working with sports stats data and want to extract stats from past 3 years. If I have a dataframe with player and year, how can I extract rows from another dataframe that has matching player, and same year and the previous 2?
df1 = pd.DataFrame([['ABC',2018,5,2,3],
['ABC',2017,52,21,31],['ABC',2016,15,12,13],
['ABC',2015,25,22,3]],
columns=['Player','Year','GS','G','MP'])
df1=
Player Year GS G MP
ABC 2018 5 2 3
ABC 2017 52 21 31
ABC 2016 15 12 13
ABC 2015 25 22 3
df2 = pd.DataFrame([["ABC",2017]], columns=['Player','Year'])
df2=
Player Year
ABC 2017
this should result in
Player Year GS G MP
ABC 2017 52 21 31
ABC 2016 15 12 13
ABC 2015 25 22 3
Eventually I want to do summations but extracting this make that much easier. Is there a pythonic way to do this using merge or filter?
merge on 'Player' then filter the year range after:
res = df1.merge(df2, on='Player', suffixes=['', '_r'])
res = res.loc[res.Year.between(res.Year_r-2, res.Year_r)].drop(columns='Year_r')
print(res)
# Player Year GS G MP
#1 ABC 2017 52 21 31
#2 ABC 2016 15 12 13
#3 ABC 2015 25 22 3
Or if 'Player' is not duplicated in df2, map to a Series and then mask with a Boolean Series:
s = df1.Player.map(df2.set_index('Player').Year)
df1[df1.Year.between(s-2, s)]
# Player Year GS G MP
#1 ABC 2017 52 21 31
#2 ABC 2016 15 12 13
#3 ABC 2015 25 22 3
A common pattern is to specify what values to filter on with the form df1[df1.Column == value]. You can combine multiple as follows:
years = [(df2.Year.values[0] - j) for j in range(3)]
player = df2.Player.values[0]
result = df1[(df1.Player == player) & (df1.Year.isin(years))]
The other answers are good! But also this should work :)
# to be safe, at first, sort the DataFrames
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
# prepare the Masks Boolean responses
check_1 = df1["Player"] == df2["Player"].to_list()[0]
# to be safe use int() to get integers in the for loop
years_list = (int(df2["Year"].tolist()[0])-i for i in range(0, 3))
check_2 = df1.Year.map(int).isin(years_list)
# apply the masks
print(df1[check_1 & check_2])
Anyway not necessarily you need a DataFrame to store the matching "Player" and the matching "Year".
Two lists or even variables would be even better, since it seems you have not set real columns in your df2, as Erfan noticed in the comment under your question.
I am working on a dataset with pandas in which a maintenance work is done at a location. The maintenance is done at random intervals, sometimes a year, and sometimes never. I want to find the years since the last maintenance action at each site if an action has been made on that site. There can be more than one action for a site and the occurrences of actions are random. For the years prior to the first action, it is not possible to know the years since action because that information is not in the dataset.
I give only two sites in the following example but in the original dataset, I have thousands of them. My data only covers the years 2014 through 2017.
Action = 0 means no action has been performed that year, Action = 1 means some action has been done. Measurement is a performance reading related to the effect of the action. The action can happen in any year.
Site Year Action Measurement
A 2014 1 100
A 2015 0 150
A 2016 0 300
A 2017 0 80
B 2014 0 200
B 2015 1 250
B 2016 1 60
B 2017 0 110
Given this dataset; I want to have a dataset like this:
Item Year Action Measurement Years_Since_Last_Action
A 2014 1 100 1
A 2015 0 150 2
A 2016 0 300 3
A 2017 0 80 4
B 2015 1 250 1
B 2016 1 60 1
B 2017 0 110 2
Please observe the Year 2015 is filtered out for Site B because that year is prior to the first action for that site.
Many thanks in advance!
I wrote the code myself. It is messy but does the job for me. :)
The solution assumes that df_select has an integer index.
df_select = (df_select[df_select['Site'].map((df_select.groupby('Site')['Action'].max() == 1))])
years_since_action = pd.Series(dtype='int64')
gbo = df_select.groupby('Site')
for (key,group) in gbo:
indices_with_ones = group[group['Action']==1].index
indices = group.index
group['Years_since_action'] = 0
group.loc[indices_with_ones,'Years_since_action'] = 1
for idx_with_ones in indices_with_ones.sort_values(ascending=False):
for idx in indices:
if group.loc[idx,'Years_since_action']==0:
if idx>idx_with_ones:
group.loc[idx,'Years_since_action'] = idx - idx_with_ones + 1
years_since_action = years_since_action.append(group['Years_since_action'])
df_final = pd.merge(df_select,pd.DataFrame(years_since_action),how='left',left_index=True,right_index=True)
Here is how I will approach it:
import pandas as pd
from io import StringIO
import numpy as np
s = '''Site Year Action Measurement
A 2014 1 100
A 2015 0 150
A 2016 0 300
A 2017 0 80
B 2014 0 200
B 2015 1 250
B 2016 1 60
B 2017 0 110
'''
ss = StringIO(s)
df = pd.read_csv(ss, sep=r"\s+")
df_maintain = df[df.Action==1][['Site', 'Year']]
df_maintain.reset_index(drop=True, inplace=True)
df_maintain
def find_last_maintenance(x):
df_temp = df_maintain[x.Site == df_maintain.Site]
gap = [0]
for ind, row in df_temp.iterrows():
if (x.Year >= row['Year']):
gap.append(x.Year - row['Year'] + 1)
return gap[-1]
df['Gap'] = df.apply(find_last_maintenance, axis=1)
df = df[df.Gap !=0]
This generates the desired output.
So I am trying to open a CSV file, read its fields and based on that fix some other fields and then save that data back to csv. My problem is that the CSV file has 2 million rows. What would be the best way to speed this up.
The CSV file consists of
ID; DATE(d/m/y); SPECIAL_ID; DAY; MONTH; YEAR
I am counting how often a row with the same date appears on my record and then update SPECIAL_ID based on that data.
Based on my previous research I decided to use pandas. I'll be processing even bigger sets of data in future (1-2GB) - this one is around 119MB so it crucial I find a good fast solution.
My code goes as follows:
df = pd.read_csv(filename, delimiter=';')
df_fixed= pd.DataFrame(columns=stolpci) #when I process the row in df I append it do df_fixed
d = 31
m = 12
y = 100
s = (y,m,d)
list_dates= np.zeros(s) #3 dimensional array.
for index, row in df.iterrows():
# PROCESSING LOGIC GOES HERE
# IT CONSISTS OF FEW IF STATEMENTS
list_dates[row.DAY][row.MONTH][row.YEAR] += 1
row['special_id'] = list_dates[row.DAY][row.MONTH][row.YEAR]
df_fixed = df_fixed.append(row.to_frame().T)
df_fixed .to_csv(filename_fixed, sep=';', encoding='utf-8')
I tried to make a print for every thousand rows processed. At first, my script needs 3 seconds for 1000 rows, but the longer it runs the slower it gets.
at row 43000 it needs 29 seconds and so on...
Thanks for all future help :)
EDIT:
I am adding additional information about my CSV and exptected output
ID;SPECIAL_ID;sex;age;zone;key;day;month;year
2;13012016505__-;F;1;1001001;1001001_F_1;13;1;2016
3;25122013505__-;F;4;1001001;1001001_F_4;25;12;2013
4;24022012505__-;F;5;1001001;1001001_F_5;24;2;2012
5;09032012505__-;F;5;1001001;1001001_F_5;9;3;2012
6;21082011505__-;F;6;1001001;1001001_F_6;21;8;2011
7;16082011505__-;F;6;1001001;1001001_F_6;16;8;2011
8;21102011505__-;F;6;1001001;1001001_F_6;16;8;2011
I have to replace - in the special ID field to a proper number.
For example for a row with
ID = 2 the SPECIAL_ID will be
26022018505001 (- got replaced by 001) if someone else in the CSV shares the same DAY, MONTH, YEAR the __- will be replaced by 002 and so on...
So exptected output for above rows would be
ID;SPECIAL_ID;sex;age;zone;key;day;month;year
2;13012016505001;F;1;1001001;1001001_F_1;13;1;2016
3;25122013505001;F;4;1001001;1001001_F_4;25;12;2013
4;24022012505001;F;5;1001001;1001001_F_5;24;2;2012
5;09032012505001;F;5;1001001;1001001_F_5;9;3;2012
6;21082011505001;F;6;1001001;1001001_F_6;21;8;2011
7;16082011505001;F;6;1001001;1001001_F_6;16;8;2011
8;21102011505002;F;6;1001001;1001001_F_6;16;8;2011
EDIT:
I changed my code to something like this: I fill list of dicts with data and then convert that list do dataframe and save as csv. This will take around 30minutes to complete
list_popravljeni = []
df = pd.read_csv(filename, delimiter=';')
df_dates = df.groupby(by=['dan_roj', 'mesec_roj', 'leto_roj']).size().reset_index()
for index, row in df_dates.iterrows():
df_candidates= df.loc[(df['dan_roj'] == dan_roj) & (df['mesec_roj'] == mesec_roj) & (df['leto_roj'] == leto_roj) ]
for index, row in df_candidates.iterrows():
vrstica = {}
vrstica['ID'] = row['identifikator']
vrstica['SPECIAL_ID'] = row['emso'][0:11] + str(index).zfill(2)
vrstica['day'] = row['day']
vrstica['MONTH'] = row['MONTH']
vrstica['YEAR'] = row['YEAR']
list_popravljeni.append(vrstica)
pd.DataFrame(list_popravljeni, columns=list_popravljeni[0].keys())
I think this gives what you're looking for and avoids looping. Potentially it could be more efficient (I wasn't able to find a way to avoid creating counts). However, it should be much faster than your current approach.
df['counts'] = df.groupby(['year', 'month', 'day'])['SPECIAL_ID'].cumcount() + 1
df['counts'] = df['counts'].astype(str)
df['counts'] = df['counts'].str.zfill(3)
df['SPECIAL_ID'] = df['SPECIAL_ID'].str.slice(0, -3).str.cat(df['counts'])
I added a fake record at the end to confirm it does increment properly:
SPECIAL_ID sex age zone key day month year counts
0 13012016505001 F 1 1001001 1001001_F_1 13 1 2016 001
1 25122013505001 F 4 1001001 1001001_F_4 25 12 2013 001
2 24022012505001 F 5 1001001 1001001_F_5 24 2 2012 001
3 09032012505001 F 5 1001001 1001001_F_5 9 3 2012 001
4 21082011505001 F 6 1001001 1001001_F_6 21 8 2011 001
5 16082011505001 F 6 1001001 1001001_F_6 16 8 2011 001
6 21102011505002 F 6 1001001 1001001_F_6 16 8 2011 002
7 21102012505003 F 6 1001001 1001001_F_6 16 8 2011 003
If you want to get rid of counts, you just need:
df.drop('counts', inplace=True, axis=1)
I have a df with around 100,000 rows and 1,000 columns and need to make some adjustments based on the existing data. How do I best approach this? Most of the changes will follow this basic formula:
search a column (or two or three) to see if a condition is met
if met, change the values of dozens or hundreds of columns in that row
This is my best attempt, where I created a list of the columns and was looking to see whether the first column contained the value 1. Where it did, I wanted to just add some number. That part worked, but it only worked on the FIRST row, not on all the 1s in the column. To fix that, I think I need to create a loop where I have the second [i] that goes through all the rows, but I wasn't sure if I was approaching the entire problem incorrectly. FWIW, test_cols = list of columns and testing_2 is my df.
def try_this(test_cols):
for i in range(len(test_cols)):
if i == 0 and testing_2[test_cols[i]][i] == 1:
testing_2[test_cols[i]][i]=testing_2[test_cols[i]][i]+78787
i+=1
return test_cols
Edit/example:
Year Month Mean_Temp
City
Madrid 1999 Jan 7--this value should appear twice
Bilbao 1999 Jan 9--appear twice
Madrid 1999 Feb 9
Bilbao 1999 Feb 10
. . . .
. . . .
. . . .
Madrid 2000 Jan 6.8--this value should go away
Bilbao 2000 Jan 9.2--gone
So I would need to do something like (using your answer):
def alter(row):
if row['Year'] == 2000 and row['Month'] == 'Jan':
row['Mean_Temp'] = row['Mean_Temp'] #from year 1999!
return row['Mean_Temp']
else:
return row['Mean_Temp']
One way you could do this is by creating a function and applying it. Suppose you want to increase column 'c' by a factor of 10 if the corresponding row in 'a' or 'b' is an even number.
import pandas as pd
data = {'a':[1,2,3,4],'b':[3,6,8,12], 'c':[1,2,3,4]}
df = pd.DataFrame(data)
def alter(row):
if row['a']%2 == 0 or row['b']%2 == 0:
return row['b']*10
else:
return row['b']
df['c'] = df.apply(alter, axis=1)
would create a df that looks like,
a b c
0 1 3 3
1 2 6 60
2 3 8 80
3 4 12 120
Edit to add:
If you want to apply values from other parts of the df you could put those in a dict and then pass that into your apply function.
import pandas as pd
data = {'Cities':['Madrid', 'Balbao'] * 3, 'Year':[1999] * 4 + [2000] * 2,
'Month':['Jan', 'Jan', 'Feb', 'Feb', 'Jan', 'Jan'],
'Mean_Temp':[7, 9, 9, 10, 6.8, 9.2]}
df = pd.DataFrame(data)
df = df[['Cities', 'Year', 'Month', 'Mean_Temp']]
#create dicitonary with the values from 1999
edf = df[df.Year == 1999]
keys = zip(edf.Cities, edf.Month)
values = edf.Mean_Temp
dictionary = dict(zip(keys, values))
def alter(row, dictionary):
if row['Year'] == 2000 and row['Month'] == 'Jan':
return dictionary[(row.Cities, row.Month)]
else:
return row['Mean_Temp']
df['Mean_Temp'] = df.apply(alter, args = (dictionary,), axis=1)
Which gives you a df that looks like,
Cities Year Month Mean_Temp
0 Madrid 1999 Jan 7
1 Balbao 1999 Jan 9
2 Madrid 1999 Feb 9
3 Balbao 1999 Feb 10
4 Madrid 2000 Jan 7
5 Balbao 2000 Jan 9
Of course you can change the parameters however you like. Hope this helps.