I have a short script. I use that script for for example I have dataset
I try to group by id first 3 then I try to group them again but this time I try to merge name, url and house
example output and input
data set
input csv
id,name,house
1,a,house1,
1,aa,house2
1,aaa,house3
2,b,house4
2,bb,house5
2,bbb,house6
3,c,house7
3,cc,house8
3,ccc,house9
4,d,house10
4,dd,house11
4,ddd,house12
4,dddd, house13
the output csv
1,a,house1,aa,house2,aaa,houes3
2,b,house4,bb,house5,bbb,houes6
3,c,house7,cc,house8,ccc,houes9
4,d,house10,dd,house11,ddd,house12
script
df = pd.read_csv('test.csv', delim_whitespace=True)
df.sort_values(by=['id'])
df = df.groupby('id').head(3).groupby('id').agg({
'name': lambda l: ','.join(l),
'house': lambda l: ','.join(l)
})
df[['name_first', 'name_second', 'name_third']] = df.name.str.split(',', expand=True)
df[['house_first', 'house_second', 'house_third']] = df.house.str.split(',', expand=True)
df = df.reset_index().drop(['name', 'house'], axis=1)
df.to_csv('output.csv')
I want to add progressbar, but I couldn't add, If I can switch agg func to apply func, I think I will be able to switch it progress_apply but I couldn't change how can I do that, I need progressbar because I have really huge csv file which over 10 millions lines so it is gonna take time, I want to track process
df = pd.DataFrame({'id': ['1', '1', '1', '2', '2', '2', '3', '3', '3', '4', '4', '4', '4'],
'name': ['a', 'aa', 'aaa', 'b', 'bb', 'bbb', 'c', 'cc', 'ccc', 'd', 'dd', 'ddd', 'dddd'],
'house': ['house1', 'house2', 'house3', 'house4', 'house5', 'house6', 'house7', 'house8', 'house9', 'house10', 'house11', 'house12', ' house13']
})
This approach creates a pivot table
outcome = df.groupby('id').head(3)\
.assign(count=df.groupby('id').cumcount())\
.set_index(['id', 'count']).unstack()\
.sort_index(axis=1, level=1)
and then we can save it after renaming the columns
outcome.columns = [f'{x}_{str(y)}' for x, y in outcome.columns]
outcome.to_csv('...')
But this does not come with a progress bar because I did not use apply.
To use progress bar for the sake of using it:
from tqdm.notebook import tqdm
tqdm.pandas()
outcome = df.groupby('id').progress_apply(
lambda x: x.head(3).reset_index(drop=True).set_index('id', append=True).unstack(0),
).droplevel(0).sort_index(axis=1, level=1)
outcome.columns = [f'{x}_{str(y)}' for x, y in outcome.columns]
outcome.to_csv('...')
Please try the both approaches and see which is faster.
Related
i have 2 dataframes have same columns with different len.
in [1] : df_g = pd.DataFrame([['EDC_TRAING_NO', 'EDU_T_N', 'NUMVER', '20'],
['EDC_TRAING_NAME', 'EDU_T_NM', 'VARCHAR', '40'],
['EDC_TRAING_ST', 'EDU_T_SD', 'DATETIME', '15'],
['EDC_TRAING_END', 'EDU_T_ED', 'DATETIME', '15'],
['EDC_PLACE_NM', 'EDU_P_NM', 'VARCHAR2', '40'],
['ONLINE_REQST_POSBL_AT', 'ONLINE_R_P_A', 'VARCHAR2', '1']],
columns=['NAME', 'ID', 'TYPE', 'LEN'])
in [2] : df_n = pd.DataFrame([['EDC_TRAING_NO', 'EDU_TR_N', 'NUMVER', '20'],
['EDC_TRAING_NAME', 'EDU_TR_NM', 'VARCHAR', '20'],
['EDC_TRAING_ST', 'EDU_TR_SD', 'DATETIME', '15'],
['EDC_TRAING_END', 'EDU_T_ED', 'DATETIME', '15'],
['EDC_PLACE_NM', 'EDU_PL_NM', 'VARCHAR2', '40'],
['ONLINE_REQST_POSBL_AT', 'ONLINE_REQ_P_A', 'VARCHAR2', '1']],
columns=['NAME', 'ID', 'TYPE', 'LEN'])
the reuslt i want to get:
result = pd.DataFrame([['EDC_TRAING_NO', 'EDU_TR_N', 'NUMVER', '20'],
['EDC_TRAING_ST', 'EDU_TR_SD', 'DATETIME', '15'],
['EDC_TRAING_END', 'EDU_T_ED', 'DATETIME', '15'],
['EDC_PLACE_NM', 'EDU_PL_NM', 'VARCHAR2', '40'],
['ONLINE_REQST_POSBL_AT', 'ONLINE_REQ_P_A', 'VARCHAR2', '1']],
columns=['NAME', 'ID', 'TYPE', 'LEN'])
and each df have length like this.
len(df_g) : 1000
len(df_n) : 5000
each dataframe has column named 'name, id, type, len'
i need to check those columns(name,type,len) in each df to compare 'id' column whether it has same value or not.
so i tried like this.
for i in g.index:
for j in n.index:
g = g.iloc[i].values
# make it to ndarray
g_Str = g[0] + g[2] + g[3]
# make it to str for pivot
n = n.iloc[j].values
n_Str = n[0] + str(n[2]) + str(n[3])
# comparing and check two df
if g_Str == n_Str and g[1] != n[1]:
print(i, j)
print(g[0])
I have above code for 2 different length DataFrame.
first i tried with 'iterrows()' for comparing those two df,
but it took too much time.(very slow)
i looked up for other ways to make it work better performance.
possible ways i found
option1
transform df to dict with to_dict() / compare those in nested for-loop
option2
transform df.series to ndarray / compare those in nested for-loop
is there any other better option?
or any option to not using nested for-loop?
thx.
you can try merge,
and if you are looking for records where ids do mismatch then the following is one way of achieving it:
r1=df_g.merge(df_n,on=['NAME', 'TYPE', 'LEN'],how='inner').query('ID_x != ID_y').rename(columns={'ID_x': 'ID'}).drop('ID_y', 1)
I have used how="inner" join, but based on need can use any of the following joins:
{‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’
I have two data frames and I want for each line in one data frame to locate the matching line in the other data frame by a certain column (containing some id). I thought to go over the lines in the df1 and use the loc function to find the matching line in df2.
The problem is that some of the id's in df2 has some extra information except the id itself.
For example:
df1 has the id: 1234,
df2 has the id: 1234-KF
How can I locate this id for example with loc? Can loc somehow match only by prefixes?
Extra information can be removed using e.g. regular expression (or substring):
import pandas as pd
import re
df1 = pd.DataFrame({
'id': ['123', '124', '125'],
'data': ['A', 'B', 'C']
})
df2 = pd.DataFrame({
'id': ['123-AA', '124-AA', '125-AA'],
'data': ['1', '2', '3']
})
df2.loc[df2.id.apply(lambda s : re.sub("[^0-9]", "", s)) == df1.id]
I'm trying to modify a few columns that all the yesses and nos become 1 and 0:
df['Outbreak Associated', 'FSA'] = df['Outbreak Associated', 'FSA'].map({'yes': '1', 'no': '0'})
Doing them one at a time works, but two or more is giving me an error. I imagine there's something simple I'm missing but I can't think of what it is.
KeyError: ('Outbreak Associated', 'FSA')
Any thoughts?
You can use replace, BUT if no match get original value, not NaN like map:
cols = ['Outbreak Associated', 'FSA']
df[cols] = df[cols].replace({'yes': '1', 'no': '0'})
Solutions for Series.map - you can loop each column by DataFrame.apply with lambda funcion:
df[cols] = df[cols].apply(lambda x: x.map({'yes': '1', 'no': '0'}))
Or use DataFrame.stack and
Series.unstack:
df[cols] = df[cols].stack().map({'yes': '1', 'no': '0'}).unstack()
Or using map for each column separately ;):
d = {'yes': '1', 'no': '0'}
df['Outbreak Associated'] = df['Outbreak Associated'].map(d)
df['FSA'] = df['FSA'].map(d)
so while trying to make some updates at local database, having issues the loop is just getting last row values.
The variables id_value and cost_value is getting only the last row value
How to get all the values? to be able to update old research records
Data:
df = pd.DataFrame({
'id': ['09999900795', '00009991136', '000094801299', '000099900300', '0075210657'],
'Cost': ['157.05974458228403', '80.637745302714', '7', '13', '65.5495495']
})
My code:
for index, row in df.iterrows():
id_value = [row['id']]
cost_value = [row['Cost']]
# this row updates the data, however the variables is getting only the last value
#table.update().where(table.c.id== id_value).values(Cost=cost_value)
print(id_value)
print(cost_value)
out[12]:
['0075210657']
['65.5495495']
Desired output:
['09999900795', '00009991136', '000094801299', '000099900300', '0075210657']
['157.05974458228403', '80.637745302714', '7', '13', '65.5495495']
Python for loops will update the value of id_value and cost_value on every iteration. This is why you are only seeing the last value in the row.
If what you want is a Python list of every value in that column, you can do that more efficiently than looping by using df['id'].tolist().
Timing the difference of these with your (small) example dataset:
import timeit
setup_string = '''
import pandas as pd
df = pd.DataFrame({
'id': ['09999900795', '00009991136', '000094801299', '000099900300', '0075210657'],
'Cost': ['157.05974458228403', '80.637745302714', '7', '13', '65.5495495']
})
'''
code_string1 = '''
id_values = []
cost_values = []
for _, row in df.iterrows():
id_values.append(row['id'])
cost_values.append(row['Cost'])
'''
code_string2 = '''
id_values = df['id'].tolist()
cost_values = df['Cost'].tolist()
'''
timeit.timeit(code_string1, setup_string, number=10000)
timeit.timeit(code_string2, setup_string, number=10000)
The first, iterative example gives 5.5589933570008725 seconds on my machine, while the second example gives 0.2375467009987915 seconds on my machine.
You need to append values to each list; you are simply defining a new list in each iteration.
id_values = []
cost_values = []
for _, row in df.iterrows():
id_values.append(row['id'])
cost_values.append(row['Cost'])
df = pd.DataFrame({
'id': ['09999900795', '00009991136', '000094801299', '000099900300', '0075210657'],
'Cost': ['157.05974458228403', '80.637745302714', '7', '13', '65.5495495']
})
id_value=df['id'].tolist()
cost_value=df['Cost'].tolist()
I have panel data (repeated observations per ID at different points in time). Data is unbalanced (there are gaps). I need to check and possibly adjust for a change in variable per person over the years.
I tried two versions. First, a for loop-setting, to first access each person and each of its years. Second, a one line combination with groupby. Groupby looks more elegant to me. Here the main issue is to identify the "next element". I assume in a loop I can solve this with a counter.
Here is my MWE panel data:
import pandas as pd
df = pd.DataFrame({'year': ['2003', '2004', '2005', '2006', '2007', '2008', '2009','2003', '2004', '2005', '2006', '2007', '2008', '2009'],
'id': ['1', '1', '1', '1', '1', '1', '1', '2', '2', '2', '2', '2', '2', '2'],
'money': ['15', '15', '15', '16', '16', '16', '16', '17', '17', '17', '18', '17', '17', '17']}).astype(int)
df
Here is what a time series per person looks like:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
fig, ax = plt.subplots()
for i in df.id.unique():
df[df['id']==i].plot.line(x='year', y='var', ax=ax, label='id = %s'%i)
df[df['id']==i].plot.scatter(x='year', y='var', ax=ax)
plt.xticks(np.unique(df.year),rotation=45)
Here is what I want to achieve: For each person, compare the time series of values and drop every successor who is different from its precursor value (identify red circles). Then I will try different strategies to handle it:
Drop (very iffy): if successor differs, drop it
Smooth (absolute value): if successor differs by (say) 1 unit, assign it its precursor value
Smooth (relative value): if successor differs by (say) 1 percent, assign it its precursor value
Solution to drop
df['money_difference'] = df['money']-df.groupby('id')['money'].shift(1)
df_new = df.drop(df[df['money_difference'].abs()>0].index)
Idea to smooth
# keep track of change of variable by person and time
df['money_difference'] = df['money']-df.groupby('id')['money'].shift(1)
# first element has no precursor, it will be NaN, replace this by 0
df = df.fillna(0)
# now: whenever change_of_variable exceeds a threshold, replace the value by its precursor - not working so far
df['money'] = np.where(abs(df['money_difference'])>=1, df['money'].shift(1), df['money'])
To get the next event in your database you can use a combination with groupby and shift and then do the subraction to the previos event:
df['money_difference'] =df.groupby(['year', 'id'])['money'].shift(-1)-df['money']