update pandas dataframe column not work on first time - python

I have a dataframe concated from some other dataframes, then I need to update some values in one column, and found that I have to do the same update twice. To find out what happened, I save the dataframe to disk and reload it, then do the update, now it works on the first time.
Is it a bug of pandas or I made some wrong?
I am using pandas 0.22.0 from conda 4.5.0
import pandas as pd
sum_trade = pd.read_csv('somefile.csv')
df = pd.concat(
[
sum_trade.loc[sum_trade.mon == 201806 ].groupby(['trade'])['cnt'].sum(),
sum_trade.loc[sum_trade.mon == 201706 ].groupby(['trade'])['cnt'].sum(),
sum_trade.loc[sum_trade.mon > 201800].groupby(['trade'])['cnt'].sum(),
sum_trade.loc[sum_trade.mon < 201800].groupby(['trade'])['cnt'].sum()
],
axis = 1
).reset_index()
df.columns = ['trade_code', 'cnt201806', 'cnt201706', 'cnt20181-6', 'cnt20171-6']
# subsititude ["1.blabla", "(1)foofoo", "其中:barbar"] to ["blabla", "foofoo", "barbar"]
pattern = re.compile(r'^(?\d?\.?\)?(其中:)?')
df.to_csv('temp.csv')
# The following line would not success
df.trade_code = df.trade_code.map(lambda x: pattern.sub('', x.strip()))
display(df[df.trade_code.map(lambda x: '1' in x)])
# do same update again seems worked
df.trade_code = df.trade_code.map(lambda x: pattern.sub('', x.strip()))
display(df[df.trade_code.map(lambda x: '1' in x)])
# if load data from file, first update will sucesses
df = pd.read_csv('temp.csv')
display(df[df.trade_code.map(lambda x: '1' in x)])
df.trade_code= df.trade_code.map(lambda x: pattern.sub('', x.strip()))
display(df[df.trade_code.map(lambda x: '1' in x)])
Here is some sample data of somefile.csv, which has about 2500 lines, and the concated df has about 200 lines (the names and numbers are faked):
city mon trade cnt
0 达纳苏斯 201701 1.农业 23458.0
1 达纳苏斯 201701 1.农副食品加工业 12345684.0
2 达纳苏斯 201701 1.房屋建筑业 22109.0
3 达纳苏斯 201701 1.电信、广播电视和卫星传输服务 338.0
4 达纳苏斯 201701 1.电力、热力生产和供应业 133333.0
below are the 2 outputs of the above code, which shows that some substitutions were successful, while some were not. I ran the code several times, it was always the following 4 lines not updated at the first time. But if data or pattern has problem, the second update should not work too.
trade cnt201806 cnt201706 cnt20181-6 cnt20171-6
33 1.化学纤维制造业 0.0 123451.0 0.0 5432185.0
34 1.印刷和记录媒介复制业 5678913.0 7890153.0 5555504.0 112233185.0
63 1.金属制品业 98765804.0 4321563.0 34567919.0 22222256.0
82 1.金属制品、机械和设备修理业 8765493.0 3214929.0 3322113331.0 556677155.0
====================================================================
trade cnt201806 cnt201706 cnt20181-6 cnt20171-6

I checked the data and found some trades are :
11.化学纤维制造业
11.印刷和记录媒介复制业
...
After the first substitution, they becomes:
1.化学纤维制造业
1.印刷和记录媒介复制业
...
That's why I have to substitute 2 times. I changed my pattern from '^(?\d?\.?\)?(其中:)?' to '^(?\d*\.?\)?(其中:)?' and all ok.
Thanks to all replies and comments.

Related

Multi-part manipulation post str.split() Pandas

I have a subset of data (single column) we'll call ID:
ID
0 07-1401469
1 07-89556629
2 07-12187595
3 07-381962
4 07-99999085
The current format is (usually) YY-[up to 8-character ID].
The desired output format is a more uniformed YYYY-xxxxxxxx:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
Knowing that I've done padding in the past, the thought process was to combine
df['id'].str.split('-').str[0].apply(lambda x: '{0:20>4}'.format(x))
df['id'].str.split('-').str[1].apply(lambda x: '{0:0>8}'.format(x))
However I ran into a few problems:
The '20' in '{0:20>4}' must be a singular value and not a string
Trying to do something like the below just results in df['id'] taking the properties of the last lambda & trying any other way to combine multiple apply/lambdas just didn't work. I started going down the pad left/right route but that seemed to be taking be backwards.
df['id'] = (df['id'].str.split('-').str[0].apply(lambda x: '{0:X>4}'.format(x)).str[1].apply(lambda x: '{0:0>8}'.format(x)))
The current solution I have (but HATE because its long, messy, and just not clean IMO) is:
df['idyear'] = df['id'].str.split('-').str[0].apply(lambda x: '{:X>4}'.format(x)) # Split on '-' and pad with X
df['idyear'] = df['idyear'].str.replace('XX', '20') # Replace XX with 20 to conform to YYYY
df['idnum'] = df['id'].str.split('-').str[1].apply(lambda x: '{0:0>8}'.format(x)) # Pad 0s up to 8 digits
df['id'] = df['idyear'].map(str) + "-" + df['idnum'] # Merge idyear and idnum to remake id
del df['idnum'] # delete extra
del df['idyear'] # delete extra
Which does work
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
But my questions are
Is there a way to run multiple apply() functions in a single line so I'm not making temp variables
Is there a better way than replacing 'XX' for '20'
I feel like this entire code block can be compress to 1 or 2 lines I just don't know how. Everything I've seen on SO and Pandas documentation on highlights/relates to singular manipulation so far.
One option is to split; then use str.zfill to pad '0's. Also prepend '20's before splitting, since you seem to need it anyway:
tmp = df['ID'].radd('20').str.split('-')
df['ID'] = tmp.str[0] + '-'+ tmp.str[1].str.zfill(8)
Output:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
I'd do it in two steps, using .str.replace:
df["ID"] = df["ID"].str.replace(r"^(\d{2})-", r"20\1-", regex=True)
df["ID"] = df["ID"].str.replace(r"-(\d+)", lambda g: f"-{g[1]:0>8}", regex=True)
print(df)
Prints:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085

Run functions over many dataframes, add results to another dataframe, and dynamically name the resulting column with the name of the original df

I have many different tables that all have different column names and each refer to an outcome, like glucose, insulin, leptin etc (except keep in mind that the tables are all gigantic and messy with tons of other columns in them as well).
I am trying to generate a report that starts empty but then adds columns based on functions applied to each of the glucose, insulin, and leptin tables.
I have included a very simple example - ignore that the function makes little sense. The below code works, but I would like to, instead of copy + pasting final_report["outcome"] = over and over again, just run the find_result function over each of glucose, insulin, and leptin and add the "glucose_result", "insulin_result" and "leptin_result" to the final_report in one or a few lines.
Thanks in advance.
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
outcome = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
glucose = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
insulin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
leptin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
ids = [1,2,3,4]
start = [1,1,1,1]
end = [6,6,6,6]
final_report = pd.DataFrame({'id':ids,
'start':start,
'end':end})
def find_result(subject, start, end, df):
df = df.loc[(df["id"] == subject) & (df["timepoint"] >= start) & (df["timepoint"] <= end)].sort_values(by = "timepoint")
return df["timepoint"].nunique()
final_report['glucose_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], glucose), axis=1)
final_report['insulin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], insulin), axis=1)
final_report['leptin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], leptin), axis=1)
If you have to use this code structure, you can create a simple dictionary with your dataframes and their names and loop through them, creating new columns with programmatically assigned names:
input_dfs = {"glucose": glucose, "insulin": insulin, "leptin": leptin}
for name, df in input_dfs.items():
final_report[f"{name}_result"] = final_report.apply(
lambda x: find_result(x['id'], x['start'], x['end'], df),
axis=1
)
Output:
id start end glucose_result insulin_result leptin_result
0 1 1 6 6 6 6
1 2 1 6 6 6 6
2 3 1 6 3 3 3
3 4 1 6 6 6 6

Adding Leading Zeros to a field with MM:SS time data

I have the following data:
data shows a race time finish and pace:
As you can see, the data doesn't show the hour format for people who finish before the hour mark and in order to do some analysis, i need to convert into a time format but pandas doesn't recognize just the MM:SS format. how can I pad '0:' in front of the rows where hour is missing?
i'm sorry, this is my first time posting.
Considering your data is in csv format.
# reading in the data file
df = pd.read_csv('data_file.csv')
# replacing spaces with '_' in column names
df.columns = [c.replace(' ', '_') for c in df.columns]
for i, row in df.iterrows():
val_inital = str(row.Gun_time)
val_final = val_inital.replace(':','')
if len(val_final)<5:
val_final = "0:" + val_inital
df.at[i, 'Gun_time'] = val_final
# saving newly edited csv file
df.to_csv('new_data_file.csv')
Before:
Gun time
0 28:48
1 29:11
2 1:01:51
3 55:01
4 2:08:11
After:
Gun_time
0 0:28:48
1 0:29:11
2 1:01:51
3 0:55:01
4 2:08:11
You can try to apply the following function to the columns you want to change then maybe change it to timedelta
df['Gun time'].apply(lambda x: '0:' + x if len(x) == 5 \
else ('0:0' + x if len(x) == 4 else x))
df['Gun time'] = pd.to_timedelta(df['Gun Time'])

Am I using groupby.sum() correctly?

I've the following code, and a problem in the new_df["SUM"] line:
import pandas as pd
df = pd.read_excel(r"D:\Tesina\Proteoma Humano\Tablas\uno - copia.xlsx")
#df = pd.DataFrame({'ID': ['C9JLR9','O95391', 'P05114',"P14866"], 'SEQ': ['1..100,182..250,329..417,490..583', '1..100,206..254,493..586', '1..100', "1..100,284..378" ]})
df2 = pd.DataFrame
df["SEQ"] = df["SEQ"].replace("\.\."," ", regex =True)
new_df = df.assign(SEQ=df.SEQ.str.split(',')).explode('SEQ')
for index, row in df.iterrows():
new_df['delta'] = new_df['SEQ'].map(lambda x: (int(x.split()[1])+1)-int(x.split()[0]) if x.split()[0] != '1' else (int(x.split()[1])+1))
new_df["SUM"] = new_df.groupby(["ID"]).sum().reset_index(drop=True) #Here's the error, even though I can't see where
df2 = new_df.groupby(["ID","SUM"], sort=False)["SEQ"].apply((lambda x: ','.join(x.astype(str)))).reset_index(name="SEQ")
To give some context, what it does is the following: grabs every line with the same ID, separates the numbers with a "," in between, does some math with those numbers (that's where the "delta" (which i know it's not a delta) line gets involved), and finally sums up all the "delta" for each ID, grouping them all by their original ID, so I maintain the same numbers of rows.
And, when I use a sample of the data (the one that´s commented at the beginning), it works perfectly, giving me the ouptut that I wish:
ID SUM SEQ
0 C9JLR9 353 1 100,182 250,329 417,490 583
1 O95391 244 1 100,206 254,493 586
2 P05114 101 1 100
3 P14866 196 1 100,284 378
But, when I aply it on my Excel file (that has 10471 rows), the groupby.sum() line doesn't work as it's supposed to (I've already checked everything else, I know the error is within that line).
This is the output that I receive:
ID SUM SEQ
0 C9JLR9 39 1 100,182 250,329 417,490 583
1 O95391 20 1 100,206 254,493 586
2 P05114 33 1 100
4 P98177 21 1 100,176 246
You can clearly see that the SUM values differ (and are not correct at all). I haven't been able to figure out where those numbers come from, also. It's really weird.
If anyone is interested, the solution was provided in the comments: I had to change the line with the following:
new_df["SUM"] = new_df.groupby("ID")["delta"].transform("sum")

Explode a cell populated with multiple values into unique rows

I want to "explode" each cell that has multiple words in it into distinct rows while retaining it's rating and sysnet value when being conjoined. I attempted to import someone's pandas_explode library but VS code just does not want to recognize it. Is there any way for me in pandas documentation or some nifty for loop that'll extract and redistribute these words? Example csv is in the img link
import json
import pandas as pd # version 1.01
df = pd.read_json('result.json')
df.to_csv('jsonToCSV.csv', index=False)
df = pd.read_csv('jsonToCSV.csv')
df = df.explode('words')
print(df)
df = df.to_csv(r'C:\Users\alant\Desktop\test.csv', index = None, header=True)
Output when running above:
synset rating words
0 1034312 0.0 ['discourse', 'talk about', 'discuss']
1 146856 0.0 ['merging', 'meeting', 'coming together']
2 829378 0.0 ['care', 'charge', 'tutelage', 'guardianship']
3 8164585 0.0 ['administration', 'governance', 'governing bo...
4 1204318 0.0 ['nonhierarchical', 'nonhierarchic']
... ... ... ...
8605 7324673 1.0 ['emergence', 'outgrowth', 'growth']
csv file
If you have columns that needs to be kept from exploding, I suggest setting them as index first and then explode.
For your example, try if this works for you.
df = df.set_index(['synset','rating']).apply(pd.Series.explode) # this would work for exploding multiple columns as well
# then reset the index
df = df.reset_index()

Categories