Change number format in dataframe index - python

I would like to change the number format of the index of a dataframe.
From the screenshot below Paper ID is all in e+07 format (I don't know how to call this btw) and I would like to change them into normal number such as 1147687 instead of 1.147687e+07.
Here's my dataframe:

You can convert your indexes values to int, but you must do it carefully, cause you can loose some ids:
df = pd.DataFrame({'PaperId':[1000000000.0, 2.0, 3.0, 4.0],
'memberNum':[1, 2, 3, 4]})
df = df.set_index('PaperId')
df
memberNum
PaperId
1.000000e+09 1
2.000000e+00 2
3.000000e+00 3
4.000000e+00 4
df['PaperId'] = df.index
df['PaperId'] = df['PaperId'].astype('int')
df = df.set_index('PaperId')
df
memberNum
PaperId
1000000000 1
2 2
3 3
4 4

Related

pandas groupby and return a series of one column

I have a dataframe like below
df = pd.DataFrame({'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'time_1' :['2017-04-03 12:35:00','2017-04-03 12:50:00','2018-04-05 12:59:00','2018-05-04 13:14:00','2017-05-05 13:37:00','2018-07-06 13:39:00','2018-07-08 11:30:00','2017-04-08 16:00:00','2019-04-09 22:00:00','2019-04-11 04:00:00','2018-04-13 04:30:00','2017-04-14 08:00:00'],
'val' :[5,5,5,5,1,6,5,5,8,3,4,6],
'Prod_id':['A','B','C','A','E','Q','G','F','G','H','J','A']})
df['time_1'] = pd.to_datetime(df['time_1'])
I would like to do the below
a) groupby subject_id and time_1 using freq=3M`
b) return only the aggregated values of Prod_id column (and drop index)
So, I tried the below
df.groupby(['subject_id',pd.Grouper(key='time_1', freq='3M')])['Prod_id'].nunique()
Though the above works but it returned the group by columns as well in the output.
So, I tried the below using as_index=False
df.groupby(['subject_id',pd.Grouper(key='time_1', freq='3M'),as_index=False])['Prod_id'].nunique()
But still it didn't give the exepected output
I expect my output to be like as shown below
uniq_prod_cnt
2
1
1
3
2
1
2
You are in one of those cases in which you need to get rid of the index afterwards.
To get the exact shown output:
(df.groupby(['subject_id',pd.Grouper(key='time_1', freq='3M')])
.agg(uniq_prod_cnt=('Prod_id', 'nunique'))
.reset_index(drop=True)
)
output:
uniq_prod_cnt
0 2
1 1
2 1
3 3
4 2
5 1
6 2
if you want to get array without index
use values attribute :
df.groupby(['subject_id',pd.Grouper(key='time_1', freq='3M')])['Prod_id'].nunique().values
output:
array([2, 1, 1, 3, 2, 1, 2], dtype=int64)
if you want to get range index series
use reset_index(drop=True):
df.groupby(['subject_id',pd.Grouper(key='time_1', freq='3M')])['Prod_id'].nunique().reset_index(drop=True)
output:
0 2
1 1
2 1
3 3
4 2
5 1
6 2
Name: Prod_id, dtype: int64

Append results of DataFrame apply lamda to DateFrame or new Series

I am using the apply method with a lamda to compute on each row of a Dataframe to return a Series.
statsSeries = matchData.apply(lambda row: mytest(row), axis=1)
where mytest(row) is a function that returns timestamp, float, float.
def mytest(row):
timestamp = row['timestamp']
wicketsPerOver = row['wickets']/row['overs']
runsPerWicket = row['runs']/row['wickets']
return timestamp, wicketsPerOver, runsPerWicket
As I have written it, the statsSeries contains two columns, one an index and the other a tuple of the (timestamp, wicketsPerOver, runsPerWicket).
How can I return a Series with three columns [timestamp, wicketsPerOver, runsPerWicket]?
It appears you need to use pd.Series.apply(pd.Series).
Here is a minimal example:
import pandas as pd
df = pd.DataFrame({0: [1, 2, 3, 4]})
def add_some(row):
return row[0]+1, row[0]+2, row[0]+3
df[[1, 2, 3]] = df.apply(add_some, axis=1).apply(pd.Series)
print(df)
0 1 2 3
0 1 2 3 4
1 2 3 4 5
2 3 4 5 6
3 4 5 6 7

How to parse out array from column inside a dataframe?

I have a data frame that looks like this:
Index Values Digits
1 [1.0,0.13,0.52...] 3
2 [1.0,0.13,0.32...] 3
3 [1.0,0.31,0.12...] 1
4 [1.0,0.30,0.20...] 2
5 [1.0,0.30,0.20...] 3
My output should be:
Index Values Digits
1 [0.33,0.04,0.17...] 3
2 [0.33,0.04,0.11...] 3
3 [0.33,0.10,0.40...] 1
4 [0.33,0.10,0.07...] 2
5 [0.33,0.10,0.07...] 3
I believe that the Values column has a np.array within the cells? Is this technically an array.
I wish to parse out the Values column and divide all values within the array by 3.
My attempts have stopped at the parsing out of the values:
a = df(df['Values'].values.tolist())
IIUC, apply the list calculation
df.Values.apply(lambda x : [y/3 for y in x])
Out[1095]:
0 [0.3333333333333333, 0.043333333333333335, 0.1...
1 [0.3333333333333333, 0.043333333333333335, 0.1...
Name: Values, dtype: object
#df.Values=df.Values.apply(lambda x : [y/3 for y in x])
Created dataframe:
import pandas as pd
d = {'col1': [[1,10], [2,20]], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
created function:
def divide_by_3(lst):
outpuut =[]
for i in lst:
outpuut.append(i/3.0)
return outpuut
apply function :
df.col1.apply(divide_by_3`)
result:
0 [0.333333333333, 3.33333333333]
1 [0.666666666667, 6.66666666667]

Replace a list of timestamps in a pandas dataframe with a list of our for every timestamp in every row

I have a column in a pandas dataframe that consist of lists containing timestamps. I want to replace this list of timestamps with a list of the hour of each timestamp for every row. Below is an example
df = pd.DataFrame( {'id':[1,2], 'time':[ [2017-09-05 03:34:51,2016-03-07 05:24:55], [2016-02-06 03:14:21,2014-08-09 09:12:44, 2011-05-02 07:43:21] ] })
I would like a new column named 'hour' where
df['hour'] = [ [3,5], [3,9,7] ]
I tried different functionalities using map() and apply() but nothing produced the desired outcome, any help is very much appreciated.
Use apply + to_datetime.
s = df.time.apply(lambda x: pd.to_datetime(x, errors='coerce').hour.tolist() )
s
0 [3, 5]
1 [3, 9, 7]
Name: time, dtype: object
df['hour'] = s
df
id time hour
0 1 [2017-09-05 03:34:51, 2016-03-07 05:24:55] [3, 5]
1 2 [2016-02-06 03:14:21, 2014-08-09 09:12:44, 201... [3, 9, 7]
Statutory warning, this is inefficient in general, because you have a column of lists.
If you want to know how I'd store this data, it'd be something like:
df
id time
0 1 2017-09-05 03:34:51
1 1 2016-03-07 05:24:55
2 2 2016-02-06 03:14:21
3 2 2014-08-09 09:12:44
4 2 2011-05-02 07:43:21
Now, getting the hour is as easy as:
h = pd.to_datetime(df.time).dt.hour
h
0 3
1 5
2 3
3 9
4 7
Name: time, dtype: int64
df['hour'] = h
If you want to perform group-wise computation, you can always use df.groupby.

How to get unmatching data from 2 dataframes based on one column. (Pandas)

I have 2 data frames sample output is here
My code for getting those and formatting the date column is here
First df:
csv_data_df = pd.read_csv(os.path.join(path_to_data+'\\Data\\',appendedfile))
csv_data_df['Date_Formatted'] = pd.to_datetime(csv_data_df['DATE']).dt.strftime('%Y-%m-%d')
csv_data_df.head(3)
second df :
new_Data_df = pd.read_csv(io.StringIO(response.decode('utf-8')))
new_Data_df['Date_Formatted'] =
pd.to_datetime(new_Data_df['DATE']).dt.strftime('%Y-%m-%d')
new_Data_df.head(3)`
I want to construct third dataframe where only the rows with un-matching dates from second dataframe needs to go in third one.
Is there any method to do that. The date formatted column you can see in the screenshot.
You could set the index of both dataframes to your desired join column, then
use df1.combine_first(df2). For your specific example here, that could look like the below line.
csv_data_df.set_index('Date_Formatted').combine_first(new_Data_df.set_index('Date_Formatted')).reset_index()
Ex:
df = pd.DataFrame(np.random.randn(5, 3), columns=list('abc'), index=list(range(1, 6)))
df2 = pd.DataFrame(np.random.randn(8, 3), columns=list('abc'))
df
Out[10]:
a b c
1 -1.357517 -0.925239 0.974483
2 0.362472 -1.881582 1.263237
3 0.785508 0.227835 -0.604377
4 -0.386585 -0.511583 3.080297
5 0.660516 -1.393421 1.363900
df2
Out[11]:
a b c
0 1.732251 -1.977803 0.720292
1 0.048229 1.125277 1.016083
2 -1.684013 2.136061 0.553824
3 -0.022957 1.237249 0.236923
4 -0.998079 1.714126 1.291391
5 0.955464 -0.049673 1.629146
6 0.865864 1.137120 1.117207
7 -0.126944 1.003784 -0.180811
df.combine_first(df2)
Out[13]:
a b c
0 1.732251 -1.977803 0.720292
1 -1.357517 -0.925239 0.974483
2 0.362472 -1.881582 1.263237
3 0.785508 0.227835 -0.604377
4 -0.386585 -0.511583 3.080297
5 0.660516 -1.393421 1.363900
6 0.865864 1.137120 1.117207
7 -0.126944 1.003784 -0.180811

Categories