I've got a dataframe that has a Time Series (made up of strings) with some missing information:
# Generate a toy dataframe:
import pandas as pd
data = {'Time': ['0'+str(i)+':15:45' for i in range(10)]}
data['Time'][4] = 'unknown'
data['Time'][8] = 'unknown'
df = pd.DataFrame(data)
# df
Time
0 00:15:45
1 01:15:45
2 02:15:45
3 03:15:45
4 unknown
5 05:15:45
6 06:15:45
7 07:15:45
8 unknown
9 09:15:45
I would like the unknown entries to match the entry above, resulting in this dataframe:
# desired_df
Time
0 00:15:45
1 01:15:45
2 02:15:45
3 03:15:45
4 03:15:45
5 05:15:45
6 06:15:45
7 07:15:45
8 07:15:45
9 09:15:45
What is the best way to achieve this?
If you're intent on working with a time series data. I would recommend converting it to a time series, and then forward filling the blanks
import pandas as pd
data = {'Time': ['0'+str(i)+':15:45' for i in range(10)]}
data['Time'][4] = 'unknown'
data['Time'][8] = 'unknown'
df.Time = pd.to_datetime(df.Time, errors = 'coerce')
df.fillna(method='ffill')
However, if you are getting this data from a csv file or something where you use pandas.read_* function you should use the na_values argument in those functions to specify unknown as a NA value
df = pd.read_csv('example.csv', na_values = 'unknown')
df = df.fillna(method='ffill')
you can also pass a list instead of the string, and it adds the words passed to already existing list of NA values
However, if you want to keep the column a string, I would recommend just doing a find and replace
df.Time = np.where(df.Time == 'unknown', df.Time.shift(),df.Time)
One way to do this would be using pandas' shift, creating a new column with the data in Time shifted by one, and dropping it. But there may be a cleaner way to achieve this:
# Create new column with the shifted time data
df['Time2'] = df['Time'].shift()
# Replace the data in Time with the data in your new column where necessary
df.loc[df['Time'] == 'unknown', 'Time'] = df.loc[df['Time'] == 'unknown', 'Time2']
# Drop your new column
df = df.drop('Time2', axis=1)
print(df)
Time
0 00:15:45
1 01:15:45
2 02:15:45
3 03:15:45
4 03:15:45
5 05:15:45
6 06:15:45
7 07:15:45
8 07:15:45
9 09:15:45
EDIT: as pointed out by Zero, the new column step can be skipped altogether:
df.loc[df['Time'] == 'unknown', 'Time'] = df['Time'].shift()
Related
I'm currently trying to write a program that takes a chemical compound's identifier (something called the CID number) and then gives back the compound's properties by using the pubchempy documentation.
However, I keep getting an error when I try to merge the data values that I get from pubchempy to the initial database.
This is the code I've written for now:
import pandas as pd
import pubchempy
import numpy as np
df = pd.read_csv("Data.tsv.txt", sep="\t")
from pubchempy import get_properties
df['CID'] = df['CID'].astype(str).apply(lambda x: x.replace('.0',''))
df['CID'] = df['CID'].astype(str).apply(lambda x: x.replace('0',''))
df = df.drop(df[df.CID=='nan'].index)
df = df.drop(labels='reference', axis=1)
df = df.drop(labels='group', axis=1)
df = df.drop(labels='comments', axis=1)
df = df.drop(labels='compound_name', axis=1)
props = ['HBondDonorCount', 'RotatableBondCount', 'MolecularWeight', 'HBondAcceptorCount']
df2 = pd.DataFrame(get_properties(identifier=df.CID.to_list(), properties=props))
df = df.merge(df2)
print(df)
However, I get an error message that says,
ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat
Does anyone know how to fix this?
Few lines of text file (datafile):
NO. compound_name IUPAC_name SMILES CID Inchi threshold reference group comments
1 sulphasalazine 2-hydroxy-5-[[4-(pyridin-2-ylsulfamoyl)phenyl]diazenyl]benzoic acid O=C(O)c1cc(N=Nc2ccc(S(=O)(=O)Nc3ccccn3)cc2)ccc1O 5339 InChI=1S/C18H14N4O5S/c23-16-9-6-13(11-15(16)18(24)25)21-20-12-4-7-14(8-5-12)28(26,27)22-17-3-1-2-10-19-17/h1-11,23H,(H,19,22)(H,24,25) R2|R2|R25|R46| A
2 moxalactam 7-[[2-carboxy-2-(4-hydroxyphenyl)acetyl]amino]-7-methoxy-3-[(1-methyltetrazol-5-yl)sulfanylmethyl]-8-oxo-5-oxa-1-azabicyclo[4.2.0]oct-2-ene-2-carboxylic acid COC1(NC(=O)C(C(=O)O)c2ccc(O)cc2)C(=O)N2C(C(=O)O)=C(CSc3nnnn3C)COC21 3889 InChI=1S/C20H20N6O9S/c1-25-19(22-23-24-25)36-8-10-7-35-18-20(34-2,17(33)26(18)13(10)16(31)32)21-14(28)12(15(29)30)9-3-5-11(27)6-4-9/h3-6,12,18,27H,7-8H2,1-2H3,(H,21,28)(H,29,30)(H,31,32) R25| A
3 clioquinol 5-chloro-7-iodoquinolin-8-ol Oc1c(I)cc(Cl)c2cccnc12 2788 InChI=1S/C9H5ClINO/c10-6-4-7(11)9(13)8-5(6)2-1-3-12-8/h1-4,13H R18|R26|R27| A
Few lines of df2 output:
CID MolecularWeight HBondDonorCount HBondAcceptorCount RotatableBondCount
0 5339 398.4 3 9 6
1 3889 520.5 4 13 9
2 2788 305.50 1 2 0
3 1422517 440.5 0 8 4
4 18595497 461.5 5 10 3
It seems you want to merge two dataframes on CID column. CID column type of df2 is int, you need change it to object to match the type of CID in df
df = df.merge(df2.astype({'CID': str}), on='CID')
I have the the below df build from a pivot of a larger df. In this table 'week' is the the index (dtype = object) and I need to show week 53 as the first row instead of the last
Can someone advice please? I tried reindex and custom sorting but can't find the way
Thanks!
here is the table
Since you can't insert the row and push others back directly, a clever trick you can use is create a new order:
# adds a new column, "new" with the original order
df['new'] = range(1, len(df) + 1)
# sets value that has index 53 with 0 on the new column
# note that this comparison requires you to match index type
# so if weeks are object, you should compare df.index == '53'
df.loc[df.index == 53, 'new'] = 0
# sorts values by the new column and drops it
df = df.sort_values("new").drop('new', axis=1)
Before:
numbers
weeks
1 181519.23
2 18507.58
3 11342.63
4 6064.06
53 4597.90
After:
numbers
weeks
53 4597.90
1 181519.23
2 18507.58
3 11342.63
4 6064.06
One way of doing this would be:
import pandas as pd
df = pd.DataFrame(range(10))
new_df = df.loc[[df.index[-1]]+list(df.index[:-1])].reset_index(drop=True)
output:
0
9 9
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
Alternate method:
new_df = pd.concat([df[df["Year week"]==52], df[~(df["Year week"]==52)]])
I have a pandas dataframe, in which a column is a string formatted as
yyyymmdd
which should be a date. Is there an easy way to convert it to a recognizable form of date?
And then what python libraries should I use to handle them?
Let's say, for example, that I would like to consider all the events (rows) whose date field is a working day (so mon-fri). What is the smoothest way to handle such a task?
Ok so you want to select Mon-Friday. Do that by converting your column to datetime and check if the dt.dayofweek is lower than 6 (Mon-Friday --> 0-4)
m = pd.to_datetime(df['date']).dt.dayofweek < 5
df2 = df[m]
Full example:
import pandas as pd
df = pd.DataFrame({
'date': [
'20180101',
'20180102',
'20180103',
'20180104',
'20180105',
'20180106',
'20180107'
],
'value': range(7)
})
m = pd.to_datetime(df['date']).dt.dayofweek < 5
df2 = df[m]
print(df2)
Returns:
date value
0 20180101 0
1 20180102 1
2 20180103 2
3 20180104 3
4 20180105 4
Given this DataFrame:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({'Date':['20/03/17 10:30:34','20/03/17 10:31:24','20/03/17 10:34:34'],
'Value':[4,7,5]})
df['Date'] = pd.to_datetime(df.Date)
df
Out[53]:
Date Value
0 2017-03-20 10:30:34 4
1 2017-03-20 10:31:24 7
2 2017-03-20 10:34:34 5
Im am trying to extract the max value and its index. I can get the max value by df.Value.max() but when I use df.idxmax() to get the Index fo the value I get a TypeError:
TypeError: float() argument must be a string or a number
Is there any other way to get the Index of the Max value of a Dataframe? (Or any way to correct this one?)
Because it should be:
df.Value.idxmax()
It then returns 1.
If you only care about the Value column, you can use:
df.Value.idxmax()
>>> 1
However, it is strange that it fails on both columns with df.idxmax() as the following works, too:
df.Date.idxmax()
>>> 2
df.idxmax() also works for some other dummy data:
dummy = pd.DataFrame(np.random.random(size=(5,2)))
print(dummy)
0 1
0 0.944017 0.365198
1 0.541003 0.447632
2 0.583375 0.081192
3 0.492935 0.570310
4 0.832320 0.542983
print(dummy.idxmax())
0 0
1 3
dtype: int64
You have to specify from which column you want to get maximum value-idx.
To get the idx of maxumum value use:
df.Value.idxmax()
if you want to get the idx of maximum Date use:
df.Date.idxmax()
I have a problem with adding columns in pandas.
I have DataFrame, dimensional is nxk. And in process I wiil need add columns with dimensional mx1, where m = [1,n], but I don't know m.
When I try do it:
df['Name column'] = data
# type(data) = list
result:
AssertionError: Length of values does not match length of index
Can I add columns with different length?
If you use accepted answer, you'll lose your column names, as shown in the accepted answer example, and described in the documentation (emphasis added):
The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information.
It looks like column names ('Name column') are meaningful to the Original Poster / Original Question.
To save column names, use pandas.concat, but don't ignore_index (default value of ignore_index is false; so you can omit that argument altogether). Continue to use axis=1:
import pandas
# Note these columns have 3 rows of values:
original = pandas.DataFrame({
'Age':[10, 12, 13],
'Gender':['M','F','F']
})
# Note this column has 4 rows of values:
additional = pandas.DataFrame({
'Name': ['Nate A', 'Jessie A', 'Daniel H', 'John D']
})
new = pandas.concat([original, additional], axis=1)
# Identical:
# new = pandas.concat([original, additional], ignore_index=False, axis=1)
print(new.head())
# Age Gender Name
#0 10 M Nate A
#1 12 F Jessie A
#2 13 F Daniel H
#3 NaN NaN John D
Notice how John D does not have an Age or a Gender.
Use concat and pass axis=1 and ignore_index=True:
In [38]:
import numpy as np
df = pd.DataFrame({'a':np.arange(5)})
df1 = pd.DataFrame({'b':np.arange(4)})
print(df1)
df
b
0 0
1 1
2 2
3 3
Out[38]:
a
0 0
1 1
2 2
3 3
4 4
In [39]:
pd.concat([df,df1], ignore_index=True, axis=1)
Out[39]:
0 1
0 0 0
1 1 1
2 2 2
3 3 3
4 4 NaN
We can add the different size of list values to DataFrame.
Example
a = [0,1,2,3]
b = [0,1,2,3,4,5,6,7,8,9]
c = [0,1]
Find the Length of all list
la,lb,lc = len(a),len(b),len(c)
# now find the max
max_len = max(la,lb,lc)
Resize all according to the determined max length (not in this example
if not max_len == la:
a.extend(['']*(max_len-la))
if not max_len == lb:
b.extend(['']*(max_len-lb))
if not max_len == lc:
c.extend(['']*(max_len-lc))
Now the all list is same length and create dataframe
pd.DataFrame({'A':a,'B':b,'C':c})
Final Output is
A B C
0 1 0 1
1 2 1
2 3 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
I had the same issue, two different dataframes and without a common column. I just needed to put them beside each other in a csv file.
Merge:
In this case, "merge" does not work; even adding a temporary column to both dfs and then dropping it. Because this method makes both dfs with the same length. Hence, it repeats the rows of the shorter dataframe to match the longer dataframe's length.
Concat:
The idea of The Red Pea didn't work for me. It just appended the shorter df to the longer one (row-wise) while leaving an empty column (NaNs) above the shorter df's column.
Solution: You need to do the following:
df1 = df1.reset_index()
df2 = df2.reset_index()
df = [df1, df2]
df_final = pd.concat(df, axis=1)
df_final.to_csv(filename, index=False)
This way, you'll see your dfs besides each other (column-wise), each of which with its own length.
If somebody like to replace a specific column of a different size instead of adding it.
Based on this answer, I use a dict as an intermediate type.
Create Pandas Dataframe with different sized columns
If the column to be inserted is not a list but already a dict, the respective line can be omitted.
def fill_column(dataframe: pd.DataFrame, list: list, column: str):
dict_from_list = dict(enumerate(list)) # create enumertable object from list and create dict
dataFrame_asDict = dataframe.to_dict() # Get DataFrame as Dict
dataFrame_asDict[column] = dict_from_list # Assign specific column
return pd.DataFrame.from_dict(dataFrame_asDict, orient='index').T # Create new DataSheet from Dict and return it