I think I miss something very basic in the Pandas DataFrame:
I have the following program:
array = {'A':[1, 2, 3, 4], 'B':[5, 6, 7, 8]}
index = pd.DatetimeIndex(
[ '09:30',
'09:31',
'09:32',
'09:33' ])
data = pd.DataFrame(array, index=index)
data.index = data.index.strftime('%H:%M')
print(data)
print(data.loc('09:33'))
I get:
A B
09:30 1 5
09:31 2 6
09:32 3 7
09:33 4 8
Which is great, but I can not access a row using it's index '09:33' and I get:
ValueError: No axis named 09:33 for object type <class 'pandas.core.frame.DataFrame'>
What am I missing ?
Thank you,
Ehood
You need to use brackets [] instead: data.loc['09:33']
array = {'A':[1, 2, 3, 4], 'B':[5, 6, 7, 8]}
index = pd.DatetimeIndex(
[ '09:30',
'09:31',
'09:32',
'09:33' ])
data = pd.DataFrame(array, index=index)
data.index = data.index.strftime('%H:%M')
print(data)
print(data.loc['09:33']) # HERE!
Output:
A B
09:30 1 5
09:31 2 6
09:32 3 7
09:33 4 8
A 4
B 8
Name: 09:33, dtype: int64
You're close. When using loc or iloc, its followed by square brackets, like this:
df.loc['viper']
In your case that ends up like this:
You can find the documentation for that here.
Related
I have the following Dataframe:
Now i want to insert an empty row after every time the column "Zweck" equals 7.
So for example the third row should be an empty row.
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 2, 3, 4, 5], 'f': [1, 7, 3, 4, 7]})
ren_dict = {i: df.columns[i] for i in range(len(df.columns))}
ind = df[df['f'] == 7].index
df = pd.DataFrame(np.insert(df.values, ind, values=[33], axis=0))
df.rename(columns=ren_dict, inplace=True)
ind_empt = df['a'] == 33
df[ind_empt] = ''
print(df)
Output
a b f
0 1 1 1
1
2 2 2 7
3 3 3 3
4 4 4 4
5
6 5 5 7
Here the dataframe is overwritten, as the append operation will be resource intensive. As a result, the required strings with values 33 appear. This is necessary because np.insert does not allow string values to be substituted. Columns are renamed to their original state with: df.rename. Finally, we find lines with df['a'] == 33 to set to empty values.
i want to make a dataframe with defined labels. Dont know how to tell panda to take the labels from the list. Hope someone can help
import numpy as np
import pandas as pd
df = []
thislist = []
thislist = ["A","D"]
thisdict = {
"A": [1, 2, 3],
"B": [4, 5, 6],
"C": [7, 8, 9],
"D": [7, 8, 9]
}
df = pd.DataFrame(data= thisdict[thislist]) # <- here is my problem
I want to get this:
df = A D
1 7
2 8
3 9
Use:
df = pd.DataFrame(thisdict)[thislist]
print(df)
A D
0 1 7
1 2 8
2 3 9
We could also use DataFrame.drop
df = pd.DataFrame(thisdict).drop(columns = ['B','C'])
or DataFrame.reindex
df = pd.DataFrame(thisdict).reindex(columns = thislist)
or DataFrame.filter
df = pd.DataFrame(thisdict).filter(items=thislist)
We can also use filter to filter thisdict.items()
df = pd.DataFrame(dict(filter(lambda item: item[0] in thislist, thisdict.items())))
print(df)
A D
0 1 7
1 2 8
2 3 9
I think this answer is completed with the solution of #anky_91
Finally, I recommend you see how to index
IIUC, use .loc[] with the dataframe constructor:
df = pd.DataFrame(thisdict).loc[:,thislist]
print(df)
A D
0 1 7
1 2 8
2 3 9
Use a dict comprehension to create a new dictionary that is a subset of your original so you only construct the DataFrame you care about.
pd.DataFrame({x: thisdict[x] for x in thislist})
A D
0 1 7
1 2 8
2 3 9
If you want to deal with the possibility of missing Keys, add some logic so it's similar to reindex
pd.DataFrame({x: thisdict[x] if x in thisdict.keys() else np.NaN for x in thislist})
df = pd.DataFrame(thisdict)
df[['A', 'D']]
another alternative for your input:
thislist = ["A","D"]
thisdict = {
"A": [1, 2, 3],
"B": [4, 5, 6],
"C": [7, 8, 9],
"D": [7, 8, 9]
}
df = pd.DataFrame(thisdict)
and than simply remove your columns not in thelist (you can do it directly from the df or aggregate them):
remove_columns = []
for c in df.columns:
if c not in thislist:
remove_columns.append(c)
and remove it:
df.drop(columns=remove_columns, inplace=True)
I have a csv data file with a header indicating the column names.
xy wz hi kq
0 10 5 6
1 2 4 7
2 5 2 6
I run:
X = np.array(pd.read_csv('gbk_X_1.csv').values)
I want to get the column names:
['xy', 'wz', 'hi', 'kg']
I read this post but the solution provides me with None.
Use the following code:
import re
f = open('f.csv','r')
alllines = f.readlines()
columns = re.sub(' +',' ',alllines[0]) #delete extra space in one line
columns = columns.strip().split(',') #split using space
print(columns)
Assume CSV file is like this:
xy wz hi kq
0 10 5 6
1 2 4 7
2 5 2 6
Let's assume your csv file looks like
xy,wz,hi,kq
0,10,5,6
1,2,4,7
2,5,2,6
Then use pd.read_csv to dump the file into a dataframe
df = pd.read_csv('gbk_X_1.csv')
The dataframe now looks like
df
xy wz hi kq
0 0 10 5 6
1 1 2 4 7
2 2 5 2 6
It's three main components are the
data which you can access via the values attribute
df.values
array([[ 0, 10, 5, 6],
[ 1, 2, 4, 7],
[ 2, 5, 2, 6]])
index which you can access via the index attribute
df.index
RangeIndex(start=0, stop=3, step=1)
columns which you can access via the columns attribute
df.columns
Index(['xy', 'wz', 'hi', 'kq'], dtype='object')
If you want the columns as a list, use the to_list method
df.columns.tolist()
['xy', 'wz', 'hi', 'kq']
I have 2 dataframes:
# dataframe 1
data = {'Name':['PINO','PALO','TNCO' ,'TNTO','CUCO' ,'FIGO','ONGF','LABO'],
'Id' :[ 10 , 9 ,np.nan , 14 , 3 ,np.nan, 7 ,np.nan]}
df1 = pd.DataFrame(data)
and
# dataframe 2
convert_table = {'XXX': ['ALLO','BELO','CACO','CUCO','DADO','FIGO','FIGO','ONGF','PALO','PALO','PINO','TNCO','TNCO','TNCO','TNTO']}
df2 = pd.DataFrame(convert_table)
My goal is to identify the indexes of the elements of df2['XXX'] which follow these conditions:
Are present in df1['Name']
Have the correspondin df1['Id'] = NaN
I was able to achieve my goal by using the following lines of code:
nan_names = df1['Name'][df1['Id'].isnull()]
df3 = pd.DataFrame()
for name in nan_names:
index = df2[df2['XXX']==name].index.tolist()
if index:
dic = {'name':[name] , 'index':[index]}
df3 = pd.concat([df3,pd.DataFrame(dic)], ignore_index=True)
However I would like to know if there is a more efficient and elegant way to achieve my goal.
The result should look like this:
index name
0 [11, 12, 13] TNCO
1 [5, 6] FIGO
Note: if the name is not found then, it is not needed to store any information.
You're looking for the method isin:
df = df2[df2['XXX'].isin(nan_names)]
This will return:
XXX
5 FIGO
6 FIGO
11 TNCO
12 TNCO
13 TNCO
From there it's just a matter of formatting:
df.reset_index().groupby('XXX')['index'].apply(list)
This will return:
XXX
FIGO [5, 6]
TNCO [11, 12, 13]
The idea is to reset the index so that it becomes a column (named index). Grouping by name and applying the list function will return the list of original indices for each name.
Calling reset_index once more will return the result you were looking for.
Edit
Combine everything into a one-liner, this will be the output:
In [21]: df2[df2['XXX'].isin(nan_names)].reset_index().groupby('XXX')['index'].apply(list).reset_index()
Out[21]:
XXX index
0 FIGO [5, 6]
1 TNCO [11, 12, 13]
I think you can use merge with groupby and apply list:
nan_names = df1.loc[df1['Id'].isnull(), ['Name']]
print (nan_names)
Name
2 TNCO
5 FIGO
7 LABO
df = pd.merge(df2.reset_index(), nan_names, on='Name', suffixes=('','_'))
print (df)
index Name
0 5 FIGO
1 6 FIGO
2 11 TNCO
3 12 TNCO
4 13 TNCO
print (df.groupby('Name')['index'].apply(list).reset_index())
Name index
0 FIGO [5, 6]
1 TNCO [11, 12, 13]
I'm trying to make a table, and the way Pandas formats its indices is exactly what I'm looking for. That said, I don't want the actual data, and I can't figure out how to get Pandas to print out just the indices without the corresponding data.
You can access the index attribute of a df using .index:
In [277]:
df = pd.DataFrame({'a':np.arange(10), 'b':np.random.randn(10)})
df
Out[277]:
a b
0 0 0.293422
1 1 -1.631018
2 2 0.065344
3 3 -0.417926
4 4 1.925325
5 5 0.167545
6 6 -0.988941
7 7 -0.277446
8 8 1.426912
9 9 -0.114189
In [278]:
df.index
Out[278]:
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
.index.tolist() is another function which you can get the index as a list:
In [1391]: datasheet.head(20).index.tolist()
Out[1391]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
You can access the index attribute of a df using df.index[i]
>> import pandas as pd
>> import numpy as np
>> df = pd.DataFrame({'a':np.arange(5), 'b':np.random.randn(5)})
a b
0 0 1.088998
1 1 -1.381735
2 2 0.035058
3 3 -2.273023
4 4 1.345342
>> df.index[1] ## Second index
>> df.index[-1] ## Last index
>> for i in xrange(len(df)):print df.index[i] ## Using loop
...
0
1
2
3
4
You can use lamba function:
index = df.index[lambda x : for x in df.index() ]
print(index)
You can always try df.index. This function will show you the range index.
Or you can always set your index. Let say you had a weather.csv file with headers:
'date', 'temperature' and 'event'. And you want set "date" as your index.
import pandas as pd
df = pd.read_csvte'weather_file)
df.set_index('day', inplace=True)
df