I am looking through a DataFrame with different kinds of data whose usefulness I'm trying to evaluate. So I am looking at each column and check the kind of data it is. E.g.
print(extract_df['Auslagenersatz'])
For some I get responses like this:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
..
263 NaN
264 NaN
265 NaN
266 NaN
267 NaN
I would like to check whether that column contains any information at all so what I am looking for is something like
s = extract_df['Auslagenersatz']
print(s.loc[s == True])
where I am assuming that NaN is interpreted as False in the same way an empty set is. I would like it to return only those elements of the series that satisfy this condition (being not empty). The code above does not work however, as I get an empty set even for columns that I know have non-NaN entries.
I oriented myself with this post How to select rows from a DataFrame based on column values
but I can't figure where I'm going wrong or how to do this instead. The Problem comes up often so any help is well appreciated.
import pandas as pd
df = pd.DataFrame({'A':[2,3,None, 4,None], 'B':[2,13,None, None,None], 'C':[None,3,None, 4,None]})
If you want to see non-NA values of column A then:
df[~df['A'].isna()]
Related
I have a large data frame of schedules, and I need to count the numbers of experiments run. The challenge is that usage for is repeated in rows (which is ok), but is duplicated in some, but not all columns. I want to remove the second entry (if duplicated), but I can't delete the entire second column because it will contain some new values too. How can I compare individual entries for two columns in a side by side fashion and delete the second if there is a duplicate?
The duration for this is a maximum of two days, so three days in a row is a new event with the same name starting on the third day.
The actual text for the experiment names is complicated and the data frame is 120 columns wide, so typing this in as a list or dictionary isn't possible. I'm hoping for a python or numpy function, but could use a loop.
Here are pictures for an example of the starting data frame and the desired output.starting data frame example
de-duplicated data frame example
This a hack and similar to #Params answer, but would be faster because you aren't calling .iloc a lot. The basic idea is to transpose the data frame and repeat an operation for as many times as you need to compare all of the columns. Then transpose it back to get the result in the OP.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Monday':['exp_A','exp_A','exp_A','exp_A','exp_B',np.nan,np.nan,np.nan,'exp_D','exp_D'],
'Tuesday':['exp_A','exp_A','exp_A','exp_A','exp_B','exp_B','exp_C','exp_C','exp_D','exp_D'],
'Wednesday':['exp_A','exp_D',np.nan,np.nan,np.nan,'exp_B','exp_C','exp_C','exp_C',np.nan],
'Thursday':['exp_A','exp_D',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'exp_C',np.nan]
})
df = df.T
for i in range(int(np.ceil(df.shape[0]/2))):
df[(df == df.shift(1))& (df != df.shift(2))] = np.nan
df = df.T
Monday Tuesday Wednesday Thursday
0 exp_A NaN exp_A NaN
1 exp_A NaN exp_D NaN
2 exp_A NaN NaN NaN
3 exp_A NaN NaN NaN
4 exp_B NaN NaN NaN
5 NaN exp_B NaN NaN
6 NaN exp_C NaN NaN
7 NaN exp_C NaN NaN
8 exp_D NaN exp_C NaN
9 exp_D NaN NaN NaN
I have a performance problem with filling missing values in my dataset. This concerns a 500mb / 5.000.0000 row dataset (Kaggle: Expedia 2013).
It would be easiest to use df.fillna(), but it seems I cannot use this to fill every NaN with a different value.
I created a lookup table:
srch_destination_id | Value
2 0.0110
3 0.0000
5 0.0207
7 NaN
8 NaN
9 NaN
10 0.1500
12 0.0114
This table contains per srch_destination_id the corresponding value to replace NaN with in dataset.
# Iterate over dataset row per row. If missing value (NaN), fill in the min. val
# found in lookuptable.
for row in range(len(dataset)):
if pd.isnull(dataset.iloc[row]['prop_location_score2']):
cell = dataset.iloc[row]['srch_destination_id']
df.set_value(row, 'prop_location_score2', lookuptable.loc[cell])
This code works when iterating over 1000 rows, but when iterating over all 5 million rows, my computer never finishes (I waited hours).
Is there a better way to do what I'm doing? Did I make a mistake somewhere?
pd.Series.fillna does accept a series or a dictionary, as well as scalar replacement values.
Therefore, you can create a series mapping from lookup:
s = lookup.set_index('srch_destination')['Value']
Then use this to fill in NaN values in dataset:
dataset['prop_loc'] = dataset['prop_loc'].fillna(dataset['srch_destination'].map(s.get))
Notice that in the fillna input we are mapping an identifier from dataset. In addition, we use pd.Series.map to perform the necessary mapping.
I have a data series which looks like this:
print mys
id_L1
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
I would like to check is all the values are NaN.
My attempt:
pd.isnull(mys).all()
Output:
True
Is this the correct way to do it?
Yes, that's correct, but I think a more idiomatic way would be:
mys.isnull().all()
This will check for all columns..
mys.isnull().values.all(axis=0)
if df['col'].count() > 0:
then ...
This works well but I think it might be quite a slow approach. I made the mistake of embedding this into a 6000-times loop to test four columns - and it's brutal, but I can blame the programmer clearly :)
Obviously, don't be like me. Always: Test your columns for all-null once, set a variable with the yes - "empty" or no - "not empty" result - and then loop.
I have a Pandas series. Basically one specific row of a pandas data frame.
Name: NY.GDP.PCAP.KD.ZG, dtype: int64
NY.GDP.DEFL.ZS_logdiff 0.341671
NY.GDP.DISC.CN 0.078261
NY.GDP.DISC.KN 0.083890
NY.GDP.FRST.RT.ZS 0.296574
NY.GDP.MINR.RT.ZS 0.264811
NY.GDP.MKTP.CD_logdiff 0.522725
NY.GDP.MKTP.CN_logdiff 0.884601
NY.GDP.MKTP.KD_logdiff 0.990679
NY.GDP.MKTP.KD.ZG 0.992603
NY.GDP.MKTP.KN_logdiff -0.077253
NY.GDP.MKTP.PP.CD_logDiff 0.856861
NY.GDP.MKTP.PP.KD_logdiff 0.990679
NY.GDP.NGAS.RT.ZS -0.018126
NY.GDP.PCAP.CD_logdiff 0.523433
NY.GDP.PCAP.KD.ZG 1.000000
NY.GDP.PCAP.KN_logdiff 0.999456
NY.GDP.PCAP.PP.CD_logdff 0.857321
NY.GDP.PCAP.PP.KD_logdiff 0.999456
The first column is index as you would find in a series. Now I want to basically get all these index names in a list such that only those index should come whose absolute value in the right column is less than 0.5. To give a context this series is basically a row corresponding to the variable NY.GDP.PCAP.KD.ZG in a correlation matrix and I want to retain this variable along with those variables which have correlation less than 0.5 with this variable. Rest variables I will drop from the dataframe
Currently I do something like this but it also keeps NaN
print(tourism[columns].corr().ix[14].where(np.absolute(tourism[columns].corr().ix[14]<0.5)))
where tourism is the data frame , columns is the set of columns on which I did correlation analysis and 14 is the row in the correlation matrix corresponding to column mentioned above
gives:
NY.GDP.DEFL.ZS_logdiff 0.341671
NY.GDP.DISC.CN 0.078261
NY.GDP.DISC.KN 0.083890
NY.GDP.FRST.RT.ZS 0.296574
NY.GDP.MINR.RT.ZS 0.264811
NY.GDP.MKTP.CD_logdiff NaN
NY.GDP.MKTP.CN_logdiff NaN
NY.GDP.MKTP.KD_logdiff NaN
NY.GDP.MKTP.KD.ZG NaN
NY.GDP.MKTP.KN_logdiff -0.077253
NY.GDP.MKTP.PP.CD_logDiff NaN
NY.GDP.MKTP.PP.KD_logdiff NaN
NY.GDP.NGAS.RT.ZS -0.018126
NY.GDP.PCAP.CD_logdiff NaN
NY.GDP.PCAP.KD.ZG NaN
NY.GDP.PCAP.KN_logdiff NaN
NY.GDP.PCAP.PP.CD_logdff NaN
NY.GDP.PCAP.PP.KD_logdiff NaN
Name: NY.GDP.PCAP.KD.ZG, dtype: float64
If x is your series, then:
x[x.abs() < 0.5].index
I'm attempting to read in a flat-file to a DataFrame using pandas but can't seem to get the format right. My file has a variable number of fields represented per line and looks like this:
TIME=20131203004552049|CHAN=FCJNJKDCAAANPCKEAAAAAAAA|EVNT=NVOCinpt|MIME=application/synthesis+ssml|TXID=NUAN-20131203004552049-FCJNJKDCAAANPCKEAAAAAAAA-txt|TXSZ=1167|UCPU=31|SCPU=15
TIME=20131203004552049|CHAN=FCJNJKDCAAANPCKEAAAAAAAA|EVNT=NVOCsynd|INPT=1167|DURS=5120|RSTT=stop|UCPU=31|SCPU=15
TIME=20131203004552049|CHAN=FCJNJKDCAAANPCKEAAAAAAAA|EVNT=NVOClise|LUSED=0|LMAX=100|OMAX=95|LFEAT=tts|UCPU=0|SCPU=0
I have the field separator at |, I've pulled a list of all unique keys into keylist, and am trying to use the following to read in the data:
keylist = ['TIME',
'CHAN',
# [truncated]
'DURS',
'RSTT']
test_fp = 'c:\\temp\\test_output.txt'
df = pd.read_csv(test_fp, sep='|', names=keylist)
This incorrectly builds the DataFrame as I'm not specifying any way to recognize the key label in the line. I'm a little stuck and am not sure which way to research -- should I be using .read_json() for example?
Not sure if there's a slick way to do this. Sometimes when the data structure is different enough from the norm it's easiest to preprocess it on the Python side. Sure, it's not as fast, but since you could immediately save it in a more standard format it's usually not worth worrying about.
One way:
with open("wfield.txt") as fp:
rows = (dict(entry.split("=",1) for entry in row.strip().split("|")) for row in fp)
df = pd.DataFrame.from_dict(rows)
which produces
>>> df
CHAN DURS EVNT INPT LFEAT LMAX LUSED \
0 FCJNJKDCAAANPCKEAAAAAAAA NaN NVOCinpt NaN NaN NaN NaN
1 FCJNJKDCAAANPCKEAAAAAAAA 5120 NVOCsynd 1167 NaN NaN NaN
2 FCJNJKDCAAANPCKEAAAAAAAA NaN NVOClise NaN tts 100 0
MIME OMAX RSTT SCPU TIME \
0 application/synthesis+ssml NaN NaN 15 20131203004552049
1 NaN NaN stop 15 20131203004552049
2 NaN 95 NaN 0 20131203004552049
TXID TXSZ UCPU
0 NUAN-20131203004552049-FCJNJKDCAAANPCKEAAAAAAA... 1167 31
1 NaN NaN 31
2 NaN NaN 0
[3 rows x 15 columns]
After you've got this, you can reshape as needed. (I'm not sure if you wanted to combine rows with the same TIME & CHAN or not.)
Edit: if you're using an older version of pandas which doesn't support passing a generator to from_dict, you can built it from a list instead:
df = pd.DataFrame(list(rows))
but note that you haev have to convert columns to numerical columns from strings after the fact.