Python Pandas - findall() not returning all columns - python

I have the following data in my DataFrame
I want to create another DataFrame by applying one of the filter ( Get all data which belongs to current month ).
To achieve this I am using the following code :
search = '2021-07'
month_dataframe=all_dataframe["date"].str.findall(str(search))
I think the filter is running properly , because when I print month_dataframe, I can see the data, but it is only returning date column data, I want all of the columns/data from all_dataframe.

Input df
Name date address
0 A 2021-07-03 X
1 B 2021-07-03 Y
2 C 2021-07-03 Z
3 D 2021-08-01 M
4 E 2021-08-01 N
5 F 2021-08-01 O
If date col is object/string type and we want to keep it as is
search = '2021-07'
month_dataframe = df[df["date"].str.contains(str(search))]
month_dataframe
If date col is object/str type and we are okay with converting it to datetime
Benefit in this case is we don't need to define variable search = '2021-07' and this solution will work for every(current) month.
df['date'] = pd.to_datetime(df.date)
month_dataframe = df[df.date.dt.month == pd.Timestamp('today').month]
month_dataframe
If date col is datetime type
(Just an option, not the best way)
search = '2021-07'
month_dataframe = df[df["date"].astype(str).str.contains(str(search))]
month_dataframe
Output
Name date address
0 A 2021-07-03 X
1 B 2021-07-03 Y
2 C 2021-07-03 Z

Try to change the column data type and use filter option.

Related

How can I output a row of datafram in python, numpy array? [duplicate]

I have a Pandas DataFrame with a mix of screen names, tweets, fav's etc. I want find the max value of 'favcount' (which i have already done) and also return the screen name of that 'tweet'
df = pd.DataFrame()
df['timestamp'] = timestamp
df['sn'] = sn
df['text'] = text
df['favcount'] = fav_count
print df
print '------'
print df['favcount'].max()
I cant seem to find anything on this, can anyone help guide me in the right direction?
Use argmax() idxmax() to get the index of the max value. Then you can use loc
df.loc[df['favcount'].idxmax(), 'sn']
Edit: argmax() is now deprecated, switching for idxmax()
I think you need idxmax - get index of max value of favcount and then select value in column sn by loc:
df = pd.DataFrame({'favcount':[1,2,3], 'sn':['a','b','c']})
print (df)
favcount sn
0 1 a
1 2 b
2 3 c
print (df.favcount.idxmax())
2
print (df.loc[df.favcount.idxmax()])
favcount 3
sn c
Name: 2, dtype: object
print (df.loc[df.favcount.idxmax(), 'sn'])
c
By using same df as above,
# python code
df = pd.DataFrame({'favcount':[1,2,3], 'sn':['a','b','c']})
print (df)
favcount sn
0 1 a
1 2 b
2 3 c
## You can use max()
print(df[df.favcount.max() == df['favcount']])
favcount sn
2 3 c
## If you need specific column you can select it
print(df[df.favcount.max() == df['favcount']].sn)
2 c

Distinguish repeating column names by adding an integer using pandas

I have some columns that have the same names. I would like to add a 1 to the repeating column names
Data
Date Type hi hello stat hi hello
1/1/2022 a 0 0 1 1 0
Desired
Date Type hi hello stat hi1 hello1
1/1/2022 a 0 0 1 1 0
Doing
mask = df['col2'].duplicated(keep=False)
I believe I can utilize mask, but not sure how to efficiently achieve this without calling out the actual column. I would like to call the full dataset and allow the algorithm to update the dupe.
Any suggestion is appreciated
Use the built-in parser method _maybe_dedup_names():
df.columns = pd.io.parsers.base_parser.ParserBase({'usecols': None})._maybe_dedup_names(df.columns)
# Date Type hi hello stat hi.1 hello.1
# 0 1/1/2022 a 0 0 1 1 0
This is what pandas uses to deduplicate column headers from read_csv().
Note that it scales to any number of duplicate names:
cols = ['hi'] * 3 + ['hello'] * 5
pd.io.parsers.base_parser.ParserBase({'usecols': None})._maybe_dedup_names(cols)
# ['hi', 'hi.1', 'hi.2', 'hello', 'hello.1', 'hello.2', 'hello.3', 'hello.4']
In pandas < 1.3:
df.columns = pd.io.parsers.ParserBase({})._maybe_dedup_names(df.columns)
You need to apply the duplicated operation to the column names. And then map the duplication information to a string, which you can then add to the original column names.
df.columns = df.columns+[{False:'',True:'1'}[x] for x in df.columns.duplicated()]
We can do
s = df.columns.to_series().groupby(df.columns).cumcount().replace({0:''}).astype(str).radd('.')
df.columns = (df.columns + s).str.strip('.')
df
Out[153]:
Date Type hi hello stat hi.1 hello.1
0 1/1/2022 a 0 0 1 1 0

How to check the dates in different columns?

I have a dataframe "expeditions" where there are 3 columns ("basecamp_date", "highpoint_date" and "termination_date"). I would like to check that the basecamp date is before the highpoint date and before the termination date because I noticed that there are rows where this is not the case (see picture)
Do you have any idea what I should do (a loop, a new dataframe...?)
Code
import pandas as pd
expeditions = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/expeditions.csv")
I would start by transforming the columns into a datetime format to be extra sure:
for x in df:
df[x] = pd.to_datetime(df[x],infer_datetime_format=True)
And then follow it by the comparison using np.where()
df['Check'] = np.where((df['basecamp_date'] < df['highpoint_date']) & (df['basecamp_date'] < df['termination_date']),True,False)
EDIT: Based on OPs follow up question I propose the following solution:
filtered = df[((df['basecamp_date'] < df['highpoint_date']) & (df['basecamp_date'] < df['termination_date'])) | (df.isna().sum(axis=1) != 0)]
Example:
basecamp_date highpoint_date termination_date
0 2008-01-04 2008-05-01 2008-04-05
1 NaN 2008-05-03 2008-06-03
2 2008-01-04 2008-01-01 2009-01-01
Only row 0 should be kept as row 2 doesn't match that date conditions and row 1 has a null value. Using the proposed code, the output is:
basecamp_date highpoint_date termination_date
0 2008-01-04 2008-05-01 2008-04-05
1 NaT 2008-05-03 2008-06-03
You should convert your data to datetime format:
df['date_col'] = pd.to_datetime(df['date_col'])
and then do like this:
df[df['date_col1'] < df['date_col2']]
In your case date_col might be you column names.
All the other answers are working solutions but I find this much easier:
df.query("basecamp_date < highpoint_date and basecamp_date
< termination_date")
Use Series.lt for compare columns and chain masks by & for bitwise AND:
m = (df['basecamp_date'].ge(df['highpoint_date']) |
df['basecamp_date'].ge(df['termination_date']) |
df[['basecamp_date', 'termination_date', 'highpoint_date']].notna().all(1)))
)
If need check matched values:
df1 = df[m]

Changing format of date in pandas dataframe

I have a pandas dataframe, in which a column is a string formatted as
yyyymmdd
which should be a date. Is there an easy way to convert it to a recognizable form of date?
And then what python libraries should I use to handle them?
Let's say, for example, that I would like to consider all the events (rows) whose date field is a working day (so mon-fri). What is the smoothest way to handle such a task?
Ok so you want to select Mon-Friday. Do that by converting your column to datetime and check if the dt.dayofweek is lower than 6 (Mon-Friday --> 0-4)
m = pd.to_datetime(df['date']).dt.dayofweek < 5
df2 = df[m]
Full example:
import pandas as pd
df = pd.DataFrame({
'date': [
'20180101',
'20180102',
'20180103',
'20180104',
'20180105',
'20180106',
'20180107'
],
'value': range(7)
})
m = pd.to_datetime(df['date']).dt.dayofweek < 5
df2 = df[m]
print(df2)
Returns:
date value
0 20180101 0
1 20180102 1
2 20180103 2
3 20180104 3
4 20180105 4

Aggregating unbalanced panel to time series using pandas

I have an unbalanced panel that I'm trying to aggregate up to a regular, weekly time series. The panel looks as follows:
Group Date value
A 1/1/2000 5
A 1/17/2000 10
B 1/9/2000 3
B 1/23/2000 7
C 1/22/2000 20
To give a better sense of what I'm looking for, I'm including an intermediate step, which I'd love to skip if possible. Basically some data needs to be filled in so that it can be aggregated. As you can see, missing weeks in between observations are interpolated. All other values are set equal to zero.
Group Date value
A 1/1/2000 5
A 1/8/2000 5
A 1/15/2000 10
A 1/22/2000 0
B 1/1/2000 0
B 1/8/2000 3
B 1/15/2000 3
B 1/22/2000 7
C 1/1/2000 0
C 1/8/2000 0
C 1/15/2000 0
C 1/22/2000 20
The final result that I'm looking for is as follows:
Date value
1/1/2000 5 = 5 + 0 + 0
1/8/2000 8 = 5 + 3 + 0
1/15/2000 13 = 10 + 3 + 0
1/22/2000 27 = 0 + 7 + 20
I haven't gotten very far, managed to create a panel:
panel = df.set_index(['Group','week']).to_panel()
Unfortunately, if I try to resample, I get an error
panel.resample('W')
TypeError: Only valid with DatetimeIndex or PeriodIndex
Assume df is your second dataframe with weeks, you can try the following:
df.groupby('week').sum()['value']
The documentation of groupby() and its application is here. It's similar to group-by function in SQL.
To obtain the second dataframe from the first one, try the following:
Firstly, prepare a function to map the day to week
def d2w_map(day):
if day <=7:
return 1
elif day <= 14:
return 2
elif day <= 21:
return 3
else:
return 4
In the method above, days from 29 to 31 are considered in week 4. But you get the idea. You can modify it as needed.
Secondly, take the lists out from the first dataframe, and convert days to weeks
df['Week'] = df['Day'].apply(d2w_map)
del df['Day']
Thirdly, initialize your second dataframe with only columns of 'Group' and 'Week', leaving the 'value' out. Assume now your initialized new dataframe is result, you can now do a join
result = result.join(df, on=['Group', 'Week'])
Last, write a function to fill the Nan up in the 'value' column with the nearby element. The Nan is what you need to interpolate. Since I am not sure how you want the interpolation to work, I will leave it to you.
Here is how you can change d2w_map to convert string of date to integer of week
from datetime import datetime
def d2w_map(day_str):
return datetime.strptime(day_str, '%m/%d/%Y').weekday()
Returned value of 0 means Monday, 1 means Tuesday and so on.
If you have the package dateutil installed, the function can be more robust:
from dateutil.parser import parse
def d2w_map(day_str):
return parse(day_str).weekday()
Sometimes, things you want are already implemented by magic :)
Turns out the key is to resample a groupby object like so:
df_temp = (df.set_index('date')
.groupby('Group')
.resample('W', how='sum', fill_method='ffill'))
ts = (df_temp.reset_index()
.groupby('date')
.sum()[value])
Used this tab delimited test.txt:
Group Date value
A 1/1/2000 5
A 1/17/2000 10
B 1/9/2000 3
B 1/23/2000 7
C 1/22/2000 20
You can skip the intermediate datafile as follows. Don't have time now. Just play around with it to get it right.
import pandas as pd
import datetime
time_format = '%m/%d/%Y'
Y = pd.read_csv('test.txt', sep="\t")
dates = Y['Date']
dates_right_format = map(lambda s: datetime.datetime.strptime(s, time_format), dates)
values = Y['value']
X = pd.DataFrame(values)
X.index = dates_right_format
print X
X = X.sort()
print X
print X.resample('W', how=sum, closed='right', label='right')
Last print
value
2000-01-02 5
2000-01-09 3
2000-01-16 NaN
2000-01-23 37

Categories