I have a column in a dataframe as follows:
Data
[special_request=nowhiterice, waiter=Janice]
[allegic=no, waiter=Janice, tip=20]
[allergic=no, tip=20]
[special_request=nogreens]
May I know how could I make it such that one data = 1 column ?
special_request allegic waiter tip
You can make a Dictionary by splitting the elements of your series and build your Dataframe from it (s being your column here):
import pandas as pd
s = pd.Series([['special_request=nowhiterice', 'waiter=Janice'],
['allegic=no', 'waiter=Janice', 'tip=20'],
['allergic=no', 'tip=20'],
['special_request=nogreens']])
df = pd.DataFrame([dict(e.split('=') for e in row) for row in s])
print(df)
Output:
special_request waiter allegic tip allergic
0 nowhiterice Janice NaN NaN NaN
1 NaN Janice no 20 NaN
2 NaN NaN NaN 20 no
3 nogreens NaN NaN NaN NaN
Edit: if the column values are actual strings, you first should split your string (also stripping [, ]and whitespaces):
s = pd.Series(['[special_request=nowhiterice, waiter=Janice]',
'[allegic=no, waiter=Janice, tip=20]',
'[allergic=no, tip=20]',
'[special_request=nogreens]'])
df = pd.DataFrame([dict(map(str.strip, e.split('=')) for e in row.strip('[]').split(',')) for row in s])
print(df)
You can split the column value of string type into dict then use pd.json_normalize to convert dict to columns.
df_ = pd.json_normalize(df['Data'].apply(lambda x: dict([map(str.strip, i.split('=')) for i in x.strip("[]").split(',')])))
print(df_)
special_request waiter allegic tip allergic
0 nowhiterice Janice NaN NaN NaN
1 NaN Janice no 20 NaN
2 NaN NaN NaN 20 no
3 nogreens NaN NaN NaN NaN
Related
So i'm working on a dataframe which has a key-value pair as its value in columns. Is there a way to make the keys as column name while only keeping the value left in the column.
Currently i have something like this:
>0 1 2
>{'1536235175000': 26307.9} {'1536235176000': 0} {'1536236701000': 2630}
>{'1536239919000': 1028127} {'1536239921000': 0} NaN
>{'1536242709000': 2629.6} {'1536242711000': 0} NaN
If you want to keep the row index, you can agg every row as as list and explode them.
obj = df.apply(lambda x: list(x), axis=1).explode().dropna()
dfn = pd.DataFrame(obj.tolist(), index=obj.index)
dfn.stack().unstack()
# 1536235175000 1536235176000 1536236701000 1536239919000 \
# 0 26307.9 0.0 2630.0 NaN
# 1 NaN NaN NaN 1028127.0
# 2 NaN NaN NaN NaN
# 1536239921000 1536242709000 1536242711000
# 0 NaN NaN NaN
# 1 0.0 NaN NaN
# 2 NaN 2629.6 0.0
Check with concat
pd.concat([pd.Series(df[x].tolist()) for x in df.columns], keys=df.columns, axis=1)
I'm new to scraping the website
url = 'https://www.sec.gov/Archives/edgar/data/1383094/000095013120003579/d33910dex991.htm'
df = pd.read_html(url, parse_dates=[0])[0]
print (df.head())
This is my code and I want to extract all data from this website, but the result always be the first 'body'
0 1 2 3 4
0 NaN NaN NaN NaN NaN
1 Collection Period Beginning: NaN NaN 08/01/2020 NaN
2 Collection Period Ending: NaN NaN 08/31/2020 NaN
3 Previous Payment/Close Date: NaN NaN 08/17/2020 NaN
4 Payment Date NaN NaN 09/15/2020 NaN
How can I get the rest of all?
pd.read_html returns a list of all the tables. You are just reading the initial table so it is giving you a single df.
try :
df = pd.read_html(url, parse_dates=[0])
df1 = df[0]
df2 = df[1]
.. and so on to read all the df at the index. df holds the list and you can access the list elements at each index.
I'm trying to do the following:
Given a row in df1, if str(row['code']) is in any rows for df2['code'], then I would like all those rows in df2['lamer_url_1'] and df2['shopee_url_1'] to take the corresponding values as from df1.
Then carry on with the next row for df1['code']...
'''
==============
Initial Tables:
df1
code lamer_url_1 shopee_url_1
0 L61B18H089 b a
1 L61S19H014 e d
2 L61S19H015 z y
df2
code lamer_url_1 shopee_url_1 lamer_url_2 shopee_url_2
0 L61B18H089-F1424 NaN NaN NaN NaN
1 L61S19H014-S1500 NaN NaN NaN NaN
2 L61B18H089-F1424 NaN NaN NaN NaN
==============
Expected output:
df2
code lamer_url_1 shopee_url_1 lamer_url_2 shopee_url_2
0 L61B18H089-F1424 b a NaN NaN
1 L61S19H014-S1500 e d NaN NaN
2 L61B18H089-F1424 b a NaN NaN
'''
I assumed that common part of "code" from "df2" are chars before "-". I also assumed that from "df1" we want 'lamer_url_1', 'shopee_url_1' and from "df2" we want 'lamer_url_2', 'shopee_url_2' (correct me in comment if I am wrong so I can polish code):
df1.set_index(df1['code'], inplace=True)
df2.set_index(df2['code'].apply(lambda x: x.split('-')[0]), inplace=True)
df2.index.names = ['code_join']
df3 = pd.merge(df2[['code', 'lamer_url_2', 'shopee_url_2']],
df1[['lamer_url_1', 'shopee_url_1']],
left_index=True, right_index=True)
I have a 5x500k pandas dataframe and want to locate outlier indexes where the content is an abnormaly long string of characters.
for col in df.columns:
print(df[col].apply(str).map(len).max()) #finds max length of a string in the column col
print(df[col].apply(str).map(len)) #Gives length of all strings in the column col
What I would like to do is to find the longest string in each column and set it to NaN if there are no other strings with the same length (e.g. not multiple longest strings). And also save the index for this value. I want to repeat this for each column until no column has any "uniquely long" strings.
Example input:
a b c d e
0 NaN 54674054 6613722414 2330536 NaN
1 NaN 1234 asdf 2339933 NaN
2 14242 423124 gsdgsgdfgaadfg sdaasda NaN NaN
3 342543 214124 NaN 1231 978ad6f7d8yv 6767969
4 4123 512353 SDFAGdssd 12 87612378y8q7ssdy
5 4473 32325 as asfsda NaN NaN
Should Output:
a b c d e
0 NaN NaN 6613722414 2330536 NaN
1 NaN 1234 asdf 2339933 NaN
2 NaN 423124 NaN NaN NaN
3 NaN 214124 NaN 1231 NaN
4 4123 512353 2SDFAGdssd 12 NaN
5 4473 32325 as asfsda NaN NaN
Because I would like to clear my big dataset from long string obvious anomalies. Is it possible to easily do such an operation with pandas?
Maybe a more general version of the question would be, how can I find the index and the value of all the longest strings in a pandas dataframe column? And not just the first occurrence of the longest string.
Thank you very much,
Karl
I've not tried it but I think you are looking for something like this:
for col in df.columns:
# find indices, keep="all" means keep all occurrences.
idxs = df[col].astype(str).str.len().nlargest(
1, keep="all"
).index
# get values.
values = df.loc[idxs, col]
I have a dataframe with people information. However sometimes these guys get repeated and some rows have more info about the same person than the others. Is there a way to drop the duplicates using column 'Name' as reference but only keep the most filled rows?
If you have a dataframe like
df = pd.DataFrame([['a',np.nan,np.nan,'M'],['a',12,np.nan,'M'],['c',np.nan,np.nan,'M'],['d',np.nan,np.nan,'M']],columns=['Name','Age','Region','Gender'])
Sorting rows based on nan count and dropping duplicates with subset 'Name' by keep first one might help i.e.
df['count'] = pd.isnull(df).sum(1)
df= df.sort_values(['count']).drop_duplicates(subset=['Name'],keep='first').drop('count',1)
Output:
Before:
Name Age Region Gender
0 a NaN NaN M
1 a 12.0 NaN M
2 c NaN NaN M
3 d NaN NaN M
After:
Name Age Region Gender
1 a 12.0 NaN M
2 c NaN NaN M
3 d NaN NaN M