splitting url and getting values from that URl in columns - python

Hi say I have a column in data frame
name submission contains - mhttps://ckd.pdc.com/pdc/73ba5189-94fd-44aa-88d3-6b36aaa69b02/DDA1610095.zip
I want to one column say Zest and I want the value DDA1610095 in that column.
and a new column say type and want .zip in that column how to do that using pandas.

you can use str.split to extract the zip from the url
df
url
0 mhttps://ckd.pdc.com/pdc/73ba5189-94fd-44aa-88d3-6b36aaa69b02/DDA1610095.zip
df['zip'] = df.url.str.split('/',expand=True).T[0] \
[df.url.str.split('/',expand=True).T.shape[0]-1]
df.T
Out[46]:
0
url mhttps://ckd.pdc.com/pdc/73ba5189-94fd-44aa-88d3-6b36aaa69b02/DDA1610095.zip
zip DDA1610095.zip

try using a str.split and add another str so you can index each row.
data = [{'ID' : '1',
'URL': 'https://ckd.pdc.com/pdc/73ba5189-94fd-44aa-88d3-6b36aaa69b02/DDA1610095.zip'}]
df = pd.DataFrame(data)
print(df)
ID URL
0 1 https://ckd.pdc.com/pdc/73ba5189-94fd-44aa-88d...
#Get the file name and replace zip (probably a more elegant way to do this)
df['Zest'] = df.URL.str.split('/').str[-1].str.replace('.zip','')
#assign the type into the next column.
df['Type'] = df.URL.str.split('.').str[-1]
print(df)
ID URL Zest Type
0 1 https://ckd.pdc.com/pdc/73ba5189-94fd-44aa-88d... DDA1610095 zip

Related

How can I unnest a long column(map) to multiple columns with pandas?

I have a dataframe like this:
dataframe name: df_test
ID
Data
test-001
{"B":{"1":{"_seconds":1663207410,"_nanoseconds":466000000}},"C":{"1":{"_seconds":1663207409,"_nanoseconds":978000000}},"D":{"1":{"_seconds":1663207417,"_nanoseconds":231000000}}}
test-002
{"B":{"1":{"_seconds":1663202431,"_nanoseconds":134000000}},"C":{"1":{"_seconds":1663208245,"_nanoseconds":412000000}},"D":{"1":{"_seconds":1663203482,"_nanoseconds":682000000}}}
I want it to be unnested like this:
ID
B_1_seconds
B_1_nanoseconds
C_1_seconds
C_1_nanoseconds
D__seconds
D__nanoseconds
test-001
1663207410
466000000
1663207409
978000000
1663207417
231000000
test-002
1663202431
134000000
1663208245
412000000
1663203482
682000000
I tryed df_test.explode but it doesn't work for this
I used Dataiku to unnest the data and it worked perfectly, now I want to unnest the data within my python notebook, what should I do?
Edit:
I tried
df_list=df_test["data"].tolist()
then
pd.json_normalize.(df_list)
it returned an empty dataframe with only index but no value in it.
Since pd.json_normalize returns an empty dataframe I'd guess that df["Data"] contains strings? If that's the case you could try
import json
df_data = pd.json_normalize(json.loads("[" + ",".join(df["Data"]) + "]"), sep="_")
res = pd.concat([df[["ID"]], df_data], axis=1).rename(lambda c: c.replace("__", "_"), axis=1)
or
df_data = pd.json_normalize(df["Data"].map(eval), sep="_")
res = pd.concat([df[["ID"]], df_data], axis=1).rename(lambda c: c.replace("__", "_"), axis=1)
Result for both alternatives is:
ID B_1_seconds B_1_nanoseconds C_1_seconds C_1_nanoseconds \
0 test-001 1663207410 466000000 1663207409 978000000
1 test-002 1663202431 134000000 1663208245 412000000
D_1_seconds D_1_nanoseconds
0 1663207417 231000000
1 1663203482 682000000

How to add prefix to the selected records in python pandas df

I have df where some of the records in the column contains prefix and some of them not. I would like to update records without prefix. Unfortunately, my script adds desired prefix to each record in df:
new_list = []
prefix = 'x'
for ids in df['ids']:
if ids.find(prefix) < 1:
new_list.append(prefix + ids)
How can I ommit records with the prefix?
I've tried with df[df['ids'].str.contains(prefix)], but I'm getting an error.
Use Series.str.startswith for mask and add values with numpy.where:
df = pd.DataFrame({'ids':['aaa','ssx','xwe']})
prefix = 'x'
df['ids'] = np.where(df['ids'].str.startswith(prefix), '', prefix) + df['ids']
print (df)
ids
0 xaaa
1 xssx
2 xwe

Using Panda, Update column values based on a list of ID and new Values

I have a df with and ID and Sell columns. I want to update the Sell column, using a list of new Sells (not all raws need to be updated - just some of them). In all examples I have seen, the value is always the same or is coming from a column. In my case, I have a dynamic value.
This is what I would like:
file = ('something.csv') # Has 300 rows
IDList= [['453164259','453106168','453163869','453164463'] # [ID]
SellList=[120,270,350,410] # Sells values
csv = path_pattern = os.path.join(os.getcwd(), file)
df = pd.read_csv(file)
df.loc[df['Id'].isin(IDList[x]), 'Sell'] = SellList[x] # Update the rows with the corresponding Sell value of the ID.
df.to_csv(file)
Any ideas?
Thanks in advance
Assuming 'id' is a string (as mentioned in IDList) & is not index of your df
IDList= [['453164259','453106168','453163869','453164463'] # [ID]
SellList=[120,270,350,410]
id_dict={x:y for x,y in zip(IDList,SellList)}
for index,row in df.iterrows():
if row['id'] in IDList:
df.loc[str(index),'Sell']=id_dict[row['id']]
If id is index:
IDList= [['453164259','453106168','453163869','453164463'] # [ID]
SellList=[120,270,350,410]
id_dict={x:y for x,y in zip(IDList,SellList)}
for index,row in df.iterrows():
if index in IDList:
df.loc[str(index),'Sell']=id_dict[index]
What I did is created a dictionary using IDlist & SellList & than looped over the df using iterrows()
df = pd.read_csv('something.csv')
IDList= ['453164259','453106168','453163869','453164463']
SellList=[120,270,350,410]
This will work efficiently, specially for large files:
df.set_index('id', inplace=True)
df.loc[IDList, 'Sell'] = SellList
df.reset_index() ## not mandatory, just in case you need 'id' back as a column
df.to_csv(file)

Column in DataFrame isn't recognised. Keyword Error: 'Date'

I'm in the initial stages of doing some 'machine learning'.
I'm trying to create a new data frame and one of the columns doesn't appear to be recognised..?
I've loaded an Excel file with 2 columns (removed the index). All fine.
Code:
df = pd.read_excel('scores.xlsx',index=False)
df=df.rename(columns=dict(zip(df.columns,['Date','Amount'])))
df.index=df['Date']
df=df[['Amount']]
#creating dataframe
data = df.sort_index(ascending=True, axis=0)
new_data = pd.DataFrame(index=range(0,len(df)),columns=['Date','Amount'])
for i in range(0,len(data)):
new_data['Date'][i] = data['Date'][i]
new_data['Amount'][i] = data['Amount'][i]
The error:
KeyError: 'Date'
Not really sure what's the problem here.
Any help greatly appreciated
I think in line 4 you reduce your dataframe to just one column "Amount"
To add to #Grzegorz Skibinski's answer, the problem is after line 4, there is no longer a 'Date' column. The Date column was assigned to the index and removed, and while the index has a name "Date", you can't use 'Date' as a key to get the index - you have to use data.index[i] instead of data['Date'][i].
It seems that you have an error in the formatting of your Date column.
To check that you don't have an error on the name of the columns you can print the columns names:
import pandas as pd
# create data
data_dict = {}
data_dict['Fruit '] = ['Apple', 'Orange']
data_dict['Price'] = [1.5, 3.24]
# create dataframe from dict
df = pd.DataFrame.from_dict(data_dict)
# Print columns names
print(df.columns.values)
# Print "Fruit " column
print(df['Fruit '])
This code outputs:
['Fruit ' 'Price']
0 Apple
1 Orange
Name: Fruit , dtype: object
We clearly see that the "Fruit " column as a trailing space. This is an easy mistake to do, especially when using excel.
If you try to call "Fruit" instead of "Fruit " you obtain the error you have:
KeyError: 'Fruit'

Remove a SPECIFIC url from a string in a pandas dataframe

I have a dataframe:
Name url
A 'https://foo.com, https://www.bar.org, https://goo.com'
B 'https://foo.com, https://www.bar.org, https://www.goo.com'
C 'https://foo.com, https://www.bar.org, https://goo.com'
and then a keyword list:
keyword_list = ['foo','bar']
I'm trying remove the urls that contain the keywords while keeping the ones that don't, so far this is the only thing that has worked for me, however it just removes that instance of the word only:
df['url'] = df['url'].str.replace('|'.join(keywordlist), ' ')
I've tried to convert the elements in the string to a list, however I get an indexing error when combining it back with the larger dataframe its part of, anyone run into this before?
Desired output:
Name url
A 'https://goo.com'
B 'https://www.goo.com'
C 'https://goo.com'
I'm pretty sure you can do so with some regex. But you can also do:
new_df = df.set_index('Name').url.str.split(',\s+', expand=True).stack()
(new_df[~new_df.str.contains('|'.join(keyword_list))]
.reset_index(level=1, drop=True)
.to_frame(name='url')
.reset_index()
)
Output:
Name url
0 A https://goo.com
1 B https://www.goo.com
2 C https://goo.com

Categories