I have a dataframe named df which has a column named "text" consisting of each row which a string like this: This is the string of the MARC data format.
d20s 22 i2as¶001VNINDEA455133910000005¶008180529c 1996 frmmm wz 7b ¶009se z 1 m mm c¶008a ¶008at ¶008ap ¶008a ¶0441 $a2609-2565$c2609-2565¶0410 $afre$aeng$apor ¶0569 $a2758-8965$c4578-7854¶0300 $a789$987$754 ¶051 $atxt$asti$atdi$bc¶110 $317737535$w20..b.....$astock market situation¶3330 $aimport and export agency ABC¶7146 $q1$uwwww.abc.org$ma1¶7146 $q9$uAgency XYZ¶8799 $q1$uAgency ABC$fHTML$
Here I want to extract information containing in zones ¶7146, after $u or zone ¶0441, after $c.
The result table will be like this :
¶7146$u
¶0441$c
wwww.abc.org
2609-2565
Agency XYZ
2609-2565
Here is the code I made :
import os
import pandas as pd
import numpy as np
import requests
df = pd.read_csv('dataset.csv')
def extract(text, start_pattern, sc):
ist = text.find(start_pattern)
if ist < 0:
return ""
ist = text.find(sc, ist)
if ist < 0:
return ""
im = text.find("$", ist + len(sc))
iz = text.find("¶", ist + len(sc))
if im >= 0:
if iz >= 0:
ie = min(im, iz)
else:
ie = im
else:
ie = iz
if ie < 0:
return ""
return text[ist + len(sc): ie]
def extract_text(row, list_in_zones):
text = row["text"]
if pd.isna(text):
return [""] * len(list_in_zones)
patterns = [("¶" + p, "$" + c) for p, c in [zone.split("$") for zone in list_in_zones]]
return [extract(text, pattern, sc) for pattern, sc in patterns]
list_in_zones = ["7146$u", "0441$u", "200$y"]
df[list_in_zones] = df.apply(lambda row: extract_text(row, list_in_zones),
axis=1,
result_type="expand")
df.to_excel("extract.xlsx", index = False)
For zones ¶7146 and after $u, my code only extracted "www.abc.org", he cannot extract the duplicate with value "Agency XYZ". What's wrong here?.
Additional logical structure : The logic about the structure of the string is that each zone will start with a character ¶ like ¶7146, ¶0441,.. , and the fields start with $ for example $u, $c and this field ends with either $ or ¶. Here, I want to extract information in the fields $.
You could try splitting and then cleaning up strings as follows
import pandas as pd
text = ('d20s 22 i2as¶001VNINDEA455133910000005¶008180529c 1996 frmmm wz 7b ¶009se z 1 m mm c¶008a ¶008at ¶008ap ¶008a ¶0441 $a2609-2565$c2609-2565¶0410 $afre$aeng$apor ¶0569 $a2758-8965$c4578-7854¶0300 $a789$987$754 ¶051 $atxt$asti$atdi$bc¶110 $317737535$w20..b.....$astock market situation¶3330 $aimport and export agency ABC¶7146 $q1$uwwww.abc.org$ma1¶7146 $q9$uAgency XYZ¶8799 $q1$uAgency ABC$fHTML$')
u = text.split('$u')[1:3] # Taking just the seconds and third elements in the array because they match your desired output
c = text.split('$c')[1:3]
pd.DataFrame([u,c]).T
OUTPUT
0 1
0 wwww.abc.org$ma1¶7146 $q9 2609-2565¶0410 $afre$aeng$apor ¶0569 $a2758-8965
1 Agency XYZ¶8799 $q1 4578-7854¶0300 $a789$987$754 ¶051 $atxt$asti$a...
From here you can try to clean up the strings until they match the desired output.
It would be easier to give a more helpful answer if we could understand the logic behind this data structure - when do certain fields start and end?
Related
I have a data frame which contains a text column i.e. df["input"],
I would like to create a new variable which checks whether df["input"] column contains any of the word in a given list and assigns a value of 1 if previous dummy variable is equal to 0 (logic is 1) create a dummy variable that equals to zero 2) replace it to one if it contains any word in a given list and it was not contained in the previous lists.)
# Example lists
listings = ["amazon listing", "ecommerce", "products"]
scripting = ["subtitle", "film", "dubbing"]
medical = ["medical", "biotechnology", "dentist"]
df = pd.DataFrame({'input': ['amazon listing subtitle',
'medical',
'film biotechnology dentist']})
which looks like:
input
amazon listing subtitle
medical
film biotechnology dentist
final dataset should look like:
input listings scripting medical
amazon listing subtitle 1 0 0
medical 0 0 1
film biotechnology dentist 0 1 0
One possible implementation is to use str.contains in a loop to create the 3 columns, then use idxmax to get the column name (or the list name) of the first match, then create a dummy variable from these matches:
import numpy as np
d = {'listings':listings, 'scripting':scripting, 'medical':medical}
for k,v in d.items():
df[k] = df['input'].str.contains('|'.join(v))
arr = df[list(d)].to_numpy()
tmp = np.zeros(arr.shape, dtype='int8')
tmp[np.arange(len(arr)), arr.argmax(axis=1)] = arr.max(axis=1)
out = pd.DataFrame(tmp, columns=list(d)).combine_first(df)
But in this case, it might be more efficient to use a nested for-loop:
import re
def get_dummy_vars(col, lsts):
out = []
len_lsts = len(lsts)
for row in col:
tmp = []
# in the nested loop, we use the any function to check for the first match
# if there's a match, break the loop and pad 0s since we don't care if there's another match
for lst in lsts:
tmp.append(int(any(True for x in lst if re.search(fr"\b{x}\b", row))))
if tmp[-1]:
break
tmp += [0] * (len_lsts - len(tmp))
out.append(tmp)
return out
lsts = [listings, scripting, medical]
out = df.join(pd.DataFrame(get_dummy_vars(df['input'], lsts), columns=['listings', 'scripting', 'medical']))
Output:
input listings medical scripting
0 amazon listing subtitle 1 0 0
1 medical 0 1 0
2 film biotechnology dentist 0 0 1
Here is a simpler - more pandas vector style solution:
patterns = {} #<-- dictionary
patterns["listings"] = ["amazon listing", "ecommerce", "products"]
patterns["scripting"] = ["subtitle", "film", "dubbing"]
patterns["medical"] = ["medical", "biotechnology", "dentist"]
df = pd.DataFrame({'input': ['amazon listing subtitle',
'medical',
'film biotechnology dentist']})
#---------------------------------------------------------------#
# step 1, for each column create a reg-expression
for col, items in patterns.items():
# create a regex pattern (word1|word2|word3)
pattern = f"({'|'.join(items)})"
# find the pattern in the input column
df[col] = df['input'].str.contains(pattern, regex=True).astype(int)
# step 2, if the value to the left is 1, change its value to 0
## 2.1 create a mask
## shift the rows to the right,
## --> if the left column contains the same value as the current column: True, otherwise False
mask = (df == df.shift(axis=1)).values
# substract the mask from the df
## and clip the result --> negative values will become 0
df.iloc[:,1:] = np.clip( df[mask].iloc[:,1:] - mask[:,1:], 0, 1 )
print(df)
Result
input listings scripting medical
0 amazon listing subtitle 1 0 0
1 medical 0 0 1
2 film biotechnology dentist 0 1 0
Great question and good answers (I somehow missed it yesterday)! Here's another variation with .str.extractall():
search = {"listings": listings, "scripting": scripting, "medical": medical, "dummy": []}
pattern = "|".join(
f"(?P<{column}>" + "|".join(r"\b" + s + r"\b" for s in strings) + ")"
for column, strings in search.items()
)
result = (
df["input"].str.extractall(pattern).assign(dummy=True).groupby(level=0).any()
.idxmax(axis=1).str.get_dummies().drop(columns="dummy")
)
I am trying to clean a spreadsheet of user-inputted data that includes a "birth_date" column. The issue I am having is that the date formating ranges widely between users, including inputs without markers between the date, month, and year. I am having a hard time developing a formula that is intelligent enough to interpret such a wide range of inputs. Here is a sample:
1/6/46
7/28/99
11272000
11/28/78
Here is where I started:
df['birth_date']=pd.to_datetime(df.birth_date)
This does not seem to make it past the first example, as it looks for a two-month format. Can anyone help with this?
Your best bet is to check each input and give a consistent output. Assuming Month-Day-Year formats, you can use this function
import pandas as pd
import re
def fix_dates(dates):
new = []
for date in dates:
chunks = re.split(r"[\/\.\-]", date)
if len(chunks) == 3:
m, d, y = map(lambda x: x.zfill(2), chunks)
y = y[2:] if len(y) == 4 else y
new.append(f"{m}/{d}/{y}")
else:
m = date[:2]
d = date[2:4]
y = date[4:]
y = y[2:] if len(y) == 4 else y
new.append(f"{m}/{d}/{y}")
return new
inconsistent_dates = '1/6/46 7/28/99 11272000 11/28/78'.split(' ')
pd.to_datetime(pd.Series(fix_dates(inconsistent_dates)))
0 2046-01-06
1 1999-07-28
2 2000-11-27
3 1978-11-28
dtype: datetime64[ns]
I am trying to get data from a website to write them on an excel file to be worked on. I have a main url scheme and I have to change the "year" and the "reference number" accordingly:
http://calcio-seriea.net/presenze/"year"/"reference number"/
I already tried to write a part of the code but I have one issue. First of all, I should keep the year the same while the reference number takes every number of an interval of 18. Then the year increases of 1, and the reference number take again every number of an interval of 18. I try to give an example:
Y = 1998 RN = [1142:1159];
Y = 1999 RN = [1160:1177];
Y = 2000 RN = [1178:1195];
Y = … RN = …
Then from year 2004 the interval becomes of 20, so
Y = 2004 RN = [1250:1269];
Y = 2005 RN = [1270:1289];
Till year = 2019 included.
This is the code I could make so far:
import pandas as pd
year = str(1998)
all_items = []
for i in range(1142, 1159):
pattern = "http://calcio-seriea.net/presenze/" + year + "/" + str(i) + "/"
df = pd.read_html(pattern)[6]
all_items.append(df)
pd.DataFrame(all_items).to_csv(r"C:\Users\glcve\Desktop\data.csv", index = False, header = False)
print("Done!")
Thanks to all in advance
All that's missing is a pd.concat from your function, however as you're calling the same method over and over, lets write a function so you can keep your code dry.
def create_html_df(base_url, year,range_nums = ()):
"""
Returns a dataframe from a url/html table
base_url : the url to target
year : the target year.
range_nums = the range of numbers i.e (1,50)
"""
start, stop = range_nums
url_pat = [f"{base_url}/{year}/{i}" for i in range(start,stop)]
dfs = []
for each_url in url_pat:
df = pd.read_html(each_url)[6]
dfs.append(df)
return pd.concat(dfs)
final_df = create_html_df(base_url = "http://calcio-seriea.net/presenze/",
year = 1998,
range_nums = (1142, 1159))
I have data in following csv format
Date,State,City,Station Code,Minimum temperature (C),Maximum temperature (C),Rainfall (mm),Evaporation (mm),Sunshine (hours),Direction of maximum wind gust,Speed of maximum wind gust (km/h),9am Temperature (C),9am relative humidity (%),3pm Temperature (C),3pm relative humidity (%)
2017-12-25,VIC,Melbourne,086338,15.1,21.4,0,8.2,10.4,S,44,17.2,57,20.7,54
2017-12-25,VIC,Bendigo,081123,11.3,26.3,0,,,ESE,46,17.2,53,25.5,25
2017-12-25,QLD,Gold Coast,040764,22.3,35.7,0,,,SE,59,29.2,53,27.7,67
2017-12-25,SA,Adelaide,023034,13.9,29.5,0,10.8,12.4,N,43,18.6,42,27.7,17
The output for VIC sohuld be
S : 1
ESE : 1
SE : 0
N : 0
however i am getting output as
S : 1
ESE : 1
Thus would like to know, how can a unique function be used to include the other 2 missing results. Below is the proram which calls a csv file
import pandas as pd
#read file
df = pd.read_csv('climate_data_Dec2017.csv')
#marker
value = df['Date']
date = value == "2017-12-26"
marker = df[date]
#group data
directionwise_data = marker.groupby('Direction of maximum wind gust')
count = directionwise_data.size()
numbers = count.to_dict()
for key in numbers:
print(key, ":", numbers[key])
To begin with, i'm not sure what you're trying to get from this:
Your data sample has no "2017-12-26" records yet you're using it in your code, hence i presume for that sample, i'll change the code to "2017-12-25" just to see what is it producing, now that produces the exact thing you're expecting! Therefore i guess in your full data, you don't have records for "2017-12-26" for SE and N and therefore it's not being grouped, i suggest you create a unique set of the four directions you've in your df, then just count their occurances in a slice of your dataframe fo the needed date!
Or if all you want is how many records for each direction you have by date, why not just pivot it like below:
output = df.pivot_table(index='Date', columns = 'Direction of maximum wind gust', aggfunc={'Direction of maximum wind gust':'count'}, fill_value=0)
EDIT:
Ok, so i wrote this real quick which should get you what you want, however you need to feed it which date you want:
import pandas as pd
#read csv
df = pd.read_csv('climate_data_Dec2017.csv')
#specify date
neededDate = '2017-12-25'
#slice dataframe to keep needed records based on the date
subFrame = df.loc[df['Date'] == neededDate].reset_index(drop=True)
#set count to zero
d1 = 0 #'S'
d2 = 0 #'SE'
d3 = 0 #'N'
d4 = 0 #'ESE'
#loop over slice and count directions
for i, row in subFrame.iterrows():
direction = subFrame.at[i,'Direction of maximum wind gust']
if direction == 'S':
d1 = d1+1
elif direction == 'SE':
d2 = d2+1
elif direction == 'N':
d3 = d3+1
if direction == 'ESE':
d4 = d4+1
#print directions count
print ('S = ' + str(d1))
print ('SE = ' + str(d2))
print ('N = ' + str(d3))
print ('ESE = ' + str(d4))
S = 1
SE = 1
N = 1
ESE = 1
I have data like the SampleDf below, and I'm trying to create code that would pick off the first 'Avg','Sum' or 'Count' that it runs in to in each string and put that in a new column 'Agg'. The code I have below almost does it but it has a hierarchy. So in the code I have below if Count comes before Sum it still puts Sum in the 'Agg' column. I have an OutputDf below showing what I'm hoping to get.
Sample Data:
SampleDf=pd.DataFrame([['tom',"Avg(case when Value1 in ('Value2') and [DateType] in ('Value3') then LOS end)"],['bob',"isnull(Sum(case when XferToValue2 in (1) and DateType in ('Value3') and [Value1] in ('HM') then Count(LOS) end),0)"]],columns=['ReportField','OtherField'])
Sample Output:
OutputDf=pd.DataFrame([['tom',"Avg(case when Value1 in ('Value2') and [DateType] in ('Value3') then LOS end)",'Avg'],['bob',"isnull(Sum(case when XferToValue2 in (1) and DateType in ('Value3') and [Value1] in ('HM') then Count(LOS) end),0)",'Sum']],columns=['ReportField','OtherField','Agg'])
Code:
import numpy as np
SampleDf['Agg'] = np.where(SampleDf.SQLTranslation.str.contains("Sum"),"Sum",
np.where(SampleDf.SQLTranslation.str.contains("Count"),"Count",
np.where(SampleDf.SQLTranslation.str.contains("Avg"),"Avg","Nothing")))
A quick and dirty attempt at this problem would be writing a function that returns:
- any term of interest, i.e. ['Avg','Sum','Count'], occurring first, if it's present in the string
- or None, if there is no such:
import re
terms = ['Avg','Sum','Count']
def extractTerms(s, t=terms):
s_clean = re.sub("[^\w]|[\d]"," ", s).split()
s_array = [w for w in s_clean if w in t]
try:
return s_array[0]
except:
return None
Proof if terms in the string:
SampleDf['Agg'] = SampleDf['OtherField'].apply(lambda s: extractTerms(s))
SampleDf
ReportField OtherField Agg
0 tom Avg(case when Value1 in ('Value2') and [DateType] in ('Value3') then LOS end) Avg
1 bob isnull(Sum(case when XferToValue2 in (1) and DateType in ('Value3') and [Value1] in ('HM') then Count(LOS) end),0) Sum
Proof if terms are not in the string:
SampleDf['Agg'] = SampleDf['OtherField'].apply(lambda s: extractTerms(s))
SampleDf
ReportField OtherField Agg
0 tom foo None
1 bob isnull(Sum(case when XferToValue2 in (1) and DateType in ('Value3') and [Value1] in ('HM') then Count(LOS) end),0) Sum