I have a huge list of website names in my dataframe.
e.g array(['google','facebook','yahoo','youtube', and many other small websites])
Dataframe has around 40 more websites.
I want to group the other websites name as 'other'
My input table is something like
|Website |
|-------------|
|google.com |
|youtube.com |
|yahoo.com |
|nyu.com |
|something.com|
My desired output will be something like
|Website |
|-----------|
|google.com |
|youtube.com|
|yahoo.com |
|others |
|others |
I tried a few things but didn't work. Should I be manually renaming them ? Or is there any way, I can create a new column and mention them as others with a few exceptions as above ?
Thanks in advance.
try:
m=df['Website'].isin(['google.com','youtube.com','yahoo.com'])
#Finally:
df.loc[~m,'Website']='others'
OR
m=df['Website'].str.contains('google|youtube|yahoo')
#Finally:
df.loc[~m,'Website']='others'
Try using str.contains:
df.loc[df['Website'].str.contains('google|youtube|yahoo|facebook'),'Website']='others'
Maybe...
# maintain a list of sites you wish to keep
sitesToKeep = ['google.com', 'youtube.com', 'yahoo.com']
# for all rows where the value in the column 'Website' is not present in the list 'sitesToKeep' change the value to 'other'
df.loc[~df.Website.isin(sitesToKeep), 'Website'] = 'Other'
Related
I have a DataFrame df with text as below :
|---------------------|-----------------------------------|
| File_name | Content |
|---------------------|-----------------------------------|
| BI1.txt | I am writing this letter ... |
|---------------------|-----------------------------------|
| BI2.txt | Yes ! I would like to pursue... |
|---------------------|-----------------------------------|
I would like to create an additional column which provides the syllable count with :
df['syllable_count']= textstat.syllable_count(df['content'])
The error :
Series objects are mutable, thus they cannot be hashed
How can I change the Content column to hashable? How can I fix this error?
Thanks for your help !
Try doing it this way:
df['syllable_count'] = df.content.apply(lambda x: textstat.syllable_count(x))
I have a script that collates sets of tags from other dataframes, converts them into comma-separated string and adds all of this to a new dataframe. If I use pd.read_csv to generate the dataframe, the first entry is what I expect it to be. However, if I use the df_empty script (below), then I get a copy of the headers in that first row instead of the data I want. The only difference I have made is generating a new dataframe instead of loading one.
The resultData = pd.read_csv() reads a .csv file with the following headers and no additional information:
Sheet, Cause, Initiator, Group, Effects
The df_empty script is as follows:
def df_empty(columns, dtypes, index=None):
assert len(columns)==len(dtypes)
df = pd.DataFrame(index=index)
for c,d in zip(columns, dtypes):
df[c] = pd.Series(dtype=d)
return df
# https://stackoverflow.com/a/48374031
# Usage: df = df_empty(['a', 'b'], dtypes=[np.int64, np.int64])
My script contains the following line to create the dataframe:
resultData = df_empty(['Sheet','Cause','Initiator','Group','Effects'],[np.str,np.int64,np.str,np.str,np.str])
I've also used the following with no differences:
resultData = df_empty(['Sheet','Cause','Initiator','Group','Effects'],['object','int64','object','object','object'])
My script to collate the data and add it to my dataframe is as follows:
data = {'Sheet': sheetNum, 'Cause': causeNum, 'Initiator': initTag, 'Group': grp, 'Effects': effectStr}
count = len(resultData)
resultData.at[count,:] = data
When I run display(data), I get the following in Jupyter:
{'Sheet': '0001',
'Cause': 1,
'Initiator': 'Tag_I1',
'Group': 'DIG',
'Effects': 'Tag_O1, Tag_O2,...'}
What I want to see with both options / what I get when reading the csv:
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
+-------+-------+-----------+-------+--------------------+
| 0001 | 1 | Tag_I1 | DIG | Tag_O1, Tag_O2,... |
| 0001 | 2 | Tag_I2 | DIG | Tag_O2, Tag_04,... |
+-------+-------+-----------+-------+--------------------+
What I see when generating a dataframe with df_empty:
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
| 0001 | 2 | Tag_I2 | DIG | Tag_O2, Tag_04,... |
+-------+-------+-----------+-------+--------------------+
Any ideas on what might be causing the generated dataframe to copy my headers into the first row and if it possible for me to not have to read an otherwise empty csv?
Thanks!
Why? Because you've inserted the first row as data. The magic behaviour of using the first row as header is in read_csv(), if you create your dataframe without using read_csv, the first row is not treated specially.
Solution? Skip the first row when inserting to the data frame generate by df_empty.
I've created a dataframe as:
ratings = imdb_data.sort('imdbRating').select('imdbRating').filter('imdbRating is NOT NULL')
Upon doing ratings.show() as shown below, i can see that
the imdbRating field has a mixed type of data such as random strings, movie title, movie url and actual ratings. So the dirty data looks this:
+--------------------+
| imdbRating|
+--------------------+
|Mary (TV Episode...|
| Paranormal Activ...|
| Sons (TV Episode...|
| Spion (2011)|
| Winter... und Fr...|
| and Gays (TV Epi...|
| grAs - Die Serie...|
| hat die Wahl (2000)|
| 1.0|
| 1.3|
| 1.4|
| 1.5|
| 1.5|
| 1.5|
| 1.6|
| 1.6|
| 1.7|
| 1.9|
| 1.9|
| 1.9|
+--------------------+
only showing top 20 rows
Is there anyway i can filter out the unwanted strings and all just get the ratings ? I tried using UDF as:
ratings_udf = udf(lambda imdbRating: imdbRating if isinstance(imdbRating, float) else None)
and tried calling it as:
ratings = imdb_data.sort('imdbRating').select('imdbRating')
filtered = rating.withColumn('imdbRating',ratings_udf(ratings.imdbRating))
The problem with above is, since it tried calling the udf on each row, each row of the dataframe mapped to a Row type and hence returning None on all the values.
Is there any straightforward way to filter out those data ?
Any help will be much appreciated. Thank you
Finally, i was able to resolve it.The problem was there was some corrupt data with not all fields present. Firstly, i tried is using pandas by reading the csv files in pandas as:
pd_frame = pd.read_csv('imdb.csv', error_bad_lines=False)
This skipped/dropped the corrupt rows which had less columns than the actual. I tried to read the above panda dataframe, pd_frame, to spark using:
imdb_data= spark.createDataFrame(pd_frame)
but got some error because of mismatch while inferring schema. Turns out spark csv reader has something similar which drops the corrupt rows as:
imdb_data = spark.read.csv('imdb.csv', header='true', mode='DROPMALFORMED')
I have a data set that has columns for number of units sold in a given month - the problem being that the monthly units columns are named in MM/yyyy format, meaning that I have 12 columns of units information per record.
So for instance, my data looks like:
ProductID | CustomerID | 04/2018 | 03/2018 | 02/2018 | FileDate |
a1032 | c1576 | 36 | 12 | 19 | 04/20/2018 |
What causes this to be problematic is that a new file comes in every month, with the same file name, but different column headers for the units information based on the last 12 months.
What I would like to do, is rename the monthly units columns to Month1, Month2, Month3... based on a simple regex such as ([0-9]*)/([0-9]*) that will result in the output:
ProductID | CustomerID | Month1 | Month2 | Month3 | FileDate |
a1032 | c1576 | 36 | 12 | 19 | 04/20/2018 |
I know that this should be possible using Python, but as I have never used Python before (I am an old .Net developer) I honestly have no idea how to achieve this.
I have done a bit of research on renaming columns in Python, but none of them mentioned pattern matching to rename a column, eg:
df = df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'})
UPDATE: The data that I am showing in my example is only a subset of the columns; total, in my data set I have 120 columns, only 12 of which need to be renamed, this is why I thought that regex might be the simplest way to go.
import re
# regex pattern
pattern = re.compile("([0-9]*)/([0-9]*)")
# get headers as list
headers = list(df)
# apply regex
months = 1
for index, header in enumerate(headers):
if pattern.match(header):
headers[index] = 'Month{}'.format(months)
months += 1
# set new list as column headers
df.columns = headers
If you have some set names that you want to convert to, then rather than using rename, it might easier to just pass a new list to the df.columns attribute
df.columns = ['ProductID','CustomerID']+['Month{}'.format(i) for i in range(12)]+['FileDate']
If you want to use rename, if you can write a function find_new_name that does the conversion you want for a single name, you can rename an entire list old_names with
df.rename(columns = {oldname:find_new_name(old_name) for old_name in old_names})
Or if you have a function that takes a new name and figures out what old name corresponds to it, then it would be
df.rename(columns = {find_old_name(new_name):new_name for new_name in new_names})
You can also do
for new_name in new_names:
old_name = find_new_name(old_name)
df[new_name] = df[old_name]
This will copy the data into new columns with the new names rather than renaming, so you can then subset to just the columns you want.
Since rename could take a function as a mapper, we could define a customized function which returns a new column name in the new format if the old column name matches regex; otherwise, returns the same column name. For example,
import re
def mapper(old_name):
match = re.match(r'([0-9]*)/([0-9]*)', old_name)
if match:
return 'Month{}'.format(int(match.group(1)))
return old_name
df = df.rename(columns=mapper)
Hi I have a rather simple task but seems like all online help is not working.
I have data set like this:
ID | Px_1 | Px_2
theta| 106.013676 | 102.8024788702673
Rho | 100.002818 | 102.62640389123405
gamma| 105.360589 | 107.21999706084836
Beta | 106.133046 | 115.40449479551263
alpha| 106.821119 | 110.54312246081719
I want to find min by each row in a fourth col so the output I can have is for example, theta is 102.802 because it is the min value of both Px_1 and Px_2
I tried this but doesnt work
I constantly get max value
df_subset = read.set_index('ID')[['Px_1','Px_2']]
d = df_subset.min( axis=1)
Thanks
You can try this
df["min"] = df[["Px_1", "Px_2"]].min(axis=1)
Select the columns needed, here ["Px_1", "Px_2"], to perform min operation.