data processing with python

data processing with python - python

I am new to python so please excuse me for my question. In my line of work I have to work with tabular data represented in text files. The values are separated by either a coma or semi colon. The simplified example of such file might look as following:
City;Car model;Color;Registration number
Moscow;Mercedes;Red;1234
Moscow;Mercedes;Red;2345
Kiev;Toyota;Blue;3423
London;Fiat;Red;4545
My goal is to have a script which can tell me how many Mercedes are in Moscow (in our case there are two) and save a new text file Moscow.txt with following
Moscow;Mercedes;Red;1234
Moscow;Mercedes;Red;2345
I will be very thankful for your help.

I would recommend looking into the pandas library. You can do all sorts of neat manipulations of tabular data. First read it in:
>>> import pandas as pd
>>> df = pd.read_csv("cars.ssv", sep=";")
>>> df
City Car model Color Registration number
0 Moscow Mercedes Red 1234
1 Moscow Mercedes Red 2345
2 Kiev Toyota Blue 3423
3 London Fiat Red 4545
Index it in different ways:
>>> moscmerc = df[(df["City"] == "Moscow") & (df["Car model"] == "Mercedes")]
>>> moscmerc
City Car model Color Registration number
0 Moscow Mercedes Red 1234
1 Moscow Mercedes Red 2345
>>> len(moscmerc)
2
Write it out:
>>> moscmerc.to_csv("moscmerc.ssv", sep=";", header=None, index=None)
>>> !cat moscmerc.ssv
Moscow;Mercedes;Red;1234
Moscow;Mercedes;Red;2345
You can also work on multiple groups at once:
>>> df.groupby(["City", "Car model"]).size()
City Car model
Kiev Toyota 1
London Fiat 1
Moscow Mercedes 2
Dtype: int64
Update: #Anthon pointed out that the above only handles the case of a semicolon separator. If a file has a comma throughout, then you can just use , instead of ;, so that's trivial. The more interesting case is if the delimiter is inconsistent within the file, but that's easily handled too:
>>> !cat cars_with_both.txt
City;Car model,Color;Registration number
Moscow,Mercedes;Red;1234
Moscow;Mercedes;Red;2345
Kiev,Toyota;Blue,3423
London;Fiat,Red;4545
>>> df = pd.read_csv("cars_with_both.txt", sep="[;,]")
>>> df
City Car model Color Registration number
0 Moscow Mercedes Red 1234
1 Moscow Mercedes Red 2345
2 Kiev Toyota Blue 3423
3 London Fiat Red 4545
Update #2: and now the text is in Russian -- of course it is. :^) Still, if everything is correctly encoded, and your terminal is properly configured, that should work too:
>>> df = pd.read_csv("russian_cars.csv", sep="[;,]")
>>> df
City Car model Color Registration number
0 Москва Mercedes красный 1234
1 Москва Mercedes красный 2345
2 Киев Toyota синий 3423
3 Лондон Fiat красный 4545

Related

How to find if elements of a column in a data frame are string-contained by the elements of a column of another data frame?

I have a data frame tweets_df that looks like this:
sentiment id date text
0 0 1502071360117424136 2022-03-10 23:58:14+00:00 AngelaRaeBoon1 Same Alabama Republicans charge...
1 0 1502070916318121994 2022-03-10 23:56:28+00:00 This ’ w/the sentencing JussieSmollett But mad...
2 0 1502057466267377665 2022-03-10 23:03:01+00:00 DannyClayton Not hard find takes smallest amou...
3 0 1502053718711316512 2022-03-10 22:48:08+00:00 I make fake scenarios getting fights protectin...
4 0 1502045714486022146 2022-03-10 22:16:19+00:00 WipeHomophobia Well people lands wildest thing...
.. ... ... ... ...
94 0 1501702542899691525 2022-03-09 23:32:41+00:00 There 's reason deep look things kill bad peop...
95 0 1501700281729433606 2022-03-09 23:23:42+00:00 Shame UN United Dictators Shame NATO Repeat We...
96 0 1501699859803516934 2022-03-09 23:22:01+00:00 GayleKing The difference Ukrainian refugees IL...
97 0 1501697172441550848 2022-03-09 23:11:20+00:00 hrkbenowen And includes new United States I un...
98 0 1501696149853511687 2022-03-09 23:07:16+00:00 JLaw_OTD A world women minorities POC LGBTQ÷ d...
And the second dataFrame globe_df that looks like this:
Country Region
0 Andorra Europe
1 United Arab Emirates Middle east
2 Afghanistan Asia & Pacific
3 Antigua and Barbuda South/Latin America
4 Anguilla South/Latin America
.. ... ...
243 Guernsey Europe
244 Isle of Man Europe
245 Jersey Europe
246 Saint Barthelemy South/Latin America
247 Saint Martin South/Latin America
I want to delete all rows of the dataframe tweets_df which have 'text' that does not contain a 'Country' or 'Region'.
This was my attempt:
globe_df = pd.read_csv('countriesAndRegions.csv')
tweets_df = pd.read_csv('tweetSheet.csv')
for entry in globe_df['Country']:
tweet_index = tweets_df[entry in tweets_df['text']].index # if tweets that *contain*, not equal...... entry in tweets_df['text] .... (in)or (not in)?
tweets_df.drop(tweet_index , inplace=True)
print(tweets_df)
Edit: Also, fuzzy, case-insensitive matching with stemming would be preferred when searching the 'text' for countries and regions.
Ex) If the text contained 'Ukrainian', 'british', 'engliSH', etc... then it would not be deleted

Convert country and region values to a list and use str.contains to filter out rows that do not contain these values.
#with case insensitive
vals=globe_df.stack().to_list()
tweets_df = tweets_df[tweets_df ['text'].str.contains('|'.join(vals), regex=True, case=False)]
or (with case insensitive)
vals="({})".format('|'.join(globe_df.stack().str.lower().to_list())) #make all letters lowercase
tweets_df['matched'] = tweets_df.text.str.lower().str.extract(vals, expand=False)
tweets_df = tweets_df.dropna()

# Import data
globe_df = pd.read_csv('countriesAndRegions.csv')
tweets_df = pd.read_csv('tweetSheet.csv')
# Get country and region column as list
globe_df_country = globe_df['Country'].values.tolist()
globe_df_region = globe_df['Region'].values.tolist()
# merge_lists, cause you want to check with or operator
merged_list = globe_df_country + globe_df_region
# If you want to update df while iterating it, best way to do it with using copy df
df_tweets2 = tweets_df.copy()
for index,row in tweets_df.iterrows():
# Check if splitted row's text values are intersecting with merged_list
if [i for i in merged_list if i in row['text'].split()] == []:
df_tweets2 = df_tweets2.drop[index]
tweets_df_new = df_tweets2.copy()
print(tweets_df_new)

You can try using pandas.Series.str.contains to find the values.
tweets_df[tweets_df['text'].contains('{}|{}'.format(entry['Country'],entry['Region'])]
And after creating a new column with boolean values, you can remove rows with the value True.

Python: Counting values for columns with multiple values per entry in dataframe

I have a dataframe of restaurants and one column has corresponding cuisines.
The problem is that there are restaurants with multiple cuisines in the same column [up to 8].
Let's say it's something like this:
RestaurantName City Restaurant ID Cuisines
Restaurant A Milan 31333 French, Spanish, Italian
Restaurant B Shanghai 63551 Pizza, Burgers
Restaurant C Dubai 7991 Burgers, Ice Cream
Here's a copy-able code as a sample:
rst= pd.DataFrame({'RestaurantName': ['Rest A', 'Rest B', 'Rest C'],
'City': ['Milan', 'Shanghai', 'Dubai'],
'RestaurantID': [31333,63551,7991],
'Cuisines':['French, Spanish, Italian','Pizza, Burgers','Burgers, Ice Cream']})
I used string split to expand them into 8 different columns and added it to the dataframe.
csnsplit=rst.Cuisines.str.split(", ",expand=True)
rst["Cuisine1"]=csnsplit.loc[:,0]
rst["Cuisine2"]=csnsplit.loc[:,1]
rst["Cuisine3"]=csnsplit.loc[:,2]
rst["Cuisine4"]=csnsplit.loc[:,3]
rst["Cuisine5"]=csnsplit.loc[:,4]
rst["Cuisine6"]=csnsplit.loc[:,5]
rst["Cuisine7"]=csnsplit.loc[:,6]
rst["Cuisine8"]=csnsplit.loc[:,7]
Which leaves me with this:
https://i.stack.imgur.com/AUSDY.png
Now I have no idea how to count individual cuisines since they're across up to 8 different columns, let's say if I want to see top cuisine by city.
I also tried getting dummy columns for all of them, Cuisine 1 to Cuisine 8. This is causing me to have duplicates like Cuisine1_Bakery, Cusine2_Bakery, and so on. I could hypothetically merge like ones and keeping only the one that has a count of "1," but no idea how to do that.
dummies=pd.get_dummies(data=rst,columns=["Cuisine1","Cuisine2","Cuisine3","Cuisine4","Cuisine5","Cuisine6","Cuisine7","Cuisine8"])
print(dummies.columns.tolist())
Which leaves me with all of these columns:
https://i.stack.imgur.com/84spI.png
A third thing I tried was to get unique values from all 8 columns, and I have a deduped list of each type of cuisine. I can probably add all these columns to the dataframe, but wouldn't know how to fill the rows with a count for each one based on the column name.
AllCsn=np.concatenate((rst.Cuisine1.unique(),
rst.Cuisine2.unique(),
rst.Cuisine3.unique(),
rst.Cuisine4.unique(),
rst.Cuisine5.unique(),
rst.Cuisine6.unique(),
rst.Cuisine7.unique(),
rst.Cuisine8.unique()
))
AllCsn=np.unique(AllCsn.astype(str))
AllCsn
Which leaves me with this:
https://i.stack.imgur.com/O9OpW.png
I do want to create a model later on where I maybe have a column for each cuisine, and use the "unique" code above to get all the columns, but then I would need to figure out how to do a count based on the column header.
I am new to this, so please bear with me and let me know if I need to provide any more info.

It sounds like you're looking for str.split without expanding, then explode:
rst['Cuisines'] = rst['Cuisines'].str.split(', ')
rst = rst.explode('Cuisines')
Creates a frame like:
RestaurantName City RestaurantID Cuisines
0 Rest A Milan 31333 French
0 Rest A Milan 31333 Spanish
0 Rest A Milan 31333 Italian
1 Rest B Shanghai 63551 Pizza
1 Rest B Shanghai 63551 Burgers
2 Rest C Dubai 7991 Burgers
2 Rest C Dubai 7991 Ice Cream
Then it sounds like either crosstab:
pd.crosstab(rst['City'], rst['Cuisines'])
Cuisines Burgers French Ice Cream Italian Pizza Spanish
City
Dubai 1 0 1 0 0 0
Milan 0 1 0 1 0 1
Shanghai 1 0 0 0 1 0
Or value_counts
rst[['City', 'Cuisines']].value_counts().reset_index(name='counts')
City Cuisines counts
0 Dubai Burgers 1
1 Dubai Ice Cream 1
2 Milan French 1
3 Milan Italian 1
4 Milan Spanish 1
5 Shanghai Burgers 1
6 Shanghai Pizza 1
Max value_count per City via groupby head:
max_counts = (
rst[['City', 'Cuisines']].value_counts()
.groupby(level=0).head(1)
.reset_index(name='counts')
)
max_counts:
City Cuisines counts
0 Dubai Burgers 1
1 Milan French 1
2 Shanghai Burgers 1

Capturing row if column string contains X and at least one of [Y,Z]

My data looks something like this, with household members of three different origin (Dutch, American, French):
Household members nationality:
Dutch American Dutch French
Dutch Dutch French
American American
American Dutch
French American
Dutch Dutch
I want to convert them into three categories:
Dutch only households
Households with 1 Dutch and at least 1 French or American
Non-Dutch households
Category 1 was captured by the following code:
~df['households'].str.contains("French", "American")
I was looking for a solution for category 2 and 3. I had the following in mind:
Mixed households
df['households'].str.contains("Dutch" and ("French" or "American"))
But this solution did not work because it also captured rows containing only French members.
How do I implement this 'and' statement correctly in this context?

Let us try str.get_dummies to create a dataframe of dummy indicator variables for the column Household, then create boolean masks m1, m2, m3 as per the specified conditions finally use these masks to filter out the rows:
c = df['Household'].str.get_dummies(sep=' ')
m1 = c['Dutch'].eq(1) & c[['American', 'French']].eq(0).all(1)
m2 = c['Dutch'].eq(1) & c[['American', 'French']].eq(1).any(1)
m3 = c['Dutch'].eq(0)
Details:
>>> c
American Dutch French
0 1 1 1
1 0 1 1
2 1 0 0
3 1 1 0
4 1 0 1
5 0 1 0
>>> df[m1] # category 1
Household
5 Dutch Dutch
>>> df[m2] # category 2
Household
0 Dutch American Dutch French
1 Dutch Dutch French
3 American Dutch
>>> df[m3] # category 3
Household
2 American American
4 French American

How to replace value in whole dataframe based on dictionary when part of string match?

Hi I have 2 dataframes where i have to use 1 dataframe to replace the value in other. I can normally create the dictionary to replace values in whole dataframe but I have a bit different value in other dataframe so i need condition where i can tell if the part of the string is matched then it should map the dictionary.
The first dataframe is like this:
The second dataframe is like this:
id cars1 cars2
1 $ {hQOpelText.r1.val} BMW
2 $ {hQOpelText.r2.val} $ {hQOpelText.r2.val}
3 $ {hQOpelText.r3.val} $ {hQOpelText.r5.val}
4 $ {hQOpelText.r4.val} Audi
5 $ {hQOpelText.r5.val} Audi
And i want resulted df like this:
id cars1 cars2
1 Opel Adam BMW
2 Opel Astra Estate Opel Astra Estate
3 Opel Astra Hatchback Opel Grandland x
4 Opel Astra Saloon Audi
5 Opel Grandland x Audi

Idea is replace . to empty strings, then extract values by keys of dictionary, map and replace original if no match:
c = df.select_dtypes(object).columns
func = lambda x: (x.str.replace('.', '', regex=False)
.str.extract(f'({"|".join(d.keys())})', expand=False)
.map(d)
.fillna(x))
df[c] = df[c].apply(func)
print (df)
id cars1 cars2
0 1 Opel Adam BMW
1 2 Opel Astra Estate Opel Astra Estate
2 3 Opel Astra Hatchback Opel Grandand X
3 4 Opel Astra Saloon Audi
4 5 Opel Grandand X Audi

We can first change all the column values of type $ {hQOpelText.r*.val} in df2 to adhere to the convention of the values used in Variable column in df1 i.e hQOpelTextr*, then we can replace those values from the corresponding values from df1:
cols = df2.select_dtypes(object).columns
df2[cols] = df2[cols].transform(
lambda s: (
s.str.replace(r'\$\s*\{([^\.]+).*?([^\.]+).*?\}', r'\g<1>\g<2>')
.replace(df1.set_index('Variable')['AUS'])
)
)
# print(df2)
id cars1 cars2
0 1 OpehAdam BMW
1 2 Opel Astra Estate Opel Astra Estate
2 3 Opel Astra Hatchback Opel Grandland X
3 4 Opel Astra Saloon Audi
4 5 Opel Grandland X Audi

Create dummy and categorical variables from specific word(s) in text column in Python dataframe

I am trying to generate dummy and categorical variables from a text column in a dataframe, using Python. Imagine a text column 'Cars_notes' in a dataframe named 'Cars_listing':
- "This Audi has ABS braking, leather interior and bucket seats..."
- "The Ford F150 is one tough pickup truck, with 4x4, new suspension and club cab..."
- "Our Nissan Sentra comes with ABS brakes, Bluetooth-enabled radio..."
- "This Toyota Corolla is a gem, with new tires, low miles, a few scratches..."
- "The Renault Le Car has been sitting in the garage, a little rust..."
- "The Kia Sorento for sale has a CD player, new tires..."
- "Red Dodge Viper convertible for sale, ceramic brakes, low miles..."
How to make new variables:
- car_type: American [Ford] (1), European [Audi, Renault] (2), Asian [Toyota, Kia] (3)
- ABS_brakes: description includes 'ABS brak' (1), or not (0)
- imperfection: description includes 'rust' or 'scratches' (1) or not (0)
- sporty: description includes 'convertible' (1) or not (0)
I have started by trying re.search() (not re.match()), such as:
sporty = re.search("convertible",'Cars_notes')
I am just starting to learn Python text manipulation and NLP. I have searched for information here as well as other sources (Data Camp, Udemy, Google searching) but I have not yet found something to explain how to manipulate text to create such categorical or dummy variables. Help will be appreciated. Thanks!

Here's my take on this.
Since you're dealing with text, pandas.Series.str.contains should be plenty (no need to use re.search.
np.where and np.select are useful when it comes to assigning new variables based on conditions.
import pandas as pd
import numpy as np
Cars_listing = pd.DataFrame({
'Cars_notes':
['"This Audi has ABS braking, leather interior and bucket seats..."',
'"The Ford F150 is one tough pickup truck, with 4x4, new suspension and club cab..."',
'"Our Nissan Sentra comes with ABS brakes, Bluetooth-enabled radio..."',
'"This Toyota Corolla is a gem, with new tires, low miles, a few scratches..."',
'"The Renault Le Car has been sitting in the garage, a little rust..."',
'"The Kia Sorento for sale has a CD player, new tires..."',
'"Red Dodge Viper convertible for sale, ceramic brakes, low miles..."']
})
# 1. car_type
Cars_listing['car_type'] = np.select(
condlist=[ # note you could use the case-insensitive search with `case=False`
Cars_listing['Cars_notes'].str.contains('ford', case=False),
Cars_listing['Cars_notes'].str.contains('audi|renault', case=False),
Cars_listing['Cars_notes'].str.contains('Toyota|Kia')
],
choicelist=[1, 2, 3], # dummy variables
default=0 # you could set it to `np.nan` etc
)
# 2. ABS_brakes
Cars_listing['ABS_brakes'] = np.where(# where(condition, [x, y])
Cars_listing['Cars_notes'].str.contains('ABS brak'), 1, 0)
# 3. imperfection
Cars_listing['imperfection'] = np.where(
Cars_listing['Cars_notes'].str.contains('rust|scratches'), 1, 0)
# 4. sporty
Cars_listing['sporty'] = np.where(
Cars_listing['Cars_notes'].str.contains('convertible'), 1, 0)
Cars_notes car_type ABS_brakes imperfection sporty
0 """This Audi has ..." 2 1 0 0
1 """The Ford F150 ..." 1 0 0 0
2 """Our Nissan Sen..." 0 1 0 0
3 """This Toyota Co..." 3 0 1 0
4 """The Renault Le..." 2 0 1 0
5 """The Kia Sorent..." 3 0 0 0
6 """Red Dodge Vipe..." 0 0 0 1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

data processing with python - python

Related

How to find if elements of a column in a data frame are string-contained by the elements of a column of another data frame?

Python: Counting values for columns with multiple values per entry in dataframe

Capturing row if column string contains X and at least one of [Y,Z]

How to replace value in whole dataframe based on dictionary when part of string match?

Create dummy and categorical variables from specific word(s) in text column in Python dataframe

Categories

Resources