I am working on a Python program that aims to take Excel data that is vertical and make it horizontal.
For example, the data is shaped something like this:
County | State | Number | Date
Oakland | MI | 19 | 1/12/10
Oakland | MI | 32 | 1/19/10
Wayne | MI | 9 | 1/12/10
Wayne | MI | 6 | 1/19/10
But I want it like this (purposefully excluding the state):
County | 1/12/10 | 1/19/10
Oakland | 19 | 32
Wayne | 9 | 6
(And for the actual data, it’s quite long).
My logic so far:
Read in the Excel File
Loop through the counties
If county name is the same, place # in Row 1?
Make a new Excel File?
Any ideas of how to write this out? I think I am a little stuck on the syntax here.
Related
how i take specific text from one column in python pandas but inconsistent format for example like this
Area | Owners
Bali Island: 4600 | John
Java Island:7200 | Van Hour
Hallo Island : 2400| Petra
and the format would be like this
Area | Owners | Area Number
Bali Island: 4600 | John | 4600
Java Island:7200 | Van Hour | 7200
Hallo Island : 2400| Petra | 2400
You could use str.extract:
df['Area Number'] = df['Area'].str.extract('(\d+)$')
output:
Area Owners Area Number
0 Bali Island: 4600 John 4600
1 Java Island:7200 Van Hour 7200
2 Hallo Island : 2400 Petra 2400
Background
I have a large dataset in SAS that has 17 variables of which four are numeric and 13 character/string. The original dataset that I am using can be found here: https://www.kaggle.com/austinreese/craigslist-carstrucks-data.
cylinders
condition
drive
paint_color
type
manufacturer
title_status
model
fuel
transmission
description
region
state
price (num)
posting_date (num)
odometer (num)
year (num)
After applying specific filters to the numeric columns, there are no missing values for each numeric variable. However, there are thousands to hundreds of thousands of missing variables for the remaining 14 char/string variables.
Request
Similar to the blog post towards data science as shown here (https://towardsdatascience.com/end-to-end-data-science-project-predicting-used-car-prices-using-regression-1b12386c69c8), specifically under the Feature Engineering section, how can I write the equivalent SAS code where I use regex on the description column to fill missing values of the other string/char columns with categorical values such as cylinders, condition, drive, paint_color, and so on?
Here is the Python code from the blog post.
import re
manufacturer = '(gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover | lexus \
| honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda | \
mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche \
| hennessey)'
condition = '(excellent | good | fair | like new | salvage | new)'
fuel = '(gas | hybrid | diesel |electric)'
title_status = '(clean | lien | rebuilt | salvage | missing | parts only)'
transmission = '(automatic | manual)'
drive = '(4x4 | awd | fwd | rwd | 4wd)'
size = '(mid-size | full-size | compact | sub-compact)'
type_ = '(sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad)'
paint_color = '(red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow)'
cylinders = '(\s[1-9] cylinders? |\s1[0-6]? cylinders?)'
keys = ['manufacturer', 'condition', 'fuel', 'title_status', 'transmission', 'drive','size', 'type', 'paint_color' , 'cylinders']
columns = [ manufacturer, condition, fuel, title_status, transmission ,drive, size, type_, paint_color, cylinders]
for i,column in zip(keys,columns):
database[i] = database[i].fillna(
database['description'].str.extract(column, flags=re.IGNORECASE, expand=False)).str.lower()
database.drop('description', axis=1, inplace= True)
What would be the equivalent SAS code for the Python code shown above?
It's basically just doing a word search of sorts.
A simplified example in SAS:
data want;
set have;
array _fuel(*) $ _temporary_ ("gas", "hybrid", "diesel", "electric");
do i=1 to dim(_fuel);
if find(description, _fuel(i), 'it')>0 then fuel = _fuel(i);
*does not deal with multiple finds so the last one found will be kept;
end;
run;
You can expand this by creating an array for each variable and then looping through your lists. I think you can replace the loop with a REGEX command as well in SAS but regex requires too much thinking so someone else will have to provide that answer.
I have data that looks like this:
service | company
--------------------
sequencing| Fischer
RNA tests | Fischer
Cell tests| 23andMe
consulting| UCLA
DNA tests | UCLA
mouse test| UCLA
and I want to concat services together into a list on equal company names like this:
service_list | company
-------------------------------------------------
['sequencing','RNA tests'] | Fischer
['Cell tests'] | 23andMe
['consulting','DNA tests','mouse test']| UCLA
Not sure how to begin doing this.
Lets try groupby(), aggregate to list
df.groupby('company').service.agg(list).reset_index()
company service
0 23andMe [Celltests]
1 Fischer [sequencing, RNAtests]
2 UCLA [consulting, DNAtests, mousetest]
Hi assuming I have 2 lists:
names = ['Daniel', 'Mario', 'Mandy', 'Jolene', 'Fabio']
places = ['on top of the table', 'France', 'valley of the kings']
and a dataframe with some sentences
ex:
DataframeOrig
Index | Sent
0 | Mandy went to France on the Eiffel Tower
1 | Daniele was dancing on top of the box
2 | I am eating on top of the table
3 | Maria went to the valley of the kings
I would like to use a distance metric like difflib to scan the sentences and compare phrases to the list having a determined offset. Hopefully the result of this would be:
Index | Sent | Result
0 | Mandy went to France on the Eiffel Tower | Mandy
1 | Daniele was dancing on top of the box | Daniel
2 | I am eating on top of the table | on top of the table
3 | Maria went to the valley of the kings | Mario, valley of the kings
How would you go about it without using loads of loops to get phrase matches?
So I was trying to make a function that downloads a csv file using the csv download link and then basically prints it dividing it in lines but I'm having problems when I have to save
def download_data(csv_url):
response = request.urlopen(csv_url)
csv = response.read()
csv_str = str(csv)
lines = csv_str.split("\\n")
dest_url = r'data.csv'
fx = open(dest_url, 'r')
for line in lines:
fx.write(line + '/n')
fx.close()
when I give it the csv link , it tells me it can't find file/directory "data.csv" even though I should've downloaded it.
Running Mac os
You're reading the file. Change the 'r' in fx = open(dest_url, 'r') to 'w'.
fx = open(dest_url, 'w')
As a side note you really should be using a with statement. with will make the file object close the connection once the code leaves the the with's scope. This way you don't have to worry about closing the connection.
def download_data(csv_url):
response = request.urlopen(csv_url)
with open('data.csv', 'w') as f:
f.write(str(response.read()))
Though really there isn't any need to save the file at all if you're just going to read it and display the contents on the screen. Just have download_data return csv_str.
Finally take a look at the builtin csv module. It makes life easy.
import csv
from io import StringIO
import requests
def download_data(csv_url):
return csv.reader(
StringIO(
requests.get(csv_url)
.text
), delimiter=','
)
for row in download_data('https://people.sc.fsu.edu/~jburkardt/data/csv/addresses.csv'):
print("| {} |".format(str(' | '.join(row))))
# Prints:
#
# | John | Doe | 120 jefferson st. | Riverside | NJ | 08075 |
# | Jack | McGinnis | 220 hobo Av. | Phila | PA | 09119 |
# | John "Da Man" | Repici | 120 Jefferson St. | Riverside | NJ | 08075 |
# | Stephen | Tyler | 7452 Terrace "At the Plaza" road | SomeTown | SD | 91234 |
# | | Blankman | | SomeTown | SD | 00298 |
# | Joan "the bone", Anne | Jet | 9th, at Terrace plc | Desert City | CO | 00123 |