filling in columns with info from other file based on condition - python

So there are 2 csv files im working with:
file 1:
City KWR1 KWR2 KWR3
Killeen
Killeen
Houston
Whatever
file2:
location link reviews
Killeen www.example.com 300
Killeen www.differentexample.com 200
Killeen www.example3.com 100
Killeen www.extraexample.com 20
Here's what im trying to make this code do:
look at the 'City' in file one, take the top 3 links in file 2 (you can go ahead and assume the cities wont get mixed up) and then put these top 3 into the KWR1 KWR2 KWR3 columns for all the same 'City' values.
so it gets the top 3 and then just copies them to the right of all the Same 'City' values.
even asking this question correctly is difficult for me, hope i've provided enough information.
i know how to read the file in with pandas and all that, just cant code this exact situation in...

It is a little unusual requirement but I think you need to three steps:
1. Keep only the first three values you actually need.
df = df.sort_values(by='reviews',ascending=False).groupby('location').head(3).reset_index()
Hopefully this keeps only the first three from every city.
Then you somehow need to label your data, there might be better ways to do this but here is one way:- You assign a new column with numbers and create a user defined function
import numpy as np
df['nums'] = np.arange(len(df))
Now you have a column full of numbers (kind of like line numbers)
You create your function then that will label your data...
def my_func(index):
if index % 3 ==0 :
x = 'KWR' + str(1)
elif index % 3 == 1:
x = 'KWR' + str(2)
elif index % 3 == 2:
x = 'KWR' + str(3)
return x
You can then create the labels you need:
df['labels'] = df.nums.apply(my_func)
Then you can do:
my_df = pd.pivot_table(df, values='reviews', index=['location'], columns='labels', aggfunc='max').reset_index()
Which literally pulls out the labels (pivots) and puts the values in to the right places.

Related

Python Panda's row selection

I've tried doing some searching, but I'm having troubles finding what I specifically need. I currently have this.
location = 'Location'
data = pd.read_csv('testbook.csv')
df = pd.DataFrame(data)
search = 'OR' # This will be replaced with an input
row = (df[df.eq(search).any(1)])
print(row)
Location = row.at[0, location]
print(Location)
This outputs this
row print out
Location City Price Etc
0 FL OR 50 123
Location print out
FL
this is the CSV information that it's pull the data from.
My main question and issue is what I'm trying to find out is at this specific line of code
Location = row.at[0, location]
for Location what I'm trying to do and see if possible is in the brackets [0, location].
I want it to automate in the future since for example if I need to find instead of 'OR' I need to find what data is in 'OR1'. The issue is that the [0] is related to the Row # hence this(this is the entire df).
Location City Price Etc
0 FL OR 50 123
1 FL1 OR1 501 1231
2 FL2 OR2 502 1232
I would have to manually change the code every single time which of course is unfeasible with what I'm trying to accomplish.
My main question is, how do I pull specific row numbers all the way on the left and take that output and make it a variable that I can input anywhere?
I'm having a bit of trouble figuring out what you are looking for but this is my best guess
import pandas as pd
data = {'Location':['FL', 'FL1', 'FL2'],
'City': ['OR', 'OR1', 'OR2'],
'Price':[50, 501, 502],
'Etc': [123,1231,1232]}
data = pd.DataFrame(data)
df = pd.DataFrame(data)
# Given search term -> find location
search = 'OR'
# Outputs 'FL'
df['Location'][df['City'] == search].any()

I need help formating this data

I have data like this
id,phonenumbers,firstname,lastname,email,birthday,gender,locale,hometown,location,link
The problem is some data is not in the format like this
000000,000000,name1,name2,email#email,1 1 1990,female,en_En,new york,USA ,new yourk,https://www.example.com
As you can see in the "local,hometown" there are 3 commas, I want to delete one of them so the data become like this
000000,000000,name1,name2,email#email,1 1 1990,female,en_En ,new york USA, new yourk,https://www.example.com
This is just an example to the problem in my data there could be more than 3 commas and different addresses
Essentially I want to load the data into excel and have it show up clean each column with the right data
The problem is that a value is split into multiple colums when it should be in one column. If this is only possible with one column but we have a fixed number of columns before and after, then it's possible to fix it:
testdata = "000000,000000,name1,name2,email#email,1 1 1990,female,en_En,new york,USA ,new yourk,https://www.example.com"
def split(data, cols_before_addr=8, cols_after_addr=1):
raw_cols = data.split(',')
return raw_cols[:cols_before_addr] \
+ ["\n".join(raw_cols[cols_before_addr:-cols_after_addr])] \
+ raw_cols[-cols_after_addr:]
print(split(testdata))

Identifying elements in a dataframe

I have a dictionary of dataframes called names_and_places in pandas that looks like the below.
names_and_places:
Alfred,,,
Date,F_1,F_2,Key
4/1/2020,1,4,NAN
4/2/2020,2,5,NAN
4/3/2020,3,6,"[USA,NY,NY, NY]"
Brett,,,
Date,F_1,F_2,Key
4/1/2020,202,404,NAN
4/2/2020,101,401,NAN
4/3/2020,102,403,"[USA,CT, Fairfield, Stamford] "
Claire,,,
Date,F_1,F_2,Key
4/1/2020,NAN,12,NAN
4/2/2020,NAN,45,NAN
4/3/2020,7,78,"[USA,CT, Fairfield, Darian] "
Dane,,,
Date,F_1,F_2,Key
4/1/2020,4,17,NAN
4/2/2020,5,18,NAN
4/3/2020,7,19,"[USA,CT, Bridgeport, New Haven] "
Edward,,,
Date,F_1,F_2,Key
4/1/2020,4,17,NAN
4/2/2020,5,18,NAN
4/3/2020,7,19,"[USA,CT, Bridgeport, Milford] "
(text above or image below)
The key column is either going to be NAN or of the form [Country, State, County, City], but can be of length 3 or 4 elements (sometimes County is absent). I need to find all the elements with a given element that is contained in a key. For instance if the element = "CT", the script returns Edward, Brett, Dane and Claire (order is not important). If the element = "Stamford" then only Brett is returned. However I am going about the identification process in a way that seems very inefficient. I basically have variables that iterate through each possible combination of State, County, City (all of which I am currently manually inputting into variables) to identify which names to extract like below:
country = 'USA' #this never needs to change
element = 'CT'
#These next two are actually in .txt files that I create once I am asked for
#a given breakdown but I would like to not have to manually input these
middle_node = ['Fairfield','Bridgeport']
terminal_nodes = ['Stamford','Darian','New Haven','Milford']
names=[]
for a in middle_node:
for b in terminal_nodes:
my_key = [country,key_of_interest,a,b]
for s in names_and_places:
for z in names_and_places[s]['Key']:
if my_key == z:
names.append(s)
#Note having "if my_key in names_and_places[s]['Key']": was causing sporadic failures for
#some reason
display(names)
Output:
Edward, Brett, Dane, Claire
What I would like is to be able to input only the variable element and this can either be a level 2 (State), 3 (County), or 4 (City) node. However short of adding additional for loops and going into the Key column, I don't know how to do this. The one benefit (for a novice like myself) is that the double for loops allow me to keep bucketing intact and makes it easier for people to see where names are coming from when that is also needed.
But is there a better way? For bonus points if there is a way to handle the case when the key_of_interest is 'NY' and values in the Keys column can be like [USA, NY, NY, NY] or [USA, NY, NY, Queens].
Edit: names_and_places is a dictionary with names as the index, so
display(names_and_places['Alfred'])
would be
Date,F_1,F_2,Key
4/1/2020,1,4,NAN
4/2/2020,2,5,NAN
4/3/2020,3,6,"[USA,NY,NY, NY]"
I do have the raw dataframe that has columns:
Date, Field name, Value, Names,
Where Field Name is either F_1, F_2 or Key and Value is the associated value of that field. I then pivot the data on Name with columns of Field Name to make my extraction easier.
Here's a way to do that in a somewhat more effective way. You start by building a single dataframe out of the dictionary, and then do the actual work on that dataframe.
single_df = pd.concat([df.assign(name = k) for k, df in names_and_places.items()])
single_df["Key"] = single_df.Key.replace("NAN", np.NaN)
single_df.dropna(inplace=True)
# Since the location is a string, we have to parse it.
location_df = pd.DataFrame(single_df.Key.str.replace(r"[\[\]]", "").str.split(",", expand=True))
location_df.columns = ["Country", "State", "County", "City"]
single_df = pd.concat([single_df, location_df], axis=1)
# this is where the actual query goes.
single_df[(single_df.Country == "USA") & (single_df.State == "CT")].name
The output is:
2 Brett
2 Claire
2 Dane
2 Edward
Name: name, dtype: object

Save each Excel-spreadsheet-row with header in separate .txt-file (saved as a parameter-sample to be read by simulation programs)

I'm a building energy simulation modeller with an Excel-question to enable automated large-scale simulations using parameter samples (samples generated using Monte Carlo). Now I have the following question in saving my samples:
I want to save each row of an Excel-spreadsheet in a separate .txt-file in a 'special' way to be read by simulation programs.
Let's say, I have the following excel-file with 4 parameters (a,b,c,d) and 20 values underneath:
a b c d
2 3 5 7
6 7 9 1
3 2 6 2
5 8 7 6
6 2 3 4
Each row of this spreadsheet represents a simulation-parameter-sample.
I want to store each row in a separate .txt-file as follows (so 5 '.txt'-files for this spreadsheet):
'1.txt' should contain:
a=2;
b=3;
c=5;
d=7;
'2.txt' should contain:
a=6;
b=7;
c=9;
d=1;
and so on for files '3.txt', '4.txt' and '5.txt'.
So basically matching the header with its corresponding value underneath for each row in a separate .txt-file ('header equals value;').
Is there an Excel add-in that does this or is it better to use some VBA-code? Anybody some idea?
(I'm quit experienced in simulation modelling but not in programming, therefore this rather easy parameter-sample-saving question in Excel. (Solutions in Python are also welcome if that's easier for you people))
my idea would be to use Python along with Pandas as it's one of the most flexible solutions, as your use case might expand in the future.
I'm gonna try making this as simple as possible. Though I'm assuming, that you have Python, that you know how to install packages via pip or conda and are ready to run a python script on whatever system you are using.
First your script needs to import pandas and read the file into a DataFrame:
import pandas as pd
df = pd.read_xlsx('path/to/your/file.xlsx')
(Note that you might need to install the xlrd package, in addition to pandas)
Now you have a powerful data structure, that you can manipulate in plenty of ways. I guess the most intuitive one, would be to loop over all items. Use string formatting, which is best explained over here and put the strings together the way you need them:
outputs = {}
for row in df.index:
s = ""
for col in df.columns:
s += "{}={};\n".format(col, df[col][row])
print(s)
now you just need to write to a file using python's io method open. I'll just name the files by the index of the row, but this solution will overwrite older text files, created by earlier runs of this script. You might wonna add something unique like the date and time or the name of the file you read to it or increment the file name further with multiple runs of the script, for example like this.
All together we get:
import pandas as pd
df = pd.read_excel('path/to/your/file.xlsx')
file_count = 0
for row in df.index:
s = ""
for col in df.columns:
s += "{}={};\n".format(col, df[col][row])
file = open('test_{:03}.txt'.format(file_count), "w")
file.write(s)
file.close()
file_count += 1
Note that it's probably not the most elegant way and that there are one liners out there, but since you are not a programmer I thought you might prefer a more intuitive way, that you can tweak yourself easily.
I got this to work in Excel. You can expand the length of the variables x,y and z to match your situation and use LastRow, LastColumn methods to find the dimensions of your data set. I named the original worksheet "Data", as shown below.
Sub TestExportText()
Dim Hdr(1 To 4) As String
Dim x As Long
Dim y As Long
Dim z As Long
For x = 1 To 4
Hdr(x) = Cells(1, x)
Next x
x = 1
For y = 1 To 5
ThisWorkbook.Sheets.Add After:=Sheets(Sheets.Count)
ActiveSheet.Name = y
For z = 1 To 4
With ActiveSheet
.Cells(z, 1) = Hdr(z) & "=" & Sheets("Data").Cells(x + 1, z) & ";"
End With
Next z
x = x + 1
ActiveSheet.Move
ActiveWorkbook.ActiveSheet.SaveAs Filename:="File" & y & ".txt", FileFormat:=xlTextWindows
ActiveWorkbook.Close SaveChanges:=False
Next y
End Sub
If you can save your Excel spreadsheet as a CSV file then this python script will do what you want.
with open('data.csv') as file:
data_list = [l.rstrip('\n').split(',') for l in file]
counter = 1
for x in range (1, len (data_list)) :
output_file_name = str (counter) + '.txt'
with open (output_file_name, 'w' ) as file :
for x in range (len (data_list [counter])) :
print (x)
output_string = data_list [0] [x] + '=' + data_list [counter] [x] + ';\n'
file.write (output_string)
counter += 1

How to iterate over a data frame

I have a dataset of users, books and ratings and I want to find users who rated high particular book and to those users I want to find what other books they liked too.
My data looks like:
df.sample(5)
User-ID ISBN Book-Rating
49064 102967 0449244741 8
60600 251150 0452264464 9
376698 52853 0373710720 7
454056 224764 0590416413 7
54148 25409 0312421273 9
I did so far:
df_p = df.pivot_table(index='ISBN', columns='User-ID', values='Book-Rating').fillna(0)
lotr = df_p.ix['0345339703'] # Lord of the Rings Part 1
like_lotr = lotr[lotr > 7].to_frame()
users = like_lotr['User-ID']
last line failed for
KeyError: 'User-ID'
I want to obtain users who rated LOTR > 7 to those users further find movies they liked too from the matrix.
Help would be appreciated. Thanks.
In your like_lotr dataframe 'User-ID' is the name of the index, you cannot select it like a normal column. That is why the line users = like_lotr['User-ID'] raises a KeyError. It is not a column.
Moreover ix is deprecated, better to use loc in your case. And don't put quotes: it need to be an integer, since 'User-ID' was originally a column of integers (at least from your sample).
Try like this:
df_p = df.pivot_table(index='ISBN', columns='User-ID', values='Book-Rating').fillna(0)
lotr = df_p.loc[452264464] # used another number from your sample dataframe to test this code.
like_lotr = lotr[lotr > 7].to_frame()
users = like_lotr.index.tolist()
user is now a list with the ids you want.
Using your small sample above and the number I used to test, user is [251150].
An alternative solution is to use reset_index. The two last lins should look like this:
like_lotr = lotr[lotr > 7].to_frame().reset_index()
users = like_lotr['User-ID']
reset_index put the index back in the columns.

Categories