1
0
Fork 0
julien.riou.xyz/content/blog/postgis-and-the-hard-rock-cafe-collection.md

305 lines
12 KiB
Markdown
Raw Normal View History

---
title: PostGIS and the Hard Rock Cafe collection
date: 2024-05-27T18:00:00+02:00
categories:
- postgresql
---
In 2007, I went to the west coast of the United States where I visited a Hard
Rock Cafe store in Hollywood CA and bought the first T-Shirt of my collection.
Now, I'm an open-source DBA and my number one database is
[PostgreSQL](https://www.postgresql.org/). So I've decided to use the
[PostGIS](https://postgis.net/) extension to plan my next vacations or
conferences to buy more T-Shirts.
# The collection
## Inventory
Let's do the inventory of my collected T-Shirts by looking at my closet.
![](/hrc/t-shirts.jpg)
The titles represent a location, generally a city but it could be a known place
like a stadium. This will be enough to find the related shop in a database.
Titles can be written into a file:
```
Dublin
Los Angeles
Prague
London
```
([collection.csv](/hrc/collection.csv))
## Coordinates
The next step is to add coordinates. This information can be found by querying
the [Nominatim](https://nominatim.org/) API based on the
[OpenStreetMap](https://www.openstreetmap.org) community-driven project.
```python
#!/usr/bin/env python3
import requests
import csv
import time
if __name__ == "__main__":
headers = {"User-Agent": "Hard Rock Cafe Blog Post From Julien Riou"}
session = requests.Session()
session.headers.update(headers)
with open("collection_with_coordinates.csv", "w") as dest:
writer = csv.writer(dest)
with open("collection.csv", "r") as source:
for row in csv.reader(source):
name = row[0]
r = session.get(f"https://nominatim.openstreetmap.org/search?q={name}&limit=1&format=json")
time.sleep(1)
r.raise_for_status()
data = r.json()
if len(data) == 1:
data = data[0]
writer.writerow([name, data["lon"], data["lat"]])
else:
print(f"Location not found for {title}, skipping")
```
([collection.py](/hrc/collection.py))
The Python script iterates over the `collection.csv` file to query the
Nominatim API to find the most relevent OpenStreetMap node then writes
coordinates in the `collection_with_coordinates.csv` file using the
coma-separated values (CSV) format.
```
Dublin,53.3493795,-6.2605593
Los Angeles,34.0536909,-118.242766
Prague,50.0596288,14.446459273258009
London,51.4893335,-0.14405508452768728
```
([collection_with_coordinates.csv](/hrc/collection_with_coordinates.csv))
# The shops (or "nodes")
Now we need a complete list of Hard Rock Cafe locations with their coordinates
to match the collection.
## OpenStreetMap
My first idea was to use OpenStreetMap that should provide the needed dataset.
I tried to use the Nominatim API but the queries are [limited to 40
results](https://nominatim.org/release-docs/latest/api/Search/#parameters). I
could [download the entire
dataset](https://wiki.openstreetmap.org/wiki/Downloading_data), import it in a
[PostgreSQL instance](https://osm2pgsql.org/) locally but it would have been
space and time consuming. So, I used the [Overpass
API](https://wiki.openstreetmap.org/wiki/Overpass_API) with [this
query](https://overpass-turbo.eu/s/1LwQ) (thanks
[Andreas](https://andreas.scherbaum.la/)). In the end, the quality of the data
was not satisfying. The amenity was restaurant, bar, pub, cafe, nightclub or
shop. The name had an accent ("é") or not ("e"). Sometimes, the brand was not
reported. Even with all those filters, there was a
[node](https://www.openstreetmap.org/node/6260098710) that was not even a Hard
Rock Cafe. The more the query grew, the more I wanted to use another method.
## Website
I decided to parse the official website. By using a well-known library like
[Selenium](https://selenium-python.readthedocs.io/) or
[ferret](https://www.montferret.dev/)? Given the personal time I had for this
project, I've chosen the quick and dirty path. Let me present you the ugly but
functional one-liner to parse the official Hard Rock Cafe website:
```
curl -sL https://cafe.hardrock.com/locations.aspx | \
grep 'var currentMapPoint=' | \
sed "s/.*{'title':/{'title':/g;s/,'description.*/}/g;s/'/\"/g" | \
sed 's/{"title"://g;s/"lat":/"/g;s/,"lng":/","/g;s/}/"/g' \
> nodes.csv
```
Very ugly, not future-proof, but it did the job.
```
"Hard Rock Cafe Amsterdam","52.36211","4.88298"
"Hard Rock Cafe Andorra","42.507707","1.531977"
"Hard Rock Cafe Angkor","13.35314","103.85676"
"Hard Rock Cafe Asuncion","-25.2896910","-57.5737599"
```
([nodes.csv](/hrc/nodes.csv))
# Data exploration
The tool of choice to import and analyze this data is PostgreSQL and its
PostGIS extension. I've used [Docker](https://www.docker.com/) to have a
disposable local instance to perform quick analysis.
```
docker run -d --name postgres -v "$(pwd):/mnt:ro" \
-e POSTGRES_USER=hrc -e POSTGRES_PASSWORD=insecurepassword \
-e POSTGRES_DB=hrc \
postgis/postgis:16-3.4
docker exec -ti postgres psql -U hrc -W
```
The instance is now started and we are connected.
## Import
The [COPY](https://www.postgresql.org/docs/current/sql-copy.html) command on
PostgreSQL can import CSV lines easily. We'll use the psql alias (`\copy`) to
send data directly through the client.
```
create table collection (
name text primary key,
lat numeric,
lon numeric
);
\copy collection (name, lat, lon) from '/mnt/collection_with_coordinates.csv' csv;
create table nodes (
name text primary key,
lat numeric,
lon numeric
);
\copy nodes (name, lat, lon) from '/mnt/nodes.csv' delimiter ',' csv;
```
## Correlation
The SQL query takes all the rows from the `collection` table and try to
find a row in the `nodes` table within 50 km based on coordinates.
```
select c.name as tshirt, n.name as restaurant,
round((ST_Distance(ST_Point(c.lon, c.lat), ST_Point(n.lon, n.lat), true)/1000)::numeric, 2)
as distance_km
from collection c
left join nodes n
on ST_DWithin(ST_Point(c.lon, c.lat), ST_Point(n.lon, n.lat), 50000, true)
order by c.name, distance_km;
```
The PostGIS functions used are:
* [ST_Point](https://postgis.net/docs/ST_Point.html) to create a point in space with coordinates
* [ST_Distance](https://postgis.net/docs/ST_Distance.html) to compute the distance between two points
* [ST_DWithin](https://postgis.net/docs/ST_DWithin.html) to filter only rows with a distance less or equal than the provided value
Result:
```
tshirt | restaurant | distance_km
------------------+--------------------------------------------+-------------
Amsterdam | Hard Rock Cafe Amsterdam | 1.38
Angkor | Hard Rock Cafe Phnom Penh | 0.75
Antwerp | Hard Rock Cafe Brussels | 41.84
Barcelona | Hard Rock Cafe Barcelona | 0.64
Berlin | Hard Rock Cafe Berlin | 4.32
Boston | (null) | (null)
Brussels | Hard Rock Cafe Brussels | 0.10
Detroit | (null) | (null)
Dublin | Hard Rock Cafe Dublin | 0.40
Hamburg | Hard Rock Cafe Hamburg | 2.36
Ho Chi Minh City | (null) | (null)
Hollywood | Hard Rock Cafe Hollywood on Hollywood Blvd | 1.07
Lisbon | Hard Rock Cafe Lisbon | 1.07
London | Hard Rock Cafe London | 1.66
London | Hard Rock Cafe London Piccadilly Circus | 2.40
Los Angeles | Hard Rock Cafe Hollywood on Hollywood Blvd | 10.46
Miami | Hard Rock Cafe Miami | 0.97
Miami | Hard Rock Cafe Hollywood FL | 30.83
Miami Gardens | Hard Rock Cafe Hollywood FL | 12.62
Miami Gardens | Hard Rock Cafe Miami | 19.26
New York | Hard Rock Cafe New York Times Square | 5.18
New York | Hard Rock Cafe Yankee Stadium | 14.53
Orlando | Hard Rock Cafe Orlando | 11.50
Oslo | (null) | (null)
Paris | Hard Rock Cafe Paris | 2.14
Prague | Hard Rock Cafe Prague | 3.59
San Francisco | Hard Rock Cafe San Francisco | 3.36
Singapore | Hard Rock Cafe Singapore | 5.74
Singapore | Hard Rock Cafe Changi Airport Singapore | 18.88
Singapore | Hard Rock Cafe Puteri Harbour | 19.33
Yankee Stadium | Hard Rock Cafe Yankee Stadium | 0.14
Yankee Stadium | Hard Rock Cafe New York Times Square | 9.53
```
We can identify multiple patterns here:
* exact match
* closed restaurants (Boston, Detroit, Ho Chi Minh City, Oslo)
* multiple restaurants (Miami, Miami Gardens, Singapore, New York, Yankee Stadium)
* multiple T-Shirts (Miami, Los Angeles)
* wrong match (Angkor, Antwerp)
* missed opportunities (Hollywood FL, Piccadilly Circus)
I've created a [script](/hrc/update.sql) to update the names in the
`collection` table to match the names in the `nodes` tables to join them by names
instead of the location.
# Next locations
The last step of the exploration is to find Hard Rock Cafe locations within a
reasonable distance from home (1000 Km). As I don't want to disclose the exact
position, we'll search for "Belgium". The country is quite small so that should
not be an issue.
```
$ curl -A "Hard Rock Cafe Blog Post From Julien Riou" \
-sL "https://nominatim.openstreetmap.org/search?limit=1&format=json&q=Belgium" | \
jq -r '.[0].name,.[0].lon,.[0].lat'
België / Belgique / Belgien
4.6667145
50.6402809
```
The query to search for shops that I don't have already visited looks like
this:
```
select n.name,
round((ST_Distance(ST_Point(n.lon, n.lat), ST_Point(4.6667145, 50.6402809), true)/1000)::numeric, 2)
as distance_km
from nodes n
left join collection c
on n.name = c.name
where c.name is null
and ST_Distance(ST_Point(n.lon, n.lat), ST_Point(4.6667145, 50.6402809), true)/1000 < 1000
order by distance_km;
```
The final result:
```
name | distance_km
-----------------------------------------+-------------
Hard Rock Cafe Cologne | 164.87
Hard Rock Cafe London Piccadilly Circus | 349.99
Hard Rock Cafe Manchester | 569.37
Hard Rock Cafe Munich | 573.42
Hard Rock Cafe Innsbruck | 618.90
Hard Rock Cafe Newcastle | 640.64
Hard Rock Cafe Milan | 666.41
Hard Rock Cafe Copenhagen | 769.51
Hard Rock Cafe Edinburgh | 789.31
Hard Rock Cafe Venice | 812.96
Hard Rock Cafe Wroclaw | 870.89
Hard Rock Cafe Vienna | 890.23
Hard Rock Cafe Florence | 911.37
Hard Rock Cafe Gothenburg | 918.32
Hard Rock Cafe Andorra | 935.20
```
# Conclusion
According to this study, my next vacations or conferences should take place in
Germany or UK. A perfect opportunity to go to
[PGConf.DE](https://2024.pgconf.de/) and [PGDay UK](https://pgday.uk/)!