Yelp: Reverse geocoding businesses to extract detailed location information
I’ve been playing around with the Yelp Open Dataset and wanted to extract more detailed location information for each business.
This is an example of the JSON representation of one business:
$ cat dataset/business.json | head -n1 | jq { "business_id": "FYWN1wneV18bWNgQjJ2GNg", "name": "Dental by Design", "neighborhood": "", "address": "4855 E Warner Rd, Ste B9", "city": "Ahwatukee", "state": "AZ", "postal_code": "85044", "latitude": 33.3306902, "longitude": -111.9785992, "stars": 4, "review_count": 22, "is_open": 1, "attributes": { "AcceptsInsurance": true, "ByAppointmentOnly": true, "BusinessAcceptsCreditCards": true }, "categories": [ "Dentists", "General Dentistry", "Health & Medical", "Oral Surgeons", "Cosmetic Dentists", "Orthodontists" ], "hours": { "Friday": "7:30-17:00", "Tuesday": "7:30-17:00", "Thursday": "7:30-17:00", "Wednesday": "7:30-17:00", "Monday": "7:30-17:00" } }
The businesses reside in different countries so I wanted to extract the area/county/state and the country for each of them. I found the reverse-geocoder library which is perfect for this problem.
You give the library a lat/long or list of lat/longs and it returns you back a list containing the nearest lat/long to your points along with the name of the place, Admin regions, and country code. It’s way quicker to pass in a list of lat/longs than to call the function individually for each lat/long so we’ll do that.
We can write the following code to extract location information for a list of lat/longs:
import reverse_geocoder as rg lat_longs = { "FYWN1wneV18bWNgQjJ2GNg": (33.3306902, -111.9785992), "He-G7vWjzVUysIKrfNbPUQ": (40.2916853, -80.1048999), "KQPW8lFf1y5BT2MxiSZ3QA": (33.5249025, -112.1153098) } business_ids = list(lat_longs.keys()) locations = rg.search(list(lat_longs.values())) for business_id, location in zip(business_ids, locations): print(business_id, lat_longs[business_id], location)
This is the output we get from running the script:
$ python blog.py Loading formatted geocoded file... FYWN1wneV18bWNgQjJ2GNg (33.3306902, -111.9785992) OrderedDict([('lat', '33.37088'), ('lon', '-111.96292'), ('name', 'Guadalupe'), ('admin1', 'Arizona'), ('admin2', 'Maricopa County'), ('cc', 'US')]) He-G7vWjzVUysIKrfNbPUQ (40.2916853, -80.1048999) OrderedDict([('lat', '40.2909'), ('lon', '-80.10811'), ('name', 'Thompsonville'), ('admin1', 'Pennsylvania'), ('admin2', 'Washington County'), ('cc', 'US')]) KQPW8lFf1y5BT2MxiSZ3QA (33.5249025, -112.1153098) OrderedDict([('lat', '33.53865'), ('lon', '-112.18599'), ('name', 'Glendale'), ('admin1', 'Arizona'), ('admin2', 'Maricopa County'), ('cc', 'US')])
It seems to work fairly well! Now we just need to tweak our script to read in the values from the Yelp JSON file and generate a new JSON file containing the locations:
import json import reverse_geocoder as rg lat_longs = {} with open("dataset/business.json") as business_json: for line in business_json.readlines(): item = json.loads(line) if item["latitude"] and item["longitude"]: lat_longs[item["business_id"]] = { "lat_long": (item["latitude"], item["longitude"]), "city": item["city"] } result = {} business_ids = list(lat_longs.keys()) locations = rg.search([value["lat_long"] for value in lat_longs.values()]) for business_id, location in zip(business_ids, locations): result[business_id] = { "country": location["cc"], "name": location["name"], "admin1": location["admin1"], "admin2": location["admin2"], "city": lat_longs[business_id]["city"] } with open("dataset/businessLocations.json", "w") as business_locations_json: json.dump(result, business_locations_json, indent=4, sort_keys=True)
And that’s it!
Published on Web Code Geeks with permission by Mark Needham, partner at our WCG program. See the original article here: Yelp: Reverse geocoding businesses to extract detailed location information Opinions expressed by Web Code Geeks contributors are their own. |