[enh]: make last_x_days generic

add mls_only make radius generic
Bunsly · Oct 4, 2023 · c487067 · c487067
1 parent 51bde20
commit c487067
Show file tree

Hide file tree

Showing 9 changed files with 218 additions and 199 deletions.
diff --git a/README.md b/README.md
@@ -36,13 +36,13 @@ pip install homeharvest
 ### CLI 
 
 ```
-usage: homeharvest [-h] [-l {for_sale,for_rent,sold}] [-o {excel,csv}] [-f FILENAME] [-p PROXY] [-d DAYS] [-r RADIUS] location
-                                                                                                                              
+usage: homeharvest [-l {for_sale,for_rent,sold}] [-o {excel,csv}] [-f FILENAME] [-p PROXY] [-d DAYS] [-r RADIUS] [-m] location
+                                                                                                                             
 Home Harvest Property Scraper                                                                                                 
-                                                                                                                              
+                                                                                                                             
 positional arguments:                                                                                                         
   location              Location to scrape (e.g., San Francisco, CA)                                                          
-                                                                                                                              
+                                                                                                                             
 options:                                                                                                                      
   -l {for_sale,for_rent,sold}, --listing_type {for_sale,for_rent,sold}                                                        
                         Listing type to scrape                                                                                
@@ -54,7 +54,8 @@ options:
                         Proxy to use for scraping                                                                             
   -d DAYS, --days DAYS  Sold in last _ days filter.                                                                           
   -r RADIUS, --radius RADIUS                                                                                                  
-                        Get comparable properties within _ (eg. 0.0) miles. Only applicable for individual addresses.
+                        Get comparable properties within _ (eg. 0.0) miles. Only applicable for individual addresses.        
+  -m, --mls_only        If set, fetches only MLS listings.
 ```
 ```bash
 > homeharvest "San Francisco, CA" -l for_rent -o excel -f HomeHarvest
@@ -73,9 +74,14 @@ filename = f"output/{current_timestamp}.csv"
 properties = scrape_property(
     location="San Diego, CA",
     listing_type="sold", # for_sale, for_rent
+    last_x_days=30, # sold/listed in last 30 days
+    mls_only=True, # only fetch MLS listings
 )
 print(f"Number of properties: {len(properties)}")
+
+# Export to csv
 properties.to_csv(filename, index=False)
+print(properties.head())
 ```
 
 
@@ -94,12 +100,23 @@ properties.to_csv(filename, index=False)
 ### Parameters for `scrape_property()`
 ```
 Required
-├── location (str): address in various formats e.g. just zip, full address, city/state, etc.
-└── listing_type (enum): for_rent, for_sale, sold
+├── location (str): The address in various formats - this could be just a zip code, a full address, or city/state, etc.
+└── listing_type (option): Choose the type of listing.
+    - 'for_rent'
+    - 'for_sale'
+    - 'sold'
+
 Optional
-├── radius_for_comps (float): Radius in miles to find comparable properties based on individual addresses.
-├── sold_last_x_days (int): Number of past days to filter sold properties.
-├── proxy (str): in format 'http://user:pass@host:port'
+├── radius (decimal): Radius in miles to find comparable properties based on individual addresses.
+│    Example: 5.5 (fetches properties within a 5.5-mile radius if location is set to a specific address; otherwise, ignored)
+│
+├── last_x_days (integer): Number of past days to filter properties. Utilizes 'COEDate' for 'sold' listing types, and 'Lst Date' for others (for_rent, for_sale).
+│    Example: 30 (fetches properties listed/sold in the last 30 days)
+│
+├── mls_only (True/False): If set, fetches only MLS listings (mainly applicable to 'sold' listings)
+│
+└── proxy (string): In format 'http://user:pass@host:port'
+
 ```
 ### Property Schema
 ```plaintext
@@ -111,59 +128,57 @@ Property
 │ └── status (str)
 
 ├── Address Details:
-│ ├── street (str)
-│ ├── unit (str)
-│ ├── city (str)
-│ ├── state (str)
-│ └── zip (str)
+│ ├── street
+│ ├── unit
+│ ├── city
+│ ├── state
+│ └── zip
 
 ├── Property Description:
-│ ├── style (str)
-│ ├── beds (int)
-│ ├── baths_full (int)
-│ ├── baths_half (int)
-│ ├── sqft (int)
-│ ├── lot_sqft (int)
-│ ├── sold_price (int)
-│ ├── year_built (int)
-│ ├── garage (float)
-│ └── stories (int)
+│ ├── style
+│ ├── beds
+│ ├── baths_full
+│ ├── baths_half
+│ ├── sqft
+│ ├── lot_sqft
+│ ├── sold_price
+│ ├── year_built
+│ ├── garage
+│ └── stories
 
 ├── Property Listing Details:
-│ ├── list_price (int)
-│ ├── list_date (str)
-│ ├── last_sold_date (str)
-│ ├── prc_sqft (int)
-│ └── hoa_fee (int)
+│ ├── list_price
+│ ├── list_date
+│ ├── last_sold_date
+│ ├── prc_sqft
+│ └── hoa_fee
 
 ├── Location Details:
-│ ├── latitude (float)
-│ ├── longitude (float)
-│ └── neighborhoods (str)
+│ ├── latitude
+│ ├── longitude
+│ └── neighborhoods
 ```
-## Supported Countries for Property Scraping
-
-* **Realtor.com**: mainly from the **US** but also has international listings
 
 ### Exceptions
 The following exceptions may be raised when using HomeHarvest:
 
 - `InvalidListingType` - valid options: `for_sale`, `for_rent`, `sold`
-- `NoResultsFound` - no properties found from your input
-
+- `NoResultsFound` - no properties found from your search
+
+
 ## Frequently Asked Questions
 ---
 
 **Q: Encountering issues with your searches?**  
-**A:** Try to broaden the location. If problems persist, [submit an issue](https://github.com/ZacharyHampton/HomeHarvest/issues).
+**A:** Try to broaden the parameters you're using. If problems persist, [submit an issue](https://github.com/ZacharyHampton/HomeHarvest/issues).
 
 ---
 
 **Q: Received a Forbidden 403 response code?**  
 **A:** This indicates that you have been blocked by Realtor.com for sending too many requests. We recommend:
 
 - Waiting a few seconds between requests.
-- Trying a VPN to change your IP address.
+- Trying a VPN or useing a proxy as a parameter to scrape_property() to change your IP address.
 
 ---
 
diff --git a/examples/HomeHarvest_Demo.ipynb b/examples/HomeHarvest_Demo.ipynb
@@ -31,7 +31,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# scrapes all 3 sites by default\n",
+    "# check for sale properties\n",
     "scrape_property(\n",
     "    location=\"dallas\",\n",
     "    listing_type=\"for_sale\"\n",
@@ -53,7 +53,6 @@
     "# search a specific address\n",
     "scrape_property(\n",
     "    location=\"2530 Al Lipscomb Way\",\n",
-    "    site_name=\"zillow\",\n",
     "    listing_type=\"for_sale\"\n",
     ")"
    ]
@@ -68,7 +67,6 @@
     "# check rentals\n",
     "scrape_property(\n",
     "    location=\"chicago, illinois\",\n",
-    "    site_name=[\"redfin\", \"zillow\"],\n",
     "    listing_type=\"for_rent\"\n",
     ")"
    ]
@@ -88,7 +86,6 @@
     "# check sold properties\n",
     "scrape_property(\n",
     "    location=\"90210\",\n",
-    "    site_name=[\"redfin\"],\n",
     "    listing_type=\"sold\"\n",
     ")"
    ]

diff --git a/examples/HomeHarvest_Demo.py b/examples/HomeHarvest_Demo.py
@@ -0,0 +1,18 @@
+from homeharvest import scrape_property
+from datetime import datetime
+
+# Generate filename based on current timestamp
+current_timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+filename = f"output/{current_timestamp}.csv"
+
+properties = scrape_property(
+    location="San Diego, CA",
+    listing_type="sold", # for_sale, for_rent
+    last_x_days=30, # sold/listed in last 30 days
+    mls_only=True, # only fetch MLS listings
+)
+print(f"Number of properties: {len(properties)}")
+
+# Export to csv
+properties.to_csv(filename, index=False)
+print(properties.head())
diff --git a/homeharvest/__init__.py b/homeharvest/__init__.py
@@ -1,103 +1,41 @@
+import warnings
 import pandas as pd
-import concurrent.futures
-from concurrent.futures import ThreadPoolExecutor
-
 from .core.scrapers import ScraperInput
-from .utils import process_result, ordered_properties
+from .utils import process_result, ordered_properties, validate_input
 from .core.scrapers.realtor import RealtorScraper
-from .core.scrapers.models import ListingType, Property, SiteName
-from .exceptions import InvalidListingType
-
-
-_scrapers = {
-    "realtor.com": RealtorScraper,
-}
-
-
-def _validate_input(listing_type: str) -> None:
-    if listing_type.upper() not in ListingType.__members__:
-        raise InvalidListingType(f"Provided listing type, '{listing_type}', does not exist.")
+from .core.scrapers.models import ListingType
+from .exceptions import InvalidListingType, NoResultsFound
 
 
-def _scrape_single_site(location: str, site_name: str, listing_type: str, radius: float, proxy: str = None, sold_last_x_days: int = None) -> pd.DataFrame:
+def scrape_property(
+    location: str,
+    listing_type: str = "for_sale",
+    radius: float = None,
+    mls_only: bool = False,
+    last_x_days: int = None,
+    proxy: str = None,
+) -> pd.DataFrame:
     """
-    Helper function to scrape a single site.
+    Scrape properties from Realtor.com based on a given location and listing type.
     """
-    _validate_input(listing_type)
+    validate_input(listing_type)
 
     scraper_input = ScraperInput(
         location=location,
         listing_type=ListingType[listing_type.upper()],
-        site_name=SiteName.get_by_value(site_name.lower()),
         proxy=proxy,
         radius=radius,
-        sold_last_x_days=sold_last_x_days
+        mls_only=mls_only,
+        last_x_days=last_x_days,
     )
 
-    site = _scrapers[site_name.lower()](scraper_input)
+    site = RealtorScraper(scraper_input)
     results = site.search()
-    print(f"found {len(results)}")
 
     properties_dfs = [process_result(result) for result in results]
     if not properties_dfs:
-        return pd.DataFrame()
-
-    return pd.concat(properties_dfs, ignore_index=True, axis=0)[ordered_properties]
-
-
-def scrape_property(
-    location: str,
-    listing_type: str = "for_sale",
-    radius: float = None,
-    sold_last_x_days: int = None,
-    proxy: str = None,
-) -> pd.DataFrame:
-    """
-    Scrape properties from Realtor.com based on a given location and listing type.
-
-    :param location: US Location (e.g. 'San Francisco, CA', 'Cook County, IL', '85281', '2530 Al Lipscomb Way')
-    :param listing_type: Listing type (e.g. 'for_sale', 'for_rent', 'sold'). Default is 'for_sale'.
-    :param radius: Radius in miles to find comparable properties on individual addresses. Optional.
-    :param sold_last_x_days: Number of past days to filter sold properties. Optional.
-    :param proxy: Proxy IP address to be used for scraping. Optional.
-    :returns: pd.DataFrame containing properties
-    """
-    site_name = "realtor.com"
-
-    if site_name is None:
-        site_name = list(_scrapers.keys())
-
-    if not isinstance(site_name, list):
-        site_name = [site_name]
-
-    results = []
-
-    if len(site_name) == 1:
-        final_df = _scrape_single_site(location, site_name[0], listing_type, radius, proxy, sold_last_x_days)
-        results.append(final_df)
-    else:
-        with ThreadPoolExecutor() as executor:
-            futures = {
-                executor.submit(_scrape_single_site, location, s_name, listing_type, radius, proxy, sold_last_x_days): s_name
-                for s_name in site_name
-            }
-
-            for future in concurrent.futures.as_completed(futures):
-                result = future.result()
-                results.append(result)
-
-    results = [df for df in results if not df.empty and not df.isna().all().all()]
-
-    if not results:
-        return pd.DataFrame()
-
-    final_df = pd.concat(results, ignore_index=True)
-
-    columns_to_track = ["Street", "Unit", "Zip"]
-
-    #: validate they exist, otherwise create them
-    for col in columns_to_track:
-        if col not in final_df.columns:
-            final_df[col] = None
+        raise NoResultsFound("no results found for the query")
 
-    return final_df
+    with warnings.catch_warnings():
+        warnings.simplefilter("ignore", category=FutureWarning)
+        return pd.concat(properties_dfs, ignore_index=True, axis=0)[ordered_properties]
diff --git a/homeharvest/cli.py b/homeharvest/cli.py
@@ -5,7 +5,9 @@
 
 def main():
     parser = argparse.ArgumentParser(description="Home Harvest Property Scraper")
-    parser.add_argument("location", type=str, help="Location to scrape (e.g., San Francisco, CA)")
+    parser.add_argument(
+        "location", type=str, help="Location to scrape (e.g., San Francisco, CA)"
+    )
 
     parser.add_argument(
         "-l",
@@ -33,21 +35,41 @@ def main():
         help="Name of the output file (without extension)",
     )
 
-    parser.add_argument("-p", "--proxy", type=str, default=None, help="Proxy to use for scraping")
-    parser.add_argument("-d", "--days", type=int, default=None, help="Sold in last _ days filter.")
+    parser.add_argument(
+        "-p", "--proxy", type=str, default=None, help="Proxy to use for scraping"
+    )
+    parser.add_argument(
+        "-d",
+        "--days",
+        type=int,
+        default=None,
+        help="Sold/listed in last _ days filter.",
+    )
 
     parser.add_argument(
         "-r",
-        "--sold-properties-radius",
-        dest="sold_properties_radius",  # This makes sure the parsed argument is stored as radius_for_comps in args
+        "--radius",
         type=float,
         default=None,
-        help="Get comparable properties within _ (eg. 0.0) miles. Only applicable for individual addresses."
+        help="Get comparable properties within _ (eg. 0.0) miles. Only applicable for individual addresses.",
+    )
+    parser.add_argument(
+        "-m",
+        "--mls_only",
+        action="store_true",
+        help="If set, fetches only MLS listings.",
     )
 
     args = parser.parse_args()
 
-    result = scrape_property(args.location, args.listing_type, radius_for_comps=args.radius_for_comps, proxy=args.proxy)
+    result = scrape_property(
+        args.location,
+        args.listing_type,
+        radius=args.radius,
+        proxy=args.proxy,
+        mls_only=args.mls_only,
+        last_x_days=args.days,
+    )
 
     if not args.filename:
         timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")