Skip to content

Commit

Permalink
[enh]: make last_x_days generic
Browse files Browse the repository at this point in the history
add mls_only
make radius generic
  • Loading branch information
cullenwatson committed Oct 4, 2023
1 parent 51bde20 commit c487067
Show file tree
Hide file tree
Showing 9 changed files with 218 additions and 199 deletions.
95 changes: 55 additions & 40 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,13 +36,13 @@ pip install homeharvest
### CLI

```
usage: homeharvest [-h] [-l {for_sale,for_rent,sold}] [-o {excel,csv}] [-f FILENAME] [-p PROXY] [-d DAYS] [-r RADIUS] location
usage: homeharvest [-l {for_sale,for_rent,sold}] [-o {excel,csv}] [-f FILENAME] [-p PROXY] [-d DAYS] [-r RADIUS] [-m] location
Home Harvest Property Scraper
positional arguments:
location Location to scrape (e.g., San Francisco, CA)
options:
-l {for_sale,for_rent,sold}, --listing_type {for_sale,for_rent,sold}
Listing type to scrape
Expand All @@ -54,7 +54,8 @@ options:
Proxy to use for scraping
-d DAYS, --days DAYS Sold in last _ days filter.
-r RADIUS, --radius RADIUS
Get comparable properties within _ (eg. 0.0) miles. Only applicable for individual addresses.
Get comparable properties within _ (eg. 0.0) miles. Only applicable for individual addresses.
-m, --mls_only If set, fetches only MLS listings.
```
```bash
> homeharvest "San Francisco, CA" -l for_rent -o excel -f HomeHarvest
Expand All @@ -73,9 +74,14 @@ filename = f"output/{current_timestamp}.csv"
properties = scrape_property(
location="San Diego, CA",
listing_type="sold", # for_sale, for_rent
last_x_days=30, # sold/listed in last 30 days
mls_only=True, # only fetch MLS listings
)
print(f"Number of properties: {len(properties)}")

# Export to csv
properties.to_csv(filename, index=False)
print(properties.head())
```


Expand All @@ -94,12 +100,23 @@ properties.to_csv(filename, index=False)
### Parameters for `scrape_property()`
```
Required
├── location (str): address in various formats e.g. just zip, full address, city/state, etc.
└── listing_type (enum): for_rent, for_sale, sold
├── location (str): The address in various formats - this could be just a zip code, a full address, or city/state, etc.
└── listing_type (option): Choose the type of listing.
- 'for_rent'
- 'for_sale'
- 'sold'
Optional
├── radius_for_comps (float): Radius in miles to find comparable properties based on individual addresses.
├── sold_last_x_days (int): Number of past days to filter sold properties.
├── proxy (str): in format 'http://user:pass@host:port'
├── radius (decimal): Radius in miles to find comparable properties based on individual addresses.
│ Example: 5.5 (fetches properties within a 5.5-mile radius if location is set to a specific address; otherwise, ignored)
├── last_x_days (integer): Number of past days to filter properties. Utilizes 'COEDate' for 'sold' listing types, and 'Lst Date' for others (for_rent, for_sale).
│ Example: 30 (fetches properties listed/sold in the last 30 days)
├── mls_only (True/False): If set, fetches only MLS listings (mainly applicable to 'sold' listings)
└── proxy (string): In format 'http://user:pass@host:port'
```
### Property Schema
```plaintext
Expand All @@ -111,59 +128,57 @@ Property
│ └── status (str)
├── Address Details:
│ ├── street (str)
│ ├── unit (str)
│ ├── city (str)
│ ├── state (str)
│ └── zip (str)
│ ├── street
│ ├── unit
│ ├── city
│ ├── state
│ └── zip
├── Property Description:
│ ├── style (str)
│ ├── beds (int)
│ ├── baths_full (int)
│ ├── baths_half (int)
│ ├── sqft (int)
│ ├── lot_sqft (int)
│ ├── sold_price (int)
│ ├── year_built (int)
│ ├── garage (float)
│ └── stories (int)
│ ├── style
│ ├── beds
│ ├── baths_full
│ ├── baths_half
│ ├── sqft
│ ├── lot_sqft
│ ├── sold_price
│ ├── year_built
│ ├── garage
│ └── stories
├── Property Listing Details:
│ ├── list_price (int)
│ ├── list_date (str)
│ ├── last_sold_date (str)
│ ├── prc_sqft (int)
│ └── hoa_fee (int)
│ ├── list_price
│ ├── list_date
│ ├── last_sold_date
│ ├── prc_sqft
│ └── hoa_fee
├── Location Details:
│ ├── latitude (float)
│ ├── longitude (float)
│ └── neighborhoods (str)
│ ├── latitude
│ ├── longitude
│ └── neighborhoods
```
## Supported Countries for Property Scraping

* **Realtor.com**: mainly from the **US** but also has international listings

### Exceptions
The following exceptions may be raised when using HomeHarvest:

- `InvalidListingType` - valid options: `for_sale`, `for_rent`, `sold`
- `NoResultsFound` - no properties found from your input

- `NoResultsFound` - no properties found from your search


## Frequently Asked Questions
---

**Q: Encountering issues with your searches?**
**A:** Try to broaden the location. If problems persist, [submit an issue](https://github.com/ZacharyHampton/HomeHarvest/issues).
**A:** Try to broaden the parameters you're using. If problems persist, [submit an issue](https://github.com/ZacharyHampton/HomeHarvest/issues).

---

**Q: Received a Forbidden 403 response code?**
**A:** This indicates that you have been blocked by Realtor.com for sending too many requests. We recommend:

- Waiting a few seconds between requests.
- Trying a VPN to change your IP address.
- Trying a VPN or useing a proxy as a parameter to scrape_property() to change your IP address.

---

5 changes: 1 addition & 4 deletions examples/HomeHarvest_Demo.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
"metadata": {},
"outputs": [],
"source": [
"# scrapes all 3 sites by default\n",
"# check for sale properties\n",
"scrape_property(\n",
" location=\"dallas\",\n",
" listing_type=\"for_sale\"\n",
Expand All @@ -53,7 +53,6 @@
"# search a specific address\n",
"scrape_property(\n",
" location=\"2530 Al Lipscomb Way\",\n",
" site_name=\"zillow\",\n",
" listing_type=\"for_sale\"\n",
")"
]
Expand All @@ -68,7 +67,6 @@
"# check rentals\n",
"scrape_property(\n",
" location=\"chicago, illinois\",\n",
" site_name=[\"redfin\", \"zillow\"],\n",
" listing_type=\"for_rent\"\n",
")"
]
Expand All @@ -88,7 +86,6 @@
"# check sold properties\n",
"scrape_property(\n",
" location=\"90210\",\n",
" site_name=[\"redfin\"],\n",
" listing_type=\"sold\"\n",
")"
]
Expand Down
18 changes: 18 additions & 0 deletions examples/HomeHarvest_Demo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
from homeharvest import scrape_property
from datetime import datetime

# Generate filename based on current timestamp
current_timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"output/{current_timestamp}.csv"

properties = scrape_property(
location="San Diego, CA",
listing_type="sold", # for_sale, for_rent
last_x_days=30, # sold/listed in last 30 days
mls_only=True, # only fetch MLS listings
)
print(f"Number of properties: {len(properties)}")

# Export to csv
properties.to_csv(filename, index=False)
print(properties.head())
104 changes: 21 additions & 83 deletions homeharvest/__init__.py
Original file line number Diff line number Diff line change
@@ -1,103 +1,41 @@
import warnings
import pandas as pd
import concurrent.futures
from concurrent.futures import ThreadPoolExecutor

from .core.scrapers import ScraperInput
from .utils import process_result, ordered_properties
from .utils import process_result, ordered_properties, validate_input
from .core.scrapers.realtor import RealtorScraper
from .core.scrapers.models import ListingType, Property, SiteName
from .exceptions import InvalidListingType


_scrapers = {
"realtor.com": RealtorScraper,
}


def _validate_input(listing_type: str) -> None:
if listing_type.upper() not in ListingType.__members__:
raise InvalidListingType(f"Provided listing type, '{listing_type}', does not exist.")
from .core.scrapers.models import ListingType
from .exceptions import InvalidListingType, NoResultsFound


def _scrape_single_site(location: str, site_name: str, listing_type: str, radius: float, proxy: str = None, sold_last_x_days: int = None) -> pd.DataFrame:
def scrape_property(
location: str,
listing_type: str = "for_sale",
radius: float = None,
mls_only: bool = False,
last_x_days: int = None,
proxy: str = None,
) -> pd.DataFrame:
"""
Helper function to scrape a single site.
Scrape properties from Realtor.com based on a given location and listing type.
"""
_validate_input(listing_type)
validate_input(listing_type)

scraper_input = ScraperInput(
location=location,
listing_type=ListingType[listing_type.upper()],
site_name=SiteName.get_by_value(site_name.lower()),
proxy=proxy,
radius=radius,
sold_last_x_days=sold_last_x_days
mls_only=mls_only,
last_x_days=last_x_days,
)

site = _scrapers[site_name.lower()](scraper_input)
site = RealtorScraper(scraper_input)
results = site.search()
print(f"found {len(results)}")

properties_dfs = [process_result(result) for result in results]
if not properties_dfs:
return pd.DataFrame()

return pd.concat(properties_dfs, ignore_index=True, axis=0)[ordered_properties]


def scrape_property(
location: str,
listing_type: str = "for_sale",
radius: float = None,
sold_last_x_days: int = None,
proxy: str = None,
) -> pd.DataFrame:
"""
Scrape properties from Realtor.com based on a given location and listing type.
:param location: US Location (e.g. 'San Francisco, CA', 'Cook County, IL', '85281', '2530 Al Lipscomb Way')
:param listing_type: Listing type (e.g. 'for_sale', 'for_rent', 'sold'). Default is 'for_sale'.
:param radius: Radius in miles to find comparable properties on individual addresses. Optional.
:param sold_last_x_days: Number of past days to filter sold properties. Optional.
:param proxy: Proxy IP address to be used for scraping. Optional.
:returns: pd.DataFrame containing properties
"""
site_name = "realtor.com"

if site_name is None:
site_name = list(_scrapers.keys())

if not isinstance(site_name, list):
site_name = [site_name]

results = []

if len(site_name) == 1:
final_df = _scrape_single_site(location, site_name[0], listing_type, radius, proxy, sold_last_x_days)
results.append(final_df)
else:
with ThreadPoolExecutor() as executor:
futures = {
executor.submit(_scrape_single_site, location, s_name, listing_type, radius, proxy, sold_last_x_days): s_name
for s_name in site_name
}

for future in concurrent.futures.as_completed(futures):
result = future.result()
results.append(result)

results = [df for df in results if not df.empty and not df.isna().all().all()]

if not results:
return pd.DataFrame()

final_df = pd.concat(results, ignore_index=True)

columns_to_track = ["Street", "Unit", "Zip"]

#: validate they exist, otherwise create them
for col in columns_to_track:
if col not in final_df.columns:
final_df[col] = None
raise NoResultsFound("no results found for the query")

return final_df
with warnings.catch_warnings():
warnings.simplefilter("ignore", category=FutureWarning)
return pd.concat(properties_dfs, ignore_index=True, axis=0)[ordered_properties]
36 changes: 29 additions & 7 deletions homeharvest/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,9 @@

def main():
parser = argparse.ArgumentParser(description="Home Harvest Property Scraper")
parser.add_argument("location", type=str, help="Location to scrape (e.g., San Francisco, CA)")
parser.add_argument(
"location", type=str, help="Location to scrape (e.g., San Francisco, CA)"
)

parser.add_argument(
"-l",
Expand Down Expand Up @@ -33,21 +35,41 @@ def main():
help="Name of the output file (without extension)",
)

parser.add_argument("-p", "--proxy", type=str, default=None, help="Proxy to use for scraping")
parser.add_argument("-d", "--days", type=int, default=None, help="Sold in last _ days filter.")
parser.add_argument(
"-p", "--proxy", type=str, default=None, help="Proxy to use for scraping"
)
parser.add_argument(
"-d",
"--days",
type=int,
default=None,
help="Sold/listed in last _ days filter.",
)

parser.add_argument(
"-r",
"--sold-properties-radius",
dest="sold_properties_radius", # This makes sure the parsed argument is stored as radius_for_comps in args
"--radius",
type=float,
default=None,
help="Get comparable properties within _ (eg. 0.0) miles. Only applicable for individual addresses."
help="Get comparable properties within _ (eg. 0.0) miles. Only applicable for individual addresses.",
)
parser.add_argument(
"-m",
"--mls_only",
action="store_true",
help="If set, fetches only MLS listings.",
)

args = parser.parse_args()

result = scrape_property(args.location, args.listing_type, radius_for_comps=args.radius_for_comps, proxy=args.proxy)
result = scrape_property(
args.location,
args.listing_type,
radius=args.radius,
proxy=args.proxy,
mls_only=args.mls_only,
last_x_days=args.days,
)

if not args.filename:
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
Expand Down
Loading

0 comments on commit c487067

Please sign in to comment.