Refining Crop Yield Training Data Improved my Model’s R Squared to 0.97

Steady improvement in the datasource is yielding better results

  • 594 training records with 49 total features (39 used for model training)

  • Geographic coverage: 5 states, 5 crops, 10+ years (2012-2022)

Key improvements:

  • Expanded region, crop, datetime diversity

  • Added more granular soil features

  • Expanded sample size

Future improvements

  • Add actual weather data per region and time sample was taken

  • expand record count

  • expand crop, date, and location selection diversity

  • expand soil selection limitations

** If you don’t understand this but you want to, follow along and we can discover training models together

Here’s a Feature Breakdown by Source

1. Soil Chemistry Features (12 features)

Source: ssurgo_lab_chemical_properties table - Real laboratory analysis

  • ph_h2o - Soil pH in water

  • ph_cacl2 - Soil pH in calcium chloride

  • estimated_organic_carbon - Organic carbon percentage

  • total_carbon_ncs - Total carbon content

  • total_nitrogen_ncs - Total nitrogen content

  • carbon_to_nitrogen_ratio - C:N ratio

  • cec_nh4_ph_7 - Cation exchange capacity

  • base_sat_nh4oac_ph_7 - Base saturation percentage

  • ca_nh4_ph_7 - Exchangeable calcium

  • mg_nh4_ph_7 - Exchangeable magnesium

  • k_nh4_ph_7 - Exchangeable potassium

  • na_nh4_ph_7 - Exchangeable sodium

2. Soil Physics Features (5 features)

Source: ssurgo_lab_physical_properties table - Real laboratory analysis

  • clay_total - Clay percentage

  • silt_total - Silt percentage

  • sand_total - Sand percentage

  • bulk_density_oven_dry - Soil bulk density

  • water_retention_15_bar - Water holding capacity

  • particle_density_less_than_2mm - Particle density

3. Location Features (7 features)

Source: ssurgo_lab_site + ssurgo_lab_layer tables - Real GPS coordinates from soil sampling

  • latitude_std_decimal_degrees - GPS latitude

  • longitude_std_decimal_degrees - GPS longitude

  • layer_key - Unique soil layer identifier

  • site_key - Unique soil site identifier

  • user_site_id - Site identification code

  • hzn_top - Soil horizon top depth (cm)

  • hzn_bot - Soil horizon bottom depth (cm)

  • texture_description - Soil texture classification

4. Crop Yield Features (6 features)

Source: nass_crops table - USDA NASS survey data

  • commodity_desc - Crop type (Corn, Wheat, Soybeans, Barley, Cotton)

  • year - Harvest year (2012-2022)

  • yield_value - Crop yield (bu/acre or tons/acre)

  • state_name - State where crop was grown

  • county_name - County where crop was grown

  • unit_desc - Yield measurement units

5. Weather Features (9 features) - SYNTHETIC DATA

Source: State-based climate modeling with regional averages

  • climate_zone - Köppen climate classification

  • avg_temperature - Average growing season temperature

  • max_temperature - Maximum temperature

  • min_temperature - Minimum temperature

  • total_gdd - Growing degree days (calculated)

  • avg_relative_humidity - Average humidity by state

  • total_precipitation - Total precipitation with drought/wet year adjustments

  • days_with_precipitation - Estimated precipitation days

6. Derived Features (7 features)

Source: Calculated from soil chemistry/physics and weather data

  • soil_quality_score - 0-100 composite score (pH + organic matter + CEC + clay)

  • soil_quality_class - Categorical: Poor/Good/Excellent

  • temp_optimality - Temperature suitability for crop growth (0-1)

  • precip_category - Drought/Normal/Wet classification

  • gdd_suitability - Growing degree day suitability by crop

  • ca_mg_ratio - Calcium to magnesium nutrient ratio

  • soil_crop_distance_km - Distance from soil sample to crop location