A growing number of websites have started to block generic python-requests user agents but do not block well meaning and identified unique user agents. Similar to identifying ourselves in cross client calls, we should also identify ourselves when a user provides a URL to fetch and import into DocumentCloud. We should also add some logging mechanism to be able to audit this and detect when sites are blocking users from importing their documents into DocumentCloud.
This serves as the dual function to be more reliable and also to perhaps file a public records request on why we're blocked given we host primary source materials for free to the public.
A growing number of websites have started to block generic python-requests user agents but do not block well meaning and identified unique user agents. Similar to identifying ourselves in cross client calls, we should also identify ourselves when a user provides a URL to fetch and import into DocumentCloud. We should also add some logging mechanism to be able to audit this and detect when sites are blocking users from importing their documents into DocumentCloud.
This serves as the dual function to be more reliable and also to perhaps file a public records request on why we're blocked given we host primary source materials for free to the public.