Web scrapping for downloading images from NHTSA website (CIREN crash cases)

1 answer

I am trying to download some images from NHTSA Crash Viewer (CIREN cases). An example of the case https://crashviewer.nhtsa.dot.gov/nass-CIREN/CaseForm.aspx?xsl=main.xsl&CaseID=99817 If I try to download a Front crash image then there is no file downloaded. I am using beautifulsoup4 and requests libraries. This code works for other websites.

The link of images are in the following format: https://crashviewer.nhtsa.dot.gov/nass-CIREN/GetBinary.aspx?Image&ImageID=555004572&CaseID=555003071&Version=0

I have also tried the previous answers from SO but none solution works, Error obtained: No response form server

Code used for web scrapping

from bs4 import * import requests as rq import os  r2 = rq.get("https://crashviewer.nhtsa.dot.gov/nass-CIREN/GetBinary.aspx?Image&ImageID=555004572&CaseID=555003071&Version=0") soup2 = BeautifulSoup(r2.text, "html.parser")  links = []  x = soup2.select('img[src^="https://crashviewer.nhtsa.dot.gov"]')  for img in x:     links.append(img['src'])  os.mkdir('ciren_photos') i=1  for index, img_link in enumerate(links):     if i<=200:         img_data = rq.get(img_link).content         with open("ciren_photos\\"+str(index+1)+'.jpg', 'wb+') as f:             f.write(img_data)         i += 1     else:         f.close()         break     

All answers to this question, which has the identifier 59760830

The best answer:

This is a task that would require Selenium, but luckily there is a shortcut. On the top of the page there is a "Text and Images Only" link that goes to a page like this one: https://crashviewer.nhtsa.dot.gov/nass-CIREN/CaseForm.aspx?ViewText&CaseID=99817&xsl=textonly.xsl&websrc=true that contains all the images and text content in one page. You can select that link with soup.find('a', text='Text and Images Only').

That link and the image links are relative (links to the same site are usually relative links), so you'll have to use urljoin() to get the full urls.

from bs4 import BeautifulSoup import requests as rq from urllib.parse import urljoin  url = 'https://crashviewer.nhtsa.dot.gov/nass-CIREN/CaseForm.aspx?xsl=main.xsl&CaseID=99817'  with rq.session() as s:     r = s.get(url)     soup = BeautifulSoup(r.text, "html.parser")      url = urljoin(url, soup.find('a', text='Text and Images Only')['href'])     r = s.get(url)     soup = BeautifulSoup(r.text, "html.parser")      links = [urljoin(url, i['src']) for i in soup.select('img[src^="GetBinary.aspx"]')]      for link in links:         content = s.get(link).content         # write `content` to file 

So, the site doesn't return valid pictures unless the request has valid cookies. There are two ways to get the cookies: either use cookies from a previous request or use a Sessiion object. It's best to use a Session because it also handles the TCP connection and other parameters.

Last questions

how do i remove the switch on my home screen?
how to edit the JS date and time to update atuomatically?
How to utilize data stored in a multidimensional array
Powermockito not mocking URL constructor in URI.toURL() method
Android Bluetooth LE Scanner only scans when phone's Location is turned on in some devices
docker wordpress container can't connect to mysql container
How can I declare a number in java that is more than 64-bits? [duplicate]
Optaplanner solutionClass entityCollectionProperty should never return null error when simple JSON object passed to controller
Anylogic, get the time a pedestrain is in a queue
How do I fix this syntax issue with my .flex file?
Optimizing query in PHP
How to find the highest number of a column and print two columns of that row in R?
Ideas on “Error: Type com.google.firebase.iid.zzav is referenced as an interface from com.google.firebase.messaging.zzd”?
JCIFS SmbFile.exists() and SmbFile.isDirectory() return false when it exists and I can listFiles()
PHP total order
Laravel booking system design
neural net - undefined column selected
How to indicate y axis does not start from 0 in ggplot?
Fragments in backStack
Spinner how to change the data